期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Equivalence of Free-Response and Multiple-Choice Items

Randy Elliot Bennett Donald A. Rock Minhwei Wang 《Journal of Educational Measurement》1991,28(1):77-92

This study examined the relationship of multiple-choice and free-response items contained on the College Board's Advanced Placement Computer Science (APCS) examination. Confirmatory factor analysis was used to test the fit of a two-factor model where each item format marked its own factor. Results showed a single-factor solution to provide the most parsimonious fit in each of two random-half samples. This finding might be accounted for by several mechanisms, including overlap in the specific processes assessed by the multiple-choice and free-response items and the limited opportunity for skill differentiation afforded by the year-long APCS course. 相似文献

2.

Written feedback: Response certitude and durability

Raymond W. Kulhavy William A. Stock Thomas E. Hancock Linda K. Swindell Penny L. Hammrich 《Contemporary educational psychology》1990,15(4)

This study tested assumptions of a servocontrol model of test item feedback. High school students responded to multiple-choice items and rated their certainty of correctness in each response. Next, learners either received feedback on the items or responded again to the same test. The same items were tested again after 1 and 8 days, with the order to alternatives randomized for half of the subjects in each feedback group. The results generally supported the control model and suggest that response certitude estimates can be treated as an index of comprehension. 相似文献

3.

基于Gitest Ⅲ的听力过关测试试题项目分析

王蓓《湖南科技学院学报》2009,30(5)

通过利用Gitest III软件,作者对湖南城市学院英语专业2009年1月听力过关测试全部客观题进行了较为详细的项目分析,检测了试题的信度和效度,并针对分析结果提出在设计听力测试客观题方面应注意的问题及建议. 相似文献

4.

Validating Measurement of Knowledge Integration in Science Using Multiple-Choice and Explanation Items

Hee-Sun Lee Ou Lydia Liu Marcia C. Linn 《教育实用测度》2013,26(2):115-136

This study explores measurement of a construct called knowledge integration in science using multiple-choice and explanation items. We use construct and instructional validity evidence to examine the role multiple-choice and explanation items plays in measuring students' knowledge integration ability. For construct validity, we analyze item properties such as alignment, discrimination, and target range on the knowledge integration scale using a Rasch Partial Credit Model analysis. For instructional validity, we test the sensitivity of multiple-choice and explanation items to knowledge integration instruction using a cohort comparison design. Results show that (1) one third of correct multiple-choice responses are aligned with higher levels of knowledge integration while three quarters of incorrect multiple-choice responses are aligned with lower levels of knowledge integration, (2) explanation items discriminate between high and low knowledge integration ability students much more effectively than multiple-choice items, (3) explanation items measure a wider range of knowledge integration levels than multiple-choice items, and (4) explanation items are more sensitive to knowledge integration instruction than multiple-choice items. 相似文献

5.

Reducing the need for guesswork in multiple-choice tests

Martin Bush 《Assessment & Evaluation in Higher Education》2015,40(2):218-231

The humble multiple-choice test is very widely used within education at all levels, but its susceptibility to guesswork makes it a suboptimal assessment tool. The reliability of a multiple-choice test is partly governed by the number of items it contains; however, longer tests are more time consuming to take, and for some subject areas, it can be very hard to create new test items that are sufficiently distinct from previously used items. A number of more sophisticated multiple-choice test formats have been proposed dating back at least 60?years, many of which offer significantly improved test reliability. This paper offers a new way of comparing these alternative test formats, by modelling each one in terms of the range of possible test taker responses it enables. Looking at the test formats in this way leads to the realisation that the need for guesswork is reduced when test takers are given more freedom to express their beliefs. Indeed, guesswork is eliminated entirely when test takers are able to partially order the answer options within each test item. The paper aims to strengthen the argument for using more sophisticated multiple-choice test formats, especially for high-stakes summative assessment. 相似文献

6.

Gender Differences in Large-Scale Math Assessments: PISA Trend 2000 and 2003

Ou Lydia Liu Mark Wilson 《教育实用测度》2013,26(2):164-184

Many efforts have been made to determine and explain differential gender performance on large-scale mathematics assessments. A well-agreed-on conclusion is that gender differences are contextualized and vary across math domains. This study investigated the pattern of gender differences by item domain (e.g., Space and Shape, Quantity) and item type (e.g., multiple-choice ⁱ ⁱIn this paper, two kinds of multiple-choice items are discussed: traditional multiple-choice items and complex multiple-choice items. A sample complex multiple choice item is shown in Table 6. The terms “multiple-choice” and “traditional multiple-choice” are used interchangeably to refer to the traditional multiple choice items throughout the paper, while the term “complex multiple-choice” is used to refer to the complex multiple-choice items. Raman K. Grover is now an Independent Psychometrician. items, open constructed-response items). The U.S. portion of the Programme for International Student Assessment (PISA) 2000 and 2003 mathematics assessment was analyzed. A multidimensional Rasch model was used to provide student ability estimates for each comparison. Results revealed a slight but consistent male advantage. Students showed the largest gender difference (d = 0.19) in favor of males on complex multiple-choice items, an unconventional item type. Males and females also showed sizable differences on Space and Shape items, a domain well documented for showing robust male superiority. Contrary to many previous findings reporting male superiority on multiple-choice items, no measurable difference has been identified on multiple-choice items for both the PISA 2000 and the 2003 math assessments. Reasons for the differential gender performance across math domains and item types were speculated, and directions of future research were discussed. 相似文献

7.

命题者：影响阅读理解测试效度的一个因素

李雪曾用强《考试研究》2012,(4):49-60

本研究应用项目反应理论,从被试的阅读能力值和题目的难度值这两个方面,分析阅读理解测试中多项选择题命题者对考试效度的影响。实验设计中,将两组被试同时施测于一项“阅读水平测试”,根据测试结果估计出的两组被试能力值之间无显著性差异。再次将这两组被试分别施测于两位不同命题者所命制的题目,尽管这些题目均产生于相同的阅读材料,且题目的难度值之间并没有显著性差异,被试的表现却显著不同。Rasch模型认为,被试表现由被试能力和试题难度共同决定。因此,可以推测,这是由于不同命题者所命制的题目影响了被试的表现,并进而影响了使用多项选择题进行阅读理解测试的效度。相似文献

8.

Models of Decisionmaking Processes for Multiple-Choice Test Items: An Analysis of Spatial Ability

Rand R. Wilcox Karen Thompson Wilcox 《Journal of Educational Measurement》1988,25(2):125-136

Latent class models of decisionmaking processes related to multiple-choice test items are extremely important and useful in mental test theory. However, building realistic models or studying the robustness of existing models is very difficult. One problem is that there are a limited number of empirical studies that address this issue. The purpose of this paper is to describe and illustrate how latent class models, in conjunction with the answer-until-correct format, can be used to examine the strategies used by examinees for a specific type of task. In particular, suppose an examinee responds to a multiple-choice test item designed to measure spatial ability, and the examinee gets the item wrong. This paper empirically investigates various latent class models of the strategies that might be used to arrive at an incorrect response. The simplest model is a random guessing model, but the results reported here strongly suggest that this model is unsatisfactory. Models for the second attempt of an item, under an answer-until-correct scoring procedure, are proposed and found to give a good fit to data in most situations. Some results on strategies used to arrive at the first choice are also discussed 相似文献

9.

Effect of Varying Item Order on Multiple-Choice Test Scores: Importance of Statistical and Cognitive Difficulty

《教育实用测度》2013,26(1):89-97

Research on the use of multiple-choice tests has presented conflicting evidence about the use of statistical item difficulty as a means of ordering items. An alternate method advocated by many texts is the use of cognitive difficulty. This study examined the effect of using both statistical and cognitive item difficulty in determining item order. Results indicated that those students who received items in an increasing cognitive order, no matter what the order of statistical difficulty, scored higher on hard items. Those students who received the forms with opposing cognitive and statistical difficulty orders scored the highest on medium-level items. The study concludes with a call for more research on the effects of cognitive difficulty and suggests that future studies examine subscores as well as total test results. 相似文献

10.

基于复合选择题的雷同答卷甄别探析——以某资格考试为例

孔祥常颖昊《中国考试》2021,(2)

基于某资格类考试考后数据,对包含单选题和多选题的复合选择题的雷同答卷进行分析,提出单选甄别、多选甄别和合并甄别3种甄别方案,设计不同测验长度、难度、被抄袭考生水平、题目抄袭比率、作弊考生比率及显著性水平的实验样本。研究结果显示:合并甄别和单选甄别都表现出较好的甄别性能,多选甄别由于多选题数量方面的劣势,甄别率低且Ⅰ型错误率高。据此,提出甄别雷同答卷的建议:如果对甄别出的雷同答卷考生进行违纪违规处理,可将单选甄别和合并甄别分别甄别出的结果取交集作为最后处理依据;如果不对甄别结果进行处理,可将2种方案分别甄别出的结果取并集,以此来对考区或考点的考风考纪进行监管。相似文献

11.

Measurement Efficiency of Innovative Item Formats in Computer-Based Testing

Michael G. Jodoin 《Journal of Educational Measurement》2003,40(1):1-15

The psychometric literature provides little empirical evaluation of examinee test data to assess essential psychometric properties of innovative items. In this study, examinee responses to conventional (e.g., multiple choice) and innovative item formats in a computer-based testing program were analyzed for IRT information with the three-parameter and graded response models. The innovative item types considered in this study provided more information across all levels of ability than multiple-choice items. In addition, accurate timing data captured via computer administration were analyzed to consider the relative efficiency of the multiple choice and innovative item types. As with previous research, multiple-choice items provide more information per unit time. Implications for balancing policy, psychometric, and pragmatic factors in selecting item formats are also discussed. 相似文献

12.

Does difficulty-based item order matter in multiple-choice exams? (Empirical evidence from university students)

《Studies in Educational Evaluation》2020

This empirical study aimed to investigate the impact of easy first vs. hard first ordering of the same items in a paper and-pencil multiple-choice exam on the performances of low, moderate, and high achiever examinees, as well as on the item statistics. Data were collected from 554 Turkish university students using two test forms, which included the same multiple-choice items ordered reversely, i.e. easy first vs. hard first. Tests included 26 multiple-choice items about the introductory unit of “Measurement and Assessment” course. The results suggested that sequencing the multiple-choice items in either direction from easy to hard or vice versa did not affect the test performances of the examinees no matter whether they are low, moderate or high achiever examinees. Finally, no statistically significant difference was observed between item statistics of both forms, i.e. the difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj. r) coefficients. 相似文献

13.

Comparisons among Designs for Equating Mixed-Format Tests in Large-Scale Assessments

Sooyeon Kim Michael E. Walker Frederick McHale 《Journal of Educational Measurement》2010,47(1):36-53

In this study we examined variations of the nonequivalent groups equating design for tests containing both multiple-choice (MC) and constructed-response (CR) items to determine which design was most effective in producing equivalent scores across the two tests to be equated. Using data from a large-scale exam, this study investigated the use of anchor CR item rescoring (known as trend scoring) in the context of classical equating methods. Four linking designs were examined: an anchor with only MC items, a mixed-format anchor test containing both MC and CR items; a mixed-format anchor test incorporating common CR item rescoring; and an equivalent groups (EG) design with CR item rescoring, thereby avoiding the need for an anchor test. Designs using either MC items alone or a mixed anchor without CR item rescoring resulted in much larger bias than the other two designs. The EG design with trend scoring resulted in the smallest bias, leading to the smallest root mean squared error value. 相似文献

14.

Cognitive Processing requirements of Constructed Figural Response and Multiple-Choice Items in Architecture Assessment

《Educational Assessment》2013,18(1):83-98

Contrasts between constructed-response items and multiple-choice counterparts have yielded but a few weak generalizations. Such contrasts typically have been based on the statistical properties of groups of items, an approach that masks differences in properties at the item level and may lead to inaccurate conclusions. In this article, we examine item-level differences between a certain type of constructed-response item (called figural response) and comparable multiple-choice items in the domain of architecture. Our data show that in comparing two item formats, item-level differences in difficulty correspond to differences in cognitive processing requirements and that relations between processing requirements and psychometric properties are systematic. These findings illuminate one aspect of construct validity that is frequently neglected in comparing item types, namely the cognitive demand of test items. 相似文献

15.

Gender DIF in Reading and Mathematics Tests With Mixed Item Formats

Catherine S. Taylor Yoonsun Lee 《教育实用测度》2013,26(3):246-280

This was a study of differential item functioning (DIF) for grades 4, 7, and 10 reading and mathematics items from state criterion-referenced tests. The tests were composed of multiple-choice and constructed-response items. Gender DIF was investigated using POLYSIBTEST and a Rasch procedure. The Rasch procedure flagged more items for DIF than did the simultaneous item bias procedure—particularly multiple-choice items. For both reading and mathematics tests, multiple-choice items generally favored males while constructed-response items generally favored females. Content analyses showed that flagged reading items typically measured text interpretations or implied meanings; males tended to benefit from items that asked them to identify reasonable interpretations and analyses of informational text. Most items that favored females asked students to make their own interpretations and analyses, of both literary and informational text, supported by text-based evidence. Content analysis of mathematics items showed that items favoring males measured geometry, probability, and algebra. Mathematics items favoring females measured statistical interpretations, multistep problem solving, and mathematical reasoning. 相似文献

16.

Ethnic DIF in Reading Tests With Mixed Item Formats

Catherine S. Taylor Yoonsun Lee 《Educational Assessment》2013,18(1):35-68

This article presents a study of ethnic Differential Item Functioning (DIF) for 4th-, 7th-, and 10th-grade reading items on a state criterion-referenced achievement test. The tests, administered 1997 to 2001, were composed of multiple-choice and constructed-response items. Item performance by focal groups (i.e., students from Asian/Pacific Island, Black/African American, Native American, and Latino/Hispanic origins) were compared with the performance of White students using simultaneous item bias and Rasch procedures. Flagged multiple-choice items generally favored White students, whereas flagged constructed-response items generally favored students from Asian/Pacific Islander, Black/African American, and Latino/Hispanic origins. Content analysis of flagged reading items showed that positively and negatively flagged items typically measured inference, interpretation, or analysis of text in multiple-choice and constructed-response formats. Items that were not flagged for DIF generally measured very easy reading skills (e.g., literal comprehension) and reading skills that require higher level thinking (e.g., developing interpretations across texts and analyzing graphic elements). 相似文献

17.

Type K and Other Complex Multiple-Choice Items: An Analysis of Research and Item Properties

Mark A. Albanese 《Educational Measurement》1993,12(1):28-33

What is a complex multiple-choice test item? What is the evidence that such items should be avoided? 相似文献

18.

AN EXPERIMENTAL COMPARISON OF ITEM SAMPLING AND EXAMINEE SAMPLING FOR ESTIMATING TEST NORMS

THOMAS R. OWENS DANIEL L. STUFFLEBEAM 《Journal of Educational Measurement》1969,6(2):75-83

An empirical comparison of the accuracy of item sampling and examinee sampling in estimating norm statistics. Item samples were composed of 3, 6, or 12 items selected from a total test of 50 multiple-choice vocabulary questions. Overall, the study findings provided empirical evidence that item sampling is approximately as effective as examinee sampling for estimating the population mean and standard deviation. Contradictory trends occurred for lower ability and higher ability student populations in accuracy of estimated means and standard deviations when the number of items administered increased from 3 to 6 to 12. The findings from this study indicate that the variation of sequences of items occurring in item sampling need not have a significant affect on test performance. 相似文献

19.

The Assessment of Teacher Affective Sensitivity

《The Journal of educational research》2012,105(5):257-263

Abstract

The purpose of this investigation was to develop and validate a simulation device to measure a teacher's ability to identify verbal and nonverbal emotions expressed by students (teacher affective sensitivity). The scale consists of videotaped excerpts of teacher-learner interactions and accompanying multiple-choice instrumentation. Respondents select the answer from each multiple-choice item that they believe most accurately describes the affective state of the pupil viewed on the monitor. Previously produced media focusing on classroom interactions were used to obtain the examples of learner affective expressions. Expert judges constructed two multiple-choice items for each simulation episode. Pilot test administrations allowed for numerous scale revisions. Finally, assessments of scale reliability, and scale construct, predictive, concurrent, and content validity were made. 相似文献

20.

An Empirical Examination of the IRT Information of Polytomously Scored Reading Items Under the Generalized Partial Credit Model

John R. Donoghue 《Journal of Educational Measurement》1994,31(4):295-311

Using Muraki's (1992) generalized partial credit IRT model, polytomous items (responses to which can be scored as ordered categories) from the 1991 field test of the NAEP Reading Assessment were calibrated simultaneously with multiple-choice and short open-ended items. Expected information of each type of item was computed. On average, four-category polytomous items yielded 2.1 to 3.1 times as much IRT information as dichotomous items. These results provide limited support for the ad hoc rule of weighting k-category polytomous items the same as k - 1 dichotomous items for computing total scores. Polytomous items provided the most information about examinees of moderately high proficiency; the information function peaked at 1.0 to 1.5, and the population distribution mean was 0. When scored dichotomously, information in polytomous items sharply decreased, but they still provided more expected information than did the other response formats. For reference, a derivation of the information function for the generalized partial credit model is included. 相似文献