期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Empirical versus subjective procedures for identifying gender differences in science test items

Richard R. Sudweeks Richard R. Tolman 《科学教学研究杂志》1993,30(1):3-19

When judgmental and statistical procedures are both used to identify potentially gender-biased items in a test, to what extent do the results agree? In this study, both procedures were used to evaluate the items in a statewide, 78-item, multiple-choice test of science knowledge. Only one item was flagged by the sensitivity reviewers as being potentially biased, but this item was not flagged by the statistical procedure. None of the nine items flagged by the Mantel-Haenszel procedure were flagged by the sensitivity reviewers. Eight of the nine statistically flagged items were differentially easier for males. Four of these eight measured the same category of objectives. The authors conclude that both judgmental and statistical procedures provide useful information and that both should be used in test construction. They caution readers that content-validity issues need to be addressed when making decisions based on the results of either procedure. 相似文献

2.

Differential Item Functioning on the SAT-M Braille Edition

Randy Elliot Bennett Donald A. Rock Inge Novatkoski 《Journal of Educational Measurement》1989,26(1):67-79

This study attempted to pinpoint the causes of differential item difficulty for blind students taking the braille edition of the Scholastic Aptitude Test's Mathematical section (SAT-M). The study method involved reviewing the literature to identify factors that might cause differential item functioning for these examinees, forming item categories based on these factors, identifying categories that functioned differentially, and assessing the functioning o f the items comprising deviant categories to determine if the differential effect was pervasive. Results showed an association between selected item categories and differential functioning, particularly for items that included figures in the stimulus, items for which spatial estimation was helpful in eliminating at least two of the options, and items that presented figures that were small or medium in size. The precise meaning of this association was unclear, however, because some items from the suspected categories functioned normally, factors other than the hypothesized ones might have caused the observed aberrant item behavior, and the differential difficulty might reflect real population differences in relevant content knowledge 相似文献

3.

Estimating Average Domain Scores

Mary Pommerich W. Alan Nicewander Bradley A. Hanson 《Journal of Educational Measurement》1999,36(3):199-216

A simulation study was performed to determine whether a group's average percent correct in a content domain could be accurately estimated for groups taking a single test form and not the entire domain of items. Six Item Response Theory based domain score estimation methods were evaluated, under conditions of few items per content area perform taken, small domains, and small group sizes. The methods used item responses to a single form taken to estimate examinee or group ability; domain scores were then computed using the ability estimates and domain item characteristics. The IRT-based domain score estimates typically showed greater accuracy and greater consistency across forms taken than observed performance on the form taken. For the smallest group size and least number of items taken, the accuracy of most IRT-based estimates was questionable; however, a procedure that operates on an estimated distribution of group ability showed promise under most conditions. 相似文献

4.

Using New Proximity Measures With Hierarchical Cluster Analysis to Detect Multidimensionality 总被引：1，自引：0，他引：1

Louis A. Roussos William F. Stout John I. Marden 《Journal of Educational Measurement》1998,35(1):1-30

A new approach for partitioning test items into dimensionally distinct item clusters is introduced. The core of the approach is a new item-pair conditional-covariance-based proximity measure that can be used with hierarchical cluster analysis. An extensive simulation study designed to test the limits of the approach indicates that when approximate simple structure holds, the procedure can correctly partition the test into dimensionally homogeneous item clusters even for very high correlations between the latent dimensions. In particular, the procedure can correctly classify (on average) over 90% of the items for correlations as high as .9. The cooperative role that the procedure can play when used in conjunction with other dimensionality assessment procedures is discussed. 相似文献

5.

Invariance of Item Characteristic Functions With Variations in Instructional Coverage 总被引：1，自引：0，他引：1

M. David Miller Robert L. Linn 《Journal of Educational Measurement》1988,25(3):205-219

An assumption of item response theory is that a person's score is a function of the item response parameters and the person's ability. In this paper, the effect of variations in instructional coverage on item characteristic functions is examined. Using data from the Second International Mathematics Study (1985), curriculum clusters were formed based on teachers' ratings of their students' opportunities to learn the items on a test. After forming curriculum clusters, item response curves were compared using signed and unsigned sum of squared differences. Some of the differences in the item response curves between curriculum clusters were found to be large, but better performance was not necessarily related to greater opportunity to learn. The item response curve differences were much larger than differences reported in prior studies based on comparisons of black and white students. Implications of the findings for applications of item response theory to educational achievement test data are discussed 相似文献

6.

Contextual Explanations of Local Dependence in Item Clusters in a Large Scale Hands-On Science Performance Assessment

Steven Ferrara Huynh Huynh Hillary Michaels 《Journal of Educational Measurement》1999,36(2):119-140

This study provides hypothesized explanations for local item dependence (LID) in a large scale hands-on science performance assessment. Items within multi-step item clusters were classified as low or high in LID using contextual analysis procedures described in this and other studies. LID was identified statistically using the average within cluster (AWC) correlation procedure described in previous studies. Levels of LID identified in contextual analyses were compared to levels of LlD identified in correlation analyses. Consistent with other studies, items that appear to elicit locally dependent responses require examinees to answer and explain their answer or to use given or generated information to respond 相似文献

7.

RIM: A Random Item Mixture Model to Detect Differential Item Functioning

Sofie Frederickx Francis Tuerlinckx Paul De Boeck David Magis 《Journal of Educational Measurement》2010,47(4):432-457

In this paper we present a new methodology for detecting differential item functioning (DIF). We introduce a DIF model, called the random item mixture (RIM), that is based on a Rasch model with random item difficulties (besides the common random person abilities). In addition, a mixture model is assumed for the item difficulties such that the items may belong to one of two classes: a DIF or a non-DIF class. The crucial difference between the DIF class and the non-DIF class is that the item difficulties in the DIF class may differ according to the observed person groups while they are equal across the person groups for the items from the non-DIF class. Statistical inference for the RIM is carried out in a Bayesian framework. The performance of the RIM is evaluated using a simulation study in which it is compared with traditional procedures, like the likelihood ratio test, the Mantel-Haenszel procedure and the standardized p -DIF procedure. In this comparison, the RIM performs better than the other methods. Finally, the usefulness of the model is also demonstrated on a real life data set. 相似文献

8.

Gender-Based Differential Item Performance in Mathematics Achievement Items

Allen E. Doolittle T. Anne Cleary 《Journal of Educational Measurement》1987,24(2):157-166

A procedure for the detection of differential item performance (DIP) is used to investigate the relationships between characteristics of mathematics achievement items and gender differences in performance. Eight randomly equivalent samples of high school seniors were each given a unique form of the ACT Assessment Mathematics Usage Test (ACTM). Students without requisite mathematics courses were deleted from the samples to reduce the confounding effects of differences in instruction at the high school level. Signed measures of DIP were obtained for each item in the eight ACTM forms. These DIP estimates were then analyzed in a 6 × 8 (item category by form) experimental design. A significant item category effect was found indicating a relationship between item characteristics and gender-based DIP. Predictions, based on previous research about the categories of items that would contribute to gender-based DIP, were supported: Geometry and mathematics reasoning items were relatively more difficult for female examinees and the more algorithmic, computation-oriented items were relatively easier. 相似文献

9.

ESTIMATING THE RELIABILITY OF MULTIPLE TRUE-FALSE TESTS

DAVID A. FRISBIE CYNTHIA A. DRUVA 《Journal of Educational Measurement》1986,23(2):99-105

This study was designed to examine the level of dependence within multiple true-false (MTF) test item clusters by computing sets of item intercorrelations with data from a test composed of both MTF and multiple choice (MC) items. It was posited that internal analysis reliability estimates for MTF tests would be spurious due to elevated MTF within-cluster intercorrelations. Results showed that, on the average, MTF within-cluster dependence was no greater than that found between MTF items from different clusters, between MC items, or between MC and MTF items. But item for item, there was greater dependence between items within the same cluster than between items of different clusters. 相似文献

10.

The Russell Sage Social Relations Test

Dora E. Damrin 《Journal of Experimental Education》2013,81(1):85-99

Using a technique that controlled exposure of items, the investigator examined the effect on mean test score, item difficulty index, and reliability and validity coefficients of the reordering of items within a power test containing ten letter-series-completion items. The results suggest that effects on test statistics from item rearrangement are, generally, minimal. The implication of these findings for test designs involving an item sampling procedure is that performance on an item is minimally influenced by the context in which it occurs. 相似文献

11.

Gender DIF in Reading and Mathematics Tests With Mixed Item Formats

Catherine S. Taylor Yoonsun Lee 《教育实用测度》2013,26(3):246-280

This was a study of differential item functioning (DIF) for grades 4, 7, and 10 reading and mathematics items from state criterion-referenced tests. The tests were composed of multiple-choice and constructed-response items. Gender DIF was investigated using POLYSIBTEST and a Rasch procedure. The Rasch procedure flagged more items for DIF than did the simultaneous item bias procedure—particularly multiple-choice items. For both reading and mathematics tests, multiple-choice items generally favored males while constructed-response items generally favored females. Content analyses showed that flagged reading items typically measured text interpretations or implied meanings; males tended to benefit from items that asked them to identify reasonable interpretations and analyses of informational text. Most items that favored females asked students to make their own interpretations and analyses, of both literary and informational text, supported by text-based evidence. Content analysis of mathematics items showed that items favoring males measured geometry, probability, and algebra. Mathematics items favoring females measured statistical interpretations, multistep problem solving, and mathematical reasoning. 相似文献

12.

A Mixture Model Analysis of Differential Item Functioning

Allan S. Cohen Daniel M. Bolt 《Journal of Educational Measurement》2005,42(2):133-148

Once a differential item functioning (DIF) item has been identified, little is known about the examinees for whom the item functions differentially. This is because DIF focuses on manifest group characteristics that are associated with it, but do not explain why examinees respond differentially to items. We first analyze item response patterns for gender DIF and then illustrate, through the use of a mixture item response theory (IRT) model, how the manifest characteristic associated with DIF often has a very weak relationship with the latent groups actually being advantaged or disadvantaged by the item(s). Next, we propose an alternative approach to DIF assessment that first uses an exploratory mixture model analysis to define the primary dimension(s) that contribute to DIF, and secondly studies examinee characteristics associated with those dimensions in order to understand the cause(s) of DIF. Comparison of academic characteristics of these examinees across classes reveals some clear differences in manifest characteristics between groups. 相似文献

13.

An Exploratory Analysis of Differential Item Functioning and Its Possible Sources in a Higher Education Admissions Context

Maria Elena Oliveri Rene Lawless Frederic Robin Brent Bridgeman 《教育实用测度》2018,31(1):1-16

We analyzed a pool of items from an admissions test for differential item functioning (DIF) for groups based on age, socioeconomic status, citizenship, or English language status using Mantel-Haenszel and item response theory. DIF items were systematically examined to identify its possible sources by item type, content, and wording. DIF was primarily found in the citizenship group. As suggested by expert reviewers, possible sources of DIF in the direction of U.S. citizens was often in Quantitative Reasoning in items containing figures, charts, tables depicting real-world (as opposed to abstract) contexts. DIF items in the direction of non-U.S. citizens included “mathematical” items containing few words. DIF for the Verbal Reasoning items included geocultural references and proper names that may be differentially familiar for non-U.S. citizens. This study is responsive to foundational changes in the fairness section of the Standards for Educational and Psychological Testing, which now consider additional groups in sensitivity analyses, given the increasing demographic diversity in test-taker populations. 相似文献

14.

Effect of the Medium of Item Presentation on Examinee Performance and Item Characteristics

Judith A. Spray Terry A. Ackerman Mark D. Reckase James E. Carlson 《Journal of Educational Measurement》1989,26(3):261-271

Studies that have investigated differences in examinee performance on items administered in paper-and-pencil form or on a computer screen have produced equivocal results. Certain item administration procedures were hypothesized to be among the most important variables causing differences in item performance and ultimately in test scores obtained from these different administration media. A study where these item administration procedures were made as identical as possible for each presentation medium is described. In addition, a methodology is presented for studying the difficulty and discrimination of items under each presentation medium as a post hoc procedure. 相似文献

15.

Dimensions influencing students' judgments of classroom test items

Gerald Halpin Samuel B. Green Glennelle Halpin 《Research in higher education》1980,13(3):273-282

Sixty-eight graduate students made general and specific ratings of the quality of 12 classroom test items, which varied in difficulty and discrimination. Four treatment combinations defined two additional factors: group discussion/no group discussion of test items and exposure/no exposure to an instructional module on test item construction. The students rated the items differentially, depending not only on item difficulty level but also on item discriminative power. The group discussion and exposure to module factors had significant effects on the general item ratings only. Implications of the research were discussed. 相似文献

16.

An Experimental, Exploratory Study of Causes of Bias in Test Items

Janiee Dowd Seheuneman 《Journal of Educational Measurement》1987,24(2):97-118

This study evaluated 16 hypotheses, subsumed under 7 more general hypotheses, concerning possible sources of bias in test items for black and white examinees on the Graduate Record Examination General Test (GRE). Items were developed in pairs that were varied according to a particular hypothesis, with each item from a pair administered in different forms of an experimental portion of the GRE. Data were analyzed using log linear methods. Ten of the 16 hypotheses showed interactions between group membership and the item version indicating a differential effect of the item manipulation on the performance of black and white examinees. The complexity of some of the interactions found, however, suggested that uncontrolled factors were also differentially affecting performance. 相似文献

17.

Evaluating DETECT Classification Accuracy and Consistency When Data Display Complex Structure

Mark J. Gierl Jacqueline P. Leighton Xuan Tan 《Journal of Educational Measurement》2006,43(3):265-289

DETECT, the acronym for Dimensionality Evaluation To Enumerate Contributing Traits, is an innovative and relatively new nonparametric dimensionality assessment procedure used to identify mutually exclusive, dimensionally homogeneous clusters of items using a genetic algorithm ( Zhang & Stout, 1999 ). Because the clusters of items are mutually exclusive, this procedure is most useful when the data display approximate simple structure. In many testing situations, however, data display a complex multidimensional structure. The purpose of the current study was to evaluate DETECT item classification accuracy and consistency when the data display different degrees of complex structure using both simulated and real data. Three variables were manipulated in the simulation study: The percentage of items displaying complex structure (10%, 30%, and 50%), the correlation between dimensions (.00, .30, .60, .75, and .90), and the sample size (500, 1,000, and 1,500). The results from the simulation study reveal that DETECT can accurately and consistently cluster items according to their true underlying dimension when as many as 30% of the items display complex structure, if the correlation between dimensions is less than or equal to .75 and the sample size is at least 1,000 examinees. If 50% of the items display complex structure, then the correlation between dimensions should be less than or equal to .60 and the sample size be, at least, 1,000 examinees. When the correlation between dimensions is .90, DETECT does not work well with any complex dimensional structure or sample size. Implications for practice and directions for future research are discussed. 相似文献

18.

Item Difficulty of Four Verbal Item Types and an Index of Differential Item Functioning for Black and White Examinees 总被引：1，自引：0，他引：1

Roy Freedle Irene Kostin 《Journal of Educational Measurement》1990,27(4):329-343

In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored. 相似文献

19.

Effects of Linking Methods on Detection of DIF

Seock-Ho Kim Allan S. Cohen 《Journal of Educational Measurement》1992,29(1):51-66

Studies of differential item functioning under item response theory require that item parameter estimates be placed on the same metric before comparisons can be made. The present study compared the effects of three methods for linking metrics: a weighted mean and sigma method (WMS); the test characteristic curve method (TCC); and the minimum chi-square method (MCS), on detection of differential item functioning. Both iterative and noniterative linking procedures were compared for each method. Results indicated that detection of differentially functioning items following linking via the test characteristic curve method gave the most accurate results when the sample size was small. When the sample size was large, results for the three linking methods were essentially the same. Iterative linking provided an improvement in detection of differentially functioning items over noniterative linking particularly with the .05 alpha level. The weighted mean and sigma method showed greater improvement with iterative linking than either the test characteristic curve or minimum chi-square method. 相似文献

20.

DIF Detection and Interpretation in Large-Scale Science Assessments: Informing Item Writing Practices

April L. Zenisky Ronald K. Hambleton Frederic Robin 《Educational Assessment》2013,18(1-2):61-78

Differential item functioning (DIF) analyses are a routine part of the development of large-scale assessments. Less common are studies to understand the potential sources of DIF. The goals of this study were (a) to identify gender DIF in a large-scale science assessment and (b) to look for trends in the DIF and non-DIF items due to content, cognitive demands, item type, item text, and visual-spatial or reference factors. To facilitate the analyses, DIF studies were conducted at 3 grade levels and for 2 randomly equivalent forms of the science assessment at each grade level (administered in different years). The DIF procedure itself was a variant of the "standardization procedure" of Dorans and Kulick (1986) and was applied to very large sets of data (6 sets of data, each involving 60,000 students). It has the advantages of being easy to understand and to explain to practitioners. Several findings emerged from the study that would be useful to pass on to test development committees. For example, when there was DIF in science items, MC items tended to favor male examinees and OR items tended to favor female examinees. Compiling DIF information across multiple grades and years increases the likelihood that important trends in the data will be identified and that item writing practices will be informed by more than anecdotal reports about DIF. 相似文献