首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades.  相似文献   

2.
Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true‐score, and IRT observed‐score equating methods. Using mixed‐format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method.  相似文献   

3.
As computer‐based tests become more common, there is a growing wealth of metadata related to examinees’ response processes, which include solution strategies, concentration, and operating speed. One common type of metadata is item response time. While response times have been used extensively to improve estimates of achievement, little work considers whether these metadata may provide useful information on social–emotional constructs. This study uses an analytic example to explore whether metadata might help illuminate such constructs. Specifically, analyses examine whether the amount of time students spend on test items (after accounting for item difficulty and estimates of true achievement), and difficult items in particular, tell us anything about the student's academic motivation and self‐efficacy. While results do not indicate a strong relationship between mean item durations and these constructs in general, the amount of time students spend on very difficult items is highly correlated with motivation and self‐efficacy. The implications of these findings for using response process metadata to gain information on social–emotional constructs are discussed.  相似文献   

4.
The current work examines children's sensitivity to rime unit spelling–sound correspondences within the context of early word reading as a way of assessing word‐specific influences on early word‐reading strategies. Sixty 6–7‐year‐olds participated in an experimental reading task that comprised word items that shared either frequent or infrequent rime unit correspondences. Retrospective self‐reports were taken as measures of strategy choice. The results showed that the children were more accurate in identifying word items that shared a common rime unit (consistent items) when compared with those containing infrequent rime units (unique and exception items). Moreover, while nonlexical (phonological) attempts were most frequently applied across all word types, these resulted in lower levels of accuracy, especially for the exception word items. The current data support the argument that children are increasingly sensitive to rime unit sound–spelling correspondences during the early stages of their word reading and the nature of these word‐specific orthographic representations shape their reliance on using particular lexical or non‐lexical‐based word‐reading strategies.  相似文献   

5.
Many standardized tests are now administered via computer rather than paper‐and‐pencil format. The computer‐based delivery mode brings with it certain advantages. One advantage is the ability to adapt the difficulty level of the test to the ability level of the test taker in what has been termed computerized adaptive testing (CAT). A second advantage is the ability to record not only the test taker's response to each item (i.e., question), but also the amount of time the test taker spends considering and answering each item. Combining these two advantages, various methods were explored for utilizing response time data in selecting appropriate items for an individual test taker. Four strategies for incorporating response time data were evaluated, and the precision of the final test‐taker score was assessed by comparing it to a benchmark value that did not take response time information into account. While differences in measurement precision and testing times were expected, results showed that the strategies did not differ much with respect to measurement precision but that there were differences with regard to the total testing time.  相似文献   

6.
Using factor analysis, we conducted an assessment of multidimensionality for 6 forms of the Law School Admission Test (LSAT) and found 2 subgroups of items or factors for each of the 6 forms. The main conclusion of the factor analysis component of this study was that the LSAT appears to measure 2 different reasoning abilities: inductive and deductive. The technique of N. J. Dorans & N. M. Kingston (1985) was used to examine the effect of dimensionality on equating. We began by calibrating (with item response theory [IRT] methods) all items on a form to obtain Set I of estimated IRT item parameters. Next, the test was divided into 2 homogeneous subgroups of items, each having been determined to represent a different ability (i.e., inductive or deductive reasoning). The items within these subgroups were then recalibrated separately to obtain item parameter estimates, and then combined into Set II. The estimated item parameters and true-score equating tables for Sets I and II corresponded closely.  相似文献   

7.
A rapidly expanding arena for item response theory (IRT) is in attitudinal and health‐outcomes survey applications, often with polytomous items. In particular, there is interest in computer adaptive testing (CAT). Meeting model assumptions is necessary to realize the benefits of IRT in this setting, however. Although initial investigations of local item dependence have been studied both for polytomous items in fixed‐form settings and for dichotomous items in CAT settings, there have been no publications applying local item dependence detection methodology to polytomous items in CAT despite its central importance to these applications. The current research uses a simulation study to investigate the extension of widely used pairwise statistics, Yen's Q3 Statistic and Pearson's Statistic X2, in this context. The simulation design and results are contextualized throughout with a real item bank of this type from the Patient‐Reported Outcomes Measurement Information System (PROMIS).  相似文献   

8.
Abstract

The arrangement of response options in multiple-choice (MC) items, especially the location of the most attractive distractor, is considered critical in constructing high-quality MC items. In the current study, a sample of 496 undergraduate students taking an educational assessment course was given three test forms consisting of the same items but the positions of the most attractive distractor varied across the forms. Using a multiple-indicators–multiple-causes (MIMIC) approach, the effects of the most attractive distractor's positions on item difficulty were investigated. The results indicated that the relative placement of the most attractive distractor and the distance between the most attractive distractor and the keyed option affected students’ response behaviors. Moreover, low-achieving students were more susceptible to response-position changes than high-achieving students.  相似文献   

9.
Data mining methods have drawn considerable attention across diverse scientific fields. However, few applications could be found in the areas of psychological and educational measurement, and particularly pertinent to this article, in test security research. In this study, various data mining methods for detecting cheating behaviors on large‐scale assessments are explored as an alternative to the traditional methods including person‐fit statistics and similarity analysis. A common data set from the Handbook of Quantitative Methods for Detecting Cheating on Tests (Cizek & Wollack) was used for comparing the performance of the different methods. The results indicated that the use of data mining methods may combine multiple sources of information about test takers' performance, which may lead to higher detection rate over traditional item response and response time methods. Several recommendations, all based on our findings, are provided to practitioners.  相似文献   

10.
The Plaut, McClelland, Seidenberg and Patterson (1996) connectionist model of reading was evaluated at two points early in its training against reading data collected from British children on two occasions during their first year of literacy instruction. First, the network's non‐word reading was poor relative to word reading when compared with the children. Second, the network made more non‐lexical than lexical errors, the opposite pattern to the children. Three adaptations were made to the training of the network to bring it closer to the learning environment of a child: an incremental training regime was adopted; the network was trained on grapheme–phoneme correspondences; and a training corpus based on words found in children's early reading materials was used. The modifications caused a sharp improvement in non‐word reading, relative to word reading, resulting in a near perfect match to the children's data on this measure. The modified network, however, continued to make predominantly non‐lexical errors, although evidence from a small‐scale implementation of the full triangle framework suggests that this limitation stems from the lack of a semantic pathway. Taken together, these results suggest that, when properly trained, connectionist models of word reading can offer insights into key aspects of reading development in children.  相似文献   

11.
Phonological analysis of the speech of five children with Down syndrome was compared across single word samples elicited from tests and connected speech samples elicited during play. Analyses indicated that connected speech samples provided fewer total words and word tokens and, in some cases, failed to target certain later developing phonemes. CVCs were similarly represented across single word and connected speech samples, but differences across the sampling conditions included an over‐representation of clusters and an under‐representation of VC and CV forms in single word samples. Overall accuracy for vowels was greater than for consonant production, but no difference in phoneme accuracy was found across the conditions. Differences in phonetic inventories for some subjects indicated the possibility that children may, in their connected speech, avoid phonemes which they have yet to master. Phonological process analysis indicated that differences across the two conditions were restricted to their relative impact on each subject's overall intelligibility. Both similarities and differences with previous studies comparing sampling conditions with other populations are discussed, as are clinical implications.  相似文献   

12.
《教育实用测度》2013,26(3):257-275
The purpose of this study was to investigate the technical properties of stem-equivalent mathematics items differing only with respect to response format. Using socio- economic factors to define the strata, a proportional stratified random sample of 1,366 Connecticut sixth-grade students were administered one of three forms. Classical item analysis, dimensionality assessment, item response theory goodness-of-fit, and an item bias analysis were conducted. Analysis of variance and confirmatory factor analysis were used to examine the functioning of the items presented in the three different formats. It was found that, after equating forms, the constructed-response formats were somewhat more difficult than the multiple-choice format. However, there was no significant difference across formats with respect to item discrimination. A differential item functioning (DIF) analysis was conducted using both the Mantel-Haenszel procedure and the comparison of the item characteristic curves. The DIF analysis indicated that the presence of bias was not greatly affected by item format; that is, items biased in one format tended to be biased in a similar manner when presented in a different format, and unbiased items tended to remain so regardless of format.  相似文献   

13.
Item stem formats can alter the cognitive complexity as well as the type of abilities required for solving mathematics items. Consequently, it is possible that item stem formats can affect the dimensional structure of mathematics assessments. This empirical study investigated the relationship between item stem format and the dimensionality of mathematics assessments. A sample of 671 sixth-grade students was given two forms of a mathematics assessment in which mathematical expression (ME) items and word problems (WP) were used to measure the same content. The effects of mathematical language and reading abilities in responding to ME and WP items were explored using unidimensional and multidimensional item response theory models. The results showed that WP and ME items appear to differ with regard to the underlying abilities required to answer these items. Hence, the multidimensional model fit the response data better than the unidimensional model. For the accurate assessment of mathematics achievement, students’ reading and mathematical language abilities should also be considered when implementing mathematics assessments with ME and WP items.  相似文献   

14.
In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items.  相似文献   

15.
Once a differential item functioning (DIF) item has been identified, little is known about the examinees for whom the item functions differentially. This is because DIF focuses on manifest group characteristics that are associated with it, but do not explain why examinees respond differentially to items. We first analyze item response patterns for gender DIF and then illustrate, through the use of a mixture item response theory (IRT) model, how the manifest characteristic associated with DIF often has a very weak relationship with the latent groups actually being advantaged or disadvantaged by the item(s). Next, we propose an alternative approach to DIF assessment that first uses an exploratory mixture model analysis to define the primary dimension(s) that contribute to DIF, and secondly studies examinee characteristics associated with those dimensions in order to understand the cause(s) of DIF. Comparison of academic characteristics of these examinees across classes reveals some clear differences in manifest characteristics between groups.  相似文献   

16.
An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items.  相似文献   

17.
This article uses data from a large‐scale assessment program to illustrate the potential issue of range restriction with the Bookmark method in the context of trying to set cut scores to closely align with a set of college and career readiness benchmarks. Analyses indicated that range restriction issues existed across different response probability (RP) values and item response theory (IRT) models if one were to apply the Bookmark procedure using intact test forms. Results also suggested that range restriction may still be present if one had access to additional data from an item bank. This demonstration critically highlights challenges that may exist in some practical applications of the Bookmark method due items not being designed to cover the full range of examinee abilities.  相似文献   

18.
Reading comprehension is influenced by sources of variance associated with the reader and the task. To gain insight into the complex interplay of multiple sources of influence, we employed crossed random‐effects item response models. These models allowed us to simultaneously examine the degree to which variables related to the type of passage and student characteristics influenced students’ (n = 94; mean age = 11.97 years) performance on two indicators of reading comprehension: different types of comprehension questions and passage fluency. We found that variables related to word recognition, language, and executive function were influential across various types of passages and comprehension questions and also predicted a reader's passage fluency. Further, an exploratory analysis of two‐way interaction effects was conducted. Results suggest that understanding the relative influence of passage, question, and student variables has implications for identifying struggling readers and designing interventions to address their individual needs.  相似文献   

19.
In test development, item response theory (IRT) is a method to determine the amount of information that each item (i.e., item information function) and combination of items (i.e., test information function) provide in the estimation of an examinee's ability. Studies investigating the effects of item parameter estimation errors over a range of ability have demonstrated an overestimation of information when the most discriminating items are selected (i.e., item selection based on maximum information). In the present study, the authors examined the influence of item parameter estimation errors across 3 item selection methods—maximum no target, maximum target, and theta maximum—using the 2- and 3-parameter logistic IRT models. Tests created with the maximum no target and maximum target item selection procedures consistently overestimated the test information function. Conversely, tests created using the theta maximum item selection procedure yielded more consistent estimates of the test information function and, at times, underestimated the test information function. Implications for test development are discussed.  相似文献   

20.
The development of scientific thinking was assessed in 1,581 second, third, and fourth graders (8‐, 9‐, 10‐year‐olds) based on a conceptual model that posits developmental progression from naïve to more advanced conceptions. Using a 66‐item scale, five components of scientific thinking were addressed, including experimental design, data interpretation, and understanding the nature of science. Unidimensional and multidimensional item response theory analyses supported the instrument's reliability and validity and suggested that the multiple components of scientific thinking form a unitary construct, independent of verbal or reasoning skills. A partial credit model gave evidence for a hierarchical developmental progression. Across each grade transition, advanced conceptions increased while naïve conceptions decreased. Independent effects of intelligence, schooling, and parental education on scientific thinking are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号