首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This study investigated whether scores obtained from the online and paper-and-pencil administrations of the statewide end-of-course English test were equivalent for students with and without disabilities. Score comparability was evaluated by examining equivalence of factor structure (measurement invariance) and differential item and bundle functioning analyses for the online and paper groups. Results supported measurement invariance between the online and paper groups, suggesting that it is meaningful to compare scores across administration modes. When the data were analyzed at both the item and item bundle (content area) levels, similar performance appeared between the online and paper groups.  相似文献   

2.
The early detection of item drift is an important issue for frequently administered testing programs because items are reused over time. Unfortunately, operational data tend to be very sparse and do not lend themselves to frequent monitoring analyses, particularly for on‐demand testing. Building on existing residual analyses, the authors propose an item index that requires only moderate‐to‐small sample sizes to form data for time‐series analysis. Asymptotic results are presented to facilitate statistical significance tests. The authors show that the proposed index combined with time‐series techniques may be useful in detecting and predicting item drift. Most important, this index is related to a well‐known differential item functioning analysis so that a meaningful effect size can be proposed for item drift detection.  相似文献   

3.
The current study investigated how item formats and their inherent affordances influence test‐takers’ cognition under uncertainty. Adult participants solved content‐equivalent math items in multiple‐selection multiple‐choice and four alternative grid formats. The results indicated that participants’ affirmative response tendency (i.e., judge the given information as True) was affected by the presence of a grid, type of grid options, and their visual layouts. The item formats further affected the test scores obtained from the alternatives keyed True and the alternatives keyed False, and their psychometric properties. The current results suggest that the affordances rendered by item design can lead to markedly different test‐taker behaviors and can potentially influence test outcomes. They emphasize that a better understanding of the cognitive implications of item formats could potentially facilitate item design decisions for large‐scale educational assessments.  相似文献   

4.
Many statistics used in the assessment of differential item functioning (DIF) in polytomous items yield a single item-level index of measurement invariance that collapses information across all response options of the polytomous item. Utilizing a single item-level index of DIF can, however, be misleading if the magnitude or direction of the DIF changes across the steps underlying the polytomous response process. A more comprehensive approach to examining measurement invariance in polytomous item formats is to examine invariance at the level of each step of the polytomous item, a framework described in this article as differential step functioning (DSF). This article proposes a nonparametric DSF estimator that is based on the Mantel-Haenszel common odds ratio estimator ( Mantel & Haenszel, 1959 ), which is frequently implemented in the detection of DIF in dichotomous items. A simulation study demonstrated that when the level of DSF varied in magnitude or sign across the steps underlying the polytomous response options, the DSF-based approach typically provided a more powerful and accurate test of measurement invariance than did corresponding item-level DIF estimators.  相似文献   

5.
When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores.  相似文献   

6.
The comparison of scores from linguistically different tests is a twofold matter: the adaptation of tests and the comparison of scores. These 2 aspects of measurement invariance intersect at the need to guarantee the psychometric equivalence between the original and adapted versions. In this study, the authors examined comparability in 2 stages. First, they conducted a thorough study of progressive factorial variance through which they defined an anchor test. Second, they defined an observed score-equated function to establish equivalences between the original test and the adapted test; they used a design of common item nonequivalent groups for this purpose.  相似文献   

7.
Subscore added value analyses assume invariance across test taking populations; however, this assumption may be untenable in practice as differential subdomain relationships may be present among subgroups. The purpose of this simulation study was to understand the conditions associated with subscore added value noninvariance when manipulating: (a) subdomain test length, (b) differences in subgroup mean ability, and (c) subgroup differences in intersubdomain correlations. Results demonstrated that subscore added value was noninvariant for 24–100% of replications (depending on subdomain test length) when the subgroup difference in intersubdomain correlation was equal to .30. To examine if this condition was met in practice, applied invariance analyses of three operational testing programs were conducted. Across these datasets, noninvariant subscore added value was present for some subdomains across sex and ethnic subgroups. Overall, these results indicate that subscore added value noninvariance is largely driven by differential intersubdomain correlations among subgroups, which may be present in some operational testing programs.  相似文献   

8.
The definition of what it means to take a test online continues to evolve with the inclusion of a broader range of item types and a wide array of devices used by students to access test content. To assure the validity and reliability of test scores for all students, device comparability research should be conducted to evaluate the impact of testing device on student test performance. The current study looked at the comparability of test scores across tablets and computers for high school students in three commonly assessed content areas and for a variety of different item types. Results indicate no statistically significant differences across device type for any content area or item type. Student survey results suggest that students may have a preference for taking tests on devices with which they have more experience, but that even limited exposure to tablets in this study increased positive responses for testing on tablets.  相似文献   

9.
Recent research has shown that admissions tests retain the vast majority of their predictive power after controlling for socioeconomic status (SES), and that SES provides only a slight increment over SAT and high school grades (high school grade point average [HSGPA]) in predicting academic performance. To address the possibility that these overall analyses obscure differences by race/ethnicity or gender, we examine the role of SES in the test‒grade relationship for men and women as well as for various racial/ethnic subgroups within the United States. For each subgroup, the test‒grade relationship is only slightly diminished when controlling for SES. Further, SES is a substantially less powerful predictor of academic performance than both SAT and HSGPA. Among the indicators of SES (i.e., father's education, mother's education, and parental income), father's education appears to be strongest predictor of freshman grades across subgroups, with the exception of the Asian subgroup. In general, SES appears to behave similarly across subgroups in the prediction of freshman grades with SAT scores and HSGPA.  相似文献   

10.
Investigations of differential distractor functioning (DDF) can provide valuable information concerning the location and possible causes of measurement invariance within a multiple‐choice item. In this article, I propose an odds ratio estimator of the DDF effect as modeled under the nominal response model. In addition, I propose a simultaneous distractor‐level (SDL) test of invariance based on the results of the distractor‐level tests of DDF. The results of a simulation study indicated that the DDF effect estimator maintained good statistical properties under a variety of conditions, and the SDL test displayed substantially higher power than the traditional Mantel‐Haenszel test of no DIF when the DDF effect varied in magnitude and/or size across the distractors.  相似文献   

11.
Trend estimation in international comparative large‐scale assessments relies on measurement invariance between countries. However, cross‐national differential item functioning (DIF) has been repeatedly documented. We ran a simulation study using national item parameters, which required trends to be computed separately for each country, to compare trend estimation performances to two linking methods employing international item parameters across several conditions. The trend estimates based on the national item parameters were more accurate than the trend estimates based on the international item parameters when cross‐national DIF was present. Moreover, the use of fixed common item parameter calibrations led to biased trend estimates. The detection and elimination of DIF can reduce this bias but is also likely to increase the total error.  相似文献   

12.
Peer victimization is common and linked to maladjustment. Prior research has typically identified four peer victimization subgroups: aggressors, victims, aggressive-victims, and uninvolved. However, findings related to sex and racial-ethnic differences in subgroup membership have been mixed. Using data collected in September of 2002 and 2003, this study conducted confirmatory latent class analysis of a racially-ethnically diverse sample of 5415 sixth graders (49% boys; 50.6% Black; 20.9% Hispanic) representing two cohorts from 37 schools in four U.S. communities to replicate the four subgroups and evaluate measurement invariance of latent class indicators across cohort, sex, race-ethnicity, and study site. Results replicated the four-class solution and illustrated that sociodemographic differences in subgroup membership were less evident after accounting for differential item functioning.  相似文献   

13.
The study aims to investigate the effects of delivery modalities on psychometric characteristics and student performance on cognitive tests. A first study assessed the inductive reasoning ability of 715 students under the supervision of teachers. A second study examined 731 students’ performance on the application of the control-of-variables strategy in basic physics but without teacher supervision due to the COVID-19 pandemic. Rasch measurement showed that the online format fitted to the data better in the unidimensional model across two conditions. Under teacher supervision, paper-based testing was better than online testing in terms of reliability and total scores, but contradictory findings were found in turn without teacher supervision. Although measurement invariance was confirmed between two versions at item level, the differential bundle functioning analysis supported the online groups on the item bundles constructed of figure-related materials. Response time was also discussed as an advantage of technology-based assessment for test development.  相似文献   

14.
The development of alternate assessments for students with disabilities plays a pivotal role in state and national accountability systems. An important assumption in the use of alternate assessments in these accountability systems is that scores are comparable on different test forms across diverse groups of students over time. The use of test equating is a common way that states attempt to establish score comparability on different test forms. However, equating presents many unique, practical, and technical challenges for alternate assessments. This article provides case studies of equating for two alternate assessments in Michigan and an approach to determine whether or not equating would be preferred to not equating on these assessments. This approach is based on examining equated score and performance-level differences and investigating population invariance across subgroups of students with disabilities. Results suggest that using an equating method with these data appeared to have a minimal impact on proficiency classifications. The population invariance assumption was suspect for some subgroups and equating methods with some large potential differences observed.  相似文献   

15.
The development of statistical methods for detecting test collusion is a new research direction in the area of test security. Test collusion may be described as large‐scale sharing of test materials, including answers to test items. Current methods of detecting test collusion are based on statistics also used in answer‐copying detection. Therefore, in computerized adaptive testing (CAT) these methods lose power because the actual test varies across examinees. This article addresses that problem by introducing a new approach that works in two stages: in Stage 1, test centers with an unusual distribution of a person‐fit statistic are identified via Kullback–Leibler divergence; in Stage 2, examinees from identified test centers are analyzed further using the person‐fit statistic, where the critical value is computed without data from the identified test centers. The approach is extremely flexible. One can employ any existing person‐fit statistic. The approach can be applied to all major testing programs: paper‐and‐pencil testing (P&P), computer‐based testing (CBT), multiple‐stage testing (MST), and CAT. Also, the definition of test center is not limited by the geographic location (room, class, college) and can be extended to support various relations between examinees (from the same undergraduate college, from the same test‐prep center, from the same group at a social network). The suggested approach was found to be effective in CAT for detecting groups of examinees with item pre‐knowledge, meaning those with access (possibly unknown to us) to one or more subsets of items prior to the exam.  相似文献   

16.
We examined the feasibility and results of a multilevel multidimensional nominal response model (ML‐MNRM) for measuring both substantive constructs and extreme response style (ERS) across countries. The ML‐MNRM considers within‐country clustering while allowing overall item slopes to vary across items and examination of whether certain items were more prone to ERS. We applied this model to survey items from TALIS 2013. Results indicated that self‐efficacy items were more likely to trigger ERS compared to need for professional development, and the between‐country relationships among constructs can change due to ERS. Simulations assessed the estimation approach and found adequate recovery of model parameters and factor scores. We stress the importance of additional validity studies to improve the cross‐cultural comparability of substantive constructs.  相似文献   

17.
Cross‐level invariance in a multilevel item response model can be investigated by testing whether the within‐level item discriminations are equal to the between‐level item discriminations. Testing the cross‐level invariance assumption is important to understand constructs in multilevel data. However, in most multilevel item response model applications, the cross‐level invariance is assumed without testing of the cross‐level invariance assumption. In this study, the detection methods of differential item discrimination (DID) over levels and the consequences of ignoring DID are illustrated and discussed with the use of multilevel item response models. Simulation results showed that the likelihood ratio test (LRT) performed well in detecting global DID at the test level when some portion of the items exhibited DID. At the item level, the Akaike information criterion (AIC), the sample‐size adjusted Bayesian information criterion (saBIC), LRT, and Wald test showed a satisfactory rejection rate (>.8) when some portion of the items exhibited DID and the items had lower intraclass correlations (or higher DID magnitudes). When DID was ignored, the accuracy of the item discrimination estimates and standard errors was mainly problematic. Implications of the findings and limitations are discussed.  相似文献   

18.
This application study investigates whether the multiple‐choice to composite linking functions that determine Advanced Placement Program exam grades remain invariant over subgroups defined by region. Three years of test data from an AP exam are used to study invariance across regions. The study focuses on two questions: (a) How invariant are grade thresholds across regions? and (b) Do the small sample sizes for some regional groups present particular problems for assessing thresholds invariance? The equatability index proposed by Dorans and Holland (2000) is employed to evaluate the invariance of the linking functions, and cross‐classification is used to evaluate the invariance of the composite cut scores. Overall, the linkings across regions seem to hold up reasonably well. Nevertheless, more exams need to be examined.  相似文献   

19.
This study presents a new approach to synthesizing differential item functioning (DIF) effect size: First, using correlation matrices from each study, we perform a multigroup confirmatory factor analysis (MGCFA) that examines measurement invariance of a test item between two subgroups (i.e., focal and reference groups). Then we synthesize, across the studies, the differences in the estimated factor loadings between the two subgroups, resulting in a meta-analytic summary of the MGCFA effect sizes (MGCFA-ES). The performance of this new approach was examined using a Monte Carlo simulation, where we created 108 conditions by four factors: (1) three levels of item difficulty, (2) four magnitudes of DIF, (3) three levels of sample size, and (4) three types of correlation matrix (tetrachoric, adjusted Pearson, and Pearson). Results indicate that when MGCFA is fitted to tetrachoric correlation matrices, the meta-analytic summary of the MGCFA-ES performed best in terms of bias and mean square error values, 95% confidence interval coverages, empirical standard errors, Type I error rates, and statistical power; and reasonably well with adjusted Pearson correlation matrices. In addition, when tetrachoric correlation matrices are used, a meta-analytic summary of the MGCFA-ES performed well, particularly, under the condition that a high difficulty item with a large DIF was administered to a large sample size. Our result offers an option for synthesizing the magnitude of DIF on a flagged item across studies in practice.  相似文献   

20.
If the factor structure of a test does not hold over time (i.e., is not invariant), then longitudinal comparisons of standing on the test are not meaningful. In the case of the Wechsler Intelligence Scale for Children‐Third Edition (WISC‐III), it is crucial that it exhibit longitudinal factorial invariance because it is widely used in high‐stakes special education eligibility decisions. Accordingly, the present study analyzed the longitudinal factor structure of the WISC‐III for both configural and metric invariance with a group of 177 students with disabilities tested, on average, 2.8 years apart. Equivalent factor loadings, factor variances, and factor covariances across the retest interval provided evidence of configural and metric invariance. It was concluded that the WISC‐III was measuring the same constructs with equal fidelity across time which allows unequivocal interpretation of score differences as reflecting changes in underlying latent constructs rather than variations in the measurement operation itself. © 2001 John Wiley & Sons, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号