首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
When tests are designed to measure dimensionally complex material, DIF analysis with matching based on the total test score may be inappropriate. Previous research has demonstrated that matching can be improved by using multiple internal or both internal and external measures to more completely account for the latent ability space. The present article extends this line of research by examining the potential to improve matching by conditioning simultaneously on test score and a categorical variable representing the educational background of the examinees. The responses of male and female examinees from a test of medical competence were analyzed using a logistic regression procedure. Results show a substantial reduction in the number of items identified as displaying significant DIF when conditioning is based on total test score and a variable representing educational background as opposed to total test score only.  相似文献   

2.
This study analyzes gains in cognitive components of learning competence with respect to cohorts based on ability tracking in a Czech longitudinal study. Propensity score matching is used to form parallelized samples of academic and non-academic track students and to eliminate the effect of selective school intake. We applied regression models on the total scores to test for the overall track effect. Furthermore, we analyze scores and gains on the subscores and check for differential item functioning in Grade 6 and in change to Grade 9. While after 3 years, no significant difference between the two tracks was apparent in the total learning competence score, we did, however, find significant differences in some subscores and in the functioning of some items. We argue that item-level analysis is important for deeper understanding of the tracking implications and may provide the basis for more precise evidence-based decisions regarding the tracking policy.  相似文献   

3.
概化理论作为新一代测量理论逐渐应用于大规模考试领域。文章运用多元概化理论对自学考试课程《英语水平考试(一)笔试》试卷的测量信度、试卷总分合成、及格线决策信度、试卷结构优化等问题进行探讨。研究发现:本次考试的测量信度较高;各分测验对全域总分的方差贡献比例与试卷赋分意图基本一致;该试卷以60分作为及格线具有较高的决策信度;将各分测验题量同时增至15题或单独将词汇分测验题量增至20题,可有效提高测量信度。  相似文献   

4.
We make a distinction between the operational practice of using an observed score to assess differential item functioning (DIF) and the concept of departure from measurement invariance (DMI) that conditions on a latent variable. DMI and DIF indices of effect sizes, based on the Mantel-Haenszel test of common odds ratio, converge under restricted conditions if a simple sum score is used as the matching or conditioning variable in a DIF analysis. Based on theoretical results, we demonstrate analytically that matching on a weighted sum score can significantly reduce the difference between DIF and DMI measures over what can be achieved with a simple sum score. We also examine the utility of binning methods that could facilitate potential operational use of DIF with weighted sum scores. A real data application was included to show this feasibility.  相似文献   

5.
The study of change is based on the idea that the score or index at each measurement occasion has the same meaning and metric across time. In tests or scales with multiple items, such as those common in the social sciences, there are multiple ways to create such scores. Some options include using raw or sum scores (i.e., sum of item responses or linear transformation thereof), using Rasch-scaled scores provided by the test developers, fitting item response models to the observed item responses and estimating ability or aptitude, and jointly estimating the item response and growth models. We illustrate that this choice can have an impact on the substantive conclusions drawn from the change analysis using longitudinal data from the Applied Problems subtest of the Woodcock–Johnson Psycho-Educational Battery–Revised collected as part of the National Institute of Child Health and Human Development's Study of Early Child Care. Assumptions of the different measurement models, their benefits and limitations, and recommendations are discussed.  相似文献   

6.
In this paper I describe and illustrate the Roussos-Stout (1996) multidimensionality-based DIF analysis paradigm, with emphasis on its implication for the selection of a matching and studied subtest for DIF analyses. Standard DIF practice encourages an exploratory search for matching subtest items based on purely statistical criteria, such as a failure to display DIF. By contrast, the multidimensional DIF paradigm emphasizes a substantively-informed selection of items for both the matching and studied subtest based on the dimensions suspected of underlying the test data. Using two examples, I demonstrate that these two approaches lead to different interpretations about the occurrence of DIF in a test. It is argued that selecting a matching and studied subtest, as identified using the DIF analysis paradigm, can lead to a more informed understanding of why DIF occurs.  相似文献   

7.
A norm distribution consisting of test scores received by 810 college students on a 150 item dichotomously-scored 4-alternative multiple-choice test was empirically estimated through several item-examinee sampling procedures. The post mortum item-sampling investigation was specifically designed to manipulate systematically the variables of number of subtests, number of items per subtest, and number of examinees responding to each subtest. Defining one observation as the score received by one examinee on one item, the results suggest that as the number of observations increases beyond 1.23% of the data base all procedures produce stochastically equivalent results. The results of this investigation indicate that, in estimating a norm distribution by item-sampling, the variable of importance is not the item-sampling procedure per se but is instead the number of observations obtained by the procedure. It should be noted, however, that in this investigation the test score norm distribution was approximately symmetrical and the possibility should not be overlooked that item-sampling as a procedure may be robust only for symmetrical norm distributions.  相似文献   

8.
《教育实用测度》2013,26(1):75-94
In this study we examined whether the measures used in the admission of students to universities in Israel are gender biased. The criterion used to measure bias was performance in the first year of university study; the predictors consisted of an admission score, a high school matriculation score, and a standardized test score as well as its component subtest scores. Statistically, bias was defined according to the boundary conditions given in Linn (1984). No gender bias was detected when using the admission score (which is used for selection) as a predictor of first-year performance in the university. Bias in favor of women was found predominantly using school grades as predictor whereas bias against women was found predominantly in using the standardized test scores. It was concluded that the admission score is a valid and unbiased predictor of first-year university performance for the two genders.  相似文献   

9.
In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.  相似文献   

10.
This study aimed to develop an instrument for assessing kindergarteners’ mathematics problem solving (MPS) by using cognitive diagnostic assessment (CDA). A total of 747 children were recruited to examine the psychometric properties of the cognitive diagnostic test. The results showed that the classification accuracy of 11 cognitive attributes ranged from .68 to .99, with the average being .84. Both the cognitive diagnostic test score and the average mastery probabilities of the 11 cognitive attributes had moderate correlations with the Applied Problem subtest and the Calculation subtest of the Woodcock–Johnson IV Tests of Achievement. Moreover, the correlation between the cognitive diagnostic test and the Applied Problems subtest was higher than that between the cognitive diagnostic test and the Calculation subtest. The results indicated that the formal cognitive diagnostic test was a reliable instrument for assessing kindergarteners’ MPS in the domain of number and operations.  相似文献   

11.
12.
The percentage of students retaking college admissions tests is rising. Researchers and college admissions offices currently use a variety of methods for summarizing these multiple scores. Testing organizations such as ACT and the College Board, interested in validity evidence like correlations with first‐year grade point average (FYGPA), often use the most recent test score available. In contrast, institutions report using a variety of composite scoring methods for applicants with multiple test records, including averaging and taking the maximum subtest score across test occasions (“superscoring”). We compare four scoring methods on two criteria. First, we compare correlations between scores and FYGPA by scoring method. We find them similar (). Second, we compare the extent to which test scores differentially predict FYGPA by scoring method and number of retakes. We find that retakes account for additional variance beyond standardized achievement and positively predict FYGPA across all scoring methods. Superscoring minimizes this differential prediction—although it may seem that superscoring should inflate scores across retakes, this inflation is “true” in that it accounts for the positive effects of retaking for predicting FYGPA. Future research should identity factors related to retesting and consider how they should be used in college admissions.  相似文献   

13.
Standardized tests are designed to measure broad goals. But many professionals have been concerned with the lack of fairly specific matches between items (or objectives) on a test and the curriculum (instruction). This study assessed the differences in standardized test scores resulting from curricular differences in two school systems. The degree of curriculum-test match for reading and math in grades 3 and 6 was based on ratings of that match by qualified district personnel. Further, results of using different textbook series were analyzed. The dependent variables of test and subtest scores were analyzed using a two-factor MANCOVA where textbook series and school personnel ratings were the two factors, and pretest scores and percent eligible for Aid to Families with Dependent Children (AFDC) were the covariates. None of the multivariate F tests were significant at the .05 level. It was concluded that neither the curricular match as judged by district personnel or the textbook series used had a significant impact on standardized test scores.  相似文献   

14.
The purpose of this study is to evaluate the performance of CATSIB (Computer Adaptive Testing-Simultaneous Item Bias Test) for detecting differential item functioning (DIF) when items in the matching and studied subtest are administered adaptively in the context of a realistic multi-stage adaptive test (MST). MST was simulated using a 4-item module in a 7-panel administration. Three independent variables, expected to affect DIF detection rates, were manipulated: item difficulty, sample size, and balanced/unbalanced design. CATSIB met the acceptable criteria, meaning that the Type I error and power rates met 5% and 80%, respectively, for the large reference/moderate focal sample and the large reference/large focal sample conditions. These results indicate that CATSIB can be used to consistently and accurately detect DIF on an MST, but only with moderate to large samples.  相似文献   

15.
OBJECTIVE: The Picture Completion subtest of the Wechsler Preschool and Primary Scales of Intelligence-Revised (WPPSI-R) measures visual alertness and the ability to differentiate essential from nonessential details. In children who are hypervigilant as a result of maltreatment, these skills may be over-functioning. It was hypothesized that the Picture Completion subtest scores of these children would be significantly elevated in comparison to their other nonverbal scores and their overall intellectual functioning. METHOD: Fourteen children from a therapeutic day treatment preschool program for maltreated children were administered the WPPSI-R. Standardized discrepancy scores between Picture Completion scores and Performance mean scores (PC-Performance Discrepancy) and the mean of all subscale scores (PC-Overall IQ Discrepancy) were formed and then analyzed. RESULTS: The abused preschoolers scored significantly lower than the population mean on four of the five WPPSI-R Performance subscales. Only on Picture Completion did they score significantly higher. Average PC-Performance Discrepancy and PC-Overall IQ Discrepancy scores were greater than one, indicating that the mean difference of children's Picture Completion score from either their Performance mean score or all of their mean scores was more than one standard deviation. CONCLUSION: Elevated Picture Completion score may serve as a marker for hypervigilance and/or PTSD in children with histories of maltreatment.  相似文献   

16.
The purpose of this study was to examine the validity evidence of first‐grade Dynamic Indicators of Basic Early Literacy Skills (DIBELS) scores for predicting third‐grade reading comprehension scores. We used the “simple view” of reading as the theoretical foundation for examining the extent to which DIBELS subtest scores predict comprehension through both word recognition and language comprehension. Scores from the DIBELS Oral Reading Fluency (ORF) subtest, a measure of word recognition speed and accuracy, strongly and significantly predicted multiple measures of reading comprehension. No other DIBELS subtest score explained additional variance beyond DIBELS ORF. Although experimental DIBELS Word Use Fluency (WUF) was significantly correlated with a language comprehension measure and measures of reading comprehension, WUF scores did not predict reading comprehension beyond ORF scores. Alternatively, first‐grade Peabody Picture Vocabulary Test scores did predict additional, significant variance in reading comprehension, beyond DIBELS ORF.  相似文献   

17.
《教育实用测度》2013,26(1):33-51
The objectives of this study were to examine the impact of different curricula on standardized achievement test scores at item and objective levels and to determine if different curricula generate different patterns of item factor loadings. School buildings from a middle-sized district were rated regarding the degree to which their curricula matched the content of the standardized test, and the actual textbook series used within each building (classroom) was determined. Covariate analyses of objective scores and plots and correlations of item p values indicated very small, nonsignificant differential effects across ratings and textbook series. Factor patterns indicated no curricular effects on large first factors. These findings parallel the results of a previous study conducted at the subtest level. We conclude that educators need not be unduly concerned about the impact of specific and generally small differences in curricular offerings within a district on standardized test scores or inferences to a broad content domain.  相似文献   

18.
Thirty mentally retarded persons took part in a study intended to verify the predictive value, for work adjustment, of learning potential. The multiple regression equation was derived for the data produced by a non verbal intelligence test (PM-47) and a test of learning potential (adaptation of the Block design test, Ionescu et al., 1974) The results showed that only the total score on a block design test has predictive value; this score is the sum of two scores, a “without help” score and a «transfer» score (measure of learning potential).  相似文献   

19.
Brennan ( 2012 ) noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman ( 2008 ) suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. According to this method, a subscore has added value if the corresponding true subscore is predicted better by the subscore than by the total score. In this note, parallel‐forms scores are considered. It is proved that another way to interpret the method of Haberman is that a subscore has added value if it is in better agreement than the total score with the corresponding subscore on a parallel form. The suggested interpretation promises to make the method of Haberman more accessible because several practitioners find the concept of parallel forms more acceptable or easier to understand than that of a true score. Results are shown for data from two operational tests.  相似文献   

20.
As access and reliance on technology continue to increase, so does the use of computerized testing for admissions, licensure/certification, and accountability exams. Nonetheless, full computer‐based test (CBT) implementation can be difficult due to limited resources. As a result, some testing programs offer both CBT and paper‐based test (PBT) administration formats. In such situations, evidence that scores obtained from different formats are comparable must be gathered. In this study, we illustrate how contemporary statistical methods can be used to provide evidence regarding the comparability of CBT and PBT scores at the total test score and item levels. Specifically, we looked at the invariance of test structure and item functioning across test administration mode across subgroups of students defined by SES and sex. Multiple replications of both confirmatory factor analysis and Rasch differential item functioning analyses were used to assess invariance at the factorial and item levels. Results revealed a unidimensional construct with moderate statistical support for strong factorial‐level invariance across SES subgroups, and moderate support of invariance across sex. Issues involved in applying these analyses to future evaluations of the comparability of scores from different versions of a test are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号