首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Subscores Based on Classical Test Theory: To Report or Not to Report   总被引:1,自引:0,他引:1  
There is an increasing interest in reporting subscores, both at examinee level and at aggregate levels. However, it is important to ensure reasonable subscore performance in terms of high reliability and validity to minimize incorrect instructional and remediation decisions. This article employs a statistical measure based on classical test theory that is conceptually similar to the test reliability measure and can be used to determine when subscores have any added value over total scores. The usefulness of subscores is examined both at the level of the examinees and at the level of the institutions that the examinees belong to. The suggested approach is applied to two data sets from a basic skills test. The results provide little support in favor of reporting subscores for either examinees or institutions for the tests studied here.  相似文献   

2.
Brennan noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. One way to interpret the method is that a subscore has added value only if it has a better agreement than the total score with the corresponding subscore on a parallel form. The focus of this article is on classification of the examinees into “pass” and “fail” (or master and nonmaster) categories based on subscores. A new CTT‐based method is suggested to assess whether classification based on a subscore is in better agreement, than classification based on the total score, with classification based on the corresponding subscore on a parallel form. The method can be considered as an assessment of the added value of subscores with respect to classification. The suggested method is applied to data from several operational tests. The added value of subscores with respect to classification is found to be very similar, except at extreme cutscores, to their added value from a value‐added analysis of Haberman.  相似文献   

3.
Will subscores provide additional information than what is provided by the total score? Is there a method that can estimate more trustworthy subscores than observed subscores? To answer the first question, this study evaluated whether the true subscore was more accurately predicted by the observed subscore or total score. To answer the second question, three subscore estimation methods (i.e., subscore estimated from the observed subscore, total score, or a combination of both the subscore and total score) were compared. Analyses were conducted using data from six licensure tests. Results indicated that reporting subscores at the examinee level may not be necessary as they did not provide much additional information over what is provided by the total score. However, at the institutional level (for institution size ≥ 30), reporting subscores may not be harmful, although they may be redundant because the subscores were predicted equally well by the observed subscores or total scores. Finally, results indicated that estimating the subscore using a combination of observed subscore and total score resulted in the highest reliability.  相似文献   

4.
Recent research has proposed a criterion to evaluate the reportability of subscores. This criterion is a value‐added ratio (VAR), where values greater than 1 suggest that the true subscore is better approximated by the observed subscore than by the total score. This research extends the existing literature by quantifying statistical significance and effect size for using VAR to provide practical guidelines for subscore interpretation and reporting. Findings indicate that subscores with VAR ≥ 1.1 are a minimum requirement for a meaningful contribution to a user's score interpretation; subscores with .9 < VAR < 1.1 are redundant with the total score and subscores with VAR ≤ .9 would be misleading to report. Additionally, we discuss what to do when subscores do not add value, yet must be reported, as well as when VAR ≥ 1.1 may be undesirable.  相似文献   

5.
Brennan ( 2012 ) noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman ( 2008 ) suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. According to this method, a subscore has added value if the corresponding true subscore is predicted better by the subscore than by the total score. In this note, parallel‐forms scores are considered. It is proved that another way to interpret the method of Haberman is that a subscore has added value if it is in better agreement than the total score with the corresponding subscore on a parallel form. The suggested interpretation promises to make the method of Haberman more accessible because several practitioners find the concept of parallel forms more acceptable or easier to understand than that of a true score. Results are shown for data from two operational tests.  相似文献   

6.
The value‐added method of Haberman is arguably one of the most popular methods to evaluate the quality of subscores. The method is based on the classical test theory and deems a subscore to be of added value if the subscore predicts the corresponding true subscore better than does the total score. Sinharay provided an interpretation of the added value of subscores in terms of scores and subscores on parallel forms. This article extends the results of Sinharay and considers the prediction of a subscore on a parallel form from both the subscore and the total raw score on the original form. The resulting predictor essentially becomes the augmented subscore suggested by Haberman. The proportional reduction in mean squared error of the resulting predictor is interpreted as a squared multiple correlation coefficient. The practical usefulness of the derived results is demonstrated using an operational data set.  相似文献   

7.
This study investigates the relationships among factor correlations, inter-item correlations, and the reliability estimates of subscores, providing a guideline with respect to psychometric properties of useful subscores. In addition, it compares subscore estimation methods with respect to reliability and distinctness. The subscore estimation methods explored in the current study include augmentation based on classical test theory and multidimensional item response theory (MIRT). The study shows that there is no estimation method that is optimal according to both criteria. Augmented subscores show the most improvement in reliability compared to observed subscores but are the least distinct.  相似文献   

8.
Recently, interest in test subscore reporting for diagnosis purposes has been growing rapidly. The two simulation studies here examined factors (sample size, number of subscales, correlation between subscales, and three factors affecting subscore reliability: number of items per subscale, item parameter distribution, and data generating model) that affected the value of reporting subscores within the classical test theory framework. Results showed that a higher proportion of subscores of added value was related to lower correlation between subscales, more items per subscale, no guessing in responses, smaller variability in difficulty parameters, and matched average item difficulty and average examinee ability.  相似文献   

9.
Recently, there has been an increasing level of interest in subscores for their potential diagnostic value. Haberman (2008b) suggested reporting an augmented subscore that is a linear combination of a subscore and the total score. Sinharay and Haberman (2008) and Sinharay (2010) showed that augmented subscores often lead to more accurate diagnostic information than subscores. In order to report augmented subscores operationally, they should be comparable across the different forms of a test. One way to achieve comparability is to equate them. We suggest several methods for equating augmented subscores. Results from several operational and simulated data sets show that the error in the equating of augmented subscores appears to be small in most practical situations.  相似文献   

10.
Recently, there has been an increasing level of interest in subscores for their potential diagnostic value. Haberman suggested a method based on classical test theory to determine whether subscores have added value over total scores. In this article I first provide a rich collection of results regarding when subscores were found to have added value for several operational data sets. Following that I provide results from a detailed simulation study that examines what properties subscores should possess in order to have added value. The results indicate that subscores have to satisfy strict standards of reliability and correlation to have added value. A weighted average of the subscore and the total score was found to have added value more often.  相似文献   

11.
This study analyzed the relationship between benchmark scores from two curriculum‐based measurement probes in mathematics (M‐CBM) and student performance on a state‐mandated high‐stakes test. Participants were 298 students enrolled in grades 7 and 8 in a rural southeastern school. Specifically, we calculated the criterion‐related and predictive validity of benchmark scores from CBM probes measuring math computation and math reasoning skills. Results of this study suggest that math reasoning probes have strong concurrent and predictive validity. The study also provides evidence that calculation skills, while important, do not have strong predictive strength at the secondary level when a state math assessment is the criterion. When reading comprehension skill is taken into account, math reasoning scores explained the greatest amount of variance in the criterion measure. Computation scores explained less than 5% of the variance in the high‐stakes test, suggesting that it may have limitations as a universal screening measure for secondary students.  相似文献   

12.
Subscores can be of diagnostic value for tests that cover multiple underlying traits. Some items require knowledge or ability that spans more than a single trait. It is thus natural for such items to be included on more than a single subscore. Subscores only have value if they are reliable enough to justify conclusions drawn from them and if they contain information about the examinee that is distinct from what is in the total test score. In this study we show, for a broad range of conditions of item overlap on subscores, that the value of the subscore is always improved through the removal of such items.  相似文献   

13.
In this study we describe an analytic method for aiding in the generation of subscales that characterize the deep structure of tests. In addition we also derive a procedure for estimating scores for these scales that are much more statistically stable than subscores computed solely from the items that are contained on that scale. These scores achieve their stability through augmentation with information from other related information on the test. These methods were used to complement each other on a data set obtained from a Praxis administration. We found that the deep structure of the test yielded ten subscales and that, because the test was essentially unidimensional, ten subscores could be computed, all with very high reliability. This result was contrasted with the calculation of six traditional subscales based on surface features of the items. These subscales also yielded augmented subscores of high reliability.  相似文献   

14.
A practical concern for many existing tests is that subscore test lengths are too short to provide reliable and meaningful measurement. A possible method of improving the subscale reliability and validity would be to make use of collateral information provided by items from other subscales of the same test. To this end, the purpose of this article is to compare two different formulations of an alternative Item Response Theory (IRT) model developed to parameterize unidimensional projections of multidimensional test items: Analytical and Empirical formulations. Two real data applications are provided to illustrate how the projection IRT model can be used in practice, as well as to further examine how ability estimates from the projection IRT model compare to external examinee measures. The results suggest that collateral information extracted by a projection IRT model can be used to improve reliability and validity of subscale scores, which in turn can be used to provide diagnostic information about strength and weaknesses of examinees helping stakeholders to link instruction or curriculum to assessment results.  相似文献   

15.
Assessment data must be valid for the purpose for which educators use them. Establishing evidence of validity is an ongoing process that must be shared by test developers and test users. This study examined the predictive validity and the diagnostic accuracy of universal screening measures in reading. Scores on three different universal screening tools were compared for nearly 500 second‐ and third‐grade students attending four public schools in a large urban district. Hierarchical regression and receiver operating characteristic curves were used to examine the criterion‐related validity and diagnostic accuracy of students’ oral reading fluency (ORF), Fountas and Pinnell Benchmark Assessment System (BAS) scores, and fall scores from the Measures of Academic Progress for reading (MAP). Results indicated that a combination of all three measures accounted for 65% of the variance in spring MAP scores, whereas a reduced model of ORF and MAP scores predicted 60%. ORF and BAS scores did not meet standards for diagnostic accuracy. Combining the measures improved diagnostic accuracy, depending on how criterion scores were calculated. Implications for practice and future research are discussed.  相似文献   

16.
运用模糊综合评判法,对高校体育教学效果进行实验测评.测评中设计一个评估课堂教学质量的数学模型,并对该模型的可行性及测评结果的可靠性进行检验.结果表明:该模型可操作性强,能真实反映高校体育教学的实际效果,很好地提高了课堂教学质量评估的有效性和公平性.  相似文献   

17.
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings.  相似文献   

18.
Measuring the quality of service in higher education is increasingly important, particularly as fees introduce a more consumerist ethic amongst students. This paper aims to test and compare the relative efficacy of three measuring instruments of service quality (namely HEdPERF, SERVPERF and the moderating scale of HEdPERF‐SERVPERF) within a higher education setting. The objective was to determine which instrument had the superior measuring capability in terms of unidimensionality, reliability, validity and explained variance. Tests were conducted utilizing a sample of higher education students, and the findings indicated that HEdPERF scale resulted in more reliable estimations, greater criterion and construct validity, greater explained variance, and consequently was a better fit than the other two instruments. Consequently, a modified five‐factor structure of HEdPERF is put forward as the more superior scale for the higher education sector.  相似文献   

19.
Four definitions of “cultural fairness” are examined and found to be not only mutually contradictory (for reasons which are explained), but all based on the false view that optimum treatment of cultural factors in test construction or test selection can be reduced to completely mechanical procedures. If a conflict arises between the two goals of maximizing a test's validity and minimizing the test's discrimination against certain cultural groups, then a subjective, policy-level decision must be made concerning the relative importance of the two goals. The terms in which this judgment should be made are described, and methods are described for entering the result of this judgment into mechanical procedures for constructing a “culturally optimum” test. Such a test will not necessarily fit any of the four definitions of “cultural fairness.”  相似文献   

20.
Test‐taking strategies are important cognitive skills that strongly affect students’ performance in tests. Using appropriate test‐taking strategies improves students’ achievement and grades, improves students’ attitudes toward tests and reduces test anxiety. This results in improving test accuracy and validity. This study aimed at developing a scale to assess students’ test‐taking strategies at university level. The scale developed was passed through several validation procedures that included content, construct and criterion‐related validity. Similarly, scale reliability (internal reliability and stability over time) was assessed through several procedures. Four samples of students (50, 828, 553 and 235) participated by responding to different versions of the scale. The scale developed consists of 31 items distributed into four sub‐scales: Before‐test, Time management, During‐test and After‐test. To the researcher’s knowledge, this is the first comprehensive scale developed to assess test‐taking strategies used by university students.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号