首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 281 毫秒
1.
In discussion of the properties of criterion-referenced tests, it is often assumed that traditional reliability indices, particularly those based on internal consistency, are not relevant. However, if the measurement errors involved in using an individual's observed score on a criterion-referenced test to estimate his or her universe scores on a domain of items are compared to errors of an a priori procedure that assigns the same universe score (the mean observed test score) to all persons, the test-based procedure is found to improve the accuracy of universe score estimates only if the test reliability is above 0.5. This suggests that criterion-referenced tests with low reliabilities generally will have limited use in estimating universe scores on domains of items.  相似文献   

2.
Two conventional scores and a weighted score on a group test of general intelligence were compared for reliability and predictive validity. One conventional score consisted of the number of correct answers an examinee gave in responding to 69 multiple-choice questions; the other was the formula score obtained by subtracting from the number of correct answers a fraction of the number of wrong answers. A weighted score was obtained by assigning weights to all the response alternatives of all the questions and adding the weights associated with the responses, both correct and incorrect, made by the examinee. The weights were derived from degree-of-correctness judgments of the set of response alternatives to each question. Reliability was estimated using a split-half procedure; predictive validity was estimated from the correlation between test scores and mean school achievement. Both conventional scores were found to be significantly less reliable but significantly more valid than the weighted scores. (The formula scores were neither significantly less reliable nor significantly more valid than number-correct scores.)  相似文献   

3.
The identification and composition of four subtests constituting the major areas of competence of English as a foreign langauge is reported, largely based on a study of the literature. A 160-item, four-choice test was used to obtain a number of reliability and validity indices. It was found that grammar was the most reliable and the most valid component when based on total test scores as the criterion, where-as translation was lowest on these two measures. The opposite trend was observed with Grade Point Average as the criterion, whereas the third criterion, writing ability, was found to correlate highest with listening comprehension, which was also found to contribute to the highest unique nonchance variance of the four components. These findings and their explanation are discussed.  相似文献   

4.
This article studies the difference between the criterion validity coefficient of the widely used overall scale score for a unidimensional multicomponent measuring instrument and the maximal criterion validity coefficient that is achievable with a linear combination of its components. A necessary and sufficient condition of their identity is presented in the case of measurement errors being uncorrelated among themselves and with a used criterion. An upper bound of the difference in these validity coefficients is provided, indicating that it cannot exceed the discrepancy between the maximal reliability and composite reliability indexes. A readily applicable latent variable modeling procedure is discussed that can be used for point and interval estimation of the difference between the maximal and scale criterion validity coefficients. The outlined method is illustrated with a numerical example.  相似文献   

5.
A reliability coefficient for criterion-referenced tests is developed from the assumptions of classical test theory. This coefficient is based on deviations of scores from the criterion score, rather than from the mean. The coefficient is shown to have several of the important properties of the conventional normreferenced reliability coefficient, including its interpretation as a ratio of variances and as a correlation between parallel forms, its relationship to test length, its estimation from a single form of a test, and its use in correcting for attenuation due to measurement error. Norm-referenced measurement is considered as a special case of criterion-referenced measurement.  相似文献   

6.
A structural equation modeling based method is outlined that accomplishes interval estimation of individual optimal scores resulting from multiple-component measuring instruments evaluating single underlying latent dimensions. The procedure capitalizes on the linear combination of a prespecified set of measures that is associated with maximal reliability and validity. The approach is useful when one is interested in evaluating plausible ranges for subject scores on the composite exhibiting highest measurement consistency and strongest linear relation with a given criterion. The method is illustrated with a numerical example.  相似文献   

7.
Although reliability of subscale scores may be suspect, subscale scores are the most common type of diagnostic information included in student score reports. This research compared methods for augmenting the reliability of subscale scores for an 8th-grade mathematics assessment. Yen's Objective Performance Index, Wainer et al.'s augmented scores, and scores based on multidimensional item response theory (IRT) models were compared and found to improve the precision of the subscale scores. However, the augmented subscale scores were found to be more highly correlated and less variable than unaugmented scores. The meaningfulness of reporting such augmented scores as well as the implications for validity and test development are discussed.  相似文献   

8.
口试评分规范化与信度研究   总被引:2,自引:0,他引:2  
口语考试的效度较高,信度却比较低。但没有信度,效度也不可能真正得到保证。因此,如何提高口试的信度,是很多测试研究者普遍关注的问题。本文通过描述清华大学英语水平考试中口试部分的评分规范化与评分员培训,对如何规范评分以提高口试信度这一问题进行讨论。  相似文献   

9.
Scores on essay‐based assessments that are part of standardized admissions tests are typically given relatively little weight in admissions decisions compared to the weight given to scores from multiple‐choice assessments. Evidence is presented to suggest that more weight should be given to these assessments. The reliability of the writing scores from two of the large volume admissions tests, the GRE General Test (GRE) and the Test of English as a Foreign Language Internet‐based test (TOEFL iBT), based on retesting with a parallel form, is comparable to the reliability of the multiple‐choice Verbal or Reading scores from those tests. Furthermore, and even more important, the writing scores from both tests are as effective as the multiple‐choice scores in predicting academic success and could contribute to fairer admissions decisions.  相似文献   

10.
Value-added scores from tests of college learning indicate how score gains compare to those expected from students of similar entering academic ability. Unfortunately, the choice of value-added model can impact results, and this makes it difficult to determine which results to trust. The research presented here demonstrates how value-added models can be compared on three criteria: reliability, year-to-year consistency and information about score precision. To illustrate, the original Collegiate Learning Assessment value-added model is compared to a new model that employs hierarchical linear modelling. Results indicate that scores produced by the two models are similar, but the new model produces scores that are more reliable and more consistent across years. Furthermore, the new approach provides school-specific indicators of value-added score precision. Although the reliability of value-added scores is sufficient to inform discussions about improving general education programmes, reliability is currently inadequate for making dependable, high-stakes comparisons between postsecondary institutions.  相似文献   

11.
In this study, we focused on increasing the reliability of ability-achievement difference scores using the Kaufman Assessment Battery for Children (KABC) as an example. Ability-achievement difference scores are often used as indicators of learning disabilities, but when they are derived from traditional equally weighted ability and achievement scores, they have suboptimal psychometric properties because of the high correlations between the scores. As an alternative to equally weighted difference scores, we examined an orthogonal reliable component analysis, (RCA) solution and an oblique principal component analysis (PCA) solution for the standardization sample of the KABC (among 5- to 12-year-olds). The components were easily identifiable as the simultaneous processing, sequential processing, and achievement constructs assessed by the KABC. As judged via the score intercorrelations, all three types of scores had adequate convergent validity, while the orthogonal RCA scores had superior discriminant validity, followed by the oblique PCA scores. Differences between the orthogonal RCA scores were more reliable than differences between the oblique PCA scores, which were in turn more reliable than differences between the traditional equally weighted scores. The increased reliability with which the KABC differences are assessed with the orthogonal RCA method has important practical implications, including narrower confidence intervals around difference scores used in individual administrations of the KABC.  相似文献   

12.
In this paper, an attempt has been made to synthesize some of the current thinking in the area of criterion-referenced testing as well as to provide the beginning of an integration of theory and method for such testing. Since criterion-referenced testing is viewed from a decision-theoretic point of view, approaches to reliability and validity estimation consistent with this philosophy are suggested. Also, to improve the decision-making accuracy of criterion-referenced tests, a Bayesian procedure for estimating true mastery scores has been proposed. This Bayesian procedure uses information about other members of a student's group (collateral information), but the resulting estimation is still criterion referenced rather than norm referenced in that the student is compared to a standard rather than to other students. In theory, the Bayesian procedure increases the “effective length” of the test by improving the reliability, the validity, and more importantly, the decision-making accuracy of the criterion-referenced test scores.  相似文献   

13.
《教育实用测度》2013,26(1):25-51
In this study, we compared the efficiency, reliability, validity, and motivational benefits of computerized-adaptive and self-adapted music-listening tests (referred to hereafter as CAT and SAT, respectively). Junior high school general music students completed a tonal memory CAT, a tonal memory SAT, standardized music aptitude and achievement tests; and questionnaires assessing test anxiety, demographics, and attitudes about the CAT and SAT. Standardized music test scores and music course grades served as criterion measures in the concurrent validity analysis. Results showed that the SAT elicited more favorable attitudes from examinees and yielded ability estimates that were higher and less correlated with test anxiety than did the CAT. The CAT, however, required fewer items and less administration time to match the reliability and concurrent validity of the SAT and yielded higher levels of reliability and concurrent validity than the SAT when test length was held constant. These results reaffirm important tradeoffs between the two administration procedures observed in prior studies of vocabulary and algebra skills, with the SAT providing greater potential motivational benefits and the CAT providing greater efficiency. Implications and questions for future research are discussed.  相似文献   

14.
The richness and complexity of video portfolios endanger both the reliability and validity of the assessment of teacher competencies. In a post-graduate teacher education program, the assessment of video portfolios was evaluated for its reliability, construct validity, and consequential validity. Although video portfolio facilitated a reliable and valid assessment of teacher competencies, procedures to improve assessment quality were also revealed and are therefore discussed: more explicit grounding of assessment results in the data, peer debriefing, prolonged engagement with the assessment data, cross-checking to find confirmatory or counter examples.  相似文献   

15.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

16.
The survey investigated the problems of social desirability (SD), non‐response bias (NRB) and reliability in the Minnesota Multiphasic Personality Inventory – Revised (MMPI‐2) self‐report inventory administered to Brunei student teachers. Bruneians scored higher on all the validity scales than the normative US sample, thereby threatening the internal validity of the study. Of the three validity scales that assess various forms of SD, only the F scale was reliable and its mean score was in the clinical range. In addition, seven of the ten clinical scales had poor reliability. Although Brunei males scored much higher on the K scale than females, both mean scores were below the critical region. Protocols for two respondents with many missing values indicated that the study’s external validity was vulnerable to NRB effects. Altogether SD, NRB and low reliability had potential to undermine and depress the overall validity of the MMPI‐2 and caution the value of using it ‘as is’ in Brunei.  相似文献   

17.
本文运用教师积极性测评表对 4 2名中学教师进行了测量 ,并对测评表的信度和效度进行了检验。结果表明 :运用量表对教师的工作积极性进行测评是可行的和有意义的 ,该量表具有一定的信度和效标关联效度 ,但还应在现有基础上 ,用因素分析法进行筛选和归类。本文还提出 ,应建立一个统一的评分体系 ,使分数更具有可比性  相似文献   

18.
Two ways of measuring change are presented and compared: A conventional “change score”, defined as the difference between scores before and after an interim period, and a process-oriented approach focusing on detailed analysis of conceptually defined response patterns. The validity of the two approaches was investigated. Vocabulary knowledge was assessed by means of equivalent multiple-choice tests administered before and after an intervention, and four characteristic responses were observed: Words consistently not understood; words inconsistently understood; learned words; and words consistently understood. The results showed that inclusion of the category “words consistently not understood” offered a “truer” gain score than did the conventional change score. It captured more variance from age and cognitive constraints and appeared educationally more reliable from an assessment-for-teaching-perspective.  相似文献   

19.
ABSTRACT

Touch screen tablets are being increasingly used in schools for learning and assessment. However, the validity and reliability of assessments delivered via tablets are largely unknown. The present study tested the psychometric properties of a tablet-based app designed to measure early literacy skills. Tablet-based tests were also compared with traditional paper-based tests. Children aged 2–6 years (N?=?99) completed receptive tests delivered via a tablet for letter, word, and numeral skills. The same skills were tested with a traditional paper-based test that used an expressive response format. Children (n?=?35) were post-tested 8 weeks later to examine the stability of test scores over time. The tablet test scores showed high internal consistency (all α’s?>?.94), acceptable test-retest reliability (ICC range?=?.39–.89), and were correlated with child age, family SES, and home literacy teaching to indicate good predictive validity. The agreement between scores for the tablet and traditional tests was high (ICC range?=?.81–.94). The tablet tests provides valid and reliable measures of children’s early literacy skills. The strong psychometric properties and ease of use suggests that tablet-based tests of literacy skills have the potential to improve assessment practices for research purposes and classroom use.  相似文献   

20.
A number of mental-test theorists have called attention to the fact that increasing test reliability beyond an optimal point can actually lead to a decrement in the validity of that test with respect to a criterion. This non-monotonic relation between reliability and validity has been referred to by Loevinger as the “attentuation paradox,” because Spearman’s correction for attenuation leads one to expect that increasing reliability will always increase validity. In this paper a mathematical link between test reliability and test validity is derived which takes into account the correlation between error scores on a test and error scores on a criterion measure the test is designed to predict. It is proved that when the correlation between these two sets of error scores is positive, the non-monotonic relation between test reliability and test validity which has been viewed as a paradox occurs universally.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号