首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Reliability consists of both important social and scientific values and methods for evidencing those values, though in practice methods are often conflated with the values. With the two distinctly understood, a reliability argument can be made that articulates the particular reliability values most relevant to the particular measurement situation and then the most appropriate evidence and theory to support an argument for the presence of that value. The steps in making a reliability argument are explained and an extended example is given. The article is intended to provoke discussion, debate, and the development of additional reliability methodologies.  相似文献   

2.
In criterion‐referenced tests (CRTs), the traditional measures of reliability used in norm‐referenced tests (NRTs) have often proved problematic because of NRT assumptions of one underlying ability or competency and of variance in the distribution of scores. CRTs, by contrast, are likely to be created when mastery of the skill or knowledge by all or most all test takers is expected and thus little variation in the scores is expected. A comprehensive CRT often measures a number of discrete tasks that may not represent a single unifying ability or competence. Hence, CRTs theoretically violate the two most essential assumptions of classic NRT re liability theory and they have traditionally required the logistical problems of multiple test administrations to the same test takers to estimate reliability. A review of the literature categorizes approaches to reliability for CRTs into two classes: estimates sensitive to all measures of error and estimates of consistency in test outcome. For single test administration of CRTs Livingston's k2is recommended for estimating all measures of error, Sc is proposed for estimates of consistency in test outcome. Both approaches compared using data from a CRT exam and recommendations for interpretation and use are proposed.  相似文献   

3.
This article reviews the validity of models based on (a) aptitude-achievement discrepancies, (b) low achievement, (c) intraindividual differences, and (d) response to instruction for the classification and identification of learning disabilities (LD). Models based on aptitude-achievement discrepancies and intraindividual differences showed little evidence of discriminant validity. Low achievement models had stronger discriminant validity but do not adequately assess the most significant component of the LD construct, unexpected underachievement. All three of these status models have limited reliability because of their reliance on a measurement at a single time point. Models that incorporate response to instruction have stronger reliability and validity but cannot represent the sole criterion for LD identification. Hybrid models combining low achievement and response to instruction most clearly capture the LD construct and have the most direct relation to instruction. The assessment of students for LD must reflect a stronger underlying classification that takes into account relations with other developmental disorders as well as the reliability and validity of the underlying classification and resultant identification system.  相似文献   

4.
There is growing interest in using measures of teacher applicant quality to improve hiring decisions, but the statistical properties of such measures are not well understood. We use unique data on structured ratings solicited from the references of teacher applicants to explore the dimensionality of measures of teacher applicant quality and the inter-rater reliability of the reference ratings. Despite questions about applicants designed to capture multiple dimensions of quality, factor analysis suggests that the reference ratings only capture one underlying dimension. Point estimates of inter-rater reliability range between 0.23 and 0.31 and are significantly lower for novice applicants. It is difficult to judge whether these levels of reliability are high or low in the current context given so little evidence on applicant assessment tools.  相似文献   

5.
An increased focus on the use of research evidence (URE) in K-12 education has led to a proliferation of instruments measuring URE in K-12 education settings. However, to date, there has been no review of these measures to inform education researchers’ assessment of URE. Here, we systematically review published quantitative measurement instruments in K-12 education. Findings suggest that instruments broadly assess user characteristics, environmental characteristics, and implementation and practices. In reviewing instrument quality, we found that studies infrequently report reliability, validity, and demographics about the instruments they develop or use. Future work evaluating and developing instruments should explore environmental characteristics that affect URE, generate items that match up with URE theory, and follow standards for establishing instrument reliability and validity.  相似文献   

6.
Lord (1959) has shown that the standard error of measurement of a test is, for all practical purposes, directly proportional to the square root of the number of items on the test. More specifically, Lord found empirically that the standard error of a test was equal to .     if the reliability of the test was computed by the Kuder-Richardson (KR) 20 formula. If the KR-21 formula was used, the standard error was equal to .     . The present paper sets out to show how these relationships may be derived from the defining formulas of reliability and standard error of measurement, if certain simple assumptions about values of test statistics are made.  相似文献   

7.
ABSTRACT

In the past decade, there has been interest in the assessment of cognitive and affective processes and products for the purposes of meaningful learning. Meaningful measurement (MM) has been proposed which is in accordance with a humanistic constructivist information‐processing perspective. Students’ responses to the assessment tasks are now evaluated according to an item response measurement model, together with a hypothesized model detailing the progressive forms of knowing/competence under examination. There is a possibility of incorporating student errors and alternative frameworks into these evaluation procedures. Meaningful measurement leads us to examine the composite concepts of “ability” and “difficulty”. Under the rubric of meaningful measurement, validity assessment (i.e. internal and external components of construct validity) is essentially the same as an inquiry into the meanings afforded by the measurements. Concepts of reliability, expressed as a group statistics which is applied in the same way to all the examinees in the sample, have to be obviated when the precision of the trait estimates stemming from the item response measurement models can be determined at each trait level. Reliability, measured in terms of standard errors of estimates needs to be within acceptable limits when internal validity is to be secured. Further evidence of validity may be provided by in‐depth analyses of how “epistemic subjects” of different levels of competence and proficiency engage in different types of assessment tasks, where affective and metacognitive behaviours may be examined as well. These ways of undertaking MM can be codified by proposing a three‐level conceptualization of MM. It is within the rubric of this conceptualization and the MM enquiry paradigm that validity and reliability of test measures are discussed in this paper.  相似文献   

8.
Test unreliability due to guessing in multiple‐choice and true/false tests is analysed from first principles, and two new measures are described, with the intention that they should be of a sort that is easily communicated without reference to the underlying statistics. One measure is concerned with the resolution of defined levels of knowledge and the other with the probability of examinees being incorrectly ranked. How the measures decrease with both test length and number of response options per question is quantified. It is concluded that the results of many tests currently conducted are likely to be unacceptably unreliable. Procedures for increasing test reliability are discussed in a logical sequence intended to aid their understanding.  相似文献   

9.
Young children, ages 5–6 years, develop first beliefs about science and themselves as science learners, and these beliefs are considered important precursors of children's future motivation to pursue science. Yet, due to a lack of adequate measures, little is known about young children's motivational beliefs about learning science. The present two‐part study explores the motivational beliefs of young children using a new measure—the Young Children's Science Motivation (Y‐CSM) scale. Initial measurement development involved a thorough literature review of existing measures, and an extensive piloting phase until a final instrument was reached. To establish scale reliability, measurement invariance as well construct and criterion validity, the final instrument was administered to a new sample of 277 young children, age 5–6 years, in northern Germany. Results reveal that children's motivational beliefs can be empirically differentiated into their self‐confidence and enjoyment in science at this young age. Older children were more motivated in science, but no significant gender differences were found. Importantly, children in preschools with a science focus reported significantly higher science motivation. This finding stresses the importance of early science education for the development of children's motivational beliefs science.  相似文献   

10.
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings.  相似文献   

11.
We evaluated community general education (CGE; n = 178), community special education (CSE; n = 30) and hospital-referred (HR, n = 145) children (ages 7-6 to 11-11) prospectively over a 2-year period. During this period, 17 CGE children were referred for evaluation (community referred; CR). Prior to referral, CR children performed more poorly than community-nonreferred (CNR) children on cognitive ability, academic achievement, attention problems, and information processing. CR group performance was equivalent to that of CSE and HR groups, but HR children showed poorer academic achievement. Referred children performed more poorly on all measures than nonreferred, whether they met formal diagnostic criteria for a learning disorder or not. Learning disorders may be better conceptualized as a context-dependent problem of functional adaptation than as a disability analogous to physical disabilities, raising questions about the validity of using psychometric test scores as the criterion for identification.  相似文献   

12.
This study presents evidence regarding the construct validity and internal consistency of the IFSP Rating Scale (McWilliam & Jung, 2001), which was designed to rate individualized family service plans (IFSPs) on 12 indicators of family centered practice. Here, the Rasch measurement model is employed to investigate the scale's functioning and fit for both person and item diagnostics of 120 IFSPs that were previously analyzed with a classical test theory approach. Analyses demonstrated scores on the IFSP Rating Scale fit the model well, though additional items could improve the scale's reliability. Implications for applying the Rasch model to improve special education research and practice are discussed.  相似文献   

13.
What is the extent of error likely with each of several approximations for the standard deviation, internal consistency reliability, and the standard error of measurement? To help answer this question, approximations were compared with exact statistics obtained on 85 different classroom tests constructed and administered by professors in a variety of fields; means and standard deviations of the resulting differences supported the use of approximations in practical situations. Results of this analysis (1) suggest a greater number of alternative formulas that might be employed, and (2) provide additional information concerning the accuracy of approximations with non-normal distributions.  相似文献   

14.
A reliability coefficient for criterion-referenced tests is developed from the assumptions of classical test theory. This coefficient is based on deviations of scores from the criterion score, rather than from the mean. The coefficient is shown to have several of the important properties of the conventional normreferenced reliability coefficient, including its interpretation as a ratio of variances and as a correlation between parallel forms, its relationship to test length, its estimation from a single form of a test, and its use in correcting for attenuation due to measurement error. Norm-referenced measurement is considered as a special case of criterion-referenced measurement.  相似文献   

15.
Test fairness and test bias are not synonymous concepts. Test bias refers to statistical evidence that the psychometrics or interpretation of test scores depend on group membership, such as gender or race, when such differences are not expected. A test that is grossly biased may be judged to be unfair, but test fairness concerns the broader, more subjective evaluation of assessment outcomes from perspectives of social justice. Thus, the determination of test fairness is not solely a matter of statistics, but statistical evidence is important when evaluating test fairness. This work introduces the use of the structural equation modelling technique of multiple-group confirmatory factor analysis (MGCFA) to evaluate hypotheses of measurement invariance, or whether a set of observed variables measures the same factors with the same precision over different populations. An example of testing for measurement invariance with MGCFA in an actual, downloadable data set is also demonstrated.  相似文献   

16.
Direct observation of behaviors is a data collection method customarily used in clinical and educational settings. Repeated measures and small samples are inherent characteristics of observational studies that pose challenges to the numerical estimation of reliability for observational data. In this article, we review some debates about the use of Generalizability Theory in estimating reliability of single‐subject observational data. We propose that it could be used but under a clearly stated set of conditions. The conceptualization of facets and object of measurement for a common design of observational research is elucidated under a different light. We provide two numerical examples to illustrate the ideas. Limitations of using Generalizability Theory to estimate reliability of observational data are discussed. © 2007 Wiley Periodicals, Inc. Psychol Schs 44: 433–439, 2007.  相似文献   

17.
以教育统计与测量理论为基础,根据学生考试原始成绩,从难度、区分度、信度和效度等四个方面给出了对试卷质量分析的数学模型.通过对一个具体的实例分析,应用MATLAB工具根据所给的模型进行了具体的计算,得出试卷质量分析结果,提出了对试卷中试题的修订意见,有助于进一步提高命题的质量,检查教学效果,鉴定教学质量,改进教师的教学方法.  相似文献   

18.
ABSTRACT

We investigate whether Anchoring Vignettes (AV) improve intercultural comparability of non-cognitive student-directed factors (e.g., procrastination). So far, correlation analyses for anchored and non-anchored scores with a criterion have been used to demonstrate the effectiveness of AV in improving data quality. However, correlation analyses are often used to investigate external validity of a scale. Nonetheless, before testing for validity, the reliability of the measurement of a construct should be examined. In the present study, we tested for measurement invariance across countries and languages and compared anchored and non-anchored student-directed self-reports that are highly relevant for the students’ self and their behaviour and performance. In addition, we apply further criteria for testing reliability. The results indicate that the data quality for some of the constructs can – in fact – be improved slightly by anchoring; whereas, for other self-reports, anchoring is less successful than was hoped. We discuss with regard to possible consequences for research methodology.  相似文献   

19.
In this paper, an attempt has been made to synthesize some of the current thinking in the area of criterion-referenced testing as well as to provide the beginning of an integration of theory and method for such testing. Since criterion-referenced testing is viewed from a decision-theoretic point of view, approaches to reliability and validity estimation consistent with this philosophy are suggested. Also, to improve the decision-making accuracy of criterion-referenced tests, a Bayesian procedure for estimating true mastery scores has been proposed. This Bayesian procedure uses information about other members of a student's group (collateral information), but the resulting estimation is still criterion referenced rather than norm referenced in that the student is compared to a standard rather than to other students. In theory, the Bayesian procedure increases the “effective length” of the test by improving the reliability, the validity, and more importantly, the decision-making accuracy of the criterion-referenced test scores.  相似文献   

20.
《教育实用测度》2013,26(1):9-26
We present statistical and theoretical issues that arise from assessing person-fit on measures of typical performance. After presenting the status of past and current research issues, we describe three topics of ongoing concern. First, because typical performance measures tend to be short, and because they have low bandwidth, the detection of person-misfit is often attenuated. Second, there is a need for creative methods of identifying the specific sources of response aberrancy, rather than simply identifying person-misfit. Third, the promise of person-fit measures as moderators of trait-criterion relations remains un- demonstrated. We offer commentary or potential resolutions to these three current topics. In terms of future research directions, we outline two lines of advancement that are relevant for both educational and personality psychologists. These are (a) the use of person-fit statistics in the assessment of how item response theory measurement models differ across manifest groups (e.g., ethnicity, gender), and (b) the application of person-fit statistics under "external" item response theory model conditions. We summarize the role these advances could play in helping educational testers go beyond the standard task of identifying "invalid" protocols by discussing how person-fit assessment may contribute to our understanding of individual and group differences in trait structure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号