首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The conventional focus of validity in educational measurement has been on intended interpretations and uses of test scores. Empirical studies of test use by teachers, administrators and policy-makers show that actual interpretations and uses of test scores in context are invariably shaped by local users’ questions, which frequently require attention to multiple sources of evidence about students’ learning and the factors that shape it, and depend on local capacity to use such information well. This requires a more complex theory of validity that can shift focus as needed from the intended interpretations and uses of test scores that guide test developers to local capacity to support the actual interpretations, decisions and actions that routinely serve local users’ purposes. I draw on the growing empirical literature on data use to illustrate the need for an expanded theory of validity, point to theoretical resources that might guide such an expansion, and suggest a research agenda towards these ends.  相似文献   

2.
In the face of accumulating research and logic, the use of a discrepancy between intelligence and reading achievement test scores is becoming increasingly untenable as a marker of reading disabilities. However, it is not clear what criteria might replace the discrepancy requirement. We surveyed 218 members of journal editorial boards to solicit their opinions on current and proposed definitional components and exclusion criteria. Three components were selected by over two‐thirds of the respondents: reading achievement, phonemic awareness, and treatment validity. However, only 30 percent believed IQ‐reading achievement discrepancy should be a marker. More than 75 percent of the respondents believed exclusion criteria should remain part of the definition. Mental retardation was the most frequently selected exclusion criterion despite rejection of intelligence test scores as a definitional component. Although the findings reflect uncertainty among experts on what elements should comprise a definition, they do signal a willingness to consider new approaches to the conceptually difficult task of defining reading disabilities.  相似文献   

3.
4.
A number of mental-test theorists have called attention to the fact that increasing test reliability beyond an optimal point can actually lead to a decrement in the validity of that test with respect to a criterion. This non-monotonic relation between reliability and validity has been referred to by Loevinger as the “attentuation paradox,” because Spearman’s correction for attenuation leads one to expect that increasing reliability will always increase validity. In this paper a mathematical link between test reliability and test validity is derived which takes into account the correlation between error scores on a test and error scores on a criterion measure the test is designed to predict. It is proved that when the correlation between these two sets of error scores is positive, the non-monotonic relation between test reliability and test validity which has been viewed as a paradox occurs universally.  相似文献   

5.
Recent critics of the Defining Issues Test (DIT) suggest that moral judgment development as currently measured is neither developmental nor moral. Instead, scores on the DIT are claimed to be the result of political attitudes or verbal ability. In making such claims, these critics raise the possibility that the validity of the DIT is suspect as well as the construct it purports to measure. We begin our response with an overview of the various claims that DIT scores reduce to political attitudes or verbal ability. Then, a relevant sample of the DIT literature is summarized in order to assess the degree to which relationships between DIT scores and criterion variables typically used to support the validity of the DIT can be explained by political attitudes or verbal ability. This summary suggests that moral judgment development as measured by the DIT provides a unique source of information that cannot be explained by general/verbal ability or political attitudes.  相似文献   

6.
This study analyzed the relationship between benchmark scores from two curriculum‐based measurement probes in mathematics (M‐CBM) and student performance on a state‐mandated high‐stakes test. Participants were 298 students enrolled in grades 7 and 8 in a rural southeastern school. Specifically, we calculated the criterion‐related and predictive validity of benchmark scores from CBM probes measuring math computation and math reasoning skills. Results of this study suggest that math reasoning probes have strong concurrent and predictive validity. The study also provides evidence that calculation skills, while important, do not have strong predictive strength at the secondary level when a state math assessment is the criterion. When reading comprehension skill is taken into account, math reasoning scores explained the greatest amount of variance in the criterion measure. Computation scores explained less than 5% of the variance in the high‐stakes test, suggesting that it may have limitations as a universal screening measure for secondary students.  相似文献   

7.
In recent years, readability formulas have gained new prominence as a basis for selecting texts for learning and assessment. Variables that quantitative tools count (e.g., word frequency, sentence length) provide valid measures of text complexity insofar as they accurately predict representative and high-quality criteria. The longstanding consensus of text researchers has been that such criteria will measure readers’ comprehension of sample texts. This study used Bormuth’s (1969) rigorously developed criterion measure to investigate two of today’s most widely used quantitative text tools—the Lexile Framework and the Flesch–Kincaid Grade-Level formula. Correlations between the two tools’ complexity scores and Bormuth’s measured difficulties of criterion passages were only moderately high in light of the literature and new high stakes uses for such tools. These correlations declined a small amount when passages from the University grade band of use were removed. The ability of these tools to predict measured text difficulties within any single grade band below University was low. Analyses showed that word complexity made a larger contribution relative to sentence complexity when each tool’s predictors were regressed on the Bormuth criterion rather than their original criteria. When the criterion was texts’ grade band of use instead of mean cloze scores, neither tool classified texts well and errors disproportionally placed texts from higher grade bands into lower ones. Results suggest these two text tools may lack adequate validity for their current uses in educational settings.  相似文献   

8.
Most of the articles and dissertations dealing with cognitive preferences which were written since the invention of the construct in the early 1960s have been reviewed. Fifty-four of them were found suitable for meta-analysis. The meta-analysis presents means and standard deviations of reliabilities, correlations, standard scores, and effect sizes. The effects and relationships of cognitive preferences and important school and learning related variables were studied. The results provide base line data for comparative purposes and offer empirical evidence which support the construct validity of cognitive preferences.  相似文献   

9.
Every year, thousands of college and university applicants with learning disabilities (LD) present scores from standardized examinations as part of the admissions process for postsecondary education. Many of these scores are from tests administered with nonstandard procedures due to the examinees' learning disabilities. Using a sample of college students with LD and a control sample, this study investigated the criterion validity and comparability of scores on the Miller Analogies Test when accommodations for the examinees with LD were in place. Scores for examinees with LD from test administrations with accommodations were similar to those of examinees without LD on standard administrations, but less well associated with grade point averages. The results of this study provide evidence that although scores for examinees with LD from nonstandard test administrations are comparable to scores for examinees without LD, they have less criterion validity and are less meaningful for their intended purpose.  相似文献   

10.
Automated text complexity measurement tools (also called readability metrics) have been proposed as a way to help teachers, textbook publishers, and assessment developers select texts that are closely aligned with the new, more demanding text complexity expectations specified in the Common Core State Standards. This article examines a critical element of the validity arguments presented in support of proposed metrics: the claim that criterion text complexity scores developed from students’ responses to reading comprehension test items are reflective of the difficulties actually experienced by students while reading. Evidence that fails to support this assertion is examined, and implications relative to the goal of obtaining valid, unbiased evidence about the measurement properties of proposed readability metrics are discussed.  相似文献   

11.
Assessment data must be valid for the purpose for which educators use them. Establishing evidence of validity is an ongoing process that must be shared by test developers and test users. This study examined the predictive validity and the diagnostic accuracy of universal screening measures in reading. Scores on three different universal screening tools were compared for nearly 500 second‐ and third‐grade students attending four public schools in a large urban district. Hierarchical regression and receiver operating characteristic curves were used to examine the criterion‐related validity and diagnostic accuracy of students’ oral reading fluency (ORF), Fountas and Pinnell Benchmark Assessment System (BAS) scores, and fall scores from the Measures of Academic Progress for reading (MAP). Results indicated that a combination of all three measures accounted for 65% of the variance in spring MAP scores, whereas a reduced model of ORF and MAP scores predicted 60%. ORF and BAS scores did not meet standards for diagnostic accuracy. Combining the measures improved diagnostic accuracy, depending on how criterion scores were calculated. Implications for practice and future research are discussed.  相似文献   

12.
Verbal and quantitative reasoning tests provide valuable information about cognitive abilities that are important to academic success. Information about these abilities may be particularly valuable to teachers of students who are English‐language learners (ELL), because leveraging reasoning skills to support comprehension is a critical aptitude for their academic success. However, due to concerns about cultural bias, many researchers advise exclusive use of nonverbal tests with ELL students despite a lack of evidence that nonverbal tests provide greater validity for these students. In this study, a test measuring verbal, quantitative, and nonverbal reasoning was administered to a culturally and linguistically diverse sample of students. The two‐year predictive relationship between ability and achievement scores revealed that nonverbal scores had weaker correlations with future achievement than did quantitative and verbal reasoning ability scores for ELL and non‐ELL students. Results do not indicate differential prediction and do not support the exclusive use of nonverbal tests for ELL students. © 2012 Wiley Periodicals, Inc.  相似文献   

13.
Test administrators are appropriately concerned about the potential for time constraints to impact the validity of score interpretations; psychometric efforts to evaluate the impact of speededness date back more than half a century. The widespread move to computerized test delivery has led to the development of new approaches to evaluating how examinees use testing time and to new metrics designed to provide evidence about the extent to which time limits impact performance. Much of the existing research is based on these types of observational metrics; relatively few studies use randomized experiments to evaluate the impact time limits on scores. Of those studies that do report on randomized experiments, none directly compare the experimental results to evidence from observational metrics to evaluate the extent to which these metrics are able to sensitively identify conditions in which time constraints actually impact scores. The present study provides such evidence based on data from a medical licensing examination. The results indicate that these observational metrics are useful but provide an imprecise evaluation of the impact of time constraints on test performance.  相似文献   

14.
The current study evaluates existing and new validity evidence for the Academic Motivation Scale (AMS; Vallerand et al., 1992). We first provide a narrative review synthesizing past research, and then conduct a validity investigation of the scores from the measure. Data analysis using a sample of 1406 American college students provided construct validity evidence in the form of a well-fitting seven-factor model and adequate internal consistency of the item responses. Convergent and discriminant validity evidence provided insight into the distinctiveness of the seven subscales. However, support for the scale’s simplex structure, which would represent the self-determination theory continuum, was not fully substantiated. Implications for theory and the scale’s use in the current form are discussed.  相似文献   

15.
Abstract

This study uses decision tree analysis to determine the most important variables that predict high overall teaching and course scores on a student evaluation of teaching (SET) instrument at a large public research university in the United States. Decision tree analysis is a more robust and intuitive approach for analysing and interpreting SET scores compared to more common parametric statistical approaches. Variables in this analysis included individual items on the SET instrument, self-reported student characteristics, course characteristics and instructor characteristics. The results show that items on the SET instrument that most directly address fundamental issues of teaching and learning, such as helping the student to better understand the course material, are most predictive of high overall teaching and course scores. SET items less directly related to student learning, such as those related to course grading policies, have little importance in predicting high overall teaching and course scores. Variables irrelevant to the construct, such as an instructor’s gender and race/ethnicity, were not predictive of high overall teaching and course scores. These findings provide evidence of criterion and discriminant validity, and show that high SET scores do not reflect student biases against an instructor’s gender or race/ethnicity.  相似文献   

16.
Measures of biographical data, or biodata, provide indicators of one's life history and past experiences. Biodata information is often available in various forms during processes of academic admissions to higher education. Such information can be used, in combination with other factors, to predict students’ future academic and extra-curricular accomplishments. There is a scattered body of literature investigating relationships between standardized biodata measures and a number of student criteria in college. The current study uses meta-analysis methods to summarize findings on how various biodata measures—overall scores or scale scores—predict student accomplishments, including grades, self- and other-rated performances, persistence, and extracurricular accomplishments. Data from 46 independent samples, consisting of 38,478 students and resulting in 74 individual predictor–criterion relationships were analyzed. Results indicate, generally, that biodata measures predict substantially students’ academic and extra-curricular accomplishments. Overall biodata scores correlate with grades at .39, persistence at .25, and point-hour ratios at .35. Students’ accomplishments in leadership, visual and performing arts, music, and science were predicted best by biodata measures developed specifically to target those outcomes. This meta-analytic study provides support for the predictive validity of biographical data inventories with respect to student outcomes and adds justification to the use of biodata in academic selection.  相似文献   

17.
This study examines the predictive validity of three commonly used nursing school admission indices, that is, scholastic aptitude test scores, matriculation grades, and evaluations of performance in a group interview situation, in a sample of 321 Israeli nursing school students. Grade point average, supervisor evaluation of clinical internship, and scores on a government certification exam served as primary indices of criterion performance. Whereas composite aptitude test scores correlated moder ately with both grade point average and certification exam scores, matriculation grades correlated negligibly with all three criterion measures. Group interview ratings correlated moderately with clinical performance, but negligibly with the remaining criteria. Aptitude test scores were not found to be biased predictors of criterion performance by ethnicity or social background. The implications of these findings for the selection of nursing school candidates in Israel are discussed.  相似文献   

18.
When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores.  相似文献   

19.
An achievement test score can be viewed as a joint function of skill and will, of knowledge and motivation. However, when interpreting and using test scores, the ‘will’ part is not always acknowledged and scores are mostly interpreted and used as pure measures of student knowledge. This paper argues that students’ motivation to do their best on the assessment – their test‐taking motivation – is important to consider from an assessment validity perspective. This is true not least in assessment contexts where the assessment outcome has no consequences for the test‐taker. The paper further argues that the quality of assessment of test‐taking motivation also needs attention. Theoretical and methodological issues related to the assessment of test‐taking motivation are presented from a validity perspective, and findings from empirical studies on the relation between test stakes, test‐taking motivation and test performance are presented.  相似文献   

20.
The evolving specification for a series of vertically equated overlapping Key Stage 3 national tests in science in England and Wales sets a series of test development challenges. These include the need to relate standards defined by hierarchically organised ‘level’ criteria to cut‐scores based on total test scores; and the need to allow compensation across the boundaries of sets of items targeted at different levels. A criterion‐related model for test development is described which is governed by a pattern of expectations about the performance of pupils relating to the hierarchical level criteria and builds determination of cut‐scores into the test development process. Some other relevant approaches to standard setting are also discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号