首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Validity is the most fundamental consideration in test development. Understandably, much time, effort, and money is spent in its pursuit. Central to the modern conception of validity are the interpretations made, and uses planned, on the basis of test scores. There is, unfortunately, however, evidence that test users have difficulty understanding scores as intended. That is, although the proposed interpretations and use of test scores might be theoretically valid they might never come to be because the meaning of the message is lost in translation. This necessitates pause. It is almost absurd to think that the intended interpretations and uses of test scores might fail because there is a lack of alignment with the actual interpretations made and uses enacted by the audience. Despite this, there has only recently been contributions to the literature regarding the interpretability of score reports, the mechanisms by which scores are communicated to their audience, and their relevance to validity. These contributions have focused upon linking, through evidence, the intended interpretation and use with the actual interpretations being made and actions being planned by score users. This article reviews the current conception of validity, validation, and validity evidence with the goal of positioning the emerging notion of validity of usage within the current paradigm.  相似文献   

2.
A misconception exists that validity may refer only to the interpretation of test scores and not to the uses of those scores. The development and evolution of validity theory illustrate test score interpretation was a primary focus in the earliest days of modern testing, and that validating interpretations derived from test scores remains essential today. However, test scores are not interpreted and then ignored; rather, their interpretations lead to actions. Thus, a modern definition of validity needs to describe the validation of test score interpretations as a necessary, but insufficient, step en route to validating the uses of test scores for their intended purposes. To ignore test use in defining validity is tantamount to defining validity for ‘useless’ tests. The current definition of validity stipulated in the 2014 version of the Standards for Educational and Psychological Testing properly describes validity in terms of both interpretations and uses, and provides a sufficient starting point for validation.  相似文献   

3.
Advances in validity theory and alacrity in validation practice have suffered because the term validity has been used to refer to two incompatible concerns: (1) the degree of support for specified interpretations of test scores (i.e. intended score meaning) and (2) the degree of support for specified applications (i.e. intended test uses). This article provides a brief summary of current validity theory, explication of a critical flaw in the current conceptualisation of validity, and a framework that both accommodates and differentiates validation of test score inferences and justification of test use.  相似文献   

4.
The AERA, APA, NCME Standards define validity as ‘the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests’. A century of disagreement about validity does not mean that there has not been substantial progress. This consensus definition brings together interpretations and use so that it is one idea, not a sequence of steps. Just as test design is framed by a particular context of use, so too must validation research focus on the adequacy of tests for specific purposes. The consensus definition also carries forward major reforms in validity theory begun in the 1970s that rejected separate types of validity evidence for different types of tests, e.g. content validity for achievement tests and predictive correlations for employment tests. When the current definition refers to both ‘evidence and theory’ the Standards are requiring not just that a test be well designed based on theory but that evidence be collected to verify that the test device is working as intended. Having taught policy-makers, citizens, and the courts to use the word validity, especially in high-stakes applications, we cannot after the fact substitute a more limited, technical definition of validity. An official definition provides clarity even for those who disagree, because it serves as a touchstone and obliges them to acknowledge when they are departing from it.  相似文献   

5.
Despite the ease of accessing a wide range of measures, little attention is given to validity arguments when considering whether to use the measure for a new purpose or in a different context. Making a validity argument has historically focused on the intended interpretation and use. There has been a press to consider both the intended and actual interpretations and how users make sense of the data when constructing validity arguments, but the practice is not widespread. This paper contributes to existing research on validity by highlighting the value of attending to the actual interpretation and use of a measure aimed at supporting instructional improvement in mathematics. We describe the use of the same measure across two contexts to highlight the importance of attending to characteristics of both users and the contexts in which the measures are used when assessing the validity of inferences for the purpose of instructional improvement efforts.  相似文献   

6.
《教育实用测度》2013,26(3):185-207
With increasing interest in educational accountability, test results are now expected to meet a diverse set of informational needs. But a norm-referenced test (NRT) cannot be expected to meet the simultaneous demands for both norm-referenced and curriculum-specific information. One possible solution, which is the focus of this article, is to customize the NRT. Customized tests may appear in any form. They may (a) add a few curriculum-specific items to the end of the NRT, (b) substitute locally constructed items for a few NRT items, (c) substitute a curriculum-specific test (CST) for the NRT, or (d) use equating methods to obtain predicted NRT scores from the CST scores. In this article, we describe the four main approaches to customized testing, address the validity of the uses and interpretations of customized test scores obtained from the four main approaches, and offer recommendations regarding the use of customized tests and the need for further research. Results indicate that customized testing can yield both valid normative and curriculum- specific information, when special conditions exist. But, there are also many threats to the validity of normative interpretations. Cautious application of customized testing is needed in order to avoid misleading inferences about student achievement.  相似文献   

7.
Assessment Validation in the Context of High-Stakes Assessment   总被引:1,自引:0,他引:1  
Including the perspectives of stakeholder groups (e.g., teachers, parents) can improve the validity of high-stakes assessment interpretations and uses. How stakeholder groups view high-stakes assessments and their uses may differ significantly from state-level policy officials. The views of these stakeholders can contribute to identifying the strengths and weaknesses of the intended assessment interpretations and uses. This article proposes a process approach to validity that addresses assessment validation in the context of high-stakes assessment. The process approach includes a test evaluator or validator who considers the perspectives of five stakeholder groups at four different stages of assessment maturity in relationship to six aspects of construct validity. The tasks of the test evaluator and how stakeholders' views might be incorporated are illustrated at each stage of assessment maturity. How the test evaluator might make judgments about the merit of high-stakes assessment interpretations and uses is discussed.  相似文献   

8.
This article reviews the intended uses of these college‐ and career‐readiness assessments with the goal of articulating an appropriate validity argument to support such uses. These assessments differ fundamentally from today's state assessments employed for state accountability. Current assessments are used to determine if students have mastered the knowledge and skills articulated in state standards; content standards, performance levels, and student impact often differ across states. College‐ and career‐readiness assessments will be used to determine if students are prepared to succeed in postsecondary education. Do students have a high probability of academic success in college or career‐training programs? As with admissions, placement, and selection tests, the primary interpretations that will be made from test scores concern future performance. Statistical evidence between test scores and performance in postsecondary education will become an important form of evidence. A validation argument should first define the construct (college and career readiness) and then define appropriate criterion measures. This article reviews alternative definitions and measures of college and career readiness and contrasts traditional standard‐setting methods with empirically based approaches to support a validation argument.  相似文献   

9.
Evaluating the multiple characteristics of alignment has taken a prominent role in educational assessment and accountability systems given its attention in the No Child Left Behind legislation (NCLB). Leading to this rise in popularity, alignment methodologies that examined relationships among curriculum, academic content standards, instruction, and assessments were proposed as strategies to evaluate evidence of the intended uses and interpretations of test scores. In this article, we propose a framework for evaluating alignment studies based on similar concepts that have been recommended for standard setting (Kane). This framework provides guidance to practitioners about how to identify sources of validity evidence for an alignment study and make judgments about the strength of the evidence that may impact the interpretation of the results.  相似文献   

10.
Validity is a central principle of assessment relating to the appropriateness of the uses and interpretations of test results. Usually, one of the inferences that we wish to make is that the score reflects the extent of a student’s learning in a given domain. Thus, it is important to establish that the assessment tasks elicit performances that reflect the intended constructs. This research explored the use of three methods for evaluating whether there are threats to validity in relation to the constructs elicited in international A level geography examinations: (a) Rasch analysis; (b) analysis of processes expected and apparent when students answer questions; and (c) qualitative analysis of responses to items identified as potentially problematic. The results provided strong evidence to support validity with regard to the elicitation of constructs although one question part was identified as a threat to validity. Strengths and weaknesses of the methods can be identified.  相似文献   

11.
Speededness refers to the situation where the time limits on a standardized test do not allow substantial numbers of examinees to fully consider all test items. When tests are not intended to measure speed of responding, speededness introduces a severe threat to the validity of interpretations based on test scores. In this article, we describe test speededness, its potential threats to validity, and traditional and modern methods that can be used to assess the presence of speededness. We argue that more attention must be paid to this issue and that more research must be done to set appropriate time limits on power tests so that speed of responding does not interfere with the construct measured.  相似文献   

12.
In the service of educational accountability, student achievement tests are being used to measure constructs quite unlike those envisioned by test developers. Scores are compared to cut points to create classifications like “proficient”; scores are combined over time to measure growth; student scores are aggregated to measure the effectiveness of teachers, schools, and school districts; indices are created to measure college and career readiness. These and other new uses rely on derived scores created to measure new constructs. The field of educational and psychological measurement has largely ignored these significant, consequential measurement applications. The conceptual frameworks and analytical tools of educational and psychological measurement should be used to study such derived scores and the validity of their uses and interpretations.  相似文献   

13.
This study examines an interactional view on teaching mathematics, whereby meaning is co-produced with the students through a process of negotiation. Further, teaching is viewed from a symbolic interactionism perspective, allowing the analysis to focus on the teacher’s role in the negotiation of meaning. Using methods inspired by grounded theory, patterns of teachers’ interaction are categorized. The results show how teachers’ actions, interpretations and intentions form interactional strategies that guide the negotiation of meaning in the classroom. The theoretical case of revoicing as a teacher action, together with interpretations of mathematical objects from probability theory, is used to exemplify conclusions from the proposed perspective. Data are generated from a lesson sequence with two teachers working with known and unknown constant sample spaces with their classes. In the lessons presented in this article, the focus is on negotiations of the meaning of chance. The analysis revealed how the teachers indicate their interpretations of mathematical objects and intentions to the students to different degrees and, by doing so, create opportunities for the students to ascribe meaning to these objects. The discussion contrasts the findings with possible interpretations from other perspectives on teaching.  相似文献   

14.
Score reports have one or more intended audiences: the people who use the reports to make decisions about test takers, including teachers, administrators, parents and test takers. Attention to audience when designing a score report supports assessment validity by increasing the likelihood that score users will interpret and use assessment results appropriately. Although most design guidelines focus on making score reports understandable to people who are not testing professionals, audiences should be defined by more than just their lack of statistical knowledge. This paper introduces an approach to identifying important audience characteristics for designing computer-based, interactive score reports. Through three examples, we demonstrate how an audience analysis suggests a design pattern, which guides the overall design of a report, as well as design details, such as data representations and scaffolding. We conclude with a research agenda for furthering the use of audience analysis in the design of interactive score reports.  相似文献   

15.
Cronbach made the point that for validity arguments to be convincing to diverse stakeholders, they need to be based on assumptions that are credible to these stakeholders. The interpretations and uses of high-stakes test scores rely on a number of policy assumptions about what should be taught in schools, and more specifically, about the content standards and performance standards that should be applied to students and schools. For example, a high-school graduation test can be developed as a measure of readiness for the world of work, for college, or for citizenship and the activities of daily life. The assumptions built into the assessment need to be subjected to scrutiny and criticism if a strong case is to be made for the validity of the proposed interpretation and use.  相似文献   

16.
Student examinees are key stakeholders in large-scale, high-stakes, public examination systems. How they perceive the purpose, comprehend the technical characteristics of testing and how they interpret scores influence their response to the system demands and their preparation for the examinations; this information relates to intended and unintended consequences of testing and is a component of an expanded notion of test validity. The research reported in this paper investigates examinees’ perceptions about the secondary school graduation and university-entrance national exams in Cyprus. Interviews with recent examinees reveal the versatility and complexity of their perceptions about the fairness and appropriateness of the system, which are influenced by design features of the exams and by the local context. There are important, mostly unintended, consequences on their in- and out-of-school experience, on school curricula and on instructional practices. Empirical evidence about consequential aspects of examinations contributes to the validity argument needed to support such programmes.  相似文献   

17.
Current thinking on validity suggests that educational institutions and individuals should evaluate their uses of test scores in the context of their fundamental goals. Regression coefficients and other traditional criterion-related validity statistics provide relevant information, but often do not, by themselves, address the fundamental reasons for using test scores. Formal decision theory models provide a logically rigorous way to do this, but they are difficult to implement in practice. This article considers a simplification of formal decision theory models, in which one estimates the proportion of examinees for whom positive outcomes result from a use of test scores. For uses involving selection, the proportion of examinees with positive outcomes can be calculated by applying traditional regression coefficients to the marginal distribution of scores in the unselected population. The incremental usefulness of using a particular variable can be judged by comparing its proportion to that associated with no selection and to that associated with using another variable, either alone or jointly. Examples, related to college admission and retention, are given to illustrate these ideas.  相似文献   

18.
States use standards‐based English language proficiency (ELP) assessments to inform relatively high‐stakes decisions for English learner (EL) students. Results from these assessments are one of the primary criteria used to determine EL students’ level of ELP and readiness for reclassification. The results are also used to evaluate the effectiveness of and funding allocation to district or school programs that serve EL students. In an effort to provide empirical validity evidence for such important uses of ELP assessments, this study focused on examining the constructs of ELP assessments as a fundamental validity issue. Particularly, the study examined the types of language proficiency measured in three sample states’ ELP assessments and the relationship between each type of language proficiency and content assessment performance. The results revealed notable variation in the presence of academic and social language in the three ELP assessments. A series of hierarchical linear modeling (HLM) analyses also revealed varied relationships among social language proficiency, academic language proficiency, and content assessment performance. The findings highlight the importance of examining the constructs of ELP assessments for making appropriate interpretations and decisions based on the assessment scores for EL students. Implications for policy and practice are discussed.  相似文献   

19.
This paper argues for an expanded conception of test validity, in which teachers, as end-users of tests, contribute a distinctive perspective on validity, referred to as inferential validity. It also offers a methodology that could be adopted in order to subject this dimension of validity to scrutiny. An investigation conducted into the meanings constructed by teachers of a literacy test, the Emergent Literacy Baseline Assessment (ELBA), is reported to illustrate the methodology. In the first section of the paper, current conceptions of validity are discussed. It is argued that the validation process for tests should include the clarification and justification of the interpretations and uses of observed scores. This argument is illustrated from the methodology for investigating the validity of the ELBA. Self-assessment questionnaires and focus-group interviews provided data on teachers' views about the validity of the ELBA. Arguments in favour of investigating the validity of large-scale tests by taking into account teachers' perspectives are provided.  相似文献   

20.
Assessment data must be valid for the purpose for which educators use them. Establishing evidence of validity is an ongoing process that must be shared by test developers and test users. This study examined the predictive validity and the diagnostic accuracy of universal screening measures in reading. Scores on three different universal screening tools were compared for nearly 500 second‐ and third‐grade students attending four public schools in a large urban district. Hierarchical regression and receiver operating characteristic curves were used to examine the criterion‐related validity and diagnostic accuracy of students’ oral reading fluency (ORF), Fountas and Pinnell Benchmark Assessment System (BAS) scores, and fall scores from the Measures of Academic Progress for reading (MAP). Results indicated that a combination of all three measures accounted for 65% of the variance in spring MAP scores, whereas a reduced model of ORF and MAP scores predicted 60%. ORF and BAS scores did not meet standards for diagnostic accuracy. Combining the measures improved diagnostic accuracy, depending on how criterion scores were calculated. Implications for practice and future research are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号