首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
《教育实用测度》2013,26(2):111-122
Teachers in three local school districts helped customize two standardized norm-referenced tests (NRTs). The primary purpose of the investigation was to consider the effects of deleting items from the NRTs and of adding locally constructed items to the NRTs. Specifically, the normative data, percentile ranks (PRs), that would be provided for these customized tests were of interest. The results indicated that the PRs provided for customized tests may be very different from those for the complete test. This was true regardless of how the customized test was constructed, whether it consisted only of items selected from the NRT or of locally constructed items added to those selected from the NRT. Unlike previous studies, however, these results provide little evidence to suggest that the PRs associated with the customized tests were systematically higher than the PRs associated with the full-length test.  相似文献   

2.
《教育实用测度》2013,26(1):15-35
This study examines the effects of using item response theory (IRT) ability estimates based on customized tests that were formed by selecting specific content areas from a nationally standardized achievement test. Subsets of items were selected from four different subtests of the Iowa Tests of Basic Skills (Hieronymus, Hoover, & Lindquist, 1985) on the basis of (a) selected content areas (content-customized tests) and (b) a representative sampling of content areas (representative-customized tests). For three of the four tests examined, ability estimates and estimated national percentile ranks based on the content-customized tests in school samples tended to be systematically higher than those based on the full tests. The results of the study suggested that for certain populations, IRT ability estimates and corresponding normative scores on content-customized versions of standardized achievement tests cannot be expected to be equivalent to scores based on the full-length tests.  相似文献   

3.
This paper discusses how to maintain the integrity of national normative information for achievement tests when the test that is administered has been customized to satisfy local needs and is not a test that has been nationally normed. Using an Item Response Theory perspective, alternative procedures for item selection and calibration are examined with respect to their effect on the accuracy of normative information. It is emphasized that it is important to match the content of the customized test with that of the normed test if accurate normative data are desired.  相似文献   

4.
A misconception exists that validity may refer only to the interpretation of test scores and not to the uses of those scores. The development and evolution of validity theory illustrate test score interpretation was a primary focus in the earliest days of modern testing, and that validating interpretations derived from test scores remains essential today. However, test scores are not interpreted and then ignored; rather, their interpretations lead to actions. Thus, a modern definition of validity needs to describe the validation of test score interpretations as a necessary, but insufficient, step en route to validating the uses of test scores for their intended purposes. To ignore test use in defining validity is tantamount to defining validity for ‘useless’ tests. The current definition of validity stipulated in the 2014 version of the Standards for Educational and Psychological Testing properly describes validity in terms of both interpretations and uses, and provides a sufficient starting point for validation.  相似文献   

5.
Recently there has been a great amount of research and professional educator interest in at-risk, poor academically attaining students, especially low socioeconomic status students at U.S. inner-city schools. A major factor that has been hypothesized in the research literature as being associated with poor academic attainment is the lack of critical and timely instructional feedback or formative evaluation. Using a sample of 130 inner-city senior high school students, the perceived quality and quantity of formative evaluation received by these students at their elementary and secondary school levels were assessed. in addition, each student was given a mathematics (pre-algebra) assessment using both a one and two-dimensional format (recognition plus confidence) to determine present levels of mathematics attainment. Finally data were collected from the cumulative grade-level folders of a subset of these students, especially norm-referenced data (NRT) in mathematics, to examine their relationship to scores on the Scholastic Aptitude Test-Quantitiative portion. The study finds that in addition to extremely poor mathematics attainment and poor formative evaluation practices there is little association between SAT (quantitative) scores and the grade-level (mathematics) NRT scores. These findings suggest that parents cannot depend on traditional norm-referenced measures to indicate actual mathematics attainment as these students are progressing through the schools. These findings also challenge urban school administrative personnel to reassess the use of NRT measures to monitor student progress and to develop more comprehensive and systematic formative evaluation procedures and practices for individual students as they progress through each grade level.  相似文献   

6.
Powers, Slaughter, and Helmick (1983) recently analyzed selection, pretest, and posttest scores collected from large numbers of students in two cohorts. These analyses led them to conclude that the equipercentile assumption underlying norm-referenced evaluation methodology is "inappropriate." Re-examination of the data suggests that there is strong support for the validity of the equipercentile assumption in the selection and pretest scores they present. The observed "gains" from pre- to posttests are better attributed to stakeholder bias, posttests that match the curriculum content too closely, or a combination of these two factors than to inappropriateness of the equipercentile assumption. Annual testing where the posttest from one year also serves as the pretest for the next is suggested as a promising solution to both of the cited threats to the internal validity of norm-referenced evaluations.  相似文献   

7.
In criterion‐referenced tests (CRTs), the traditional measures of reliability used in norm‐referenced tests (NRTs) have often proved problematic because of NRT assumptions of one underlying ability or competency and of variance in the distribution of scores. CRTs, by contrast, are likely to be created when mastery of the skill or knowledge by all or most all test takers is expected and thus little variation in the scores is expected. A comprehensive CRT often measures a number of discrete tasks that may not represent a single unifying ability or competence. Hence, CRTs theoretically violate the two most essential assumptions of classic NRT re liability theory and they have traditionally required the logistical problems of multiple test administrations to the same test takers to estimate reliability. A review of the literature categorizes approaches to reliability for CRTs into two classes: estimates sensitive to all measures of error and estimates of consistency in test outcome. For single test administration of CRTs Livingston's k2is recommended for estimating all measures of error, Sc is proposed for estimates of consistency in test outcome. Both approaches compared using data from a CRT exam and recommendations for interpretation and use are proposed.  相似文献   

8.
Violations of four selected principles of writing multiple choice items were introduced into an undergraduate political science examination. Three of the four poor practices had no overall effect on test difficulty. A significant (α= .05) interaction effect between the poor practices and course achievement occurred for one of the four practices, with the poorer students generally gaining most from the poorly written items. KR 20 values were significantly lower for sets of items with the same flaws than for "good" versions of the items in three of four comparisons. The reductions in reliability were equivalent to those expected to result from shortening the test by 13 to 56 percent. Concurrent validity (correlation of experimental test scores with final examination scores) was significantly lower in two of four cases. The reductions in validity were equivalent to those expected to result from shortening the test by 56 to 83 percent.  相似文献   

9.
This study was conducted to determine if a norm-referenced test designed to assess instructional design competency could be statistically validated (i.e., confirmed statistically to discriminate between known masters and known nonmasters of instructional design). The test was composed of items written to assess verified competencies required of instructional design professionals. A total of 257 respondents participated in the study over the course of three stages: initial item bank construction, item analysis to determine those items with discrimination power, and the concurrent validity calculation, including determination of the mastery cut-off score. Mean scores of five groups of respondents were analyzed in the final stage. Statistically significant differences were found among the Professional Masters, Education Graduate Students and Undergraduates, Noneducation Graduate Students, and Noneducation Undergraduates. The article concludes with a discussion of the role of such an instrument in conducting research in the field.  相似文献   

10.
This article addresses validity and fairness in the testing of English language learners (ELLs)—students in the United States who are developing English as a second language. It discusses limitations of current approaches to examining the linguistic features of items and their effect on the performance of ELL students. The article submits that these limitations stem from the fact that current ELL testing practices are not effective in addressing three basic notions on the nature of language and the linguistic features of test items: (a) language is a probabilistic phenomenon, (b) the linguistic features of test items are multidimensional and interconnected, and (c) each test item has a unique set of linguistic affordances and constraints. Along with the limitations of current testing practices, for each notion, the article discusses evidence of the effectiveness of several probabilistic approaches to examining the linguistic features of test items in ELL testing.  相似文献   

11.
This paper investigates whether inferences about school performance based on longitudinal models are consistent when different assessments and metrics are used as the basis for analysis. Using norm-referenced (NRT) and standards-based (SBT) assessment results from panel data of a large heterogeneous school district, we examine inferences based on vertically equated scale scores, normal curve equivalents (NCEs), and nonvertically equated scale scores. The results indicate that the effect of the metric depends upon the evaluation objective. NCEs significantly underestimate absolute individual growth, but NCEs and scale scores yield highly correlated (r >.90) school-level results based on mean initial status and growth estimates. SBT and NRT results are highly correlated for status but only moderately correlated for growth. We also find that as few as 30 students per school provide consistent results and that mobility tends to affect inferences based on status but not growth – irrespective of the assessment or metric used.  相似文献   

12.
Regular use of questions previously made available to the public (i.e., disclosed items) may provide one way to meet the requirement for large numbers of questions in a continuous testing environment, that is, an environment in which testing is offered at test taker convenience throughout the year rather than on a few prespecified test dates. First it must be shown that such use has effects on test scores small enough to be acceptable. In this study simulations are used to explore the use of disclosed items under a worst-case scenario which assumes that disclosed items are always answered correctly. Some item pool and test designs were identified in which the use of disclosed items produces effects on test scores that may be viewed as negligible.  相似文献   

13.
考试是检验教与学效果的重要手段,试题库是试卷的基础,试卷分析法是检验试卷合理性与详细分析考试结果的方法。建立试题库及从中抽取试题时应遵循不重复、不遗漏、均衡分配得分、题型多样等原则;抽取试题方式要注意题型控制、章节控制;试卷分析法的三种图表对了解学生和改进教学有很大的帮助。  相似文献   

14.
Construct-Irrelevant Variance in High-Stakes Testing   总被引:1,自引:0,他引:1  
There are many threats to validity in high-stakes achievement testing. One major threat is construct-irrelevant variance (CIV). This article defines CIV in the context of the contemporary, unitary view of validity and presents logical arguments, hypotheses, and documentation for a variety of CIV sources that commonly threaten interpretations of test scores. A more thorough study of CIV is recommended.  相似文献   

15.
本文是第一篇探索斯坦福成就阅读考试(第十版)的原本及其客户化版本的结构相似性的文章。研究分析是跨年级在多个观测变量(个别题目,题组,题包)上进行的。分析方法主要包括线性和非线性的探索性和实证性因素分析。分析结果表明在所有文章内的试题,都有不同程度的题组效应。在所有的模型当中,个别题目作为观测变量的模型的拟合度最低,题组作为观测变量的模型的拟合;其次,题包作为观测变量的模型的拟合度最高。在三种结构等性等级:同性等性(congenric),陶性等性(tau-equivalent)和并行等性(parallel)中,斯坦福成就阅读考试原本与其客户化版本的结构具有同性相似。  相似文献   

16.
Evaluation is an inherent part of education for an increasingly diverse student population. Confidence in one’s test‐taking skills, and the associated testing environment, needs to be examined from a perspective that combines the concept of Bandurian self‐efficacy with the concept of stereotype threat reactions in a diverse student sample. Factors underlying testing reactions and performance on a cognitive ability test in four different testing conditions (high or low stereotype threat and high or low test face validity) were examined in this exploratory study. The stereotype threat manipulation seemed to lower African‐American and Hispanic participants’ test scores. However, the hypothesis that there would be an interaction with face validity was only partially supported. Participants’ highest scores resulted from low stereotype threat and high face validity, as predicted. However, the lowest scores were not in the high stereotype threat/ low face validity condition as expected. Instead, most groups tended to score lower when the test was perceived to be more face valid. Stereotype threat manipulation affected Whites as well as non‐Whites, although differently. Specifically, high stereotype threat increased Whites’ cognitive ability test scores in the low face validity condition, but decreased them in the high face validity condition. Implications for testing and classroom environment design are discussed.  相似文献   

17.
A central concern surrounding test-based accountability is that teachers may narrow teaching practices to improve test performance on a curriculum-based specific knowledge test rather than student learning more broadly. Two of the most common teaching practices that “teach to the test” are providing test-specific classwork and increasing the frequency with which students take practice tests. Whether such teaching practices improve student learning—both in terms of learning the content associated with a specific knowledge test as well as more general learning—is a largely unanswered question. To approach this question, this paper uses a student fixed effects approach to analyze the impact of these kinds of narrow teaching practices on student performance on a specific test as well as a general knowledge test. We find that test-specific classwork and practice tests with specific test items tend to have little or negative impacts on curriculum specific or general knowledge test performance, except for male students, and that subject practice tests (without emphasizing test-specific items) have positive effects on student outcomes on both kinds of tests, but larger on the curriculum-specific than on the general test, and much larger on the curriculum-specific test for male students. We discuss the logic for these results and what they tell us about the effectiveness of test-focused teaching practices more generally.  相似文献   

18.
Despite their significant contributions to research on self-regulated learning, those favoring online and trace approaches have questioned the use of self-report to assess learners' use of learning strategies. An important rejoinder to such criticisms consists of examining the validity of self-report items. The present study was designed to assess the validity of items to assess 9th-grade students' use of planning, monitoring, and regulation when studying math. To establish response process evidence of construct validity, cognitive interviews were coded to determine whether students' interpretations of the items were consistent with their intended meaning and whether their response choices were congruent with those interpretations. Evidence supported the construct validity of monitoring and regulation items, but to a lesser degree those designed to assess planning. We discuss implications of the evidence for the self-report assessment of learners' use of metacognitive strategies.  相似文献   

19.
《教育实用测度》2013,26(4):413-432
With the increasing use of automated scoring systems in high-stakes testing, it has become essential that test developers assess the validity of the inferences based on scores produced by these systems. In this article, we attempt to place the issues associated with computer-automated scoring within the context of current validity theory. Although it is assumed that the criteria appropriate for evaluating the validity of score interpretations are the same for tests using automated scoring procedures as for other assessments, different aspects of the validity argument may require emphasis as a function of the scoring procedure. We begin the article with a taxonomy of automated scoring procedures. The presentation of this taxonomy provides a framework for discussing threats to validity that may take on increased importance for specific approaches to automated scoring. We then present a general discussion of the process by which test-based inferences are validated, followed by a discussion of the special issues that must be considered when scoring is done by computer.  相似文献   

20.
Speededness refers to the situation where the time limits on a standardized test do not allow substantial numbers of examinees to fully consider all test items. When tests are not intended to measure speed of responding, speededness introduces a severe threat to the validity of interpretations based on test scores. In this article, we describe test speededness, its potential threats to validity, and traditional and modern methods that can be used to assess the presence of speededness. We argue that more attention must be paid to this issue and that more research must be done to set appropriate time limits on power tests so that speed of responding does not interfere with the construct measured.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号