首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
Test scores matter these days. Test‐takers want to understand how they performed, and test score reports, particularly those for individual examinees, are the vehicles by which most people get the bulk of this information. Historically, score reports have not always met the examinees’ information or usability needs, but this is clearly changing for the better due to recent, much‐needed additions to the psychometric literature as well as improved efforts in reporting practices. This paper provides an overview of score reports from a development perspective, focusing on current practices and emerging efforts in content of reports as well as the process by which reports are designed, evaluated, and ultimately used to communicate with the public.  相似文献   

2.
A College Board-sponsored survey of a nationally representative sample of 1995–96 SAT takers yielded a data base for more than 4, 000 examinees, about 500 of whom had attended formal coaching programs outside their schools. Several alternative analytical methods were used to estimate the effects of coaching on SAT I: Reasoning Test scores. The various analyses produced slightly different estimates. All of the estimates, however, suggested that the effects of coaching are far less than is claimed by major commercial test preparation companies. The revised SAT does not appear to be any more coachable than its predecessor.  相似文献   

3.
Test preparation activities were determined for a large representative sample of Graduate Record Examination (GRE) Aptitude Test takers. About 3% of these examinees had attended formal coaching programs for one or more sections of the test.
After adjusting for differences in the background characteristics of coached and uncoached students, effects on test scores were related to the length and the type of programs offered. The effects on GRE verbal ability scores were not significantly related to the amount of coaching examinees received, and quantitative coaching effects increased slightly but not significantly with additional coaching. Effects on analytical ability scores, on the other hand, were related significantly to the length of coaching programs, through improved performance on two analytical item types, which have since been deleted from the test.
Overall, the data suggest that, when compared with the two highly susceptible item types that have been removed from the GRE Aptitude Test, the test item types in the current version of the test (now called the GRE General Test) appear to show relatively little susceptibility to formal coaching experiences of the kinds considered here.  相似文献   

4.

To be successful in a high-stakes testing situation is desirable for any test taker. It has been found that, beside content knowledge, test-taking behavior, such as risk-taking strategies, motivation, and test anxiety, is important for test performance. The purposes of the present study were to identify and group test takers with similar patterns of test-taking behavior and to explore how these groups differ in terms of background characteristics and test performance in a high-stakes achievement test context. A sample of the Swedish Scholastic Assessment Test test takers (N = 1891) completed a questionnaire measuring their motivation, test anxiety, and risk-taking behavior during the test, as well as background characteristics. A two-step cluster analysis revealed three clusters of test takers with significantly different test-taking behavior profiles: a moderate (n = 741), a calm risk taker (n = 637), and a test anxious risk averse (n = 513) profile. Group difference analyses showed that the calm risk taker profile (i.e., a high degree of risk-taking together with relatively low levels of test anxiety and motivation during the test) was the most successful profile from a test performance perspective, while the test anxious risk averse profile (i.e., a low degree of risk-taking together with high levels of test anxiety and motivation) was the least successful. Informing prospective test takers about these insights can potentially lead to more valid interpretations and inferences based on the test scores.

  相似文献   

5.
Abstract

Educational stakeholders have long known that students might not be fully engaged when taking an achievement test and that such disengagement could undermine the inferences drawn from observed scores. Thanks to the growing prevalence of computer-based tests and the new forms of metadata they produce, researchers have developed and validated procedures for using item response times to identify responses to items that are likely disengaged. In this study, we examine the impact of two techniques to account for test disengagement—(a) removing unengaged test takers from the sample and (b) adjusting test scores to remove rapidly guessed items—on estimates of school contributions to student growth, achievement gaps, and summer learning loss. Our results indicate that removing disengaged examinees from the sample will likely induce bias in the estimates, although as a whole accounting for disengagement had minimal impact on the metrics we examined. Last, we provide guidance for policy makers and evaluators on how to account for disengagement in their own work and consider the promise and limitations of using achievement test metadata for related purposes.  相似文献   

6.
In an article in the Winter 2011 issue of the Journal of Educational Measurement, van der Linden, Jeon, and Ferrara suggested that “test takers should trust their initial instincts and retain their initial responses when they have the opportunity to review test items.” They presented a complex IRT model that appeared to show that students would be worse off by changing answers. As noted in a subsequent erratum, this conclusion was based on flawed data, and that the correct data could not be analyzed by their method because the model failed to converge. This left their basic question on the value of answer changing unanswered. A much more direct approach is to simply count the number of examinees whose scores after an opportunity to change answers are higher, lower, or the same as their initial scores. Using the same data set as the original article, an overwhelming majority of the students received higher scores after the opportunity to change answers.  相似文献   

7.
Every year, thousands of college and university applicants with learning disabilities (LD) present scores from standardized examinations as part of the admissions process for postsecondary education. Many of these scores are from tests administered with nonstandard procedures due to the examinees' learning disabilities. Using a sample of college students with LD and a control sample, this study investigated the criterion validity and comparability of scores on the Miller Analogies Test when accommodations for the examinees with LD were in place. Scores for examinees with LD from test administrations with accommodations were similar to those of examinees without LD on standard administrations, but less well associated with grade point averages. The results of this study provide evidence that although scores for examinees with LD from nonstandard test administrations are comparable to scores for examinees without LD, they have less criterion validity and are less meaningful for their intended purpose.  相似文献   

8.
In order to determine the role of time limits on both test performance and test validity, we asked approximately 300 volunteers–prospective graduate students–to each write two essays–one in a 40-minute time period and the other in 60 minutes. Analyses revealed that, on average, test performance was significantly better when examinees were given 60 minutes instead of 40. However, there was no interaction between test-taking style (fast vs. slow) and time limits. 'That is', examinees who described themselves as slow writers/test takers did not benefit any more (or any less) from generous time limits than did their quicker counterparts. In addition, there was no detectable effect of different time limits on the meaning of essay scores, as suggested by their relationship to several nontest indicators of writing ability.  相似文献   

9.
In May 1990 new groups of examinees participated in the Swedish Scholastic Aptitude Test (SweSA T). Generally these new groups were younger and had higher education than the examinees at earlier test administrations. The purpose of the study reported was to examine whether the gender differences in test results had changed with the changed composition of examinees. The groups of men and women were successively matched according to age and education and comparisons were made of gender differences in test results between different age and education groups. The results, however, showed that even though age as well as education had influence on the test results, no real difference was found between younger and older examinees regarding gender differences in the test results.  相似文献   

10.
The rise of computer‐based testing has brought with it the capability to measure more aspects of a test event than simply the answers selected or constructed by the test taker. One behavior that has drawn much research interest is the time test takers spend responding to individual multiple‐choice items. In particular, very short response time—termed rapid guessing—has been shown to indicate disengaged test taking, regardless whether it occurs in high‐stakes or low‐stakes testing contexts. This article examines rapid‐guessing behavior—its theoretical conceptualization and underlying assumptions, methods for identifying it, misconceptions regarding its dynamics, and the contextual requirements for its proper interpretation. It is argued that because it does not reflect what a test taker knows and can do, a rapid guess to an item represents a choice by the test taker to momentarily opt out of being measured. As a result, rapid guessing tends to negatively distort scores and thereby diminish validity. Therefore, because rapid guesses do not contribute to measurement, it makes little sense to include them in scoring.  相似文献   

11.
Martin   《Assessing Writing》2009,14(2):88-115
The demand for valid and reliable methods of assessing second and foreign language writing has grown in significance in recent years. One such method is the timed writing test which has a central place in many testing contexts internationally. The reliability of this test method is heavily influenced by the scoring procedures, including the rating scale to be used and the success with which raters can apply the scale. Reliability is crucial because important decisions and inferences about test takers are often made on the basis of test scores. Determining the reliability of the scoring procedure frequently involves examining the consistency with which raters assign scores. This article presents an analysis of the rating of two sets of timed tests written by intermediate level learners of German as a foreign language (n = 47) by two independent raters who used a newly developed detailed scoring rubric containing several categories. The article discusses how the rubric was developed to reflect a particular construct of writing proficiency. Implications for the reliability of the scoring procedure are explored, and considerations for more extensive cross-language research are discussed.  相似文献   

12.
Test performance is a function both of the test takers’ personal attributes and of the test method facets. However, much of the previous research has addressed the covariates of assessment preferences of pupils rather than those of their actual performances. Following the microsystem perspective and as part of a larger project, this study was set out to detect the learner factors and linguistic parameters which mediate performance on different test formats. A number of language learners responded to the group embedded figures test, willingness to communicate scale, Michigan proficiency test, and a reading comprehension test battery. Based on the previous empirical research, a hypothetical model was designed and tested using structural equation modeling. The findings were as follows: (a) Performance on controlled and constructed response tests is substantially mediated by testees’ characteristics (cognitive source); (b) Target ability (linguistic source) is the most significant determinant of performance on free-response tasks.  相似文献   

13.
Test validity has been the predominant paradigm underlying much of the research into test bias. If we are interested in identifying causes of bias and determining its effects on test scores, validity may not be the best paradigm. In this article, a simple theoretical framework is presented in which bias can be seen to affect validity, but is not defined by it. Bias is seen instead as a multifaceted aspect of tests and of test takers. Some of the implications of this model for understanding the causes and effects of bias are then explored.  相似文献   

14.
For a certification, licensure, or placement exam, allowing examinees to take multiple attempts at the test could effectively change the pass rate. Change in the pass rate can occur without any change in the underlying latent trait, and can be an artifact of multiple attempts and imperfect reliability of the test. By deriving formulae to compute the pass rate under two definitions, this article provides tools for testing practitioners to compute and evaluate the change in the expected pass rate when a certain (maximum) number of attempts are allowed without any change in the latent trait. This article also includes a simulation study that considers change in ability and differential motivation of examinees to retake the test. Results indicate that the general trend shown by the analytical results is maintained—that is, the marginal expected pass rate increases with more attempts when the testing volume is defined as the total number of test takers, and decreases with more attempts when the testing volume is defined as the total number of test attempts.  相似文献   

15.
Abstract

High school students completed both multiple-choice and constructed response exams over an 845-word narrative passage on which they either took notes or underlined critical information. A control group merely read the text In addition, half of the learners in each condition were told to expect either a multiple-choice or constructed response test following reading. Overall, note takers showed superior posttest recall, and notetaking without test instructions yielded the best group performance. Notetaking also required significantly more time than the other conditions. Underlining for a multiple-choice test led to better recall than underlining for a constructed response test. Although more multiple-choice than constructed response items were remembered. Test Mode failed to interact with the other factors.  相似文献   

16.
It has been reasonably well established that test takers can, to varying degrees, answer some reading comprehension questions without reading the passages on which the questions are based, even for carefully constructed measures like the Scholastic Aptitude Test (SAT). The aim of this study was to determine what test-taking strategies examinees use, and which are related to test performance, when reading passages are not available. The research focused on reading comprehension questions similar to those that will be used in the revised SAT, to be introduced in 1994. The most often cited strategies involved choosing answers on the basis of consistency with other questions and reconstructing the main theme of a missing passage from all of the questions and answers in a set. These strategies were more likely to result in successful performance on individual test items than were any of many other possible (and less constructrelevant) strategies.  相似文献   

17.
In this ITEMS module, we provide a two‐part introduction to the topic of reliability from the perspective of classical test theory (CTT). In the first part, which is directed primarily at beginning learners, we review and build on the content presented in the original didactic ITEMS article by Traub and Rowley (1991). Specifically, we discuss the notion of reliability as an intuitive everyday concept to lay the foundation for its formalization as a reliability coefficient via the basic CTT model. We then walk through the step‐by‐step computation of key reliability indices and discuss the data collection conditions under which each is most suitable. In the second part, which is directed primarily at intermediary learners, we present a distribution‐centered perspective on the same content. We discuss the associated assumptions of various CTT models ranging from parallel to congeneric, and review how these affect the choice of reliability statistics. Throughout the module, we use a customized Excel workbook with sample data and basic data manipulation functionalities to illustrate the computation of individual statistics and to allow for structured independent exploration. In addition, we provide quiz questions with diagnostic feedback as well as short videos that walk through sample exercises within the workbook.  相似文献   

18.
The interpretability of score comparisons depends on the design and execution of a sound data collection plan and the establishment of linkings between these scores. When comparisons are made between scores from two or more assessments that are built to different specifications and are administered to different populations under different conditions, the validity of the comparisons hinges on untestable assumptions. For example, tests administered across different disability groups or tests administered to different language groups produce scores for which implicit linkings are presumed to hold. Presumed linking makes use of extreme assumptions to produce links between scores on tests in the absence of common test material or equivalent groups of test takers. These presumed linkings lead to dubious interpretations. This article suggests an approach that indirectly assesses the validity of these presumed linkings among scores on assessments that contain neither equivalent groups nor common anchor material.  相似文献   

19.
This study examined the differential effectiveness of traditional and discovery methods of instruction for the teaching of science concepts, understandings about science, and scientific attitudes, to learners at the concrete and formal level of cognitive development. The dependent variables were achievement, understanding science, and scientific attitude; assessed through the use of the ACS Achievement Test (high school chemistry, Form 1979), the Test on Understanding Science (Form W), and the Test on Scientific Attitude, respectively. Mode of instruction and cognitive development were the independent variables. Subjects were 120 Form IV (11th grade) males enrolled in chemistry classes in Lusaka, Zambia. Sixty of these were concrete reasoners (mean age = 18.23) randomly selected from one of the two schools. The remaining 60 subjects were formal reasoners (mean age 18.06) randomly selected from a second boys' school. Each of these two groups was randomly split into two subgroups with 30 subjects. Traditional and discovery approaches were randomly assigned to the two subgroups of concrete reasoners and to the two subgroups of formal reasoners. Prior to instruction, the subjects were pretested using the ACS Achievement Test, the Test on Understanding Science, and the Test on Scientific Attitude. Subjects received instruction covering eight chemistry topics during approximately 10 weeks. Posttests followed using the same standard tests. Two-way analysis of covariance, with pretest scores serving as covariates was used and 0.05 level of significant was accepted. Tukey WSD technique was used as a follow-up test where applicable. It was found that (1) for the formal reasoners, the discovery group earned significantly higher understanding science scores than the traditional group. For the concrete reasoners mode of instruction did not make a difference; (2) overall, formal reasoners earned significantly higher achievement scores than concrete reasoners; (3) in general, subjects taught by the discovery approach earned significantly higher scientific attitude scores than those taught by the traditional approach. The traditional group outperformed the discovery group in achievement scores. It was concluded that the traditional approach might be an efficient instructional mode for the teaching of scientific facts and principles to high school students, while the discovery approach seemed to be more suitable for teaching scientific attitudes and for promoting understanding about science and scientists among formal operational learners.  相似文献   

20.
If a test is intended to impact teaching and leaning, how can we make a case for its validity? Currently, it is argued, the case for the validity of a test considers only the test maker's view of what the scores mean and whether they are useful for teachers and learners. Although the test maker's perspective is a necessary one, it is insufficient to validate a test, so an expanded framework for validating tests is needed. The expansion proposes using teachers'/professionals' and students' perspectives as necessary information in the validation of what test scores mean and whether they are useful to teachers and learners.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号