期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Who Benefits Most From Preparing for a “Coachable” Admissions Test?

Donald E. Powers 《Journal of Educational Measurement》1987,24(3):247-262

A previous study of the initial, preoperational version of the Graduate Record Examinations (GRE) analytical ability measure (Powers & Swinton, 1984) revealed practically and statistically significant effects of test familiarization on analytical test scores. (Two susceptible item types were subsequently removed from the test.) Data from this study were reanalyzed for evidence of differential effects for subgroups of examinees classified by age, ethnicity, degree aspiration, English language dominance, and performance on other sections of the GRE General Test. The results suggested little, if any, difference among subgroups of examinees with respect to their response to the particular kind of test preparation considered in the study. Within the limits of the data, no particular subgroup appeared to benefit significantly more or significantly less than any other subgroup. 相似文献

2.

Modeling Item Response Times With a Two-State Mixture Model: A New Method of Measuring Speededness

Deborah L. Schnipke David J. Scrams 《Journal of Educational Measurement》1997,34(3):213-232

Speededness refers to the extent to which time limits affect examinees'test performance, and it is often measured by calculating the proportion of examinees who do not reach a certain percentage of test items. However, when tests are number-right scored (i.e., no points are subtracted for incorrect responses), examinees are likely to rapidly guess on items rather than leave them blank. Therefore, this traditional measure of speededness probably underestimates the true amount of speededness on such tests. A more accurate assessment of speededness should also reflect the tendency of examinees to rapidly guess on items as time expires. This rapid-guessing component of speededness can be estimated by modeling response times with a two-state mixture model, as demonstrated with data from a computer- administered reasoning test. Taking into account the combined effect of unreached items and rapid guessing provides a more complete measure of speededness than has previously been available. 相似文献

3.

Improving Multiple-Choice Test Performance for Examinees with Different Levels of Test Anxiety

Linda Crocker Alicia Schmitt 《Journal of Experimental Education》2013,81(4):201-205

The effectiveness of a strategy for improving performance on multiple-choice items for examinees was assessed. An aptitude-treatment interaction model was used to test the possibility of different treatment effects for examinees with different levels of test anxiety. Undergraduate measurement students responded to the Mandler-Sarason Test Anxiety Scale and to an objective test covering course content. For low-anxious examinees, generation of an answer before selecting a multiple-choice response led to higher test performance; for highly test anxious examinees, there was a slightly negative effect on performance. 相似文献

4.

Inexperienced and Anxious Computer Users: Coping With a Computer-Administered Test of Academic Skills

《Educational Assessment》2013,18(2):153-173

This study assessed the degree to which computer-based administration contributes to test performance differences among examinees. Inexperienced or anxious computer users answered computer-based reading and mathematics questions from a new teacher licensing test. They also answered paper-and-pencil analogues of these questions. While taking the computer-administered tests, half of the examinees had access to on-line familiarization materials only; the other half had additional help from a test supervisor. Results showed that (a) extra assistance from a test supervisor did not have a noticeable effect on test performance, (b) performance on test sections administered later during the session showed no evidence of improvement from practice on earlier sections, (c) most of the variation in performance on the computer-administered tests was explained by performance on the paper-and-pencil analogues rather than by attitudes toward computers or experience with them, and (d) examinees were more positive about the computer-based tests after testing than before. The conclusion was that on-line test familiarization proved adequate for the anxious/inexperienced computer users in the study and that computer-based test administration did not unduly affect examinee performance. The implications of the study are discussed with respect to evaluating new and emerging alternative modes of assessment. 相似文献

5.

Effects of Differentially Time-Consuming Tests on Computer-Adaptive Test Scores

Brent Bridgeman Frederick Cline 《Journal of Educational Measurement》2004,41(2):137-148

Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test. 相似文献

6.

The Effect of Including Pretest Items in an Operational Computerized Adaptive Test: Do Different Ability Examinees Spend Different Amounts of Time on Embedded Pretest Items?

Abdullah A. Ferdous Barbara S. Plake Shu-Ren Chang 《Educational Assessment》2013,18(2):161-173

The purpose of this study was to examine the effect of pretest items on response time in an operational, fixed-length, time-limited computerized adaptive test (CAT). These pretest items are embedded within the CAT, but unlike the operational items, are not tailored to the examinee's ability level. If examinees with higher ability levels need less time to complete these items than do their counterparts with lower ability levels, they will have more time to devote to the operational test questions. Data were from a graduate admissions test that was administered worldwide. Data from both quantitative and verbal sections of the test were considered. For the verbal section, examinees in the lower ability groups spent systematically more time on their pretest items than did those in the higher ability groups, though for the quantitative section the differences were less clear. 相似文献

7.

A Comparison of Experimental and Observational Approaches to Assessing the Effects of Time Constraints in a Medical Licensing Examination

下载免费PDF全文

Polina Harik Brian E. Clauser Irina Grabovsky Peter Baldwin Melissa J. Margolis Deniz Bucak Michael Jodoin William Walsh Steven Haist 《Journal of Educational Measurement》2018,55(2):308-327

Test administrators are appropriately concerned about the potential for time constraints to impact the validity of score interpretations; psychometric efforts to evaluate the impact of speededness date back more than half a century. The widespread move to computerized test delivery has led to the development of new approaches to evaluating how examinees use testing time and to new metrics designed to provide evidence about the extent to which time limits impact performance. Much of the existing research is based on these types of observational metrics; relatively few studies use randomized experiments to evaluate the impact time limits on scores. Of those studies that do report on randomized experiments, none directly compare the experimental results to evidence from observational metrics to evaluate the extent to which these metrics are able to sensitively identify conditions in which time constraints actually impact scores. The present study provides such evidence based on data from a medical licensing examination. The results indicate that these observational metrics are useful but provide an imprecise evaluation of the impact of time constraints on test performance. 相似文献

8.

An Investigation of Possible Correlation of General Anxiety with Performance in Eleven‐plus Scores in Year 6 Primary School Pupils

Sandra Eady 《教育心理学》1999,19(3):347-359

ABSTRACT This study focuses on measuring levels of anxiety experienced by 11‐year‐olds in their last year at primary school and aims to investigate the effect of anxiety on pupils’ performance in eleven‐plus tests. The Taylor Manifest Anxiety Test was used to determine individual levels of anxiety amongst a Year 6 cohort. Their final test scores in the eleven‐plus examination were used as a measure of their overall performance. Correlations were carried out to see if there was any link between levels of anxiety and performance in eleven‐plus for the group as a whole and in terms of gender. There appeared to be no significant link between levels of high anxiety and poor exam performance. However, although there seemed to be no apparent correlation, highly anxious boys performed well in the eleven‐plus examination, as did highly anxious girls. 相似文献

9.

Autistic Syndromes and Diet: a follow‐up study

Christina Stage 《Scandinavian Journal of Educational Research》2013,57(3):223-235

In May 1990 new groups of examinees participated in the Swedish Scholastic Aptitude Test (SweSA T). Generally these new groups were younger and had higher education than the examinees at earlier test administrations. The purpose of the study reported was to examine whether the gender differences in test results had changed with the changed composition of examinees. The groups of men and women were successively matched according to age and education and comparisons were made of gender differences in test results between different age and education groups. The results, however, showed that even though age as well as education had influence on the test results, no real difference was found between younger and older examinees regarding gender differences in the test results. 相似文献

10.

Equivalence of students' scores on timed and untimed anatomy practical examinations

Guiyun Zhang Bruce A. Fenderson Richard R. Schmidt J. Jon Veloski 《Anatomical sciences education》2013,6(5):281-285

Untimed examinations are popular with students because there is a perception that first impressions may be incorrect, and that difficult questions require more time for reflection. In this report, we tested the hypothesis that timed anatomy practical examinations are inherently more difficult than untimed examinations. Students in the Doctor of Physical Therapy program at Thomas Jefferson University were assessed on their understanding of anatomic relationships using multiple‐choice questions. For the class of 2012 (n = 46), students were allowed to circulate freely among 40 testing stations during the 40‐minute testing session. For the class of 2013 (n = 46), students were required to move sequentially through the 40 testing stations (one minute per item). Students in both years were given three practical examinations covering the back/upper limb, lower limb, and trunk. An identical set of questions was used for both groups of students (untimed and timed examinations). Our results indicate that there is no significant difference between student performance on untimed and timed examinations (final percent scores of 87.3 and 88.9, respectively). This result also held true for students in the top and bottom 20th percentiles of the class. Moreover, time limits did not lead to errors on even the most difficult, higher‐order questions (i.e., items with P‐values < 0.70). Thus, limiting time at testing stations during an anatomy practical examination does not adversely affect student performance. Anat Sci Educ 6: 281–285. © 2013 American Association of Anatomists. 相似文献

11.

Validity Issues in Test Speededness

Ying Lu Stephen G. Sireci 《Educational Measurement》2007,26(4):29-37

Speededness refers to the situation where the time limits on a standardized test do not allow substantial numbers of examinees to fully consider all test items. When tests are not intended to measure speed of responding, speededness introduces a severe threat to the validity of interpretations based on test scores. In this article, we describe test speededness, its potential threats to validity, and traditional and modern methods that can be used to assess the presence of speededness. We argue that more attention must be paid to this issue and that more research must be done to set appropriate time limits on power tests so that speed of responding does not interfere with the construct measured. 相似文献

12.

The Effect of Multidimensionality on IRT True-Score Equating for Subgroups of Examinees

André F. De Champlain 《Journal of Educational Measurement》1996,33(2):181-201

The purpose of this study was to assess the dimensionality of two forms of a large-scale standardized test separately for 3 ethnic groups of examinees and to investigate whether differences in their latent trait composites have any impact on unidimensional item response theory true-score equating functions. Specifically, separate equating functions for African American and Hispanic examinees were compared to those of a Caucasian group as well as the total test taker population. On both forms, a 2-dimensional model adequately accounted for the item responses of Caucasian and African American examinees, whereas a more complex model was required for the Hispanic subgroup. The differences between equating functions for the 3 ethnic groups and the total test taker population were small and tended to be located at the low end of the score scale. 相似文献

13.

A Comparison of Achievement Test Performance of nondisabled Students Under Silent Reading and Reading Plus Listening Modes of Administration

《教育实用测度》2013,26(4):307-320

Average performance on four subtests of the Iowa Tests of Educational Development was compared under two types of test administration. The first adhered to the publisher's standardized directions and time limits. The second permitted students to listen to audiotapes of the test material on individual cassette players as they read the test booklet. No time limits were imposed on the high school students under this second mode of administration. Each of the four subtests — Interpretation of Literary Materials, Analysis of Social Studies Materials, Use of Sources of Information, and Vocabulary-imposes substantial reading, demands on the examinee. Only on the literary materials test did students score significantly higher via the tape-assisted administration. Significant differential effects were found as a function of level of reading ability for this test and for the social studies test. Poor readers were helped by the use of the tapes, whereas good readers did not benefit from the use of the tapes. Correlations corrected for attenuation indicate that the two modes of administration measure essentially the same attributes when students exhibit no learning disabilities or reading problems. 相似文献

14.

Does difficulty-based item order matter in multiple-choice exams? (Empirical evidence from university students)

《Studies in Educational Evaluation》2020

This empirical study aimed to investigate the impact of easy first vs. hard first ordering of the same items in a paper and-pencil multiple-choice exam on the performances of low, moderate, and high achiever examinees, as well as on the item statistics. Data were collected from 554 Turkish university students using two test forms, which included the same multiple-choice items ordered reversely, i.e. easy first vs. hard first. Tests included 26 multiple-choice items about the introductory unit of “Measurement and Assessment” course. The results suggested that sequencing the multiple-choice items in either direction from easy to hard or vice versa did not affect the test performances of the examinees no matter whether they are low, moderate or high achiever examinees. Finally, no statistically significant difference was observed between item statistics of both forms, i.e. the difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj. r) coefficients. 相似文献

15.

Answer Changing on Multiple-Choice Test Items Among Eighth-Grade Readers 总被引：1，自引：1，他引：0

Clifton A. Casteel 《Journal of Experimental Education》2013,81(4):300-309

This study was done to examine the effect of answer changing on multiple-choice test performance among good and poor readers in the eighth grade. Although the gains of poor readers were higher than those of good readers, all subjects profited significantly from changing their answers on items. For all subjects, when a single response was changed, there was a two-to-one chance that the new response would raise rather than lower the final score. Gains from answer changing on test items were slightly higher for poor readers as a group than were those for good readers. However, the result was determined not to be significant. More important, this hypothesis is strengthened by the fact that all subjects profited from answer changing. Therefore, the results were interpreted as lending support to the notion that answer-changing response among young examinees should be encouraged if there is a reasonable doubt about their “first impression.” 相似文献

16.

Subscores Based on Classical Test Theory: To Report or Not to Report 总被引：1，自引：0，他引：1

Sandip Sinharay Shelby Haberman Gautam Puhan 《Educational Measurement》2007,26(4):21-28

There is an increasing interest in reporting subscores, both at examinee level and at aggregate levels. However, it is important to ensure reasonable subscore performance in terms of high reliability and validity to minimize incorrect instructional and remediation decisions. This article employs a statistical measure based on classical test theory that is conceptually similar to the test reliability measure and can be used to determine when subscores have any added value over total scores. The usefulness of subscores is examined both at the level of the examinees and at the level of the institutions that the examinees belong to. The suggested approach is applied to two data sets from a basic skills test. The results provide little support in favor of reporting subscores for either examinees or institutions for the tests studied here. 相似文献

17.

An Analysis of Variance Approach for the Estimation of Response Time Distributions in Tests

Yigal Attali 《Journal of Educational Measurement》2010,47(4):458-470

Generalizability theory and analysis of variance methods are employed, together with the concept of objective time pressure, to estimate response time distributions and the degree of time pressure in timed tests. By estimating response time variance components due to person, item, and their interaction, and fixed effects due to item types and examinee time pressure, one can predict the distribution (mean and variance) of total response time for a population of examinees and a particular time limit. Furthermore, these variance components and fixed effects can be used in a simulation approach to estimate the distributions of time pressure during the test to help test developers evaluate the appropriateness of specific time limits. I present theoretical considerations and empirical results from two tests. 相似文献

18.

Psychometric Equivalence of Ratings for Repeat Examinees on a Performance Assessment for Physician Licensure

Mark R. Raymond Kimberly A. Swygert Nilufer Kahraman 《Journal of Educational Measurement》2012,49(4):339-361

Although a few studies report sizable score gains for examinees who repeat performance‐based assessments, research has not yet addressed the reliability and validity of inferences based on ratings of repeat examinees on such tests. This study analyzed scores for 8,457 single‐take examinees and 4,030 repeat examinees who completed a 6‐hour clinical skills assessment required for physician licensure. Each examinee was rated in four skill domains: data gathering, communication‐interpersonal skills, spoken English proficiency, and documentation proficiency. Conditional standard errors of measurement computed for single‐take and multiple‐take examinees indicated that ratings were of comparable precision for the two groups within each of the four skill domains; however, conditional errors were larger for low‐scoring examinees regardless of retest status. In addition, on their first attempt multiple‐take examinees exhibited less score consistency across the skill domains but on their second attempt their scores became more consistent. Further, the median correlation between scores on the four clinical skill domains and three external measures was .15 for multiple‐take examinees on their first attempt but increased to .27 for their second attempt, a value, which was comparable to the median correlation of .26 for single‐take examinees. The findings support the validity of inferences based on scores from the second attempt. 相似文献

19.

An investigation of the gender differential performance on a high-stakes language proficiency test in Iran

Hossein Karami 《Asia Pacific Education Review》2013,14(3):435-444

There has been a growing consensus among the educational measurement experts and psychometricians that test taker characteristics may unduly affect the performance on tests. This may lead to construct-irrelevant variance in the scores and thus render the test biased. Hence, it is incumbent on test developers and users alike to provide evidence that their tests are free of such bias. The present study exploited generalizability theory to examine the presence of gender differential performance on a high-stakes language proficiency test, the University of Tehran English Proficiency Test. An analysis of the performance of 2,343 examinees who had taken the test in 2009 indicated that the relative contributions of different facets to score variance were almost uniform across the gender groups. Further, there is no significant interaction between items and persons, indicating that the relative standings of the persons were uniform across all items. The lambda reliability coefficients were also uniformly high. All in all, the study provides evidence that the test is free of gender bias and enjoys a high level of dependability. 相似文献

20.

MULTIPLE-CHOICE VERSUS FREE-RESPONSE: A SIMULATION STUDY

ROBERT B. FRARY 《Journal of Educational Measurement》1985,22(1):21-31

Responses to a 40-item test were simulated for 150 examinees under free-response and multiple-choice formats. The simulation was replicated three times for each of 30 variations reflecting format and the extent to which examinees were (a) misinformed, (b) successful in guessing free-response answers, and (c) able to recognize with assurance correct multiple-choice options that they could not produce under free-response testing. Internal consistency reliability (KR20) estimates were consistently higher for the free-response score sets, even when the free-response item difficulty indices were augmented to yield mean scores comparable to those from multiple-choice testing. In addition, all test score sets were correlated with four randomly generated sets of unit-normal measures, whose intercorrelations ranged from moderate to strong. These measures served as criteria because one of them had been used as the basic ability measure in the simulation of the test score sets. Again, the free-response score sets yielded superior results even when tests of equal difficulty were compared. The guessing and recognition factors had little or no effect on reliability estimates or correlations with the criteria. The extent of misinformation affected only multiple-choice score KR20's (more misinformation—higher KR20's). Although free-response tests were found to be generally superior, the extent of their advantage over multiple-choice was judged sufficiently small that other considerations might justifiably dictate format choice. 相似文献