首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Speededness refers to the extent to which time limits affect examinees'test performance, and it is often measured by calculating the proportion of examinees who do not reach a certain percentage of test items. However, when tests are number-right scored (i.e., no points are subtracted for incorrect responses), examinees are likely to rapidly guess on items rather than leave them blank. Therefore, this traditional measure of speededness probably underestimates the true amount of speededness on such tests. A more accurate assessment of speededness should also reflect the tendency of examinees to rapidly guess on items as time expires. This rapid-guessing component of speededness can be estimated by modeling response times with a two-state mixture model, as demonstrated with data from a computer- administered reasoning test. Taking into account the combined effect of unreached items and rapid guessing provides a more complete measure of speededness than has previously been available.  相似文献   

2.
Speededness refers to the situation where the time limits on a standardized test do not allow substantial numbers of examinees to fully consider all test items. When tests are not intended to measure speed of responding, speededness introduces a severe threat to the validity of interpretations based on test scores. In this article, we describe test speededness, its potential threats to validity, and traditional and modern methods that can be used to assess the presence of speededness. We argue that more attention must be paid to this issue and that more research must be done to set appropriate time limits on power tests so that speed of responding does not interfere with the construct measured.  相似文献   

3.
When tests are administered under fixed time constraints, test performances can be affected by speededness. Among other consequences, speededness can result in inaccurate parameter estimates in item response theory (IRT) models, especially for items located near the end of tests (Oshima, 1994). This article presents an IRT strategy for reducing contamination in item difficulty estimates due to speededness. Ordinal constraints are applied to a mixture Rasch model (Rost, 1990) so as to distinguish two latent classes of examinees: (a) a "speeded" class, comprised of examinees that had insufficient time to adequately answer end-of-test items, and (b) a "nonspeeded" class, comprised of examinees that had sufficient time to answer all items. The parameter estimates obtained for end-of-test items in the nonspeeded class are shown to more accurately approximate their difficulties when the items are administered at earlier locations on a different form of the test. A mixture model can also be used to estimate the class memberships of individual examinees. In this way, it can be determined whether membership in the speeded class is associated with other student characteristics. Results are reported for gender and ethnicity.  相似文献   

4.
A critical component of test speededness is the distribution of the test taker’s total time on the test. A simple set of constraints on the item parameters in the lognormal model for response times is derived that can be used to control the distribution when assembling a new test form. As the constraints are linear in the item parameters, they can easily be included in a mixed integer programming model for test assembly. The use of the constraints is demonstrated for the problems of assembling a new test form to be equally speeded as a reference form, test assembly in which the impact of a change in the content specifications on speededness is to be neutralized, and the assembly of test forms with a revised level of speededness.  相似文献   

5.
Specially constructed “speeded” and “unspeeded” forms of a Reading Comprehension test were administered to both regular center and fee-free center LSAT candidates in an effort to determine: (1) if the test was more speeded for fee-free candidates, and (2) if reducing the amount of speededness was more beneficial to fee-free candidates. Results of the analyses show: (1) the test is somewhat more speeded for fee-free candidates than for regular candidates, (2) reducing the amount of speededness produces higher scores for both regular and fee-free center candidates, and (3) reducing speededness is not significantly more beneficial (in terms of increasing the number of items answered correctly) to fee-free than to regular center candidates. Lower KR-20 reliability was observed under speeded conditions in the fee-free sample.  相似文献   

6.
《教育实用测度》2013,26(1):95-109
To evaluate the effects of calculator use on performance on the SAT I: Reasoning Test in Mathematics, questions about use of the calculator on the test were inserted into the answer sheets for the November 1996 and November 1997 administrations of the examination. Overall, nearly all of examinees indicated that they brought a calculator to the test and about two thirds reported using them on one third or more of the math items. Some group differences in the use of calculators were observed with girls using them more frequently than boys and Whites and Asian Americans using them more often than other racial or ethnic groups. Use of calculators was associated with higher test performance, but the more able students were more likely to have calculators and used them more often. The results were analyzed further using multiple regression and differential item functioning procedures. The degree of speededness on different degrees of calculator use was also examined. Overall, the effects of calculator use were found to be small, but detectable.  相似文献   

7.
Single‐best answers to multiple‐choice items are commonly dichotomized into correct and incorrect responses, and modeled using either a dichotomous item response theory (IRT) model or a polytomous one if differences among all response options are to be retained. The current study presents an alternative IRT‐based modeling approach to multiple‐choice items administered with the procedure of elimination testing, which asks test‐takers to eliminate all the response options they consider to be incorrect. The partial credit model is derived for the obtained responses. By extracting more information pertaining to test‐takers’ partial knowledge on the items, the proposed approach has the advantage of providing more accurate estimation of the latent ability. In addition, it may shed some light on the possible answering processes of test‐takers on the items. As an illustration, the proposed approach is applied to a classroom examination of an undergraduate course in engineering science.  相似文献   

8.
The presence of nuisance dimensionality is a potential threat to the accuracy of results for tests calibrated using a measurement model such as a factor analytic model or an item response theory model. This article describes a mixture group bifactor model to account for the nuisance dimensionality due to a testlet structure as well as the dimensionality due to differences in patterns of responses. The model can be used for testing whether or not an item functions differently across latent groups in addition to investigating the differential effect of local dependency among items within a testlet. An example is presented comparing test speededness results from a conventional factor mixture model, which ignores the testlet structure, with results from the mixture group bifactor model. Results suggested the 2 models treated the data somewhat differently. Analysis of the item response patterns indicated that the 2-class mixture bifactor model tended to categorize omissions as indicating speededness. With the mixture group bifactor model, more local dependency was present in the speeded than in the nonspeeded class. Evidence from a simulation study indicated the Bayesian estimation method used in this study for the mixture group bifactor model can successfully recover generated model parameters for 1- to 3-group models for tests containing testlets.  相似文献   

9.
We investigated students' metacognitive experiences with regard to feelings of difficulty (FD), feelings of satisfaction (FS), and estimate of effort (EE), employing either computerized adaptive testing (CAT) or computerized fixed item testing (FIT). In an experimental approach, 174 students in grades 10 to 13 were tested either with a CAT or a FIT version of a matrices test. Data revealed that metacognitive experiences were not related to the resulting test scores for CAT: test takers who took the matrices test in an adaptive mode were paradoxically more satisfied with their performance the worse they had performed in terms of the resulting ability parameter. They also rated the test as easier the lower they had performed, but their estimates of effort were higher the better they had performed. For test takers who took the FIT version, completely different results were revealed. In line with previous results, test takers were supposed to base these experiences on the subjectively estimated percentage of items solved. This moderated mediation hypothesis was in parts confirmed, as the relation between the percentage of items solved and FD, FS, and EE was revealed to be mediated by the estimated percentage of items solved. Results are discussed with reference to feedback acceptance, errant self-estimations, and test fairness with regard to a possible false regulation of effort in lower ability groups when using CAT.  相似文献   

10.
Computer-based educational assessments often include items that involve drag-and-drop responses. There are different ways that drag-and-drop items can be laid out and different choices that test developers can make when designing these items. Currently, these decisions are based on experts’ professional judgments and design constraints, rather than empirical research, which might threaten the validity of interpretations of test outcomes. To this end, we investigated the effect of drag-and-drop item features on test-taker performance and response strategies with a cognition-centered approach. Four hundred and seventy-six adult participants solved content-equivalent drag-and-drop mathematics items under five design variants. Results showed that: (a) test takers’ performance and response strategies were affected by the experimental manipulations, and (b) test takers mostly used cognitively efficient response strategies regardless of the manipulated item features. Implications of the findings are provided to support test developers’ design decisions.  相似文献   

11.
In an article in the Winter 2011 issue of the Journal of Educational Measurement, van der Linden, Jeon, and Ferrara suggested that “test takers should trust their initial instincts and retain their initial responses when they have the opportunity to review test items.” They presented a complex IRT model that appeared to show that students would be worse off by changing answers. As noted in a subsequent erratum, this conclusion was based on flawed data, and that the correct data could not be analyzed by their method because the model failed to converge. This left their basic question on the value of answer changing unanswered. A much more direct approach is to simply count the number of examinees whose scores after an opportunity to change answers are higher, lower, or the same as their initial scores. Using the same data set as the original article, an overwhelming majority of the students received higher scores after the opportunity to change answers.  相似文献   

12.
The humble multiple-choice test is very widely used within education at all levels, but its susceptibility to guesswork makes it a suboptimal assessment tool. The reliability of a multiple-choice test is partly governed by the number of items it contains; however, longer tests are more time consuming to take, and for some subject areas, it can be very hard to create new test items that are sufficiently distinct from previously used items. A number of more sophisticated multiple-choice test formats have been proposed dating back at least 60?years, many of which offer significantly improved test reliability. This paper offers a new way of comparing these alternative test formats, by modelling each one in terms of the range of possible test taker responses it enables. Looking at the test formats in this way leads to the realisation that the need for guesswork is reduced when test takers are given more freedom to express their beliefs. Indeed, guesswork is eliminated entirely when test takers are able to partially order the answer options within each test item. The paper aims to strengthen the argument for using more sophisticated multiple-choice test formats, especially for high-stakes summative assessment.  相似文献   

13.
According to a popular belief, test takers should trust their initial instinct and retain their initial responses when they have the opportunity to review test items. More than 80 years of empirical research on item review, however, has contradicted this belief and shown minor but consistently positive score gains for test takers who changed answers they found to be incorrect during review. This study reanalyzed the problem of the benefits of answer changes using item response theory modeling of the probability of an answer change as a function of the test taker’s ability level and the properties of items. Our empirical results support the popular belief and reveal substantial losses due to changing initial responses for all ability levels. Both the contradiction of the earlier research and support of the popular belief are explained as a manifestation of Simpson’s paradox in statistics.  相似文献   

14.
Abstract

High school students completed both multiple-choice and constructed response exams over an 845-word narrative passage on which they either took notes or underlined critical information. A control group merely read the text In addition, half of the learners in each condition were told to expect either a multiple-choice or constructed response test following reading. Overall, note takers showed superior posttest recall, and notetaking without test instructions yielded the best group performance. Notetaking also required significantly more time than the other conditions. Underlining for a multiple-choice test led to better recall than underlining for a constructed response test. Although more multiple-choice than constructed response items were remembered. Test Mode failed to interact with the other factors.  相似文献   

15.
In multiple-choice tests, the quality of distractors may be more important than their number. We therefore examined the joint influence of distractor quality and quantity on test functioning by providing a sample of 5,793 participants with five parallel test sets consisting of items that differed in the number and quality of distractors. Surprisingly, we found that items in which only the one best distractor was presented together with the solution provided the strongest criterion-related evidence of the validity of test scores and thus allowed for the most valid conclusions on the general knowledge level of test takers. Items that included the best distractor produced more reliable test scores irrespective of option number. Increasing the number of options increased item difficulty, but did not increase internal consistency when testing time was controlled for.  相似文献   

16.
A sample of college-bound juniors from 275 high schools took a test consisting of 70 math questions from the SAT. A random half of the sample was allowed to use calculators on the test. Both genders and three ethnic groups (White, African American, and Asian American) benefitted about equally from being allowed to use calculators; Latinos benefitted slightly more than the other groups. Students who routinely used calculators on classroom mathematics tests were relatively advantaged on the calculator test. Test speededness was about the same whether or not students used calculators. Calculator effects on individual items ranged from positive through neutral to negative and could either increase or decrease the validity of an item as a measure of mathematical reasoning skills. Calculator effects could be either present or absent in both difficult and easy items  相似文献   

17.
This study describes the effects of sex, education and age on the total test score on the Swedish Scholastic Aptitude Test (SweSA T), a test used in the selection process to colleges and universities in Sweden since 1977. Its use has so far been limited to one of four quota groups consisting of applicants 25 years or older and with more than four years of work experience. Statistical methods used in this study are regression models with dummy variables and estimated with a corner‐point parameterization. The results indicate rather genuine differences in every variable studied. Test takers with a higher education obtain a higher mean score than those with a lower education and older test takers obtain a higher mean score on the subtests vocabulary (WORD) and general information (GI) than younger persons. The mean test score for men is higher than the corresponding score for women, even if differences in education and age are controlled for. Finally some statistical problems related to the analysis of data of this type are discussed.  相似文献   

18.
There has been a growing research interest in the identification and management of disengaged test taking, which poses a validity threat that is particularly prevalent with low‐stakes tests. This study investigated effort‐moderated (E‐M) scoring, in which item responses classified as rapid guesses are identified and excluded from scoring. Using achievement test data composed of test takers who were quickly retested and showed differential degrees of disengagement, three basic findings emerged. First, standard E‐M scoring accounted for roughly one‐third of the score distortion due to differential disengagement. Second, a modified E‐M scoring method that used more liberal time thresholds performed better—accounting for two‐thirds or more of the distortion. Finally, the inability of E‐M scoring to account for all of the score distortion suggests the additional presence of nonrapid item responses that reflect less‐than‐full engagement by some test takers.  相似文献   

19.
The rise of computer‐based testing has brought with it the capability to measure more aspects of a test event than simply the answers selected or constructed by the test taker. One behavior that has drawn much research interest is the time test takers spend responding to individual multiple‐choice items. In particular, very short response time—termed rapid guessing—has been shown to indicate disengaged test taking, regardless whether it occurs in high‐stakes or low‐stakes testing contexts. This article examines rapid‐guessing behavior—its theoretical conceptualization and underlying assumptions, methods for identifying it, misconceptions regarding its dynamics, and the contextual requirements for its proper interpretation. It is argued that because it does not reflect what a test taker knows and can do, a rapid guess to an item represents a choice by the test taker to momentarily opt out of being measured. As a result, rapid guessing tends to negatively distort scores and thereby diminish validity. Therefore, because rapid guesses do not contribute to measurement, it makes little sense to include them in scoring.  相似文献   

20.
Sometimes, test‐takers may not be able to attempt all items to the best of their ability (with full effort) due to personal factors (e.g., low motivation) or testing conditions (e.g., time limit), resulting in poor performances on certain items, especially those located toward the end of a test. Standard item response theory (IRT) models fail to consider such testing behaviors. In this study, a new class of mixture IRT models was developed to account for such testing behavior in dichotomous and polytomous items, by assuming test‐takers were composed of multiple latent classes and by adding a decrement parameter to each latent class to describe performance decline. Parameter recovery, effect of model misspecification, and robustness of the linearity assumption in performance decline were evaluated using simulations. It was found that the parameters in the new models were recovered fairly well by using the freeware WinBUGS; the failure to account for such behavior by fitting standard IRT models resulted in overestimation of difficulty parameters on items located toward the end of the test and overestimation of test reliability; and the linearity assumption in performance decline was rather robust. An empirical example is provided to illustrate the applications and the implications of the new class of models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号