期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An Investigation of Examinee Test-Taking Effort on a Large-Scale Assessment

J. Carl Setzer Steven L. Wise Jill R. van den Heuvel Guangming Ling 《教育实用测度》2013,26(1):34-49

Assessment results collected under low-stakes testing situations are subject to effects of low examinee effort. The use of computer-based testing allows researchers to develop new ways of measuring examinee effort, particularly using response times. At the item level, responses can be classified as exhibiting either rapid-guessing behavior or solution behavior based on the item response time. Most previous research involving the study of response times has been conducted using locally developed instruments. The purpose of the current study was to examine the amount of rapid-guessing behavior within a commercially available, low-stakes instrument. Results indicate that smaller amounts of rapid-guessing behavior exist within the data compared to published results using other instruments. Additionally, rapid-guessing behavior varied by item and was significantly related to item length, item position, and presence of ancillary reading material. The amount of rapid-guessing behavior was consistently very low among various demographic subpopulations. On average, rapid-guessing behavior was observed on only 1% of item responses. Also found was that a small amount of rapid-guessing behavior can impact institutional rankings. 相似文献

2.

Correlates of Rapid-Guessing Behavior in Low-Stakes Testing: Implications for Test Development and Measurement Practice

Steven L. Wise Dena A. Pastor Xiaojing J. Kong 《教育实用测度》2013,26(2):185-205

Previous research has shown that rapid-guessing behavior can degrade the validity of test scores from low-stakes proficiency tests. This study examined, using hierarchical generalized linear modeling, examinee and item characteristics for predicting rapid-guessing behavior. Several item characteristics were found significant; items with more text or those occurring later in the test were related to increased rapid guessing, while the inclusion of a graphic in a item was related to decreased rapid guessing. The sole significant examinee predictor was SAT total score. Implications of these results for measurement professionals developing low-stakes tests are discussed. 相似文献

3.

Changes in Rapid-Guessing Behavior Over a Series of Assessments

Christine E. Demars 《Educational Assessment》2013,18(1):23-45

Abstract

A series of 8 tests was administered to university students over 4 weeks for program assessment purposes. The stakes of these tests were low for students; they received course points based on test completion, not test performance. Tests were administered in a counterbalanced order across 2 administrations. Response time effort, a measure of the proportion of items on which solution behavior rather than rapid-guessing behavior was used, was higher when a test was administered in the 1st week. Test scores were also higher. Differences between Week 1 and Week 4 test scores decreased when the test was scored with an effort-moderated model that took into account whether the student used solution or rapid-guessing behavior. Differences further decreased when students who used rapid-guessing on 5 or more of the 30 items were filtered from the data set. 相似文献

4.

The Effects of Effort Monitoring With Proctor Notification on Test-Taking Engagement,Test Performance,and Validity

Steven L. Wise Megan R. Kuhfeld James Soland 《教育实用测度》2019,32(2):183-192

When we administer educational achievement tests, we want to be confident that the resulting scores validly indicate what the test takers know and can do. However, if the test is perceived as low stakes by the test taker, disengaged test taking sometimes occurs, which poses a serious threat to score validity. When computer-based tests are used, disengagement can be detected through occurrences of rapid-guessing behavior. This empirical study investigated the impact of a new effort monitoring feature that can detect rapid guessing, as it occurs, and notify proctors that a test taker has become disengaged. The results showed that, after a proctor notification was triggered, test-taking engagement tended to increase, test performance improved, and test scores exhibited higher convergent validation evidence. The findings of this study provide validation evidence that this innovative testing feature can decrease disengaged test taking. 相似文献

5.

Examinee Non-Effort on Contextualized and Non-Contextualized Mathematics Items in Large-Scale Assessments

Daniel Van Nijlen Rianne Janssen 《教育实用测度》2015,28(1):68-84

In this study it is investigated to what extent contextualized and non-contextualized mathematics test items have a differential impact on examinee effort. Mixture item response theory (IRT) models are applied to two subsets of items from a national assessment on mathematics in the second grade of the pre-vocational track in secondary education in Flanders. One subset focused on elementary arithmetic and consisted of non-contextualized items. Another subset of contextualized items focused on the application of arithmetic in authentic problem-solving situations. Results indicate that differential performance on the subsets is to a large extent due to test effort. The non-contextualized items appear to be much more susceptible to low examinee effort in low-stakes testing situations. However, subgroups of students can be found with regard to the extent to which they show low effort. One can distinguish a compliant, an underachieving, and a dropout group. Group membership is also linked to relevant background characteristics. 相似文献

6.

Examinee Noneffort and the Validity of Program Assessment Results

Steven L. Wise Christine E. DeMars 《Educational Assessment》2013,18(1):27-41

Educational program assessment studies often use data from low-stakes tests to provide evidence of program quality. The validity of scores from such tests, however, is potentially threatened by examinee noneffort. This study investigated the extent to which one type of noneffort—rapid-guessing behavior—distorted the results from three types of commonly used program assessment designs. It was found that, for each design, a modest amount of rapid guessing had a pronounced effect on the results. In addition, motivation filtering was found to be successful in mitigating the effects caused by rapid guessing. It is suggested that measurement practitioners routinely apply motivation filtering whenever the data from low-stakes tests are used to support program decisions. 相似文献

7.

Patterns of Solution Behavior across Items in Low-Stakes Assessments

Dena A. Pastor 《Educational Assessment》2019,24(3):189-212

The trustworthiness of low-stakes assessment results largely depends on examinee effort, which can be measured by the amount of time examinees devote to items using solution behavior (SB) indices. Because SB indices are calculated for each item, they can be used to understand how examinee motivation changes across items within a test. Latent class analysis (LCA) was used with the SB indices from three low-stakes assessments to explore patterns of solution behavior across items. Across tests, the favored models consisted of two classes, with Class 1 characterized by high and consistent solution behavior (>90% of examinees) and Class 2 by lower and less consistent solution behavior (<10% of examinees). Additional analyses provided supportive validity evidence for the two-class solution with notable differences between classes in self-reported effort, test scores, gender composition, and testing context. Although results were generally similar across the three assessments, striking differences were found in the nature of the solution behavior pattern for Class 2 and the ability of item characteristics to explain the pattern. The variability in the results suggests motivational changes across items may be unique to aspects of the testing situation (e.g., content of the assessment) for less motivated examinees. 相似文献

8.

Avoiding split attention in computer‐based testing: Is neglecting additional information facilitative?

下载免费PDF全文

Halszka Jarodzka Noortje Janssen Paul A. Kirschner Gijsbert Erkens 《British journal of educational technology : journal of the Council for Educational Technology》2015,46(4):803-817

This study investigated whether design guidelines for computer‐based learning can be applied to computer‐based testing (CBT). Twenty‐two students completed a CBT exam with half of the questions presented in a split‐screen format that was analogous to the original paper‐and‐pencil version and half in an integrated format. Results show that students attended to all information in the integrated format while ignoring information in the split format. Interestingly, and contrary to expectations, they worked more efficiently in the split format. A content analysis of the ignored information revealed that it was mostly not relevant to answering the questions, unnecessarily taxed students' cognitive capacity and inefficiently increased the mental effort they expended. Further comparisons of different mental effort measures indicate that mental effort had an explicit (ie, self‐reports, explicit utterances) and an implicit component (ie, silent pauses in thinking‐aloud, eye tracking parameters). Consequently, when designing CBT environments, not only the design of the tasks but also the content of the given information and their effect on the different aspects of mental effort must be considered. 相似文献

9.

An Application of Item Response Time: The Effort-Moderated IRT Model

Steven L. Wise Christine E. DeMars 《Journal of Educational Measurement》2006,43(1):19-38

The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity. 相似文献

10.

A Comparison of Self-Adapted and Computerized Adaptive Tests

Steven L. Wise Barbara S. Plake Phillip L. Johnson Linda L. Roos 《Journal of Educational Measurement》1992,29(4):329-339

According to item response theory (IRT), examinee ability estimation is independent of the particular set of test items administered from a calibrated pool. Although the most popular application of this feature of IRT is computerized adaptive (CA) testing, a recently proposed alternative is self-adapted (SA) testing, in which examinees choose the difficulty level of each of their test items. This study compared examinee performance under SA and CA tests, finding that examinees taking the SA test (a) obtained significantly higher ability scores and (b) reported significantly lower posttest state anxiety. The results of this study suggest that SA testing is a desirable format for computer-based testing. 相似文献

11.

Modeling Change in Effort Across a Low-Stakes Testing Session: A Latent Growth Curve Modeling Approach

Carol L. Barry Sara J. Finney 《教育实用测度》2013,26(1):46-64

ABSTRACT

We examined change in test-taking effort over the course of a three-hour, five test, low-stakes testing session. Latent growth modeling results indicated that change in test-taking effort was well-represented by a piecewise growth form, wherein effort increased from test 1 to test 4 and then decreased from test 4 to test 5. There was significant variability in effort for each of the five tests, which could be predicted from examinees’ conscientiousness, agreeableness, mastery approach goal orientation, and whether the examinee “skipped” or attended the initial testing session. The degree to which examinees perceived a particular test as important was related to effort for the difficult, cognitive test but not for less difficult, noncognitive tests. There was significant variability in the rates of change in effort, which could be predicted from examinees’ agreeableness. Interestingly, change in test-taking effort was not related to change in perceived test importance. Implications of these results for assessment practice and directions for future research are discussed. 相似文献

12.

A Simple Model for Diagnostic Testing When There Are Several Types of Misinformation

《Journal of Experimental Education》2012,80(1):57-62

相似文献

13.

Response Time Effort: A New Measure of Examinee Motivation in Computer-Based Tests

《教育实用测度》2013,26(2):163-183

When low-stakes assessments are administered, the degree to which examinees give their best effort is often unclear, complicating the validity and interpretation of the resulting test scores. This study introduces a new method, based on item response time, for measuring examinee test-taking effort on computer-based test items. This measure, termed response time effort (RTE), is based on the hypothesis that when administered an item, unmotivated examinees will answer too quickly (i.e., before they have time to read and fully consider the item). Psychometric characteristics of RTE scores were empirically investigated and supportive evidence for score reliability and validity was found. Potential applications of RTE scores and their implications are discussed. 相似文献

14.

Developing and Scoring an Innovative Computerized Writing Assessment

Tim Davey Janet Godwin David Mittelholtz 《Journal of Educational Measurement》1997,34(1):21-41

We describe the development and administration of a recently introduced computer-based test of writing skills. This test asks the examinee to edit a writing passage presented on a computer screen. To do this, the examinee moves a cursor to a suspect section of the passage and chooses from a list of alternative ways o f rewriting that section. Any or all parts o f the passage can be changed, as often as the examinee likes. An able examinee identifies and fixes errors in grammar, organization, and style, whereas a less able examinee may leave errors untouched, replace an error with another error, or even introduce errors where none existed previously. All these response alternatives contrive to present both obvious and subtle scoring difficulties. These difficulties were attacked through the combined use of option weighting and the sequential probability ratio test, the result o f which is to classify examinees into several discrete ability groups. Item calibration was enabled by augmenting sparse pretest samples through data meiosis, in which response vectors were randomly recombined to produce offspring that retained much of the character of their parents. These procedures are described, and operational examples are offered. 相似文献

15.

Inexperienced and Anxious Computer Users: Coping With a Computer-Administered Test of Academic Skills

《Educational Assessment》2013,18(2):153-173

This study assessed the degree to which computer-based administration contributes to test performance differences among examinees. Inexperienced or anxious computer users answered computer-based reading and mathematics questions from a new teacher licensing test. They also answered paper-and-pencil analogues of these questions. While taking the computer-administered tests, half of the examinees had access to on-line familiarization materials only; the other half had additional help from a test supervisor. Results showed that (a) extra assistance from a test supervisor did not have a noticeable effect on test performance, (b) performance on test sections administered later during the session showed no evidence of improvement from practice on earlier sections, (c) most of the variation in performance on the computer-administered tests was explained by performance on the paper-and-pencil analogues rather than by attitudes toward computers or experience with them, and (d) examinees were more positive about the computer-based tests after testing than before. The conclusion was that on-line test familiarization proved adequate for the anxious/inexperienced computer users in the study and that computer-based test administration did not unduly affect examinee performance. The implications of the study are discussed with respect to evaluating new and emerging alternative modes of assessment. 相似文献

16.

对基于项目反映理论的计算机自适应测试方法的再思考

刘培艳王淑琴《唐山师范学院学报》2013,(2):44-46

以项目反应理论IRT（ItemResponseTheory）为基础,介绍项目反应理论IRT的特点,以及基于项目反应理论IRT的计算机自适应测试的工作原理,并在此基础上总结了起点选择的方法,提出了测试流程两步制的改进方案,通过对测试流程的改进,大大减少了与被试能力值相差较远的测试项目,缩短了测试时间和计算量,同时能准确地估计被试能力值。相似文献

17.

Evaluating the Comparability of Paper‐ and Computer‐Based Science Tests Across Sex and SES Subgroups

Jennifer Randall Stephen Sireci Xueming Li Leah Kaira 《Educational Measurement》2012,31(4):2-12

As access and reliance on technology continue to increase, so does the use of computerized testing for admissions, licensure/certification, and accountability exams. Nonetheless, full computer‐based test (CBT) implementation can be difficult due to limited resources. As a result, some testing programs offer both CBT and paper‐based test (PBT) administration formats. In such situations, evidence that scores obtained from different formats are comparable must be gathered. In this study, we illustrate how contemporary statistical methods can be used to provide evidence regarding the comparability of CBT and PBT scores at the total test score and item levels. Specifically, we looked at the invariance of test structure and item functioning across test administration mode across subgroups of students defined by SES and sex. Multiple replications of both confirmatory factor analysis and Rasch differential item functioning analyses were used to assess invariance at the factorial and item levels. Results revealed a unidimensional construct with moderate statistical support for strong factorial‐level invariance across SES subgroups, and moderate support of invariance across sex. Issues involved in applying these analyses to future evaluations of the comparability of scores from different versions of a test are discussed. 相似文献

18.

Embedded Field Test Item Statistics: Can They Be Trusted for Estimating Student Proficiency?

Jeffrey T. Steedle Kristin M. Morrison 《Educational Assessment》2019,24(1):1-12

Assessment items are commonly field tested prior to operational use to observe statistical item properties such as difficulty. Item parameter estimates from field testing may be used to assign scores via pre-equating or computer adaptive designs. This study examined differences between item difficulty estimates based on field test and operational data and the relationship of such differences to item position changes and student proficiency estimates. Item position effects were observed for 20 assessments, with items in later positions tending to be more difficult. Moreover, field test estimates of item difficulty were biased slightly upward, which may indicate examinee knowledge of which items were being field tested. Nevertheless, errors in field test item difficulty estimates had negligible impacts on student proficiency estimates for most assessments. Caution is still warranted when using field test statistics for scoring, and testing programs should conduct investigations to determine whether the effects on scoring are inconsequential. 相似文献

19.

The Effect of Changing Content on IRT Scaling Methods

Lisa A. Keller Robert R. Keller 《教育实用测度》2015,28(2):99-114

Equating test forms is an essential activity in standardized testing, with increased importance with the accountability systems in existence through the mandate of Adequate Yearly Progress. It is through equating that scores from different test forms become comparable, which allows for the tracking of changes in the performance of students from one year to the next. This study compares three different item response theory scaling methods (fixed common item parameter, Stocking & Lord, and Concurrent Calibration) with respect to examinee classification into performance categories, and estimation of the ability parameter, when the content of the test form changes slightly from year to year, and the examinee ability distribution changes. The results indicate that calibration methods, especially concurrent calibration, produced more stable results than the transformation method. 相似文献

20.

How Well Can We Compare Scores on Test Forms That Are Constructed by Examinees Choice?

Howard Wainer Xiang-Bo Wang David Thissen 《Journal of Educational Measurement》1994,31(3):183-199

When an exam consists, in whole or in part, of constructed-response items, it is a common practice to allow the examinee to choose a subset of the questions to answer. This procedure is usually adopted so that the limited number of items that can be completed in the allotted time does not unfairly affect the examinee. This results in the de facto administration of several different test forms, where the exact structure of any particular form is determined by the examinee. However, when different forms are administered, a canon of good testing practice requires that those forms be equated to adjust for differences in their difficulty. When the items are chosen by the examinee, traditional equating procedures do not strictly apply due to the nonignorable nature of the missing responses. In this article, we examine the comparability of scores on such tests within an IRT framework. We illustrate the approach with data from the College Board's Advanced Placement Test in Chemistry 相似文献