首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
《教育实用测度》2013,26(2):163-183
When low-stakes assessments are administered, the degree to which examinees give their best effort is often unclear, complicating the validity and interpretation of the resulting test scores. This study introduces a new method, based on item response time, for measuring examinee test-taking effort on computer-based test items. This measure, termed response time effort (RTE), is based on the hypothesis that when administered an item, unmotivated examinees will answer too quickly (i.e., before they have time to read and fully consider the item). Psychometric characteristics of RTE scores were empirically investigated and supportive evidence for score reliability and validity was found. Potential applications of RTE scores and their implications are discussed.  相似文献   

2.
Children, as well as adults, can be handicapped when taking a standardized test because of an unfamiliarity with the test format or with the requirements of the testing situation. This review presents a critical analysis of the skills required for test-taking, the training of test-taking skills, and the experimental evidence on the training. Based on the recommendations of psychologists such as Thorndike, Cronbach, and McClelland, practical classroom strategies for test-taking are discussed. Cautions on the pitfalls of training test-taking skills on questionable dimensions, such as on test item content, are also discussed. The review concludes with recommendations for a task-specific instructional unit which trains the necessary skills for test-taking to assure that the score on the test is an accurate measurement of the skill being assessed.  相似文献   

3.
Item response time data were used in investigating the differences in student test-taking behavior between two device conditions: computer and tablet. Analyses were conducted to address the questions of whether or not the device condition had a differential impact on rapid guessing and solution behaviors (with response time effort used as an indicator) as well as on the time that students spent on the test (reading, mathematics, and science) or a given item type (such as drag-and-drop and fill in blank). Further analyses were conducted to examine if the potential impact of device conditions varied by gender and ethnicity groups. Overall there were no significant differences in response time effort related to device, although some differences related to item type and test sequence were noted. Students tended to spend slightly more time when taking the tests and certain types of items on the tablet than on the computer. No interactions of device with gender or ethnicity were observed. Follow-up research on the item time thresholds is discussed.  相似文献   

4.
Assessments of student learning outcomes (SLO) have been widely used in higher education for accreditation, accountability, and strategic planning purposes. Although important to institutions, the assessment results typically bear no consequence for individual students. It is important to clarify the relationship between motivation and test performance and identify practical strategies to boost students' motivation in test taking. This study designed an experiment to examine the effectiveness of a motivational instruction. The instruction increased examinees' self-reported test-taking motivation by .89 standard deviations (SDs) and test scores by .63 SDs. Students receiving the instruction spent an average of 14 more seconds on an item than students in the control group. Score difference between experimental and control groups narrowed to .23 SDs after unmotivated students identified by low response time were removed from the analyses. The findings provide important implications for higher education institutions which administer SLO assessments in a low-stakes setting.  相似文献   

5.
Whenever the purpose of measurement is to inform an inference about a student’s achievement level, it is important that we be able to trust that the student’s test score accurately reflects what that student knows and can do. Such trust requires the assumption that a student’s test event is not unduly influenced by construct-irrelevant factors that could distort his score. This article examines one such factor—test-taking motivation—that tends to induce a person-specific, systematic negative bias on test scores. Because current measurement models underlying achievement testing assume students respond effortfully to test items, it is important to identify test scores that have been materially distorted by non-effortful test taking. A method for conducting effort-related individual score validation is presented, and it is recommended that measurement professionals have a responsibility to identify invalid scores to individuals who make inferences about student achievement on the basis of those scores.  相似文献   

6.
We investigated motivation for taking low stakes tests. Based on expectancy-value theory, we expected that the effect of student perceptions of three task values (interest, usefulness, and importance) on low stakes test performance would be mediated by the student’s reported effort. We hypothesized that all three task value components would play a significant role in predicting test-taking effort, and that effort would significantly predict test performance. Participants were 1005 undergraduate students enrolled at four midsize public universities. After students took all four subtests of CBASE, a standardized general education exam, they immediately filled out a motivation survey. Path analyses showed that the task value variables usefulness and importance significantly predicted test-taking effort and performance for all four tests. These results provide evidence that students who report trying hard on low stakes tests score higher than those who do not. The results indicate that if students do not perceive importance or usefulness of an exam, their effort suffers and so does their test score. While the data are correlational, they suggest that it might be useful for test administrators and school staff to communicate to students the importance and usefulness of the test that they are being asked to complete.  相似文献   

7.
When we administer educational achievement tests, we want to be confident that the resulting scores validly indicate what the test takers know and can do. However, if the test is perceived as low stakes by the test taker, disengaged test taking sometimes occurs, which poses a serious threat to score validity. When computer-based tests are used, disengagement can be detected through occurrences of rapid-guessing behavior. This empirical study investigated the impact of a new effort monitoring feature that can detect rapid guessing, as it occurs, and notify proctors that a test taker has become disengaged. The results showed that, after a proctor notification was triggered, test-taking engagement tended to increase, test performance improved, and test scores exhibited higher convergent validation evidence. The findings of this study provide validation evidence that this innovative testing feature can decrease disengaged test taking.  相似文献   

8.
There has been a growing research interest in the identification and management of disengaged test taking, which poses a validity threat that is particularly prevalent with low‐stakes tests. This study investigated effort‐moderated (E‐M) scoring, in which item responses classified as rapid guesses are identified and excluded from scoring. Using achievement test data composed of test takers who were quickly retested and showed differential degrees of disengagement, three basic findings emerged. First, standard E‐M scoring accounted for roughly one‐third of the score distortion due to differential disengagement. Second, a modified E‐M scoring method that used more liberal time thresholds performed better—accounting for two‐thirds or more of the distortion. Finally, the inability of E‐M scoring to account for all of the score distortion suggests the additional presence of nonrapid item responses that reflect less‐than‐full engagement by some test takers.  相似文献   

9.
Assessment results collected under low-stakes testing situations are subject to effects of low examinee effort. The use of computer-based testing allows researchers to develop new ways of measuring examinee effort, particularly using response times. At the item level, responses can be classified as exhibiting either rapid-guessing behavior or solution behavior based on the item response time. Most previous research involving the study of response times has been conducted using locally developed instruments. The purpose of the current study was to examine the amount of rapid-guessing behavior within a commercially available, low-stakes instrument. Results indicate that smaller amounts of rapid-guessing behavior exist within the data compared to published results using other instruments. Additionally, rapid-guessing behavior varied by item and was significantly related to item length, item position, and presence of ancillary reading material. The amount of rapid-guessing behavior was consistently very low among various demographic subpopulations. On average, rapid-guessing behavior was observed on only 1% of item responses. Also found was that a small amount of rapid-guessing behavior can impact institutional rankings.  相似文献   

10.
Since the turn of the century, an increasing number of low-stakes assessments (i.e., assessments without direct consequences for the test-takers) are being used to evaluate the quality of educational systems. Internationally, research has shown that low-stakes test results can be biased due to students’ low test-taking motivation and that students’ effort levels can vary throughout a testing session involving both cognitive and noncognitive tests. Thus, it is possible that students’ motivation varies throughout a single cognitive test and in turn affects test performance. This study examines the change in test-taking motivation within a 2-h cognitive low-stakes test and its association with test performance. Based on expectancy-value theory, we assessed three components of test-taking motivation (expectancy for success, value, and effort) and investigated its change. Using data from a large-scale student achievement study of German ninth-graders, we employed second-order latent growth modeling and structural equation modeling to predict test performance in mathematics. On average, students’ effort and perceived value of the test decreased, whereas expectancy for success remained stable. Overall, initial test-taking motivation was a better predictor of test performance than change in motivation. Only the variability of change in the expectancy component was positively related to test performance. The theoretical and practical implications for test practitioners are discussed.  相似文献   

11.
This study examined the utility of response time‐based analyses in understanding the behavior of unmotivated test takers. For the data from an adaptive achievement test, patterns of observed rapid‐guessing behavior and item response accuracy were compared to the behavior expected under several types of models that have been proposed to represent unmotivated test taking behavior. Test taker behavior was found to be inconsistent with these models, with the exception of the effort‐moderated model. Effort‐moderated scoring was found to both yield scores that were more accurate than those found under traditional scoring, and exhibit improved person fit statistics. In addition, an effort‐guided adaptive test was proposed and shown by a simulation study to alleviate item difficulty mistargeting caused by unmotivated test taking.  相似文献   

12.
We discuss generalizability (G) theory and the fair and valid assessment of linguistic minorities, especially emergent bilinguals. G theory allows examination of the relationship between score variation and language variation (e.g., variation of proficiency across languages, language modes, and social contexts). Studies examining score variation across items administered in emergent bilinguals' first and second languages show that the interaction of student and the facets (sources of measurement error) item and language is an important source of score variation. Each item poses a unique set of linguistic challenges in each language, and each emergent bilingual individual has a unique set of strengths and weaknesses in each language. Based on these findings, G theory can inform the process of test construction in large-scale testing programmes and the development of testing models that ensure more valid and fair interpretations of test scores for linguistic minorities.  相似文献   

13.
Recent developrnents of person-Jit analysis in computerized adaptive testing (CAT) are discussed. Methods from stutistical process control are presented that have been proposed to classify an item score pattern as fitting or misjitting the underlying item response theory model in CAT. Most person-fit research in CAT is restricted to simulated data. In this study, empirical data from a certification test were used, Alternatives are discussed to generate norms so that bounds can be determined to classify an item score pattern as fitting or misfitting. Using bounds determined from a sample of a high-stakes certification test, the empirical analysis showed that dizerent types of misfit can be distinguished. Further applications using statistical process control methods to detect misfitting item score patterns are discussed.  相似文献   

14.
In classical test theory, a test is regarded as a sample of items from a domain defined by generating rules or by content, process, and format specifications, l f the items are a random sample of the domain, then the percent-correct score on the test estimates the domain score, that is, the expected percent correct for all items in the domain. When the domain is represented by a large set of calibrated items, as in item banking applications, item response theory (IRT) provides an alternative estimator of the domain score by transformation of the IRT scale score on the test. This estimator has the advantage of not requiring the test items to be a random sample of the domain, and of having a simple standard error. We present here resampling results in real data demonstrating for uni- and multidimensional models that the IRT estimator is also a more accurate predictor of the domain score than is the classical percent-correct score. These results have implications for reporting outcomes of educational qualification testing and assessment.  相似文献   

15.
Accountability mandates often prompt assessment of student learning gains (e.g., value-added estimates) via achievement tests. The validity of these estimates have been questioned when performance on tests is low stakes for students. To assess the effects of motivation on value-added estimates, we assigned students to one of three test consequence conditions: (a) an aggregate of test scores is used solely for institutional effectiveness purposes, (b) personal test score is reported to the student, or (c) personal test score is reported to faculty. Value-added estimates, operationalized as change in performance between two testing occasions for the same individuals where educational programming was experienced between testing occasions, were examined across conditions, in addition to the effects of test-taking motivation. Test consequences did not impact value-added estimates. Change in test-taking motivation, however, had a substantial effect on value-added estimates. In short, value-added estimates were attenuated due to decreased motivation from pretest to posttest.  相似文献   

16.
This article proposes a model-based procedure, intended for personality measures, for exploiting the auxiliary information provided by the certainty with which individuals answer every item (response certainty). This information is used to (a) obtain more accurate estimates of individual trait levels, and (b) provide a more detailed assessment of the consistency with which the individual responds to the test. The basis model consists of 2 submodels: an item response theory submodel for the responses, and a linear-in-the-coefficients submodel that describes the response certainties. The latter is based on the distance-difficulty hypothesis, and is parameterized as a factor-analytic model. Procedures for (a) estimating the structural parameters, (b) assessing model–data fit, (c) estimating the individual parameters, and (d) assessing individual fit are discussed. The proposal was used in an empirical study. Model–data fit was acceptable and estimates were meaningful. Furthermore, the precision of the individual trait estimates and the assessment of the individual consistency improved noticeably.  相似文献   

17.
In the attempt to identify or prevent unfair tests, both quantitative analyses and logical evaluation are often used. For the most part, fairness evaluation is a pragmatic attempt at determining whether procedural or substantive due process has been accorded to either a group of test takers or an individual. In both the individual and comparative approaches to test fairness, counterfactual reasoning is useful to clarify a potential charge of unfairness: Is it plausible to believe that with an alternative assessment (test or item) or under different test conditions an individual or groups of individuals may have fared better? Beyond comparative questions, fairness can also be framed by moral and ethical choices. A number of ongoing issues are evaluated with respect to these topics including accommodations, differential item functioning (DIF), differential prediction and selection, employment testing, test validation, and classroom assessment.  相似文献   

18.
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity.  相似文献   

19.
Graphical Item Analysis (GIA) visually displays the relationship between the total score on a test and the response proportions of the correct and false alternatives of a multiple-choice item. The GIA method provides essential and easily interpretable information about item characteristics (difficulty, discrimination and guessing rate). Low quality items are easily detected with the GIA method because they show response proportions on the correct alternative which decrease with an increase of the total score, or display response proportions of one or more false alternatives which do not decrease with an increase of the total score. The GIA method has two main applications. Firstly, it can be used by researchers in the process of identifying items that need to be excluded from further analysis. Secondly, it can be used by test constructors in the process of improving the quality of the item bank. GIA enables a better understanding of test theory and test construction, especially for those without a background in psychometrics. In this sense, the GIA method might contribute to reducing the gap between the world of psychometrists and the practical world of constructors of achievement tests.  相似文献   

20.
The attribute hierarchy method (AHM) is a psychometric procedure for classifying examinees' test item responses into a set of structured attribute patterns associated with different components from a cognitive model of task performance. Results from an AHM analysis yield information on examinees' cognitive strengths and weaknesses. Hence, the AHM can be used for cognitive diagnostic assessment. The purpose of this study is to introduce and evaluate a new concept for assessing attribute reliability using the ratio of true score variance to observed score variance on items that probe specific cognitive attributes. This reliability procedure is evaluated and illustrated using both simulated data and student response data from a sample of algebra items taken from the March 2005 administration of the SAT. The reliability of diagnostic scores and the implications for practice are also discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号