首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《教育实用测度》2013,26(1):25-51
In this study, we compared the efficiency, reliability, validity, and motivational benefits of computerized-adaptive and self-adapted music-listening tests (referred to hereafter as CAT and SAT, respectively). Junior high school general music students completed a tonal memory CAT, a tonal memory SAT, standardized music aptitude and achievement tests; and questionnaires assessing test anxiety, demographics, and attitudes about the CAT and SAT. Standardized music test scores and music course grades served as criterion measures in the concurrent validity analysis. Results showed that the SAT elicited more favorable attitudes from examinees and yielded ability estimates that were higher and less correlated with test anxiety than did the CAT. The CAT, however, required fewer items and less administration time to match the reliability and concurrent validity of the SAT and yielded higher levels of reliability and concurrent validity than the SAT when test length was held constant. These results reaffirm important tradeoffs between the two administration procedures observed in prior studies of vocabulary and algebra skills, with the SAT providing greater potential motivational benefits and the CAT providing greater efficiency. Implications and questions for future research are discussed.  相似文献   

2.
This study focused on the effects of administration mode (computer-adaptive test [CAT] versus self-adaptive test [SAT]), item-by-item answer feedback (present versus absent), and test anxiety on results obtained from computerized vocabulary tests. Examinees were assigned at random to four testing conditions (CAT with feedback, CAT without feedback, SAT with feedback, SAT without feedback). Examinees completed the Test Anxiety Inventory (Spielberger, 1980) before taking their assigned computerized tests. Results showed that the CATs were more reliable and took less time to complete than the SATs. Administration time for both the CATs and SATs was shorter when feedback was provided than when it was not, and this difference was most pronounced for examinees at medium to high levels of test anxiety. These results replicate prior findings regarding the precision and administrative efficiency of CATs and SATs but point to new possible benefits of including answer feedback on such tests.  相似文献   

3.
According to item response theory (IRT), examinee ability estimation is independent of the particular set of test items administered from a calibrated pool. Although the most popular application of this feature of IRT is computerized adaptive (CA) testing, a recently proposed alternative is self-adapted (SA) testing, in which examinees choose the difficulty level of each of their test items. This study compared examinee performance under SA and CA tests, finding that examinees taking the SA test (a) obtained significantly higher ability scores and (b) reported significantly lower posttest state anxiety. The results of this study suggest that SA testing is a desirable format for computer-based testing.  相似文献   

4.
The development of statistical methods for detecting test collusion is a new research direction in the area of test security. Test collusion may be described as large‐scale sharing of test materials, including answers to test items. Current methods of detecting test collusion are based on statistics also used in answer‐copying detection. Therefore, in computerized adaptive testing (CAT) these methods lose power because the actual test varies across examinees. This article addresses that problem by introducing a new approach that works in two stages: in Stage 1, test centers with an unusual distribution of a person‐fit statistic are identified via Kullback–Leibler divergence; in Stage 2, examinees from identified test centers are analyzed further using the person‐fit statistic, where the critical value is computed without data from the identified test centers. The approach is extremely flexible. One can employ any existing person‐fit statistic. The approach can be applied to all major testing programs: paper‐and‐pencil testing (P&P), computer‐based testing (CBT), multiple‐stage testing (MST), and CAT. Also, the definition of test center is not limited by the geographic location (room, class, college) and can be extended to support various relations between examinees (from the same undergraduate college, from the same test‐prep center, from the same group at a social network). The suggested approach was found to be effective in CAT for detecting groups of examinees with item pre‐knowledge, meaning those with access (possibly unknown to us) to one or more subsets of items prior to the exam.  相似文献   

5.
Examinees who take high-stakes assessments are usually given an opportunity to repeat the test if they are unsuccessful on their initial attempt. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign a different test form to repeat examinees. The use of multiple forms is expensive and can present psychometric challenges, particularly for low-volume credentialing programs; thus, it is important to determine if unwarranted score gains actually occur. Prior studies provide strong evidence that the same-form advantage is pronounced for aptitude tests. However, the sparse research within the context of achievement and credentialing testing suggests that the same-form advantage is minimal. For the present experiment, 541 examinees who failed a national certification test were randomly assigned to receive either the same test or a different (parallel) test on their second attempt. Although the same-form group had shorter response times on the second administration, score gains for the two groups were indistinguishable. We discuss factors that may limit the generalizability of these findings to other assessment contexts.  相似文献   

6.
The purpose of this study is to apply the attribute hierarchy method (AHM) to a subset of SAT critical reading items and illustrate how the method can be used to promote cognitive diagnostic inferences. The AHM is a psychometric procedure for classifying examinees’ test item responses into a set of attribute mastery patterns associated with different components from a cognitive model. The study was conducted in two steps. In step 1, three cognitive models were developed by reviewing selected literature in reading comprehension as well as research related to SAT Critical Reading. Then, the cognitive models were validated by having a sample of students think aloud as they solved each item. In step 2, psychometric analyses were conducted on the SAT critical reading cognitive models by evaluating the model‐data fit between the expected and observed response patterns produced from two random samples of 2,000 examinees who wrote the items. The model that provided best data‐model fit was then used to calculate attribute probabilities for 15 examinees to illustrate our diagnostic testing procedure.  相似文献   

7.
ABSTRACT

The authors sought to better understand the relationship between students participating in the Advanced Placement (AP) program and subsequent performance on the Scholastic Aptitude Test (SAT). Focusing on students graduating from U.S. public high schools in 2010, the authors used propensity scores to match junior year AP examinees in 3 subjects to similar students who did not take any AP exams in high school. Multilevel regression models with these matched samples demonstrate a mostly positive relationship between AP exam participation and senior year SAT performance, particularly for students who score a 3 or higher. Students who enter into the AP year with relatively lower initial achievement are predicted to perform slightly better on later SAT tests than students with similar initial achievement who do not participate in AP.  相似文献   

8.
Recent simulation studies indicate that there are occasions when examinees can use judgments of relative item difficulty to obtain positively biased proficiency estimates on computerized adaptive tests (CATs) that permit item review and answer change. Our purpose in the study reported here was to evaluate examinees' success in using these strategies while taking CATs in a live testing setting. We taught examinees two item difficulty judgment strategies designed to increase proficiency estimates. Examinees who were taught each strategy and examinees who were taught neither strategy were assigned at random to complete vocabulary CATs under conditions in which review was allowed after completing all items and when review was allowed only within successive blocks of items. We found that proficiency estimate changes following review were significantly higher in the regular review conditions than in the strategy conditions. Failure to obtain systematically higher scores in the strategy conditions was due in large part to errors examinees made in judging the relative difficulty of CAT items.  相似文献   

9.
Computerized adaptive testing in instructional settings   总被引:3,自引:0,他引:3  
Item response theory (IRT) has most often been used in research on computerized adaptive testing (CAT). Depending on the model used, IRT requires between 200 and 1,000 examinees for estimating item parameters. Thus, it is not practical for instructional designers to develop their own CAT based on the IRT model. Frick improved Wald's sequential probability ratio test (SPRT) by combining it with normative expert systems reasoning, referred to as an EXSPRT-based CAT. While previous studies were based on re-enactments from historical test data, the present study is the first to examine how well these adaptive methods function in a real-time testing situation. Results indicate that the EXSPRT-I significantly reduced test lengths and was highly accurate in predicting mastery. EXSPRT is apparently a viable and practical alternative to IRT for assessing mastery of instructional objectives.  相似文献   

10.
In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored.  相似文献   

11.
Results obtained from computer-adaptive and self-adaptive tests were compared under conditions in which item review was permitted and not permitted. Comparisons of answers before and after review within the "review" condition showed that a small percentage of answers was changed (5.23%), that more answers were changed from wrong to right than from right to wrong (by a ratio of 2.92:1), that most examinees (66.5%) changed answers to at least some questions, that most examinees who changed answers improved their ability estimates by doing so (by a ratio of 2.55 to 1), and that review was particularly beneficial to examineees at high ability levels. Comparisons between the "review" and "no-review" conditions yielded no significant differences in ability estimates or in estimated measurement error and provided no trustworthy evidence that test anxiety moderated the effects of review on those indexes. Most examinees desired review, but permitting it increased testing time by 41%.  相似文献   

12.
The continuous testing framework, where both successful and unsuccessful examinees have to demonstrate continued proficiency at frequent prespecified intervals, is a framework that is used in noncognitive assessment and is gaining in popularity in cognitive assessment. Despite the rigorous advantages of this framework, this paper demonstrates that there is significant inflation in false negatives as both passers and failers continually take a test, especially for examinees closer to the passing score. Several passing policies are investigated to control the inflation of false negatives while maintaining low false‐positive rates for fixed‐length tests. Lastly, recommendations are made for testing professionals who wish to utilize the rigorous nature of the continuous testing framework while also avoiding the inflation of qualified examinees failing.  相似文献   

13.
We investigated students' metacognitive experiences with regard to feelings of difficulty (FD), feelings of satisfaction (FS), and estimate of effort (EE), employing either computerized adaptive testing (CAT) or computerized fixed item testing (FIT). In an experimental approach, 174 students in grades 10 to 13 were tested either with a CAT or a FIT version of a matrices test. Data revealed that metacognitive experiences were not related to the resulting test scores for CAT: test takers who took the matrices test in an adaptive mode were paradoxically more satisfied with their performance the worse they had performed in terms of the resulting ability parameter. They also rated the test as easier the lower they had performed, but their estimates of effort were higher the better they had performed. For test takers who took the FIT version, completely different results were revealed. In line with previous results, test takers were supposed to base these experiences on the subjectively estimated percentage of items solved. This moderated mediation hypothesis was in parts confirmed, as the relation between the percentage of items solved and FD, FS, and EE was revealed to be mediated by the estimated percentage of items solved. Results are discussed with reference to feedback acceptance, errant self-estimations, and test fairness with regard to a possible false regulation of effort in lower ability groups when using CAT.  相似文献   

14.
《Educational Assessment》2013,18(4):295-308
Performance on the reading comprehension (RC) tasks of the Scholastic Assessment Test-I (SAT-I or the "new" SAT), the Enhanced American College Testing Assessment (ACT), and the Graduate Record Examination (GRE) when passages were missing was examined. For the SAT-I and ACT, scores were well above chance and correlated substantially with verbal score on the earlier version of the SAT, indicating that examinees perform similarly with or without passages. Comparable but weaker results were found for the GRE. The findings raise doubts about the construct validity of the RC task. We argue that performance is influenced by the plausibility of item choices with or without the passages and that this, in turn, is the result of the construction of test items with little knowledge of the underlying reading process.  相似文献   

15.
高风险考试试题保密性和心理测量学研究的基本矛盾   总被引:1,自引:0,他引:1  
这篇论文以中国高考为例,分析了考试的高风险性对试题心理测量学分析可行性的影响。高风险考试一方面要求高度的试题保密性,另一方面又有对潜在应试者进行实验测试而达到试题质量最优化的要求,而这两种需要之间却存在着一些基本矛盾。本文介绍了一些质化的研究方法;笔者建议在高风险考试的开发阶段主要运用这些质化方法。  相似文献   

16.
分数不确切代表被试的真实语言能力的问题是语言测量学界一个最本质、最棘手的问题——效度问题。以往我们采取的一些诸如增加评分员数量、重评等办法虽然在一定程度上改善了效度,但是却都无法从真正意义上得到一个与真分数尽可能近似的客观的分数。Longford针对主观评分中的信度问题提出了四种分数调整模型来解决这一问题。本文运用严厉度调整模型对HSK高等作文评分中的异常评分者所评的分数进行了调整,调整后分数得到很大改善。因此在以后的考试当中基本上可以用这种数学的调整方法代替以往组织评分员重评的方法。  相似文献   

17.
This study seeks to develop a better understanding of the underrepresentation of women in science and engineering by analyzing the gender gaps (a) in the interest in pursuing a science degree and (b) on science achievement. We use national-level college admissions data to examine gender differences and to explore the association between these outcomes and the attendance to single-sex or co-educational schools. The Chilean college admissions system provides a unique context to study these gender differences, since applicants who wish to pursue an undergraduate degree in science or engineering are required to take a high-stakes standardized science achievement test as part of the admission battery. This test has three subjects: biology, physics, and chemistry, and applicants must choose to be tested in only one of them. Significant gender differences exist for the examinees in their choice of subject and achievement on (the tests). Gender gaps favoring males are observed in the three forms. Both interest and achievement in science are associated with the sex composition of the school attended.  相似文献   

18.
For a certification, licensure, or placement exam, allowing examinees to take multiple attempts at the test could effectively change the pass rate. Change in the pass rate can occur without any change in the underlying latent trait, and can be an artifact of multiple attempts and imperfect reliability of the test. By deriving formulae to compute the pass rate under two definitions, this article provides tools for testing practitioners to compute and evaluate the change in the expected pass rate when a certain (maximum) number of attempts are allowed without any change in the latent trait. This article also includes a simulation study that considers change in ability and differential motivation of examinees to retake the test. Results indicate that the general trend shown by the analytical results is maintained—that is, the marginal expected pass rate increases with more attempts when the testing volume is defined as the total number of test takers, and decreases with more attempts when the testing volume is defined as the total number of test attempts.  相似文献   

19.
In this study data were examined from several national testing programs to determine whether the change from paper-based administration to computer-based tests (CBTs) influences group differences in performance. Performances by gender, racial, and ethnic groups on the Graduate Record Examination General Test, Graduate Management Admissions Test, SAT I: Reasoning Test, and Praxis: Professional Assessment for Beginning Teachers, were analyzed to determine whether the shift in testing format from paper-and-pencil tests to CBTs posed a disadvantage to any of these subgroups, beyond that already identified for paper-based tests. Although all differences were quite small, some consistent patterns were found for some racial-ethnic and gender groups. African-American examinees and, to a lesser degree, Hispanic examinees appear to benefit from the CBT format. On some tests, female examinees' performance was relatively lower on the CBT version.  相似文献   

20.
Computerized adaptive testing (CAT) and multistage testing (MST) have become two of the most popular modes in large‐scale computer‐based sequential testing.  Though most designs of CAT and MST exhibit strength and weakness in recent large‐scale implementations, there is no simple answer to the question of which design is better because different modes may fit different practical situations. This article proposes a hybrid adaptive framework to combine both CAT and MST, inspired by an analysis of the history of CAT and MST. The proposed procedure is a design which transitions from a group sequential design to a fully sequential design. This allows for the robustness of MST in early stages, but also shares the advantages of CAT in later stages with fine tuning of the ability estimator once its neighborhood has been identified. Simulation results showed that hybrid designs following our proposed principles provided comparable or even better estimation accuracy and efficiency than standard CAT and MST designs, especially for examinees at the two ends of the ability range.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号