期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Using Response Time to Detect Item Preknowledge in Computer‐Based Licensure Examinations

Hong Qian Dorota Staniewska Mark Reckase Ada Woo 《Educational Measurement》2016,35(1):38-47

This article addresses the issue of how to detect item preknowledge using item response time data in two computer‐based large‐scale licensure examinations. Item preknowledge is indicated by an unexpected short response time and a correct response. Two samples were used for detecting item preknowledge for each examination. The first sample was from the early stage of the operational test and was used for item calibration. The second sample was from the late stage of the operational test, which may feature item preknowledge. The purpose of this research was to explore whether there was evidence of item preknowledge and compromised items in the second sample using the parameters estimated from the first sample. The results showed that for one nonadaptive operational examination, two items (of 111) were potentially exposed, and two candidates (of 1,172) showed some indications of preknowledge on multiple items. For another licensure examination that featured computerized adaptive testing, there was no indication of item preknowledge or compromised items. Implications for detected aberrant examinees and compromised items are discussed in the article. 相似文献

2.

Embedded Field Test Item Statistics: Can They Be Trusted for Estimating Student Proficiency?

Jeffrey T. Steedle Kristin M. Morrison 《Educational Assessment》2019,24(1):1-12

Assessment items are commonly field tested prior to operational use to observe statistical item properties such as difficulty. Item parameter estimates from field testing may be used to assign scores via pre-equating or computer adaptive designs. This study examined differences between item difficulty estimates based on field test and operational data and the relationship of such differences to item position changes and student proficiency estimates. Item position effects were observed for 20 assessments, with items in later positions tending to be more difficult. Moreover, field test estimates of item difficulty were biased slightly upward, which may indicate examinee knowledge of which items were being field tested. Nevertheless, errors in field test item difficulty estimates had negligible impacts on student proficiency estimates for most assessments. Caution is still warranted when using field test statistics for scoring, and testing programs should conduct investigations to determine whether the effects on scoring are inconsequential. 相似文献

3.

Comparison of Methods for Combining the Minimum Passing Levels for Individual Items into a Passing Score for a Test

Barbara S. Plake Michael T. Kane 《Journal of Educational Measurement》1991,28(3):249-256

The purpose of this study was to compare several methods for determining a passing score on an examination from the individual raters' estimates of minimal pass levels for the items. The methods investigated differ in the weighting that the estimates for each item receive in the aggregation process. An IRT-based simulation method was used to model a variety of error components of minimum pass levels. The results indicate little difference in estimated passing scores across the three methods. Less error was present when the ability level of the minimally competent candidates matched the expected difficulty level of the test. No meaningful improvement in passing score estimation was achieved for a 50-item test as opposed to a 25-item test; however, the RMSE values for estimates with 10 raters were smaller than those for 5 raters. The results suggest that the simplest method for aggregating minimum pass levels across the items in a test–adding them up–is the preferred method. 相似文献

4.

Psychometric Aspects of Maintaining Standards of Examinations

C. A. W. Glas 《教育心理学》1988,8(4):257-270

Through pilot studies and regular examination procedures, the National Institute for Educational Measurement (CITO) in The Netherlands has gathered experience with different methods of maintaining the standards of examinations. The present paper presents an overview of the psychometric aspects of the various approaches that can be chosen for the maintenance of standards. Generally speaking, the approaches to the problem, can be divided into two classes. In the first approach the examinations are a fixed factor, i.e. the examination is already constructed and cannot be changed, and the link between the standards of both examinations is created by some test equating design. In the second approach the items of both examinations are selected from a pre‐tested pool of items, in such a way that two equivalent examinations are constructed. In both approaches the statistical problems of simultaneously modelling possible differences in the ability level of different groups of examinees and differences in the difficulty of the items are solved within the framework of item response theory. It is shown that applying the Rasch model for dichotomous and polytomous items results in a variety of possible test‐equating designs which adequately deal with the restrictions imposed by the practical conditions related to the fact that the equating involves examinations. Especially the requirement of secrecy of the content of new examinations must be taken into account. Finally it is shown that, given a pool of pre‐tested items, optimisation techniques can be used to construct equivalent examinations. 相似文献

5.

天津市初等信息技术考试标准设定方法的研究与实践

高淑印郑刚《考试研究》2013,(4):76-83

天津市初等信息技术考试是面向社会测试应试者计算机应用能力的评测系统,作为一种标准参照考试,从2004年开始实施以来,一直以60分作为合格标准,但实践证明,60分并不能作为判断考生是否合格的永恒标准。该考试系统是上机考试,社会考生自愿报名参加,考试对象年龄差异较大,覆盖小学2-6年级,且每个级别会有不同年龄学生参加,60分的划界分数忽略了每次参加测试的被试者的平均能力不同这一事实,也忽略了同一次考试不同考生抽到的题目不完全一致的事实。这样可能会产生一个问题,即我们只能了解考生的相对能力和相对位置。如果不能正确地将考生归入恰当的等级类别中,这种等级考试的价值就会受很大影响。因此,本文对该考试系统的"合格"标准分数的设定进行研究,利用Angoff法设定划界分数,客观地应用到被试群体中,在提高考试信度、效度的研究与应用方面进行了有益的探索。相似文献

6.

Comparison of six examinations given in Rhetoric 101, at the University of Illinois,fall 1965

Lawrence M. Aleamoni Sarah B. Eitelbach 《Research in higher education》1976,4(4):347-354

Two forms of the College Entrance Examination Board's (CEEB) English Composition Test were compared with four rhetoric final examinations in the basic English composition course at the University of Illinois at Urbana-Champaign during fall semester 1965. This comparison was made to determine if the CEEB test could be used to predict course grade as well as the departmental final examination. In addition, an analysis of the examinations was conducted to determine the stability of the tests. The results indicated that the CEEB test was much more stable and yielded better item statistics, which seemed to characterize a norm-referenced measuring instrument. The departmental examinations, on the other hand, were more highly related to course grade and seemed to more nearly characterize a criterion-referenced measuring instrument. 相似文献

7.

计算机自适应考试系统研究

吕岚《晋城职业技术学院学报》2013,6(4):56-59

本文结合专家经验确定法和项目反应理论,设计出一种简明、实用的计算机自适应考试系统的试题难度确定方法,同时重点分析计算机自适应考试系统的测试起点、终点选择,选题策略和能力值估计方法。最后列举了一个自适应测试的步骤实例。本系统能够根据不同能力被试者随机选择试题项目,减少了测试长度,与传统在线考试系统相比提高了考试效率。相似文献

8.

Effect of Two Selected Item-Writing Practices on Test Difficulty,Discrimination, and Reliability

Cynthia B. Schmeiser Douglas R. Whitney 《Journal of Experimental Education》2013,81(3):30-34

In order to investigate the effect of two item-writing practices on test characteristics, examinations were chosen for study in two undergraduate courses (N = 71 and 210) . About one-fourth of the items on each examination included a practice generally regarded as undesirable in measurement textbooks and alleged to make test items more difficult. Alternate forms which eliminated the undesirable practice were developed and administered at the same time as the original form. Rewriting item stems so that they formed a complete sentence or question resulted in about 6 percent more students answering items correctly. Eliminating unnecessary material in item stems, however, had little effect on difficulty. KR₂₀ values were not appreciably different for the two versions of either test. Neither flaw was found to affect item discrimination indices noticeably. The absence of any substantial practice-by-achievement level interactions suggested little effect of the practices on the validity of the tests. 相似文献

9.

Open-book tests in a university course

Krarup Niels Naeraa Noe Olsen Christian 《Higher Education》1974,3(2):157-164

In the few available studies on the use of books in examinations, open-book tests have been found to reduce pre-test memorization and anxiety during examinations without affecting academic performance. However, these studies were made with students in non-book systems, whereas systems which allowed books in all exams might be thought likely to create a non-fact-learning attitude in students. The present study was undertaken in a book-allowing system with 120 students during a regular course in physiology at a medical school. Each group sat two parallel 60-item multiple choice tests and used books in one test but not in the other. The tests took place about four weeks prior to the final examination, which is of the same type as the experimental tests. Recall items could yield less than 15% of maximum points, so that interpretation and problem-solving items predominated. Total test points with and without books did not differ significantly. An analysis of variance showed that the effect of books on recall items was only slight and that the two tests varied in difficulty, in spite of efforts to secure equality. 相似文献

10.

如何改进考试的分数报告

韩宁《考试研究》2009,(4):68-78

分数报告是实现教育考试功能的重要环节,考试机构应该树立把考生作为消费者的观念,提供尽可能准确、充分、易于理解的分数信息服务。本文指出了考试分数报告中的常见问题,介绍了AERA／APA／NCME行业标准对考试分数报告的要求,并讨论分数报告设计的基本原则,同时还对几个技术细节问题进行详细探讨,在题目映射、垂直量表、诊断性分数报告等几个环节介绍了具体可行的做法。相似文献

11.

Decision making for borderline cases in pass/fail clinical anatomy courses: The practical value of the standard error of measurement and likelihood ratio in a diagnostic test

Milton Severo Fernanda Silva‐Pereira Maria Amélia Ferreira 《Anatomical sciences education》2013,6(3):157-162

Several studies have shown that the standard error of measurement (SEM) can be used as an additional “safety net” to reduce the frequency of false‐positive or false‐negative student grading classifications. Practical examinations in clinical anatomy are often used as diagnostic tests to admit students to course final examinations. The aim of this study was to explore the diagnostic value of SEM using the likelihood ratio (LR) in establishing decisions about students with practical examination scores at or below the pass/fail cutoff score in a clinical anatomy course. Two hundred sixty‐seven students took three clinical anatomy practical examinations in 2011. The students were asked to identify 40 anatomical structures in images and prosected specimens in the practical examination. Practical examination scores were then divided according to the following cutoff scores: 2, 1 SEM below, and 0, 1, 2 SEM above the pass score. The positive predictive value (+PV) and LR of passing the final examination were estimated for each category to explore the diagnostic value of practical examination scores. The +PV (LR) in the six categories defined by the SEM was 39.1% (0.08), 70.0% (0.30), 88.9% (1.04), 91.7% (1.43), 95.8% (3.00), and 97.8% (5.74), respectively. The LR of categories 2 SEM above/below the pass score generated a moderate/large shift in the pre‐ to post‐test probability of passing. The LR increased the usefulness and practical value of SEM by improving confidence in decisions about the progress of students with borderline scores 2 SEM above/below the pass score in practical examinations in clinical anatomy courses. Anat Sci Educ. © 2013 American Association of Anatomists. 相似文献

12.

Teaching social studies to learning disabled high school students: effects of a hypertext study guide

Steven V Horton Randall A Boone Thomas C Lovitt 《British journal of educational technology : journal of the Council for Educational Technology》1990,21(2):118-131

This study investigated the effectiveness of a computer-based study guide using hypertext software to increase textbook comprehension among four learning disabled students enrolled in a remedial high school social studies class. The program provided four levels of instructional cues that matched students to their highest level of independent interaction with a textbook passage, based on item-to-item responses to computer-generated questions. Using alternative forms of a 45-item multiple-choice test, a pre-test/post-test design was arranged, with a retention test given after a 30-day period. Fifteen questions were designated as control items by placing them in the 45-item tests but not in the computer treatment. The computer program consisted of three separate lessons administered across consecutive class sessions, with each followed by a written 15-item multiple choice test containing 10 computer questions and 5 control items. Results indicated a significant gain for pupils on computer items from pre-test to post-test and from pre-test to retention test, while no significant change occurred on control items across measures. A single-case analysis revealed a consistent relationship between gain scores on computer items, reading time on computer, and the number of instructional cues required by students. Two types of non-linear pathways that teacher might consider when constructing study guides are discussed. 相似文献

13.

The implications of content versus item validity on science tests

William L. Yarroch 《科学教学研究杂志》1991,28(7):619-629

The use of content validity as the primary assurance of the measurement accuracy for science assessment examinations is questioned. An alternative accuracy measure, item validity, is proposed. Item validity is based on research using qualitative comparisons between (a) student answers to objective items on the examination, (b) clinical interviews with examinees designed to ascertain their knowledge and understanding of the objective examination items, and (c) student answers to essay examination items prepared as an equivalent to the objective examination items. Calculations of item validity are used to show that selected objective items from the science assessment examination overestimated the actual student understanding of science content. Overestimation occurs when a student correctly answers an examination item, but for a reason other than that needed for an understanding of the content in question. There was little evidence that students incorrectly answered the items studied for the wrong reason, resulting in underestimation of the students' knowledge. The equivalent essay items were found to limit the amount of mismeasurement of the students' knowledge. Specific examples are cited and general suggestions are made on how to improve the measurement accuracy of objective examinations. 相似文献

14.

Do Different Approaches to Examining Construct Comparability in Multilanguage Assessments Lead to Similar Conclusions?

Maria E. Oliveri Kadriye Ercikan 《教育实用测度》2013,26(4):349-366

In this study, we examine the degree of construct comparability and possible sources of incomparability of the English and French versions of the Programme for International Student Assessment (PISA) 2003 problem-solving measure administered in Canada. Several approaches were used to examine construct comparability at the test- (examination of test data structure, reliability comparisons and test characteristic curves) and item-levels (differential item functioning, item parameter correlations, and linguistic comparisons). Results from the test-level analyses indicate that the two language versions of PISA are highly similar as shown by similarity of internal consistency coefficients, test data structure (same number of factors and item factor loadings) and test characteristic curves for the two language versions of the tests. However, results of item-level analyses reveal several differences between the two language versions as shown by large proportions of items displaying differential item functioning, differences in item parameter correlations (discrimination parameters) and number of items found to contain linguistic differences. 相似文献

15.

A comparison of difficulty and discrimination values of selected true-false item types

Douglas Barker Robert L Ebel 《Contemporary educational psychology》1982,7(1):35-40

Thirty-eight undergraduate students were randomly assigned one of two alternate forms of a 144-item true-false midterm examination. Whenever a statement appeared on one form as true and positively stated, it appeared on the alternate form as false and negatively stated. Similarly, a false and positively stated item on one form was true and negatively stated on the other. The subject matter of the two forms was identical and the four kinds of true-false items were equally represented on each form. Difficulty and discrimination indices were computed for each of the four item types. The statistical results showed negatively stated items were more difficult, but no more discriminating, than positively stated items. Also, false items were not statistically more difficult than true items, but were significantly more discriminating. It was concluded that test constructors should include more false items than true items in their instruments and that all items should be stated positively. 相似文献

16.

Technical characteristics and some correlates of the california critical thinking skills test,forms A and B

Stanley S. Jacobs 《Research in higher education》1995,36(1):89-108

Forms A and B of the California Critical Thinking Skills Test (CCTST) were administered to two randomly formed groups of undergraduate students at a large eastern university, as part of the freshman orientation process. Arithmetic means for the forms were significantly different, indicating a lack of equivalence between forms. Principal component analyses and specific patterns of item intercorrelations differed between forms, with the lack of equivalence apparently due to the changes in Form A items, which were carried out in order to create Form B items. Internal consistency reliabilities for total and subtest scores were uniformly low, and it appears the CCTST scores largely reflect verbal intelligence of the type measured by the SAT. It was concluded that the CCTST may be acceptable for research purposes (e.g., as a blocking variable or covariate), but not for decision making concerning individual students, especially with respect to subtest scores and score differences. 相似文献

17.

Equivalent Pass/Fail Decisions

John J. Norcini 《Journal of Educational Measurement》1990,27(1):59-66

In competency testing, it is sometimes difficult to properly equate scores of different forms of a test and thereby assure equivalent cutting scores. Under such circumstances, it is possible to set standards separately for each test form and then scale the judgments of the standard setters to achieve equivalent pass/fail decisions. Data from standard setters and examinees for a medical certifying examination were reanalyzed. Cutting score equivalents were derived by applying a linear procedure to the standard-setting results. These were compared against criteria along with the cutting score equivalents derived from typical examination equating procedures. Results indicated that the cutting score equivalents produced by the experts were closer to the criteria than standards derived from examinee performance, especially when the number of examinees used in equating was small. The root mean square error estimate was about 1 item on a 189-item test. 相似文献

18.

The Effects of Test Length and Sample Size on the Reliability and Equating of Tests Composed of Constructed-Response Items

《教育实用测度》2013,26(1):31-57

Examined in this study were the effects of test length and sample size on the alternate forms reliability and the equating of simulated mathematics tests composed of constructed-response items scaled using the 2-parameter partial credit model. Test length was defined in terms of the number of both items and score points per item. Tests with 2, 4, 8, 12, and 20 items were generated, and these items had 2, 4, and 6 score points. Sample sizes of 200, 500, and 1,000 were considered. Precise item parameter estimates were not found when 200 cases were used to scale the items. To obtain acceptable reliabilities and accurate equated scores, the findings suggested that tests should have at least eight 6-point items or at least 12 items with 4 or more score points per item. 相似文献

19.

Consistency of Angoff-Based Predictions of Item Performance: Evidence of Technical Quality of Results From the Angoff Standard Setting Method

Barbara S. Plake James C. Impara Patrick M. Irwin 《Journal of Educational Measurement》2000,37(4):347-355

Judgmental standard-setting methods, such as the Angoff(1971) method, use item performance estimates as the basis for determining the minimum passing score (MPS). Therefore, the accuracy, of these item peformance estimates is crucial to the validity of the resulting MPS. Recent researchers (Shepard, 1995; Impara & Plake, 1998; National Research Council. 1999) have called into question the ability of judges to make accurate item performance estimates for target subgroups of candidates, such as minimally competent candidates. The propose of this study was to examine the intra- and inter-rater consistency of item performance estimates from an Angoff standard setting. Results provide evidence that item pelformance estimates were consistent within and across panels within and across years. Factors that might have influenced this high degree of reliability, in the item performance estimates in a standard setting study are discussed. 相似文献

20.

Item Arrangement,Cognitive Entry Characteristics,Sex, and Test Anxiety as Predictors of Achievement Examination Performance

《Journal of Experimental Education》2012,80(4):214-219

This study’s general research question was: Given male and female students in an introductory educational psychology course who vary in cognitive entry characteristics and test anxiety, how do three item arrangements (easy to difficult, difficult to easy, and random) located within a 50-item multiple-choice achievement examination influence students’ total test performance? Two hierarchical multiple regression analyses were used to analyze the data. The four predictor variables and their interactions were tested for the amount of variation that they explained in the dependent variable. The main finding within the context of this study is that item arrangements based on item difficulties do not influence achievement examination performance. 相似文献