首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Croatian 1st‐year and 3rd‐year high‐school students (N = 170) completed a conceptual physics test. Students were evaluated with regard to two physics topics: Newtonian dynamics and simple DC circuits. Students answered test items and also indicated their confidence in each answer. Rasch analysis facilitated the calculation of three linear measures: (a) an item‐difficulty measure based upon all responses, (b) an item‐confidence measure based upon correct student answers, and (c) an item‐confidence measure based upon incorrect student answers. Comparisons were made with regard to item difficulty and item confidence. The results suggest that Newtonian dynamics is a topic with stronger students' alternative conceptions than the topic of DC circuits, which is characterized by much lower students' confidence on both correct and incorrect answers. A systematic and significant difference between mean student confidence on Newtonian dynamics and DC circuits items was found in both student groups. Findings suggest some steps for physics instruction in Croatia as well as areas of further research for those in science education interested in additional techniques of exploring alternative conceptions. © 2005 Wiley Periodicals, Inc. J Res Sci Teach 43: 150–171, 2006  相似文献   

2.
Although multiple choice examinations are often used to test anatomical knowledge, these often forgo the use of images in favor of text‐based questions and answers. Because anatomy is reliant on visual resources, examinations using images should be used when appropriate. This study was a retrospective analysis of examination items that were text based compared to the same questions when a reference image was included with the question stem. Item difficulty and discrimination were analyzed for 15 multiple choice items given across two different examinations in two sections of an undergraduate anatomy course. Results showed that there were some differences item difficulty but these were not consistent to either text items or items with reference images. Differences in difficulty were mainly attributable to one group of students performing better overall on the examinations. There were no significant differences for item discrimination for any of the analyzed items. This implies that reference images do not significantly alter the item statistics, however this does not indicate if these images were helpful to the students when answering the questions. Care should be taken by question writers to analyze item statistics when making changes to multiple choice questions, including ones that are included for the perceived benefit of the students. Anat Sci Educ 10: 68–78. © 2016 American Association of Anatomists.  相似文献   

3.
Abstract

The arrangement of response options in multiple-choice (MC) items, especially the location of the most attractive distractor, is considered critical in constructing high-quality MC items. In the current study, a sample of 496 undergraduate students taking an educational assessment course was given three test forms consisting of the same items but the positions of the most attractive distractor varied across the forms. Using a multiple-indicators–multiple-causes (MIMIC) approach, the effects of the most attractive distractor's positions on item difficulty were investigated. The results indicated that the relative placement of the most attractive distractor and the distance between the most attractive distractor and the keyed option affected students’ response behaviors. Moreover, low-achieving students were more susceptible to response-position changes than high-achieving students.  相似文献   

4.
The effects of training tests on subsequent achievement were studied using 2-test item characteristics: item difficulty and item complexity. Ninety Ss were randomly assigned to treatment conditions having easy or difficult items and calling for rote or complex skills. Each S was administered two training tests during the quarter containing only items defined by his treatment condition. The dependent measure was a sixty item final examination with fifteen items reflecting each of the four treatment condition item types. The results showed greater achievement for those trained with difficult items and with rote items. In addition, two interaction of treatment conditions with type of test items were found. The results are discussed as supporting a hierarchical model rather than a “similarity” transfer model of learning.  相似文献   

5.
The purpose of this study is to apply the attribute hierarchy method (AHM) to a subset of SAT critical reading items and illustrate how the method can be used to promote cognitive diagnostic inferences. The AHM is a psychometric procedure for classifying examinees’ test item responses into a set of attribute mastery patterns associated with different components from a cognitive model. The study was conducted in two steps. In step 1, three cognitive models were developed by reviewing selected literature in reading comprehension as well as research related to SAT Critical Reading. Then, the cognitive models were validated by having a sample of students think aloud as they solved each item. In step 2, psychometric analyses were conducted on the SAT critical reading cognitive models by evaluating the model‐data fit between the expected and observed response patterns produced from two random samples of 2,000 examinees who wrote the items. The model that provided best data‐model fit was then used to calculate attribute probabilities for 15 examinees to illustrate our diagnostic testing procedure.  相似文献   

6.
Traditional item analyses such as classical test theory (CTT) use exam-taker responses to assessment items to approximate their difficulty and discrimination. The increased adoption by educational institutions of electronic assessment platforms (EAPs) provides new avenues for assessment analytics by capturing detailed logs of an exam-taker's journey through their exam. This paper explores how logs created by EAPs can be employed alongside exam-taker responses and CTT to gain deeper insights into exam items. In particular, we propose an approach for deriving features from exam logs for approximating item difficulty and discrimination based on exam-taker behaviour during an exam. Items for which difficulty and discrimination differ significantly between CTT analysis and our approach are flagged through outlier detection for independent academic review. We demonstrate our approach by analysing de-identified exam logs and responses to assessment items of 463 medical students enrolled in a first-year biomedical sciences course. The analysis shows that the number of times an exam-taker visits an item before selecting a final response is a strong indicator of an item's difficulty and discrimination. Scrutiny by the course instructor of the seven items identified as outliers suggests our log-based analysis can provide insights beyond what is captured by traditional item analyses.

Practitioner notes

What is already known about this topic
  • Traditional item analysis is based on exam-taker responses to the items using mathematical and statistical models from classical test theory (CTT). The difficulty and discrimination indices thus calculated can be used to determine the effectiveness of each item and consequently the reliability of the entire exam.
What this paper adds
  • Data extracted from exam logs can be used to identify exam-taker behaviours which complement classical test theory in approximating the difficulty and discrimination of an item and identifying items that may require instructor review.
Implications for practice and/or policy
  • Identifying the behaviours of successful exam-takers may allow us to develop effective exam-taking strategies and personal recommendations for students.
  • Analysing exam logs may also provide an additional tool for identifying struggling students and items in need of revision.
  相似文献   

7.
In today's higher education, high quality assessments play an important role. Little is known, however, about the degree to which assessments are correctly aimed at the students’ levels of competence in relation to the defined learning goals. This article reviews previous research into teachers’ and students’ perceptions of item difficulty. It focuses on the item difficulty of assessments and students’ and teachers’ abilities to estimate item difficulty correctly. The review indicates that teachers tend to overestimate the difficulty of easy items and underestimate the difficulty of difficult items. Students seem to be better estimators of item difficulty. The accuracy of the estimates can be improved by: the information the estimators or teachers have about the target group and their earlier assessment results; defining the target group before the estimation process; the possibility of having discussions about the defined target group of students and their corresponding standards during the estimation process; and by the amount of training in item construction and estimating. In the subsequent study, the ability and accuracy of teachers and students to estimate the difficulty levels of assessment items was examined. In higher education, results show that teachers are able to estimate the difficulty levels correctly for only a small proportion of the assessment items. They overestimate the difficulty level of most of the assessment items. Students, on the other hand, underestimate their own performances. In addition, the relationships between the students’ perceptions of the difficulty levels of the assessment items and their performances on the assessments were investigated. Results provide evidence that the students who performed best on the assessments underestimated their performances the most. Several explanations are discussed and suggestions for additional research are offered.  相似文献   

8.
The purpose of the present study is to examine the language characteristics of a few states' large-scale assessments of mathematics and science and investigate whether the language demands of the items are associated with the degree of differential item functioning (DIF) for English language learner (ELL) students. A total of 542 items from 11 assessments at Grades 4, 5, 7, and 8 from three states were rated for the linguistic complexity based on a developed linguistic coding scheme. The linguistic ratings were compared to each item's DIF statistics. The results yielded a stronger association between the linguistic rating and DIF statistics for ELL students in the “relatively easy” items than in the “not easy” items. Particularly, general academic vocabulary and the amount of language in an item were found to have the strongest association with the degrees of DIF, particularly for ELL students with low English language proficiency. Furthermore, the items were grouped into four bundles to closely look at the relationship between the varying degrees of language demands and ELL students' performance. Differential bundling functioning (DBF) results indicated that the exhibited DBF was more substantial as the language demands increased. By disentangling linguistic difficulty from content difficulty, the results of the study provide strong evidence of the impact of linguistic complexity on ELL students' performance on tests. The study discusses the implications for the validation of the tests and instructions for ELL students.  相似文献   

9.
We investigated students' metacognitive experiences with regard to feelings of difficulty (FD), feelings of satisfaction (FS), and estimate of effort (EE), employing either computerized adaptive testing (CAT) or computerized fixed item testing (FIT). In an experimental approach, 174 students in grades 10 to 13 were tested either with a CAT or a FIT version of a matrices test. Data revealed that metacognitive experiences were not related to the resulting test scores for CAT: test takers who took the matrices test in an adaptive mode were paradoxically more satisfied with their performance the worse they had performed in terms of the resulting ability parameter. They also rated the test as easier the lower they had performed, but their estimates of effort were higher the better they had performed. For test takers who took the FIT version, completely different results were revealed. In line with previous results, test takers were supposed to base these experiences on the subjectively estimated percentage of items solved. This moderated mediation hypothesis was in parts confirmed, as the relation between the percentage of items solved and FD, FS, and EE was revealed to be mediated by the estimated percentage of items solved. Results are discussed with reference to feedback acceptance, errant self-estimations, and test fairness with regard to a possible false regulation of effort in lower ability groups when using CAT.  相似文献   

10.
Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in seven themes. The answer format alternated per theme and was either a labeled image or an answer list, resulting in two versions containing both images and answer lists. Subjects were randomly assigned to one version. Answer formats were compared through item scores. Both examinations had similar overall difficulty and reliability. Two cross‐sectional images resulted in greater item difficulty and item discrimination, compared to an answer list. A schematic image of fetal circulation led to decreased item difficulty and item discrimination. Three images showed variable effects. These results show that effects on assessment scores are dependent on the type of image used. Results from the two cross‐sectional images suggest an extra ability is being tested. Data from a scheme of fetal circulation suggest a cueing effect. Variable effects from other images indicate that a context‐dependent interaction takes place with the content of questions. The conclusion is that item difficulty and item discrimination can be affected when images are used instead of answer lists; thus, the use of images as a response format has potential implications for the validity of test items. Anat Sci Educ © 2012 American Association of Anatomists.  相似文献   

11.
This article used the multidimensional random coefficients multinomial logit model to examine the construct validity and detect the substantial differential item functioning (DIF) of the Chinese version of motivated strategies for learning questionnaire (MSLQ-CV). A total of 1,354 Hong Kong junior high school students were administered the MSLQ-CV. Partial credit model was suggested to have a better goodness of fit than that of the rating scale model. Five items with substantial gender or grade DIF were removed from the questionnaire, and the correlations between the subscales indicated that factors of cognitive strategy use and self-regulation had a very high correlation which resulted in a possible combination of the two factors. The test reliability analysis showed that the subscale of test anxiety had a lower reliability compared with the other factors. Finally, the item difficulty and step parameters for the modified 39-item questionnaire were displayed. The order of the step difficulty estimates for some items implied that some grouping of categories might be required in the case of overlapping. Based on these findings, the directions for future research were discussed.  相似文献   

12.
Sixty-eight graduate students made general and specific ratings of the quality of 12 classroom test items, which varied in difficulty and discrimination. Four treatment combinations defined two additional factors: group discussion/no group discussion of test items and exposure/no exposure to an instructional module on test item construction. The students rated the items differentially, depending not only on item difficulty level but also on item discriminative power. The group discussion and exposure to module factors had significant effects on the general item ratings only. Implications of the research were discussed.  相似文献   

13.
《教育实用测度》2013,26(1):89-97
Research on the use of multiple-choice tests has presented conflicting evidence about the use of statistical item difficulty as a means of ordering items. An alternate method advocated by many texts is the use of cognitive difficulty. This study examined the effect of using both statistical and cognitive item difficulty in determining item order. Results indicated that those students who received items in an increasing cognitive order, no matter what the order of statistical difficulty, scored higher on hard items. Those students who received the forms with opposing cognitive and statistical difficulty orders scored the highest on medium-level items. The study concludes with a call for more research on the effects of cognitive difficulty and suggests that future studies examine subscores as well as total test results.  相似文献   

14.
This article demonstrates the utility of restricted item response models for examining item difficulty ordering and slope uniformity for an item set that reflects varying cognitive processes. Twelve sets of paired algebra word problems were developed to systematically reflect various types of cognitive processes required for successful performance. This resulted in a total of 24 items. They reflected distance-rate–time (DRT), interest, and area problems. Hypotheses concerning difficulty ordering and slope uniformity for the items were tested by constraining item difficulty and discrimination parameters in hierarchical item response models. The first set of model comparisons tested the equality of the discrimination and difficulty parameters for each set of paired items. The second set of model comparisons examined slope uniformity within the complex DRT problems. The third set of model comparisons examined whether the familiarity of the story context affected item difficulty for two types of complex DRT problems. The last set of model comparisons tested the hypothesized difficulty ordering of the items.  相似文献   

15.
The nature of anatomy education has changed substantially in recent decades, though the traditional multiple‐choice written examination remains the cornerstone of assessing students' knowledge. This study sought to measure the quality of a clinical anatomy multiple‐choice final examination using item response theory (IRT) models. One hundred seventy‐six students took a multiple‐choice clinical anatomy examination. One‐ and two‐parameter IRT models (difficulty and discrimination parameters) were used to assess item quality. The two‐parameter IRT model demonstrated a wide range in item difficulty, with a median of ?1.0 and range from ?2.0 to 0.0 (25th to 75th percentile). Similar results were seen for discrimination (median 0.6; range 0.4–0.8). The test information curve achieved maximum discrimination for an ability level one standard deviation below the average. There were 15 items with standardized loading less than 0.3, which was due to several factors: two items had two correct responses, one was not well constructed, two were too easy, and the others revealed a lack of detailed knowledge by students. The test used in this study was more effective in discriminating students of lower ability than those of higher ability. Overall, the quality of the examination in clinical anatomy was confirmed by the IRT models. Anat Sci Educ 3:17–24, 2010. © 2009 American Association of Anatomists.  相似文献   

16.
Federal policy on alternate assessment based on modified academic achievement standards (AA-MAS) inspired this research. Specifically, an experimental study was conducted to determine whether tests composed of modified items would have the same level of reliability as tests composed of original items, and whether these modified items helped reduce the performance gap between AA-MAS eligible and ineligible students. Three groups of eighth-grade students (N?=?755) defined by eligibility and disability status took original and modified versions of reading and mathematics tests. In a third condition, the students were provided limited reading support along with the modified items. Changes in reliability across groups and conditions for both the reading and mathematics tests were determined to be minimal. Mean item difficulties within the Rasch model were shown to decrease more for students who would be eligible for the AA-MAS than for non-eligible groups, revealing evidence of differential boost. Exploratory analyses indicated that shortening the question stem may be a highly effective modification, and that adding graphics to reading items may be a poor modification.  相似文献   

17.
This work examines the hypothesis that the arrangement of items according to increasing difficulty is the real source of what is considered the item-position effect. A confusion of the 2 effects is possible because in achievement measures the items are arranged according to their difficulty. Two item subsets of Raven’s Advanced Progressive Matrices (APM), one following the original item order, and the other one including randomly ordered items, were applied to a sample of 266 students. Confirmatory factor analysis models including representations of both the item-position effect and a possible effect due to increasing item difficulty were compared. The results provided evidence for both effects. Furthermore, they indicated a substantial relation between the item-position effects of the 2 APM subsets, whereas no relation was found for item difficulty. This indicates that the item-position effect stands on its own and is not due to increasing item difficulty.  相似文献   

18.
Calibration of an item bank for computer adaptive testing requires substantial resources. In this study, we investigated whether the efficiency of calibration under the Rasch model could be enhanced by improving the match between item difficulty and student ability. We introduced targeted multistage calibration designs, a design type that considers ability‐related background variables and performance for assigning students to suitable items. Furthermore, we investigated whether uncertainty about item difficulty could impair the assembling of efficient designs. The results indicated that targeted multistage calibration designs were more efficient than ordinary targeted designs under optimal conditions. Limited knowledge about item difficulty reduced the efficiency of one of the two investigated targeted multistage calibration designs, whereas targeted designs were more robust.  相似文献   

19.
Certain testing authorities have implied that the proportion of examinees who answer an item correctly may be influenced by the difficulty of the immediately preceding item. If present, such a "sequence effect" would cause p (as an estimate of item difficulty level) to misrepresent an item's "true" level of difficulty. To investigate this hypothesis, a balanced Latin square design was used to rearrange examination items into various test forms. A unique analysis of variance procedure was used to analyze the resulting data. The alleged sequence effect was not found. Certain limitations preclude the generalization of this finding to all students or to all testing situations. However, the evidence provided by this investigation does suggest that comments relating to sequence effects should be qualified as compared with presently appearing statements.  相似文献   

20.
Biased test items were intentionally imbedded within a set of test items, and the resulting instrument was administered to large samples of blacks and whites. Three popular item bias detection procedures were then applied to the data: (1) the three-parameter item characteristic curve procedure, (2) the chi-square method, and (3) the transformed item difficulty approach. The three-parameter item characteristic curve procedure proved most effective at detecting the intentionally biased test items; and the chi-square method was viewed as the best alternative. The transformed item difficulty approach has certain limitations yet represents a practical alternative if sample size, lack of computer facilities, or the like preclude the use of the other two procedures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号