首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In recent years, students’ test scores have been used to evaluate teachers’ performance. The assumption underlying this practice is that students’ test performance reflects teachers’ instruction. However, this assumption is generally not empirically tested. In this study, we examine the effect of teachers’ instruction on test performance at the item level using a hierarchical differential item functioning approach. The items are from the U.S. TIMSS 2011 4th-grade math test. Specifically, we tested whether students who had received instruction on a given item performed significantly better on that item compared with students who had not received such instruction when their overall math ability was controlled for, whether with or without controlling for student-level and class-level covariates. This study provides preliminary findings regarding why some items show instructional sensitivity and sheds light on how to develop instructionally sensitive items. Implications and directions for further research are also discussed.  相似文献   

2.
Instructional sensitivity is the psychometric capacity of tests or single items of capturing effects of classroom instruction. Yet, current item sensitivity measures’ relationship to (a) actual instruction and (b) overall test sensitivity is rather unclear. The present study aims at closing these gaps by investigating test and item sensitivity to teaching quality, reanalyzing data from a quasi-experimental intervention study in primary school science education (1026 students, 53 classes, Mage = 8.79 years, SDage = 0.49, 50% female). We examine (a) the correlation of item sensitivity measures and the potential for cognitive activation in class and (b) consequences for test score interpretation when assembling tests from items varying in their degree of sensitivity to cognitive activation. Our study (a) provides validity evidence that item sensitivity measures may be related to actual classroom instruction and (b) points out that inferences on teaching drawn from test scores may vary due to test composition.  相似文献   

3.
Data were generated to simulate multidimensionality resulting from including two or four subtopics on a test. Each item was dependent on an ability trait due to instruction and learning, which was the same across all items, as well as an ability trait unique to the subtopic of the test (such as biology on a general science test). The eigenvalues of the item correlation matrix and Yen's Q3 were not greatly influenced by multidimensionality under conditions where the responses of a large proportion of students shared the influence of common instruction across subtopics. In contrast, Stout's T procedure was effective at detecting this type of multidimensionality, unless the subtopic abilities were correlated.  相似文献   

4.
Traditional item analyses such as classical test theory (CTT) use exam-taker responses to assessment items to approximate their difficulty and discrimination. The increased adoption by educational institutions of electronic assessment platforms (EAPs) provides new avenues for assessment analytics by capturing detailed logs of an exam-taker's journey through their exam. This paper explores how logs created by EAPs can be employed alongside exam-taker responses and CTT to gain deeper insights into exam items. In particular, we propose an approach for deriving features from exam logs for approximating item difficulty and discrimination based on exam-taker behaviour during an exam. Items for which difficulty and discrimination differ significantly between CTT analysis and our approach are flagged through outlier detection for independent academic review. We demonstrate our approach by analysing de-identified exam logs and responses to assessment items of 463 medical students enrolled in a first-year biomedical sciences course. The analysis shows that the number of times an exam-taker visits an item before selecting a final response is a strong indicator of an item's difficulty and discrimination. Scrutiny by the course instructor of the seven items identified as outliers suggests our log-based analysis can provide insights beyond what is captured by traditional item analyses.

Practitioner notes

What is already known about this topic
  • Traditional item analysis is based on exam-taker responses to the items using mathematical and statistical models from classical test theory (CTT). The difficulty and discrimination indices thus calculated can be used to determine the effectiveness of each item and consequently the reliability of the entire exam.
What this paper adds
  • Data extracted from exam logs can be used to identify exam-taker behaviours which complement classical test theory in approximating the difficulty and discrimination of an item and identifying items that may require instructor review.
Implications for practice and/or policy
  • Identifying the behaviours of successful exam-takers may allow us to develop effective exam-taking strategies and personal recommendations for students.
  • Analysing exam logs may also provide an additional tool for identifying struggling students and items in need of revision.
  相似文献   

5.
An assumption of item response theory is that a person's score is a function of the item response parameters and the person's ability. In this paper, the effect of variations in instructional coverage on item characteristic functions is examined. Using data from the Second International Mathematics Study (1985), curriculum clusters were formed based on teachers' ratings of their students' opportunities to learn the items on a test. After forming curriculum clusters, item response curves were compared using signed and unsigned sum of squared differences. Some of the differences in the item response curves between curriculum clusters were found to be large, but better performance was not necessarily related to greater opportunity to learn. The item response curve differences were much larger than differences reported in prior studies based on comparisons of black and white students. Implications of the findings for applications of item response theory to educational achievement test data are discussed  相似文献   

6.
High stakes testing, a phenomena born out of intense accountability across the United States, produces instructional settings that marginalize both curriculum and instruction. Teachers and other school personnel have minimized instruction to drill and practice in an effort to raise standardized and criterion referenced test scores. This study presents an alternative to current practice that engages students in learning and increases their awareness of the internal aspects of standardized tests. The Test Item Construction Model (TICM) guides students through the process of studying test item stems and subsequently creating items using a 12 week process of incrementing from understanding to creating test items. Students grew in their understanding of the test item stems and the generation of these. An ANOVA did not yield significant differences between random groups of trained and untrained test writers. However, students in the experimental group demonstrated gains in understanding of test items.  相似文献   

7.
本文研究的是不同的测试方法-单项选择和信息转移-是否会在阅读理解考试中产生测试方法效应的问题.除对学生的考试成绩(分数)进行分析外,本研究还进一步对试题的难度值进行了分析,而本研究中试题难度是通过项目反应理论(Item Response Theory)计算得到的.结果显示不同测试方法的确会影响题目难度及考生的考试表现,就试题难度而言信息转移比单项选择更难.  相似文献   

8.
In test development, item response theory (IRT) is a method to determine the amount of information that each item (i.e., item information function) and combination of items (i.e., test information function) provide in the estimation of an examinee's ability. Studies investigating the effects of item parameter estimation errors over a range of ability have demonstrated an overestimation of information when the most discriminating items are selected (i.e., item selection based on maximum information). In the present study, the authors examined the influence of item parameter estimation errors across 3 item selection methods—maximum no target, maximum target, and theta maximum—using the 2- and 3-parameter logistic IRT models. Tests created with the maximum no target and maximum target item selection procedures consistently overestimated the test information function. Conversely, tests created using the theta maximum item selection procedure yielded more consistent estimates of the test information function and, at times, underestimated the test information function. Implications for test development are discussed.  相似文献   

9.
The purpose of this study was to compare the opinions of students, teachers, and administrators relative to student evaluation of instruction in selected community colleges. While important educational decisions in community colleges are made on the basis of students’ evaluations (as in retention, promotion, tenure, and pay), little has been accomplished in testing the assumptions behind student evaluation of instruction. The student evaluation process assumes that students are honest, serious, and evaluate instruction, not some incidental activity.

A 25‐item Student Evaluation Process Scale was completed by 607 students, 130 faculty, and 45 administrators in five Illinois community colleges. Findings revealed little significant differences in the opinions of students regarding evaluation of instruction based on variables of sex, age, school location, student type (transfer or occupational), and class standing. There were little significant differences in faculty opinion and within the administrative groups based on selected variables. There were significant differences when the opinions of students, faculty, and administrators were compared. Students and faculty tended to agree with those items that questioned the objectivity of student evaluation of instruction. Administrators and students tended to agree with items reflecting the seriousness with which students evaluate instruction. Faculty and administrators indicated that student evaluation of instruction impacted faculty members’ instructional performances. Neither students, faculty, nor administrators supported the concept of merit pay tied to student evaluation of instruction.

The role of student evaluation of instruction in a faculty evaluation system must be investigated. A variety of groups should participate in this investigation.  相似文献   

10.
This study sought a scientific way to examine whether item response curves are influenced systematically by the cognitive processes underlying solution of the items in a procedural domain (addition of fractions). Starting from an expert teacher's logical task analysis and prediction of various erroneous rules and sources of misconceptions, an error diagnostic program was developed. This program was used to carry out an error analysis of test performance by three samples of students. After the cognitive structure of the subtasks was validated by a majority of the students, the items were characterized by their underlying subtask patterns. It was found that item response curves for items in the same categories were significantly more homogeneous than those in different categories. In other words, underlying cognitive subtasks appeared to systematically influence the slopes and difficulties of item response curves.  相似文献   

11.
Interpreting and creating graphs plays a critical role in scientific practice. The K-12 Next Generation Science Standards call for students to use graphs for scientific modeling, reasoning, and communication. To measure progress on this dimension, we need valid and reliable measures of graph understanding in science. In this research, we designed items to measure graph comprehension, critique, and construction and developed scoring rubrics based on the knowledge integration (KI) framework. We administered the items to over 460 middle school students. We found that the items formed a coherent scale and had good reliability using both item response theory and classical test theory. The KI scoring rubric showed that most students had difficulty linking graphs features to science concepts, especially when asked to critique or construct graphs. In addition, students with limited access to computers as well as those who speak a language other than English at home have less integrated understanding than others. These findings point to the need to increase the integration of graphing into science instruction. The results suggest directions for further research leading to comprehensive assessments of graph understanding.  相似文献   

12.
The reading data from the 1983–84 National Assessment of Educational Progress survey were scaled using a unidimensional item response theory model. To determine whether the responses to the reading items were consistent with unidimensionality, the full-information factor analysis method developed by Bock and associates (1985) and Rosenbaum's (1984) test of unidimensionality, conditional (local) independence, and monotonicity were applied. Full-information factor analysis involves the assumption of a particular item response function; the number of latent variables required to obtain a reasonable fit to the data is then determined. The Rosenbaum method provides a test of the more general hypothesis that the data can be represented by a model characterized by unidimensionality, conditional independence, and monotonicity. Results of both methods indicated that the reading items could be regarded as measures of a single dimension. Simulation studies were conducted to investigate the impact of balanced incomplete block (BIB) spiraling, used in NAEP to assign items to students, on methods of dimensionality assessment. In general, conclusions about dimensionality were the same for BIB-spiraled data as for complete data.  相似文献   

13.
The psychometrically sound development of assessment instruments requires pilot testing of candidate items as a first step in gauging their quality, typically a time-consuming and costly effort. Crowdsourcing offers the opportunity for gathering data much more quickly and inexpensively than from most targeted populations. In a simulation of a pilot testing protocol, item parameters for 110 life science questions are estimated from 4,043 crowdsourced adult subjects and then compared with those from 20,937 middle school science students. In terms of item discrimination classification (high vs. low), classical test theory yields an acceptable level of agreement (C-statistic = 0.755); item response theory produces excellent results (C-statistic = 0.848). Item response theory also identifies potential anchor items without including any false positives (items with low discrimination in the targeted population). We conclude that the use of crowdsourcing subjects is a reasonable, efficient method for the identification of high-quality items for field testing and for the selection of anchor items to be used for test equating.  相似文献   

14.
The premise of a great deal of current research guiding policy development has been that accommodations are the catalyst for student performance differences. Rather than accepting this premise, two studies were conducted to investigate the influence of extended time and content knowledge on the performance of ninth‐grade students who took a statewide mathematics test with and without accommodations. Each study involved 1,250 accommodated students (extended time only) with learning disabilities and 1,250 nonaccommodated students demonstrating no disabilities. In Study One, a standard differential item functioning (DIF) analysis illustrated that the usual approach to studying the effects of accommodations contributes little to our understanding of the reason for performance differences across students. Next, a mixture item response theory DIF model was used to explore the most likely cause(s) for performance differences across the population. The results from both studies suggest that students for whom items were functioning differently were not accurately characterized by their accommodation status but rather by their content knowledge. That is, knowing students' accommodation status (i.e., accommodated or nonaccommodated) contributed little to understanding why accommodated and nonaccommodated students differed in their test performance. Rather, the data would suggest that a more likely explanation is that mathematics competency differentiated the groups of student learners regardless of their accommodation and/or reading levels.  相似文献   

15.
An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items.  相似文献   

16.
In many testing programs it is assumed that the context or position in which an item is administered does not have a differential effect on examinee responses to the item. Violations of this assumption may bias item response theory estimates of item and person parameters. This study examines the potentially biasing effects of item position. A hierarchical generalized linear model is formulated for estimating item‐position effects. The model is demonstrated using data from a pilot administration of the GRE wherein the same items appeared in different positions across the test form. Methods for detecting and assessing position effects are discussed, as are applications of the model in the contexts of test development and item analysis.  相似文献   

17.
This study involved the development and application of a two-tier diagnostic test measuring college biology students' understanding of diffusion and osmosis after a course of instruction. The development procedure had three general steps: defining the content boundaries of the test, collecting information on students' misconceptions, and instrument development. Misconception data were collected from interviews and multiple-choice questions with free response answers. The data were used to develop 12 two-tier multiple choice items in which the first tier examined content knowledge and the second examined understanding of that knowledge. The conceptual knowledge examined was the particulate and random nature of matter, concentration and tonicity, the influence of life forces on diffusion and osmosis, membranes, kinetic energy of matter, the process of diffusion, and the process of osmosis. The diagnostic instrument was administered to 240 students (123 non-biology majors and 117 biology majors) enrolled in a college freshman biology laboratory course. The students had completed a unit on diffusion and osmosis. The content taught was carefully defined by propositional knowledge statements, and was the same content that defined the content boundaries of the test. The split-half reliability was .74. Difficulty indices ranged from 0.23 to 0.95, and discrimination indices ranged from 0.21 to 0.65. Each item was analyzed to determine student understanding of, and identify misconceptions about, diffusion and osmosis.  相似文献   

18.
Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in seven themes. The answer format alternated per theme and was either a labeled image or an answer list, resulting in two versions containing both images and answer lists. Subjects were randomly assigned to one version. Answer formats were compared through item scores. Both examinations had similar overall difficulty and reliability. Two cross‐sectional images resulted in greater item difficulty and item discrimination, compared to an answer list. A schematic image of fetal circulation led to decreased item difficulty and item discrimination. Three images showed variable effects. These results show that effects on assessment scores are dependent on the type of image used. Results from the two cross‐sectional images suggest an extra ability is being tested. Data from a scheme of fetal circulation suggest a cueing effect. Variable effects from other images indicate that a context‐dependent interaction takes place with the content of questions. The conclusion is that item difficulty and item discrimination can be affected when images are used instead of answer lists; thus, the use of images as a response format has potential implications for the validity of test items. Anat Sci Educ © 2012 American Association of Anatomists.  相似文献   

19.
In this study, the relationship between differentiated instruction, as an element of data-based decision making, and student achievement was examined. Classroom observations (n = 144) were used to measure teachers’ differentiated instruction practices and to predict the mathematical achievement of 2nd- and 5th-grade students (n = 953). The analysis of classroom observation data was based on a combination of generalizability theory and item response theory, and student achievement effects were determined by means of multilevel analysis. No significant positive effects were found for differentiated instruction practices. Furthermore, findings showed that students in low-ability groups profited less from differentiated instruction than students in average or high-ability groups. Nevertheless, the findings, data collection, and data-analysis procedures of this study contribute to the study of classroom observation and the measurement of differentiated instruction.  相似文献   

20.
This paper demonstrates, both theoretically and empirically, using both simulated and real test data, that sets of items can be selected that meet the unidimensionality assumption of most item response theory models even though they require more than one ability for a correct response. Sets of items that measure the same composite of abilities as defined by multidimensional item response theory are shown to meet the unidimensionality assumption. A method for identifying such item sets is also presented  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号