首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
Concept inventories are often used to assess current student understanding although conceptual change models are problematic. Due to controversies with conceptual change models and the realities of student assessment, it is important that concept inventories are evaluated using a variety of theoretical models to improve quality. This study used a modified item response theory model to determine university nonmajor biology students’ levels of understanding of natural selection (n = 1,192). Using Conceptual Inventory of Natural Selection, we have reported how we applied Bock’s modified nominal item response theory model and the distracter test item analysis. We found that the use of this model can define student levels of understanding and identify problematic distracters.  相似文献   

2.
The information matrix can equivalently be determined via the expectation of the Hessian matrix or the expectation of the outer product of the score vector. The identity of these two matrices, however, is only valid in case of a correctly specified model. Therefore, differences between the two versions of the observed information matrix indicate model misfit. The equality of both matrices can be tested with the so‐called information matrix test as a general test of misspecification. This test can be adapted to item response models in order to evaluate the fit of single items and the fit of the whole scale. The performance of different versions of the test is compared in a simulation study with existing tests of model fit, among them the test of Orlando and Thissen, the score test of local independence due to Glas and Suarez‐Falcon, and the limited information approach of Maydeu‐Olivares and Joe. In general, the different versions of the information matrix test adhere to the nominal Type I error rate and have high power for detecting misspecified item characteristic curves. Additionally, some versions of the test can be used in order to detect violations of the local independence assumption.  相似文献   

3.
In this article, we introduce a person‐fit statistic called the hierarchy consistency index (HCI) to help detect misfitting item response vectors for tests developed and analyzed based on a cognitive model. The HCI ranges from ?1.0 to 1.0, with values close to ?1.0 indicating that students respond unexpectedly or differently from the responses expected under a given cognitive model. A simulation study was conducted to evaluate the power of the HCI in detecting different types of misfitting item response vectors. Simulation results revealed that the detection rate of the HCI was a function of type of misfit, item discriminating power, and test length. The best detection rates were achieved when the HCI was applied to tests that consisted of a large number of highly discriminating items. In addition, whether a misfitting item response vector can be correctly identified depends, to a large degree, on the number of misfits of the item response vector relative to the cognitive model. When misfitting response behavior only affects a small number of item responses, the resulting item response vector will not be substantially different from the expectations under the cognitive model and consequently may not be statistically identified as misfitting. As an item response vector deviates further from the model expectations, misfits are more easily identified and consequently higher detection rates of the HCI are expected.  相似文献   

4.
Computer‐based tests (CBTs) often use random ordering of items in order to minimize item exposure and reduce the potential for answer copying. Little research has been done, however, to examine item position effects for these tests. In this study, different versions of a Rasch model and different response time models were examined and applied to data from a CBT administration of a medical licensure examination. The models specifically were used to investigate whether item position affected item difficulty and item intensity estimates. Results indicated that the position effect was negligible.  相似文献   

5.
Investigations of differential distractor functioning (DDF) can provide valuable information concerning the location and possible causes of measurement invariance within a multiple‐choice item. In this article, I propose an odds ratio estimator of the DDF effect as modeled under the nominal response model. In addition, I propose a simultaneous distractor‐level (SDL) test of invariance based on the results of the distractor‐level tests of DDF. The results of a simulation study indicated that the DDF effect estimator maintained good statistical properties under a variety of conditions, and the SDL test displayed substantially higher power than the traditional Mantel‐Haenszel test of no DIF when the DDF effect varied in magnitude and/or size across the distractors.  相似文献   

6.
There are many educational interventions being implemented to address workforce issues in the field of nanotechnology. However, there is no instrument to assess the impact of these interventions on student awareness of, exposure to, and motivation for nanotechnology. To address this need, the Nanotechnology Awareness Instrument was conceptualized. This paper is a progress report of the instrument development process. Version 1 of the instrument was administered to 335 first-year students majoring in food and agriculture fields in a pre–post fashion relative to a brief exposure to nanotechnology in the classroom. Following item analysis of Version 1 responses, a revision of the instrument was completed. Version 2 was administered to 1,426 first-year engineering students for the purpose of conducting item and factor analyses. Results indicate that the Nanotechnology Awareness Instrument shows potential to provide valid information about student awareness of, exposure to, and motivation for nanotechnology. The instrument is not a valid measure of nano-knowledge and this subscale was dropped from the final version of the instrument. Implications include the use of the instrument to evaluate programs, interventions, or courses that attempt to increase student awareness of nanotechnology. Further study is necessary to determine how the Nanotechnology Awareness Instrument functions as a pre–post measure.  相似文献   

7.
The objective of the present study was to evaluate the extent to which students who took a computer adaptive test of reading comprehension accounting for testlet effects were administered fewer passages and had a more precise estimate of their reading comprehension ability compared to students in the control condition. A randomized controlled trial was used whereby 529 students in Grades 4–8 and 10 were randomly assigned to one of two conditions, both of whom took a computerized adaptive assessment of reading comprehension. Participants in the experimental condition had ability scores estimated as a function of an item response model, which accounted for item-dependence effects in the reading assessment, whereas control students took a version where item-dependence effects were not controlled. Results indicated that examinees in the experimental condition took fewer passages (average Hedges' g = 0.97) and had more reliable estimates of their reading comprehension ability (average Hedges' g = 0.60). Findings are discussed in the context of potential time savings in assessment practices without sacrificing reliability.  相似文献   

8.
9.
This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common‐item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common‐item equating methodology to standard setting ratings to account for systematic differences between standard setting panels) has received almost no attention in the literature. Identity equating was also examined to provide context. Data from a standard setting form of a large national certification test (N examinees = 4,397; N panelists = 13) were split into content‐equivalent subforms with common items, and resampling methodology was used to investigate the error introduced by each approach. Common‐item equating (circle‐arc and nominal weights mean) was evaluated at samples of size 10, 25, 50, and 100. The standard setting approaches (resetting and rescaling the standard) were evaluated by resampling (N = 8) and by simulating panelists (N = 8, 13, and 20). Results were inconclusive regarding the relative effectiveness of resetting and rescaling the standard. Small‐sample equating, however, consistently produced new form cut scores that were less biased and less prone to random error than new form cut scores based on resetting or rescaling the standard.  相似文献   

10.
Orlando and Thissen's S‐X 2 item fit index has performed better than traditional item fit statistics such as Yen's Q1 and McKinley and Mill's G2 for dichotomous item response theory (IRT) models. This study extends the utility of S‐X 2 to polytomous IRT models, including the generalized partial credit model, partial credit model, and rating scale model. The performance of the generalized S‐X 2 in assessing item model fit was studied in terms of empirical Type I error rates and power and compared to G2. The results suggest that the generalized S‐X 2 is promising for polytomous items in educational and psychological testing programs.  相似文献   

11.
The power of the chi-square test statistic used in structural equation modeling decreases as the absolute value of excess kurtosis of the observed data increases. Excess kurtosis is more likely the smaller the number of item response categories. As a result, fit is likely to improve as the number of item response categories decreases, regardless of the true underlying factor structure or χ2-based fit index used to examine model fit. Equivalently, given a target value of approximate fit (e.g., root mean square error of approximation ≤ .05) a model with more factors is needed to reach it as the number of categories increases. This is true regardless of whether the data are treated as continuous (common factor analysis) or as discrete (ordinal factor analysis). We recommend using a large number of response alternatives (≥ 5) to increase the power to detect incorrect substantive models.  相似文献   

12.
ABSTRACT

The Defining Issues Test (DIT) has been the dominant measure of moral development. The DIT has its roots in Kohlberg’s original stage theory of moral judgment development and asks respondents to rank a set of stage typed statements in order of importance on six stories. However, the question to what extent the DIT-data match the underlying stage model was never addressed with a statistical model. Therefore, we applied item response theory (IRT) to a large data set (55,319 cases). We found that the ordering of the stages as extracted from the raw data fitted the ordering in the underlying stage model good. Furthermore, difficulty differences of stages across the stories were found and their magnitude and location were visualized. These findings are compatible with the notion of one latent moral developmental dimension and lend support to the hundreds of studies that have used the DIT-1 and by implication support the renewed DIT-2.  相似文献   

13.
Although reliability of subscale scores may be suspect, subscale scores are the most common type of diagnostic information included in student score reports. This research compared methods for augmenting the reliability of subscale scores for an 8th-grade mathematics assessment. Yen's Objective Performance Index, Wainer et al.'s augmented scores, and scores based on multidimensional item response theory (IRT) models were compared and found to improve the precision of the subscale scores. However, the augmented subscale scores were found to be more highly correlated and less variable than unaugmented scores. The meaningfulness of reporting such augmented scores as well as the implications for validity and test development are discussed.  相似文献   

14.
The purpose of the current study was to examine the validity and diagnostic accuracy of the Intervention Selection Profile—Social Skills (ISP‐SS), a brief social skills assessment tool intended for use with students in need of Tier 2 intervention. Participants included 160 elementary and middle school students who had been identified through universal screening as at risk for behavioral concerns. Teacher participants ( n = 71) rated each of these students using both the ISP‐SS and the Social Skills Improvement System—Rating Scales (SSiS‐RS), with the latter measure serving as the criterion within validity and diagnostic accuracy analyses. Confirmatory factor analysis supported ISP‐SS structural validity, indicating ISP‐SS items broadly conformed to a single “Social Skills” factor. Follow‐up analyses suggested ISP‐SS broad scale scores demonstrated adequate internal consistency reliability, with hierarchical omega coefficient equal to 0.86. Correlational analyses supported the concurrent validity of ISP‐SS items, finding each ISP‐SS item to be moderately or highly related to its corresponding SSiS‐RS subscale. Finally, analyses indicated three of the seven ISP‐SS items that demonstrated sufficient diagnostic accuracy; however, findings suggest additional revisions are needed if the ISP‐SS is to be appropriate for use in schools. Implications for practice and future research are discussed.  相似文献   

15.
In structural equation modeling software, either limited-information (bivariate proportions) or full-information item parameter estimation routines could be used for the 2-parameter item response theory (IRT) model. Limited-information methods assume the continuous variable underlying an item response is normally distributed. For skewed and platykurtic latent variable distributions, 3 methods were compared in Mplus: limited information, full information integrating over a normal distribution, and full information integrating over the known underlying distribution. Interfactor correlation estimates were similar for all 3 estimation methods. For the platykurtic distribution, estimation method made little difference for the item parameter estimates. When the latent variable was negatively skewed, for the most discriminating easy or difficult items, limited-information estimates of both parameters were considerably biased. Full-information estimates obtained by marginalizing over a normal distribution were somewhat biased. Full-information estimates obtained by integrating over the true latent distribution were essentially unbiased. For the a parameters, standard errors were larger for the limited-information estimates when the bias was positive but smaller when the bias was negative. For the d parameters, standard errors were larger for the limited-information estimates of the easiest, most discriminating items. Otherwise, they were generally similar for the limited- and full-information estimates. Sample size did not substantially impact the differences between the estimation methods; limited information did not gain an advantage for smaller samples.  相似文献   

16.
The goal of this study was to investigate the usefulness of person‐fit analysis in validating student score inferences in a cognitive diagnostic assessment. In this study, a two‐stage procedure was used to evaluate person fit for a diagnostic test in the domain of statistical hypothesis testing. In the first stage, the person‐fit statistic, the hierarchy consistency index (HCI; Cui, 2007 ; Cui & Leighton, 2009 ), was used to identify the misfitting student item‐score vectors. In the second stage, students’ verbal reports were collected to provide additional information about students’ response processes so as to reveal the actual causes of misfits. This two‐stage procedure helped to identify the misfits of item‐score vectors to the cognitive model used in the design and analysis of the diagnostic test, and to discover the reasons of misfits so that students’ problem‐solving strategies were better understood and their performances were interpreted in a more meaningful way.  相似文献   

17.
Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example, multidimensional item response theory (MIRT) may have a promising future in subscale score proficiency estimation, leading toward a more diagnostic orientation, which requires the linking of these subscale scores across different forms and populations. Several multidimensional linking studies can be found in the literature; however, none have used a combination of MC and CR item types. Thus, this research explores multidimensional linking accuracy for tests composed of both MC and CR items using a matching test characteristic/response function approach. The two-dimensional simulation study presented here used real data-derived parameters from a large-scale statewide assessment with two subscale scores for diagnostic profiling purposes, under varying conditions of anchor set lengths (6, 8, 16, 32, 60), across 10 population distributions, with a mixture of simple versus complex structured items, using a sample size of 3,000. It was found that for a well chosen anchor set, the parameters recovered well after equating across all populations, even for anchor sets composed of as few as six items.  相似文献   

18.
Standards-based reform, as codified by the No Child Left Behind Act, relies on the ability of assessments to accurately reflect the learning that takes place in U.S. classrooms. However, this property of assessments—their instructional sensitivity—is rarely, if ever, investigated by test developers, states, or researchers. In this paper, the literature on the psychometric property of instructional sensitivity is reviewed. Three categories of instructional sensitivity measures are identified—those relying on item or test scores only, those relying on item or test scores and teacher reports of instruction, and strictly judgmental methods. Each method identified in the literature is discussed alongside the evidence for its utility. Finally, recommendations are made as to the proper role of instructional sensitivity in the evaluation of assessments used under standards-based reform.  相似文献   

19.
Separation anxiety symptoms are very common in children. The present study aims to examine the psychometric properties and the factorial structure of the Portuguese version of the Separation Anxiety Scale for Children (SASC). The participants included 874 children, 52% male, aged between 8 and 11 years (M = 9.50; SD = 1.15). Factor analysis supported the three-factor model found in the original scale. The instrument demonstrated to have good reliability for the total score (α = .81) and for its three factors (Discomfort from separation, α = .80; Worry about separation, α = .72; Calm at separation, α = .59). The validity, examined via the correlation of the SASC with the separation anxiety subscale of the SCARED, was satisfactory (r = .49); the test–retest reliability for the total scale was good (r = .81). The SASC was shown to have good psychometric properties for its use with Portuguese children for clinical and research purposes.  相似文献   

20.
Traditional item analyses such as classical test theory (CTT) use exam-taker responses to assessment items to approximate their difficulty and discrimination. The increased adoption by educational institutions of electronic assessment platforms (EAPs) provides new avenues for assessment analytics by capturing detailed logs of an exam-taker's journey through their exam. This paper explores how logs created by EAPs can be employed alongside exam-taker responses and CTT to gain deeper insights into exam items. In particular, we propose an approach for deriving features from exam logs for approximating item difficulty and discrimination based on exam-taker behaviour during an exam. Items for which difficulty and discrimination differ significantly between CTT analysis and our approach are flagged through outlier detection for independent academic review. We demonstrate our approach by analysing de-identified exam logs and responses to assessment items of 463 medical students enrolled in a first-year biomedical sciences course. The analysis shows that the number of times an exam-taker visits an item before selecting a final response is a strong indicator of an item's difficulty and discrimination. Scrutiny by the course instructor of the seven items identified as outliers suggests our log-based analysis can provide insights beyond what is captured by traditional item analyses.

Practitioner notes

What is already known about this topic
  • Traditional item analysis is based on exam-taker responses to the items using mathematical and statistical models from classical test theory (CTT). The difficulty and discrimination indices thus calculated can be used to determine the effectiveness of each item and consequently the reliability of the entire exam.
What this paper adds
  • Data extracted from exam logs can be used to identify exam-taker behaviours which complement classical test theory in approximating the difficulty and discrimination of an item and identifying items that may require instructor review.
Implications for practice and/or policy
  • Identifying the behaviours of successful exam-takers may allow us to develop effective exam-taking strategies and personal recommendations for students.
  • Analysing exam logs may also provide an additional tool for identifying struggling students and items in need of revision.
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号