首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
The study of change is based on the idea that the score or index at each measurement occasion has the same meaning and metric across time. In tests or scales with multiple items, such as those common in the social sciences, there are multiple ways to create such scores. Some options include using raw or sum scores (i.e., sum of item responses or linear transformation thereof), using Rasch-scaled scores provided by the test developers, fitting item response models to the observed item responses and estimating ability or aptitude, and jointly estimating the item response and growth models. We illustrate that this choice can have an impact on the substantive conclusions drawn from the change analysis using longitudinal data from the Applied Problems subtest of the Woodcock–Johnson Psycho-Educational Battery–Revised collected as part of the National Institute of Child Health and Human Development's Study of Early Child Care. Assumptions of the different measurement models, their benefits and limitations, and recommendations are discussed.  相似文献   

2.
Careless responding is a bias in survey responses that disregards the actual item content, constituting a threat to the factor structure, reliability, and validity of psychological measurements. Different approaches have been proposed to detect aberrant responses such as probing questions that directly assess test-taking behavior (e.g., bogus items), auxiliary or paradata (e.g., response times), or data-driven statistical techniques (e.g., Mahalanobis distance). In the present study, gradient boosted trees, a state-of-the-art machine learning technique, are introduced to identify careless respondents. The performance of the approach was compared with established techniques previously described in the literature (e.g., statistical outlier methods, consistency analyses, and response pattern functions) using simulated data and empirical data from a web-based study, in which diligent versus careless response behavior was experimentally induced. In the simulation study, gradient boosting machines outperformed traditional detection mechanisms in flagging aberrant responses. However, this advantage did not transfer to the empirical study. In terms of precision, the results of both traditional and the novel detection mechanisms were unsatisfactory, although the latter incorporated response times as additional information. The comparison between the results of the simulation and the online study showed that responses in real-world settings seem to be much more erratic than can be expected from the simulation studies. We critically discuss the generalizability of currently available detection methods and provide an outlook on future research on the detection of aberrant response patterns in survey research.  相似文献   

3.
Although there is a common understanding of instructional sensitivity, it lacks a common operationalization. Various approaches have been proposed, some focusing on item responses, others on test scores. As approaches often do not produce consistent results, previous research has created the impression that approaches to instructional sensitivity are noticeably fragmented. To counter this impression, we present an item response theory–based framework that can help us to understand similarities and differences between existing approaches. Using empirical data for illustration, this article identifies three perspectives on instructional sensitivity: One perspective views instructional sensitivity as the capacity to detect differences in students' stages of learning across points of time. A second perspective treats instructional sensitivity as the capacity to detect differences between groups that have received different instruction. For a third perspective, the previous two are combined to consider differences between both time points and groups. We discuss linking sensitivity indices to measures of instruction.  相似文献   

4.
Practical use of the matrix sampling (i.e. item sampling) technique requires the assumption that an examinee's response to an item is independent of the context in which the item occurs. This assumption was tested experimentally by comparing the responses of examinees to a population of items with the responses of examinees to item samples. Matrix sampling mean and variance estimates for verbal, quantitative, and attitude tests were used as dependent variables to test for differences between the “context” and “out-of-context” groups. The estimates obtained from both treatment groups were also compared with actual population values. No significant differences were found between treatments on matrix sample parameter estimates for any of the three types of tests.  相似文献   

5.
Standard setting methods such as the Angoff method rely on judgments of item characteristics; item response theory empirically estimates item characteristics and displays them in item characteristic curves (ICCs). This study evaluated several indexes of rater fit to ICCs as a method for judging rater accuracy in their estimates of expected item performance for target groups of test-takers. Simulated data were used to compare adequately fitting ratings to poorly fitting ratings at various target competence levels in a simulated two stage standard setting study. The indexes were then applied to a set of real ratings on 66 items evaluated at 4 competence thresholds to demonstrate their relative usefulness for gaining insight into rater “fit.” Based on analysis of both the simulated and real data, it is recommended that fit indexes based on the absolute deviations of ratings from the ICCs be used, and those based on the standard errors of ratings should be avoided. Suggestions are provided for using these indexes in future research and practice.  相似文献   

6.
In international large-scale surveys, constructed response (CR) items are increasingly being used and multiple-choice (MC) items are being used less frequently. In this article the two item types will be compared in terms of any differences they have on national mean scores. TIMSS 1995 and TIMSS 1999 data have been used. Are there different effects of the question types for mathematics and science? Does the introduction of open-ended items into the math and science tests affect the math and science achievement results?  相似文献   

7.
Both structural equation modeling (SEM) and item response theory (IRT) can be used for factor analysis of dichotomous item responses. In this case, the measurement models of both approaches are formally equivalent. They were refined within and across different disciplines, and make complementary contributions to central measurement problems encountered in almost all empirical social science research fields. In this article (a) fundamental formal similiarities between IRT and SEM models are pointed out. It will be demonstrated how both types of models can be used in combination to analyze (b) the dimensional structure and (c) the measurement invariance of survey item responses. All analyses are conducted with Mplus, which allows an integrated application of both approaches in a unified, general latent variable modeling framework. The aim is to promote a diffusion of useful measurement techniques and skills from different disciplines into empirical social research.  相似文献   

8.
Contrasts between constructed-response items and multiple-choice counterparts have yielded but a few weak generalizations. Such contrasts typically have been based on the statistical properties of groups of items, an approach that masks differences in properties at the item level and may lead to inaccurate conclusions. In this article, we examine item-level differences between a certain type of constructed-response item (called figural response) and comparable multiple-choice items in the domain of architecture. Our data show that in comparing two item formats, item-level differences in difficulty correspond to differences in cognitive processing requirements and that relations between processing requirements and psychometric properties are systematic. These findings illuminate one aspect of construct validity that is frequently neglected in comparing item types, namely the cognitive demand of test items.  相似文献   

9.
Around the world, multiple-choice tests are widely used as part of high-stakes examinations. To counteract lucky guessing, many of them have instituted a penalty for wrong answers. In this paper, we use administrative data from Turkish college admissions test to study the heterogeneity in gender differences in tendency to leave questions blank across subjects, difficulty levels, and stakes. Exploiting the tracking system and using the resulting variation in the effective guessing penalty across different test sections, we find that female test-takers skip significantly more questions than male test-takers in the quantitative track while we do not find a significant difference in other tracks. Among quantitative track students, the gender gap is larger in Math and when questions are more difficult while it reverses in Literature. We also find that self-assessment is related to skipping behavior and explains part of the gender gap. Male test-takers are more likely than female test-takers to report that they are good at Math, Science, and Social Science after conditioning on their number of correct answers in the corresponding test sections. This gender gap, consistently with the one in skipping behavior, reverses when it comes to Literature. Differently from previous literature, our findings suggest that the magnitude and the sign of the gender gap in answering questions under uncertainty is context dependent.  相似文献   

10.
The paper deals with the investigation of gender differences in performances in mathematics for Italian students at the end of lower secondary school. The study is based on a new large-scale assessment test developed and administered by the National Evaluation Institute for the School System. Given the evidence in the literature which favors males, performances of female and male students are compared using different approaches. Scores proposed by educational experts based on item subgroups were considered, while a model-based approach was used within item response theory. The results revealed a significant advantage to males in overall performance, while no meaningful differences were observed with respect to item domain and type. An interpretable item map was developed crossing expert opinions with IRT abilities, and plausible proficiency levels were defined. According to the map-based student classification, a relatively lower percentage of females fell into the highest proficiency groups with respect to males.  相似文献   

11.
Recent developrnents of person-Jit analysis in computerized adaptive testing (CAT) are discussed. Methods from stutistical process control are presented that have been proposed to classify an item score pattern as fitting or misjitting the underlying item response theory model in CAT. Most person-fit research in CAT is restricted to simulated data. In this study, empirical data from a certification test were used, Alternatives are discussed to generate norms so that bounds can be determined to classify an item score pattern as fitting or misfitting. Using bounds determined from a sample of a high-stakes certification test, the empirical analysis showed that dizerent types of misfit can be distinguished. Further applications using statistical process control methods to detect misfitting item score patterns are discussed.  相似文献   

12.
The effects of computer and paper test media on EFL test-takers with different computer familiarity in writing scores and in the cognitive writing process have been comprehensively explored from the learners’ aspect as well as on the basis of related theories and practice. The results indicate significant differences in test scores among the test-takers who are less familiar with computers, showing that the computer test medium has greatly impacted this group of test-takers’ writing scores. From the perspective of the cognitive process, they are not significantly different in such stages as ‘goal-setting’, ‘generating ideas’ and ‘reviewing’, while their ‘organising ideas’ and ‘translating’ stages were greatly different, owing mainly to the nature of the test medium itself.  相似文献   

13.
The sample invariance of item discrimination statistics is evaluated in this case study using real data. The hypothesized superiority of the item response model (IRM) is tested against structural equation modeling (SEM) for responses to the Center for Epidemiologic Studies-Depression (CES-D) scale. Responses from 10 random samples of 500 people were drawn from a base sample of 6,621 participants across gender, age, and different health groups. Hierarchical tests of multiple-group structural equation models indicated statistically significant differences exist in item regressions across contrast groups. Although the IRM item discrimination estimates were most stable in all conditions of this case study, additional research on the precision of individual scores and possible item bias is required to support the validity of either model for scoring the CES-D. The SEM approach to examining between-group differences holds promise for any field where heterogeneous populations are assessed and important consequences arise from score interpretations.  相似文献   

14.
A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of polytomous IRT models. The module presents commonly encountered polytomous IRT models, describes their properties, and contrasts their defining principles and assumptions. After completing this module, the reader should have a sound understating of what a polytomous IRT model is, the manner in which the equations of the models are generated from the model's underlying step functions, how widely used polytomous IRT models differ with respect to their definitional properties, and how to interpret the parameters of polytomous IRT models.  相似文献   

15.
Researchers interested in exploring substantive group differences are increasingly attending to bundles of items (or testlets): the aim is to understand how gender differences, for instance, are explained by differential performances on different types or bundles of items, hence differential bundle functioning (DBF). Some previous work has modelled hierarchies in data in this context or considered item responses within persons, but here we model the bundles themselves as explanatory variables at the item level potentially explaining significant intra-class correlation due to gender differences in item difficulty, and thus explaining variation at the second item level. In this study, we analyse DBF using single- and two-level models (the latter modelling random item effects, which models responses at Level 1 and items at Level 2) in a high-stakes National Mathematics test. The models show comparable regression coefficients but the statistical significances of the two-level models are smaller due to the larger values of the estimated standard errors. We discuss the contrasting relevance of this effect for test developers and gender researchers.  相似文献   

16.
We propose a structural equation model, which reduces to a multidimensional latent class item response theory model, for the analysis of binary item responses with nonignorable missingness. The missingness mechanism is driven by 2 sets of latent variables: one describing the propensity to respond and the other referred to the abilities measured by the test items. These latent variables are assumed to have a discrete distribution, so as to reduce the number of parametric assumptions regarding the latent structure of the model. Individual covariates can also be included through a multinomial logistic parameterization for the distribution of the latent variables. Given the discrete nature of this distribution, the proposed model is efficiently estimated by the expectation–maximization algorithm. A simulation study is performed to evaluate the finite-sample properties of the parameter estimates. Moreover, an application is illustrated with data coming from a student entry test for the admission to some university courses.  相似文献   

17.
The definition of what it means to take a test online continues to evolve with the inclusion of a broader range of item types and a wide array of devices used by students to access test content. To assure the validity and reliability of test scores for all students, device comparability research should be conducted to evaluate the impact of testing device on student test performance. The current study looked at the comparability of test scores across tablets and computers for high school students in three commonly assessed content areas and for a variety of different item types. Results indicate no statistically significant differences across device type for any content area or item type. Student survey results suggest that students may have a preference for taking tests on devices with which they have more experience, but that even limited exposure to tablets in this study increased positive responses for testing on tablets.  相似文献   

18.
Latent growth modeling allows social behavioral researchers to investigate within-person change and between-person differences in within-person change. Typically, conventional latent growth curve models are applied to continuous variables, where the residuals are assumed to be normally distributed, whereas categorical variables (i.e., binary and ordinal variables), which do not hold to normal distribution assumptions, have rarely been used. This article describes the latent growth curve model with categorical variables, and illustrates applications using Mplus software that are applicable to social behavioral research. The illustrations use marital instability data from the Iowa Youth and Family Project. We close with recommendations for the specification and parameterization of growth models that use both logit and probit link functions.  相似文献   

19.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

20.
以北京、上海等10个热点城市为对象,选取1984~2008年入境客流量和旅游收入数据,分析了城市入境旅游发展的时间同步性及区域响应。结果显示:从1984~2008年,城市入境旅游发展具有较高的时间同步性,1984~1991年为快速增长,1992~2003年为平缓增长,2004~2008为加速增长,旅游增长率变化"五涨四跌",波动周期"同涨同落"。依据增长指数和相关系数的差异,将10城市划分为两种类型。由于旅游投资、资源开发、交通区位等因素的差异,各城市入境旅游的发展具有不同的区域响应。文章分三个时段,分别统计了1984~1991,1992~2003和2004~2008年各城市入境旅游的基期值和平均增长率,依据各阶段基期值和平均增长率的大小,划分了入境旅游发展的区域响应类型。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号