首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
An IRT‐based sequential procedure is developed to monitor items for enhancing test security. The procedure uses a series of statistical hypothesis tests to examine whether the statistical characteristics of each item under inspection have changed significantly during CAT administration. This procedure is compared with a previously developed CTT‐based procedure through simulation studies. The results show that when the total number of examinees is fixed both procedures can control the rate of type I errors at any reasonable significance level by choosing an appropriate cutoff point and meanwhile maintain a low rate of type II errors. Further, the IRT‐based method has a much lower type II error rate or more power than the CTT‐based method when the number of compromised items is small (e.g., 5), which can be achieved if the IRT‐based procedure can be applied in an active mode in the sense that flagged items can be replaced with new items.  相似文献   

3.
This module describes and extends X‐to‐Y regression measures that have been proposed for use in the assessment of X‐to‐Y scaling and equating results. Measures are developed that are similar to those based on prediction error in regression analyses but that are directly suited to interests in scaling and equating evaluations. The regression and scaling function measures are compared in terms of their uncertainty reductions, error variances, and the contribution of true score and measurement error variances to the total error variances. The measures are also demonstrated as applied to an assessment of scaling results for a math test and a reading test. The results of these analyses illustrate the similarity of the regression and scaling measures for scaling situations when the tests have a correlation of at least .80, and also show the extent to which the measures can be adequate summaries of nonlinear regression and nonlinear scaling functions, and of heteroskedastic errors. After reading this module, readers will have a comprehensive understanding of the purposes, uses, and differences of regression and scaling functions.  相似文献   

4.
程力  柳博 《成人教育》2012,32(12):15-17
预估难度的评定直接关系到组配试卷的难度,进而影响到考试的合格标准。通过组建评估队伍、培训评估专家、分析难度因素、估计试题难度、调整难度值、构建难度量表等步骤,建立自学考试预估难度的评定操作规范。为了控制评定误差,需要采取有针对性的措施:统一评估专家的合格标准;明确试题统计难度和预估难度的内涵;确定试题难度的影响因素。  相似文献   

5.
Some literature on elder abuse recommends, and practitioners claim, that there should be better assessment and screening tools. In order to improve the accuracy of measurement instruments, the purpose of this article is threefold: (a) describing the construction of an instrument with formative indicators and the survey design about the sensitive topic of elder abuse, (b) development of an analytic strategy to improve the precision of the measures by (c) evaluating the measurement instrument through quality criteria against outcomes of the instrument. We randomly selected 2,880 home-dwelling older women aged 60 and above from five European Union countries who participated in a survey on elder abuse. Prevalence data on abuse against older women was gathered using a postal (BE, FI, PT), face-to-face (BE, LT), and telephone survey (AT) but using an identical instrument. A table with outcome measures was calculated to evaluate the formative indicators of the measurement instrument, and a decision strategy for item reduction was developed. The results suggest that 12 (35%) of the original 34-indicators instrument can be omitted. The adapted version can provide the same elder abuse prevalence rates (reliability) with the same negative associations in terms of life quality (validity). The results indicate in an applied way how an elder abuse instrument can be evaluated and further developed using formative measures.  相似文献   

6.
Structural equation modeling (SEM) techniques were used to compare 5 methods of assessing HIV/AIDS sexual risk in a large prediction model. These were: (a) multiple measures; (b) a single latent factor; (c) modifying the computation of the dependent variables used in Methods 1 and 2 to weight sexual encounters by specific partner risk; (d) use of risk composites, obtained by multiplying number of sexual partners by number of occasions of unprotected sex; and (e) use of risk indexes that assign a number based on responses to general questions about risk behaviors. Data from 452 at‐risk women from a New England community were analyzed in 5 versions of an HIV/AIDS sexual risk prediction model. Models were compared in terms of SEM empirical fit indexes (x2 [df], average absolute standardized residuals, and Comparative Fit Index); significant paths, explained variance, theoretical fit, and simplicity. Results indicate that: (a) multiple measures and latent factor models are preferable to all others by each of the standards of comparison, (b) in the composite dependent variable models, including information about the partners' number of partners provided little additional explained variance beyond knowing the number of occasions of unprotected sex, and (c) dependent measures that did not remain close to Centers for Disease Control criteria may not be adequately predicting HIV/AIDS sexual risk. Several recommendations are presented for selecting an appropriate conceptualization of HIV/AIDS sexual risk.  相似文献   

7.
Abstract

This study was concerned with the measurement of a set of indicators of teacher competence, defined by the teachers themselves as observable in their classroom behavior. The question it sought to answer was whether scoring keys for existing low-inference observation schedules could be developed that would measure any or all of these indicators objectively and reliably. Multiple observations were made with four such instruments in 100 classrooms in a single rural school system to provide data relevant to the question. Forty-two scoring keys were developed to measure one or another of 26 indicators identified by the teachers and used to score the records made in the 100 classrooms. An analysis of variance was made of the scores on each key to estimate its reliability and to isolate and assess errors of measurement due to lack of internal consistency and to instability confounded with observer disagreement. It was concluded that keys could be constructed to yield stable and consistent measures of most indicators of competency from records made on at least one of the low-inference observation schedules used, even though they had been designed to measure other variables, but that some efforts to refine the keys, either by empirical analysis or by having the composition of the keys verified by the teachers who defined the indicators, was necessary to ensure that the scores obtained would reflect them accurately.  相似文献   

8.
OBJECTIVE: There were two aims in this research. First, to examine the relationships between childhood sexual abuse and HIV drug and sexual risk taking behaviors among female prisoners, and second, to examine the relationship between a marginal adult living context and HIV drug and sexual risk taking behavior among female prisoners. METHOD: The data were collected through face-to-face interviews with a random sample of 500 women at admission to prison in 1994. Differences between women who were sexually abused while growing up (n = 130) were compared to women who reported no sexual abuse (n = 370) along various demographic, and HIV drug and sexual risk taking dimensions. RESULTS: A history of sexual abuse while growing up was associated with increased sexual risk taking behaviors in adulthood. A marginal adult living situation also emerged as an important factor increasing the risk for HIV infection. Examining the co-occurrence of both childhood sexual abuse and adult marginal living context revealed a strong relationship between these two factors and HIV risk taking activities. CONCLUSIONS: The findings indicate that childhood sexual abuse may be a predictor for HIV sexual risk taking behaviors among incarcerated women. The marginal and chaotic adult living style of these women was also associated the extent of their HIV drug and sexual risk taking behaviors. Our research suggests that the co-occurrence of sexual victimization and marginality is a stronger predictor of HIV risk than each variable alone.  相似文献   

9.
Item response models are finding increasing use in achievement and aptitude test development. Item response theory (IRT) test development involves the selection of test items based on a consideration of their item information functions. But a problem arises because item information functions are determined by their item parameter estimates, which contain error. When the "best" items are selected on the basis of their statistical characteristics, there is a tendency to capitalize on chance due to errors in the item parameter estimates. The resulting test, therefore, falls short of the test that was desired or expected. The purposes of this article are (a) to highlight the problem of item parameter estimation errors in the test development process, (b) to demonstrate the seriousness of the problem with several simulated data sets, and (c) to offer a conservative solution for addressing the problem in IRT-based test development.  相似文献   

10.
This study presents a new approach to synthesizing differential item functioning (DIF) effect size: First, using correlation matrices from each study, we perform a multigroup confirmatory factor analysis (MGCFA) that examines measurement invariance of a test item between two subgroups (i.e., focal and reference groups). Then we synthesize, across the studies, the differences in the estimated factor loadings between the two subgroups, resulting in a meta-analytic summary of the MGCFA effect sizes (MGCFA-ES). The performance of this new approach was examined using a Monte Carlo simulation, where we created 108 conditions by four factors: (1) three levels of item difficulty, (2) four magnitudes of DIF, (3) three levels of sample size, and (4) three types of correlation matrix (tetrachoric, adjusted Pearson, and Pearson). Results indicate that when MGCFA is fitted to tetrachoric correlation matrices, the meta-analytic summary of the MGCFA-ES performed best in terms of bias and mean square error values, 95% confidence interval coverages, empirical standard errors, Type I error rates, and statistical power; and reasonably well with adjusted Pearson correlation matrices. In addition, when tetrachoric correlation matrices are used, a meta-analytic summary of the MGCFA-ES performed well, particularly, under the condition that a high difficulty item with a large DIF was administered to a large sample size. Our result offers an option for synthesizing the magnitude of DIF on a flagged item across studies in practice.  相似文献   

11.
Three local observed‐score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed‐score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts.  相似文献   

12.
ABSTRACT

In the past decade, there has been interest in the assessment of cognitive and affective processes and products for the purposes of meaningful learning. Meaningful measurement (MM) has been proposed which is in accordance with a humanistic constructivist information‐processing perspective. Students’ responses to the assessment tasks are now evaluated according to an item response measurement model, together with a hypothesized model detailing the progressive forms of knowing/competence under examination. There is a possibility of incorporating student errors and alternative frameworks into these evaluation procedures. Meaningful measurement leads us to examine the composite concepts of “ability” and “difficulty”. Under the rubric of meaningful measurement, validity assessment (i.e. internal and external components of construct validity) is essentially the same as an inquiry into the meanings afforded by the measurements. Concepts of reliability, expressed as a group statistics which is applied in the same way to all the examinees in the sample, have to be obviated when the precision of the trait estimates stemming from the item response measurement models can be determined at each trait level. Reliability, measured in terms of standard errors of estimates needs to be within acceptable limits when internal validity is to be secured. Further evidence of validity may be provided by in‐depth analyses of how “epistemic subjects” of different levels of competence and proficiency engage in different types of assessment tasks, where affective and metacognitive behaviours may be examined as well. These ways of undertaking MM can be codified by proposing a three‐level conceptualization of MM. It is within the rubric of this conceptualization and the MM enquiry paradigm that validity and reliability of test measures are discussed in this paper.  相似文献   

13.
试题命制是考试的基础要素和关键环节,体现着考试作为测量手段的科学理性。试题的结构要素分别是刺激情境(线索材料)、设问(作答指令)、分值赋予、答案及评分标准等。试题命制的科学性指试题整体及其结构要素所涉及的情境素材、概念原理和推理论证等准确可靠,没有错误或者歧义。科学性维度的技术规范要求命题时首先明确测试的"构念",即测试要测的到底是被试哪方面的特质,"构念"相对抽象,需要在内容领域和认知能力等方面对"构念"进行分解使之具体化;基于测试"构念",考核重要的而非无关紧要的学习内容;其次要求试题刺激情境自身的合理性与科学性,要求试题的作答指令清晰明确无歧义,要求分值赋予合理,要求参考答案与评分标准精确并与情境素材保持逻辑一致。试题命制还需要关注试题的局部独立性。科学性是任何考试都必须遵循的基础性原则,试题命制自始至终都必须规避因科学性错误带来的测试偏差和风险。  相似文献   

14.
Trend estimation in international comparative large‐scale assessments relies on measurement invariance between countries. However, cross‐national differential item functioning (DIF) has been repeatedly documented. We ran a simulation study using national item parameters, which required trends to be computed separately for each country, to compare trend estimation performances to two linking methods employing international item parameters across several conditions. The trend estimates based on the national item parameters were more accurate than the trend estimates based on the international item parameters when cross‐national DIF was present. Moreover, the use of fixed common item parameter calibrations led to biased trend estimates. The detection and elimination of DIF can reduce this bias but is also likely to increase the total error.  相似文献   

15.
Noting the desirability of the current shift toward mastery testing and criterion-referenced test procedures, an evaluation model is presented which should be useful and practical for such purposes. This model is based on the assumptions that the learning of fundamental skills can be considered all or none, that each item response on a single skill test represents an unbiased sample of the examinee's true mastery status, that measurement error occurring on the test (as estimated from the average interitem correlation) can be of only one type (α or β) for each examinee, and that through practical and theoretical considerations of evaluation error costs and item error characteristics, an optimal mastery criterion can be calculated. Each of these assumptions is discussed and the resultant mastery criteria algorithm is described along with an example from the IPI math program.  相似文献   

16.
Assessment items are commonly field tested prior to operational use to observe statistical item properties such as difficulty. Item parameter estimates from field testing may be used to assign scores via pre-equating or computer adaptive designs. This study examined differences between item difficulty estimates based on field test and operational data and the relationship of such differences to item position changes and student proficiency estimates. Item position effects were observed for 20 assessments, with items in later positions tending to be more difficult. Moreover, field test estimates of item difficulty were biased slightly upward, which may indicate examinee knowledge of which items were being field tested. Nevertheless, errors in field test item difficulty estimates had negligible impacts on student proficiency estimates for most assessments. Caution is still warranted when using field test statistics for scoring, and testing programs should conduct investigations to determine whether the effects on scoring are inconsequential.  相似文献   

17.
The focus of this paper is assessing the impact of measurement errors on the prediction error of an observed‐score regression. Measures are presented and described for decomposing the linear regression's prediction error variance into parts attributable to the true score variance and the error variances of the dependent variable and the predictor variable(s). These measures are demonstrated for regression situations reflecting a range of true score correlations and reliabilities and using one and two predictors. Simulation results also are presented which show that the measures of prediction error variance and its parts are generally well estimated for the considered ranges of true score correlations and reliabilities and for homoscedastic and heteroscedastic data. The final discussion considers how the decomposition might be useful for addressing additional questions about regression functions’ prediction error variances.  相似文献   

18.
Test assembly is the process of selecting items from an item pool to form one or more new test forms. Often new test forms are constructed to be parallel with an existing (or an ideal) test. Within the context of item response theory, the test information function (TIF) or the test characteristic curve (TCC) are commonly used as statistical targets to obtain this parallelism. In a recent study, Ali and van Rijn proposed combining the TIF and TCC as statistical targets, rather than using only a single statistical target. In this article, we propose two new methods using this combined approach, and compare these methods with single statistical targets for the assembly of mixed‐format tests. In addition, we introduce new criteria to evaluate the parallelism of multiple forms. The results show that single statistical targets can be problematic, while the combined targets perform better, especially in situations with increasing numbers of polytomous items. Implications of using the combined target are discussed.  相似文献   

19.
This article presents a methodology for examining the content and nature of item parcels as indicators of a conceptually defined latent construct. An essential component of this methodology is the 2-facet measurement model, which includes items and parcels as facets of construct indicators. The 2-facet model tests assumptions required for accepting parcels as aggregates of item covariation in representing the latent construct. According to this methodology, parcels are acceptable indicators of the latent construct if the 2-facet model meets parametric assumptions for unidimensionality and if items and parcels have content validity as measures of the latent construct. The proposed methodology is illustrated using a 1-factor model of the Worry construct in the test anxiety measurement tradition  相似文献   

20.
In this ITEMS module, we frame the topic of scale reliability within a confirmatory factor analysis and structural equation modeling (SEM) context and address some of the limitations of Cronbach's α. This modeling approach has two major advantages: (1) it allows researchers to make explicit the relation between their items and the latent variables representing the constructs those items intend to measure, and (2) it facilitates a more principled and formal practice of scale reliability evaluation. Specifically, we begin the module by discussing key conceptual and statistical foundations of the classical test theory model and then framing it within an SEM context; we do so first with a single item and then expand this approach to a multi‐item scale. This allows us to set the stage for presenting different measurement structures that might underlie a scale and, more importantly, for assessing and comparing those structures formally within the SEM context. We then make explicit the connection between measurement model parameters and different measures of reliability, emphasizing the challenges and benefits of key measures while ultimately endorsing the flexible McDonald's ω over Cronbach's α. We then demonstrate how to estimate key measures in both a commercial software program (Mplus) and three packages within an open‐source environment (R). In closing, we make recommendations for practitioners about best practices in reliability estimation based on the ideas presented in the module.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号