首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 489 毫秒
1.
Previous methods for estimating the conditional standard error of measurement (CSEM) at specific score or ability levels are critically discussed, and a brief summary of prior empirical results is given. A new method is developed that avoids theoretical problems inherent in some prior methods, is easy to implement, and estimates not only a quantity analogous to the CSEM at each score but also the conditional standard error of prediction (CSEP) at each score and the conditional true score standard deviation (CTSSD) at each score, The new method differs from previous methods in that previous methods have concentrated on attempting to estimate error variance conditional on a fixed value of true score, whereas the new method considers the variance of observed scores conditional on a fixed value of an observed parallel measurement and decomposes these conditional observed score variances into true and error parts. The new method and several older methods are applied to a variety of tests, and representative results are graphically displayed. The CSEM-Iike estimates produced by the new method are called conditional standard error of measurement in prediction (CSEMP) estimates and are similar to those produced by older methods, but the CSEP estimates produced by the new method offer an alternative interpretation of the accuracy of a test at different scores. Finally, evidence is presented that shows that previous methods can produce dissimilar results and that the shape of the score distribution may influence the way in which the CSEM varies across the score scale.  相似文献   

2.
In Woodruff (1990), I derived estimates for the conditional standard error of measurement in prediction (CSEMP), the conditional standard error of estimation (CSEE), and the conditional standard error of prediction (CSEP). My original estimates assume that the conditional residual error score variances and the conditional residual true score variances, obtained from the regression of an observed score onto a parallel observed score, obey the same step-up rules as do the marginal error score variance and the marginal true score variance. The present article derives alternative estimates for the various test score conditional variances that do not depend on these assumptions.  相似文献   

3.
This module describes and extends X‐to‐Y regression measures that have been proposed for use in the assessment of X‐to‐Y scaling and equating results. Measures are developed that are similar to those based on prediction error in regression analyses but that are directly suited to interests in scaling and equating evaluations. The regression and scaling function measures are compared in terms of their uncertainty reductions, error variances, and the contribution of true score and measurement error variances to the total error variances. The measures are also demonstrated as applied to an assessment of scaling results for a math test and a reading test. The results of these analyses illustrate the similarity of the regression and scaling measures for scaling situations when the tests have a correlation of at least .80, and also show the extent to which the measures can be adequate summaries of nonlinear regression and nonlinear scaling functions, and of heteroskedastic errors. After reading this module, readers will have a comprehensive understanding of the purposes, uses, and differences of regression and scaling functions.  相似文献   

4.
An improved method is derived for estimating conditional measurement error variances, that is, error variances specific to individual examinees or specific to each point on the raw score scale of the test. The method involves partitioning the test into short parallel parts, computing for each examinee the unbiased estimate of the variance of part-test scores, and multiplying this variance by a constant dictated by classical test theory. Empirical data are used to corroborate the principal theoretical deductions.  相似文献   

5.
This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores.  相似文献   

6.
Instruction cannot be really personalised, as long as assessment remains norm‐referenced. Whereas psychometrics aims at differentiating the performances of individuals at a given moment, edumetrics aims at differentiating stages of learning for a given individual. The structure of the two projects is the same and generalisability theory offers symmetrical formulae for estimating the reliability of each of these measurement designs. An example is presented in this paper which shows that satisfactory reliability can be obtained in an edumetric situation, where the between‐pupils variance is completely ignored. Even though the absolute error variance is the same in both cases, the relative error variances and hence the standard errors of measurement are different. As the true score variances are also different, the edumetric properties of a test should be considered alongside its psychometric ones. Certification of progress by the teacher, supporting a portfolio of achievement, could even have a summative, as well as a formative, function.  相似文献   

7.
Structural equation modeling provides the framework for investigating experimental effects on the basis of variances and covariances in repeated measurements. A special type of confirmatory factor analysis as part of this framework enables the appropriate representation of the experimental effect and the separation of experimental and nonexperimental parts of variance. The constraint of the matrix of loadings is essential for the representation of the effect. Appropriate constraints of loadings are achievable with the aid of the polynomial function. The representation can even bear on several response modes. The usefulness of this method is demonstrated in data obtained by an experimental task with 3 treatment levels with respect to reaction times and error scores. A model with latent variables representing constancy and increase in reaction times and one latent variable representing increase in error scores serves best in these data. Both reaction times and error scores show experimental effects.  相似文献   

8.
在给定的权回归模型下,讨论了最小二乘估计、最优加权最小二乘估计和线性无偏最小方差估计的性能比较,得出了在随机误差方差矩阵可逆条件下,可算出最优加权最小二乘估计与线性无偏最小方差估计误差方差阵的差表达式,并在一定条件下,两者趋于一致。  相似文献   

9.
Increasing the correlation between the independent variable and the mediator (a coefficient) increases the effect size (ab) for mediation analysis; however, increasing a by definition increases collinearity in mediation models. As a result, the standard error of product tests increase. The variance inflation caused by increases in a at some point outweighs the increase of the effect size (ab) and results in a loss of statistical power. This phenomenon also occurs with nonparametric bootstrapping approaches because the variance of the bootstrap distribution of ab approximates the variance expected from normal theory. Both variances increase dramatically when a exceeds the b coefficient, thus explaining the power decline with increases in a. Implications for statistical analysis and applied researchers are discussed.  相似文献   

10.
The authors sought to identify through Monte Carlo simulations those conditions for which analysis of covariance (ANCOVA) does not maintain adequate Type I error rates and power. The conditions that were manipulated included assumptions of normality and variance homogeneity, sample size, number of treatment groups, and strength of the covariate-dependent variable relationship. Alternative tests studied were Quade's procedure, Puri and Sen's solution, Burnett and Barr's rank difference scores, Conover and Iman's rank transformation test, Hettmansperger's procedure, and the Puri-Sen-Harwell-Serlin test. For balanced designs, the ANCOVA F test was robust and was often the most powerful test through all sample-size designs and distributional configurations. With unbalanced designs, with variance heterogeneity, and when the largest treatment-group variance was matched with the largest group sample size, the nonparametric alternatives generally outperformed the ANCOVA test. When sample size and variance ratio were inversely coupled, all tests became very liberal; no test maintained adequate control over Type I error.  相似文献   

11.
The relation between test reliability and statistical power has been a controversial issue, perhaps due in part to a 1975 publication in the Psychological Bulletin by Overall and Woodward, “Unreliability of Difference Scores: A Paradox for the Measurement of Change”, in which they demonstrated that a Student t test based on pretest-posttest differences can attain its greatest power when the difference score reliability is zero. In the present article, the authors attempt to explain this paradox by demonstrating in several ways that power is not a mathematical function of reliability unless either true score variance or error score variance is constant.  相似文献   

12.
It is well known that measurement error in observable variables induces bias in estimates in standard regression analysis and that structural equation models are a typical solution to this problem. Often, multiple indicator equations are subsumed as part of the structural equation model, allowing for consistent estimation of the relevant regression parameters. In many instances, however, embedding the measurement model into structural equation models is not possible because the model would not be identified. To correct for measurement error one has no other recourse than to provide the exact values of the variances of the measurement error terms of the model, although in practice such variances cannot be ascertained exactly, but only estimated from an independent study. The usual approach so far has been to treat the estimated values of error variances as if they were known exact population values in the subsequent structural equation modeling (SEM) analysis. In this article we show that fixing measurement error variance estimates as if they were true values can make the reported standard errors of the structural parameters of the model smaller than they should be. Inferences about the parameters of interest will be incorrect if the estimated nature of the variances is not taken into account. For general SEM, we derive an explicit expression that provides the terms to be added to the standard errors provided by the standard SEM software that treats the estimated variances as exact population values. Interestingly, we find there is a differential impact of the corrections to be added to the standard errors depending on which parameter of the model is estimated. The theoretical results are illustrated with simulations and also with empirical data on a typical SEM model.  相似文献   

13.
Projecting the changes in the reliability of a difference score (d =× - Y ) as a consequence of changes in the reliabilities of X and Y does not represent a straightforward application of the Spearman-Brown formula. Formulas are developed for estimating the changes in the reliability of X-Y under two possible assumptions: (a ) × and Y have equal variances both before and after their reliabilities are altered, and (b ) × and Y have unequal variances before and after × and Y are modified. The second of these situations, which includes the first as a special case, is probably the more common .  相似文献   

14.
Research Findings: We evaluated the score stability of the Mathematical Quality of Instruction (MQI), an observational measure of mathematics instruction. Three raters each scored, independently, 100 video-recorded lessons taught by 20 kindergarten teachers in the spring. Using generalizability theory analyses, we decomposed the MQI’s score stability into potential sources of variation (teachers, lessons, raters, and their interactions). The 13-item (3-domain) Ambitious Mathematics Instruction scale and the Whole Lesson scale each explained about one third of the variance attributed to differences in the main construct of interest (teachers’ instructional strategies). The MQI’s Errors and Imprecision scale was not relevant at the kindergarten level; there were virtually no errors and/or ambiguities observed across the 100 mathematics lessons. In a series of decision studies, we examined improvements in reliability with combinations of up to 6 raters and 8 lessons. Only the Richness of Mathematics domain scores and the Whole Lesson scores achieved acceptable reliabilities. Practice or Policy: The findings have important implications for the use of observation measures to document teachers’ mathematics practices in the early years of school.  相似文献   

15.
This paper presents the results of a simulation study to compare the performance of the Mann-Whitney U test, Student?s t test, and the alternate (separate variance) t test for two mutually independent random samples from normal distributions, with both one-tailed and two-tailed alternatives. The estimated probability of a Type I error was controlled (in the sense of being reasonably close to the attainable level) by all three tests when the variances were equal, regardless of the sample sizes. However, it was controlled only by the alternate t test for unequal variances with unequal sample sizes. With equal sample sizes, the probability was controlled by all three tests regardless of the variances. When it was controlled, we also compared the power of these tests and found very little difference. This means that very little power will be lost if the Mann-Whitney U test is used instead of tests that require the assumption of normal distributions.  相似文献   

16.
Scale scores for educational tests can be made more interpretable by incorporating score precision information at the time the score scale is established. Methods for incorporating this information are examined that are applicable to testing situations with number-correct scoring. Both linear and nonlinear methods are described. These methods can be used to construct score scales that discourage the overinterpretation of small differences in scores. The application of the nonlinear methods also results in scale scores that have nearly equal error variability along the score scale and that possess the property that adding a specified number of points to and subtracting the same number of points from any examinee's scale score produces an approximate two-sided confidence interval with a specified coverage. These nonlinear methods use an arcsine transformation to stabilize measurement error variance for transformed scores. The methods are compared through the use of illustrative examples. The effect of rounding on measurement error variability is also considered and illustrated using stanines  相似文献   

17.
18.
Structural equation modeling (SEM) techniques were used to compare 5 methods of assessing HIV/AIDS sexual risk in a large prediction model. These were: (a) multiple measures; (b) a single latent factor; (c) modifying the computation of the dependent variables used in Methods 1 and 2 to weight sexual encounters by specific partner risk; (d) use of risk composites, obtained by multiplying number of sexual partners by number of occasions of unprotected sex; and (e) use of risk indexes that assign a number based on responses to general questions about risk behaviors. Data from 452 at‐risk women from a New England community were analyzed in 5 versions of an HIV/AIDS sexual risk prediction model. Models were compared in terms of SEM empirical fit indexes (x2 [df], average absolute standardized residuals, and Comparative Fit Index); significant paths, explained variance, theoretical fit, and simplicity. Results indicate that: (a) multiple measures and latent factor models are preferable to all others by each of the standards of comparison, (b) in the composite dependent variable models, including information about the partners' number of partners provided little additional explained variance beyond knowing the number of occasions of unprotected sex, and (c) dependent measures that did not remain close to Centers for Disease Control criteria may not be adequately predicting HIV/AIDS sexual risk. Several recommendations are presented for selecting an appropriate conceptualization of HIV/AIDS sexual risk.  相似文献   

19.
Ashby's Law of Requisite Variety states that variance prepares systems for daily activities and unforeseeable events, suggesting that academic departments comprising faculty from multiple institutions and disciplines would better adapt to ever‐changing environments. This study outlines the disciplinary heritage of full‐time tenure‐stream faculty (N = 495) within criminology and criminal justice doctoral programs (N = 31), then examines the degree to which those programs adhere to Ashby's principle. The study ranks programs on both institutional and disciplinary variances, and how well the programs balance those competing interests. Findings revealed that programs were quite mixed on both variance measures but that variance rankings had little association with how peers rated programs for 2009, in that highly ranked programs appeared with similar frequencies at both the top and bottom of variance rankings. Thus, it appears national respect is not dependent on the variance of faculties with respect to institutional and disciplinary heritage.  相似文献   

20.
《教育实用测度》2013,26(3):295-308
In some measurement settings internal consistency reliability of a measure must be based on a partition of the instrument into only 2 parts that cannot be further subdivided. Each of these 2 parts yields only a single score. If the functional lengths of the parts appear to be unequal or the parts are scored on different scales, the setting calls for a congeneric coefficient. It is shown that a single-valued estimate of the total score reliability is possible only if an assumption is made about the comparative size of the error variances of the parts. Without such an assumption, a range of reliability estimates is consistent with the part-test variances and covariance. But if the reliability of 1 part can be estimated independent of scores on the 2nd part, then a single-valued congeneric estimate of total score reliability is possible.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号