首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 21 毫秒
1.
Two new methods have been proposed to determine unexpected sum scores on sub-tests (testlets) both for paper-and-pencil tests and computer adaptive tests. A method based on a conservative bound using the hypergeometric distribution, denoted p, was compared with a method where the probability for each score combination was calculated using a highest density region (HDR). Furthermore, these methods were compared with the standardized log-likelihood statistic with and without a correction for the estimated latent trait value (denoted as l*z and lz, respectively). Data were simulated on the basis of the one-parameter logistic model, and both parametric and non-parametric logistic regression was used to obtain estimates of the latent trait. Results showed that it is important to take the trait level into account when comparing subtest scores. In a nonparametric item response theory (IRT) context, on adapted version of the HDR method was a powerful alterative to p. In a parametric IRT context, results showed that l*z had the highest power when the data were simulated conditionally on the estimated latent trait level.  相似文献   

2.
This study investigated the Type I error rate and power of four copying indices, K-index (Holland, 1996), Scrutiny! (Assessment Systems Corporation, 1993), g2 (Frary, Tideman, & Watts, 1977), and ω (Wollack, 1997) using real test data from 20,000 examinees over a 2-year period. The data were divided into three different test lengths (20, 40, and 80 items) and nine different sample sizes (ranging from 50 to 20,000). Four different amounts of answer copying were simulated (10%, 20%, 30%, and 40% of the items) within each condition. The ω index demonstrated the best Type I error control and power in all conditions and at all α levels. Scrutiny! and the K-index were uniformly conservative, and both had poor power to detect true copiers at the small α levels typically used in answer copying detection, whereas g2 was generally too liberal, particularly at small α levels. Some comments on the proper uses of copying indices are provided.  相似文献   

3.
This article used the Wald test to evaluate the item‐level fit of a saturated cognitive diagnosis model (CDM) relative to the fits of the reduced models it subsumes. A simulation study was carried out to examine the Type I error and power of the Wald test in the context of the G‐DINA model. Results show that when the sample size is small and a larger number of attributes are required, the Type I error rate of the Wald test for the DINA and DINO models can be higher than the nominal significance levels, while the Type I error rate of the A‐CDM is closer to the nominal significance levels. However, with larger sample sizes, the Type I error rates for the three models are closer to the nominal significance levels. In addition, the Wald test has excellent statistical power to detect when the true underlying model is none of the reduced models examined even for relatively small sample sizes. The performance of the Wald test was also examined with real data. With an increasing number of CDMs from which to choose, this article provides an important contribution toward advancing the use of CDMs in practical educational settings.  相似文献   

4.
This Monte Carlo simulation study investigated the impact of nonnormality on estimating and testing mediated effects with the parallel process latent growth model and 3 popular methods for testing the mediated effect (i.e., Sobel’s test, the asymmetric confidence limits, and the bias-corrected bootstrap). It was found that nonnormality had little effect on the estimates of the mediated effect, standard errors, empirical Type I error, and power rates in most conditions. In terms of empirical Type I error and power rates, the bias-corrected bootstrap performed best. Sobel’s test produced very conservative Type I error rates when the estimated mediated effect and standard error had a relationship, but when the relationship was weak or did not exist, the Type I error was closer to the nominal .05 value.  相似文献   

5.
Two new indices to detect answer copying on a multiple-choice test—S1 and S2—were proposed. The S1 index is similar to the K index (Holland, 1996) and the K2 index (Sotaridona & Meijer, 2002) but the distribution of the number of matching incorrect answers of the source and the copier is modeled by the Poisson distribution instead of the binomial distribution to improve the detection rate of K and K2. The S2 index was proposed to overcome a limitation of the K and K2 index, namely, their insensitiveness to correct answers copying. The S2 index incorporates the matching correct answers in addition to the matching incorrect answers. A simulation study was conducted to investigate the usefulness of S1 and S2 for 40- and 80-item tests, 100 and 500 sample sizes, and 10%, 20%, 30%, and 40% answer copying. The Type I errors and detection rates of S1 and S2 were compared with those of the K2 and the ω copying index (Wollack, 1997). Results showed that all four indices were able to maintain their Type I errors, with S1 and K2 being slightly conservative compared to S2 and ω. Furthermore, S1 had higher detection rates than K2. The S2 index showed a significant improvement in detection rate compared to K and K2.  相似文献   

6.
The purpose of this study was to investigate the power and Type I error rate of the likelihood ratio goodness-of-fit (LR) statistic in detecting differential item functioning (DIF) under Samejima's (1969, 1972) graded response model. A multiple-replication Monte Carlo study was utilized in which DIF was modeled in simulated data sets which were then calibrated with MULTILOG (Thissen, 1991) using hierarchically nested item response models. In addition, the power and Type I error rate of the Mantel (1963) approach for detecting DIF in ordered response categories were investigated using the same simulated data, for comparative purposes. The power of both the Mantel and LR procedures was affected by sample size, as expected. The LR procedure lacked the power to consistently detect DIF when it existed in reference/focal groups with sample sizes as small as 500/500. The Mantel procedure maintained control of its Type I error rate and was more powerful than the LR procedure when the comparison group ability distributions were identical and there was a constant DIF pattern. On the other hand, the Mantel procedure lost control of its Type I error rate, whereas the LR procedure did not, when the comparison groups differed in mean ability; and the LR procedure demonstrated a profound power advantage over the Mantel procedure under conditions of balanced DIF in which the comparison group ability distributions were identical. The choice and subsequent use of any procedure requires a thorough understanding of the power and Type I error rates of the procedure under varying conditions of DIF pattern, comparison group ability distributions.–or as a surrogate, observed score distributions–and item characteristics.  相似文献   

7.
Recent advances in testing mediation have found that certain resampling methods and tests based on the mathematical distribution of 2 normal random variables substantially outperform the traditional z test. However, these studies have primarily focused only on models with a single mediator and 2 component paths. To address this limitation, a simulation was conducted to evaluate these alternative methods in a more complex path model with multiple mediators and indirect paths with 2 and 3 paths. Methods for testing contrasts of 2 effects were evaluated also. The simulation included 1 exogenous independent variable, 3 mediators and 2 outcomes and varied sample size, number of paths in the mediated effects, test used to evaluate effects, effect sizes for each path, and the value of the contrast. Confidence intervals were used to evaluate the power and Type I error rate of each method, and were examined for coverage and bias. The bias-corrected bootstrap had the least biased confidence intervals, greatest power to detect nonzero effects and contrasts, and the most accurate overall Type I error. All tests had less power to detect 3-path effects and more inaccurate Type I error compared to 2-path effects. Confidence intervals were biased for mediated effects, as found in previous studies. Results for contrasts did not vary greatly by test, although resampling approaches had somewhat greater power and might be preferable because of ease of use and flexibility.  相似文献   

8.
An approximate χ2 statistic based on McDonald's (1967) nonlinear factor analytic representation of item response theory was proposed and investigated with simulated data. The results were compared with Stout's T statistic (Nandakumar & Stout, 1993; Stout, 1987). Unidimensional and two-dimensional item response data were simulated under varying levels of sample size, test length, test reliability, and dimension dominance. The approximate χ2 statistic had good control over Type I errors when unidimensional data were generated and displayed very good power in identifying the two-dimensional data. The performance of the approximate χ2 was at least as good as Stout's T statistic in all conditions and was better than Stout's T statistic with smaller sample sizes and shorter tests. Further implications regarding the potential use of nonlinear factor analysis and the approximate χ2 in addressing current measurement issues are discussed.  相似文献   

9.
《教育实用测度》2013,26(4):329-349
The logistic regression (LR) procedure for differential item functioning (DIF) detection is a model-based approach designed to identify both uniform and nonuniform DIF. However, this procedure tends to produce inflated Type I errors. This outcome is problematic because it can result in the inefficient use of testing resources, and it may interfere with the study of the underlying causes of DIF. Recently, an effect size measure was developed for the LR DIF procedure and a classification method was proposed. However, the effect size measure and classification method have not been systematically investigated. In this study, we developed a new classification method based on those established for the Simultaneous Item Bias Test. A simulation study also was conducted to determine if the effect size measure affects the Type I error and power rates for the LR DIF procedure across sample sizes, ability distributions, and percentage of DIF items included on a test. The results indicate that the inclusion of the effect size measure can substantially reduce Type I error rates when large sample sizes are used, although there is also a reduction in power.  相似文献   

10.
Type I error rate and power for the t test, Wilcoxon-Mann-Whitney (U) test, van der Waerden Normal Scores (NS) test, and Welch-Aspin-Satterthwaite (W) test were compared for two independent random samples drawn from nonnormal distributions. Data with varying degrees of skewness (S) and kurtosis (K) were generated using Fleishman's (1978) power function. Five sample size combinations were used with both equal and unequal variances. For nonnormal data with equal variances, the power of the U test exceeded the power of the t test regardless of sample size. When the sample sizes were equal but the variances were unequal, the t test proved to be the most powerful test. When variances and sample sizes were unequal, the W test became the test of choice because it was the only test that maintained its nominal Type I error rate.  相似文献   

11.
Multivariate analysis of variance (MANOVA) is widely used in educational research to compare means on multiple dependent variables across groups. Researchers faced with the problem of missing data often use multiple imputation of values in place of the missing observations. This study compares the performance of 2 methods for combining p values in the context of a MANOVA, with the typical default for dealing with missing data: listwise deletion. When data are missing at random, the new methods maintained the nominal Type I error rate and had power comparable to the complete data condition. When 40% of the data were missing completely at random, the Type I error rates for the new methods were inflated, but not for lower percents.  相似文献   

12.
When structural equation modeling (SEM) analyses are conducted, significance tests for all important model relationships (parameters including factor loadings, covariances, etc.) are typically conducted at a specified nominal Type I error rate (α). Despite the fact that many significance tests are often conducted in SEM, rarely is multiplicity control applied. Cribbie (2000, 2007) demonstrated that without some form of adjustment, the familywise Type I error rate can become severely inflated. Cribbie also confirmed that the popular Bonferroni method was overly conservative due to the correlations among the parameters in the model. The purpose of this study was to compare the Type I error rates and per-parameter power of traditional multiplicity strategies with those of adjusted Bonferroni procedures that incorporate not only the number of tests in a family, but also the degree of correlation between parameters. The adjusted Bonferroni procedures were found to produce per-parameter power rates higher than the original Bonferroni procedure without inflating the familywise error rate.  相似文献   

13.
This study examined the effect of model size on the chi-square test statistics obtained from ordinal factor analysis models. The performance of six robust chi-square test statistics were compared across various conditions, including number of observed variables (p), number of factors, sample size, model (mis)specification, number of categories, and threshold distribution. Results showed that the unweighted least squares (ULS) robust chi-square statistics generally outperform the diagonally weighted least squares (DWLS) robust chi-square statistics. The ULSM estimator performed the best overall. However, when fitting ordinal factor analysis models with a large number of observed variables and small sample size, the ULSM-based chi-square tests may yield empirical variances that are noticeably larger than the theoretical values and inflated Type I error rates. On the other hand, when the number of observed variables is very large, the mean- and variance-corrected chi-square test statistics (e.g., based on ULSMV and WLSMV) could produce empirical variances conspicuously smaller than the theoretical values and Type I error rates lower than the nominal level, and demonstrate lower power rates to reject misspecified models. Recommendations for applied researchers and future empirical studies involving large models are provided.  相似文献   

14.
This article reports on a Monte Carlo simulation study, evaluating two approaches for testing the intervention effect in replicated randomized AB designs: two-level hierarchical linear modeling (HLM) and using the additive method to combine randomization test p values (RTcombiP). Four factors were manipulated: mean intervention effect, number of cases included in a study, number of measurement occasions for each case, and between-case variance. Under the simulated conditions, Type I error rate was under control at the nominal 5% level for both HLM and RTcombiP. Furthermore, for both procedures, a larger number of combined cases resulted in higher statistical power, with many realistic conditions reaching statistical power of 80% or higher. Smaller values for the between-case variance resulted in higher power for HLM. A larger number of data points resulted in higher power for RTcombiP.  相似文献   

15.
Testing the goodness of fit of item response theory (IRT) models is relevant to validating IRT models, and new procedures have been proposed. These alternatives compare observed and expected response frequencies conditional on observed total scores, and use posterior probabilities for responses across θ levels rather than cross-classifying examinees using point estimates of θ and score responses. This research compared these alternatives with regard to their methods, properties (Type 1 error rates and empirical power), available research, and practical issues (computational demands, treatment of missing data, effects of sample size and sparse data, and available computer programs). Different advantages and disadvantages related to these characteristics are discussed. A simulation study provided additional information about empirical power and Type 1 error rates.  相似文献   

16.
Two simulation studies investigated Type I error performance of two statistical procedures for detecting differential item functioning (DIF): SIBTEST and Mantel-Haenszel (MH). Because MH and SIBTEST are based on asymptotic distributions requiring "large" numbers of examinees, the first study examined Type 1 error for small sample sizes. No significant Type I error inflation occurred for either procedure. Because MH has the potential for Type I error inflation for non-Rasch models, the second study used a markedly non-Rasch test and systematically varied the shape and location of the studied item. When differences in distribution across examinee group of the measured ability were present, both procedures displayed inflated Type 1 error for certain items; MH displayed the greater inflation. Also, both procedures displayed statistically biased estimation of the zero DIF for certain items, though SIBTEST displayed much less than MH. When no latent distributional differences were present, both procedures performed satisfactorily under all conditions.  相似文献   

17.
A new procedure for generating instructionally relevant diagnostic feedback is proposed. The approach involves first constructing a strong model of student proficiency and then testing whether individual students' observed item response vectors are consistent with that model. Diagnoses are specified in terms of the combinations of skills needed to score at increasingly higher levels on a test's reported score scale. The approach is applied to the problem of developing diagnostic feedback for the SAT I Verbal Reasoning test. Using a variation of Wright's (1977) person-fit statistic, it is shown that the estimated proficiency mode accounts for 91% of the "explainable" variation in students' observed item response vectors.  相似文献   

18.
The authors present a method that ensures control over the Type I error rate for those who visually analyze the data from response-guided multiple-baseline designs. The method can be seen as a modification of visual analysis methods to incorporate a mechanism to control Type I errors or as a modification of randomization test methods to allow response-guided experimentation and visual analysis. The approach uses random assignment of participants to intervention times and a data analyst who is blind to which participants enter treatment at which points in time. The authors provide an example to illustrate the method and discuss the conditions necessary to ensure Type I error control.  相似文献   

19.
The authors compared the Type I error rate and the power to detect differences in slopes and additive treatment effects of analysis of covariance (ANCOVA) and randomized block (RB) designs with a Monte Carlo simulation. For testing differences in slopes, 3 methods were compared: the test of slopes from ANCOVA, the omnibus Block × Treatment interaction, and the linear component of the Block × Treatment interaction of RB. In the test for adjusted means, 2 variations of both ANCOVA and RB were used. The power of the omnibus test of the interaction decreased dramatically as the number of blocks used increased and was always considerably smaller than the specific test of differences in slopes found in ANCOVA. Tests for means when there were concomitant differences in slopes showed that only ANCOVA uniformly controlled Type I error under all configurations of design variables. The most powerful option in almost all simulations for tests of both slopes and means was ANCOVA.  相似文献   

20.
DIMTEST is a nonparametric statistical test procedure for assessing unidimensionality of binary item response data. The development of Stout's statistic, T, used in the DIMTEST procedure, does not require the assumption of a particular parametric form for the ability distributions or the item response functions. The purpose of the present study was to empirically investigate the performance of the statistic T with respect to different shapes of ability distributions. Several nonnormal distributions, both symmetric and nonsymmetric, were considered for this purpose. Other factors varied in the study were test length, sample size, and the level of correlation between abilities. The results of Type I error and power studies showed that the test statistic T exhibited consistently similar performance for all different shapes of ability distributions investigated in the study, which confirmed the nonparametric nature of the statistic T.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号