首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Generalizability Theory   总被引:1,自引:0,他引:1  
Generalizability theory consists of a conceptual framework and a methodology that enable an investigator to disentangle multiple sources of error in a measurement procedure. The roots of generalizability theory can be found in classical test theory and analysis of variance (ANOVA), but generalizability theory is not simply the conjunction of classical theory and ANOVA. In particular, the conceptual framework in generalizability theory is unique. This framework and the procedures of generalizability theory are introduced and illustrated in this instructional module using a hypothetical scenario involving writing proficiency.  相似文献   

2.
The purpose of this study was to investigate the effects of items, passages, contents, themes, and types of passages on the reliability and standard errors of measurement for complex reading comprehension tests. Seven different generalizability theory models were used in the analyses. Results indicated that generalizability coefficients estimated using multivariate models incorporating content strata and types of passages were similar in size to reliability estimates based upon a model that did not include these factors. In contrast, incorporating passages and themes within univariate generalizability theory models produced non-negligible differences in the reliability estimates. This suggested that passages and themes be taken into account when evaluating the reliability of test scores for complex reading comprehension tests.  相似文献   

3.
《教育实用测度》2013,26(2):191-203
Generalizability theory provides a conceptual and statistical framework for estimating variance components and measurement precision. The theory has been widely used in evaluating technical qualities of performance assessments. However, estimates of variance components, measurement error variances, and generalizability coefficients are likely to vary from one sample to another. This study empirically investigates sampling variability of estimated variance components using data collected in several years for a listening and writing performance assessment. This study also evaluates stability of estimated measurement precision from year to year. The results indicated that the estimated variance components varied from one study to another, especially when sample sizes were small. The estimated measurement error variances and generalizability coefficients also changed from one year to another. Measurement precision projected by a generalizability study may not be fully realized in an actual decision study. The study points out the importance of examining variability of estimated variance components and related statistics in performance assessments.  相似文献   

4.
An approach called generalizability in item response modeling (GIRM) is introduced in this article. The GIRM approach essentially incorporates the sampling model of generalizability theory (GT) into the scaling model of item response theory (IRT) by making distributional assumptions about the relevant measurement facets. By specifying a random effects measurement model, and taking advantage of the flexibility of Markov Chain Monte Carlo (MCMC) estimation methods, it becomes possible to estimate GT variance components simultaneously with traditional IRT parameters. It is shown how GT and IRT can be linked together, in the context of a single-facet measurement design with binary items. Using both simulated and empirical data with the software WinBUGS, the GIRM approach is shown to produce results comparable to those from a standard GT analysis, while also producing results from a random effects IRT model.  相似文献   

5.
Multilevel bifactor item response theory (IRT) models are commonly used to account for features of the data that are related to the sampling and measurement processes used to gather those data. These models conventionally make assumptions about the portions of the data structure that represent these features. Unfortunately, when data violate these models' assumptions but these models are used anyway, incorrect conclusions about the cluster effects could be made and potentially relevant dimensions could go undetected. To address the limitations of these conventional models, a more flexible multilevel bifactor IRT model that does not make these assumptions is presented, and this model is based on the generalized partial credit model. Details of a simulation study demonstrating this model outperforming competing models and showing the consequences of using conventional multilevel bifactor IRT models to analyze data that violate these models' assumptions are reported. Additionally, the model's usefulness is illustrated through the analysis of the Program for International Student Assessment data related to interest in science.  相似文献   

6.
ABSTRACT

Randomized experiments are considered the gold standard for causal inference because they can provide unbiased estimates of treatment effects for the experimental participants. However, researchers and policymakers are often interested in using a specific experiment to inform decisions about other target populations. In education research, increasing attention is being paid to the potential lack of generalizability of randomized experiments because the experimental participants may be unrepresentative of the target population of interest. This article examines whether generalization may be assisted by statistical methods that adjust for observed differences between the experimental participants and members of a target population. The methods examined include approaches that reweight the experimental data so that participants more closely resemble the target population and methods that utilize models of the outcome. Two simulation studies and one empirical analysis investigate and compare the methods’ performance. One simulation uses purely simulated data while the other utilizes data from an evaluation of a school-based dropout prevention program. Our simulations suggest that machine learning methods outperform regression-based methods when the required structural (ignorability) assumptions are satisfied. When these assumptions are violated, all of the methods examined perform poorly. Our empirical analysis uses data from a multisite experiment to assess how well results from a given site predict impacts in other sites. Using a variety of extrapolation methods, predicted effects for each site are compared to actual benchmarks. Flexible modeling approaches perform best, although linear regression is not far behind. Taken together, these results suggest that flexible modeling techniques can aid generalization while underscoring the fact that even state-of-the-art statistical techniques still rely on strong assumptions.  相似文献   

7.
We discuss generalizability (G) theory and the fair and valid assessment of linguistic minorities, especially emergent bilinguals. G theory allows examination of the relationship between score variation and language variation (e.g., variation of proficiency across languages, language modes, and social contexts). Studies examining score variation across items administered in emergent bilinguals' first and second languages show that the interaction of student and the facets (sources of measurement error) item and language is an important source of score variation. Each item poses a unique set of linguistic challenges in each language, and each emergent bilingual individual has a unique set of strengths and weaknesses in each language. Based on these findings, G theory can inform the process of test construction in large-scale testing programmes and the development of testing models that ensure more valid and fair interpretations of test scores for linguistic minorities.  相似文献   

8.
The purpose of this study was to investigate the methods of estimating the reliability of school-level scores using generalizability theory and multilevel models. Two approaches, ‘student within schools’ and ‘students within schools and subject areas,’ were conceptualized and implemented in this study. Four methods resulting from the combination of these two approaches with generalizability theory and multilevel models were compared for both balanced and unbalanced data. The generalizability theory and multilevel models for the ‘students within schools’ approach produced the same variance components and reliability estimates for the balanced data, while failing to do so for the unbalanced data. The different results from the two models can be explained by the fact that they administer different procedures in estimating the variance components used, in turn, to estimate reliability. Among the estimation methods investigated in this study, the generalizability theory model with the ‘students nested within schools crossed with subject areas’ design produced the lowest reliability estimates. Fully nested designs such as (students:schools) or (subject areas:students:schools) would not have any significant impact on reliability estimates of school-level scores. Both methods provide very similar reliability estimates of school-level scores.  相似文献   

9.
Contemporary educational accountability systems, including state‐level systems prescribed under No Child Left Behind as well as those envisioned under the “Race to the Top” comprehensive assessment competition, rely on school‐level summaries of student test scores. The precision of these score summaries is almost always evaluated using models that ignore the classroom‐level clustering of students within schools. This paper reports balanced and unbalanced generalizability analyses investigating the consequences of ignoring variation at the level of classrooms within schools when analyzing the reliability of such school‐level accountability measures. Results show that the reliability of school means cannot be determined accurately when classroom‐level effects are ignored. Failure to take between‐classroom variance into account biases generalizability (G) coefficient estimates downward and standard errors (SEs) upward if classroom‐level effects are regarded as fixed, and biases G‐coefficient estimates upward and SEs downward if they are regarded as random. These biases become more severe as the difference between the school‐level intraclass correlation (ICC) and the class‐level ICC increases. School‐accountability systems should be designed so that classroom (or teacher) level variation can be taken into consideration when quantifying the precision of school rankings, and statistical models for school mean score reliability should incorporate this information.  相似文献   

10.
Monte Carlo studies offer the opportunity to manipulate data sets with specified characteristics and to examine the generalizability of statistical procedures in ways that are not practical using actual (empirical) data. To determine whether computer-generated (simulated) data results accurately represent empirical data, this study replicated an investigation of the effects of item sampling plans in the application of multiple matrix sampling using both simulated and empirical data sets. Although results were similar, the empirical data results were more precise. This study suggests that, for some investigations, it may be important to confirm simulated study with empirical study.  相似文献   

11.
In judgmental standard setting procedures (e.g., the Angoff procedure), expert raters establish minimum pass levels (MPLs) for test items, and these MPLs are then combined to generate a passing score for the test. As suggested by Van der Linden (1982), item response theory (IRT) models may be useful in analyzing the results of judgmental standard setting studies. This paper examines three issues relevant to the use of lRT models in analyzing the results of such studies. First, a statistic for examining the fit of MPLs, based on judges' ratings, to an IRT model is suggested. Second, three methods for setting the passing score on a test based on item MPLs are analyzed; these analyses, based on theoretical models rather than empirical comparisons among the three methods, suggest that the traditional approach (i.e., setting the passing score on the test equal to the sum of the item MPLs) does not provide the best results. Third, a simple procedure, based on generalizability theory, for examining the sources of error in estimates of the passing score is discussed.  相似文献   

12.
Hilton (2006) criticises the PIRLS (Progress in International Reading Literacy Study) tests and the survey conduct, raising questions about the validity of international surveys of reading. Her criticisms fall into four broad areas: cultural validity, methodological issues, construct validity and the survey in England. However, her criticisms are shown to be mistaken. Her claim of forced unidimensionality in the tests is not supported by statistical analyses and her claims of cultural strangeness are contradicted by the involvement of all the countries involved. She is concerned about linguistic diversity but this is actually reflected in the ways countries organise their surveys. Finally, Hilton suggests that the English sample was biased, but fails to recognise the stringent sampling requirements or the monitoring roles of external assessors and the sampling referee. A careful study of the evidence concerning PIRLS shows that it is actually a fair and robust measure of reading attainment in different countries.  相似文献   

13.
This paper presents a review of recent developments in statistical techniques for repeated-measures analysis of variance. Since the literature has emphasized the issue of mixed model assumptions and their violation, we present an updated perspective on the nature of these assumptions and their implications for mixed model, adjusted mixed model, or multivariate significance tests. However, the central theme of the review is that the validity of mixed model assumptions is but one consideration in selection of an appropriate method of repeated-measures ANOVA. In particular, we recommend the avoidance of omnibus significance tests in favor of specific planned comparisons whenever hypotheses more specific than the omnibus null hypothesis may be formulated a priori. The analyst must also consider whether multiple dependent measures are to be analyzed, and the paper discusses alternative approaches to true multivariate repeated-measures designs. It also includes discussion of other relevant issues, including a brief review of the strengths and weaknesses of commonly available statistical software when applied to the analysis of repeated-measures data.  相似文献   

14.
The Argumentativeness (ARG) Scale and Verbal Aggressiveness Scale have been used in hundreds of studies over the past quarter century. As expected, psychometric research has examined their validity. Although this article focuses on recent criticisms by Kotowski, Levine, Baker, and Bolt, some major points refute earlier criticisms. This article argues that (a) a large body of research demonstrates validity of the scales, (b) dimensionality of the scales is quite unequivocal, (c) argumentative presumption favors using the original scales (unless and until newer scales demonstrate statistically significant greater criterion variance), (d) critics of the ARG Scale's predictive validity have failed to include 4 situational components of argumentativeness theory in their testing, (e) both scales are designed and supposed to measure extensive sets of relevant behaviors over time, not individual behaviors observed once, and (f) statistical inference cannot confirm nulls, so critics' claims of "no correlation" between scale scores and observable behaviors are not scientific.  相似文献   

15.
This study examined the use of generalizability theory to evaluate the quality of an alternative assessment (journal writing) in mathematics. Twenty-nine junior college students wrote journal tasks on the given topics and two raters marked the tasks using a scoring rubric, constituting a two-facet G-study design in which students were crossed with tasks and raters. The G coefficient was .76 and index of dependability was .72. The results showed that increasing the number of tasks had a larger effect on the G coefficient and index of dependability than increasing the number of raters. Implications for educational practices are discussed.  相似文献   

16.
While research on metacognition, self-regulation and self-regulated learning is quite mature, these studies have been carried out with varying methodologies and with mixed results. This paper explores the ontological and epistemological assumptions of theories, models and methods used to investigate these three constructs to examine the underlying assumptions of all three. Using oft-cited theories and models of the three constructs along with highly cited studies identified in a previous review of these constructs, this paper examined facets of two popular frameworks: Cartesian-split-mechanistic tradition (CSMT) and the relational tradition specifically looking at the role of intra-individual development, the inclusiveness of categories and notions of causality in these theories, models and methods. While the theories and methods contained elements of both traditions, methods to investigate these constructs relied almost exclusively on assumptions from CSMT. Future directions for research include incorporating more studies examining intra-individual change and multiple notions of causality. Future directions for practice include better contextualisation of research results to strengthen the link between theory and practice.  相似文献   

17.
Examined in this study were three procedures for estimating the standard errors of school passing rates using a generalizability theory model. Also examined was how these procedures behaved for student samples that differed in size. The procedures differed in terms of their assumptions about the populations from which students were sampled, and it was found that student sample size generally had a notable effect on the size of the standard error estimates they produced. Also the three procedures produced markedly different standard error estimates when student sample size was small.  相似文献   

18.

This article discusses the parameters of the "testimonio" by considering its purpose, the role of the author in the text, and its trustworthiness and generalizability. The author then considers the criticisms that have been lodged against Rigoberta Menchu and discusses the competing truth claims that different individuals have.  相似文献   

19.
Models to assess mediation in the pretest–posttest control group design are understudied in the behavioral sciences even though it is the design of choice for evaluating experimental manipulations. The article provides analytical comparisons of the four most commonly used models to estimate the mediated effect in this design: analysis of covariance (ANCOVA), difference score, residualized change score, and cross-sectional model. Each of these models is fitted using a latent change score specification and a simulation study assessed bias, Type I error, power, and confidence interval coverage of the four models. All but the ANCOVA model make stringent assumptions about the stability and cross-lagged relations of the mediator and outcome that might not be plausible in real-world applications. When these assumptions do not hold, Type I error and statistical power results suggest that only the ANCOVA model has good performance. The four models are applied to an empirical example.  相似文献   

20.
The purpose of this study was to examine the quality assurance issues of a national English writing assessment in Chinese higher education. Specifically, using generalizability theory and rater interviews, this study examined how the current scoring policy of the TEM-4 (Test for English Majors – Band 4, a high-stakes national standardized EFL assessment in China) writing could impact its score variability and reliability. Eighteen argumentative essays written by nine English major undergraduate students were selected as the writing samples. Ten TEM-4 raters were first invited to use the authentic TEM-4 writing scoring rubric to score these essays holistically and analytically (with time intervals in between). They were then interviewed for their views on how the current scoring policy of the TEM-4 writing assessment could affect its overall quality. The quantitative generalizability theory results of this study suggested that the current scoring policy would not yield acceptable reliability coefficients. The qualitative results supported the generalizability theory findings. Policy implications for quality improvement of the TEM-4 writing assessment in China are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号