首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This inquiry is an investigation of item response theory (IRT) proficiency estimators’ accuracy under multistage testing (MST). We chose a two‐stage MST design that includes four modules (one at Stage 1, three at Stage 2) and three difficulty paths (low, middle, high). We assembled various two‐stage MST panels (i.e., forms) by manipulating two assembly conditions in each module, such as difficulty level and module length. For each panel, we investigated the accuracy of examinees’ proficiency levels derived from seven IRT proficiency estimators. The choice of Bayesian (prior) versus non‐Bayesian (no prior) estimators was of more practical significance than the choice of number‐correct versus item‐pattern scoring estimators. The Bayesian estimators were slightly more efficient than the non‐Bayesian estimators, resulting in smaller overall error. Possible score changes caused by the use of different proficiency estimators would be nonnegligible, particularly for low‐ and high‐performing examinees.  相似文献   

2.
This article illustrates five different methods for estimating Angoff cut scores using item response theory (IRT) models. These include maximum likelihood (ML), expected a priori (EAP), modal a priori (MAP), and weighted maximum likelihood (WML) estimators, as well as the most commonly used approach based on translating ratings through the test characteristic curve (i.e., the IRT true‐score (TS) estimator). The five methods are compared using a simulation study and a real data example. Results indicated that the application of different methods can sometimes lead to different estimated cut scores, and that there can be some key differences in impact data when using the IRT TS estimator compared to other methods. It is suggested that one should carefully think about their choice of methods to estimate ability and cut scores because different methods have distinct features and properties. An important consideration in the application of Bayesian methods relates to the choice of the prior and the potential bias that priors may introduce into estimates.  相似文献   

3.
With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number‐correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real‐number item scores. The generalized algorithm is distinct from the Lord‐Wingersky algorithm in that it explicitly incorporates the task of figuring out all possible unique real‐number test scores in each recursion. Some applications of the generalized recursive algorithm, such as IRT test score reliability estimation and IRT proficiency estimation based on summed test scores, are illustrated with a short test by varying scoring schemes for its items.  相似文献   

4.
The No Child Left Behind Act of 2001 requires that states demonstrate a reduction in the test score minority gap over time but does not specify what methodology states must use to demonstrate this. The Act also requires that a measure of Adequate Yearly Progress be established by each state expressed in terms of the percent of students who achieve a level of "proficiency" on the state examination. While the most common methods used by states for analyzing the minority gap in test scores over time are percent achieving a performance standard, mean scale scores, and effect sizes, the default method for analyzing the minority gap will likely be the percent achieving proficiency. This article considers some of the practical issues involved in using the percent achieving a performance standard, mean scale scores, and effect sizes to analyze the minority gap using Texas student performance on their in-state assessment, National Assessment of Educational Progress (NAEP), and SAT. The intent of the article is to increase the understanding of policymakers and others on the issues of using the various statistics to analyze the minority gap.  相似文献   

5.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

6.
Economic and social class differences in literacy-specific experiences and access to print resources have been widely documented. This study examined an intervention strategy designed to provide access to literacy materials and opportunities for parent-child storybook reading in three Head Start Centers. There were three specific objectives: (1) to examine the influence of text type (highly predictable, episodic predictable, and narrative) on patterns of interaction between parents and children; (2) to examine whether there were differences in these patterns of interaction between low proficiency and proficient parent readers; and (3) to examine gains in receptive language and concepts of print scores for children of low proficiency and proficient parent readers. Forty-one parents and their children participated in the study; 18 low proficiency parent readers and 23 proficient parent readers were involved in a 12-week book club. Results indicated that text type affected patterns of interaction and that parents' reading proficiency influenced conversational interactions, with different text types serving as a scaffold for parent-child interaction. Regardless of parental reading proficiency, however, children's receptive language and concepts of print improved significantly, providing further evidence for the importance of parental storybook reading on children's emerging literacy.  相似文献   

7.
States participating in the Growth Model Pilot Program reference individual student growth against “proficiency” cut scores that conform with the original No Child Left Behind Act (NCLB). Although achievement results from conventional NCLB models are also cut‐score dependent, the functional relationships between cut‐score location and growth results are more complex and are not currently well described. We apply cut‐score scenarios to longitudinal data to demonstrate the dependence of state‐ and school‐level growth results on cut‐score choice. This dependence is examined along three dimensions: 1) rigor, as states set cut scores largely at their discretion, 2) across‐grade articulation, as the rigor of proficiency standards may vary across grades, and 3) the time horizon chosen for growth to proficiency. Results show that the selection of plausible alternative cut scores within a growth model can change the percentage of students “on track to proficiency” by more than 20 percentage points and reverse accountability decisions for more than 40% of schools. We contribute a framework for predicting these dependencies, and we argue that the cut‐score dependence of large‐scale growth statistics must be made transparent, particularly for comparisons of growth results across states.  相似文献   

8.
Student growth percentiles (SGPs) express students' current observed scores as percentile ranks in the distribution of scores among students with the same prior‐year scores. A common concern about SGPs at the student level, and mean or median SGPs (MGPs) at the aggregate level, is potential bias due to test measurement error (ME). Shang, vanIwaarden, and Betebenner (SVB; this issue) develop a simulation‐extrapolation (SIMEX) approach to adjust SGPs for test ME. In this paper, we use a tractable example in which different SGP estimators, including SVB's SIMEX estimator, can be computed analytically to explain why ME is detrimental to both student‐level and aggregate‐level SGP estimation. A comparison of the alternative SGP estimators to the standard approach demonstrates the common bias‐variance tradeoff problem: estimators that decrease the bias relative to the standard SGP estimator increase variance, and vice versa. Even the most accurate estimator for individual student SGP has large errors of roughly 19 percentile points on average for realistic settings. Those estimators that reduce bias may suffice at the aggregate level but no single estimator is optimal for meeting the dual goals of student‐ and aggregate level inferences.  相似文献   

9.
Equating methods make use of an appropriate transformation function to map the scores of one test form into the scale of another so that scores are comparable and can be used interchangeably. The equating literature shows that the ways of judging the success of an equating (i.e., the score transformation) might differ depending on the adopted framework. Rather than targeting different parts of the equating process and aiming to evaluate the process from different aspects, this article views the equating transformation as a standard statistical estimator and discusses how this estimator should be assessed in an equating framework. For the kernel equating framework, a numerical illustration shows the potentials of viewing the equating transformation as a statistical estimator as opposed to assessing it using equating‐specific criteria. A discussion on how this approach can be used to compare other equating estimators from different frameworks is also included.  相似文献   

10.
In the nonequivalent groups with anchor test (NEAT) design, the standard error of linear observed‐score equating is commonly estimated by an estimator derived assuming multivariate normality. However, real data are seldom normally distributed, causing this normal estimator to be inconsistent. A general estimator, which does not rely on the normality assumption, would be preferred, because it is asymptotically accurate regardless of the distribution of the data. In this article, an analytical formula for the standard error of linear observed‐score equating, which characterizes the effect of nonnormality, is obtained under elliptical distributions. Using three large‐scale real data sets as the populations, resampling studies are conducted to empirically evaluate the normal and general estimators of the standard error of linear observed‐score equating. The effect of sample size (50, 100, 250, or 500) and equating method (chained linear, Tucker, or Levine observed‐score equating) are examined. Results suggest that the general estimator has smaller bias than the normal estimator in all 36 conditions; it has larger standard error when the sample size is at least 100; and it has smaller root mean squared error in all but one condition. An R program is also provided to facilitate the use of the general estimator.  相似文献   

11.
A formal analysis of the effects of item deletion on equating/scaling functions and reported score distributions is presented. There are two components of the present analysis: analytical and empirical. The analytical decomposition demonstrates how the effects of item characteristics, test properties, individual examinee responses, and rounding rules combine to produce the item deletion effect on the equating/scaling function and candidate scores, In addition to demonstrating how the deleted item's psychometric characteristics can affect the equating function, the analytical component of the report examines the effects of not scoring versus scoring all options correct, the effects of re-equating versus not re-equating, and the interaction between the decision to re-equate or to not re-equate and the scoring option chosen for the flawed item. The empirical portion of the report uses data from the May 1982 administration of the SA T, which contained the circles item, to illustrate the effects of item deletion on reported score distributions and equating functions. The empirical data verify what the analytical decomposition predicts.  相似文献   

12.
Frequency distributions of test scores may appear irregular and, as estimates of a population distribution, contain a substantial amount of sampling error. Techniques for smoothing score distributions are available that have the capacity to improve estimation. In this article, estimation/smoothing methods that are flexible enough to fit a wide variety of test score distributions are reviewed. The methods are a kernel method, a strong true–score model–based method, and a method that uses polynomial log–linear models. The use of these methods is then reviewed, and applications of the methods are presented that include describing and comparing test score distributions, estimating norms, and estimating equipercentile equivalents in test score equating. Suggestions for further research are also provided.  相似文献   

13.
To examine the predictive utility of three scales provided in the released database of the Third International Mathematics and Science Study (TIMSS) (international plausible values, standardized percent correct score, and national Rasch score), information was obtained on the performance in state examinations in mathematics and science in 1996 (2,969 Grade 8 students) and in 1997 (2,898 Grade 7 students) of students in the Republic of Ireland who had participated in TIMSS in 1995. Performance on TIMSS was related to later performance in the state examinations using normal and nonparametric maximum likelihood (NPML) random effects models. In every case, standardized percent correct scores were found to be the best predictors of later performance, followed by national Rasch scores, and lastly, by international plausible values. The estimates for normal mixing distributions are close to those estimated by the NPML approach, lending support to the validity of estimates.  相似文献   

14.
This paper illustrates that the psychometric properties of scores and scales that are used with mixed‐format educational tests can impact the use and interpretation of the scores that are reported to examinees. Psychometric properties that include reliability and conditional standard errors of measurement are considered in this paper. The focus is on mixed‐format tests in situations for which raw scores are integer‐weighted sums of item scores. Four associated real‐data examples include (a) effects of weights associated with each item type on reliability, (b) comparison of psychometric properties of different scale scores, (c) evaluation of the equity property of equating, and (d) comparison of the use of unidimensional and multidimensional procedures for evaluating psychometric properties. Throughout the paper, and especially in the conclusion section, the examples are related to issues associated with test interpretation and test use.  相似文献   

15.
No Child Left Behind (NCLB) performance mandates, embedded within state accountability systems, focus school AYP (adequate yearly progress) compliance squarely on the percentage of students at or above proficient. The singular importance of this quantity for decision-making purposes has initiated extensive research into percent proficient as a measure of school quality. In particular, technical discussions have scrutinized the impact of sampling, measurement, and other sources of error on percent proficient statistics. In this article, we challenge the received orthodoxy that measurement error associated with individual students' scores is inconsequential for aggregate percent proficient statistics. Synthesizing current classification accuracy research with techniques from randomized response designs, we establish results which specify the extent to which measurement error—manifest as performance level misclassifications—produces bias and increases error variability for percent at performance level statistics. The results have direct relevance for the design of coherent and fair accountability systems based upon assessment outcomes.  相似文献   

16.
In this note, we demonstrate an interesting use of the posterior distributions (and corresponding posterior samples of proficiency) that are yielded by fitting a fully Bayesian test scoring model to a complex assessment. Specifically, we examine the efficacy of the test in combination with the specific passing score that was chosen through expert judgment, or, in general, any external a priori criterion. In addition, we study the robustness of the test's efficacy with respect to choice of the passing score.  相似文献   

17.
Which kinds of content- and language-integrated interventions can better support mathematical learning, depending on students’ language background, as measured by communication on the discourse level or additional lexical training on the word level? A control trial compared interventions with two different materials (discursive versus lexical-discursive, each 5 × 90?min.) with respect to the dependent variable of conceptual and procedural knowledge for fractions. The effects are investigated differentially for four language groups: monolinguals versus bilinguals, each with higher versus lower German language proficiency (n = 343). For both interventions, the ANOVA shows an increase of mathematical knowledge for the experimental group, which is significantly higher than for the control group, but no significant difference between the interventions. The intervention with lexical-discursive materials led to a slightly higher increase of knowledge in the post test, but the discursive intervention was superior in the follow up test. Monolingual and multilingual students had similar patterns of growth without differential pattern. However, the proficient monolingual students tend to profit more from the interventions, especially from the lexical-discursive intervention.  相似文献   

18.
This paper estimates the impact of the Michigan school finance reform, Proposal A, on education inputs and test scores. Using a difference-in-difference estimation strategy, I find that school districts in Michigan used the increase in educational spending generated through Proposal A to increase teacher salaries and reduce class size to a smaller extent. Then, using the foundation allowance created by Proposal A as an instrument, I estimate the causal effect of increased spending on 4th and 7th grade math scores for two test measures – a scaled score and a percent satisfactory measure – and find positive effects of increased spending on 4th grade test scores. A 60% increase in spending increases the percent satisfactory score by one standard deviation. The positive impact of expenditures on test performance seems largely due to higher teacher salaries.  相似文献   

19.
本文提出了约束线性模型下回归系统的条件根方估计和广义条件根方估计,证明了在一定的条件下,两者在均方误差意义都能很好地改进回归系数的约束最小二乘估计,并讨论了它们的可容许性及广义条件根方估计中未知参数的选取方法。  相似文献   

20.
The present study tests the effect of ability pairing in two instructional methods in L2 collaborative revision. Two continuous indices determine a pair: individual proficiency level, distance in proficiency between pair members (heterogeneity), and the interaction between both indices. Instructional methods tested are modelling and practising. Results show that the effect of pair composition depends on instructional strategies. In the Practising condition less proficient learners profit most from a heterogeneous ability pair, whereas more proficient learners are best paired homogeneously. In the Modelling condition no effect of pair composition factors was observed. This result illustrates that Modelling is a powerful instructional method for complex learning tasks like collaborative revision in L2 as it overrides some of the grouping effects which can be found in more traditional learning conditions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号