首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 116 毫秒
1.
2.
评分标准在写作测试中非常重要,使用不同的评分方法会影响评卷者的评分行为。研究显示,虽然整体法和分析法两种英语写作评分方法都可靠,但是在两种评分中,评卷者的严厉程度以及考生的写作成绩发生很大变化。总体上,整体法评分中,评卷者的严厉程度趋于一致,接近理想值;分析法评分中,考生的写作成绩更高,同时评卷者的严厉程度也存在显著差异。因而,在决定考生前途命运的重大考试中,整体评分法更受推崇。  相似文献   

3.
4.
5.
评分是影响口语考试信、效度的重要因素。口语考试的评分方法可以分为主观评分和客观或半客观评分两种。前者主要有总体等级评分和分项等级评分,后者主要有机器评分、分项客观指标评分和0/1制评分。本文对这几种评分方法进行了梳理和总结,并指出了每种评分方法的优劣。文章还对评分方法与口语能力定义、评分方法的选择以及评分与测验效度的关系等问题进行了讨论。  相似文献   

6.
7.
8.
We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 and .80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written compositions were evaluated in widely used evaluation methods for developing writers: holistic scoring, productivity, and curriculum-based writing scores. Results showed that 54 and 52% of variance in narrative and expository compositions were attributable to true individual differences in writing. Students’ scores varied largely by tasks (30.44 and 28.61% of variance), but not by raters. To reach the reliability of .90, multiple tasks and raters were needed, and for the reliability of .80, a single rater and multiple tasks were needed. These findings offer important implications about reliably evaluating children’s writing skills, given that writing is typically evaluated by a single task and a single rater in classrooms and even in some state accountability systems.  相似文献   

9.
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups.  相似文献   

10.
In an essay rating study multiple ratings may be obtained by having different raters judge essays or by having the same rater(s) repeat the judging of essays. An important question in the analysis of essay ratings is whether multiple ratings, however obtained, may be assumed to represent the same true scores. When different raters judge the same essays only once, it is impossible to answer this question. In this study 16 raters judged 105 essays on two occasions; hence, it was possible to test assumptions about true scores within the framework of linear structural equation models. It emerged that the ratings of a given rater on the two occasions represented the same true scores. However, the ratings of different raters did not represent the same true scores. The estimated intercorrelations of the true scores of different raters ranged from .415 to .910. Parameters of the best fitting model were used to compute coefficients of reliability, validity, and invalidity. The implications of these coefficients are discussed.  相似文献   

11.
Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks.  相似文献   

12.
In this study, we describe a framework for monitoring rater performance over time. We present several statistical indices to identify raters whose standards drift and explain how to use those indices operationally. To illustrate the use of the framework, we analyzed rating data from the 2002 Advanced Placement English Literature and Composition examination, employing a multifaceted Rasch approach to determine whether raters exhibited evidence of two types of differential rater functioning over time (i.e., changes in levels of accuracy or scale category use). Some raters showed statistically significant changes in their levels of accuracy as the scoring progressed, while other raters displayed evidence of differential scale category use over time.  相似文献   

13.
This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used  相似文献   

14.
The hierarchical rater model (HRM) re‐cognizes the hierarchical structure of data that arises when raters score constructed response items. In this approach, raters’ scores are not viewed as being direct indicators of examinee proficiency but rather as indicators of essay quality; the (latent categorical) quality of an examinee's essay in turn serves as an indicator of the examinee's proficiency, thus yielding a hierarchical structure. Here it is shown that a latent class model motivated by signal detection theory (SDT) is a natural candidate for the first level of the HRM, the rater model. The latent class SDT model provides measures of rater precision and various rater effects, above and beyond simply severity or leniency. The HRM‐SDT model is applied to data from a large‐scale assessment and is shown to provide a useful summary of various aspects of the raters’ performance.  相似文献   

15.
本文依据Upshur and Turner(1999)考试与评分的理论模型,将考生口语产出的话语语言特征作为参照,研究口语考试中综合式与分析式评分的异同。实验结果表明考生口语产出的话语特征中流利度衡量指标之每分钟有意义音节数对两种不同评分模式都产生显著影响;评分员在两种评分过程中都注重考生话语的流利性,忽视语言准确性和复杂性。文章进一步对此进行了分析并从考生话语角度揭示口试评分中误差控制的问题。  相似文献   

16.
评分教师的评分效应和评分量表研究是研究主观题评分误差的核心问题。本论文以2006年高考政治(上海卷)第38题(论述题)为例,运用ACER Conquest的Raters Effect模型研究,结果显示该大题基本没有表现出模糊性、趋中性和等级限制等评分误差,评分教师能够比较好地区分考生不同行为特征,除个别评分教师的评分一致性还有待提高外,评分松紧度差异比较显著。为此,作者提出根据松紧度调整考试分数的方法。  相似文献   

17.
We examine the factor structure of scores from the CLASS‐S protocol obtained from observations of middle school classroom teaching. Factor analysis has been used to support both interpretations of scores from classroom observation protocols, like CLASS‐S, and the theories about teaching that underlie them. However, classroom observations contain multiple sources of error, most predominantly rater errors. We demonstrate that errors in scores made by two raters on the same lesson have a factor structure that is distinct from the factor structure at the teacher level. Consequently, the “standard” approach of analyzing on teacher‐level average dimension scores can yield incorrect inferences about the factor structure at the teacher level and possibly misleading evidence about the validity of scores and theories of teaching. We consider alternative hierarchical estimation approaches designed to prevent the contamination of estimated teacher‐level factors. These alternative approaches find a teacher‐level factor structure for CLASS‐S that consists of strongly correlated support and classroom management factors. Our results have implications for future studies using factor analysis on classroom observation data to develop validity evidence and test theories of teaching and for practitioners who rely on the results of such studies to support their use and interpretation of the classroom observation scores.  相似文献   

18.
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings.  相似文献   

19.
The NTID Writing Test was developed to assess the writing ability of postsecondary deaf students entering the National Technical Institute for the Deaf and to determine their appropriate placement into developmental writing courses. While previous research (Albertini et al., 1986; Albertini et al., 1996; Bochner, Albertini, Samar, & Metz, 1992) has shown the test to be reliable between multiple test raters and as a valid measure of writing ability for placement into these courses, changes in curriculum and the rater pool necessitated a new look at interrater reliability and concurrent validity. We evaluated the rating scores for 236 samples from students who entered the college during the fall 2001. Using a multiprong approach, we confirmed the interrater reliability and the validity of this direct measure of assessment. The implications of continued use of this and similar tests in light of definitions of validity, local control, and the nature of writing are discussed.  相似文献   

20.
Despite the increasing popularity of peer assessment in tertiary-level interpreter education, very little research has been conducted to examine the quality of peer ratings on language interpretation. While previous research on the quality of peer ratings, particularly rating accuracy, mainly relies on correlation and analysis of variance, latent trait modelling emerges as a useful approach to investigate rating accuracy in rater-mediated performance assessment. The present study demonstrates the use of multifaceted Rasch partial credit modelling to explore the accuracy of peer ratings on English-Chinese consecutive interpretation. The analysis shows that there was a relatively wide spread of rater accuracy estimates and that statistically significant differences were found between peer raters regarding rating accuracy. Additionally, it was easier for peer raters to assess some students accurately than others, to peer-assess target language quality accurately than the other rating domains, and to provide accurate ratings to English-to-Chinese interpretation than the other direction. Through these findings, latent trait modelling demonstrates its capability to produce individual-level indices, measure rater accuracy directly, and accommodate sparse data rating designs. It is therefore hoped that substantive inquiries into peer assessment of language interpretation could utilise latent trait modelling to move this line of research forward.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号