首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Content‐based automated scoring has been applied in a variety of science domains. However, many prior applications involved simplified scoring rubrics without considering rubrics representing multiple levels of understanding. This study tested a concept‐based scoring tool for content‐based scoring, c‐rater?, for four science items with rubrics aiming to differentiate among multiple levels of understanding. The items showed moderate to good agreement with human scores. The findings suggest that automated scoring has the potential to score constructed‐response items with complex scoring rubrics, but in its current design cannot replace human raters. This article discusses sources of disagreement and factors that could potentially improve the accuracy of concept‐based automated scoring.  相似文献   

2.
A framework for evaluation and use of automated scoring of constructed‐response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high‐stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.  相似文献   

3.
4.
This study examined the utility of response time‐based analyses in understanding the behavior of unmotivated test takers. For the data from an adaptive achievement test, patterns of observed rapid‐guessing behavior and item response accuracy were compared to the behavior expected under several types of models that have been proposed to represent unmotivated test taking behavior. Test taker behavior was found to be inconsistent with these models, with the exception of the effort‐moderated model. Effort‐moderated scoring was found to both yield scores that were more accurate than those found under traditional scoring, and exhibit improved person fit statistics. In addition, an effort‐guided adaptive test was proposed and shown by a simulation study to alleviate item difficulty mistargeting caused by unmotivated test taking.  相似文献   

5.
The scoring process is critical in the validation of tests that rely on constructed responses. Documenting that readers carry out the scoring in ways consistent with the construct and measurement goals is an important aspect of score validity. In this article, rater cognition is approached as a source of support for a validity argument for scores based on constructed responses, whether such scores are to be used on their own or as the basis for other scoring processes, for example, automated scoring.  相似文献   

6.
The rise of computer‐based testing has brought with it the capability to measure more aspects of a test event than simply the answers selected or constructed by the test taker. One behavior that has drawn much research interest is the time test takers spend responding to individual multiple‐choice items. In particular, very short response time—termed rapid guessing—has been shown to indicate disengaged test taking, regardless whether it occurs in high‐stakes or low‐stakes testing contexts. This article examines rapid‐guessing behavior—its theoretical conceptualization and underlying assumptions, methods for identifying it, misconceptions regarding its dynamics, and the contextual requirements for its proper interpretation. It is argued that because it does not reflect what a test taker knows and can do, a rapid guess to an item represents a choice by the test taker to momentarily opt out of being measured. As a result, rapid guessing tends to negatively distort scores and thereby diminish validity. Therefore, because rapid guesses do not contribute to measurement, it makes little sense to include them in scoring.  相似文献   

7.
The engagement of teachers as raters to score constructed response items on assessments of student learning is widely claimed to be a valuable vehicle for professional development. This paper examines the evidence behind those claims from several sources, including research and reports over the past two decades, information from a dozen state educational agencies regarding past and ongoing involvement of teachers in scoring‐related activities as of 2001, and interviews with educators who served a decade or more ago for one state's innovative performance assessment program. That evidence reveals that the impact of scoring experience on teachers is more provisional and nuanced than has been suggested. The author identifies possible issues and implications associated with attempts to distill meaningful skills and knowledge from hand‐scoring training and practice, along with other forms of teacher involvement in assessment development and implementation. The paper concludes with a series of research questions that—based on current and proposed practice for the coming decade—seem to the author to require the most immediate attention.  相似文献   

8.
Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension of the hierarchical rater model (HRM) to be applied with simple structured tests with constructed response items. In addition to modeling the appropriate trait structure, the multidimensional HRM (M‐HRM) presented here also accounts for rater severity bias and rater variability or inconsistency. We introduce the model formulation, test parameter recovery with a focus on latent traits, and compare the M‐HRM to other scoring approaches (unidimensional HRMs and a traditional multidimensional item response theory model) using simulated and empirical data. Results show more precise scores under the M‐HRM, with a major improvement in scores when incorporating rater effects versus ignoring them in the traditional multidimensional item response theory model.  相似文献   

9.
In signal detection rater models for constructed response (CR) scoring, it is assumed that raters discriminate equally well between different latent classes defined by the scoring rubric. An extended model that relaxes this assumption is introduced; the model recognizes that a rater may not discriminate equally well between some of the scoring classes. The extension recognizes a different type of rater effect and is shown to offer useful tests and diagnostic plots of the equal discrimination assumption, along with ways to assess rater accuracy and various rater effects. The approach is illustrated with an application to a large‐scale language test.  相似文献   

10.
During the last three decades the constructed response format has gradually gained entry in large‐scale assessments of reading‐comprehension. In their 1991 Reading Literacy Study The International Association for the Evaluation of Educational Achievement (IEA) included constructed response items on an exploratory basis. Ten years later, in Progress in International Reading Literacy Study (PIRLS) 2001, the constructed response format is ascribed special significance as bearer of central insights to the definition of reading literacy. This article focuses on the significance of the scoring guides and the relation between these guides on the one hand, and the text and the items on the other hand. A discussion of this relation as it is to be found in PIRLS 2001 is performed, showing both examples of success and more problematic aspects in the operationalisation of the intentions expressed in the theoretical framework for the test. Handling the problem of semantic openness is essential in representing depth of understanding and represents a field of possibilities for further research and development.  相似文献   

11.
This article argues that digital games and school‐based literacy practices have much more in common than is reported in the research literature. We describe the role digital game paratexts – ancillary print and multimodal texts about digital games – can play in connecting pupils’ gaming literacy practices to ‘traditional’ school‐based literacies still needed for academic success. By including the reading, writing and design of digital game paratexts in the literacy curriculum, teachers can actively and legitimately include digital games in their literacy instruction. To help teachers understand pupils’ gaming literacy practices in relation to other forms of literacy practices, we present a heuristic for understanding gaming (HUG) literacy. We argue our heuristic can be used for effective teacher professional development because it assists teachers in identifying the elements of gameplay that would be appropriate for the demands of the literacy curriculum. The heuristic traces gaming literacy across the quadrants of actions, designs, situations and systems to provide teachers and practitioners with a knowledge of gameplay and a metalanguage for talking about digital games. We argue this knowledge will assist them in capitalising on pupils’ existing gaming literacy by connecting their out‐of‐school gaming literacy practices to the literacy and English curriculum.  相似文献   

12.
Rater training is an important part of developing and conducting large‐scale constructed‐response assessments. As part of this process, candidate raters have to pass a certification test to confirm that they are able to score consistently and accurately before they begin scoring operationally. Moreover, many assessment programs require raters to pass a calibration test before every scoring shift. To support the high‐stakes decisions made on the basis of rater certification tests, a psychometric approach for their development, analysis, and use is proposed. The circumstances and uses of these tests suggest that they are expected to have relatively low reliability. This expectation is supported by empirical data. Implications for the development and use of these tests to ensure their quality are discussed.  相似文献   

13.
近五十年来,国内外相继开发出多个英语作文自动评分系统,研究日臻成熟。在翻译领域,自动评分研究主要局限于机器翻译评价,人工译文自动评分研究仍处于初级阶段。近年国内建立起针对中国学生的汉译英自动评分模型,针对英译汉的自动评分研究也开始起步。由于中国学生的英译汉具有自身的特点,其评分系统在变量挖掘、模型验证等方面与已有研究不同。  相似文献   

14.
In this digital ITEMS module, Dr. Sue Lottridge, Amy Burkhardt, and Dr. Michelle Boyer provide an overview of automated scoring. Automated scoring is the use of computer algorithms to score unconstrained open-ended test items by mimicking human scoring. The use of automated scoring is increasing in educational assessment programs because it allows scores to be returned faster at lower cost. In the module, they discuss automated scoring from a number of perspectives. First, they discuss benefits and weaknesses of automated scoring, and what psychometricians should know about automated scoring. Next, they describe the overall process of automated scoring, moving from data collection to engine training to operational scoring. Then, they describe how automated scoring systems work, including the basic functions around score prediction as well as other flagging methods. Finally, they conclude with a discussion of the specific validity demands around automated scoring and how they align with the larger validity demands around test scores. Two data activities are provided. The first is an interactive activity that allows the user to train and evaluate a simple automated scoring engine. The second is a worked example that examines the impact of rater error on test scores. The digital module contains a link to an interactive web application as well as its R-Shiny code, diagnostic quiz questions, activities, curated resources, and a glossary.  相似文献   

15.
'Mental models' used by automated scoring for the simulation divisions of the computerized Architect Registration Examination are contrasted with those used by experienced human graders. Candidate solutions (N = 3613) received both automated and human holistic scores. Quantitative analyses suggest high correspondence between automated and human scores; thereby suggesting similar mental models are implemented. Solutions with discrepancies between automated and human scores were selected for qualitative analysis. The human graders were reconvened to review the human scores and to investigate the source of score discrepancies in light of rationales provided by the automated scoring process. After review, slightly more than half of the score discrepancies were reduced or eliminated. Six sources of discrepancy between original human scores and automated scores were identified: subjective criteria; objective criteria; tolerances/ weighting; details; examinee task interpretation; and unjustified. The tendency of the human graders to be compelled by automated score rationales varied by the nature of original score discrepancy. We determine that, while the automated scores are based on a mental model consistent with that of expert graders, there remain some important differences, both intentional and incidental, which distinguish between human and automated scoring. We conclude that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency.  相似文献   

16.
《教育实用测度》2013,26(4):413-432
With the increasing use of automated scoring systems in high-stakes testing, it has become essential that test developers assess the validity of the inferences based on scores produced by these systems. In this article, we attempt to place the issues associated with computer-automated scoring within the context of current validity theory. Although it is assumed that the criteria appropriate for evaluating the validity of score interpretations are the same for tests using automated scoring procedures as for other assessments, different aspects of the validity argument may require emphasis as a function of the scoring procedure. We begin the article with a taxonomy of automated scoring procedures. The presentation of this taxonomy provides a framework for discussing threats to validity that may take on increased importance for specific approaches to automated scoring. We then present a general discussion of the process by which test-based inferences are validated, followed by a discussion of the special issues that must be considered when scoring is done by computer.  相似文献   

17.
《教育实用测度》2013,26(3):281-299
The growing use of computers for test delivery, along with increased interest in performance assessments, has motivated test developers to develop automated systems for scoring complex constructed-response assessment formats. In this article, we add to the available information describing the performance of such automated scoring systems by reporting on generalizability analyses of expert ratings and computer-produced scores for a computer-delivered performance assessment of physicians' patient management skills. Two different automated scoring systems were examined. These automated systems produced scores that were approximately as generalizable as those produced by expert raters. Additional analyses also suggested that the traits assessed by the expert raters and the automated scoring systems were highly related (i.e., true correlations between test forms, across scoring methods, were approximately 1.0). In the appendix, we discuss methods for estimating this correlation, using ratings and scores produced by an automated system from a single test form.  相似文献   

18.
In the standard scoring procedure for multiple‐choice exams, students must choose exactly one response as correct. Often students may be unable to identify the correct response, but can determine that some of the options are incorrect. This partial knowledge is not captured in the standard scoring format. The Coombs elimination procedure is an alternate scoring procedure designed to capture partial knowledge. This paper presents the results of a semester‐long experiment where both scoring procedures were compared on four exams in an undergraduate macroeconomics course. Statistical analysis suggests that the Coombs procedure is a viable alternative to the standard scoring procedure. Implications for classroom instruction and future research are also presented.  相似文献   

19.
计算机自动评分(CAS)用于自学考试外语类课程的翻译测验评分,能够有效提高评分效率及客观性。本研究对72名自考学习者翻译测验作答数据的计算机自动评分结果与人工评分结果进行相关分析及配对样本t检验,并将两种评分方式的诊断结果进行比较。研究发现,计算机自动评分与人工评分结果高度相关,两种评分方式的翻译测验总分无显著差异,总体而言本次翻译测验自动评分结果是可靠的;但计算机自动评分与人工评分对自考学习者的翻译能力结构诊断结果有一定差异。  相似文献   

20.
陈芸 《鸡西大学学报》2012,(10):102-104
随着现代教育技术的发展,自动作文评分系统在英语写作教学中的应用越来越多。为了解自动作文评分系统在英语写作教学中的效果,在非英语专业的两个班级开展了共14周的对比教学实验。同时,以调查问卷的方式调查了学生对自动作文评分系统的看法。结果表明,基于自动作文评分系统的写作教学模式,较之于传统写作教学模式,更能促进学生英语写作能力的提高。研究还发现,应用自动作文评分系统过程中学生的写作策略亟需提高,教师的引导不容忽略。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号