首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 539 毫秒
1.
A framework for evaluation and use of automated scoring of constructed‐response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high‐stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.  相似文献   

2.
In this digital ITEMS module, Dr. Sue Lottridge, Amy Burkhardt, and Dr. Michelle Boyer provide an overview of automated scoring. Automated scoring is the use of computer algorithms to score unconstrained open-ended test items by mimicking human scoring. The use of automated scoring is increasing in educational assessment programs because it allows scores to be returned faster at lower cost. In the module, they discuss automated scoring from a number of perspectives. First, they discuss benefits and weaknesses of automated scoring, and what psychometricians should know about automated scoring. Next, they describe the overall process of automated scoring, moving from data collection to engine training to operational scoring. Then, they describe how automated scoring systems work, including the basic functions around score prediction as well as other flagging methods. Finally, they conclude with a discussion of the specific validity demands around automated scoring and how they align with the larger validity demands around test scores. Two data activities are provided. The first is an interactive activity that allows the user to train and evaluate a simple automated scoring engine. The second is a worked example that examines the impact of rater error on test scores. The digital module contains a link to an interactive web application as well as its R-Shiny code, diagnostic quiz questions, activities, curated resources, and a glossary.  相似文献   

3.
In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students’ responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students’ nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed.  相似文献   

4.
5.
The engagement of teachers as raters to score constructed response items on assessments of student learning is widely claimed to be a valuable vehicle for professional development. This paper examines the evidence behind those claims from several sources, including research and reports over the past two decades, information from a dozen state educational agencies regarding past and ongoing involvement of teachers in scoring‐related activities as of 2001, and interviews with educators who served a decade or more ago for one state's innovative performance assessment program. That evidence reveals that the impact of scoring experience on teachers is more provisional and nuanced than has been suggested. The author identifies possible issues and implications associated with attempts to distill meaningful skills and knowledge from hand‐scoring training and practice, along with other forms of teacher involvement in assessment development and implementation. The paper concludes with a series of research questions that—based on current and proposed practice for the coming decade—seem to the author to require the most immediate attention.  相似文献   

6.
In this article, we systematize the factors influencing performance and feasibility of automatic content scoring methods for short text responses. We argue that performance (i.e., how well an automatic system agrees with human judgments) mainly depends on the linguistic variance seen in the responses and that this variance is indirectly influenced by other factors such as target population or input modality. Extending previous work, we distinguish conceptual, realization, and nonconformity variance, which are differentially impacted by the various factors. While conceptual variance relates to different concepts embedded in the text responses, realization variance refers to their diverse manifestation through natural language. Nonconformity variance is added by aberrant response behavior. Furthermore, besides its performance, the feasibility of using an automatic scoring system depends on external factors, such as ethical or computational constraints, which influence whether a system with a given performance is accepted by stakeholders. Our work provides (i) a framework for assessment practitioners to decide a priori whether automatic content scoring can be successfully applied in a given setup as well as (ii) new empirical findings and the integration of empirical findings from the literature on factors that influence automatic systems' performance.  相似文献   

7.
ABSTRACT

This study investigates the role of automated scoring and feedback in supporting students’ construction of written scientific arguments while learning about factors that affect climate change in the classroom. The automated scoring and feedback technology was integrated into an online module. Students’ written scientific argumentation occurred when they responded to structured argumentation prompts. After submitting the open-ended responses, students received scores generated by a scoring engine and written feedback associated with the scores in real-time. Using the log data that recorded argumentation scores as well as argument submission and revisions activities, we answer three research questions. First, how students behaved after receiving the feedback; second, whether and how students’ revisions improved their argumentation scores; and third, did item difficulties shift with the availability of the automated feedback. Results showed that the majority of students (77%) made revisions after receiving the feedback, and students with higher initial scores were more likely to revise their responses. Students who revised had significantly higher final scores than those who did not, and each revision was associated with an average increase of 0.55 on the final scores. Analysis on item difficulty shifts showed that written scientific argumentation became easier after students used the automated feedback.  相似文献   

8.
Internationally, many assessment systems rely predominantly on human raters to score examinations. Arguably, this facilitates the assessment of multiple sophisticated educational constructs, strengthening assessment validity. It can introduce subjectivity into the scoring process, however, engendering threats to accuracy. The present objectives are to examine some key qualitative data collection methods used internationally to research this potential trade‐off, and to consider some theoretical contexts within which the methods are usable. Self‐report methods such as Kelly's Repertory Grid, think aloud, stimulated recall, and the NASA task load index have yielded important insights into the competencies needed for scoring expertise, as well as the sequences of mental activity that scoring typically involves. Examples of new data and of recent studies are used to illustrate these methods’ strengths and weaknesses. This investigation has significance for assessment designers, developers and administrators. It may inform decisions on the methods’ applicability in American and other rater cognition research contexts.  相似文献   

9.
An assumption that is fundamental to the scoring of student-constructed responses (e.g., essays) is the ability of raters to focus on the response characteristics of interest rather than on other features. A common example, and the focus of this study, is the ability of raters to score a response based on the content achievement it demonstrates independent of the quality with which it is expressed. Previously scored responses from a large-scale assessment in which trained scorers rated exclusively constructed-response formats were altered to enhance or degrade the quality of the writing, and scores that resulted from the altered responses were compared with the original scores. Statistically significant differences in favor of the better-writing condition were found in all six content areas. However, the effect sizes were very small in mathematics, reading, science, and social studies items. They were relatively large for items in writing and language usage (mechanics). It was concluded from the last two content areas that the manipulation was successful and from the first four that trained scorers are reasonably well able to differentiate writing quality from other achievement constructs in rating student responses.  相似文献   

10.
ABSTRACT

Automated essay scoring is a developing technology that can provide efficient scoring of large numbers of written responses. Its use in higher education admissions testing provides an opportunity to collect validity and fairness evidence to support current uses and inform its emergence in other areas such as K–12 large-scale assessment. In this study, human and automated scores on essays written by college students with and without learning disabilities and/or attention deficit hyperactivity disorder were compared, using a nationwide (U.S.) sample of prospective graduate students taking the revised Graduate Record Examination. The findings are that, on average, human raters and the automated scoring engine assigned similar essay scores for all groups, despite average differences among groups with respect to essay length and spelling errors.  相似文献   

11.
评分是影响口语考试信、效度的重要因素。口语考试的评分方法可以分为主观评分和客观或半客观评分两种。前者主要有总体等级评分和分项等级评分,后者主要有机器评分、分项客观指标评分和0/1制评分。本文对这几种评分方法进行了梳理和总结,并指出了每种评分方法的优劣。文章还对评分方法与口语能力定义、评分方法的选择以及评分与测验效度的关系等问题进行了讨论。  相似文献   

12.
The purpose of this study was to examine how different scoring procedures affect interpretation of maze curriculum‐based measurements. Fall and spring data were collected from 199 students receiving supplemental reading instruction. Maze probes were scored first by counting all correct maze choices, followed by four scoring variations designed to reduce the effect of random guessing. Pearson's r correlation coefficients were calculated among scoring procedures and between maze scores and a standardized measure of reading. In addition, t tests were conducted to compare fall to spring growth for each scoring procedure. Results indicated that scores derived from the different procedures are highly correlated, demonstrate criterion‐related validity, and show fall‐to‐spring growth. Educators working with struggling readers may use any of the five scoring procedures to obtain technically sound scores.  相似文献   

13.
Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks.  相似文献   

14.
Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

15.
评分标准在写作测试中非常重要,使用不同的评分方法会影响评卷者的评分行为。研究显示,虽然整体法和分析法两种英语写作评分方法都可靠,但是在两种评分中,评卷者的严厉程度以及考生的写作成绩发生很大变化。总体上,整体法评分中,评卷者的严厉程度趋于一致,接近理想值;分析法评分中,考生的写作成绩更高,同时评卷者的严厉程度也存在显著差异。因而,在决定考生前途命运的重大考试中,整体评分法更受推崇。  相似文献   

16.
Drawing from multiple theoretical frameworks representing cognitive and educational psychology, we present a writing task and scoring system for measurement of students’ informative writing. Participants in this study were 72 fifth- and sixth-grade students who wrote compositions describing real-world problems and how mathematics, science, and social studies information could be used to solve those problems. Of the 72 students, 69 were able to craft a cohesive response that not only demonstrated planning in writing structure but also elaboration of relevant knowledge in one or more domains. Many-facet Rasch Modeling (MFRM) techniques were used to examine the reliability and validity of scores for the writing rating scale. Additionally, comparison of fifth- and sixth-grade responses supported the validity of scores, as did the results of a correlational analysis with scores from an overall interest measure. Recommendations for improving writing scoring systems based on the findings of this investigation are provided.  相似文献   

17.
This case study was conducted within the context of a place-based education project that was implemented with primary school students in the USA. The authors and participating teachers created a performance assessment of standards-aligned tasks to examine 6–10-year-old students’ graph interpretation skills as part of an exploratory research project. Fifty-five students participated in a performance assessment interview at the beginning and end of a place-based investigation. Two forms of the assessment were created and counterbalanced within class at pre and post. In situ scoring was conducted such that responses were scored as correct versus incorrect during the assessment's administration. Criterion validity analysis demonstrated an age-level progression in student scores. Tests of discriminant validity showed that the instrument detected variability in interpretation skills across each of three graph types (line, bar, dot plot). Convergent validity was established by correlating in situ scores with those from the Graph Interpretation Scoring Rubric. Students’ proficiency with interpreting different types of graphs matched expectations based on age and the standards-based progression of graphs across primary school grades. The assessment tasks were also effective at detecting pre–post gains in students’ interpretation of line graphs and dot plots after the place-based project. The results of the case study are discussed in relation to the common challenges associated with performance assessment. Implications are presented in relation to the need for authentic and performance-based instructional and assessment tasks to respond to the Common Core State Standards and the Next Generation Science Standards.  相似文献   

18.
Cindy L. James   《Assessing Writing》2006,11(3):167-178
How do scores from writing samples generated by computerized essay scorers compare to those generated by “untrained” human scorers and what combination of scores, if any, is more accurate at placing students in composition courses? This study endeavored to answer this two-part question by evaluating the correspondence between writing sample scores generated by the IntelliMetric™ automated scoring system and scores generated by University Preparation English faculty, as well as examining the predictive validity of both the automated and human scores. The results revealed significant correlations between the faculty scores and the IntelliMetric™ scores of the ACCUPLACEROnLine WritePlacer Plus test. Moreover, logistic regression models that utilized the IntelliMetric™ scores and average faculty scores were more accurate at placing students (77% overall correct placement rate) than were models incorporating only the average faculty score or the IntelliMetric™ scores.  相似文献   

19.
This study tested three scoring keys for the MTAI—the published empirical key (E); a logical key with three different weights, i.e., + 1, 0, -1; and a new logical key with five different scoring weights, one for each of the five responses in MTAI items, i.e., +2, +1, 0, -1, -2. The latter, termed the pentachotomous-logical (P-L), provided scores with slightly higher internal consistency and a frequency distribution which is not so significantly skewed as the others and which provides greater spread among extreme positive and negative scores. Use of the P-L scoring weights would facilitate the psychological interpretation of the MTAI. However, conclusion that the P-L scoring key is an improvement must be tempered by the fact that an expected increase in construct validity was not found.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号