期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Investigating the impact of automated feedback on students’ scientific argumentation

Mengxiao Zhu Hee-Sun Lee Ting Wang Ou Lydia Liu Vinetha Belur Amy Pallant 《International Journal of Science Education》2013,35(12):1648-1668

ABSTRACT

This study investigates the role of automated scoring and feedback in supporting students’ construction of written scientific arguments while learning about factors that affect climate change in the classroom. The automated scoring and feedback technology was integrated into an online module. Students’ written scientific argumentation occurred when they responded to structured argumentation prompts. After submitting the open-ended responses, students received scores generated by a scoring engine and written feedback associated with the scores in real-time. Using the log data that recorded argumentation scores as well as argument submission and revisions activities, we answer three research questions. First, how students behaved after receiving the feedback; second, whether and how students’ revisions improved their argumentation scores; and third, did item difficulties shift with the availability of the automated feedback. Results showed that the majority of students (77%) made revisions after receiving the feedback, and students with higher initial scores were more likely to revise their responses. Students who revised had significantly higher final scores than those who did not, and each revision was associated with an average increase of 0.55 on the final scores. Analysis on item difficulty shifts showed that written scientific argumentation became easier after students used the automated feedback. 相似文献

2.

Validating human and automated scoring of essays against “True” scores

Yoav Cohen Effi Levi Anat Ben-Simon 《教育实用测度》2018,31(3):241-250

ABSTRACT

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay’s true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By eliminating one, two, or three raters at a time, and by calculating an estimate of the true scores using the remaining raters, an independent criterion against which to judge the validity of the human raters and that of the AES system, as well as the interrater reliability was produced. The results of the study indicated that the automated scores correlate with human scores to the same degree as human raters correlate with each other. However, the findings regarding the validity of the ratings support a claim that the reliability and validity of AES diverge: although the AES scoring is, naturally, more consistent than the human ratings, it is less valid. 相似文献

3.

Validating a computerized scoring system for assessing writing and placing students in composition courses

Cindy L. James 《Assessing Writing》2006,11(3):167-178

How do scores from writing samples generated by computerized essay scorers compare to those generated by “untrained” human scorers and what combination of scores, if any, is more accurate at placing students in composition courses? This study endeavored to answer this two-part question by evaluating the correspondence between writing sample scores generated by the IntelliMetric™ automated scoring system and scores generated by University Preparation English faculty, as well as examining the predictive validity of both the automated and human scores. The results revealed significant correlations between the faculty scores and the IntelliMetric™ scores of the ACCUPLACER™ OnLine WritePlacer Plus test. Moreover, logistic regression models that utilized the IntelliMetric™ scores and average faculty scores were more accurate at placing students (77% overall correct placement rate) than were models incorporating only the average faculty score or the IntelliMetric™ scores. 相似文献

4.

Incidental Learning in Foreign Language Learning

《The Journal of educational research》2012,105(3):111-114

Abstract

Students in a college course were given written criteria, divided into teams, and asked to score their own essay examination. Their pooled ratings correlated .922 with the instructor's ratings. Agreement between the ratings of students and instructor was not related to grade point or total test score. However, grade point and test scores were related negatively to the ambiguity of the students’ answers on the examination. The results support the generalization that subjective scoring standards are readily communicable. Theoretical and practical implications are examined. 相似文献

5.

Electronic scoring of essays: Does topic matter?

《Assessing Writing》2008,13(2):80-92

The scoring of student essays by computer has generated much debate and subsequent research. The majority of the research thus far has focused on validating the automated scoring tools by comparing the electronic scores to human scores of writing or other measures of writing skills, and exploring the predictive validity of the automated scores. However, very little research has investigated possible effects of the essay prompts. This study endeavoured to do so by exploring test scores for three different prompts for the ACCUPLACER^® WritePlacer^® Plus test which is scored by the IntelliMetric^® automated scoring system. The results indicated that there was no significant difference among the prompts overall; among males, between males and females, by native language or in comparison to scores generated by human raters. However, there was a significant difference in mean scores by topic for females. 相似文献

6.

Rubric‐referenced self‐assessment and middle school students’ writing

Heidi L. Andrade Ying Du Kristina Mycek 《Assessment in Education: Principles, Policy & Practice》2010,17(2):199-214

This study investigated the relationship between middle school students’ scores for a written assignment (N = 162) and a process that involved students in generating criteria and self‐assessing with a rubric. Gender, time spent writing, grade level, prior rubric use, and previous achievement in English were also examined. The treatment involved using a model essay to scaffold the process of generating a list of criteria for an effective essay, reviewing a written rubric, and using the rubric to self‐assess first drafts. The comparison condition involved generating a list of criteria and reviewing first drafts. Findings include a main effect of treatment, gender, grade level, writing time, and previous achievement on total essay scores, as well as main effects on scores for every criterion on the scoring rubric. The results suggested that reading a model, generating criteria, and using a rubric to self‐assess can help middle school students produce more effective writing. 相似文献

7.

How important is content in the ratings of essay assessments?

Mark D. Shermis Aleksandr Shneyderman Yigal Attali 《Assessment in Education: Principles, Policy & Practice》2008,15(1):91-105

相似文献

8.

Evaluating Comparative Judgment as an Approach to Essay Scoring

Jeffrey T. Steedle Steve Ferrara 《教育实用测度》2013,26(3):211-223

ABSTRACT

As an alternative to rubric scoring, comparative judgment generates essay scores by aggregating decisions about the relative quality of the essays. Comparative judgment eliminates certain scorer biases and potentially reduces training requirements, thereby allowing a large number of judges, including teachers, to participate in essay evaluation. The purpose of this study was to assess the validity, labor costs, and efficiency of comparative judgments as a potential substitute for rubric scoring. An analysis of two essay prompts revealed that comparative judgment measures were comparable to rubric scores at a level similar to that expected of two professional scorers. The comparative judgment measures correlated slightly higher than rubric scores with a multiple-choice writing test. Score reliability exceeding .80 was achieved with approximately nine judgments per response. The average judgment time was 94 seconds, which compared favorably to 119 seconds per rubric score. Practical challenges to future implementation are discussed. 相似文献

9.

Autoscoring Essays Based on Complex Networks

Xiaohua Ke Yongqiang Zeng Haijiao Luo 《Journal of Educational Measurement》2016,53(4):478-497

This article presents a novel method, the Complex Dynamics Essay Scorer (CDES), for automated essay scoring using complex network features. Texts produced by college students in China were represented as scale‐free networks (e.g., a word adjacency model) from which typical network features, such as the in‐/out‐degrees, clustering coefficient (CC), and dynamic networks, were obtained. The CDES integrates the classical concepts of network feature representation and essay score series variation. Several experiments indicated that the network measures different essay qualities and can be clearly demonstrated to develop complex networks for autoscoring tasks. The average agreement of the CDES and human rater scores was 86.5%, and the average Pearson correlation was .77. The results indicate that the CDES produced functional complex systems and autoscored Chinese essays in a method consistent with human raters. Our research suggests potential applications in other areas of educational assessment. 相似文献

10.

Automated essay scoring: Psychometric guidelines and practices

Chaitanya Ramineni David M. Williamson 《Assessing Writing》2013,18(1):25-39

相似文献

11.

A Framework for Evaluation and Use of Automated Scoring

David M. Williamson Xiaoming Xi F. Jay Breyer 《Educational Measurement》2012,31(1):2-13

A framework for evaluation and use of automated scoring of constructed‐response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high‐stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time. 相似文献

12.

'Mental Model' Comparison of Automated and Human Scoring

David M. Williamson Isaac I. Bejar Anne S. Hone 《Journal of Educational Measurement》1999,36(2):158-184

'Mental models' used by automated scoring for the simulation divisions of the computerized Architect Registration Examination are contrasted with those used by experienced human graders. Candidate solutions (N = 3613) received both automated and human holistic scores. Quantitative analyses suggest high correspondence between automated and human scores; thereby suggesting similar mental models are implemented. Solutions with discrepancies between automated and human scores were selected for qualitative analysis. The human graders were reconvened to review the human scores and to investigate the source of score discrepancies in light of rationales provided by the automated scoring process. After review, slightly more than half of the score discrepancies were reduced or eliminated. Six sources of discrepancy between original human scores and automated scores were identified: subjective criteria; objective criteria; tolerances/ weighting; details; examinee task interpretation; and unjustified. The tendency of the human graders to be compelled by automated score rationales varied by the nature of original score discrepancy. We determine that, while the automated scores are based on a mental model consistent with that of expert graders, there remain some important differences, both intentional and incidental, which distinguish between human and automated scoring. We conclude that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency. 相似文献

13.

Validating Automated Essay Scoring: A (Modest) Refinement of the “Gold Standard”

Donald E. Powers David S. Escoffery Matthew P. Duchnowski 《教育实用测度》2015,28(2):130-142

By far, the most frequently used method of validating (the interpretation and use of) automated essay scores has been to compare them with scores awarded by human raters. Although this practice is questionable, human-machine agreement is still often regarded as the “gold standard.” Our objective was to refine this model and apply it to data from a major testing program and one system of automated essay scoring. The refinement capitalizes on the fact that essay raters differ in numerous ways (e.g., training and experience), any of which may affect the quality of ratings. We found that automated scores exhibited different correlations with scores awarded by experienced raters (a more compelling criterion) than with those awarded by untrained raters (a less compelling criterion). The results suggest potential for a refined machine-human agreement model that differentiates raters with respect to experience, expertise, and possibly even more salient characteristics. 相似文献

14.

Validation of Automated Scoring for a Formative Assessment that Employs Scientific Argumentation

Liyang Mao Ou Lydia Liu Katrina Roohr Vinetha Belur Matthew Mulholland Hee-Sun Lee 《Educational Assessment》2018,23(2):121-138

Scientific argumentation is one of the core practices for teachers to implement in science classrooms. We developed a computer-based formative assessment to support students’ construction and revision of scientific arguments. The assessment is built upon automated scoring of students’ arguments and provides feedback to students and teachers. Preliminary validity evidence was collected in this study to support the use of automated scoring in this formative assessment. The results showed satisfactory psychometric properties related to this formative assessment. The automated scores showed satisfactory agreement with human scores, but small discrepancies still existed. Automated scores and feedback encouraged students to revise their answers. Students’ scientific argumentation skills improved during the revision process. These findings provided preliminary evident to support the use of automated scoring in the formative assessment to diagnose and enhance students’ argumentation skills in the context of climate change in secondary school science classrooms. 相似文献

15.

Digital Module 18: Automated Scoring https://ncme.elevate.commpartners.com

Sue Lottridge Amy Burkhardt Michelle Boyer 《Educational Measurement》2020,39(3):141-142

In this digital ITEMS module, Dr. Sue Lottridge, Amy Burkhardt, and Dr. Michelle Boyer provide an overview of automated scoring. Automated scoring is the use of computer algorithms to score unconstrained open-ended test items by mimicking human scoring. The use of automated scoring is increasing in educational assessment programs because it allows scores to be returned faster at lower cost. In the module, they discuss automated scoring from a number of perspectives. First, they discuss benefits and weaknesses of automated scoring, and what psychometricians should know about automated scoring. Next, they describe the overall process of automated scoring, moving from data collection to engine training to operational scoring. Then, they describe how automated scoring systems work, including the basic functions around score prediction as well as other flagging methods. Finally, they conclude with a discussion of the specific validity demands around automated scoring and how they align with the larger validity demands around test scores. Two data activities are provided. The first is an interactive activity that allows the user to train and evaluate a simple automated scoring engine. The second is a worked example that examines the impact of rater error on test scores. The digital module contains a link to an interactive web application as well as its R-Shiny code, diagnostic quiz questions, activities, curated resources, and a glossary. 相似文献

16.

Automated scoring in context: Rapid assessment for placed students

Andrew Klobucar Norbert Elliot Perry Deess Oleksandr Rudniy Kamal Joshi 《Assessing Writing》2013,18(1):62-84

This study investigated the use of automated essay scoring (AES) to identify at-risk students enrolled in a first-year university writing course. An application of AES, the Criterion^® Online Writing Evaluation Service was evaluated through a methodology focusing on construct modelling, response processes, disaggregation, extrapolation, generalization, and consequence. Based on the results of our two-year study with students (N = 1,482) at a public technological research university in the United States, we found that Criterion offered a defined writing construct congruent with established models, achieved acceptance among students and instructors, showed no statistically significant differences between ethnicity groups of sufficient sample size, correlated at acceptable levels with other writing measures, performed in a stable fashion, and enabled instructors to identify at-risk students to increase their course success. 相似文献

17.

基于自动作文评分系统的英语写作教学模式研究

沈志法《浙江教育学院学报》2011,(4):7-11,19

自动作文评分系统的技术优势为英语写作教学模式的创新改革提供一个良好的平台。本研究对基于自动作文评分系统的英语写作教学模式进行了设计与教学实践,包括写前阶段、初稿和同伴互评阶段、修改和自动评阋阶段、课堂讲评和定稿阶段的设计。为期一年的写作教学实验表明：新的写作教学模式督促学生写,保持写作的频率,激发学生的写作兴趣,培养学生自主写作能力,提高学生英语写作水平。相似文献

18.

The internal consistency of a concept mapping scoring scheme and its effect on prediction validity

Xiufeng Liu Mike Hinchey 《International Journal of Science Education》2013,35(8):921-937

This study examines the internal consistency of Novak and Gowin's scoring scheme and its effect on the prediction validity of concept mapping as an alternative science classroom achievement assessment. Data were collected in three typical situations: very limited concept mapping experience with free‐style concept mapping; some concept mapping experience with questions provided; extensive concept mapping experience with a list of concepts provided for. It was found that Novak's scoring scheme was not internally consistent, and therefore there was generally no significant correlation between students’ scores on concept mapping and students’ scores on conventional classroom achievement assessments. The need for a new scoring scheme when concept mapping is used as an alternative science assessment is discussed. 相似文献

19.

Gifted children with written output difficulties: Paradox or paradigm

《Australian Journal of Learning Difficulties》2013,18(2):13-19

Abstract

This study was designed to ascertain the prevalence of written output deficits in young gifted children, to delineate the relationship between written output performance and reading performance, and to identify possible mechanisms for specific written output deficits in such children. Data from a sample of children scoring >120 on at least one IQ or achievement subscale indicated: (1) that there was a significant incidence of discrepancies between written spelling scores and reading (decoding) scores, as compared to the population; (2) that performance on spelling tasks was more subject to a maturational timetable than decoding was; (3) that performance on spelling tasks is less amenable than performance on decoding tasks to compensatory enhancement by higher level processing, and involves a sequential processing module that is shared with calculation but not with decoding; and (4) that strengths in visual‐spatial tasks may interact with relative weaknesses in both decoding and calculation tasks to predict even poorer performance on written spelling tasks. 相似文献

20.

Improving the student learning experience through dialogic feed-forward assessment

Jennifer Hill Harry West 《Assessment & Evaluation in Higher Education》2020,45(1):82-97

Abstract

Assessment feedback from teachers gains consistently low satisfaction scores in national surveys of student satisfaction, with concern surrounding its timeliness, quality and effectiveness. Equally, there has been heightened interest in the responsibility of learners in engaging with feedback and how student assessment literacy might be increased. We present results from a five-year longitudinal mixed methods enquiry, thematically analysing semi-structured interviews and focus groups with undergraduate students who have experienced dialogic feed-forward on a course in a British university. We use inferential statistics to compare performance pre and post-assessment intervention. The assessment consisted of submitting a draft coursework essay, which was discussed and evaluated face-to-face with the course teacher before a self-reflective piece was written about the assessment process and a final essay was submitted for summative grading. We evidence that this process asserted a positive influence on the student learning experience in a number of inter-related cognitive and affective ways, impacting positively upon learning behaviour, supporting student achievement and raising student satisfaction with feedback. We advocate a cyclic and iterative approach to dialogic feed-forward, which facilitates learners’ longitudinal development. Programme teams should offer systematic opportunities across curricula for students to understand the rationale for and develop feedback literacy. 相似文献