首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Typical assessment systems often measure isolated ideas rather than the coherent understanding valued in current science classrooms. Such assessments may motivate students to memorize, rather than to use new ideas to solve complex problems. To meet the requirements of the Next Generation Science Standards, instruction needs to emphasize sustained investigations, and assessments need to create a detailed picture of students’ conceptual understanding and reasoning processes.

This article describes the design process and potential for automated scoring of 2 forms of inquiry assessment: Energy Stories and MySystem. To design these assessments, we formed a partnership of teachers, discipline experts, researchers, technologists, and psychometricians to align curriculum, assessments, and rubrics. We illustrate how these items document middle school students’ reasoning about energy flow in life science. We used evidence from review by science teachers and experts in the discipline; classroom experiments; and psychometric analysis to validate the assessments, rubrics, and automated scoring.  相似文献   

2.
Interpreting and creating graphs plays a critical role in scientific practice. The K-12 Next Generation Science Standards call for students to use graphs for scientific modeling, reasoning, and communication. To measure progress on this dimension, we need valid and reliable measures of graph understanding in science. In this research, we designed items to measure graph comprehension, critique, and construction and developed scoring rubrics based on the knowledge integration (KI) framework. We administered the items to over 460 middle school students. We found that the items formed a coherent scale and had good reliability using both item response theory and classical test theory. The KI scoring rubric showed that most students had difficulty linking graphs features to science concepts, especially when asked to critique or construct graphs. In addition, students with limited access to computers as well as those who speak a language other than English at home have less integrated understanding than others. These findings point to the need to increase the integration of graphing into science instruction. The results suggest directions for further research leading to comprehensive assessments of graph understanding.  相似文献   

3.
Argumentation is fundamental to science education, both as a prominent feature of scientific reasoning and as an effective mode of learning—a perspective reflected in contemporary frameworks and standards. The successful implementation of argumentation in school science, however, requires a paradigm shift in science assessment from the measurement of knowledge and understanding to the measurement of performance and knowledge in use. Performance tasks requiring argumentation must capture the many ways students can construct and evaluate arguments in science, yet such tasks are both expensive and resource-intensive to score. In this study we explore how machine learning text classification techniques can be applied to develop efficient, valid, and accurate constructed-response measures of students' competency with written scientific argumentation that are aligned with a validated argumentation learning progression. Data come from 933 middle school students in the San Francisco Bay Area and are based on three sets of argumentation items in three different science contexts. The findings demonstrate that we have been able to develop computer scoring models that can achieve substantial to almost perfect agreement between human-assigned and computer-predicted scores. Model performance was slightly weaker for harder items targeting higher levels of the learning progression, largely due to the linguistic complexity of these responses and the sparsity of higher-level responses in the training data set. Comparing the efficacy of different scoring approaches revealed that breaking down students' arguments into multiple components (e.g., the presence of an accurate claim or providing sufficient evidence), developing computer models for each component, and combining scores from these analytic components into a holistic score produced better results than holistic scoring approaches. However, this analytical approach was found to be differentially biased when scoring responses from English learners (EL) students as compared to responses from non-EL students on some items. Differences in the severity between human and computer scores for EL between these approaches are explored, and potential sources of bias in automated scoring are discussed.  相似文献   

4.
In response to the demand for sound science assessments, this article presents the development of a latent construct called knowledge integration as an effective measure of science inquiry. Knowledge integration assessments ask students to link, distinguish, evaluate, and organize their ideas about complex scientific topics. The article focuses on assessment topics commonly taught in 6th- through 12th-grade classes. Items from both published standardized tests and previous knowledge integration research were examined in 6 subject-area tests. Results from Rasch partial credit analyses revealed that the tests exhibited satisfactory psychometric properties with respect to internal consistency, item fit, weighted likelihood estimates, discrimination, and differential item functioning. Compared with items coded using dichotomous scoring rubrics, those coded with the knowledge integration rubrics yielded significantly higher discrimination indexes. The knowledge integration assessment tasks, analyzed using knowledge integration scoring rubrics, demonstrate strong promise as effective measures of complex science reasoning in varied science domains.  相似文献   

5.
A large scale study involving 1786 year 7–10 Korean students from three school districts in Seoul was undertaken to evaluate their understanding of basic optics concepts using a two‐tier multiple‐choice diagnostic instrument consisting of four pairs of items, each of which evaluated the same concept in two different contexts. The instrument, which proved to be reliable, helped identify several context‐dependent alternative conceptions that were held by about 25% of students. At the same time, students’ performance on the diagnostic test correlated with the location of the schools, students’ achievement in school science and their attitudes to science learning. However, students’ grade levels had limited influence on their understanding of basic concepts in optics as measured by the instrument.  相似文献   

6.
Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to random sampling of items and/or responses in the validation sets. Any statistical hypothesis test of the differences in rankings needs to be appropriate for use with rater statistics and adjust for multiple comparisons. This study considered different statistical methods to evaluate differences in performance across multiple raters and items. These methods are illustrated leveraging data from the 2012 Automated Scoring Assessment Prize competitions. Using average rankings to test for significant differences in performance between automated and human raters, findings show that most automated raters did not perform statistically significantly different from human-to-human inter-rater agreement for essays but they did perform differently on short-answer items. Differences in average rankings between most automated raters were not statistically significant, even when their observed performance differed substantially.  相似文献   

7.
This yearlong study was implemented in seventh-grade life science classes with the students' regular teacher serving as teacher/researcher. In the study, a method of scoring concept maps was developed to assess knowledge and comprehension levels of science achievement. By linking scoring of concept maps to instructional objectives, scores were based upon the correctness of propositions. High correlations between the concept map scores and unit multiple choice tests provided strong evidence of the content validity of the map scores. Similarly, correlations between map scores and state criterion-referenced and national norm-referenced standardized tests were indicators of high concurrent validity. The approach to concept map scoring in the study represents a distinct departure from traditional methods that focus on characteristics such as hierarchy and branching. A large body of research has demonstrated the utility of such methods in the assessment of higher-level learning outcomes. The results of the study suggest that a concept map might be used in assessing declarative and procedural knowledge, both of which have a place in the science classroom. One important implication of these results is that science curriculum and its corresponding assessment need not be dichotomized into knowledge/comprehension versus higher-order outcomes. © 1998 John Wiley & Sons, Inc. J Res Sci Teach 35: 1103–1127, 1998.  相似文献   

8.
This study considered middle school mathematics teachers’ use of rubrics to score non‐traditional tasks. A group of eighth‐grade teachers attended a two‐day workshop where they evaluated assessment tasks and discussed the use an associated scoring rubric. Scored samples of student work submitted by the teachers indicated that they had difficulty using the rubrics for scoring. When compared to expert ratings, all except one teacher had discrepancies in scoring and some discrepancies indicated major problems. These discrepancies appear to be related to whether the task contained familiar or unfamiliar content and the mix of procedure and explanation the task required. Several other factors related to discrepancies, such as leniency errors, teacher knowledge, and the halo effect are also discussed. With the expanded use of rubrics in many arenas, these results show the need for more professional development related to rubric use.  相似文献   

9.
As methods for automated scoring of constructed‐response items become more widely adopted in state assessments, and are used in more consequential operational configurations, it is critical that their susceptibility to gaming behavior be investigated and managed. This article provides a review of research relevant to how construct‐irrelevant response behavior may affect automated constructed‐response scoring, and aims to address a gap in that literature: the need to assess the degree of risk before operational launch. A general framework is proposed for evaluating susceptibility to gaming, and an initial empirical demonstration is presented using the open‐source short‐answer scoring engines from the Automated Student Assessment Prize (ASAP) Challenge.  相似文献   

10.
Global carbon cycling describes the movement of carbon through atmosphere, biosphere, geosphere, and hydrosphere; it lies at the heart of climate change and sustainability. To understand the global carbon cycle, students will require interdisciplinary knowledge. While standards documents in science education have long promoted interdisciplinary understanding, our current science education system is still oriented toward single‐discipline‐based learning. Furthermore, there is limited work on interdisciplinary assessment. This article presents the validated Interdisciplinary Science Assessment of Carbon Cycling (ISACC), and reports empirical results of a study of high school and undergraduate students, including an analysis of the relationship between interdisciplinary items and disciplinary items. Many‐faceted Rasch analysis produced detailed information about the relative difficulty of items and estimates of ability levels of students. One‐way ANCOVA was used to analyze differences among three grade levels: high school, college Freshman–Sophomore, college Junior–Senior, with number of science courses as a covariate. Findings indicated significantly higher levels of interdisciplinary understanding among the Freshman–Sophomore group compared to high school students. There was no statistically significant difference between Freshman–Sophomore group and Junior–Senior group. Items assessing interdisciplinary understanding were more difficult than items assessing disciplinary understanding of global carbon cycling; however, interdisciplinary and disciplinary understanding were strongly correlated. This study highlights the importance of interdisciplinary understanding in learning carbon cycling and discusses its potential impacts on science curriculum and teaching practices.  相似文献   

11.
《Educational Assessment》2013,18(3):201-224
This article discusses an approach to analyzing performance assessments that identifies potential reasons for misfitting items and uses this information to improve on items and rubrics for these assessments. Specifically, the approach involves identifying psychometric features and qualitative features of items and rubrics that may possibly influence misfit; examining relations between these features and the fit statistic; conducting an analysis of student responses to a sample of misfitting items; and finally, based on the results of the previous analyses, modifying characteristics of the items or rubrics and reexamining fit. A mathematics performance assessment containing 53 constructed-response items scored on a holistic scale from 0 to 4 is used to illustrate the approach. The 2-parameter graded response model (Samejima, 1969) is used to calibrate the data. Implications of this method of data analysis for improving performance assessment items and rubrics are discussed as well as issues and limitations related to the use of the approach.  相似文献   

12.
The writing skills of 286 children (157 female and 129 male) were studied by comparing name writing and letter writing scores from preschool to kindergarten with letter and word reading scores over the same time period. Two rubrics for scoring writing were compared to determine if scores based on multiple components (i.e., letter formation, orientation on the vertical axis, left–right orientation, and correct letter sequencing) would better reflect differences in children’s writing knowledge in preschool and kindergarten than rubrics composed of one component (i.e., letter formation only). While developmental changes in writing scores were found, little additional information was provided by multiple component scoring rubrics compared to the single component rubric. Letter writing scores were more strongly related to letter and word reading scores than name writing scores but neither writing score was predictive of growth. Implications of the findings for intentional/systematic writing instruction in preschool curricula are discussed.  相似文献   

13.
Single‐best answers to multiple‐choice items are commonly dichotomized into correct and incorrect responses, and modeled using either a dichotomous item response theory (IRT) model or a polytomous one if differences among all response options are to be retained. The current study presents an alternative IRT‐based modeling approach to multiple‐choice items administered with the procedure of elimination testing, which asks test‐takers to eliminate all the response options they consider to be incorrect. The partial credit model is derived for the obtained responses. By extracting more information pertaining to test‐takers’ partial knowledge on the items, the proposed approach has the advantage of providing more accurate estimation of the latent ability. In addition, it may shed some light on the possible answering processes of test‐takers on the items. As an illustration, the proposed approach is applied to a classroom examination of an undergraduate course in engineering science.  相似文献   

14.
A framework for evaluation and use of automated scoring of constructed‐response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high‐stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.  相似文献   

15.
This article describes a method for identifying test items as disability neutral for children with vision and motor disabilities. Graduate students rated 130 items of the Preschool Language Scale and obtained inter‐rater correlation coefficients of 0.58 for ratings of items as disability neutral for children with vision disability, and 0.77 for ratings of items as disability neutral for children with motor disability. These ratings were used to create three item sets considered disability neutral for children with vision disability, motor disability, or both disabilities. Two methods for scoring the item sets were identified: scoring each set as a partially administered developmental test, or computing standard scores based upon pro‐rated raw score totals. The pro‐rated raw score method generated standard scores that were significantly inflated and therefore less useful for the assessment purposes than the ratio quotient method. This research provides a test accommodation technique for assessing children with multiple disabilities.  相似文献   

16.
This study used qualitative and quantitative approaches to evaluate the effectiveness of self‐learning modules (SLMs) developed to facilitate and individualize students' learning of basic medical sciences. Twenty physiology and nineteen microanatomy SLMs were designed with interactive images, animations, narrations, and self‐assessments. Of 41 medical students, 40 students voluntarily completed a questionnaire with open‐ended and closed‐ended items to evaluate students' attitudes and perspectives on the learning value of SLMs. Closed‐ended items were assessed on a five‐point Likert scale (5 = high score) and the data were expressed as mean ± standard deviation. Open‐ended questions further evaluated students' perspectives on the effectiveness of SLMs; student responses to open‐ended questions were analyzed to identify shared patterns or themes in their experience using SLMs. The results of the midterm examination were also analyzed to compare student performance on items related to SLMs and traditional sessions. Students positively evaluated their experience using the SLMs with an overall mean score of 4.25 (SD ± 0.84). Most students (97%) indicated that the SLMs improved understanding and facilitated learning basic science concepts. SLMs were reported to allow learner control, to help in preparation for subsequent in‐class discussion, and to improve understanding and retention. A significant difference in students' performance was observed when comparing SLM‐related items with non‐SLM items in the midterm examination (P < 0.05). In conclusion, the use of SLMs in an integrated basic science curriculum has the potential to individualize the teaching and improve the learning of basic sciences. Anat Sci Educ 3: 219–226, 2010. © 2010 American Association of Anatomists.  相似文献   

17.
Several benefits of using scoring rubrics in performance assessments have been proposed, such as increased consistency of scoring, the possibility to facilitate valid judgment of complex competencies, and promotion of learning. This paper investigates whether evidence for these claims can be found in the research literature. Several databases were searched for empirical research on rubrics, resulting in a total of 75 studies relevant for this review. Conclusions are that: (1) the reliable scoring of performance assessments can be enhanced by the use of rubrics, especially if they are analytic, topic-specific, and complemented with exemplars and/or rater training; (2) rubrics do not facilitate valid judgment of performance assessments per se. However, valid assessment could be facilitated by using a more comprehensive framework of validity when validating the rubric; (3) rubrics seem to have the potential of promoting learning and/or improve instruction. The main reason for this potential lies in the fact that rubrics make expectations and criteria explicit, which also facilitates feedback and self-assessment.  相似文献   

18.
The research focus on children's science has recently shifted from separate concepts to more comprehensive and complex topics. This study addressed pupils' understanding of the complex topic of energy flow and matter cycling. A scoring system with three categories and six concepts was developed and used by four biology teachers to analyze 106 pupils' concept maps. The results indicate that most of the pupils failed to recognize the interrelationships among the various concepts concerned with units of energy flow and matter cycling. It was the relationship between the living world and the non‐living world that presented the greatest difficult to understanding. This paper concludes with suggestions for curriculum development and biology teaching.  相似文献   

19.
‘Rubric’ is a term with a variety of meanings. As the use of rubrics has increased both in research and practice, the term has come to represent divergent practices. These range from secret scoring sheets held by teachers to holistic student-developed articulations of quality. Rubrics are evaluated, mandated, embraced and resisted based on often imprecise and inconsistent understandings of the term. This paper provides a synthesis of the diversity of rubrics, and a framework for researchers and practitioners to be clearer about what they mean when they say ‘rubric’. Fourteen design elements or decision points are identified that make one rubric different from another. This framework subsumes previous attempts to categorise rubrics, and should provide more precision to rubric discussions and debate, as well as supporting more replicable research and practice.  相似文献   

20.
The mainstream research on scoring rubrics has emphasized the summative aspect of assessment. In recent years, the use of rubrics for formative purposes has gained more attention. This research has, however, not been conclusive. The aim of this study is therefore to review the research on formative use of rubrics, in order to investigate if, and how, rubrics have an impact on student learning. In total, 21 studies about rubrics were analyzed through content analysis. Sample, subject/task, design, procedure, and findings, were compared among the different studies in relation to effects on student performance and selfregulation. Findings indicate that rubrics may have the potential to influence students learning positively, but also that there are several different ways for the use of rubrics to mediate improved performance and self-regulation. There are a number of factors identified that may moderate the effects of using rubrics formatively, as well as factors that need further investigation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号