首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
In response to the demand for sound science assessments, this article presents the development of a latent construct called knowledge integration as an effective measure of science inquiry. Knowledge integration assessments ask students to link, distinguish, evaluate, and organize their ideas about complex scientific topics. The article focuses on assessment topics commonly taught in 6th- through 12th-grade classes. Items from both published standardized tests and previous knowledge integration research were examined in 6 subject-area tests. Results from Rasch partial credit analyses revealed that the tests exhibited satisfactory psychometric properties with respect to internal consistency, item fit, weighted likelihood estimates, discrimination, and differential item functioning. Compared with items coded using dichotomous scoring rubrics, those coded with the knowledge integration rubrics yielded significantly higher discrimination indexes. The knowledge integration assessment tasks, analyzed using knowledge integration scoring rubrics, demonstrate strong promise as effective measures of complex science reasoning in varied science domains.  相似文献   

2.
Typical assessment systems often measure isolated ideas rather than the coherent understanding valued in current science classrooms. Such assessments may motivate students to memorize, rather than to use new ideas to solve complex problems. To meet the requirements of the Next Generation Science Standards, instruction needs to emphasize sustained investigations, and assessments need to create a detailed picture of students’ conceptual understanding and reasoning processes.

This article describes the design process and potential for automated scoring of 2 forms of inquiry assessment: Energy Stories and MySystem. To design these assessments, we formed a partnership of teachers, discipline experts, researchers, technologists, and psychometricians to align curriculum, assessments, and rubrics. We illustrate how these items document middle school students’ reasoning about energy flow in life science. We used evidence from review by science teachers and experts in the discipline; classroom experiments; and psychometric analysis to validate the assessments, rubrics, and automated scoring.  相似文献   

3.
In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students’ responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students’ nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed.  相似文献   

4.
Several benefits of using scoring rubrics in performance assessments have been proposed, such as increased consistency of scoring, the possibility to facilitate valid judgment of complex competencies, and promotion of learning. This paper investigates whether evidence for these claims can be found in the research literature. Several databases were searched for empirical research on rubrics, resulting in a total of 75 studies relevant for this review. Conclusions are that: (1) the reliable scoring of performance assessments can be enhanced by the use of rubrics, especially if they are analytic, topic-specific, and complemented with exemplars and/or rater training; (2) rubrics do not facilitate valid judgment of performance assessments per se. However, valid assessment could be facilitated by using a more comprehensive framework of validity when validating the rubric; (3) rubrics seem to have the potential of promoting learning and/or improve instruction. The main reason for this potential lies in the fact that rubrics make expectations and criteria explicit, which also facilitates feedback and self-assessment.  相似文献   

5.
The Multidimensional School Anger Inventory–Revised (MSAI-R) is a measurement tool to evaluate high school students' anger. Its psychometric features have been tested in the USA, Australia, Japan, Guatemala, and Italy. This study investigates the factor structure and psychometric quality of the Persian version of the MSAI-R using data from an administration of the inventory to 585 Iranian high school students. The study adopted the four-factor underlying structure of high school student anger derived through factor analysis in previous validation studies, which consists of: School Hostility, Anger Experience, Positive Coping, and Destructive Expressions. Confirmatory factor analysis of this four-factor model indicated that it fit the data better than a one-factor baseline model, although the fit was not perfect. The Rasch model showed a very high internal consistency among items, with no item misfitting; however, our results suggest that to represent the construct sufficiently some items should be added to Positive Coping and Destructive Expression. This finding is in agreement with Boman, Curtis, Furlong, and Smith's Rasch analysis of the MSAI-R with an Australian sample. Overall, the results from this study support the psychometric features of the Persian MSAI-R. However, results from some test items also point to the dangers inherent in adapting the same test stimuli to widely divergent cultures.  相似文献   

6.
As item response theory has been more widely applied, investigating the fit of a parametric model becomes an important part of the measurement process. There is a lack of promising solutions to the detection of model misfit in IRT. Douglas and Cohen introduced a general nonparametric approach, RISE (Root Integrated Squared Error), for detecting model misfit. The purposes of this study were to extend the use of RISE to more general and comprehensive applications by manipulating a variety of factors (e.g., test length, sample size, IRT models, ability distribution). The results from the simulation study demonstrated that RISE outperformed G2 and S‐X2 in that it controlled Type I error rates and provided adequate power under the studied conditions. In the empirical study, RISE detected reasonable numbers of misfitting items compared to G2 and S‐X2, and RISE gave a much clearer picture of the location and magnitude of misfit for each misfitting item. In addition, there was no practical consequence to classification before and after replacement of misfitting items detected by three fit statistics.  相似文献   

7.
The Progressive Matrices items require varying degrees of analytical reasoning. Individuals high on the underlying trait measured by the Raven should score high on the test. Latent trait models applied to data of the Raven form provide a useful methodology for examining the tenability of the above hypothesis. In this study the Rasch latent model was applied to investigate the fit of observed performance on Raven items to what was expected by the model for individuals at six different levels of the underlying scale. For the most part the model showed a good fit to the test data. The findings were similar to previous empirical work that has investigated the behavior of Rasch test scores. In three instances, however, the item fit statistic was relatively large. A closer study of the “misfitting” items revealed two items were of extreme difficulty, which is likely to contribute to the misfit. The study raises issues about the use of the Rasch model in instances of small samples. Other issues related to the interpretation of the Rasch model to Raven-type data are discussed.  相似文献   

8.
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups.  相似文献   

9.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

10.
The recent emphasis on various types of performance assessments raises questions concerning the differential effects of such assessments on population subgroups. Procedures for detecting differential item functioning (DIF) in data from performance assessments are available but may be hindered by problems that stem from this mode of assessment. Foremost among these are problems related to finding an appropriate matching variable. These problems are discussed and results are presented for three methods for DIF detection in polytomous items using data from a direct writing assessment. The purpose of the study is to examine the effects of using different combinations of internal and external matching variables. The procedures included a generalized Mantel-Haenszel statistic, a technique based on meta-analysis methodology, and logistic discriminant function analysis. In general, the results did not support the use of an external matching criterion and indicated that continued problems may be expected in attempts to assess DIF in performance assessments.  相似文献   

11.
Open-ended (OE) items are widely used to gather data on student performance in international achievement studies. However, several factors may threaten validity when using such items. This study examined Finnish coders’ opinions about threats to validity when coding responses to OE items in the PISA 2012 problem-solving test. A total of 6 discussions during 6 coder practice sessions (on 6 OE items) and an interview between 5 coders were audiorecorded and analyzed by means of content analysis, and 3 main threats to validity were found: (1) unclear and complex questions; (2) arbitrary and illogical coding rubrics; and (3) unclear and ambiguous responses. Suggestions are given as to how to respond to these threats in order to improve the validity of international achievement studies.  相似文献   

12.
This paper reports on a study where rubrics have been used to convey assessment expectations to students (n?=?176) in three different assessment situations in professional education. These situations are: (1) the development of a survey instrument, which was part of a course in statistics and epidemiology; (2) an inspection of a house, which was part of a course about the functions of buildings for real estate brokers and (3) a workshop in communication with patients, which was part of a course in the evaluation of diagnostic procedures and treatments of oral infections in dental education. In all situations, students’ perceptions and uses of the rubrics were investigated. Findings suggest that it is indeed possible to convey expectations to students through the use of rubrics, in the sense that students not only appreciate the efforts to make assessment criteria transparent, but may also use the criteria in order to support and self-assess their performance. Important features of the rubrics, which were found to facilitate students’ understanding and use of the criteria in these situations, are presented and discussed.  相似文献   

13.
Abstract

The use of assessment rubrics in the higher education sector is now widespread in a number of disciplines. Typically, these rubrics are constructed by teachers who also tend to be the main users of the rubrics throughout the grading process. In recent years, questions have been raised about this teacher-directed approach and some educators have begun to explore an alternate approach to rubric construction; that is, engaging students in collaboration with their teachers to co-construct assessment rubrics. This paper outlines the processes employed in a project that investigated the co-construction of rubrics within six different contexts. The project aimed to engage students in collaboration with their teachers to co-construct rubrics which are co-owned by teacher and student. A mixed methods approach was utilized to explore the effectiveness of the strategy. Questionnaires, interviews and focus groups were utilized to gather data from both the teacher-participants and student-participants regarding their experiences of being involved in the study. Findings are presented from the perspectives of both students and teachers, relating their views of rubrics and the activity of rubric co-construction. The paper concludes with recommendations for practical approaches to rubric co-construction and future research directions.  相似文献   

14.
Item stem formats can alter the cognitive complexity as well as the type of abilities required for solving mathematics items. Consequently, it is possible that item stem formats can affect the dimensional structure of mathematics assessments. This empirical study investigated the relationship between item stem format and the dimensionality of mathematics assessments. A sample of 671 sixth-grade students was given two forms of a mathematics assessment in which mathematical expression (ME) items and word problems (WP) were used to measure the same content. The effects of mathematical language and reading abilities in responding to ME and WP items were explored using unidimensional and multidimensional item response theory models. The results showed that WP and ME items appear to differ with regard to the underlying abilities required to answer these items. Hence, the multidimensional model fit the response data better than the unidimensional model. For the accurate assessment of mathematics achievement, students’ reading and mathematical language abilities should also be considered when implementing mathematics assessments with ME and WP items.  相似文献   

15.
16.
This article examines three typical approaches to alternate assessment for students with significant cognitive disabilities—portfolios, performance assessments, and rating scales. A detailed analysis of common and unique design features of these approaches is provided, including features of each approach that influence the psychometric quality of their results. Validity imperatives for alternate assessments are reviewed, and approaches for addressing the need for validity evidence are outlined. The article concludes with an examination of three technical challenges—alignment, scores and scoring, and standard setting—common to all alternate assessments. In light of these challenges, existing methods and professional testing standards are endorsed as necessary guidance for understanding and advancing alternate assessment practices.  相似文献   

17.
The mainstream research on scoring rubrics has emphasized the summative aspect of assessment. In recent years, the use of rubrics for formative purposes has gained more attention. This research has, however, not been conclusive. The aim of this study is therefore to review the research on formative use of rubrics, in order to investigate if, and how, rubrics have an impact on student learning. In total, 21 studies about rubrics were analyzed through content analysis. Sample, subject/task, design, procedure, and findings, were compared among the different studies in relation to effects on student performance and selfregulation. Findings indicate that rubrics may have the potential to influence students learning positively, but also that there are several different ways for the use of rubrics to mediate improved performance and self-regulation. There are a number of factors identified that may moderate the effects of using rubrics formatively, as well as factors that need further investigation.  相似文献   

18.
Computer-based educational assessments often include items that involve drag-and-drop responses. There are different ways that drag-and-drop items can be laid out and different choices that test developers can make when designing these items. Currently, these decisions are based on experts’ professional judgments and design constraints, rather than empirical research, which might threaten the validity of interpretations of test outcomes. To this end, we investigated the effect of drag-and-drop item features on test-taker performance and response strategies with a cognition-centered approach. Four hundred and seventy-six adult participants solved content-equivalent drag-and-drop mathematics items under five design variants. Results showed that: (a) test takers’ performance and response strategies were affected by the experimental manipulations, and (b) test takers mostly used cognitively efficient response strategies regardless of the manipulated item features. Implications of the findings are provided to support test developers’ design decisions.  相似文献   

19.
Teachers' assessment practices were investigated in the context of school restructuring in Title I schools. The survey method included questionnaires distributed to teachers in 11 elementary schools in their 1st year of implementation and teachers in 11 elementary schools in their 4th year of implementation. Focus group interviews were conducted with groups of 8-10 teachers at each school. Results indicated that schools in their 4th year of restructuring had significantly higher mean ratings on the alternative assessment items than did schools in their 1st year of restructuring. These differences were significant for portfolios and student self-assessments. There were significant, positive correlations between scores on the alternative assessment scale with scores on the pedagogical change and student outcome scales. The qualitative data also suggested an increase in teachers' use of alternative assessment strategies and the development of rubrics to evaluate these assessments. There were more responses indicating changes in assessment among teachers in their 4th year of restructuring than among teachers in the 1st year. The qualitative data further indicated that teachers were concerned with the incompatibility between the alternative, authentic models advocated in the restructuring models and the district and state accountability systems that relied on standardized objective tests.  相似文献   

20.
Changes to the design and development of our educational assessments are resulting in the unprecedented demand for a large and continuous supply of content‐specific test items. One way to address this growing demand is with automatic item generation (AIG). AIG is the process of using item models to generate test items with the aid of computer technology. The purpose of this module is to describe and illustrate a template‐based method for generating test items. We outline a three‐step approach where test development specialists first create an item model. An item model is like a mould or rendering that highlights the features in an assessment task that must be manipulated to produce new items. Next, the content used for item generation is identified and structured. Finally, features in the item model are systematically manipulated with computer‐based algorithms to generate new items. Using this template‐based approach, hundreds or even thousands of new items can be generated with a single item model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号