首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

2.
In this digital ITEMS module, Dr. Sue Lottridge, Amy Burkhardt, and Dr. Michelle Boyer provide an overview of automated scoring. Automated scoring is the use of computer algorithms to score unconstrained open-ended test items by mimicking human scoring. The use of automated scoring is increasing in educational assessment programs because it allows scores to be returned faster at lower cost. In the module, they discuss automated scoring from a number of perspectives. First, they discuss benefits and weaknesses of automated scoring, and what psychometricians should know about automated scoring. Next, they describe the overall process of automated scoring, moving from data collection to engine training to operational scoring. Then, they describe how automated scoring systems work, including the basic functions around score prediction as well as other flagging methods. Finally, they conclude with a discussion of the specific validity demands around automated scoring and how they align with the larger validity demands around test scores. Two data activities are provided. The first is an interactive activity that allows the user to train and evaluate a simple automated scoring engine. The second is a worked example that examines the impact of rater error on test scores. The digital module contains a link to an interactive web application as well as its R-Shiny code, diagnostic quiz questions, activities, curated resources, and a glossary.  相似文献   

3.
Abstract

Students in a college course were given written criteria, divided into teams, and asked to score their own essay examination. Their pooled ratings correlated .922 with the instructor's ratings. Agreement between the ratings of students and instructor was not related to grade point or total test score. However, grade point and test scores were related negatively to the ambiguity of the students’ answers on the examination. The results support the generalization that subjective scoring standards are readily communicable. Theoretical and practical implications are examined.  相似文献   

4.
This study examined acoustic correlates of adults' ratings of infants' cries. Parents and nonparents rated 12 spontaneous cries from young infants on 8 items describing the cries' aversiveness and on 9 semantic differential items. The results indicated that the duration, the amount of dysphonation, and proportion of energy in various frequency bands were highly correlated with adults' ratings. Further, the pattern of correlations between each of the 17 rating scale items and the acoustic attributes was virtually the same, suggesting that the items represented a single underlying dimension of perceived aversiveness. Finally, no differences were found between the results for parents and nonparents. General issues in the study of cry perception are discussed.  相似文献   

5.
This study examined the usefulness of applying the Rasch rating scale model (Andrich, 1978) to high school grade data. ACT Assessment test scores (English, Mathematics, Reading, and Science Reasoning) were used as "common items" to adjust for different grading standards in individual high school courses both within and across schools. This scaling approach yielded an ACT Assessment-adjusted high school grade point average (AA-HSGPA) on a common scale across high schools and cohorts within a large public university. AA-HSGPA was a better predictor of first-year college grade point average (CGPA) than the regular high school grade point average. The best model for predicting CGPA included both the ACT composite score and AA-HSGPA.  相似文献   

6.

Summated rating scales to measure attitudes (and other human characteristics) commonly consist of numerous items whose scores are summed to yield a total score. A central assumption underlying the use of this technique is that the items in the scale reflect a common construct. If this assumption is not met, the scoring procedure produces largely meaningless, uninterpretable data. Although this important psychometric principle has been known for a long time, numerous studies in the research literature demonstrate a neglect of this principle. Some studies make no attempt at all to conceptualise the construct to be measured; others conceptualise the construct but then ignore the possibility that it may be multi‐dimensional; still others actually contain evidence which indicates that the construct is multi‐dimensional and then proceed to ignore that evidence. A possible contributor to the confusion is the widespread misunderstanding about the related yet distinct concepts of internal consistency and uni‐dimensionality. This paper presents case studies of poor and good instrument design, in the (forlorn?) hope that clarification of the issues might make a difference in the future.  相似文献   

7.
《教育实用测度》2013,26(4):345-358
Performance assessments typically are scored by having experts rate individual performances. In contexts such as medical licensure, where the examinee population is large and the pool of expert raters is limited practically, this approach may be unworkable. This article describes an automated scoring algorithm for a computer simulation-based examination of physicians' patient-management skills. The algorithm is based on the policy used by clinicians in rating case performances. The results show that scores produced using this algorithm are highly correlated to actual clinician ratings. These scores also are shown to be effective in discriminating between case performance judged to be passing or failing by an independent group of clinicians.  相似文献   

8.
The present study investigates the validity of a 4‐point rating scale used to measure the level of preschool children's orientation to literacy during shared book reading. Validity was explored by (a) comparing the children's level of literacy orientation as measured with the Children's Orientation to Book Reading Rating Scale (COB) with a teacher's rating of a child's level of attention and effortful control on the Children's Behaviour Questionnaire (CBQ), and (b) computing the predictive validity of a child's COB rating with overall levels of emergent literacy at the end of the preschool school year. This study involved 46 preschool children from low‐income backgrounds; children's literacy orientation was rated during a group teacher‐led book reading. Children's ratings of literacy orientation during shared book reading using the global 4‐point COB scale were significantly correlated with teacher ratings of a child's attention and effortful control as measured on the CBQ. Hierarchical regression results indicated children's literacy orientation significantly predicted children's end‐of‐year alphabet knowledge and overall emergent reading skills above and beyond the variance contributed by children's language skills and family income. The validity of a global rating for indexing children's level of literacy orientation was supported. Educational implications and recommendations for the COB as a component of early literacy assessment are discussed.  相似文献   

9.
ABSTRACT

This study investigates the role of automated scoring and feedback in supporting students’ construction of written scientific arguments while learning about factors that affect climate change in the classroom. The automated scoring and feedback technology was integrated into an online module. Students’ written scientific argumentation occurred when they responded to structured argumentation prompts. After submitting the open-ended responses, students received scores generated by a scoring engine and written feedback associated with the scores in real-time. Using the log data that recorded argumentation scores as well as argument submission and revisions activities, we answer three research questions. First, how students behaved after receiving the feedback; second, whether and how students’ revisions improved their argumentation scores; and third, did item difficulties shift with the availability of the automated feedback. Results showed that the majority of students (77%) made revisions after receiving the feedback, and students with higher initial scores were more likely to revise their responses. Students who revised had significantly higher final scores than those who did not, and each revision was associated with an average increase of 0.55 on the final scores. Analysis on item difficulty shifts showed that written scientific argumentation became easier after students used the automated feedback.  相似文献   

10.
《教育实用测度》2013,26(3):281-299
The growing use of computers for test delivery, along with increased interest in performance assessments, has motivated test developers to develop automated systems for scoring complex constructed-response assessment formats. In this article, we add to the available information describing the performance of such automated scoring systems by reporting on generalizability analyses of expert ratings and computer-produced scores for a computer-delivered performance assessment of physicians' patient management skills. Two different automated scoring systems were examined. These automated systems produced scores that were approximately as generalizable as those produced by expert raters. Additional analyses also suggested that the traits assessed by the expert raters and the automated scoring systems were highly related (i.e., true correlations between test forms, across scoring methods, were approximately 1.0). In the appendix, we discuss methods for estimating this correlation, using ratings and scores produced by an automated system from a single test form.  相似文献   

11.
Content‐based automated scoring has been applied in a variety of science domains. However, many prior applications involved simplified scoring rubrics without considering rubrics representing multiple levels of understanding. This study tested a concept‐based scoring tool for content‐based scoring, c‐rater?, for four science items with rubrics aiming to differentiate among multiple levels of understanding. The items showed moderate to good agreement with human scores. The findings suggest that automated scoring has the potential to score constructed‐response items with complex scoring rubrics, but in its current design cannot replace human raters. This article discusses sources of disagreement and factors that could potentially improve the accuracy of concept‐based automated scoring.  相似文献   

12.
Objective. This study compared mother and child ratings of child anxiety to each other and to an objective measure of the child’s avoidant behavior, using a novel motion-tracking paradigm. The study also examined the moderating role of family accommodation for the link between mother ratings of child anxiety and child behavioral avoidance. Design. Participants were 98 children (7- to 14-years-old) and their mothers. Children met criteria for a primary anxiety disorder. Measures included parent and child versions of the Multi-Dimensional Anxiety Scale for Children and the Screen for Child Anxiety Related Emotional Disorders. Children also completed the Spider Phobia Questionnaire for children and the Family Accommodation Scale for Anxiety—Child Report. The Yale Interactive Kinect Environment Software platform was used to measure children’s behavioral avoidance of spider images. Results. Mother and child ratings of child anxiety were moderately correlated. Only child ratings of child anxiety were associated with child behavioral avoidance. Child-rated family accommodation moderated the association between parent ratings and child avoidance. When accommodation was low parent ratings correlated with child avoidance, but not when accommodation was high. Conclusions. The findings contribute to understanding commonly reported discrepancies between mother and child ratings of child anxiety symptoms.  相似文献   

13.
A rating scale measuring parent beliefs about play was developed and validated with a sample of 224 African American mothers of children attending Head Start. Principal components analyses of the Parent Play Beliefs Scale (PPBS) revealed two factors, Play Support and Academic Focus, which capture parent attitudes regarding the developmental significance of play. Maternal ratings of Play Support correlated positively with ratings of children's interactive peer play and were positively associated with parent education. Maternal ratings of Academic Focus were negatively correlated with prosocial peer play ratings and positively correlated with ratings of disruptive and disconnected play in children. Findings support the psychometric utility of the new measure. Future directions involving parent play beliefs in conceptual models of children's social competence during early childhood are discussed.  相似文献   

14.
ABSTRACT

Objectives: This study aims to test the dimensionality, reliability, and item quality of the revised UCLA loneliness scale as well as to investigate the differential item functioning (DIF) of the three dimensions of the revised UCLA loneliness scale in community-dwelling Chinese and Korean elderly individuals.

Method: Data from 493 elderly individuals (287 Chinese and 206 Korean) were used to examine the revised UCLA loneliness scale. The Research model based on item response theory (IRT) was used to test dimensionality, reliability, and item fit. The hybrid ordinal logistic regression-IRT test was used to evaluate DIF.

Results: Item separation reliability, person reliability, and Cronbach’s alpha met the benchmarks. The quality of the items in the three-dimension model met the benchmark. Eight items were detected as significant DIF items (at α < .01). The loneliness level of Chinese elderly individuals was significantly higher than that of Koreans in Dimensions 1 and 2, while Korean elderly participants showed significantly higher loneliness levels than Chinese participants in Dimension 3. Several collected demographic characteristics and loneliness levels were more highly correlated in Korean elderly individuals than in Chinese elderly individuals.

Conclusion: Analysis using the three dimensions is reasonable for the revised UCLA loneliness scale. Good item quality and the items of this measure suggest that the revised UCLA loneliness can be used to assess the preferred latent traits. Finally, the differences between the levels of loneliness in Chinese and Korean elderly individuals are associated with the factors of loneliness.  相似文献   

15.
ABSTRACT

Students’ attitude towards science (SAS) is often a subject of investigation in science education research. Survey of rating scale is commonly used in the study of SAS. The present study illustrates how Rasch analysis can be used to provide psychometric information of SAS rating scales. The analyses were conducted on a 20-item SAS scale used in an existing dataset of The Trends in International Mathematics and Science Study (TIMSS) (2011). Data of all the eight-grade participants from Hong Kong and Singapore (N?=?9942) were retrieved for analyses. Additional insights from Rasch analysis that are not commonly available from conventional test and item analyses were discussed, such as invariance measurement of SAS, unidimensionality of SAS construct, optimum utilization of SAS rating categories, and item difficulty hierarchy in the SAS scale. Recommendations on how TIMSS items on the measurement of SAS can be better designed were discussed. The study also highlights the importance of using Rasch estimates for statistical parametric tests (e.g. ANOVA, t-test) that are common in science education research for group comparisons.  相似文献   

16.
The use of evidence to guide policy and practice in education (Cooper, Levin, & Campbell, 2009) has included an increased emphasis on constructed-response items, such as essays and portfolios. Because assessments that go beyond selected-response items and incorporate constructed-response items are rater-mediated (Engelhard, 2002, Engelhard, 2013), it is necessary to develop evidence-based indices of quality for the rating processes used to evaluate student performances. This study proposes a set of criteria for evaluating the quality of ratings based on the concepts of measurement invariance and accuracy within the context of a large-scale writing assessment. Two measurement models are used to explore indices of quality for raters and ratings: the first model provides evidence for the invariance of ratings, and the second model provides evidence for rater accuracy. Rating quality is examined within four writing domains from an analytic rubric. Further, this study explores the alignment between indices of rating quality based on these invariance and accuracy models within each of the four domains of writing. Major findings suggest that rating quality varies across analytic rubric domains, and that there is some correspondence between indices of rating quality based on the invariance and accuracy models. Implications for research and practice are discussed.  相似文献   

17.
ABSTRACT

Automated essay scoring is a developing technology that can provide efficient scoring of large numbers of written responses. Its use in higher education admissions testing provides an opportunity to collect validity and fairness evidence to support current uses and inform its emergence in other areas such as K–12 large-scale assessment. In this study, human and automated scores on essays written by college students with and without learning disabilities and/or attention deficit hyperactivity disorder were compared, using a nationwide (U.S.) sample of prospective graduate students taking the revised Graduate Record Examination. The findings are that, on average, human raters and the automated scoring engine assigned similar essay scores for all groups, despite average differences among groups with respect to essay length and spelling errors.  相似文献   

18.
Abstract

This study investigated the reliability, validity, and utility of the following three measures of letter-formation quality: (a) a holistic rating system, in which examiners rated letters on a five-point Likert-type scale; (h) a holistic rating system with model letters, in which examiners used model letters that exemplified specific criterion scores to rate letters; and (c) a correct/incorrect procedure, in which examiners used transparent overlays and standard verbal criteria to score letters. Intrarater and interrater reliability coefficients revealed that the two holistic scoring procedures were unreliable, whereas scores obtained by examiners who used the correct/incorrect procedure were consistent over time and across examiners. Although all three of the target measures were sensitive to differences between individual letters, only the scores from the two holistic procedures were associated with other indices of handwriting performance. Furthermore, for each of the target measures, variability in scores was, for the most part, not attributable to the level of experience or sex of the respondents. Findings are discussed with respect to criteria for validating an assessment instrument.  相似文献   

19.
Abstract

The study was designed to assess the strengths and weaknesses of the nursing education preparation of associate degree nursing graduates as reflected in their job performance. The predictive relationships of measures of scholastic success such as G.P.A. and State Board Examination Scores with graduate job performance were also investigated. A rating scale of 62 items was designed to measure the following dimensions of nursing performance: (a) planning for nursing care, (b) implementing nursing care, (c) interpersonal relationships and communication, (d) leadership and group procedures, (e) evaluating and reporting nursing care, (f) professional involvement, and (g) other. Sources for the rating scale included curriculum objectives and a field survey of performance criteria. Graduates were rated by a nurse and a physician who function in close supervision of their job. Graduates completed a similar rating scale in which they were asked to rate the adequacy of their educational preparation for various job requirements. Ratings were obtained from a sample of 153 graduates of the associate degree nursing program at Delta College, University Center, Michigan. Results indicated a stated need for additional clinical experience requiring total involvement of nursing students, advanced course work in pharmacology, anatomy, physiology, and nutrition, and planned leadership preparation. Findings demonstrated no significant relationship between various indices of G.P.A. and State Board Examination Scores with job rated performance. It is projected that rated job performance is influenced by a number of personality variables. Physicians perceive the performance of nurses from different perspectives than do supervising nurses.  相似文献   

20.
ABSTRACT

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay’s true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By eliminating one, two, or three raters at a time, and by calculating an estimate of the true scores using the remaining raters, an independent criterion against which to judge the validity of the human raters and that of the AES system, as well as the interrater reliability was produced. The results of the study indicated that the automated scores correlate with human scores to the same degree as human raters correlate with each other. However, the findings regarding the validity of the ratings support a claim that the reliability and validity of AES diverge: although the AES scoring is, naturally, more consistent than the human ratings, it is less valid.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号