首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
等值(equating)和纵向量表化(vertical scaling)的功用是建立来自不同考试的分数之间的关系。等值是施用于相同年级,相同性质的试卷,而纵向量表化则用于不同年级而性质相似的试卷。纵向量表化是将不同年级的成绩放置于统一的成长分数量表之中。纵向量表(vertical scale)是一种延伸的分数,其度量跨越和串连不同年级之间,用以评估学生连继性的成就成长(Nitko,2004)。在教学中,学生的进度可以利用纵向量表来监察和评估。而在教育研究上,纵向量表可成为长期跟踪调查(longitudinal study)之有力工具。本文讨论纵向量表化的方法论,包括成长定义(definition of growth),数据收集(data collection)方法,试卷设计和使用项目反应理论(Item Response Theory)的方法以及对制作纵向量表提供一些实际的建议。  相似文献   

2.
Most growth models implicitly assume that test scores have been vertically scaled. What may not be widely appreciated are the different choices that must be made when creating a vertical score scale. In this paper empirical patterns of growth in student achievement are compared as a function of different approaches to creating a vertical scale. Longitudinal item‐level data from a standardized reading test are analyzed for two cohorts of students between Grades 3 and 6 and Grades 4 and 7 for the entire state of Colorado from 2003 to 2006. Eight different vertical scales were established on the basis of choices made for three key variables: Item Response Theory modeling approach, linking approach, and ability estimation approach. It is shown that interpretations of empirical growth patterns appear to depend upon the extent to which a vertical scale has been effectively “stretched” or “compressed” by the psychometric decisions made to establish it. While all of the vertical scales considered show patterns of decelerating growth across grade levels, there is little evidence of scale shrinkage.  相似文献   

3.
Test scores matter these days. Test‐takers want to understand how they performed, and test score reports, particularly those for individual examinees, are the vehicles by which most people get the bulk of this information. Historically, score reports have not always met the examinees’ information or usability needs, but this is clearly changing for the better due to recent, much‐needed additions to the psychometric literature as well as improved efforts in reporting practices. This paper provides an overview of score reports from a development perspective, focusing on current practices and emerging efforts in content of reports as well as the process by which reports are designed, evaluated, and ultimately used to communicate with the public.  相似文献   

4.
In this digital ITEMS module, Dr. Michael Bunch provides an in-depth, step-by-step look at how standard setting is done. It does not focus on any specific procedure or methodology (e.g., modified Angoff, bookmark, and body of work) but on the practical tasks that must be completed for any standard setting activity. Dr. Bunch carries the participant through every stage of the standard setting process, from developing a plan, through preparations for standard setting, conducting standard setting, and all the follow-up activities that must occur after standard setting in order to obtain the approval of cut scores and translate those cut scores into score reports. The digital module includes a 120-page manual, various ancillary files (e.g., PowerPoint slides, Excel workbooks, sample documents, and forms), links to datasets from the book Standard Setting (Cizek & Bunch, 2007), links to final reports from four recent large-scale standard setting events, quiz questions with formative feedback, and a glossary.  相似文献   

5.
韩宁 《考试研究》2009,(4):68-78
分数报告是实现教育考试功能的重要环节,考试机构应该树立把考生作为消费者的观念,提供尽可能准确、充分、易于理解的分数信息服务。本文指出了考试分数报告中的常见问题,介绍了AERA/APA/NCME行业标准对考试分数报告的要求,并讨论分数报告设计的基本原则,同时还对几个技术细节问题进行详细探讨,在题目映射、垂直量表、诊断性分数报告等几个环节介绍了具体可行的做法。  相似文献   

6.
《Educational Assessment》2013,18(2):203-206
This rejoinder responds to the major statements and claims made in Clemans (this issue). The arbitrary and unrealistic assumptions made by the Thurstone procedure are described. We point out the logical inconsistency of Clemans's claim that the relationship between raw scores, and abilities holds when transforming abilities into raw scores but not when transforming raw scores into abilities. Two effects that Clemans claims are caused by item response theory (IRT) scaling are examined, and we demonstrate that they occur more often with Thurstone scaling than with IRT scaling. We reiterate our belief in the superiority of IRT scaling over Thurstone scaling.  相似文献   

7.
We make a distinction between two types of test changes: inevitable deviations from specifications versus planned modifications of specifications. We describe how score equity assessment (SEA) can be used as a tool to assess a critical aspect of construct continuity, the equivalence of scores, whenever planned changes are introduced to testing programs. We also report on how SEA can be used as a quality control check to evaluate whether tests developed to a static set of specifications remain within acceptable tolerance levels with respect to equatability.  相似文献   

8.
As a universal conclusion of turbulent scale, scaling laws are important to the research on statistic turbulence. We measured two-dimensional instantaneous velocity field in turbulent boundary layers of flat plate with the momentum thickness Reynolds number Reθ=2 167. Scaling laws have different forms in different wall distance and scale. We proposed an expected scaling law and compared it with the She-Leveque (SL) scaling law based on the wavelet analysis and traditional statistical methods. Results show that the closer to the wall, the more the expected scaling law approached to the SL scaling law.  相似文献   

9.
This paper demonstrates that ‘failure’ is not a direct reflection of student knowledge. Using five years of New York State school-level data, we compare passing rates to raw-scores. We find, first, that when ‘cut scores’ are raised, more students fail even if raw scores are increasing. Second, increasing cut scores disproportionately fails more poor students than non-poor students, despite that poor students have the fastest rates of raw score improvement. Third, raised cut scores transform the smallest raw score gaps between high- and low-poverty schools into the largest passing gaps. Thus, while students in poor schools know more than they did previously, and although they have learned at superior rates, they are recast as the biggest ‘failures’ they have ever been.  相似文献   

10.
This module describes and extends X‐to‐Y regression measures that have been proposed for use in the assessment of X‐to‐Y scaling and equating results. Measures are developed that are similar to those based on prediction error in regression analyses but that are directly suited to interests in scaling and equating evaluations. The regression and scaling function measures are compared in terms of their uncertainty reductions, error variances, and the contribution of true score and measurement error variances to the total error variances. The measures are also demonstrated as applied to an assessment of scaling results for a math test and a reading test. The results of these analyses illustrate the similarity of the regression and scaling measures for scaling situations when the tests have a correlation of at least .80, and also show the extent to which the measures can be adequate summaries of nonlinear regression and nonlinear scaling functions, and of heteroskedastic errors. After reading this module, readers will have a comprehensive understanding of the purposes, uses, and differences of regression and scaling functions.  相似文献   

11.
This paper presents a framework to provide a structured approach for developing score reports for cognitive diagnostic assessments (CDAs). Guidelines for reporting and presenting diagnostic scores are based on a review of current educational test score reporting practices and literature from the area of information design. A sample diagnostic report is presented to illustrate application of the reporting framework in the context of one CDA procedure called the Attribute Hierarchy Method. Integration and application of interdisciplinary techniques from education, information design, and technology are required for effective score reporting. While the AHM is used in this paper, this framework is applicable to any attribute-based diagnostic testing method.  相似文献   

12.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

13.
Performance on a standardized reading comprehension test reflects the number of correct answers readers select from a list of alternate choices, but fails to provide information about how readers cope with the various cognitive demands of the task. The aim of this study was to determine whether three groups of readers: normally achieving (NA), poor comprehenders (CD), with no decoding disability, and reading disabled (RD), poor comprehenders with poor decoding skills, differed in their ability to cope with reading comprehension task demands. Three task variables reflected in the question-answer relations that appear on standardized reading comprehension tests were identified.Passage Independent (PI) question can be answered with reasonable accuracy based on the reader's prior knowledge of the passage content.Inference (INFER) questions required the reader to generate an inference at the local or global test level.Locating (LOCAT) questions require the reader to match the correct answer choice to a detail explicitly stated in the text either verbatim or in paraphrase form. The relations among reader characteristics, cognitive task factors and reading comprehension test scores were analyzed using a structural relations equation with LISREL. It was found that the three reading groups differed with respect to the underlying relationship between their performance on specific question-answer types and their standardized reading comprehension score. For the NA group, a high score on PI was likely to be accompanied by a low score on INFER, whereas in the CD and RD groups, PI and INFER are positively related. The finding of a negative relationship between background knowledge and inference task factors for normally achieving readers suggests that even normal readers may have comprehension difficulties that go undetected on the basis of a standardized scores. This study indicates that current comprehension assessments may not be adequate for assessing specific reading difficulties and that more precise diagnostic tools are needed.  相似文献   

14.
Bringing effective practices to scale across large systems requires attending to how information and belief systems come together in decisions to adopt, implement, and sustain those practices. Statewide scaling of the Pyramid Model, a framework for positive behavior intervention and support, across different types of early childhood programs (i.e., Head Start, early childhood special education, and school readiness) is used to describe how decision-making models may enhance professional development efforts. Research Findings: A theoretical model is presented based on implementation science, empirical knowledge, and practice evidence from one state’s experience trying to bring the Pyramid Model to scale across different types of early childhood programs. In this model, attention is given to how professional development systems may need to extend beyond the current focus on enhancing knowledge and skills to also address the belief systems of practitioners, administrators, and policymakers that influence implementation. Practice or Policy: Decision making and program characteristics are discussed relative to competency, organizational, and leadership drivers that may vary between different types of early childhood programs. Implications for statewide professional development systems and future research are discussed.  相似文献   

15.
In educational assessment, overall scores obtained by simply averaging a number of domain scores are sometimes reported. However, simply averaging the domain scores ignores the fact that different domains have different score points, that scores from those domains are related, and that at different score points the relationship between overall score and domain score may be different. To report reliable and valid overall scores and domain scores, I investigated the performance of four methods using both real and simulation data: (a) the unidimensional IRT model; (b) the higher-order IRT model, which simultaneously estimates the overall ability and domain abilities; (c) the multidimensional IRT (MIRT) model, which estimates domain abilities and uses the maximum information method to obtain the overall ability; and (d) the bifactor general model. My findings suggest that the MIRT model not only provides reliable domain scores, but also produces reliable overall scores. The overall score from the MIRT maximum information method has the smallest standard error of measurement. In addition, unlike the other models, there is no linear relationship assumed between overall score and domain scores. Recommendations for sizes of correlations between domains and the number of items needed for reporting purposes are provided.  相似文献   

16.
ABSTRACT

The authors sought to better understand the relationship between students participating in the Advanced Placement (AP) program and subsequent performance on the Scholastic Aptitude Test (SAT). Focusing on students graduating from U.S. public high schools in 2010, the authors used propensity scores to match junior year AP examinees in 3 subjects to similar students who did not take any AP exams in high school. Multilevel regression models with these matched samples demonstrate a mostly positive relationship between AP exam participation and senior year SAT performance, particularly for students who score a 3 or higher. Students who enter into the AP year with relatively lower initial achievement are predicted to perform slightly better on later SAT tests than students with similar initial achievement who do not participate in AP.  相似文献   

17.
Two methods of constructing equal-interval scales for educational achievement are discussed: Thurstone's absolute scaling method and Item Response Theory (IRT). Alternative criteria for choosing a scale are contrasted. It is argued that clearer criteria are needed for judging the appropriateness and usefulness of alternative scaling procedures, and more information is needed about the qualities of the different scales that are available. In answer to this second need, some examples are presented of how IRT can be used to examine the properties of scales: It is demonstrated that for observed score scales in common use (i.e., any scores that are influenced by measurement error), (a) systematic errors can be introduced when comparing growth at selected percentiles, and (b) normalizing observed scores will not necessarily produce a scale that is linearly related to an underlying normally distributed true trait.  相似文献   

18.
ABSTRACT

Past research into the relationship between English proficiency test (EPT) scores and score profiles, such as the IELTS and the TOEFL, has shown that there is not always a clear relationship between those scores and students’ subsequent academic achievement. Information about students’ academic self-concept (ASC) may provide additional information that helps predict future academic success. Research has consistently shown a positive relationship between students’ ASC and subsequent academic achievement and educational attainment in both school and higher education settings. The purpose of the current study was to examine the relationship between the academic performance of international students and their language proficiency and academic self-concept as well as other characteristics related to academic success. The study focused on first year international students in undergraduate business programs at an English-medium university in Canada. The following information was collected about the student participants: grades in degree program courses, annual GPA, and EPT scores (including subscores). In addition, students completed an academic self-concept scale. To obtain additional information about success in first-year business courses, instructors in two required courses were interviewed about the academic and language requirements in their courses and the profile of successful students. Correlations between the students’ course grades, GPA, EPT scores, and ASC score were calculated. The instructor interviews were analyzed using a content analysis procedure. The findings from all data sources were triangulated and show that language ability, ASC, and other factors impact academic success during the first year in a business program. The implications of these findings are discussed.  相似文献   

19.
Ultrasonography is increasingly used in medical education, but its impact on learning outcomes is unclear. Adding ultrasound may facilitate learning, but may also potentially overwhelm novice learners. Based upon the framework of cognitive load theory, this study seeks to evaluate the relationship between cognitive load associated with using ultrasound and learning outcomes. The use of ultrasound was hypothesized to facilitate learning in anatomy for 161 novice first‐year medical students. Using linear regression analyses, the relationship between reported cognitive load on using ultrasound and learning outcomes as measured by anatomy laboratory examination scores four weeks after ultrasound‐guided anatomy training was evaluated in consenting students. Second anatomy examination scores of students who were taught anatomy with ultrasound were compared with historical controls (those not taught with ultrasound). Ultrasound's perceived utility for learning was measured on a five‐point scale. Cognitive load on using ultrasound was measured on a nine‐point scale. Primary outcome was the laboratory examination score (60 questions). Learners found ultrasound useful for learning. Weighted factor score on “image interpretation” was negatively, but insignificantly, associated with examination scores [F (1,135) = 0.28, beta = ?0.22; P = 0.61]. Weighted factor score on “basic knobology” was positively and insignificantly associated with scores; [F (1,138) = 0.27, beta = 0.42; P = 0.60]. Cohorts exposed to ultrasound had significantly higher scores than historical controls (82.4% ± SD 8.6% vs. 78.8% ± 8.5%, Cohen's d = 0.41, P < 0.001). Using ultrasound to teach anatomy does not negatively impact learning and may improve learning outcomes. Anat Sci Educ 10: 144–151. © 2016 American Association of Anatomists.  相似文献   

20.
Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non-Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real-data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth-related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non-Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号