首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings.  相似文献   

2.
The purpose of this study is to explore the reliability of a potentially more practical approach to direct writing assessment in the context of ESL writing. Traditional rubric rating (RR) is a common yet resource-intensive evaluation practice when performed reliably. This study compared the traditional rubric model of ESL writing assessment and many-facet Rasch modeling (MFRM) to comparative judgment (CJ), the new approach, which shows promising results in terms of reliability. We employed two groups of raters—novice and experienced—and used essays that had been previously double-rated, analyzed with MFRM, and selected with fit statistics. We compared the results of the novice and experienced groups against the initial ratings using raw scores, MFRM, and a modern form of CJ—randomly distributed comparative judgment (RDCJ). Results showed that the CJ approach, though not appropriate for all contexts, can be as reliable as RR while showing promise as a more practical approach. Additionally, CJ is easily transferable to novel assessment tasks while still providing context-specific scores. Results from this study will not only inform future studies but can help guide ESL programs in selecting a rating model best suited to their specific needs.  相似文献   

3.
多面Rasch模型在主观题评分培训中的应用   总被引:7,自引:2,他引:7  
主观题的评分受到很多因素的影响,如评分者的知识水平、综合能力和个人偏好等。这些评分者偏差不仅会导致不同评分者之间存在主观差异,也会到导致同一评分者在不同的时间也具有主观不稳定性,最终导致主观题评分信度的降低。本研究将多面Rasch模型运用到某国家级考试论述题的评分培训中。通过分析6名有经验评分者对58份试卷的试评数据,鉴别出四种评分者偏差,然后据此对每个评分者进行个别反馈,从而提高评分的客观性和精确性。  相似文献   

4.
Cognitive pretesting (CP) is an interview methodology for pretesting the validity of items during the development of self-report instruments. The present research evaluates a systematic approach to the analysis of CP data. Materials and procedures were developed to rate self-report item performance with CP interview text data. Five raters were trained in the application of that system. Estimates of inter-rater reliability found acceptable to substantial levels of inter-rater agreement. Results from the present study suggest that excellent inter-rater reliability can be achieved in the evaluation of CP data. Guidelines for systematically rating the qualitative data collected using CP methods are provided. Future research should focus on empirical demonstrations of how such rating procedures can lead to improvements in self-report instruments.  相似文献   

5.
Quantitative student evaluations of teaching (SET) and assessments are widely used in higher education as a proxy for teaching quality. However, SET are a function of individual rating behaviours resulting from student background, knowledge and personalities, as well as the learning experience being rated. SET from three years of data from a science department at a Russell Group University in the UK were analysed to highlight issues of sample size in relation to variable perceptions of modules, and develop a statistical model of feedback incorporating individual rating behaviours across modules. Key results are that sample size and individual rating behaviours have the potential to significantly affect summary module ratings, especially for <20 respondents or if individuals have heterogeneous views. A new approach is suggested, to interpret and compare quantitative module ratings, acknowledging uncertainty, variability and individual rating behaviours. This has implications for the interpretation of SET in many aspects of academic life, including university league table positions, the identification of good teaching practice with respect to student satisfaction, and the weight given to SET in individual academics’ promotion applications.  相似文献   

6.
7.
A general procedure is proposed to estimate the reliability of a dual-span rotor based on nonparametric modeling on random uncertainty. First, the vibration equation of the rotor with random uncertainty is constructed based on random matrices through the nonparametric modeling approach. Second, the reliability estimation is then performed by response spectral analysis and the moment method. By making full use of the advantages of nonparametric method and response spectral analysis, not only is the requirement on probability density function (PDF) avoided, but also the first and second moments are no longer needed to be estimated or assumed for calculating the reliability. Finally, the statistical index Z*-value based on short-term predictability is introduced to investigate the influence of random uncertainties on the reliability of the dual-span rotor. Illustrating examples show that the results obtained from the proposed procedure are consistent with those from short-term predictability, such that dangerous ranges can be well identified during the start-up process of the rotor.  相似文献   

8.
In this paper a new approach to graphical differential item functioning (DIF) is offered. The methodology is based on a sampling-theory approach to expected response functions (Lewis, 1985; Mislevy, Wingersky, & Sheehan, 1994). Essentially error in item calibrations is modeled explicitly, and repeated samples are taken from the posterior distributions of the item parameters. Sampled parameter values are used to estimate the posterior distribution of the difference in item characteristic curves (ICCs)for two groups. A point-wise expectation is taken as an estimate of the true difference between the ICCs, and the sampled-difference functions indicate uncertainty in the estimate. Tbe approach is applied to a set of pretest items, and the results are compared to traditional Mantel-Haenszel DIF statistics. The expected-response-function approach is contrasted with Pashley's (1992) graphical DIF approach.  相似文献   

9.
The use of surveys, questionnaires, and rating scales to measure important outcomes in higher education is pervasive, but reliability and validity information is often based on problematic Classical Test Theory approaches. Rasch Analysis, based on Item Response Theory, provides a better alternative for examining the psychometric quality of rating scales and informing scale improvements. This paper outlines a six-step process for using Rasch Analysis to review the psychometric properties of a rating scale. The Partial Credit Model and Andrich Rating Scale Model will be described in terms of the pyschometric information (i.e., reliability, validity, and item difficulty) and diagnostic indices generated. Further, this approach will be illustrated through the example of authentic data from a university-wide student evaluation of teaching.  相似文献   

10.
主观考试采用评分员进行主观评分,由于评分一致性不高,缺乏信度,测量学界一直在努力探索提高主观评分信度的办法。本文用Longford方法对参加HSK[高等]作文考试评分的异常评分员作了一次实证检验。结果证明,该方法对检验大规模标准化主观考试评分员差异确实有效。  相似文献   

11.
Inclusive teaching tasks have consistently been found challenging for teachers, but it is unclear how they are ranked in terms of the extent of self-efficacy required. This study aimed at deriving such a hierarchy. A survey was conducted on 107 primary school teachers in Hong Kong using the Teacher Efficacy for Inclusive Practices scale. A Rasch rating scale model was applied to empirically examine the hierarchical structure. Good person reliability (0.89) and model fit (MNSQ 0.6–1.4) were achieved. Managing physical aggression was found at the top of the hierarchy; this and other results could facilitate the identification of training needs.  相似文献   

12.
In applications of structural equation modeling, it is often desirable to obtain measures of uncertainty for special functions of model parameters. This article provides a didactic discussion of how a method widely used in applied statistics can be employed for approximate standard error and confidence interval evaluation of such functions. The described approach is illustrated with data from a cognitive intervention study, in which it is used to estimate time-invariant reliability in multiwave, multiple indicator models.  相似文献   

13.
ABSTRACT

The authors address the reliability of scores obtained on the summative performance assessments during the pilot year of our research. Contrary to classical test theory, we discussed the advantages of using generalizability theory for estimating reliability of scores for summative performance assessments. Generalizability theory was used as the framework because of the flexibility this approach provides for examining sources of inconsistency within a complex assessment. Two major sources of inconsistency on scores considered in this study were raters and agencies (teachers' rating vs. researchers' rating). Overall, results showed that the inconsistency in scores attributable to raters and agencies was relatively small. Suggestions regarding improvement of consistency in the subsequent years of our research were provided.  相似文献   

14.
Operational reliability evaluation theory reflects real-time reliability level of power system. The component failure rate varies with operating conditions. The impact of real-time operating conditions such as ambient temperature and transformer MVA (megavolt-ampere) loading on transformer insulation life is studied in this paper. The formula of transformer failure rate based on the winding hottest-spot temperature (HST) is given. Thus the real-time reliability model of transformer based on oper- ating conditions is presented. The work is illustrated using the 1979 IEEE Reliability Test System. The changes of operating conditions are simulated by using hourly load curve and temperature curve, so the curves of real-time reliability indices are ob- tained by using operational reliability evaluation.  相似文献   

15.
Abstract

This study investigated the reliability, validity, and utility of the following three measures of letter-formation quality: (a) a holistic rating system, in which examiners rated letters on a five-point Likert-type scale; (h) a holistic rating system with model letters, in which examiners used model letters that exemplified specific criterion scores to rate letters; and (c) a correct/incorrect procedure, in which examiners used transparent overlays and standard verbal criteria to score letters. Intrarater and interrater reliability coefficients revealed that the two holistic scoring procedures were unreliable, whereas scores obtained by examiners who used the correct/incorrect procedure were consistent over time and across examiners. Although all three of the target measures were sensitive to differences between individual letters, only the scores from the two holistic procedures were associated with other indices of handwriting performance. Furthermore, for each of the target measures, variability in scores was, for the most part, not attributable to the level of experience or sex of the respondents. Findings are discussed with respect to criteria for validating an assessment instrument.  相似文献   

16.
The purpose of this study was to examine a technique for the development of performance rating scales to measure achievement in courses whose objectives require complex behaviors not easily measurable with paper and pencil achievement tests. A facet-factorial approach to rating scale construction was employed (i.e. the behavior was conceptualized as multidimensional and items for the scales were selected by employing factor analytical techniques) to construct scales to measure clarinet music performance. The three major results of the study were: 1) a thirty-item rating scale based on a six factor structure of clarinet music performance; 2) high inter-judge reliability estimates for both the total score (above .90) and the scale scores (above .60); and, 3) criterion-related validity coefficients greater than .80. Results of the investigation suggest that the facet-factorial approach can be an effective technique for the construction of rating scales to measure complex behavior such as music performance.  相似文献   

17.
In traditional Bayesian software reliability models, it was assume that all probabilities are precise. In practical applications the parameters of the probability distributions are often under uncertainty due to strong dependence on subjective information of experts’ judgments on sparse statistical data. In this paper, a quasi-Bayesian software reliability model using interval-valued probabilities to clearly quantify experts’ prior beliefs on possible intervals of the parameters of the probability distributions is presented. The model integrates experts’ judgments with statistical data to obtain more convincible assessments of software reliability with small samples. For some actual data sets, the presented model yields better predictions than the Jelinski-Moranda (JM) model using maximum likelihood (ML). Project supported by the National High-Technology Research and Development Program of China (Grant Nos.2006AA01Z187, 2007AA040605)  相似文献   

18.
为了提高飞秒激光瞬态热反射测量技术的测量精度并增加测试的可靠性,延长了泵浦光和探测光的相对光程差,以提高实验数据与理论模型间的拟合精度.通过在泵浦光路中放置一个和探测光路中同样的机械位移平台,搭建了一个新的测试实验台,将两光束的光程差由4ns提高到8ns.测量结果表明,当后4ns的测量数据与前4ns的数据平滑连接后,泵浦光和探测光束间由于发散和飘移所引起的测量误差可忽略不计.对Al/Si和Cr/Si样品界面热导的测量结果表明,实验测得值和现有文献报道数据非常吻合,证实了改进系统的可靠性.  相似文献   

19.
Empirical analysis requires researchers to choose which variables to use as controls in their models. Theory should dictate this choice, yet often in social science there are several theories that may suggest the inclusion or exclusion of certain variables as controls. The result of this is that researchers may use different variables in their models and come to disparate conclusions with respect to predicted effects and their statistical significance. In such cases one is uncertain of which particular set of regressors forms the model that represents the data. The approach used below accounts for uncertainty in variable selection by using Bayesian model averaging (BMA). Accounting for uncertainty, we demonstrate that BMA provides better out-of-sample prediction for university graduation rates than results based on alternative variable selection methods.  相似文献   

20.
A multilevel analysis approach was used to analyse students’ evaluation of teaching (SET). The low value of inter-rater reliability stresses that any solid conclusions on teaching cannot be made on the basis of single feedbacks. To assess a teacher’s general teaching effectiveness, one needs to evaluate four randomly chosen course implementations. Two implementations are needed when one course is evaluated, and if one implementation is evaluated, up to 15 feedbacks are needed. The stability of students’ ratings is very high, which reflects students’ stable rating criteria. There is an obvious rating paradox: from the student’s point of view, each rating is very precise, stable and justifiable, but from the teacher’s point of view a single feedback reflects the quality of teaching to just a moderate extent. Cross-hierarchical analysis reveals that there are large discrepancies between the uses of rating scales; some students are systematically more lenient in their rating whereas others are systematically more severe. The study also reveals that some courses are generally rated more favourably and that some courses are more suitable for certain teachers. Managers can thus improve the quality of teaching by finding the most suitable courses for each teacher.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号