首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
《教育实用测度》2013,26(3):249-253
A test segment that lacks content validity with respect to a criterion may be deleted for that reason. At issue is the effect on reliability and validity as measured by the coefficients arising from classical test theory. Assuming that the predictor test has some reasonable degree of internal consistency, deleting a segment of meaningful size is certain to reduce reliability. However, Feldt (1997) showed that a concomitant rise in the validity coefficient may occur under certain limited conditions. The present research further characterizes the circumstances under which validity changes may occur as a result of deletion of a predictor test segment. Specifically, for a positive outcome, one seeks a relatively large correlation between the scores from the deleted segment and the remaining items coupled with a relatively low correlation between scores from the deleted segment and the criterion.  相似文献   

2.
The relation between test reliability and statistical power has been a controversial issue, perhaps due in part to a 1975 publication in the Psychological Bulletin by Overall and Woodward, “Unreliability of Difference Scores: A Paradox for the Measurement of Change”, in which they demonstrated that a Student t test based on pretest-posttest differences can attain its greatest power when the difference score reliability is zero. In the present article, the authors attempt to explain this paradox by demonstrating in several ways that power is not a mathematical function of reliability unless either true score variance or error score variance is constant.  相似文献   

3.
ABSTRACT

We investigate whether Anchoring Vignettes (AV) improve intercultural comparability of non-cognitive student-directed factors (e.g., procrastination). So far, correlation analyses for anchored and non-anchored scores with a criterion have been used to demonstrate the effectiveness of AV in improving data quality. However, correlation analyses are often used to investigate external validity of a scale. Nonetheless, before testing for validity, the reliability of the measurement of a construct should be examined. In the present study, we tested for measurement invariance across countries and languages and compared anchored and non-anchored student-directed self-reports that are highly relevant for the students’ self and their behaviour and performance. In addition, we apply further criteria for testing reliability. The results indicate that the data quality for some of the constructs can – in fact – be improved slightly by anchoring; whereas, for other self-reports, anchoring is less successful than was hoped. We discuss with regard to possible consequences for research methodology.  相似文献   

4.
Abstract

This study investigated the reliability, validity, and utility of the following three measures of letter-formation quality: (a) a holistic rating system, in which examiners rated letters on a five-point Likert-type scale; (h) a holistic rating system with model letters, in which examiners used model letters that exemplified specific criterion scores to rate letters; and (c) a correct/incorrect procedure, in which examiners used transparent overlays and standard verbal criteria to score letters. Intrarater and interrater reliability coefficients revealed that the two holistic scoring procedures were unreliable, whereas scores obtained by examiners who used the correct/incorrect procedure were consistent over time and across examiners. Although all three of the target measures were sensitive to differences between individual letters, only the scores from the two holistic procedures were associated with other indices of handwriting performance. Furthermore, for each of the target measures, variability in scores was, for the most part, not attributable to the level of experience or sex of the respondents. Findings are discussed with respect to criteria for validating an assessment instrument.  相似文献   

5.
In this paper, an attempt has been made to synthesize some of the current thinking in the area of criterion-referenced testing as well as to provide the beginning of an integration of theory and method for such testing. Since criterion-referenced testing is viewed from a decision-theoretic point of view, approaches to reliability and validity estimation consistent with this philosophy are suggested. Also, to improve the decision-making accuracy of criterion-referenced tests, a Bayesian procedure for estimating true mastery scores has been proposed. This Bayesian procedure uses information about other members of a student's group (collateral information), but the resulting estimation is still criterion referenced rather than norm referenced in that the student is compared to a standard rather than to other students. In theory, the Bayesian procedure increases the “effective length” of the test by improving the reliability, the validity, and more importantly, the decision-making accuracy of the criterion-referenced test scores.  相似文献   

6.
Equivalent forms of a ten-item completion test were constructed. The same test items then were rewritten in matching format and in multiple-choice format, resulting in two forms (A and B) of each of three types of test. All tests were administered to 73 examinees, and parallel-forms reliability coefficients (correlation between scores on A and B) were calculated. These empirically obtained values were compared to the values of the reliability coefficient predicted from theoretically derived equations which indicate the influence of chance success due to guessing on test reliability. In accordance with theory it was found that the completion test was more reliable than the matching test and that the matching test was more reliable than the multiple-choice test. The empirically obtained reliability coefficients were very close to those predicted from the mathematically derived formulas.  相似文献   

7.
A reliability coefficient for criterion-referenced tests is developed from the assumptions of classical test theory. This coefficient is based on deviations of scores from the criterion score, rather than from the mean. The coefficient is shown to have several of the important properties of the conventional normreferenced reliability coefficient, including its interpretation as a ratio of variances and as a correlation between parallel forms, its relationship to test length, its estimation from a single form of a test, and its use in correcting for attenuation due to measurement error. Norm-referenced measurement is considered as a special case of criterion-referenced measurement.  相似文献   

8.
ABSTRACT

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay’s true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By eliminating one, two, or three raters at a time, and by calculating an estimate of the true scores using the remaining raters, an independent criterion against which to judge the validity of the human raters and that of the AES system, as well as the interrater reliability was produced. The results of the study indicated that the automated scores correlate with human scores to the same degree as human raters correlate with each other. However, the findings regarding the validity of the ratings support a claim that the reliability and validity of AES diverge: although the AES scoring is, naturally, more consistent than the human ratings, it is less valid.  相似文献   

9.
This meta-analysis is designed to test the immigrant paradox hypothesis, which argues that first-generation immigrant students tend to outperform their more acculturated peers. We aim to unpack the complex relation between acculturation and academic performance among immigrant-origin students with attention to methodological and demographic moderators. The review includes 79 independent samples generated from 54 studies, representing 89,827 students (M = 646.24, SD = 862.93) with a mean age of 13.26 (SD = 5.16). We found an overall main effect of 0.04, (p < .001), suggesting a significant, positive correlation between acculturation and academic performance. However, given the significant variation among studies, focused moderator analyses revealed the importance of critical methodological (e.g., type of acculturation measure used, type of academic indicator used, and type of publication) and demographic (e.g., developmental stage, race/ethnicity, urbanicity) factors that moderate the relation between acculturation and school achievement. These results suggest the opposite of the immigrant paradox, that is second-generation (or more acculturated) students seem to perform better than their first-generation (or less acculturated) peers. Moderation analysis, however, revealed that acculturation seems to have no effect on grades, while having a positive effect on test scores. Finally, we found a positive relation between acculturation and academic performance in studies conducted with children and adolescents, but not for young adults.  相似文献   

10.
Reliability coefficients of linear combinations of observed scores have anomalous properties which have led to persistent difficulties in the investigation of difference scores and gain scores in test theory. Interpretation of these test scores is further complicated by effects of correlated errors of measurement which are likely to appear in difference scores and gain scores in practice. In this paper the discrepancies between classical results and correct results obtained from more general formulas, which allow for correlated errors, are examined systematically. These discrepancies depend strongly on the reliability coefficients of the respective tests and are smallest when the influence of the variables related by the formulas is least. A vector representation of difference scores reveals that these anomalies arise from simple geometric relations among observed scores, true scores, and error scores inherent in the test-theory model. In this context, doubts as to the usefulness of difference scores and gain scores in testing practice expressed by previous authors appear to be justified.  相似文献   

11.
《教育实用测度》2013,26(3):221-240
The scores on 2 distinct tests (e.g., essay and objective) are often combined to create a composite score, which is used to make decisions. The validity of the observed composite can sometimes be evaluated relative to an external criterion. However, in cases where no criterion is available, the observed composite has generally been evaluated in terms of its reliability. The analyses in this article are based on a simple, content-based model for the validity of the observed composite as an estimate of a target composite, based on a priori weights for the 2 tests. The results suggest that giving extra weight to the more reliable of the 2 observed scores tends to improve the reliability of the composite, and up to a point tends to improve its validity. Giving too much weight to the more reliable score can decrease the validity of the observed composite as a measure of the target composite.  相似文献   

12.
BackgroundThe Childhood Trauma Questionnaire – Short Form (CTQ-SF) is a widely utilized self-report instrument in the assessment and characterization of childhood trauma. Yet, research on the instrument’s psychometric properties in clinical samples is sparse, and the Danish version of the CTQ-SF has not been previously evaluated in clinical samples.ObjectivesTo examine the structural validity, internal consistency reliability, and multi-method convergent validity of the CTQ-SF in a heterogenous clinical sample from Denmark.Participants and settingThe study was based on data from four Danish clinical samples (N = 393): 1) Outpatients diagnosed with personality disorders, 2) Patients commencing psychiatric treatment for non-affective first-episode psychosis, 3) Patients diagnosed with first-episode or prolonged depression recruited from general practitioners and an outpatient mood disorder clinic, and 4) detained delinquent boys.MethodsConfirmatory factor analysis was used to explore structural validity. Also, we calculated internal consistency and multi-method convergent validity with interview-based ratings of adverse parenting.ResultsConfirmatory factor analyses indicated that the five-factor structure described in CTQ-SF manual with three error correlated items best fitted the data, as compared to various other models. Coefficients of congruence also supported factorial similarity across countries (i.e. US substance abuser and a mixed Brazilian sample). Internal consistency reliability was acceptable and comparable to estimates previously published. Multi-method convergent validity associations further corroborated the validity of the CTQ-SF.ConclusionThese findings provide support for the reliability and validity of the Danish version of the CTQ-SF in clinical samples.  相似文献   

13.
In evaluating the relationship between two measures across different groups (i.e., in evaluating “differential validity”) it is necessary to examine differences in correlation coefficients and in regression lines. Ordinary least squares (OLS) regression is the standard method for fitting lines to data, but its criterion for optimal fit (minimizing the squared vertical distances between the points and the line) is less natural in many contexts than the criterion used in orthogonal regression (minimizing the squared Euclidean distances of points from the line). OLS regression is appropriate if the goal is to predict some unknown dependent variable from a known independent variable, but in examining the relationship between two variables, which both contain error, OLS regression introduces bias. This bias, associated with regression toward the mean, can suggest that the test scores have different relationships, and therefore different meanings, in two groups, when the two sets of test scores have the same relationship and the same meanings in the two groups. The impact of regression toward the mean in differential validity studies is illustrated with two synthetic and two real data sets. Each of the two real data sets include two measures of competence in applying legal principles to fact situations (an essay test and a multiple-choice test) for candidates in two groups (Black/White in the first example and women/men in the second example).  相似文献   

14.
本文运用教师积极性测评表对 4 2名中学教师进行了测量 ,并对测评表的信度和效度进行了检验。结果表明 :运用量表对教师的工作积极性进行测评是可行的和有意义的 ,该量表具有一定的信度和效标关联效度 ,但还应在现有基础上 ,用因素分析法进行筛选和归类。本文还提出 ,应建立一个统一的评分体系 ,使分数更具有可比性  相似文献   

15.
ObjectiveWe conducted a comprehensive assessment of the reliability and validity of the Interview for Traumatic Events in Childhood (ITEC, Lobbestael, Arntz, Kremers, & Sieswerda, 2006), a retrospective, semi-structured interview for childhood maltreatment. The ITEC aims to yield dimensional scores for severity of experiences of different childhood maltreatment dimensions.MethodsInitial psychometric properties were tested with the pilot version of the ITEC in 362 participants. A second study assessed the revised ITEC in 217 participants, patients and non-patients.ResultsFactor analyses produced the best fit for a five-factor model (sexual, physical and emotional abuse, physical and emotional neglect). The scales had good internal consistency, except for the physical neglect subscale, and excellent inter-rater reliability. The scales were highly associated with equivalent scales of the Childhood Trauma Questionnaire (i.e., good convergent validity), and showed good correspondence with patient file information (i.e., good criterion validity).ConclusionThese results support the reliability and validity of the ITEC, making it a potentially useful tool for assessing a broad range of traumatic events in childhood.Practice implicationThe first step in therapy for dealing with childhood maltreatment is to map abusive experiences and assess their severity and impact. Since maltreatment is a sensitive topic that is not reported on easily, trauma interviews are promising assessment instruments since they provide the opportunity to probe and clarify. There are hardly any well-validated trauma interviews available that assess the extent of maltreatment in and outside the family in various dimensions. The current study tries to fill this gap by presenting a new trauma interview; the Interview for Traumatic Events in Childhood.  相似文献   

16.
The Moral Competence Test (MCT) was designed over 30 years ago to provide a resource for educators interested in conducting cross-cultural studies of moral development and education. Since its origin, it has been translated into at least 30 languages and used in hundreds of studies. However, few studies provide evidence to support the use of the test in the US. The test’s designer identified three criteria for evaluating the construct validity of the test and its primary scores: do correlations of stage scores reflect a simplex structure, do ratings follow the theoretical order of stages, does the test differentiate preferences and structures of reasoning. We use these criteria and evidence of criterion and content validity to assess the validity of the MCT. We present results from two US samples (n = 772). Results analyzing the test author’s criteria support the semantic validity of the test, however, evidence of criterion validity raise questions about the C-score as a measure of moral competence. After controlling for stage preferences, the C-score was negatively related to democratic attitudes and positively related to dogmatism.  相似文献   

17.
Abstract

An objective instrument for assessment of motivation for school learning is reported along with evidence of its validity. Rural ninth-grade students in Appalachian Kentucky constituted the sample for studying relation, ships among variables of school motivation, willingness to compete, and achievement in reading, mathematics, and language. Students in general mathematics and in algebra classes were asked to volunteer for an academic type of contest. Later the mean motivation score of volunteers exceeded the mean for non-volunteers significant at the .01 level of confidence. Algebra students’ mean motivation score was significantly higher than the mean for general mathematics students (P > .001). Three months after the motivation scores were obtained scores on the California Achievement Test were collected. Product-moment correlations between motivation scores and achievement scores ranged from .604 to .718.

Although other writers have reported correlations between objective measures of motivation and teachers’ marks, no previous correlations with achievement test results could be found for comparison. Correlations with GPA’s tend to be in the range .32 to .55 which is considerably below the range resulting from this study. Data collected in this project supported hypotheses that the objective measure of school motivation would predict levels of utility for competition and achievement. It is concluded that for the sample of students involved the test presented is reliable and has validity for the prediction of willingness to try and levels of achievement as measured by a standardized test.  相似文献   

18.
Abstract

Previous researchers having established the equivalence of a group administered version of the PPVT with the standard procedure of individual administration and the reliability between alternate forms of the PPVT, an attempt was made to establish the concurrent validity of a group administered version of the PPVT in terms of two criterion variables. An r of .62 was obtained between the Otis, a group test of intelligence, and the PPVT. An r of .55 was found between the PPVT and the Stanford Achievement Test. Both r’s were significant beyond the .01 level. The concurrent validity of the PPVT was established and suggestions for additional research were made.  相似文献   

19.
20.
ABSTRACT

Touch screen tablets are being increasingly used in schools for learning and assessment. However, the validity and reliability of assessments delivered via tablets are largely unknown. The present study tested the psychometric properties of a tablet-based app designed to measure early literacy skills. Tablet-based tests were also compared with traditional paper-based tests. Children aged 2–6 years (N?=?99) completed receptive tests delivered via a tablet for letter, word, and numeral skills. The same skills were tested with a traditional paper-based test that used an expressive response format. Children (n?=?35) were post-tested 8 weeks later to examine the stability of test scores over time. The tablet test scores showed high internal consistency (all α’s?>?.94), acceptable test-retest reliability (ICC range?=?.39–.89), and were correlated with child age, family SES, and home literacy teaching to indicate good predictive validity. The agreement between scores for the tablet and traditional tests was high (ICC range?=?.81–.94). The tablet tests provides valid and reliable measures of children’s early literacy skills. The strong psychometric properties and ease of use suggests that tablet-based tests of literacy skills have the potential to improve assessment practices for research purposes and classroom use.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号