首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
A previous study of the initial, preoperational version of the Graduate Record Examinations (GRE) analytical ability measure (Powers & Swinton, 1984) revealed practically and statistically significant effects of test familiarization on analytical test scores. (Two susceptible item types were subsequently removed from the test.) Data from this study were reanalyzed for evidence of differential effects for subgroups of examinees classified by age, ethnicity, degree aspiration, English language dominance, and performance on other sections of the GRE General Test. The results suggested little, if any, difference among subgroups of examinees with respect to their response to the particular kind of test preparation considered in the study. Within the limits of the data, no particular subgroup appeared to benefit significantly more or significantly less than any other subgroup.  相似文献   

2.
A College Board-sponsored survey of a nationally representative sample of 1995–96 SAT takers yielded a data base for more than 4, 000 examinees, about 500 of whom had attended formal coaching programs outside their schools. Several alternative analytical methods were used to estimate the effects of coaching on SAT I: Reasoning Test scores. The various analyses produced slightly different estimates. All of the estimates, however, suggested that the effects of coaching are far less than is claimed by major commercial test preparation companies. The revised SAT does not appear to be any more coachable than its predecessor.  相似文献   

3.
One of the major assumptions of item response theory (IRT)models is that performance on a set of items is unidimensional, that is, the probability of successful performance by examinees on a set of items can be modeled by a mathematical model that has only one ability parameter. In practice, this strong assumption is likely to be violated. An important pragmatic question to consider is: What are the consequences of these violations? In this research, evidence is provided of violations of unidimensionality on the verbal scale of the GRE Aptitude Test, and the impact of these violations on IRT equating is examined. Previous factor analytic research on the GRE Aptitude Test suggested that two verbal dimensions, discrete verbal (analogies, antonyms, and sentence completions)and reading comprehension, existed. Consequently, the present research involved two separate calibrations (homogeneous) of discrete verbal items and reading comprehension items as well as a single calibration (heterogeneous) of all verbal item types. Thus, each verbal item was calibrated twice and each examinee obtained three ability estimates: reading comprehension, discrete verbal, and all verbal. The comparability of ability estimates based on homogeneous calibrations (reading comprehension or discrete verbal) to each other and to the all-verbal ability estimates was examined. The effects of homogeneity of item calibration pool on estimates of item discrimination were also examined. Then the comparability of IRT equatings based on homogeneous and heterogeneous calibrations was assessed. The effects of calibration homogeneity on ability parameter estimates and discrimination parameter estimates are consistent with the existence of two highly correlated verbal dimensions. IRT equating results indicate that although violations of unidimensionality may have an impact on equating, the effect may not be substantial.  相似文献   

4.
We evaluated a computer-delivered response type for measuring quantitative skill. "Generating Examples" (GE) presents under-determined problems that can have many right answers. We administered two GE tests that differed in the manipulation of specific item features hypothesized to affect difficulty. Analyses related to internal consistency reliability, external relations, and features contributing to item difficulty, adverse impact, and examinee perceptions. Results showed that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills. Item features that increased difficulty included asking examinees to supply more than one correct answer and to identify whether an item was solvable. Gender differences were similar to those found on the GRE quantitative and analytical test sections. Finally, examinees were divided on whether GE items were a fairer indicator of ability than multiple-choice items, but still overwhelmingly preferred to take the more conventional questions.  相似文献   

5.
《教育实用测度》2013,26(1):35-48
This study investigated several current coaching practices used in training test-wiseness for analogy items in standardized test batteries. A three-group design was used which included a general test-taking, "encouragement" condition in addition to a no-training control group condition. The specific techniques used in training are described. Scholastic Aptitude Test (SAT) scores were obtained from university admission files to verify that no overall aptitude differences existed in the three conditions. Differences were observed for the coached group relative to the two control groups in terms of overall number of correct responses for the coached item types (analogies). No differences were found for the non-coached item types. Item difficulties for the three groups are also reported which show that several items were indeed made easier for individuals in the coached group. A qualitative analysis of the items made easier by coaching in terms of the training techniques used is given along with an analysis of the items that did not respond to coaching. Finally, a discussion of potentially flawed item types and item characteristics and suggestions for dealing with such flaws are given.  相似文献   

6.
Exploratory and confirmatory factor analyses were used to explore relationships among existing item types and three new computer–administered item types for the analytical scale of the Graduate Record Examination General Test. One new item type was an open–ended version of the current multiple–choice analytical reasoning item type. The other new item types had no counterparts on the existing test. The computer tests were administered at four sites to a sample of students who had previously taken the GRE General Test. Scores from the regular GRE and the special computer administration were matched for a sample of 349 students. Factor analyses suggested that the new item types with no counterparts in the existing GRE were reliably assessing unique constructs but the open–ended analytical reasoning items were not measuring anything beyond what is measured by the current multiple–choice version of these items.  相似文献   

7.
In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored.  相似文献   

8.
This study evaluated 16 hypotheses, subsumed under 7 more general hypotheses, concerning possible sources of bias in test items for black and white examinees on the Graduate Record Examination General Test (GRE). Items were developed in pairs that were varied according to a particular hypothesis, with each item from a pair administered in different forms of an experimental portion of the GRE. Data were analyzed using log linear methods. Ten of the 16 hypotheses showed interactions between group membership and the item version indicating a differential effect of the item manipulation on the performance of black and white examinees. The complexity of some of the interactions found, however, suggested that uncontrolled factors were also differentially affecting performance.  相似文献   

9.
This study examines the perceptions of a representative sample of GRE test takers who were asked to indicate their views of the importance of eight widely considered factors in graduate admissions. Overall, candidates perceived undergraduate grades as the most important factor in graduate admissions. Recommendations and one's undergraduate field were rated as somewhat less important than undergraduate grades, and GRE Aptitude Test scores were rated even less important. GRE Advanced (Subject) Test scores were perceived as considerably less important than any other factor. Analyses by subgroup revealed that candidates' perceptions differed markedly according to the graduate field they intended to enter. Perceptions also differed by ethnic group (blacks versus whites) but not by sex or age.  相似文献   

10.
In actual test development practice, the number o f test items that must be developed and pretested is typically greater, and sometimes much greater, than the number that is eventually judged suitable for use in operational test forms. This has proven to be especially true for one item type–analytical reasoning-that currently forms the bulk of the analytical ability measure of the GRE General Test. This study involved coding the content characteristics of some 1,400 GRE analytical reasoning items. These characteristics were correlated with indices of item difficulty and discrimination. Several item characteristics were predictive of the difficulty of analytical reasoning items. Generally, these same variables also predicted item discrimination, but to a lesser degree. The results suggest several content characteristics that could be considered in extending the current specifications for analytical reasoning items. The use of these item features may also contribute to greater efficiency in developing such items. Finally, the influence of these various characteristics also provides a better understanding of the construct validity of the analytical reasoning item type.  相似文献   

11.
The purpose of this study was to identify broad classes of items that behave differentially for handicapped examinees taking special, extended-time administrations of the Scholastic Aptitude Test (SA T). To identify these item classes, the performance of nine handicapped groups and one nonhandicapped group on each of two forms of the SAT was investigated through a two-stage procedure. The first stage centered on the performance of item clusters. Individual items composing clusters showing questionable performance were then examined. This two-stage procedure revealed little indication of differentially functioning item classes. However, some notable instances of differential performance at the item level were detected, the most serious of which affected visually impaired students taking the braille edition of the test.  相似文献   

12.
Some applicants for admission to graduate programs present Graduate Record Examinations (GRE) General Test scores that are several years old. Due to different experiences over time, older GRE verbal, quantitative, and analytical scores may no longer accurately reflect the current capabilities of the applicants. To provide evidence regarding the long-term stability of GRE scores, test-retest correlations and average change (net gain) in test performance were analyzed for GRE General Test repeaters classified by time between test administrations in intervals ranging from less than 6 months to 10 years or more. Findings regarding average changes in verbal and quantitative test performance for long-term repeaters (with 5 years or more between tests), generally, and by graduate major area, sex, and ethnicity, appeared to be consistent with a differential growth hypothesis: Long-term repeaters generally, and in all of the subgroups, registered greater average (net) score gain on verbal tests than on quantitative tests and, for subgroups, the amount of gain tended to vary directly with initial means. A rationale is presented for a growth interpretation of the observed average gains in test performance. Implications for graduate school and GRE Program policies regarding the treatment of older test scores are considered.  相似文献   

13.
The Formulating-Hypotheses (F-H) item presents a situation and asks examinees to generate as many explanations for it as possible. This study examined the generalizability, validity, and examinee perceptions of a computer-delivered version of the task. Eight F-H questions were administered to 192 graduate students. Half of the items restricted examinees to 7 words per explanation, and half allowed up to 15 words. Generalizability results showed high interrater agreement, with tests of between 2 and 4 items scored by one judge achieving coefficients in the .80s. Construct validity analyses found that F-H was only marginally related to the GRE General Test, and more strongly related than the General Test to a measure of ideational fluency. Different response limits tapped somewhat different abilities, with the 15-word constraint appearing more useful for graduate assessment. These items added significantly to conventional measures in explaining school performance and creative expression.  相似文献   

14.
The standardization approach to assessing differential item functioning (DIF), including standardized distractor analysis, is described. The results of studies conducted on Asian Americans, Hispanics (Mexican Americans and Puerto Ricans), and Blacks on the Scholastic Aptitude Test (SAT) are described and then synthesized across studies. Where the groups were limited to include only examinees who spoke English as their best language, very few items across forms and ethnic groups exhibited large DIF. Major findings include evidence of differential speededness (where minority examinees did not complete SAT-Verbal sections at the same rate as White students with comparable SAT-Verbal scores) for Blacks and Hispanics and, when the item content is of special interest, advantages for the relevant ethnic group. In addition, homographs tend to disadvantage all three ethnic groups, but the effect of vertical relationships in analogy items are not as consistent. Although these findings are important in understanding DIF, they do not seem to account for all differences. Other variables related to DIF still need to be identified. Furthermore, these findings are seen as tentative until corroborated by studies using controlled data collection designs.  相似文献   

15.
In this study we examined alternative item types and section configurations for improving the discriminant and convergent validity of the GRE General Test. A computer-based test of reasoning items and a generating-explanations measure was administered to a sample of 388 examinees who previously had taken the General Test. Confirmatory factor analyses indicated that three dimensions of reasoning—verbal, analytical, and quantitative—and a fourth dimension of verbal fluency based on the generating-explanations task could be distinguished. Notably, generating explanations was as distinct from new variations of reasoning items as it was from verbal and quantitative reasoning. In the full sample, this differentiation was evident in relation to such external criteria as undergraduate grade point average (UGPA), self-reported accomplishments, and a measure of ideational fluency, with generating explanations relating uniquely to aesthetic and linguistic accomplishments and to ideational fluency. For the subset of participants with undergraduate majors in the humanities and social sciences, generating explanations added to the relationship with UGPA over that contributed by the General Test.  相似文献   

16.
Views on testing—its purpose and uses and how its data are analyzed—are related to one's perspective on test takers. Test takers can be viewed as learners, examinees, or contestants. I briefly discuss the perspective of test takers as learners. I maintain that much of psychometrics views test takers as examinees. I discuss test takers as a contestant in some detail. Test takers who are contestants in high‐stakes settings want reliable outcomes obtained via acceptable scoring of tests administered under clear rules. In addition, it is essential to empirically verify interpretations attached to scores. At the very least, item and test scores should exhibit certain invariance properties. I note that the “do no harm” dictum borrowed from the field of medicine is particularly relevant to the perspective of test takers as contestants.  相似文献   

17.
In this study, we created a computer-delivered problem-solving task based on the cognitive research literature and investigated its validity for graduate admissions assessment. The task asked examinees to sort mathematical word problem stems according to prototypes. Data analyses focused on the meaning of sorting scores and examinee perceptions of the task. Results showed that those who sorted well tended to have higher GRE General Test scores and college grades than did examinees who sorted less proficiently. Examinees generally preferred this task to multiple-choice items like those found on the General Test's Quantitative section and felt the task was a fairer measure of their ability to succeed in graduate school. Adaptations of the task might be used in admissions tests, as well as for instructional assessments to help lower- scoring examinees localize and remediate problem-solving difficulties.  相似文献   

18.
It is sometimes sensible to think of the fundamental unit of test construction as being larger than an individual item. This unit, dubbed the testlet, must pass muster in the same way that items do. One criterion of a good item is the absence of DIF–the item must function in the same way in all important subpopulations of examinees. In this article, we define what we mean by testlet DIF and provide a statistical methodology to detect it. This methodology parallels the IRT-based likelihood ratio procedures explored previously by Thissen, Steinberg, and Wainer (1988, in press). We illustrate this methodology with analyses of data from a testlet-based experimental version of the Scholastic Aptitude Test (SAT).  相似文献   

19.
A new measure that focused explicitly on the cognitive dimension of test anxiety was introduced and examined for psychometric quality as compared to existing measures of test anxiety. The new scale was found to be a reliable and valid measure of cognitive test anxiety. The impact of cognitive test anxiety as well as emotionality and test procrastination were subsequently evaluated on three course exams and students' self-reported performance on the Scholastic Aptitude Test for 168 undergraduate students. Higher levels of cognitive test anxiety were associated with significantly lower test scores on each of the three course examinations. High levels of cognitive test anxiety also were associated with significantly lower Scholastic Aptitude Test scores. Procrastination, in contrast, was related to performance only on the course final examination. Gender differences in cognitive test anxiety were documented, but those differences were not related to performance on the course exams. Examination of the relation between the emotionality component of test anxiety and performance revealed that moderate levels of physiological arousal generally were associated with higher exam performance. The results were consistent with cognitive appraisal and information processing models of test anxiety and support the conclusion that cognitive test anxiety exerts a significant stable and negative impact on academic performance measures.  相似文献   

20.
This study investigates the comparability of two item response theory based equating methods: true score equating (TSE), and estimated true equating (ETE). Additionally, six scaling methods were implemented within each equating method: mean-sigma, mean-mean, two versions of fixed common item parameter, Stocking and Lord, and Haebara. Empirical test data were examined to investigate the consistency of scores resulting from the two equating methods, as well as the consistency of the scaling methods both within equating methods and across equating methods. Results indicate that although the degree of correlation among the equated scores was quite high, regardless of equating method/scaling method combination, non-trivial differences in equated scores existed in several cases. These differences would likely accumulate across examinees making group-level differences greater. Systematic differences in the classification of examinees into performance categories were observed across the various conditions: ETE tended to place lower ability examinees into higher performance categories than TSE, while the opposite was observed for high ability examinees. Because the study was based on one set of operational data, the generalizability of the findings is limited and further study is warranted.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号