首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
An essential question when computing test–retest and alternate forms reliability coefficients is how many days there should be between tests. This article uses data from reading and math computerized adaptive tests to explore how the number of days between tests impacts alternate forms reliability coefficients. Results suggest that the highest alternate forms reliability coefficients were obtained when the second test was administered at least 2 to 3 weeks after the first test. Even though reliability coefficients after this amount of time were often similar, results suggested a potential tradeoff in waiting longer to retest as student ability tended to grow with time. These findings indicate that if keeping student ability similar is a concern that the best time to retest is shortly after 3 weeks have passed since the first test. Additional analyses suggested that alternate forms reliability coefficients were lower when tests were shorter and that narrowing the first test ability distribution of examinees also impacted estimates. Results did not appear to be largely impacted by differences in first test average ability, student demographics, or whether the student took the test under standard or extended time. It is suggested that for math and reading tests, like the ones analyzed in this article, the optimal retest interval would be shortly after 3 weeks have passed since the first test.  相似文献   

2.
Two conventional scores and a weighted score on a group test of general intelligence were compared for reliability and predictive validity. One conventional score consisted of the number of correct answers an examinee gave in responding to 69 multiple-choice questions; the other was the formula score obtained by subtracting from the number of correct answers a fraction of the number of wrong answers. A weighted score was obtained by assigning weights to all the response alternatives of all the questions and adding the weights associated with the responses, both correct and incorrect, made by the examinee. The weights were derived from degree-of-correctness judgments of the set of response alternatives to each question. Reliability was estimated using a split-half procedure; predictive validity was estimated from the correlation between test scores and mean school achievement. Both conventional scores were found to be significantly less reliable but significantly more valid than the weighted scores. (The formula scores were neither significantly less reliable nor significantly more valid than number-correct scores.)  相似文献   

3.
Responses to a 40-item test were simulated for 150 examinees under free-response and multiple-choice formats. The simulation was replicated three times for each of 30 variations reflecting format and the extent to which examinees were (a) misinformed, (b) successful in guessing free-response answers, and (c) able to recognize with assurance correct multiple-choice options that they could not produce under free-response testing. Internal consistency reliability (KR20) estimates were consistently higher for the free-response score sets, even when the free-response item difficulty indices were augmented to yield mean scores comparable to those from multiple-choice testing. In addition, all test score sets were correlated with four randomly generated sets of unit-normal measures, whose intercorrelations ranged from moderate to strong. These measures served as criteria because one of them had been used as the basic ability measure in the simulation of the test score sets. Again, the free-response score sets yielded superior results even when tests of equal difficulty were compared. The guessing and recognition factors had little or no effect on reliability estimates or correlations with the criteria. The extent of misinformation affected only multiple-choice score KR20's (more misinformation—higher KR20's). Although free-response tests were found to be generally superior, the extent of their advantage over multiple-choice was judged sufficiently small that other considerations might justifiably dictate format choice.  相似文献   

4.
5.
Abstract

The purpose of this study was to obtain needed additional information concerning the reliability of the PPVT. A group testing procedure was utilized by reproducing the plates of the PPVT on 35 millimeter transparent slides and projecting them onto a 60 X 60 inch screen. A sample of 414 fourth-, fifth-, and sixth-grade pupils was tested twice with Form A and twice with Form B. The time required to administer each separate form was one-half hour. The total testing time for both Form A and Form B when combined into one longer test was one hour. Alternate form reliability compared favorably with the manual. Test-retest reliability coefficients ranged from .73 to .35. Combining the two forms into a test twice as long yielded test-retest reliability coefficients of .90, .88 and .84 for the fourth, fifth and sixth grades respectively.  相似文献   

6.
As a global measure of precision, item response theory (IRT) estimated reliability is derived for four coefficients (Cronbach's α, Feldt‐Raju, stratified α, and marginal reliability). Models with different underlying assumptions concerning test‐part similarity are discussed. A detailed computational example is presented for the targeted coefficients. A comparison of the IRT model‐derived coefficients is made and the impact of varying ability distributions is evaluated. The advantages of IRT‐derived reliability coefficients for problems such as automated test form assembly and vertical scaling are discussed.  相似文献   

7.
Open–ended counterparts to a set of items from the quantitative section of the Graduate Record Examination (GRE–Q) were developed. Examinees responded to these items by gridding a numerical answer on a machine-readable answer sheet or by typing on a computer. The test section with the special answer sheets was administered at the end of a regular GRE administration. Test forms were spiraled so that random groups received either the grid-in questions or the same questions in a multiple-choice format. In a separate data collection effort, 364 paid volunteers who had recently taken the GRE used a computer keyboard to enter answers to the same set of questions. Despite substantial format differences noted for individual items, total scores for the multiple-choice and open-ended tests demonstrated remarkably similar correlational patterns. There were no significant interactions of test format with either gender or ethnicity.  相似文献   

8.
Item-response changing as a function of test anxiety was investigated. Seventy graduate students completed the Test Anxiety Scale and 73 multiple-choice items during the quarter. The data supported the hypothesis that high test-anxious students make more item-response changes than low test-anxious students. Results also suggested that both high- and low-anxious students profit to a similar extent proportionally from answer changing. It was further found that more responses were changed on difficult than on easy items for both high- and low-anxious students. Test anxiety is suggested as a factor forming test-taking style.  相似文献   

9.
《教育实用测度》2013,26(3):167-180
In the figural response item format, proficiency is expressed by manipulating elements of a picture or diagram. Figural response items in architecture were contrasted with multiple-choice counterparts in their ability to predict architectural problem-solving proficiency. Problem-solving proficiency was measured by performance on two architecture design problems, one of which involved a drawing component, whereas the other required only a written verbal response. Both figural response and multiple-choice scores predicted verbal design problem solving, but only the figural response scores predicted graphical problem solving. The presumed mechanism for this finding is that figural response items more closely resemble actual architectural tasks than do multiple-choice items. Some evidence for this explanation is furnished by architects' self-reports, in which architects rated figural response items as "more like what an architect does" than multiple-choice items.  相似文献   

10.
Abstract

Previous researchers having established the equivalence of a group administered version of the PPVT with the standard procedure of individual administration and the reliability between alternate forms of the PPVT, an attempt was made to establish the concurrent validity of a group administered version of the PPVT in terms of two criterion variables. An r of .62 was obtained between the Otis, a group test of intelligence, and the PPVT. An r of .55 was found between the PPVT and the Stanford Achievement Test. Both r’s were significant beyond the .01 level. The concurrent validity of the PPVT was established and suggestions for additional research were made.  相似文献   

11.
In contrast to multiple-choice test questions, figural response items call for constructed responses and rely upon figural material, such as illustrations and graphs, as the response medium. Figural response questions in various science domains were created and administered to a sample of 4th-, 8th-, and 12th-grade students. Item and test statistics from parallel sets of figural response and multiple-choice questions were compared. Figural response items were generally more difficult, especially for questions that were difficult (p < .5) in their constructed-response forms. Figural response questions were also slightly more discriminating and reliable than their multiple-choice counterparts, but they had higher omit rates. This article addresses the relevance of guessing to figural response items and the diagnostic value of the item type. Plans for future research on figural response items are discussed.  相似文献   

12.
Abstract

High school students completed both multiple-choice and constructed response exams over an 845-word narrative passage on which they either took notes or underlined critical information. A control group merely read the text In addition, half of the learners in each condition were told to expect either a multiple-choice or constructed response test following reading. Overall, note takers showed superior posttest recall, and notetaking without test instructions yielded the best group performance. Notetaking also required significantly more time than the other conditions. Underlining for a multiple-choice test led to better recall than underlining for a constructed response test. Although more multiple-choice than constructed response items were remembered. Test Mode failed to interact with the other factors.  相似文献   

13.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

14.
Abstract

The purpose of this investigation was to develop and validate a simulation device to measure a teacher's ability to identify verbal and nonverbal emotions expressed by students (teacher affective sensitivity). The scale consists of videotaped excerpts of teacher-learner interactions and accompanying multiple-choice instrumentation. Respondents select the answer from each multiple-choice item that they believe most accurately describes the affective state of the pupil viewed on the monitor. Previously produced media focusing on classroom interactions were used to obtain the examples of learner affective expressions. Expert judges constructed two multiple-choice items for each simulation episode. Pilot test administrations allowed for numerous scale revisions. Finally, assessments of scale reliability, and scale construct, predictive, concurrent, and content validity were made.  相似文献   

15.
This study focused on the development of a two-tier multiple-choice diagnostic instrument, which was designed and then progressively modified, and implemented to assess students' understanding of solution chemistry concepts. The results of the study are derived from the responses of 756 Grade 11 students (age 16–17) from 14 different high schools who participated in the study. The final version of the instrument included a total of 13 items that addressed the six aspects of solution chemistry, and students' understandings in the test were challenged in multiple contexts with multiple modes and levels of representation. Cronbach alpha reliability coefficients for the content tier and both tiers of the test were found to be 0.697 and 0.748, respectively. Results indicated that a substantial number of students held an inadequate understanding of solution chemistry concepts. In addition, 21 alternative conceptions observed in more than 10% of the students were reported, along with discussion on possible sources of such conceptions.  相似文献   

16.
Although test scores from similar tests in multiple choice and constructed response formats are highly correlated, equivalence in rankings may mask differences in substantive strategy use. The author used an experimental design and participant think-alouds to explore cognitive processes in mathematical problem solving among undergraduate examinees (N = 64). The study examined the effect of format on mathematics performance and strategy use for male and female examinees given stem-equivalent items. A statistically significant main effect of format on performance was found, with constructed-response items more difficult. The multiple-choice format was associated with more varied strategies, backward strategies, and guessing. Format was found to moderate the effect of problem conceptualization on performance. Results suggest that while for purposes of ranking students on performance, the multiple-choice format may be adequate, for many contemporary educational purposes that seek to provide nuanced information about student cognition, the constructed response format should be preferred.  相似文献   

17.
Educators have need for a procedure to generate alternate forms of tests. The reliability of alternate forms generated from a table of specifications is examined, using 78 high school remedial mathematics students as subjects. Ten forms of a test were constructed and administered; seven of these forms were readministered. Alternate forms correlation, .85, is as high as the test-retest correlation, .82, lending support to the hypothesis that alternate forms generated from a table of specifications are reliable. Discussion includes educational uses for a table of specifications in text books to generate test forms.  相似文献   

18.
Human Self-Assessment in Multiple-Choice Testing   总被引:1,自引:0,他引:1  
Research indicates that the multiple-choice format in itself often seems to favor males over females. The present study utilizes a method that enables test takers to assess the correctness of their answers. Applying this self-assessment method may not only make the multiple-choice tests less biased but also provide a more comprehensive measurement of usable knowledge-that is, the kind of knowledge about which a person is sufficiently sure so that he or she will use the knowledge to make decisions and take actions. The performance of male and female undergraduates on a conventional multiple-choice test was compared with their performance on a multiple-choice self-assessment test. Results show that the difference between test scores of males and those of females was reduced when subjects were allowed to make self-assessments. This may be explained in terms of the alleged difference in cognitive style between the genders.  相似文献   

19.
This empirical study aimed to investigate the impact of easy first vs. hard first ordering of the same items in a paper and-pencil multiple-choice exam on the performances of low, moderate, and high achiever examinees, as well as on the item statistics. Data were collected from 554 Turkish university students using two test forms, which included the same multiple-choice items ordered reversely, i.e. easy first vs. hard first. Tests included 26 multiple-choice items about the introductory unit of “Measurement and Assessment” course. The results suggested that sequencing the multiple-choice items in either direction from easy to hard or vice versa did not affect the test performances of the examinees no matter whether they are low, moderate or high achiever examinees. Finally, no statistically significant difference was observed between item statistics of both forms, i.e. the difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj. r) coefficients.  相似文献   

20.
Abstract

Few reliable and valid measures of reading achievement are available to evaluate programs for elementary English-as-a-second-language (ESL) pupils. Four variations on the cloze procedure, which has been previously used with disadvantaged and ESL elementary pupils, were evaluated using randomly assigned groups of fourth and fifth grade students. Matching and multiple- choice variations were selected for comparison because they are in greater consonance with current psycho- linguistic theories of the reading process than are other types of reading comprehension measures. Although the overall results were quite similar for the four cloze variations examined, the matching cloze procedure seems to be preferable for elementary ESL students since these tests produced better item characteristics and were more easily constructed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号