期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The Influence of Several Factors on Reliability for Complex Reading Comprehension Tests

Guemin Lee 《Journal of Educational Measurement》2002,39(2):149-164

The purpose of this study was to investigate the effects of items, passages, contents, themes, and types of passages on the reliability and standard errors of measurement for complex reading comprehension tests. Seven different generalizability theory models were used in the analyses. Results indicated that generalizability coefficients estimated using multivariate models incorporating content strata and types of passages were similar in size to reliability estimates based upon a model that did not include these factors. In contrast, incorporating passages and themes within univariate generalizability theory models produced non-negligible differences in the reliability estimates. This suggested that passages and themes be taken into account when evaluating the reliability of test scores for complex reading comprehension tests. 相似文献

2.

Empirical Estimates of the Comparative Reliability of Matching Tests and Multiple-Choice Tests

《Journal of Experimental Education》2012,80(3):179-182

Equivalent forms of a ten-item completion test were constructed. The same test items then were rewritten in matching format and in multiple-choice format, resulting in two forms (A and B) of each of three types of test. All tests were administered to 73 examinees, and parallel-forms reliability coefficients (correlation between scores on A and B) were calculated. These empirically obtained values were compared to the values of the reliability coefficient predicted from theoretically derived equations which indicate the influence of chance success due to guessing on test reliability. In accordance with theory it was found that the completion test was more reliable than the matching test and that the matching test was more reliable than the multiple-choice test. The empirically obtained reliability coefficients were very close to those predicted from the mathematically derived formulas. 相似文献

3.

Reliability of the peabody picture vocabulary test: A review of 32 selected research studies published between 1965 and 1974

Sandra Bochner 《Psychology in the schools》1978,15(3):320-327

This report is a review of reliability data on the PPVT obtained from 32 research studies published between 1965 and 1974. Much of the research was done on Head Start children. Overall, the median of reliability coefficients reported here (0.72) has remained remarkably close to the original median of 0.77 found in standardizing the test. Unexpectedly, elapsed time between test and retest had only a slight effect on the reliability coefficients. However, as expected, the greater range in ages and ability levels of subjects, the higher were the reliabilities. For average children in the elementary grades, and for retarded people of all ages, PPVT scores remained relatively stable over time and there was close equivalence between alternate forms. Scores were least stable for preschool children, especially from minority groups. Black preschool girls were more variable in their performance on the PPVT than boys, and preschool girls generally were more responsive than boys to play periods conducted before testing was begun. A number of variables associated with examiners and setting affected the scores on the test. As expected, raw scores tended to yield slightly higher reliabilities than MA and considerably higher reliabilities than IQ scores. 相似文献

4.

Reliability of Scores From Teacher-Made Tests 总被引：1，自引：0，他引：1

David A. Frisbie 《Educational Measurement》1988,7(1):25-35

Reliability is the property of a set of test scores that indicates the amount of measurement error associated with the scores. Teachers need to know about reliability so that they can use test scores to make appropriate decisions about their students. The level of consistency of a set of scores can he estimated by using the methods of internal analysis to compute a reliability coefficient. This coefficient, which can range between 0.0 and +1.0, usually has values around 0.50 for teacher-made tests and around 0.90 for commercially prepared standardized tests. Its magnitude can be affected by such factors as test length, test-item difficulty and discrimination, time limits, and certain characteristics of the group—extent of their testwiseness, level of student motivation, and homogeneity in the ability measured by the test. 相似文献

5.

Reading Progress Monitoring for Secondary‐School Students: Reliability,Validity, and Sensitivity to Growth of Reading‐Aloud and Maze‐Selection Measures

Renáta Tichá Christine A. Espin Miya Miura Wayman 《Learning disabilities research & practice》2009,24(3):132-142

The validity and reliability of curriculum‐based measures in reading as indicators of performance and progress for secondary‐school students were examined. Thirty‐five grade 8 students completed reading aloud and maze‐selection measures weekly for 10 weeks. Criterion measures were the state standards test in reading and the Woodcock–Johnson III Test of Achievement. Different time frames for each measure were compared. Most alternate‐form reliability coefficients were above .80. Criterion‐related validity coefficients ranged from .77 to .89. No differences related to time were found. Only maze selection reflected significant growth, with an average increase of 1.29 correct choices per week. Maze growth was related to the reading performance level and to change on the Woodcock–Johnson III from pre‐ to posttest. 相似文献

6.

The Effects of Test Length and Sample Size on the Reliability and Equating of Tests Composed of Constructed-Response Items

《教育实用测度》2013,26(1):31-57

Examined in this study were the effects of test length and sample size on the alternate forms reliability and the equating of simulated mathematics tests composed of constructed-response items scaled using the 2-parameter partial credit model. Test length was defined in terms of the number of both items and score points per item. Tests with 2, 4, 8, 12, and 20 items were generated, and these items had 2, 4, and 6 score points. Sample sizes of 200, 500, and 1,000 were considered. Precise item parameter estimates were not found when 200 cases were used to scale the items. To obtain acceptable reliabilities and accurate equated scores, the findings suggested that tests should have at least eight 6-point items or at least 12 items with 4 or more score points per item. 相似文献

7.

Reading Ages and NFER Reading Tests

Denis Vincent 《Educational research; a review for teachers and all concerned with progress in education》2013,55(3):176-180

The article outlines some practical and theoretical weaknesses of the concept of ‘reading age’. It considers ambiguity in the computational methods for determining the ‘average’ reading age and the tendency for a reading age to become a fixed property of a pupil rather than an estimate of his standing, relative to time of testing and choice of test. A theoretical standpoint is taken that insufficient is known about the way a reader develops for his attainment to be given a developmental or age‐based score. It is further suggested that in any case the relationship between age and reading development is imperfect. This point is developed further when the assumption of a linear pattern in reading development is criticized. It is further suggested that insofar as reading does follow chronological age an age‐adjustment in the scale used to express attainment is highly desirable. Practical considerations such as selection of reading material, comparison with other educational tests and the assessment of reading progress, are mentioned. The article concludes with a brief resume of the problems associated with reading ‘quotients’ and designating children as ‘under‐achievers’. 相似文献

8.

On the Reliability of Testlet-Based Tests

Stephen G. Sireci David Thissen Howard Wainer 《Journal of Educational Measurement》1991,28(3):237-247

If a test is constructed of testlets, one must take into account the within-testlet structure in the calculation of test statistics. Failing to do so may yield serious biases in the estimation of such statistics as reliability. We demonstrate how to calculate the reliability of a testlet-based test. We show that traditional reliabilities calculated on two reading comprehension tests constructed of four testlets are substantial overestimates. 相似文献

9.

Using Oral Reading Rate to Predict Student Performance on Statewide Achievement Tests

《Educational Assessment》2013,18(4):303-323

In this study, a curriculum-based measurement (CBM) of reading aloud from narrative passages was used to predict performance on statewide achievement tests in reading and math. Scores on multiple-choice reading and math achievement tests were moderately correlated with scores on rate measures during the same year and rate measures administered 1 year previously. The results provide initial support for use of timed oral readings to predict students' performance on statewide achievement tests. Usefulness of CBM in monitoring students' progress toward preestablished benchmarks is supported, as well as the stability of the measures over time. Results are interpreted as a new application of research conducted on CBM during the past 2 decades.

In this study, a curriculum-based measurement (CBM) of reading aloud from narrative passages was used to predict performance on statewide achievement tests in reading and math. Scores on multiple-choice reading and math achievement tests were moderately correlated with scores on rate measures during the same year and rate measures administered 1 year previously. The results provide initial support for use of timed oral readings to predict students' performance on statewide achievement tests. Usefulness of CBM in monitoring students' progress toward preestablished benchmarks is supported, as well as the stability of the measures over time. Results are interpreted as a new application of research conducted on CBM during the past 2 decades. 相似文献

10.

Estimating parallel form reliability from one administration of a criterion-referenced test: A computer program for practitioners

Robert Saltstone Ken Stange Ted Chase 《Psychology in the schools》1989,26(3):249-253

Reliability of a criterion-referenced test is often viewed as the consistency with which individuals who have taken two strictly parallel forms of a test are classified as being masters or nonmasters. However, in practice, it is rarely possible to retest students, especially with equivalent forms. For this reason, methods for making conservative approximations of alternate form (or test-retest “without the effects of testing”) reliability have been developed. Because these methods are computationally tedious and require some psychometric sophistication, they have rarely been used by teachers and school psychologists. This paper (a) describes one method (Subkoviak's) for estimating alternate-form reliability from one administration of a criterion-referenced test and (b) describes a computer program developed by the authors that will handle tests containing hundreds of items for large numbers of examinees and allow any test user to apply the technique described. The program is a superior alternative to other methods of simplifying this estimation procedure that rely upon tables; a user can check classification consistency estimates for several prospective cut scores directly from a data file, without having to make prior calculations. 相似文献

11.

A Comparison of General and Content-Specific Literacy Strategies for Learning Science Content

Deborah K. Reed Kelly Whalon Devon Lynn Nicole Miller Keely Smith 《Exceptionality》2017,25(2):77-96

This study employed an adapted alternating treatments single-case design to explore students’ learning of biology content when using a general note-taking (GNT) strategy and a content-specific graphic organizer (CGO) to support reading high school biology texts. The 4 focal participants were 15–18-year-olds committed to a moderate risk juvenile justice facility. Lessons were delivered once a week for 7 weeks with CGO delivered first in odd weeks and GNT first in even weeks. When students were unfamiliar with the strategies or experiencing emotional or health problems, their weekly quiz scores tended to be higher on whichever lesson was delivered first. After stabilizing, an average ability reader did better on CGO lessons, and a student with below-average reading ability did better on GNT lessons. CGO took more time to prepare but an average of 11 minutes less than each GNT lesson to implement. CGO also was associated with more student-initiated responses and more self-reported student preferences. 相似文献

12.

Test Reliability at the Individual Level

Yueqin Hu John R. Nesselroade Monica K. Erbacher Steven M. Boker S. Alexandra Burt Pamela K. Keel 《Structural equation modeling》2016,23(4):532-543

Reliability has a long history as one of the key psychometric properties of a test. However, a given test might not measure people equally reliably. Test scores from some individuals might have considerably greater error than others. This study proposed two approaches using intraindividual variation to estimate test reliability for each person. A simulation study suggested that the parallel tests approach and the structural equation modeling approach recovered the simulated reliability coefficients. Then in an empirical study, where 45 females were measured daily on the Positive and Negative Affect Schedule (PANAS) for 45 consecutive days, separate estimates of reliability were generated for each person. Results showed that reliability estimates of the PANAS varied substantially from person to person. The methods provided in this article apply to tests measuring changeable attributes and require repeated measures across time on each individual. This article also provides a set of parallel forms of PANAS. 相似文献

13.

Multiple‐Choice Tests and Student Understanding: What Is the Connection?

Mark G. Simkin William L. Kuechler 《Decision Sciences Journal of Innovative Education》2005,3(1):73-98

Instructors can use both “multiple‐choice” (MC) and “constructed response” (CR) questions (such as short answer, essay, or problem‐solving questions) to evaluate student understanding of course materials and principles. This article begins by discussing the advantages and concerns of using these alternate test formats and reviews the studies conducted to test the hypothesis (or perhaps better described as the hope) that MC tests, by themselves, perform an adequate job of evaluating student understanding of course materials. Despite research from educational psychology demonstrating the potential for MC tests to measure the same levels of student mastery as CR tests, recent studies in specific educational domains find imperfect relationships between these two performance measures. We suggest that a significant confound in prior experiments has been the treatment of MC questions as homogeneous entities when in fact MC questions may test widely varying levels of student understanding. The primary contribution of the article is a modified research model for CR/MC research based on knowledge‐level analyses of MC test banks and CR question sets from basic computer language programming. The analyses are based on an operationalization of Bloom's Taxonomy of Learning Goals for the domain, which is used to develop a skills‐focused taxonomy of MC questions. However, we propose that their analyses readily generalize to similar teaching domains of interest to decision sciences educators such as modeling and simulation programming. 相似文献

14.

The Effects of a Learning Experience Upon THe Preference for Complexity and Asymmetry

Harold James Mcwhinnie 《Journal of Experimental Education》2013,81(1):56-62

Educators have need for a procedure to generate alternate forms of tests. The reliability of alternate forms generated from a table of specifications is examined, using 78 high school remedial mathematics students as subjects. Ten forms of a test were constructed and administered; seven of these forms were readministered. Alternate forms correlation, .85, is as high as the test-retest correlation, .82, lending support to the hypothesis that alternate forms generated from a table of specifications are reliable. Discussion includes educational uses for a table of specifications in text books to generate test forms. 相似文献

15.

Exploring the Impact of Student Teaching Apprenticeships on Student Achievement and Mentor Teachers

Dan Goldhaber Roddy Theobald 《Journal of research on educational effectiveness》2020,13(2):213-234

Abstract

We exploit within-teacher variation in the years that math and reading teachers in grades 4–8 host an apprentice (“student teacher”) in Washington State to estimate the causal effect of these apprenticeships on student achievement, both during the apprenticeship and afterwards. While the average causal effect of hosting a student teacher on student performance in the year of the apprenticeship is indistinguishable from zero in both math and reading, hosting a student teacher is found to have modest positive impacts on student math and reading achievement in a teacher’s classroom in following years. These findings suggest that schools and districts can participate in the student teaching process without fear of short-term decreases in student test scores while potentially gaining modest long-term test score increases. 相似文献

16.

Reading Comprehension Tests Vary in the Skills They Assess: Differential Dependence on Decoding and Oral Comprehension

Janice M. Keenan Rebecca S. Betjemann Richard K. Olson 《Scientific Studies of Reading》2013,17(3):281-300

Comprehension tests are often used interchangeably, suggesting an implicit assumption that they are all measuring the same thing. We examine the validity of this assumption by comparing some of the most popular reading comprehension measures used in research and clinical practice in the United States: the Gray Oral Reading Test (GORT), the two assessments (retellings and comprehension questions) from the Qualitative Reading Inventory (QRI), the Woodcock–Johnson Passage Comprehension subtest (WJPC), and the Reading Comprehension test from the Peabody Individual Achievement Test (PIAT). Modest intercorrelations among the tests suggested that they were measuring different skills. Regression analyses showed that decoding, not listening comprehension, accounts for most of the variance in both the PIAT and the WJPC; the reverse holds for the GORT and both QRI measures. Large developmental differences in what the tests measure were found for the PIAT and the WJPC, but not the other tests, both when development was measured by chronological age and by word reading ability. We discuss the serious implications for research and clinical practice of having different comprehension tests measure different skills and of having the same test assess different skills depending on developmental level. 相似文献

17.

高校考试改革与大学生创新能力的培养 总被引：4，自引：0，他引：4

邹红《广东广播电视大学学报》2001,10(2):31-34

高校考试是高等教育的重要评价手段，它也可以成为大学生创新能力的有效评价手段，但由于目前我国高校考试在考试内容、考试形式及考试安排等方面的不合理，考试结构不能满足创新能力的考核要求，使高校考试难以评价大学生的创新能力。因此，在考试改革的具体措施中，考试内容结构应突出对创新能力的考核，增加平时考核发挥考试的诊断性功能，改变考试形式，建立科学的评价体系。相似文献

18.

Effects of Differentially Time-Consuming Tests on Computer-Adaptive Test Scores

Brent Bridgeman Frederick Cline 《Journal of Educational Measurement》2004,41(2):137-148

Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test. 相似文献

19.

Language assimilation and performance in achievement tests among Hispanic children in the U.S.: Evidence from a field experiment

《Economics of Education Review》2020

We provide new evidence about the effect of testing language on test scores using data from two rounds (conducted approximately six years apart) of the New Immigrants Survey. In each round, U.S.-born and foreign-born children of Hispanic origin were randomly assigned to take the Woodcock-Johnson achievement (two reading and two math) tests, either in Spanish or in English. U.S.-born children of Hispanic immigrants perform better in reading tests (but not in math tests) when they are assigned to take tests in English. The size of the testing-language effect remains stable across rounds. Foreign-born children of Hispanic immigrants perform better in both reading and math tests when they are assigned to take tests in Spanish in the first round. However, the size of the testing-language effect declines in reading tests and completely disappears in math tests by the second round. Our results suggest that the depreciation of Spanish skills is an essential factor (and, in some cases, more important than the accumulation of English skills) in explaining the decline in the testing-language effect among foreign-born children. We also explore how age at immigration and years spent in the U.S. affect language assimilation. 相似文献

20.

Retest effects in matrix test performance: Differential impact of predictors at different hierarchy levels in an educational setting

Philipp Alexander Freund Heinz Holling 《Learning and individual differences》2011,21(5):597-601

If tests of cognitive ability are repeatedly taken, test scores rise. Such retest effects have been observed for a long time and for a variety of tasks. This study investigates retest effects on figural matrix items in an educational context. A short term effect is assumed for the direct retest administration in the same test session, and a long term effect is assumed for a retest interval of six months. Using multilevel modeling, we analyze if the magnitude of these effects is not only influenced by individual variation, but also by the cluster structure of students grouped within classrooms. We also investigate if the use of identical versus parallel tests has an impact on the size of the retest effects. Our main results show a negligible short term retest effect, but a large long term retest effect. Using parallel tests does not contribute to understanding individual differences in retest effects. The variation in retest effects is larger between classrooms than between students. Reasoning ability, as measured with a different test, and school grades significantly influences retest effects at the individual level, but at the classroom level, only reasoning ability is a significant predictor. 相似文献