首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
Researchers in education are often interested in determining whether independent groups are equivalent on a specific outcome. Equivalence tests for 2 independent populations have been widely discussed, whereas testing for equivalence with more than 2 independent groups has received little attention. The authors discuss alternatives for testing the equivalence of more than 2 independent populations, and they use a Monte Carlo study to demonstrate and compare the performance of these alternatives under several conditions. The results indicate that a 1-way test (e.g., Wellek's F test) is recommended for assessing the equivalence of more than 2 independent groups because approaches based on conducting pairwise tests of equivalence are overly conservative.  相似文献   

2.
差异教学是一种优质的、引人入胜的教学,是教育教学改革的必然趋势,是解决传统教学弊病的必由之路。政策的扶持、理论的支撑、新课程改革的推动为其提供了可行性。在差异教学设计中,了解学生是差异教学的前提;寻求突破点是差异教学创生的要义;挑战性和多样性是差异教学的关键要素;弹性分组与层递式任务是差异教学的核心。差异教学的实施面临着高考的制约、学生差异难以精确化和教师精力及能力有限等难题,但其前途是光明的。  相似文献   

3.
This module describes and extends X‐to‐Y regression measures that have been proposed for use in the assessment of X‐to‐Y scaling and equating results. Measures are developed that are similar to those based on prediction error in regression analyses but that are directly suited to interests in scaling and equating evaluations. The regression and scaling function measures are compared in terms of their uncertainty reductions, error variances, and the contribution of true score and measurement error variances to the total error variances. The measures are also demonstrated as applied to an assessment of scaling results for a math test and a reading test. The results of these analyses illustrate the similarity of the regression and scaling measures for scaling situations when the tests have a correlation of at least .80, and also show the extent to which the measures can be adequate summaries of nonlinear regression and nonlinear scaling functions, and of heteroskedastic errors. After reading this module, readers will have a comprehensive understanding of the purposes, uses, and differences of regression and scaling functions.  相似文献   

4.
To best influence policymakers, researchers need to provide information and measures of effects that reflect the nature of policy decisions. Specifically, policymakers are often interested in factors associated with changes in the number of cases or rate of disorders in a community. Regression/analysis of variance (ANOVA) models, which focus on the prediction of means, slopes, and variances, do not directly address such questions. In contrast, epidemiological statistics, which focus on differences in proportions of cases, do provide such information. Three epidemiological measures of effect (the risk-ratio, the odds-ratio, and the population attributable fraction) are reviewed; their value as tools for informing public policy is discussed; and examples are provided illustrating their use. Researchers are encouraged to consider adopting an epidemiological perspective as part of their work.  相似文献   

5.
This paper presents the results of a simulation study to compare the performance of the Mann-Whitney U test, Student?s t test, and the alternate (separate variance) t test for two mutually independent random samples from normal distributions, with both one-tailed and two-tailed alternatives. The estimated probability of a Type I error was controlled (in the sense of being reasonably close to the attainable level) by all three tests when the variances were equal, regardless of the sample sizes. However, it was controlled only by the alternate t test for unequal variances with unequal sample sizes. With equal sample sizes, the probability was controlled by all three tests regardless of the variances. When it was controlled, we also compared the power of these tests and found very little difference. This means that very little power will be lost if the Mann-Whitney U test is used instead of tests that require the assumption of normal distributions.  相似文献   

6.
The WSD and F tests show the same response when the homogeneous variance assumption is violated. Both are robust when the ns are equal, but may be either seriously conservatively biased or seriously permissively biased when heterogeneous variances are combined with unequal ns. The use of equal ns is recommended for either test. There is no virtue in insisting that the F test be significant prior to conducting the WSD when the alternative to the null is μi ≠ μj. However, the conservative bias created by this procedure is small when K = 4; as is the permissive bias created by conducting both tests. For uniformly distributed μ’s, the two tests have very similar powers. This condition would not be expected for different arrangements of the μ’s.  相似文献   

7.
Translating and adapting tests and questionnaires across languages is a common strategy for comparing people who operate in different languages with respect to their achievement, attitude, personality, or other psychological construct. Unfortunately, when tests and questionnaires are translated from one language to another, there is no guarantee that the different language versions are equivalent. In this study, we present and evaluate a methodology for investigating the equivalence of translated-adapted items using bilingual test takers. The methodology involves applying item response theory models to data obtained from randomly equivalent groups of bilingual respondents. The technique was applied to an English-Turkish version of a course evaluation form. The results indicate that the methodology is effective for flagging items that function differentially across languages as well as for informing the test development and test adaptation processes. The utility and limitations of the procedure for evaluating translation equivalence are discussed.  相似文献   

8.
A reliability coefficient for criterion-referenced tests is developed from the assumptions of classical test theory. This coefficient is based on deviations of scores from the criterion score, rather than from the mean. The coefficient is shown to have several of the important properties of the conventional normreferenced reliability coefficient, including its interpretation as a ratio of variances and as a correlation between parallel forms, its relationship to test length, its estimation from a single form of a test, and its use in correcting for attenuation due to measurement error. Norm-referenced measurement is considered as a special case of criterion-referenced measurement.  相似文献   

9.
A potential concern for individuals interested in using item response theory (IRT) with achievement test data is that such tests have been specifically designed to measure content areas related to course curriculum and students taking the tests at different points in their coursework may not constitute samples from the same population. In this study, data were obtained from three administrations of two forms of a Biology achievement test. Data from the newer of the two forms were collected at a spring administration, made up of high school sophomores just completing the Biology course, and at a fall administration, made up mostly of seniors who completed their instruction in the course from 6–18 months prior to the test administration. Data from the older form, already on scale, were collected at only a fall administration, where the sample was comparable to the newer form fall sample. IRT and conventional item difficulty parameter estimates for the common items across the two forms were compared for each of the two form/sample combinations. In addition, conventional and IRT score equatings were performed between the new and old forms for each o f the form sample combinations. Widely disparate results were obtained between the equatings based on the two form/sample combinations. Conclusions are drawn about the use o f both classical test theory and IRT in situations such as that studied, and implications o f the results for achievement test validity are also discussed  相似文献   

10.
We present statistical tests for departures from random expectation in spatial memory tasks. We consider two common protocols for spatial memory experiments. In the first one, subjects are allowed to search a fixed number of sites. In the second protocol, subjects are allowed to search until they achieve a fixed number of successes. In either of these protocols, the subjects involved may or may not revisit sites that have been previously searched or exploited. This yields four situations to consider: fixed number of sites searched or fixed number of successes, with or without revisits. We derive analytical expressions for the probability mass functions, expectations, and variances associated with each type of null hypothesis. We present three statistical tests of these hypotheses: the Kolmogorov-Smirnov test, the ordinary sign test, and theZ test. We use our results to demonstrate a priori calculation of sample sizes and statistical power and to consider a mixed model of sampling with and without replacement.  相似文献   

11.
Exams are increasingly being used as learning tools in the form of collaborative assessments as opposed to their traditional use as a summative assessment tool for verifying individual student learning. Despite the growing popularity of collaborative assessments, few studies test the differential effects of collaborative assessments (versus traditional assessments) on student learning. This paper analyzes 16 empirical studies from various disciplines and investigates the extent to which collaborative assessments improve student learning. The paper further provides recommendations and improved practices for instructors interested in using collaborative assessment in their classroom.  相似文献   

12.
Although standardized tests have been in use for years, there is a lack of consensus about what constitutes appropriate student preparation for testing. Popham (1991) demonstrated that teachers and administrators view preparation in different ways and noted that there is considerable diversity of opinion about which practices are appropriate and inappropriate. Other researchers have attempted to create standards or guidelines for determining appropriate testing practice, but these do not appear to capture the diversity of teacher-initiated preparation. Do teachers and testing specialists see preparation in the same way? What practices fall into the grey area of being not appropriate but not necessarily unethical? This study examines eight categories (40 practices) of preparation or teacher intervention to maximize student test performance. Teachers (N = 42) and testing specialists (N = 10) were asked to examine practices and determine how appropriate or inappropriate the practices were for a specified test. Results show that teachers consistently rate practices to be more appropriate than do testing specialists. Significant differences between teachers and specialists were found for six of the eight categories of preparation. Practices such as motivational activities, pretest interventions, same format preparation, and previous form preparation were perceived to be less evident regarding the appropriateness of their use by teachers in the classroom. This article concludes with a call for test developers and school district representatives to collaborate to determine the appropriateness of testing practice for local needs and a recognition that concrete and widely disseminated guidelines for testing practices are needed for a variety of tests and instructional decisions.  相似文献   

13.
Interest in measuring and evaluating student learning in higher education is growing. There are many tools available to assess student learning. However, the use of such tools may be more or less appropriate under various conditions. This study provides some evidence related to the appropriate use of pre/post‐tests. The question of whether graded tests elicit a higher level of performance (better representation of actual learning gains) than ungraded post‐tests is examined. We examine whether the difficulty level of the questions asked (knowledge/comprehension vs. analysis/application) affects this difference. We test whether the student’s level in the degree programme affects this difference. Results indicate that post‐tests may not demonstrate the full level of student mastery of learning objectives and that both the difficulty level of the questions asked and the level of students in their degree programme affect the difference between graded and ungraded assessments. Some of these differences may be due to causes other than grades on the assessments. Students may have benefited from the post‐test, as a review of the material, or from additional studying between the post‐test and the final examination. Results also indicate that pre‐tests can be useful in identifying appropriate changes in course materials over time.  相似文献   

14.
Author and book title recognition tests have been used extensively in reading‐related research with both children and adults. The present paper reports the development of a book title and author recognition test and data from a UK sample of adults. Higher scores were obtained on the Author test than on the Title test. It is suggested that the tests are suitable for use with the United Kingdom adult population.  相似文献   

15.
Speededness refers to the situation where the time limits on a standardized test do not allow substantial numbers of examinees to fully consider all test items. When tests are not intended to measure speed of responding, speededness introduces a severe threat to the validity of interpretations based on test scores. In this article, we describe test speededness, its potential threats to validity, and traditional and modern methods that can be used to assess the presence of speededness. We argue that more attention must be paid to this issue and that more research must be done to set appropriate time limits on power tests so that speed of responding does not interfere with the construct measured.  相似文献   

16.
Scores on state standards‐based assessments are readily available and may be an appropriate alternative to traditional placement tests for assigning or accepting students into particular courses. Many community colleges do not require test scores for admissions purposes but do require some kind of placement scores for first‐year English and math courses. In this study, we examine the efficacy of using the reading and math portions of the Kansas State Assessment (KSA) for predicting the success of high school students taking College Algebra and College English I at a Kansas community college. Results showed that in this sample KSA scores predicted as well or better than more traditional placement tests and with no extra cost to the institution.  相似文献   

17.
ABSTRACT

Touch screen tablets are being increasingly used in schools for learning and assessment. However, the validity and reliability of assessments delivered via tablets are largely unknown. The present study tested the psychometric properties of a tablet-based app designed to measure early literacy skills. Tablet-based tests were also compared with traditional paper-based tests. Children aged 2–6 years (N?=?99) completed receptive tests delivered via a tablet for letter, word, and numeral skills. The same skills were tested with a traditional paper-based test that used an expressive response format. Children (n?=?35) were post-tested 8 weeks later to examine the stability of test scores over time. The tablet test scores showed high internal consistency (all α’s?>?.94), acceptable test-retest reliability (ICC range?=?.39–.89), and were correlated with child age, family SES, and home literacy teaching to indicate good predictive validity. The agreement between scores for the tablet and traditional tests was high (ICC range?=?.81–.94). The tablet tests provides valid and reliable measures of children’s early literacy skills. The strong psychometric properties and ease of use suggests that tablet-based tests of literacy skills have the potential to improve assessment practices for research purposes and classroom use.  相似文献   

18.
The purpose of this study was to examine the effect of a digitized podcast to deliver read-aloud testing accommodations on mobile devices to students with disabilities and reading difficulties. The total sample for this study included 47 middle school students with reading difficulties. Of the 47 students, 16 were identified as students with disabilities who received special education services. Participants were randomly assigned to three experimental testing conditions, standard administration, teacher-controlled read-aloud in traditional group delivery format, and student-controlled read-aloud delivered as a podcast and accessed on a mobile device, and given sample end-of-year science assessments. Based on a factorial analysis of variances, with test conditions and student status as the fixed factors, both student groups demonstrated statistically significant gains based on their testing conditions. Results support the use of podcast delivery as a viable alternative to the traditional teacher-delivered read-aloud test accommodation. Conclusions are discussed in the context of universal design for learning testing accommodations for future research and practice.  相似文献   

19.
《教育实用测度》2013,26(3):241-261
This simulation study compared two procedures to enable an adaptive test to select items in correspondence with a content blueprint. Trait level estimates obtained from testlet-based and constrained adaptive tests administered to 10,000 simulated examinees under two trait distributions and three item pool sizes were compared to the trait level estimates obtained from traditional adaptive tests in terms of mean absolute error, bias, and information. Results indicate that using constrained adaptive testing requires an increase of 5% to 11% in test length over the traditional adaptive test to reach the same error level and, using testlets requires an increase of 43% to 104% in test length over the traditional adaptive test. Given these results, the use of constrained computerized adaptive testing is recommended for situations in which an adaptive test must adhere to particular content specifications.  相似文献   

20.
In this article, we describe two United Kingdom (UK) screening tests for dyslexia: the Dyslexia Early Screening Test (DEST) and the Cognitive Profiling System (CoPS 1), both normed and designed to be administered by teachers to children four years and older. We first outline the political context in the UK, which for the first time, makes the use of such tests viable. We then outline the research programs behind and the components of each test; reliability and validity are also discussed. Information is presented on the tests in use. We conclude that tests such as these have the potential to identify children as at risk before they fail, halting the cycle of emotional and motivational problems traditionally associated with dyslexia. Both tests are appropriate for use in the United States, and initial reactions from the education sector have been favorable.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号