首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
When tests are designed to measure dimensionally complex material, DIF analysis with matching based on the total test score may be inappropriate. Previous research has demonstrated that matching can be improved by using multiple internal or both internal and external measures to more completely account for the latent ability space. The present article extends this line of research by examining the potential to improve matching by conditioning simultaneously on test score and a categorical variable representing the educational background of the examinees. The responses of male and female examinees from a test of medical competence were analyzed using a logistic regression procedure. Results show a substantial reduction in the number of items identified as displaying significant DIF when conditioning is based on total test score and a variable representing educational background as opposed to total test score only.  相似文献   

2.
In this study 81 experimental and 79 control subjects (randomly assigned to treatments) took Form A of the Nelson Reading Test, twice, with a four week interval between test administrations. Instructions for the retest varied for the E and C groups. The latter group was told that the test was readministered for purposes of assessing improvement. The E subjects were informed that by improving their previous score they would be eligible for winning a prize (candy bars, university sweaters, radios). Analysis of covariance indicated that the effect of the awards was significant (p < .01) in terms of number of items attempted and in terms of items correct. The adjusted mean increase for the E subjects was three months. The authors concluded that, if the terms of an actual performance contract would be applied to their results, they were to realize approximately $3,000 profit on a $75 investment.  相似文献   

3.
The purpose o f this study was to examine the consistency with which students applied procedural rules for solving signed-number operations across identical items presented in different orders. A test with 64 open-ended items was administered to 161 eighth graders. The test consisted o f two 32-item subtests containing identical items. The items in each subtest were in random order. Students'responses to each subtest were compared with respect to the identified underlying rules o f operation used to solve each problem set. The results indicated that inconsistent rule application was common among students who had not mastered signed-number arithmetic operations. In contrast, mastery level students, those who use the right rules, show a stable pattern o f rule application in signed-number arithmetic. These results are discussed in light of the hypothesis testing approach to the learning process.  相似文献   

4.
High item discrimination can be a symptom o f a special kind of measurement disturbance introduced by an item that gives persons o f high ability a special advantage over and above their higher abilities. This type o f disturbance, which can be interpreted as a form o f item "bias," can be encouraged by methods that routinely interpret highly discriminating items as the "best" items on a test and may be compounded by procedures that weight items by their discrimination. The type of measurement disturbance described and illustrated in this paper occurs when an item is sensitive to individual differences on a second, undesired dimension that is positively correlated with the variable intended to be measured. Possible secondary influences o f this type include opportunity to learn, opportunity to answer, and test wiseness  相似文献   

5.
目的了解中小学教师心理健康状况。方法采用症状自评量表(SCL-90)对424名中小学教师进行调查分析。结果中小学教师的各因子分及阳性项目数均高于常模,其中在人际关系因子上存在显著差异(P<0.05),其余都呈现极显著差异(P<0.001);中小学女教师SCL-90测验分值均高于男教师,在抑郁因子得分和阳性项目数上存在显著差异(P<0.05),在恐怖因子得分上差异较为明显(P<0.01),其余无显著差异(P>0.05〕;边远地区、城市、县镇农村中小学教师的测验分值依次增高,城市和边远地区中小学教师除焦虑、精神病性因子及阳性项目数有显著差异外(P<0.05),其余差异均不显著(P>0.05),县镇农村较城市、边远地区中小学教师的测验分值高出明显,强迫、人际关系、抑郁、精神病性因子差异极其显著(P<0.001),县镇农村和边远地区中小学教师的焦虑、恐怖因子差异也极为显著(P<0.001)。结论中小学教师心理健康整体状况不佳,女性心理健康状况较男性稍差,城乡差别对中小学教师心理健康状况有显著影响。  相似文献   

6.
This paper presents a new method for using certain restricted latent class models, referred to as binary skills models, to determine the skills required by a set o f test items. The method is applied to reading achievement data from a nationally representative sample o f fourth-grade students and offers useful perspectives on test structure and examinee ability, distinct from those provided by other methods o f analysis. Models fitted to small, overlapping sets o f items are integrated into a common skill map, and the nature o f each skill is then inferred from the characteristics o f the items for which it is required. The reading comprehension items examined conform closely to a unidimensional scale with six discrete skill levels that range from an inability to comprehend or match isolated words in a reading passage to the abilities required to integrate passage content with general knowledge and to recognize the main ideas o f the most difficult passages on the test.  相似文献   

7.
A potential concern for individuals interested in using item response theory (IRT) with achievement test data is that such tests have been specifically designed to measure content areas related to course curriculum and students taking the tests at different points in their coursework may not constitute samples from the same population. In this study, data were obtained from three administrations of two forms of a Biology achievement test. Data from the newer of the two forms were collected at a spring administration, made up of high school sophomores just completing the Biology course, and at a fall administration, made up mostly of seniors who completed their instruction in the course from 6–18 months prior to the test administration. Data from the older form, already on scale, were collected at only a fall administration, where the sample was comparable to the newer form fall sample. IRT and conventional item difficulty parameter estimates for the common items across the two forms were compared for each of the two form/sample combinations. In addition, conventional and IRT score equatings were performed between the new and old forms for each o f the form sample combinations. Widely disparate results were obtained between the equatings based on the two form/sample combinations. Conclusions are drawn about the use o f both classical test theory and IRT in situations such as that studied, and implications o f the results for achievement test validity are also discussed  相似文献   

8.
This study examines the claim that attempting, or guessing at, more items yields improved formula scores. Two samples of students who had taken a form of the SA T- Verbal consisting of three parallel half-hour sections, were used to form the following scores on each of the three sections: the number of attempts, a guessing index, the formula score, and (indirectly) an approximation to an ability score. Correlations were obtained separately for the two samples between the attempts, and the guessing index, on one section, the formula score on a second section, and ability as measured by the third section. The partial correlations obtained hovered near zero, suggesting, contrary to conventional opinion, that, on average, attempting more items and guessing are not helpful in yielding higher formula scores, and that, therefore, formula scoring is not generally disadvantageous to the student who is less willing to guess and attempt an item that he or she is not sure of. On closer examination, however, it became clear that the advantages of guessing depend, at least in part, on the ability of the examinee. Although the relationship is generally quite weak, it is apparently the case that more able examinees do tend to profit somewhat from guessing, and would therefore be disadvantaged by their reluctance to guess. On the other hand, less able examinees may lower their scores i f they guess.  相似文献   

9.
The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e., probability judgments) from previously conducted standard setting studies were used first in a re-sampling study, followed by D-studies. For the re-sampling study, proportionally stratified subsets of items were extracted under various sampling and test-length conditions. The mean cut score, variance components, expected standard error (SE) around the mean cut score, and root-mean-squared deviation (RMSD) across 1,000 replications were estimated at each study condition. The SE and the RMSD decreased as the number of items increased, but this reduction tapered off after approximately 45 items. Subsequently, D-studies were performed on the same datasets. The expected SE was computed at various test lengths. Results from both studies are consistent with previous research indicating that between 40–50 items are sufficient to make generalizable cut score recommendations.  相似文献   

10.
In classical test theory, a test is regarded as a sample of items from a domain defined by generating rules or by content, process, and format specifications, l f the items are a random sample of the domain, then the percent-correct score on the test estimates the domain score, that is, the expected percent correct for all items in the domain. When the domain is represented by a large set of calibrated items, as in item banking applications, item response theory (IRT) provides an alternative estimator of the domain score by transformation of the IRT scale score on the test. This estimator has the advantage of not requiring the test items to be a random sample of the domain, and of having a simple standard error. We present here resampling results in real data demonstrating for uni- and multidimensional models that the IRT estimator is also a more accurate predictor of the domain score than is the classical percent-correct score. These results have implications for reporting outcomes of educational qualification testing and assessment.  相似文献   

11.
In actual test development practice, the number o f test items that must be developed and pretested is typically greater, and sometimes much greater, than the number that is eventually judged suitable for use in operational test forms. This has proven to be especially true for one item type–analytical reasoning-that currently forms the bulk of the analytical ability measure of the GRE General Test. This study involved coding the content characteristics of some 1,400 GRE analytical reasoning items. These characteristics were correlated with indices of item difficulty and discrimination. Several item characteristics were predictive of the difficulty of analytical reasoning items. Generally, these same variables also predicted item discrimination, but to a lesser degree. The results suggest several content characteristics that could be considered in extending the current specifications for analytical reasoning items. The use of these item features may also contribute to greater efficiency in developing such items. Finally, the influence of these various characteristics also provides a better understanding of the construct validity of the analytical reasoning item type.  相似文献   

12.
Selected parameters for a negatively skewed and a normally distributed normative distribution were estimated in a post mortem item-examinee sampling investigation. Manipulated systematically were number of subtests, number of items per subtest, and number of examinees responding to each sub-test. Each item-examinee sampling procedure was replicated five times. Defining one observation as the score received by one examinee on one item, the results of this investigation support the conclusion that, in estimating parameters by item-examinee sampling, the variable of importance is not the item-examinee sampling procedure but is instead the number of observations obtained by that procedure. Degree of skewness in the normative distribution and failure to distribute all items among subtests were found to be relatively unimportant variables.  相似文献   

13.
《教育实用测度》2013,26(1):31-57
Examined in this study were the effects of test length and sample size on the alternate forms reliability and the equating of simulated mathematics tests composed of constructed-response items scaled using the 2-parameter partial credit model. Test length was defined in terms of the number of both items and score points per item. Tests with 2, 4, 8, 12, and 20 items were generated, and these items had 2, 4, and 6 score points. Sample sizes of 200, 500, and 1,000 were considered. Precise item parameter estimates were not found when 200 cases were used to scale the items. To obtain acceptable reliabilities and accurate equated scores, the findings suggested that tests should have at least eight 6-point items or at least 12 items with 4 or more score points per item.  相似文献   

14.
《Educational Assessment》2013,18(4):317-340
A number of methods for scoring tests with selected-response (SR) and constructed-response (CR) items are available. The selection of a method depends on the requirements of the program, the particular psychometric model and assumptions employed in the analysis of item and score data, and how scores are to be used. This article compares 3 methods: unweighted raw scores, Item Response Theory pattern scores, and weighted raw scores. Student score data from large-scale end-of-course high school tests in Biology and English were used in the comparisons. In the weighted raw score method evaluated in this study, the CR items were weighted so that SR and CR items contributed the same number of points toward the total score. The scoring methods were compared for the total group and for subgroups of students in terms of the resultant scaled score distributions, standard errors of measurement, and proficiency-level classifications. For most of the student ability distribution, the three scoring methods yielded similar results. Some differences in results are noted. Issues to be considered when selecting a scoring method are discussed.  相似文献   

15.
This study investigated the effectiveness of an interactive vocabulary instructional strategy, semantic-feature analysis (SFA), on the content area text comprehension of adolescents with learning disabilities. Prior to reading a social studies text, students in resource classes either completed a relationship chart as part of the SFA condition or used the dictionary to write definitions and sentences as part of the contrast condition. Passage comprehension was measured on a multiple-choice test consisting of two types of items, vocabulary and conceptual. Comprehension was measured immediately following teaching and again 6 months after teaching. Prior knowledge for the content of the passage served as a covariate. Results indicated that students in the SFA instructional condition had significantly greater measured comprehension immediately following and 6 months after initial teaching. These results are discussed in relation to concept-driven, interactive strategies for teaching content and facilitating text comprehension.  相似文献   

16.
Wilcox (16) proposed a latent structure model for answer-until-correct tests that can solve various measurement problems including correcting for guessing without assuming guessing is at random. This paper proposes a closed sequential procedure for estimating true score that can be used in conjunction with an answer-until-correct test. For criterion-referenced tests where the goal is to determine whether an examinee’s true score is above or below a known constant, the accuracy of the new procedure is exactly the same as a more conventional sequential solution. The advantage of the new procedure is that it eliminates the possibility of using an inordinately large number of items when in fact a large number of items is not needed; typical sequential procedures always allow this possibility. In addition, the new procedure appears to compare favorably to traditional tests where the number of items to be administered is fixed in advance.  相似文献   

17.
Two forms of a social studies achievement test were constructed with half the items for each form containing a cue, grammar, or length fault. Faults were found to make the items easier, which was supported by confidence intervals for the differences. However, validity coefficients with achievement and intelligence criteria, as well as the reliability coefficients, were virtually unchanged. The results agreed with those of Dunn and Goldstein (1959), even though the methodology differed. A suggested measure of test-wiseness for groups is presented.  相似文献   

18.
Responses to a 50-item, four-choice test were simulated for 1,000 examinees under conventional formula-scoring instructions. One hundred ninety-two simulation runs reflected variations in the average level o f item difficulty, the extent to which examinees tended to omit inappropriately (when the formulascoring directions recommended guessing), the extent to which they were misinformed (classified correct answers as distractors), the extent to which they guessed contrary to the formula-scoring instructions, the extent to which examinee ability and tendency to omit inappropriately were correlated, the examinee ability level at which misinformation was most prevalent, and the extent to which item difficulty was related to the probability that an examinee would be misinformed. For each examinee, formula scores and expected formula scores were determined allowing and not allowing inappropriate omissions. Under certain conditions, failure to guess as recommended by the formula-scoring instructions produced nontrivial proportions o f examinees with expected score losses o f one or more points. These conditions were a test o f at least moderate difficulty, a low level for the tendency to be misinformed, and at least a moderate level for the tendency to omit inappropriately.  相似文献   

19.
This article outlines how learning objectives based upon science, technology and society (STS) elements for Palestinian ninth grade science textbooks were identified, which was part of a bigger study to establish an STS foundation in the ninth grade science curriculum in Palestine. First, an initial list of STS elements was determined. Second, using this list, ninth grade science textbooks and curriculum document contents were analyzed. Third, based on this content analysis, a possible list of 71 learning objectives for the integration of STS elements was prepared. This list of learning objectives was refined by using a two-round Delphi technique. The Delphi study was used to rate and to determine the consensus regarding which items (i.e. learning objectives for STS in the ninth grade science textbooks in Palestine) are to be accepted for inclusion. The results revealed that of the initial 71 objectives in round one, 59 objectives within round two had a mean score of 5.683 or higher, which indicated that the learning objectives could be included in the development of STS modules for ninth grade science in Palestine.  相似文献   

20.
Conditioning theories and recent real-time models commonly postulate that a reinforcer is signaled by a series of stimuli. In both Pavlovian and operant procedures, serial stimuli have been shown to control the likelihood and timing of responses over intervals of seconds and minutes. The present experiments were conducted to determine whether serial stimuli exercise similar effects over stimulus-reinforcer intervals in the order of hundreds of milliseconds. Such intervals typify those used in conditioning of the rabbit nictitating membrane response. A sequence of four tone pulses (50–100 msec) was used as the CS to assess the effectiveness of serial stimuli. After training with this CS, tests were conducted in which one or more of the pulses were removed. These perturbations of the sequence of stimuli over a 400-msec interval produced large deficits in CR likelihood and smaller alterations in CR timing. The results are discussed with respect to their implications for current real-time models of conditioning, and particularly with respect to their assumptions about the source of internal stimuli, rules for learning, and rules for generating CRs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号