首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In automated test assembly (ATA), the methodology of mixed‐integer programming is used to select test items from an item bank to meet the specifications for a desired test form and optimize its measurement accuracy. The same methodology can be used to automate the formatting of the set of selected items into the actual test form. Three different cases are discussed: (i) computerized test forms in which the items are presented on a screen one at a time and only their optimal order has to be determined; (ii) paper forms in which the items need to be ordered and paginated and the typical goal is to minimize paper use; and (iii) published test forms with the same requirements but a more sophisticated layout (e.g., double‐column print). For each case, a menu of possible test‐form specifications is identified, and it is shown how they can be modeled as linear constraints using 0–1 decision variables. The methodology is demonstrated using two empirical examples.  相似文献   

2.
《教育实用测度》2013,26(3):193-211
A procedure for interpreting multiple-discrimination indices obtained from a multidimensional item-response theory (MIRT) analysis is described and demonstrated. The procedure consists of converting discrimination parameter estimates to direction cosines and cluster analyzing the angular distances between item vectors, grouping together items with similar orientations in the theta space. The procedure is suggested as an alternative to conventional item factor analysis for investigating issues related to test dimensionality within a single test form and between alternate forms of a test.  相似文献   

3.
This empirical study aimed to investigate the impact of easy first vs. hard first ordering of the same items in a paper and-pencil multiple-choice exam on the performances of low, moderate, and high achiever examinees, as well as on the item statistics. Data were collected from 554 Turkish university students using two test forms, which included the same multiple-choice items ordered reversely, i.e. easy first vs. hard first. Tests included 26 multiple-choice items about the introductory unit of “Measurement and Assessment” course. The results suggested that sequencing the multiple-choice items in either direction from easy to hard or vice versa did not affect the test performances of the examinees no matter whether they are low, moderate or high achiever examinees. Finally, no statistically significant difference was observed between item statistics of both forms, i.e. the difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj. r) coefficients.  相似文献   

4.
ABSTRACT—Using Central Conceptual Structure theory as an heuristic, learning sequences in the acquisition of number knowledge are described in three forms that bridge theory and practice: as a four-stage theoretical progression, as items on a developmental test created to test the theoretical progression, and as learning objectives in a curriculum designed to teach math concepts and skills implied in the theoretical progression. Other aspects of the theory that were used to create teaching methods and materials for the Number Worlds curriculum are also described, as are some of the outcomes of this theory-based program.  相似文献   

5.
How is affective change rated with positive adjectives such as good related to change rated with negative adjectives such as bad? Two nested perfect and imperfect forms of dynamic bipolarity are defined using latent change structural equation models based on tetrads of items. Perfect bipolarity means that latent change scores correlate -1. Meaningful structural equation modeling (SEM) analyses of self-rated affect may require analyzing polychoric correlations, if self-ratings are collected using ordered categories. The models were applied to 6 4-wave datasets from Steyer and Riedl (2004). Results suggest that perfect bipolarity is generally compatible with valence self-ratings, whereas imperfect bipolarity is compatible with tension and energy self-ratings. Methodological and substantive limits of the approach are discussed.  相似文献   

6.
《Educational Assessment》2013,18(1):99-110
The purpose of this article is to describe some of the measurement issues encountered in the equating of performance assessments designed for use in making teacher certification decisions. As some teacher certification programs move from sole reliance on multiple-choice items to inclusion of complex performance tasks, difficult measurement issues related to equating may arise. A variety of analytic and judgmental strategies are described in this article that may provide solutions for addressing these equating issues. Analytic strategies are based on examinee data and involve the modification of existing equating procedures, such as linear and equipercentile methods, that have been used successfully in the past with test forms composed of multiple-choice items. Judgmental strategies for equating involve the use of expert judgments to determine the equivalence of scores obtained from alternate forms of an assessment instrument.  相似文献   

7.
This study measured and explored the relationships among elementary mathematics teachers’ skill in (a) determining what an item measures, (b) analyzing student work, (c) providing targeted feedback, and (d) determining next instructional steps. Twenty-three elementary mathematics teachers were randomly assigned to one of three conditions: analyzing items and student responses without rubrics, analyzing items and student responses with rubrics, or analyzing items and student responses with rubrics after watching a professional development program on providing feedback to students. Findings show there is a moderate to strong relationship between teachers’ abilities to analyze student responses to infer what a student knows and can do and their abilities to take action based on that information through either providing the student feedback or making appropriate instructional adaptations. Findings show it was relatively more difficult for teachers to provide feedback that was likely to move students forward in their learning than it was for them to analyze a student's response or to determine next instructional steps. No teacher skill differences associated with the different treatment conditions were found.  相似文献   

8.
In most large-scale assessments of student achievement, several broad content domains are tested. Because more items are needed to cover the content domains than can be presented in the limited testing time to each individual student, multiple test forms or booklets are utilized to distribute the items to the students. The construction of an appropriate booklet design is a complex and challenging endeavor that has far-reaching implications for data calibration and score reporting. This module describes the construction of booklet designs as the task of allocating items to booklets under context-specific constraints. Several types of experimental designs are presented that can be used as booklet designs. The theoretical properties and construction principles for each type of design are discussed and illustrated with examples. Finally, the evaluation of booklet designs is described and future directions for researching, teaching, and reporting on booklet designs for large-scale assessments of student achievement are identified.  相似文献   

9.
不论是等差数列还是等比数列,其通项公式通常都只有一种形式,而它们的前n项和公式却有两种形式,故此在使用上就比较灵活。利用推理和数学归纳法证明出了等差数列和等比数列通项公式的另一种形式,并通过例题说明其用法,从而使得其通项公式在使用上更加简捷、灵活。  相似文献   

10.
The purpose of this article is to illustrate a seven-step process for determining whether inferential reading items were more susceptible to cultural bias than literal reading items. The seven-step process was demonstrated using multiple-choice data from the reading portion of a reading/language arts test for fifth and seventh grade Hispanic, Black, and White examinees. The process began at the broadest level of analyzing bundles of items for differential bundle functioning and finished at the narrowest level of analyzing individual items for differential distractor functioning. Some evidence was found to indicate that inferential items are more susceptible to cultural bias than literal items. Implications of the results are discussed, and suggestions for item writers and test developers are given.  相似文献   

11.
Test assembly is the process of selecting items from an item pool to form one or more new test forms. Often new test forms are constructed to be parallel with an existing (or an ideal) test. Within the context of item response theory, the test information function (TIF) or the test characteristic curve (TCC) are commonly used as statistical targets to obtain this parallelism. In a recent study, Ali and van Rijn proposed combining the TIF and TCC as statistical targets, rather than using only a single statistical target. In this article, we propose two new methods using this combined approach, and compare these methods with single statistical targets for the assembly of mixed‐format tests. In addition, we introduce new criteria to evaluate the parallelism of multiple forms. The results show that single statistical targets can be problematic, while the combined targets perform better, especially in situations with increasing numbers of polytomous items. Implications of using the combined target are discussed.  相似文献   

12.
In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation.  相似文献   

13.
When an exam consists, in whole or in part, of constructed-response items, it is a common practice to allow the examinee to choose a subset of the questions to answer. This procedure is usually adopted so that the limited number of items that can be completed in the allotted time does not unfairly affect the examinee. This results in the de facto administration of several different test forms, where the exact structure of any particular form is determined by the examinee. However, when different forms are administered, a canon of good testing practice requires that those forms be equated to adjust for differences in their difficulty. When the items are chosen by the examinee, traditional equating procedures do not strictly apply due to the nonignorable nature of the missing responses. In this article, we examine the comparability of scores on such tests within an IRT framework. We illustrate the approach with data from the College Board's Advanced Placement Test in Chemistry  相似文献   

14.
Two matched forms of a 50 item multiple-choice grammar test were developed. Twenty items designed to be humorous were included in one form. Test forms were randomly assigned to 126 eighth graders who received the test plus alternate forms of a questionnaire. Inclusion of humorous items did not affect grammar scores on matched humorous/nonhumorous items nor on common post-treatment items, nor did inclusion affect results of anxiety measures. Students favored inclusion of humor on tests, judged effects of humor positively, and estimated humorous items to be easier. Humor did not lower performance but was sought by the students. Potential for more valid and humane measurement is discussed.  相似文献   

15.
This article introduces longitudinal multistage testing (lMST), a special form of multistage testing (MST), as a method for adaptive testing in longitudinal large‐scale studies. In lMST designs, test forms of different difficulty levels are used, whereas the values on a pretest determine the routing to these test forms. Since lMST allows for testing in paper and pencil mode, lMST may represent an alternative to conventional testing (CT) in assessments for which other adaptive testing designs are not applicable. In this article the performance of lMST is compared to CT in terms of test targeting as well as bias and efficiency of ability and change estimates. Using a simulation study, the effect of the stability of ability across waves, the difficulty level of the different test forms, and the number of link items between the test forms were investigated.  相似文献   

16.
This study was conducted to determine which skills and concepts students have that are prerequisites for solving moles problems through the use of analog tasks. Two analogous tests with four forms of each were prepared that corresponded to a conventional moles test. The analogs used were oranges and granules of sugar. Slight variations between test items on various forms permitted comparisons that would indicate specific conceptual and mathematical difficulties that students might have in solving moles problems. Different forms of the two tests were randomly assigned to 332 high school chemistry students of five teachers in four schools in central Indiana. Comparisons of total test score, subtest scores, and the number of students answering an item correctly using appropriate t-test and chi square tests resulted in the following conclusions: (1) the size of the object makes no difference in the problem difficulty; (2) students understand the concepts of mass, volume, and particles equally well; (3) problems requiring two steps are harder than those requiring one step; (4) problems involving scientific notation are more difficult than those that do not; (5) problems involving the multiplication concept are easier than those involving the division concept; (6) problems involving the collective word “bag” are easier to solve than those using the word “billion”; (7) the use of the word “a(n)” makes the problem more difficult than using the number “1”.  相似文献   

17.
李玲坡 《海外英语》2012,(4):99-100
目前各类英语水平考试中,阅读理解都占有很重要的比例,特别是英语教学和英语考试改革以后,对学习者听说和读写能力有了更高的要求。但从考生的分数来看,阅读理解经常是失分最多的部分,因此在阅读过程中对其题型进行分析并及时总结应对策略是非常有必要的。  相似文献   

18.
《史记·汉兴以来将相名臣年表》中的倒文有七十条,且很有规律,对于这一奇特现象的成因,众说不一。笔者通过对《将相表》的创作目的、正书和倒书内容的特点以及以倒文形式在表中作附文的合理性等几方面的比较与分析,认为《将相表》的主题是为了突出汉兴以来将相名臣的历史功绩,其中的倒文是表中于主题之外的附文。  相似文献   

19.
To test the hypothesis that the basic “logic” utilized by individuals in scientific hypothesis testing is the biconditional (if and only if), and that the biconditional is a precondition for the development of formal operations, a sample of 387 students in grades eight, ten, twelve, and college were administered eight reasoning items. Five of the items involved the formal operational schemata of probability, proportions and correlations. Two of the items involved propositions and correlations. Two of the items involved propositional logic. One item involved the biconditional. Percentages of correct responses on most of the items increased with age. A principal-component analysis revealed three factors, two of which were identified as involving operational thought, one of which involved propositional logic. As predicted, the biconditional reasoning item loaded on one of the operational thought factors. A Guttman scale analysis of the items failed to reveal a unidimensional scale, yet the biconditional reasoning item ordered first supporting the hypothesis that it is a precondition for formal operational reasoning. Implications for teaching science students how to test hypotheses are discussed.  相似文献   

20.
Reliability of a criterion-referenced test is often viewed as the consistency with which individuals who have taken two strictly parallel forms of a test are classified as being masters or nonmasters. However, in practice, it is rarely possible to retest students, especially with equivalent forms. For this reason, methods for making conservative approximations of alternate form (or test-retest “without the effects of testing”) reliability have been developed. Because these methods are computationally tedious and require some psychometric sophistication, they have rarely been used by teachers and school psychologists. This paper (a) describes one method (Subkoviak's) for estimating alternate-form reliability from one administration of a criterion-referenced test and (b) describes a computer program developed by the authors that will handle tests containing hundreds of items for large numbers of examinees and allow any test user to apply the technique described. The program is a superior alternative to other methods of simplifying this estimation procedure that rely upon tables; a user can check classification consistency estimates for several prospective cut scores directly from a data file, without having to make prior calculations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号