期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Comparisons among Designs for Equating Mixed-Format Tests in Large-Scale Assessments

Sooyeon Kim Michael E. Walker Frederick McHale 《Journal of Educational Measurement》2010,47(1):36-53

In this study we examined variations of the nonequivalent groups equating design for tests containing both multiple-choice (MC) and constructed-response (CR) items to determine which design was most effective in producing equivalent scores across the two tests to be equated. Using data from a large-scale exam, this study investigated the use of anchor CR item rescoring (known as trend scoring) in the context of classical equating methods. Four linking designs were examined: an anchor with only MC items, a mixed-format anchor test containing both MC and CR items; a mixed-format anchor test incorporating common CR item rescoring; and an equivalent groups (EG) design with CR item rescoring, thereby avoiding the need for an anchor test. Designs using either MC items alone or a mixed anchor without CR item rescoring resulted in much larger bias than the other two designs. The EG design with trend scoring resulted in the smallest bias, leading to the smallest root mean squared error value. 相似文献

2.

Investigating the Effectiveness of Equating Designs for Constructed-Response Tests in Large-Scale Assessments

Sooyeon Kim Michael E. Walker Frederick McHale 《Journal of Educational Measurement》2010,47(2):186-201

Using data from a large-scale exam, in this study we compared various designs for equating constructed-response (CR) tests to determine which design was most effective in producing equivalent scores across the two tests to be equated. In the context of classical equating methods, four linking designs were examined: (a) an anchor set containing common CR items, (b) an anchor set incorporating common CR items rescored, (c) an external multiple-choice (MC) anchor test, and (d) an equivalent groups design incorporating rescored CR items (no anchor test). The use of CR items without rescoring resulted in much larger bias than the other designs. The use of an external MC anchor resulted in the next largest bias. The use of a rescored CR anchor and the equivalent groups design led to similar levels of equating error. 相似文献

3.

Multiple-choice exams: an obstacle for higher-level thinking in introductory science classes 总被引：1，自引：0，他引：1

KF Stanger-Hall 《CBE life sciences education》2012,11(3):294-306

Learning science requires higher-level (critical) thinking skills that need to be practiced in science classes. This study tested the effect of exam format on critical-thinking skills. Multiple-choice (MC) testing is common in introductory science courses, and students in these classes tend to associate memorization with MC questions and may not see the need to modify their study strategies for critical thinking, because the MC exam format has not changed. To test the effect of exam format, I used two sections of an introductory biology class. One section was assessed with exams in the traditional MC format, the other section was assessed with both MC and constructed-response (CR) questions. The mixed exam format was correlated with significantly more cognitively active study behaviors and a significantly better performance on the cumulative final exam (after accounting for grade point average and gender). There was also less gender-bias in the CR answers. This suggests that the MC-only exam format indeed hinders critical thinking in introductory science classes. Introducing CR questions encouraged students to learn more and to be better critical thinkers and reduced gender bias. However, student resistance increased as students adjusted their perceptions of their own critical-thinking abilities. 相似文献

4.

Multidimensional Linking for Tests with Mixed Item Types

Lihua Yao Keith Boughton 《Journal of Educational Measurement》2009,46(2):177-197

Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example, multidimensional item response theory (MIRT) may have a promising future in subscale score proficiency estimation, leading toward a more diagnostic orientation, which requires the linking of these subscale scores across different forms and populations. Several multidimensional linking studies can be found in the literature; however, none have used a combination of MC and CR item types. Thus, this research explores multidimensional linking accuracy for tests composed of both MC and CR items using a matching test characteristic/response function approach. The two-dimensional simulation study presented here used real data-derived parameters from a large-scale statewide assessment with two subscale scores for diagnostic profiling purposes, under varying conditions of anchor set lengths (6, 8, 16, 32, 60), across 10 population distributions, with a mixture of simple versus complex structured items, using a sample size of 3,000. It was found that for a well chosen anchor set, the parameters recovered well after equating across all populations, even for anchor sets composed of as few as six items. 相似文献

5.

Measurement invariance of the Servant Leadership Questionnaire across K-12 principal gender

Lihua Xu Trae Stewart Paige Haber-Curran 《School Leadership & Management》2015,35(2):202-214

Measurement invariance of the five-factor Servant Leadership Questionnaire between female and male K-12 principals was tested using multi-group confirmatory factor analysis. A sample of 956 principals (56.9% were females and 43.1% were males) was analysed in this study. The hierarchical multi-step measurement invariance test supported the measurement invariance of the five-factor model across gender. Latent factor means were compared between females and males when measurement invariance was established. Results showed that females were significantly higher than males on emotional healing, wisdom, persuasive mapping and organisational stewardship, and they were not statistically different on altruistic calling. 相似文献

6.

Small‐Sample Equating Using a Synthetic Linking Function

Sooyeon Kim Alina A. Von Davier Shelby Haberman 《Journal of Educational Measurement》2008,45(4):325-342

This study addressed the sampling error and linking bias that occur with small samples in a nonequivalent groups anchor test design. We proposed a linking method called the synthetic function, which is a weighted average of the identity function and a traditional equating function (in this case, the chained linear equating function). Specifically, we compared the synthetic, identity, and chained linear functions for various‐sized samples from two types of national assessments. One design used a highly reliable test and an external anchor, and the other used a relatively low‐reliability test and an internal anchor. The results from each of these methods were compared to the criterion equating function derived from the total samples with respect to linking bias and error. The study indicated that the synthetic functions might be a better choice than the chained linear equating method when samples are not large and, as a result, unrepresentative. 相似文献

7.

The effects of response format of a structured learning sequence on third grade children's classification achievement

Leonard Popp Ronald Raven 《科学教学研究杂志》1972,9(2):177-184

Classification was selected for use in this investigation because of the central position of process factors in teaching and learning. A twelve section classification program which was based on 12 rules derived from Piaget's analysis of classification was used in the study. The program was produced in both a constructed response (CR) format and in a matching multiple choice (MC) response format. The 36-item classification test was similarly produced in both response modes. Criterion scores on both the CR test and the MC test were collected from each of the 239 grade three subjects following treatment with the CR program, the MC program, or with drawing activities (control). The results of the multivariate and univariate analyses of variance indicated that the program in both response modes enhanced classification achievement although the effects on MC test scores were not consistent across classes, and that each program format enhanced achievement to a greater degree on the test which matched the program response mode. 相似文献

8.

Using Subpopulation Invariance to Assess Test Score Equity

Neil J. Dorans 《Journal of Educational Measurement》2004,41(1):43-68

Score equity assessment (SEA) is introduced, and placed within a fair assessment context that includes differential prediction or fair selection and differential item functioning. The notion of subpopulation invariance of linking functions is central to the assessment of score equity, just as it has been for differential item functioning and differential prediction. Advanced Placement (AP) data are used for illustrative purposes. The use of multiple-choice and constructed response items in AP provides an opportunity to observe a case where subpopulation invariance of linking functions does not hold (U.S. History), and a case in which it does hold (Calculus AB). The lack of invariance for U.S. History might be attributed to several sources. The role of SEA in assessing the fairness of test assembly processes is discussed. 相似文献

9.

Confirmatory factor analysis models of factorial invariance: A multifaceted approach

Herbert W. Marsh 《Structural equation modeling》2013,20(1):5-34

Typical confirmatory factor analysis studies of factorial invariance test parameter (factor loadings, factor variances/covariances, and uniquenesses) invariance across only two groups (e.g., males and females) or, perhaps, across more than two groups reflecting different levels of a single design facet (e.g., age). The present investigation extends this approach by considering invariance across groups from a two‐facet design. Data consist of multiple dimensions of self‐concept collected from eight groups of students (total N = 4,000) representing a 2 (Gender) × 4 (Age) design. The gender‐stereotypic model posits a particular pattern of gender differences in structure that varies with age. Adopting analysis‐of‐variance terminology, the model posits that structural differences will vary as a function of gender but that this gender effect interacts with age. In testing this model, I consider the lack of invariance in different sets of parameters attributable to gender, age, and their interaction. 相似文献

10.

Use of Adjustment by Minimum Discriminant Information in Linking Constructed‐Response Test Scores in the Absence of Common Items

Yi‐Hsuan Lee Shelby J. Haberman Neil J. Dorans 《Journal of Educational Measurement》2019,56(2):452-472

In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation. 相似文献

11.

Measurement Properties of Two Innovative Item Formats in a Computer-Based Test

Lei Wan George A. Henly 《教育实用测度》2013,26(1):58-78

Many innovative item formats have been proposed over the past decade, but little empirical research has been conducted on their measurement properties. This study examines the reliability, efficiency, and construct validity of two innovative item formats—the figural response (FR) and constructed response (CR) formats used in a K–12 computerized science test. The item response theory (IRT) information function and confirmatory factor analysis (CFA) were employed to address the research questions. It was found that the FR items were similar to the multiple-choice (MC) items in providing information and efficiency, whereas the CR items provided noticeably more information than the MC items but tended to provide less information per minute. The CFA suggested that the innovative formats and the MC format measure similar constructs. Innovations in computerized item formats are reviewed, and the merits as well as challenges of implementing the innovative formats are discussed. 相似文献

12.

Multiple‐Choice Tests and Student Understanding: What Is the Connection?

Mark G. Simkin William L. Kuechler 《Decision Sciences Journal of Innovative Education》2005,3(1):73-98

Instructors can use both “multiple‐choice” (MC) and “constructed response” (CR) questions (such as short answer, essay, or problem‐solving questions) to evaluate student understanding of course materials and principles. This article begins by discussing the advantages and concerns of using these alternate test formats and reviews the studies conducted to test the hypothesis (or perhaps better described as the hope) that MC tests, by themselves, perform an adequate job of evaluating student understanding of course materials. Despite research from educational psychology demonstrating the potential for MC tests to measure the same levels of student mastery as CR tests, recent studies in specific educational domains find imperfect relationships between these two performance measures. We suggest that a significant confound in prior experiments has been the treatment of MC questions as homogeneous entities when in fact MC questions may test widely varying levels of student understanding. The primary contribution of the article is a modified research model for CR/MC research based on knowledge‐level analyses of MC test banks and CR question sets from basic computer language programming. The analyses are based on an operationalization of Bloom's Taxonomy of Learning Goals for the domain, which is used to develop a skills‐focused taxonomy of MC questions. However, we propose that their analyses readily generalize to similar teaching domains of interest to decision sciences educators such as modeling and simulation programming. 相似文献

13.

Self-Esteem and Method Effects Associated With Negatively Worded Items: Investigating Factorial Invariance by Sex

Christine DiStefano Robert W. Motl 《Structural equation modeling》2013,20(1):134-146

The Rosenberg Self-Esteem scale (RSE) has been widely used in examinations of sex differences in global self-esteem. However, previous examinations of sex differences have not accounted for method effects associated with item wording, which have consistently been reported by researchers using the RSE. Accordingly, this study examined the multigroup invariance of global self-esteem and method effects associated with negatively worded items on the RSE between males and females. A correlated traits, correlated methods framework for modeling method effects was combined with a standard multigroup invariance routine using covariance structure analysis. Overall, there were few differences between males and females in terms of the measurement of self-esteem and method effects associated with negatively worded items on the RSE. Our findings suggest that, whereas method effects exist on the RSE scale for both males and females, the method effects associated with negatively worded items do not influence the measurement invariance and mean differences in global self-esteem scores between the sexes. 相似文献

14.

试卷中含有单个高计分主观题时的信度估计方法

杨志明丁港王雯《教育测量与评价(理论版)》2021,(1):44-48

测评信度是衡量考试质量的核心指标之一,但常规的信度估计方法在估计含有单个高计分主观题试卷的信度时并不恰当,因为这种高计分主观题对测验总分方差的影响太大。解决这种问题的一个做法是:在估计出单个高计分主观题信度的基础上,进一步运用分层α系数公式估计整个试卷的测评信度。单个高计分主观题信度的估计方法有两种,即使用重测信度的估计方法,或者使用根据两个随机变量的相关系数会因随机误差的存在而衰减的特点所提出的估计方法。相似文献

15.

Scripting to enhance university students’ critical thinking in flipped learning: implications of the delayed effect on science reading literacy

Yuan-Hsuan Lee 《Interactive Learning Environments》2018,26(5):569-582

相似文献

16.

Relationships between learning patterns and attitudes towards two assessment formats

Menucha Birenbaum Rose A. Feldman 《Educational research; a review for teachers and all concerned with progress in education》2013,55(1):90-98

The study examined the relationships between learning patterns and attitudes towards two assessment formats: open‐ended (OE) and multiple‐choice (MC), among students in higher education. Sixteen Semantic Differential scales measuring emotional reactions, intellectual reactions and appraisal of each assessment format, along with measures of learning processes, academic self‐concept and test anxiety, were administered to 58 students. Results indicated two patterns of relationships between the learning‐related variables and the assessment attitudes: high scores on the self‐concept measure and on the three measures of learning processes were related to positive attitudes towards the OE format but negative ones towards the MC format; low scores on the test anxiety measures were related to positive attitudes towards the OE format. In addition, significant gender differences emerged with respect to the MC format, with males having more favourable attitudes than females. Results were discussed in light of an adaptive assessment approach. 相似文献

17.

Further Study of the Choice of Anchor Tests in Equating

Tammy J. Trierweiler Charles Lewis Robert L. Smith 《Journal of Educational Measurement》2016,53(4):498-518

In this study, we describe what factors influence the observed score correlation between an (external) anchor test and a total test. We show that the anchor to full‐test observed score correlation is based on two components: the true score correlation between the anchor and total test, and the reliability of the anchor test. Findings using an analytical approach suggest that making an anchor test a miditest does not generally maximize the anchor to total test correlation. Results are discussed in the context of what conditions maximize the correlations between the anchor and total test. 相似文献

18.

Measurement invariance across information and communication technology development index and gender: The case of the Pearson Test of English Academic reading

《Studies in Educational Evaluation》2020

The computerization of reading assessments has presented a set of new challenges to test designers. From the vantage point of measurement invariance, test designers must investigate whether the traditionally recognized causes for violating invariance are still a concern in computer-mediated assessments. In addition, it is necessary to understand the technology-related causes of measurement invariance among test-taking populations. In this study, we used the available data (n = 800) from the previous administrations of the Pearson Test of English Academic (PTE Academic) reading, an international test of English comprising 10 test items, to investigate measurement invariance across gender and the Information and Communication Technology Development index (IDI). We conducted a multi-group confirmatory factor analysis (CFA) to assess invariance at four levels: configural, metric, scalar, and structural. Overall, we were able to confirm structural invariance for the PTE Academic, which is a necessary condition for conducting fair assessments. Implications for computer-based education and the assessment of reading are discussed. 相似文献

19.

The Chain and Post-Stratification Methods for Observed-Score Equating: Their Relationship to Population Invariance

Alina A. von Davier Paul W. Holland Dorothy T. Thayer 《Journal of Educational Measurement》2004,41(1):15-32

The Non-Equivalent-groups Anchor Test (NEAT) design has been in wide use since at least the early 1940s. It involves two populations of test takers, P and Q, and makes use of an anchor test to link them. Two linking methods used for NEAT designs are those (a) based on chain equating and (b) that use the anchor test to post-stratify the distributions of the two operational test scores to a common population (i.e., Tucker equating and frequency estimation). We show that, under different sets of assumptions, both methods are observed score equating methods and we give conditions under which the methods give identical results. In addition, we develop analogues of the Dorans and Holland (2000) RMSD measures of population invariance of equating methods for the NEAT design for both chain and post-stratification equating methods. 相似文献

20.

Utilizing online learning data to design face-to-face activities in a flipped classroom: a case study of heterogeneous group formation

Han Jeongyun Huh Sun Young Cho Young Hoan Park SoHyun Choi Jinhan Suh Bongwon Rhee Wonjong 《Educational technology research and development : ETR & D》2020,68(5):2055-2071

This study investigates the possibility of utilizing online learning data to design face-to-face activities in a flipped classroom. We focus on heterogeneous group formation for effective collaborative learning. Fifty-three undergraduate students (18 males, 35 females) participated in this study, and 8 students (3 males, 5 females) among them joined post-study interviews. For this study, a total of 6 student characteristics were used: three demographic characteristics obtained from a simple survey and three academic characteristics captured from online learning data. We define three demographic group heterogeneity variables and three academic group heterogeneity variables, where each variable is calculated using the corresponding student characteristic. In this way, each heterogeneity variables represents a degree of diversity within the group. Then, a two-stage hierarchical regression analysis was conducted to identify the significant group heterogeneity variables that influence face-to-face group achievement. The results show that the academic group heterogeneity variables, which were derived from the online learning data, accounted for a significant proportion of the variance in the group achievement when the demographic group heterogeneity variables were controlled. The interviews also reveal that the academic group heterogeneity indeed affected group interaction and learning outcome. These findings highlight that online learning data can be utilized to obtain relevant information for effective face-to-face activity design in a flipped classroom. Based on the results, we discuss the advantages of this data utilization approach and other implications for face-to-face activity design.

相似文献