期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A psychometric measurement model for adult English language learners: Pearson Test of English Academic

Hye K. Pae 《Educational Research and Evaluation》2013,19(3):211-229

The aim of this study was to apply Rasch modeling to an examination of the psychometric properties of the Pearson Test of English Academic (PTE Academic). Analyzed were 140 test-takers' scores derived from the PTE Academic database. The mean age of the participants was 26.45 (SD = 5.82), ranging from 17 to 46. Conformity of the participants' performance on the 86 items of PTE Academic Form 1 of the field test was evaluated using the partial credit model. The person reliability coefficient was .96, and item reliability was .99. The results showed that no significant differential item functioning was found across subgroups of gender and spoken-language context, indicating that the item data approximated the Rasch model. The findings of this study validated the test stability of PTE Academic as a useful measurement tool for English language learners' academic English assessment. 相似文献

2.

Using item response theory to describe the Nonverbal Literacy Assessment (NVLA)

下载免费PDF全文

Danielle Fleming Mark Wilson Lynn Ahlgrim‐Delzell 《Psychology in the schools》2018,55(4):341-349

The Nonverbal Literacy Assessment (NVLA) is a literacy assessment designed for students with significant intellectual disabilities. The 218‐item test was initially examined using confirmatory factor analysis. This method showed that the test worked as expected, but the items loaded onto a single factor. This article uses item response theory to investigate the NVLA using Rasch models. First, we reduced the number of items using a unidimensional model, which resulted in high levels of test reliability despite decreasing the number of questions, providing the same information about student abilities in less time. Second, the multidimensional analysis indicated that it is possible to view the NVLA as a test with four dimensions, resulting in more detailed information about student abilities. Finally, we combined these approaches to obtain both specificity and brevity, with a four‐dimensional model using 133 items from the original NVLA. 相似文献

3.

Commentary: A Response to Reckase's Conceptual Framework and Examples for Evaluating Standard Setting Methods 总被引：1，自引：0，他引：1

E. Matthew Schulz 《Educational Measurement》2006,25(3):4-13

A look at real data shows that Reckase's psychometric theory for standard setting is not applicable to bookmark and that his simulations cannot explain actual differences between methods. It is suggested that exclusively test-centered, criterion-referenced approaches are too idealized and that a psychophysics paradigm and a theory of group behavior could be more useful in thinking about the standard setting process. In this view, item mapping methods such as bookmark are reasonable adaptations to fundamental limitations in human judgments of item difficulty. They make item ratings unnecessary and have unique potential for integrating external validity data and student performance data more fully into the standard setting process. 相似文献

4.

Effects of multimedia on psychometric characteristics of cognitive tests: A comparison between technology-based and paper-based modalities

《Studies in Educational Evaluation》2023

The study aims to investigate the effects of delivery modalities on psychometric characteristics and student performance on cognitive tests. A first study assessed the inductive reasoning ability of 715 students under the supervision of teachers. A second study examined 731 students’ performance on the application of the control-of-variables strategy in basic physics but without teacher supervision due to the COVID-19 pandemic. Rasch measurement showed that the online format fitted to the data better in the unidimensional model across two conditions. Under teacher supervision, paper-based testing was better than online testing in terms of reliability and total scores, but contradictory findings were found in turn without teacher supervision. Although measurement invariance was confirmed between two versions at item level, the differential bundle functioning analysis supported the online groups on the item bundles constructed of figure-related materials. Response time was also discussed as an advantage of technology-based assessment for test development. 相似文献

5.

Exploring features that affect the difficulty and functioning of science exam questions for those with reading difficulties

Victoria Crisp 《Irish Educational Studies》2013,32(3):323-343

This research explored the measurement characteristics of two science examinations and the potential to use access arrangements data to investigate how students requiring reading support are affected by features of exam questions. For two science examinations, traditional and Rasch analyses provided estimates of difficulty and information on item functioning. For one examination, the performance of students eligible for support from a reader in exams was compared to a ‘norm’ group. For selected items a sample of student responses were analysed. A number of factors potentially making questions easier, more difficult or potentially contributing to problems with item functioning were identified. A number of features that may particularly influence those requiring reading support were also identified. 相似文献

6.

Setting Standards for English Foreign Language Assessment: Methodology,Validation, and a Degree of Arbitrariness

Simon P. Tiffin‐Richards Hans Anand Pant Olaf Köller 《Educational Measurement》2013,32(2):15-25

Cut‐scores were set by expert judges on assessments of reading and listening comprehension of English as a foreign language (EFL), using the bookmark standard‐setting method to differentiate proficiency levels defined by the Common European Framework of Reference (CEFR). Assessments contained stratified item samples drawn from extensive item pools, calibrated using Rasch models on the basis of examinee responses of a German nationwide assessment of secondary school language performance. The results suggest significant effects of item sampling strategies for the bookmark method on cut‐score recommendations, as well as significant cut‐score judgment revision over cut‐score placement rounds. Results are discussed within a framework of establishing validity evidence supporting cut‐score recommendations using the widely employed bookmark method. 相似文献

7.

The implications of content versus item validity on science tests

William L. Yarroch 《科学教学研究杂志》1991,28(7):619-629

The use of content validity as the primary assurance of the measurement accuracy for science assessment examinations is questioned. An alternative accuracy measure, item validity, is proposed. Item validity is based on research using qualitative comparisons between (a) student answers to objective items on the examination, (b) clinical interviews with examinees designed to ascertain their knowledge and understanding of the objective examination items, and (c) student answers to essay examination items prepared as an equivalent to the objective examination items. Calculations of item validity are used to show that selected objective items from the science assessment examination overestimated the actual student understanding of science content. Overestimation occurs when a student correctly answers an examination item, but for a reason other than that needed for an understanding of the content in question. There was little evidence that students incorrectly answered the items studied for the wrong reason, resulting in underestimation of the students' knowledge. The equivalent essay items were found to limit the amount of mismeasurement of the students' knowledge. Specific examples are cited and general suggestions are made on how to improve the measurement accuracy of objective examinations. 相似文献

8.

Evaluating Instrument Quality in Science Education: Rasch‐based analyses of a Nature of Science test

Irene Neumann Knut Neumann Ross Nehm 《International Journal of Science Education》2013,35(10):1373-1405

Given the central importance of the Nature of Science (NOS) and Scientific Inquiry (SI) in national and international science standards and science learning, empirical support for the theoretical delineation of these constructs is of considerable significance. Furthermore, tests of the effects of varying magnitudes of NOS knowledge on domain‐specific science understanding and belief require the application of instruments validated in accordance with AERA, APA, and NCME assessment standards. Our study explores three interrelated aspects of a recently developed NOS instrument: (1) validity and reliability; (2) instrument dimensionality; and (3) item scales, properties, and qualities within the context of Classical Test Theory and Item Response Theory (Rasch modeling). A construct analysis revealed that the instrument did not match published operationalizations of NOS concepts. Rasch analysis of the original instrument—as well as a reduced item set—indicated that a two‐dimensional Rasch model fit significantly better than a one‐dimensional model in both cases. Thus, our study revealed that NOS and SI are supported as two separate dimensions, corroborating theoretical distinctions in the literature. To identify items with unacceptable fit values, item quality analyses were used. A Wright Map revealed that few items sufficiently distinguished high performers in the sample and excessive numbers of items were present at the low end of the performance scale. Overall, our study outlines an approach for how Rasch modeling may be used to evaluate and improve Likert‐type instruments in science education. 相似文献

9.

Measuring Longitudinal Gains in Student Learning: A Comparison of Rasch Scoring and Summative Scoring Approaches

Yue Zhao Jenny M. Y. Huen Y. W. Chan 《Research in higher education》2017,58(6):605-616

This study pioneers a Rasch scoring approach and compares it to a conventional summative approach for measuring longitudinal gains in student learning. In this methodological note, our proposed methodology is demonstrated using an example of rating scales in a student survey as part of a higher education outcome assessment. Such assessments have become increasingly important worldwide for purposes of institutional accreditation and accountability to stakeholders. Data were collected from a longitudinal study by tracking self-reported learning outcomes of individual students in the same cohort who completed the student learning experience questionnaire (SLEQ) in their first and final years. Rasch model was employed for item calibration and latent trait estimation, together with a scaling procedure of concurrent calibration incorporating a randomly equivalent group design and a single group design to measure the gains in self-reported learning outcomes as yielded by repeated measures. The extent to which Rasch scoring compared to the conventional summative scoring method in its sensitivity to change was quantified by a statistical index namely relative performance (RP). Findings indicated greater ability to capture learning outcomes gains from Rasch scoring over the conventional summative scoring method, with RP values ranging from 3 to 17% in the cognitive, social, and value domains of the SLEQ. The Rasch scoring approach and the scaling procedure presented in the study can be readily generalised to studies using rating scales to measure change in student learning in the higher education context. The methodological innovations and contributions of this study are discussed. 相似文献

10.

Using rasch measurement to score,evaluate, and improve examinations in an anatomy course

下载免费PDF全文

Kenneth D. Royal Kurt O. Gilliland Edward T. Kernick 《Anatomical sciences education》2014,7(6):450-460

Any examination that involves moderate to high stakes implications for examinees should be psychometrically sound and legally defensible. Currently, there are two broad and competing families of test theories that are used to score examination data. The majority of instructors outside the high‐stakes testing arena rely on classical test theory (CTT) methods. However, advances in item response theory software have made the application of these techniques much more accessible to classroom instructors. The purpose of this research is to analyze a common medical school anatomy examination using both the traditional CTT scoring method and a Rasch measurement scoring method to determine which technique provides more robust findings, and which set of psychometric indicators will be more meaningful and useful for anatomists looking to improve the psychometric quality and functioning of their examinations. Results produced by the more robust and meaningful methodology will undergo a rigorous psychometric validation process to evaluate construct validity. Implications of these techniques and additional possibilities for advanced applications are also discussed. Anat Sci Educ 7: 450–460. © 2014 American Association of Anatomists. 相似文献

11.

Development and Validation of a Multimedia-based Assessment of Scientific Inquiry Abilities

Che-Yu Kuo Tsung-Hau Jen Ying-Shao Hsu 《International Journal of Science Education》2013,35(14):2326-2357

The potential of computer-based assessments for capturing complex learning outcomes has been discussed; however, relatively little is understood about how to leverage such potential for summative and accountability purposes. The aim of this study is to develop and validate a multimedia-based assessment of scientific inquiry abilities (MASIA) to cover a more comprehensive construct of inquiry abilities and target secondary school students in different grades while this potential is leveraged. We implemented five steps derived from the construct modeling approach to design MASIA. During the implementation, multiple sources of evidence were collected in the steps of pilot testing and Rasch modeling to support the validity of MASIA. Particularly, through the participation of 1,066 8th and 11th graders, MASIA showed satisfactory psychometric properties to discriminate students with different levels of inquiry abilities in 101 items in 29 tasks when Rasch models were applied. Additionally, the Wright map indicated that MASIA offered accurate information about students’ inquiry abilities because of the comparability of the distributions of student abilities and item difficulties. The analysis results also suggested that MASIA offered precise measures of inquiry abilities when the components (questioning, experimenting, analyzing, and explaining) were regarded as a coherent construct. Finally, the increased mean difficulty thresholds of item responses along with three performance levels across all sub-abilities supported the alignment between our scoring rubrics and our inquiry framework. Together with other sources of validity in the pilot testing, the results offered evidence to support the validity of MASIA. 相似文献

12.

Alignment of Content and Effectiveness of Mathematics Assessment Items

《Educational Assessment》2013,18(4):333-356

Alignment has taken on increased importance given the current high-stakes nature of assessment. To make well-informed decisions about student learning on the basis of test results, assessment items need to be well aligned with standards. Project 2061 of the American Association for the Advancement of Science (AAAS) has developed a procedure for analyzing the content and quality of assessment items. The authors of this study used this alignment procedure to closely examine 2 mathematics assessment items. Student work on these 2 items was analyzed to determine whether the conclusions reached through the use of the alignment procedure could be validated. It was found that the Project 2061 alignment procedure was effective in providing a tool for in-depth analysis of the mathematical content of the item and a set of standards and in identifying 1 particular content standard that was most closely aligned with the standard. Through analyzing student work samples and student interviews, it was also found that students' thinking may not correspond to the standard identified as best aligned with the learning goals of the item. This finding highlights the potential usefulness of analyzing student work to clarify any additional deficiencies of an assessment item not revealed by an alignment procedure. 相似文献

13.

Psychometric characteristics of the Persian version of the Multidimensional School Anger Inventory–Revised

Vahid Aryadoust Sanaz Akbarzadeh Sara Akbarzedeh 《Asia Pacific Journal of Education》2011,31(1):51-64

The Multidimensional School Anger Inventory–Revised (MSAI-R) is a measurement tool to evaluate high school students' anger. Its psychometric features have been tested in the USA, Australia, Japan, Guatemala, and Italy. This study investigates the factor structure and psychometric quality of the Persian version of the MSAI-R using data from an administration of the inventory to 585 Iranian high school students. The study adopted the four-factor underlying structure of high school student anger derived through factor analysis in previous validation studies, which consists of: School Hostility, Anger Experience, Positive Coping, and Destructive Expressions. Confirmatory factor analysis of this four-factor model indicated that it fit the data better than a one-factor baseline model, although the fit was not perfect. The Rasch model showed a very high internal consistency among items, with no item misfitting; however, our results suggest that to represent the construct sufficiently some items should be added to Positive Coping and Destructive Expression. This finding is in agreement with Boman, Curtis, Furlong, and Smith's Rasch analysis of the MSAI-R with an Australian sample. Overall, the results from this study support the psychometric features of the Persian MSAI-R. However, results from some test items also point to the dangers inherent in adapting the same test stimuli to widely divergent cultures. 相似文献

14.

A Multilevel Testlet Model for Dual Local Dependence

Hong Jiao Akihito Kamata Shudong Wang Ying Jin 《Journal of Educational Measurement》2012,49(1):82-100

The applications of item response theory (IRT) models assume local item independence and that examinees are independent of each other. When a representative sample for psychometric analysis is selected using a cluster sampling method in a testlet‐based assessment, both local item dependence and local person dependence are likely to be induced. This study proposed a four‐level IRT model to simultaneously account for dual local dependence due to item clustering and person clustering. Model parameter estimation was explored using the Markov Chain Monte Carlo method. Model parameter recovery was evaluated in a simulation study in comparison with three other related models: the Rasch model, the Rasch testlet model, and the three‐level Rasch model for person clustering. In general, the proposed model recovered the item difficulty and person ability parameters with the least total error. The bias in both item and person parameter estimation was not affected but the standard error (SE) was affected. In some simulation conditions, the difference in classification accuracy between models could go up to 11%. The illustration using the real data generally supported model performance observed in the simulation study. 相似文献

15.

Investigating the Effect of Item Position in Computer‐Based Tests

Feiming Li Allan Cohen Linjun Shen 《Journal of Educational Measurement》2012,49(4):362-379

Computer‐based tests (CBTs) often use random ordering of items in order to minimize item exposure and reduce the potential for answer copying. Little research has been done, however, to examine item position effects for these tests. In this study, different versions of a Rasch model and different response time models were examined and applied to data from a CBT administration of a medical licensure examination. The models specifically were used to investigate whether item position affected item difficulty and item intensity estimates. Results indicated that the position effect was negligible. 相似文献

16.

Measuring the Impact of Judge Severity on Examination Scores

《教育实用测度》2013,26(4):331-345

In order to obtain objective measurement for examinations that are graded by judges, an extension of the Rasch model designed to analyze examinations with more than two facets (items/examinees) is used. This extended Rasch model calibrates the elements of each facet of the examination (i.e., examinee performances, items, and judges) on a common log-linear scale. A network for assigning judges to examinations is used to link all facets. Real examination data from the "clinical assessment" part of a certification examination are used to illustrate the application. A range of item difficulties and judge severities were found. Comparison of examinee raw scores with objective linear measures corrected for variations in judge severity shows that judge severity can have a substantial impact on a raw score. Correcting for judge severity improves the fairness of examinee measures and of the subsequent pass-fail decisions because the uncorrected raw scores favor examinee performances graded by lenient judges. 相似文献

17.

International assessment: A Rasch model and teachers' evaluation of TIMSS science achievement items

Shawn M. Glynn 《科学教学研究杂志》2012,49(10):1321-1344

The Trends in International Mathematics and Science Study (TIMSS) is a comparative assessment of the achievement of students in many countries. In the present study, a rigorous independent evaluation was conducted of a representative sample of TIMSS science test items because item quality influences the validity of the scores used to inform educational policy in those countries. The items had been administered internationally to 16,009 students in their eighth year of formal schooling. The evaluation had three components. First, the Rasch model, which emphasizes high quality items, was used to evaluate the items psychometrically. Second, readability and vocabulary analyses were used to evaluate the wording of the items to ensure they were comprehensible to the students. And third, item development guidelines were used by a focus group of science teachers to evaluate the items in light of the TIMSS assessment framework, which specified the format, content, and cognitive domains of the items. The evaluation components indicated that the majority of the items were of high quality, thereby contributing to the validity of TIMSS scores. These items had good psychometric characteristics, readability, vocabulary, and compliance with the assessment framework. Overall, the items tended to be difficult: constructed response items assessing reasoning or application were the most difficult, and multiple choice items assessing knowledge or application were less difficult. The teachers revised some of the sampled items to improve their clarity of content, conciseness of wording, and fit with format specifications. For TIMSS, the findings imply that some of the non‐sampled items may need revision, too. For researchers and teachers, the findings imply that the TIMSS science items and the Rasch model are valuable resources for assessing the achievement of students. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 1321–1344, 2012 相似文献

18.

Assessment of Genetics Understanding

Philipp Schmiemann Ross H. Nehm Robyn E. Tornabene 《Science & Education》2017,26(10):1161-1191

Understanding how situational features of assessment tasks impact reasoning is important for many educational pursuits, notably the selection of curricular examples to illustrate phenomena, the design of formative and summative assessment items, and determination of whether instruction has fostered the development of abstract schemas divorced from particular instances. The goal of our study was to employ an experimental research design to quantify the degree to which situational features impact inferences about participants’ understanding of Mendelian genetics. Two participant samples from different educational levels and cultural backgrounds (high school, n = 480; university, n = 444; Germany and USA) were used to test for context effects. A multi-matrix test design was employed, and item packets differing in situational features (e.g., plant, animal, human, fictitious) were randomly distributed to participants in the two samples. Rasch analyses of participant scores from both samples produced good item fit, person reliability, and item reliability and indicated that the university sample displayed stronger performance on the items compared to the high school sample. We found, surprisingly, that in both samples, no significant differences in performance occurred among the animal, plant, and human item contexts, or between the fictitious and “real” item contexts. In the university sample, we were also able to test for differences in performance between genders, among ethnic groups, and by prior biology coursework. None of these factors had a meaningful impact upon performance or context effects. Thus some, but not all, types of genetics problem solving or item formats are impacted by situational features. 相似文献

19.

Automatic item generation of probability word problems

Heinz Holling Jonas P. Bertling Nina Zeuch 《Studies in Educational Evaluation》2009,35(2-3):71-76

Mathematical word problems represent a common item format for assessing student competencies. Automatic item generation (AIG) is an effective way of constructing many items with predictable difficulties, based on a set of predefined task parameters. The current study presents a framework for the automatic generation of probability word problems based on templates that allow for the generation of word problems involving different topics from probability theory. It was tested in a pilot study with N = 146 German university students. The items show a good fit to the Rasch model. Item difficulties can be explained by the Linear Logistic Test Model (LLTM) and by the random-effects LLTM. The practical implications of these findings for future test development in the assessment of probability competencies are also discussed. 相似文献

20.

Efficiency of Targeted Multistage Calibration Designs Under Practical Constraints: A Simulation Study

Stphanie Berger Angela J. Verschoor Theo J. H. M. Eggen Urs Moser 《Journal of Educational Measurement》2019,56(1):121-146

Calibration of an item bank for computer adaptive testing requires substantial resources. In this study, we investigated whether the efficiency of calibration under the Rasch model could be enhanced by improving the match between item difficulty and student ability. We introduced targeted multistage calibration designs, a design type that considers ability‐related background variables and performance for assigning students to suitable items. Furthermore, we investigated whether uncertainty about item difficulty could impair the assembling of efficient designs. The results indicated that targeted multistage calibration designs were more efficient than ordinary targeted designs under optimal conditions. Limited knowledge about item difficulty reduced the efficiency of one of the two investigated targeted multistage calibration designs, whereas targeted designs were more robust. 相似文献