期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Disaggregated Effects of Device on Score Comparability

Laurie Davis Kristin Morrison Xiaojing Kong Yuanyuan McBride 《Educational Measurement》2017,36(3):35-45

The use of tablets for large‐scale testing programs has transitioned from concept to reality for many state testing programs. This study extended previous research on score comparability between tablets and computers with high school students to compare score distributions across devices for reading, math, and science and to evaluate device effects for gender and ethnicity subgroups. Results indicated no significant differences between tablets and computers for math and science. For reading, a small device effect favoring tablets was found for the middle to lower part of the score distribution. This effect seemed to be driven by increases in performance for male students when testing on tablets. No interactions of device with ethnicity were observed. Consistent with previous research, this study provides additional evidence for a relatively high degree of comparability between tablets and computers. 相似文献

2.

Evaluating the Comparability of Paper‐ and Computer‐Based Science Tests Across Sex and SES Subgroups

Jennifer Randall Stephen Sireci Xueming Li Leah Kaira 《Educational Measurement》2012,31(4):2-12

As access and reliance on technology continue to increase, so does the use of computerized testing for admissions, licensure/certification, and accountability exams. Nonetheless, full computer‐based test (CBT) implementation can be difficult due to limited resources. As a result, some testing programs offer both CBT and paper‐based test (PBT) administration formats. In such situations, evidence that scores obtained from different formats are comparable must be gathered. In this study, we illustrate how contemporary statistical methods can be used to provide evidence regarding the comparability of CBT and PBT scores at the total test score and item levels. Specifically, we looked at the invariance of test structure and item functioning across test administration mode across subgroups of students defined by SES and sex. Multiple replications of both confirmatory factor analysis and Rasch differential item functioning analyses were used to assess invariance at the factorial and item levels. Results revealed a unidimensional construct with moderate statistical support for strong factorial‐level invariance across SES subgroups, and moderate support of invariance across sex. Issues involved in applying these analyses to future evaluations of the comparability of scores from different versions of a test are discussed. 相似文献

3.

Equivalence of Paper-and-Pencil and Online Administration Modes of the Statewide English Test for Students With and Without Disabilities

Do-Hong Kim Huynh Huynh 《Educational Assessment》2013,18(2):107-121

This study investigated whether scores obtained from the online and paper-and-pencil administrations of the statewide end-of-course English test were equivalent for students with and without disabilities. Score comparability was evaluated by examining equivalence of factor structure (measurement invariance) and differential item and bundle functioning analyses for the online and paper groups. Results supported measurement invariance between the online and paper groups, suggesting that it is meaningful to compare scores across administration modes. When the data were analyzed at both the item and item bundle (content area) levels, similar performance appeared between the online and paper groups. 相似文献

4.

Evaluating Comparability in Computerized Adaptive Testing: Issues, Criteria and an Example

Tianyou Wang Michael J. Kolen 《Journal of Educational Measurement》2001,38(1):19-49

When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores. 相似文献

5.

The Comparability of Scores from Different Digital Devices: A Literature Review and Synthesis with Recommendations for Practice

Nathan Dadey Susan Lyons Charles DePascale 《教育实用测度》2018,31(1):30-50

Evidence of comparability is generally needed whenever there are variations in the conditions of an assessment administration, including variations introduced by the administration of an assessment on multiple digital devices (e.g., tablet, laptop, desktop). This article is meant to provide a comprehensive examination of issues relevant to the comparability of scores across devices, and as such provide a starting point in designing and implementing a research agenda to support the comparability of any assessment program. This work starts with a conceptual framework rooted in the idea of a comparability claim—a conceptual statement about how each student is expected to perform on each of the devices in question. Then a review of the available literature is provided, focusing on how aspects of the devices (touch screens, keyboards, screen size, and displayed content) and aspects of the assessments (content area and item type) relate to student performance and preference. Building on this literature, recommendations to minimize threats to comparability are then provided. The article concludes with ways to gather evidence to support claims of comparability. 相似文献

6.

The Effect of Linguistic Simplification of Science Test Items on Score Comparability

Charlene Rivera Charles W. Stansfield 《Educational Assessment》2013,18(3-4):79-105

The use of accommodations has been widely proposed as a means of including English language learners (ELLs) or limited English proficient (LEP) students in state and districtwide assessments. However, very little experimental research has been done on specific accommodations to determine whether these pose a threat to score comparability. This study examined the effects of linguistic simplification of 4th- and 6th-grade science test items on a state assessment. At each grade level, 4 experimental 10-item testlets were included on operational forms of a statewide science assessment. Two testlets contained regular field-test items, but in a linguistically simplified condition. The testlets were randomly assigned to LEP and non-LEP students through the spiraling of test booklets. For non-LEP students, in 4 t-test analyses of the differences in means for each corresponding testlet, 3 of the mean score comparisons were not significantly different, and the 4th showed the regular version to be slightly easier than the simplified version. Analysis of variance (ANOVA), followed by pairwise comparisons of the testlets, showed no significant differences in the scores of non-LEP students across the 2 item types. Among the 40 items administered in both regular and simplified format, item difficulty did not vary consistently in favor of either format. Qualitative analyses of items that displayed significant differences in p values were not informative, because the differences were typically very small. For LEP students, there was 1 significant difference in student means, and it favored the regular version. However, because the study was conducted in a state with a small number of LEP students, the analyses of LEP student responses lacked statistical power. The results of this study show that linguistic simplification is not helpful to monolingual English-speaking students who receive the accommodation. Therefore, the results provide evidence that linguistic simplification is not a threat to the comparability of scores of LEP and monolingual English-speaking students when offered as an accommodation to LEP students. The study findings may also have implications for the use of linguistic simplification accommodations in science assessments in other states and in content areas other than science. 相似文献

7.

Comparability within Computer-Based Assessment: Does Screen Size Matter?

Jie Chen Marianne Perie 《学校用计算机》2013,30(4):268-283

Due to increased use of computer-based assessments, comparability studies are moving beyond paper-and-pencil versus computer-based assessments to analyze variances with computers. It is therefore practically important to determine whether screen size and definition of the device affect students’ performance. Using data from a large school district giving tests on either Macs with large, high-definition screens or Chromebooks with standard 14-inch screens, this study compared assessment results between devices by grade, subject, and item type. Results showed no significant evidence supporting a large and high-definition screen’s positive impact on students’ performance. Likewise, there was no relationship between the impact of screen size and grade level as well as item types. Limitations and future studies were discussed. 相似文献

8.

Constructing a Computerized Adaptive Test for University Applicants With Disabilities

《教育实用测度》2013,26(4):381-405

In recent years, there has been a large increase in the number of university applicants requesting special accommodations for university entrance exams. The Israeli National Institute for Testing and Evaluation (NITE) administers a Psychometric Entrance Test (comparable to the Scholastic Assessment Test in the United States) to assist universities in Israel in selecting undergraduates. Because universities in Israel do not permit flagging of candidates receiving special testing accommodations, such scores are treated as identical to scores attained under regular testing conditions. The increase in the number of students receiving testing accommodations and the prohibition of flagging have brought into focus certain psychometric issues pertaining to the fairness of testing students with disabilities and the comparability of special and standard testing conditions. To address these issues, NITE has developed a computerized adaptive psychometric test for administration to examinees with disabilities. This article discusses the process of developing the computerized test and ensuring its comparability to the paper-and-pencil test. This article also presents data on the operational computerized test. 相似文献

9.

EFFECTS OF COACHING ON GRE APTITUDE TEST SCORES

DONALD E. POWERS 《Journal of Educational Measurement》1985,22(2):121-136

Test preparation activities were determined for a large representative sample of Graduate Record Examination (GRE) Aptitude Test takers. About 3% of these examinees had attended formal coaching programs for one or more sections of the test.
After adjusting for differences in the background characteristics of coached and uncoached students, effects on test scores were related to the length and the type of programs offered. The effects on GRE verbal ability scores were not significantly related to the amount of coaching examinees received, and quantitative coaching effects increased slightly but not significantly with additional coaching. Effects on analytical ability scores, on the other hand, were related significantly to the length of coaching programs, through improved performance on two analytical item types, which have since been deleted from the test.
Overall, the data suggest that, when compared with the two highly susceptible item types that have been removed from the GRE Aptitude Test, the test item types in the current version of the test (now called the GRE General Test) appear to show relatively little susceptibility to formal coaching experiences of the kinds considered here. 相似文献

10.

The Effect of Training in Test Item Writing on Test Performance of Junior High Students

Jeanne Tunks 《Educational studies》2001,27(2):129-142

High stakes testing, a phenomena born out of intense accountability across the United States, produces instructional settings that marginalize both curriculum and instruction. Teachers and other school personnel have minimized instruction to drill and practice in an effort to raise standardized and criterion referenced test scores. This study presents an alternative to current practice that engages students in learning and increases their awareness of the internal aspects of standardized tests. The Test Item Construction Model (TICM) guides students through the process of studying test item stems and subsequently creating items using a 12 week process of incrementing from understanding to creating test items. Students grew in their understanding of the test item stems and the generation of these. An ANOVA did not yield significant differences between random groups of trained and untrained test writers. However, students in the experimental group demonstrated gains in understanding of test items. 相似文献

11.

COURSE PLANNING AND TESTING DECISIONS IN CRITERION‐REFERENCED SITUATIONS

Thomas C. Leitzel Daniel E. Vogler 《Community College Journal of Research & Practice》2013,37(3):305-318

Norm‐referenced measurement tools — such as reliability, validity, and item analysis — are commonly used to reach and verify conclusions about criteria. Similar tools for criterion‐referenced testing situations are scant. This study examined faculty planning and testing decisions and applied formulas to arrive at numerical indices that serve as analytical tools for use with criterion‐referenced tests. The research documents the effects of applying the concept of platform unity, which has its roots in curriculum alignment theory. Alignment of curriculum occurs if the planned, the delivered, and the tested curricula are congruent. Specifically, platform unity aligns planned, domain‐referenced content with appropriate test types. Mathematical formulas were created to determine numerically if planned and tested content were congruent. In addition, four other constructs were examined. They included effectiveness and efficiency of test‐item type selection and overtesting and undertesting of course content. A chi‐square goodness‐of‐fit test was used to compare faculty planning and testing decisions. Data indicated significant differences (p < .01) between content plans and the test types used to test content. On the basis of the analysis, it was determined that faculty do not plan and test content congruently across three levels of cognitive content. Also, faculty tended to overtest content; they were effective in their selection of test types, but not efficient. 相似文献

12.

中央民族大学预科部考核方式解决方案

黄宁《民族教育研究》2005,16(3):10-15

考试是检验教与学效果的重要手段,试题库是试卷的基础,试卷分析法是检验试卷合理性与详细分析考试结果的方法。建立试题库及从中抽取试题时应遵循不重复、不遗漏、均衡分配得分、题型多样等原则;抽取试题方式要注意题型控制、章节控制;试卷分析法的三种图表对了解学生和改进教学有很大的帮助。相似文献

13.

Effects of multimedia on psychometric characteristics of cognitive tests: A comparison between technology-based and paper-based modalities

《Studies in Educational Evaluation》2023

The study aims to investigate the effects of delivery modalities on psychometric characteristics and student performance on cognitive tests. A first study assessed the inductive reasoning ability of 715 students under the supervision of teachers. A second study examined 731 students’ performance on the application of the control-of-variables strategy in basic physics but without teacher supervision due to the COVID-19 pandemic. Rasch measurement showed that the online format fitted to the data better in the unidimensional model across two conditions. Under teacher supervision, paper-based testing was better than online testing in terms of reliability and total scores, but contradictory findings were found in turn without teacher supervision. Although measurement invariance was confirmed between two versions at item level, the differential bundle functioning analysis supported the online groups on the item bundles constructed of figure-related materials. Response time was also discussed as an advantage of technology-based assessment for test development. 相似文献

14.

The merits of representational pictures in educational assessment: Evidence for cognitive and motivational effects in a time-on-task analysis

《Contemporary educational psychology》2017

Adding representational pictures (RPs) to text-based items has been shown to improve students’ test performance. Focusing on potential explanations for this multimedia effect in testing, we propose two functions of RPs in testing, namely, (1) a cognitive facilitation function and (2) a motivational function. We found empirical support for both functions in this computer-based classroom experiment with N = 410 fifth and sixth graders. All students answered 36 manipulated science items that either contained (text-picture) or did not contain (text-only) an RP that visualized the text information in the item stem. Each student worked on both item types, following a rotated within-subject design. We measured students’ (a) solution success, (b) time on task (TOT), and identified (c) rapid-guessing behavior (RGB). We used generalized and linear mixed-effects models to investigate RPs’ impact on these outcome parameters and considered students’ level of test engagement and item positions as covariates. The results indicate that (1) RPs improved all students’ performance across item positions in a comparable manner (multimedia effect in testing). (2) RPs have the potential to accelerate item processing (cognitive facilitation function). (3) The presence of RPs reduced students’ RGB rates to a meaningful extent (motivational function). Overall, our data indicate that RPs may promote more reliable test scores, supporting a more valid interpretation of students’ achievement levels. 相似文献

15.

Measurement properties as a possible cause of digital device effects on a standardized assessment of learning

《Studies in Educational Evaluation》2023

Recent research found a substantial effect of digital device type on a computerized, standardized assessment of learning. Findings were consistent across grade and content domain. To investigate the source of these differences, the current research probed whether measurement differences or, equivalently, test bias due to device type could explain achievement or proficiency differences. We used 2018–2019 results on the Indiana Learning Evaluation Assessment Readiness Network (ILEARN). The data are census-based and cover grades 3 and 8 in the content domains of mathematics and English language arts (ELA). For our analysis, we use the root mean-squared deviation (RMSD) for detecting measurement differences across students that took ILEARN on different digital devices. Our findings suggest that few, if any, achievement differences that depend on the type of digital device can be explained by test bias. We discuss our findings in the context of multiple-group versus multiple indicator, multiple cause measurement models. Our findings suggest that digital device operates directly on the construct (math or ELA) rather than on the indicators that measure the construct. Finally, we conclude that a competing hypothesis – that differences are due to a digital device familiarity effect – remains plausible. 相似文献

16.

Do images influence assessment in anatomy? Exploring the effect of images on item difficulty and item discrimination

Marc A. T. M. Vorstenbosch Tim P. F. M. Klaassen Jan G. M. Kooloos Sanneke M. Bolhuis Roland F. J. M. Laan 《Anatomical sciences education》2013,6(1):29-41

Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in seven themes. The answer format alternated per theme and was either a labeled image or an answer list, resulting in two versions containing both images and answer lists. Subjects were randomly assigned to one version. Answer formats were compared through item scores. Both examinations had similar overall difficulty and reliability. Two cross‐sectional images resulted in greater item difficulty and item discrimination, compared to an answer list. A schematic image of fetal circulation led to decreased item difficulty and item discrimination. Three images showed variable effects. These results show that effects on assessment scores are dependent on the type of image used. Results from the two cross‐sectional images suggest an extra ability is being tested. Data from a scheme of fetal circulation suggest a cueing effect. Variable effects from other images indicate that a context‐dependent interaction takes place with the content of questions. The conclusion is that item difficulty and item discrimination can be affected when images are used instead of answer lists; thus, the use of images as a response format has potential implications for the validity of test items. Anat Sci Educ © 2012 American Association of Anatomists. 相似文献

17.

Policy considerations based on a cost analysis of alternative test formats in large scale science assessments

Frances Lawrenz Douglas Huffman Wayne Welch 《科学教学研究杂志》2000,37(6):615-626

This article compares the costs of four assessment formats: multiple choice, open ended, laboratory station, and full investigation. The amount of time spent preparing the devices, developing scoring consistency for the devices, and scoring the devices was tracked as the devices were developed. These times are presented by individual item and by complete device. Times are also compared as if 1,000 students completed each assessment. Finally, the times are converted into cost estimates by assuming a potential hourly wage. The data show that a multiple choice item costs the least, and that it is approximately 80 times as much for an open ended item, 300 times as much for a content station, and 500 times as much for a full investigation item. The very large discrepancies in costs are used as a basis to raise several policy issues related to the inclusion of alternative assessment formats in large scale science achievement testing. © 2000 John Wiley & Sons, Inc. J Res Sci Teach 37: 615–626, 2000 相似文献

18.

THE COMPARABILITY OF THE WAIS, WISC, AND WBII

M. Y. QUERESHI JEFFREY M. MILLER 《Journal of Educational Measurement》1970,7(2):105-111

Three Wechsler scales (the Wechsler Adult Intelligence Scale, Wechsler Intelligence Scale for Children, and Wechsler-Bellevue II) were administered in a counterbalanced design to 72 randomly selected 17 year-old high school Ss in order to investigate their comparability by testing the equality of ( a ) means, ( b ) variances, ( c ) reliability coefficients, and ( d ) validity coefficients based on scaled scores and IQs. Results indicated that the subtest scores and IQs for the given three scales were not equivalent. The present findings conform with most of the previous results regarding the comparability of Wechsler scales. Although the three scales investigated all evidence high similarity of item content and format, they clearly fail to meet the statistical criteria of equivalence for 17 year-old subjects. 相似文献

19.

Scoring Constructed Responses Using Expert Systems

Henry I. Braun Randy Elliot Bennett Douglas Frye Elliot Soloway 《Journal of Educational Measurement》1990,27(2):93-108

The use of constructed-response items in large scale standardized testing has been hampered by the costs and difficulties associated with obtaining reliable scores. The advent of expert systems may signal the eventual removal of this impediment. This study investigated the accuracy with which expert systems could score a new, nonmultiple-choice item type. The item type presents a faulty solution to a computer programming problem and asks the student to correct the solution. This item type was administered to a sample of high school seniors enrolled in an Advanced Placement course in Computer Science who also took the Advanced Placement Computer Science (APCS) examination. Results indicated that the expert systems were able to produce scores for between 82% and 95% of the solutions encountered and to display high agreement with a human reader on the correctness of the solutions. Diagnoses of the specific errors produced by students were less accurate. Correlations with scores on the objective and free-response sections of the APCS examination were moderate. Implications for additional research and for testing practice are offered. 相似文献

20.

Response Time Differences Between Computers and Tablets

Xiaojing Kong Yuanyuan McBride Kristin Morrison 《教育实用测度》2018,31(1):17-29

Item response time data were used in investigating the differences in student test-taking behavior between two device conditions: computer and tablet. Analyses were conducted to address the questions of whether or not the device condition had a differential impact on rapid guessing and solution behaviors (with response time effort used as an indicator) as well as on the time that students spent on the test (reading, mathematics, and science) or a given item type (such as drag-and-drop and fill in blank). Further analyses were conducted to examine if the potential impact of device conditions varied by gender and ethnicity groups. Overall there were no significant differences in response time effort related to device, although some differences related to item type and test sequence were noted. Students tended to spend slightly more time when taking the tests and certain types of items on the tablet than on the computer. No interactions of device with gender or ethnicity were observed. Follow-up research on the item time thresholds is discussed. 相似文献