期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Establishing Content-Related Validity Evidence for Assessments in Counseling: Application of a Sequential Mixed-Method Approach

Hulya Ermis-Demirtas 《International journal for the advancement of counseling》2018,40(4):387-397

相似文献

2.

语篇元功能的语料库支撑范式介入

刘建鹏杜惠芳洪明《外语教学理论与实践》2014,2(4):41

语料库支撑范式是研究系统功能语法的一个新视角,它建立在对大规模语料的检索、半自动标注基础之上,是一种理论验证与语料驱动相结合的研究方法。语篇元功能的语料库支撑范式介入研究首先是指衔接系统被模型化为可供检索的词型或正则表达式; 然后,以代表不同衔接意义的检索列表或正则表达式为检索项对文本检索,必要时可建立与之相关的附属语境词集; 最后,通过对检索行的解读和分析来建立语篇元功能的意义概率模型。相似文献

3.

Gathering Content-Related Validity Evidence for the School Leaders Licensure Assessment

Clyde M. Reese Richard J. Tannenbaum 《Journal of Personnel Evaluation in Education》1999,13(3):263-282

A licensure assessment's purpose is to measure the relevant knowledge and skills required for safe and effective professional practice. Given the important role of licensure assessments, great care must be paid to the issue of validity: Is the assessment measuring what it claims to measure? In particular, evidence of a licensure assessment's content-related validity must be collected and evaluated prior to incorporating the assessment into a licensure process. The School Leaders Licensure Assessment was developed to be part of state licensure processes for entry-level school principals. To evaluate the use of the assessment for this purpose, a multistate panel of professionals examined the assessment and rendered judgments concerning its appropriateness. The results of this content-related validity study support the use of the School Leaders Licensure Assessment by affirming the relevance and importance of the content being assessed. 相似文献

4.

Evaluation of construct-irrelevant variance yielded by machine and human scoring of a science teacher PCK constructed response assessment

《Studies in Educational Evaluation》2020

Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks. 相似文献

5.

Developing and evaluating instructionally sensitive assessments in science

Maria Araceli Ruiz‐Primo Min Li Kellie Wills Michael Giamellaro Ming‐Chih Lan Hillary Mason Deanna Sands 《科学教学研究杂志》2012,49(6):691-712

The purpose of this article is to address a major gap in the instructional sensitivity literature on how to develop instructionally sensitive assessments. We propose an approach to developing and evaluating instructionally sensitive assessments in science and test this approach with one elementary life‐science module. The assessment we developed was administered to 125 students in seven classrooms. The development approach considered three dimensions of instructional sensitivity; that is, assessment items should: represent the curriculum content, reflect the quality of instruction, and have formative value for teaching. Focusing solely on the first dimension, representation of the curriculum content, this study was guided by the following research questions: (1) What science module characteristics can be systematically manipulated to develop items that prove to be instructionally sensitive? and (2) Are the instructionally sensitive assessments developed sufficiently valid to make inferences about the impact of instruction on students' performance? In this article, we describe our item development approach and provide empirical evidence to support validity arguments about the developed instructionally sensitive items. Results indicated that: (1) manipulations of the items at different proximities to vary their sensitivity were aligned with the rules for item development and also corresponded with pre‐to‐post gains; and (2) the items developed at different distances from the science module showed a pattern of pre‐to‐post gain consistent with their instructional sensitivity, that is, the closer the items were to the science module, the larger the observed gains and effect sizes. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 691–712, 2012 相似文献

6.

Digital ITEMS Module 2: Scale Reliability in Structural Equation Modeling

下载免费PDF全文

Gregory R. Hancock Ji An 《Educational Measurement》2018,37(2):73-74

In this ITEMS module, we frame the topic of scale reliability within a confirmatory factor analysis and structural equation modeling (SEM) context and address some of the limitations of Cronbach's α. This modeling approach has two major advantages: (1) it allows researchers to make explicit the relation between their items and the latent variables representing the constructs those items intend to measure, and (2) it facilitates a more principled and formal practice of scale reliability evaluation. Specifically, we begin the module by discussing key conceptual and statistical foundations of the classical test theory model and then framing it within an SEM context; we do so first with a single item and then expand this approach to a multi‐item scale. This allows us to set the stage for presenting different measurement structures that might underlie a scale and, more importantly, for assessing and comparing those structures formally within the SEM context. We then make explicit the connection between measurement model parameters and different measures of reliability, emphasizing the challenges and benefits of key measures while ultimately endorsing the flexible McDonald's ω over Cronbach's α. We then demonstrate how to estimate key measures in both a commercial software program (Mplus) and three packages within an open‐source environment (R). In closing, we make recommendations for practitioners about best practices in reliability estimation based on the ideas presented in the module. 相似文献

7.

Jörg Großschedl Daniela Mahler Thilo Kleickmann Ute Harms 《International Journal of Science Education》2013,35(14):2335-2366

Teachers’ content-related knowledge is a key factor influencing the learning progress of students. Different models of content-related knowledge have been proposed by educational researchers; most of them take into account three categories: content knowledge, pedagogical content knowledge, and curricular knowledge. As there is no consensus about the empirical separability (i.e. empirical structure) of content-related knowledge yet, a total of 134 biology teachers from secondary schools completed three tests which were to capture each of the three categories of content-related knowledge. The empirical structure of content-related knowledge was analyzed by Rasch analysis, which suggests content-related knowledge to be composed of (1) content knowledge, (2) pedagogical content knowledge, and (3) curricular knowledge. Pedagogical content knowledge and curricular knowledge are highly related (r_latent?=?.70). The latent correlations between content knowledge and pedagogical content knowledge (r_latent?=?.48)—and curricular knowledge, respectively (r_latent?=?.35)—are moderate to low (all ps?<?.001). Beyond the empirical structure of content-related knowledge, different learning opportunities for teachers were investigated with regard to their relationship to content knowledge, pedagogical content knowledge, and curricular knowledge acquisition. Our results show that an in-depth training in teacher education, professional development, and teacher self-study are positively related to particular categories of content-related knowledge. Furthermore, our results indicate that teaching experience is negatively related to curricular knowledge, compared to no significant relationship with content knowledge and pedagogical content knowledge. 相似文献

8.

A Polytomous Scoring Approach to Handle Not-Reached Items in Low-Stakes Assessments

Guher Gorgun Okan Bulut 《Educational and psychological measurement》2021,81(5):847

In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students’ responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students’ nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed. 相似文献

9.

“I never thought of it as freezing”: How students answer questions on large‐scale science tests and what they know about science

Tracy Noble Catherine Suarez Ann Rosebery Mary Catherine O'Connor Beth Warren Josiane Hudicourt‐Barnes 《科学教学研究杂志》2012,49(6):778-803

相似文献

10.

Gauging Item Alignment Through Online Systems While Controlling for Rater Effects

下载免费PDF全文

Daniel Anderson Shawn Irvin Julie Alonzo Gerald A. Tindal 《Educational Measurement》2015,34(1):22-33

The alignment of test items to content standards is critical to the validity of decisions made from standards‐based tests. Generally, alignment is determined based on judgments made by a panel of content experts with either ratings averaged or via a consensus reached through discussion. When the pool of items to be reviewed is large, or the content‐matter experts are broadly distributed geographically, panel methods present significant challenges. This article illustrates the use of an online methodology for gauging item alignment that does not require that raters convene in person, reduces the overall cost of the study, increases time flexibility, and offers an efficient means for reviewing large item banks. Latent trait methods are applied to the data to control for between‐rater severity, evaluate intrarater consistency, and provide item‐level diagnostic statistics. Use of this methodology is illustrated with a large pool (1,345) of interim‐formative mathematics test items. Implications for the field and limitations of this approach are discussed. 相似文献

11.

Patterns of Solution Behavior across Items in Low-Stakes Assessments

Dena A. Pastor 《Educational Assessment》2019,24(3):189-212

The trustworthiness of low-stakes assessment results largely depends on examinee effort, which can be measured by the amount of time examinees devote to items using solution behavior (SB) indices. Because SB indices are calculated for each item, they can be used to understand how examinee motivation changes across items within a test. Latent class analysis (LCA) was used with the SB indices from three low-stakes assessments to explore patterns of solution behavior across items. Across tests, the favored models consisted of two classes, with Class 1 characterized by high and consistent solution behavior (>90% of examinees) and Class 2 by lower and less consistent solution behavior (<10% of examinees). Additional analyses provided supportive validity evidence for the two-class solution with notable differences between classes in self-reported effort, test scores, gender composition, and testing context. Although results were generally similar across the three assessments, striking differences were found in the nature of the solution behavior pattern for Class 2 and the ability of item characteristics to explain the pattern. The variability in the results suggests motivational changes across items may be unique to aspects of the testing situation (e.g., content of the assessment) for less motivated examinees. 相似文献

12.

Conceptualizing content-related PD facilitator expertise

Prediger Susanne Roesken-Winter Bettina Stahnke Rebekka Pöhler Birte 《Journal of Mathematics Teacher Education》2022,25(4):403-428

相似文献

13.

The Rating and Matching Item-Objective Alignment Methods

Jerome V. D'Agostino Megan E. Welsh Adriana D. Cimetta Lia D. Falco Shannon Smith Waverely Hester VanWinkle 《教育实用测度》2013,26(1):1-21

Central to the standards-based assessment validation process is an examination of the alignment between state standards and test items. Several alignment analysis systems have emerged recently, but most rely on either traditional rating or matching techniques. Little, if any, analyses have been reported on the degree of consistency between the two methods and on the item and objective characteristics that influence judges' decisions. We randomly assigned judges to either rate item-objective links or match items to objectives while reviewing the 2004 Arizona high school mathematics standards and assessment. Across items we found moderate convergence between methods, and we detected apparent reasons for divergently scored items. We also found that judges relied on item and objective content and intellectual skill features to render decisions. Based on our evidence, we contend that a thorough alignment analysis would involve judges using both rating and matching, while focusing on both content and intellectual skill. The findings have important implications for states when examining the alignment between their standards and assessments. 相似文献

14.

Psychometric Aspects of Maintaining Standards of Examinations

C. A. W. Glas 《教育心理学》1988,8(4):257-270

Through pilot studies and regular examination procedures, the National Institute for Educational Measurement (CITO) in The Netherlands has gathered experience with different methods of maintaining the standards of examinations. The present paper presents an overview of the psychometric aspects of the various approaches that can be chosen for the maintenance of standards. Generally speaking, the approaches to the problem, can be divided into two classes. In the first approach the examinations are a fixed factor, i.e. the examination is already constructed and cannot be changed, and the link between the standards of both examinations is created by some test equating design. In the second approach the items of both examinations are selected from a pre‐tested pool of items, in such a way that two equivalent examinations are constructed. In both approaches the statistical problems of simultaneously modelling possible differences in the ability level of different groups of examinees and differences in the difficulty of the items are solved within the framework of item response theory. It is shown that applying the Rasch model for dichotomous and polytomous items results in a variety of possible test‐equating designs which adequately deal with the restrictions imposed by the practical conditions related to the fact that the equating involves examinations. Especially the requirement of secrecy of the content of new examinations must be taken into account. Finally it is shown that, given a pool of pre‐tested items, optimisation techniques can be used to construct equivalent examinations. 相似文献

15.

Embedded Standard Setting: Aligning Standard-Setting Methodology with Contemporary Assessment Design Principles

Daniel Lewis Robert Cook 《Educational Measurement》2020,39(1):8-21

相似文献

16.

Articulating the validity evidence for a science alternate assessment

下载免费PDF全文

Lori Andersen Brooke L. Nash Sue Bechard 《科学教学研究杂志》2018,55(6):826-848

Students with the most significant cognitive disabilities (SCD) are the 1% of the total student population who have a disability or multiple disabilities that significantly impact intellectual functioning and adaptive behaviors and who require individualized instruction and substantial supports. Historically, these students have received little instruction in science and the science assessments they have participated in have not included age‐appropriate science content. Guided by a theory of action for a new assessment system, an eight‐state consortium developed multidimensional alternate content standards and alternate assessments in science for students in three grade bands (3–5, 6–8, 9–12) that are linked to the Next Generation Science Standards (NGSS Lead States, 2013 ) and A Framework for K‐12 Science Education (Framework; National Research Council, 2012 ). The great variability within the population of students with SCD necessitates variability in the assessment content, which creates inherent challenges in establishing technical quality. To address this issue, a primary feature of this assessment system is the use of hypothetical cognitive models to provide a structure for variability in assessed content. System features and subsequent validity studies were guided by a theory of action that explains how the proposed claims about score interpretation and use depend on specific assumptions about the assessment, as well as precursors to the assessment. This paper describes evidence for the main claim that test scores represent what students know and can do. We present validity evidence for the assumptions about the assessment and its precursors, related to this main claim. The assessment was administered to over 21,000 students in eight states in 2015–2016. We present selected evidence from system components, procedural evidence, and validity studies. We evaluate the validity argument and demonstrate how it supports the claim about score interpretation and use. 相似文献

17.

Development and validation of the groupwork skills questionnaire (GSQ) for higher education

Jennifer Cumming Charlotte Woodcock Sam J. Cooley Mark J.G. Holland Victoria E. Burns 《Assessment & Evaluation in Higher Education》2015,40(7):988-1001

The aim of the present study was to develop and provide psychometric evidence in support of the groupwork skills questionnaire (GSQ) for measuring task and interpersonal groupwork skills. A 46-item version of the GSQ was initially completed by 672 university students. The number of items was reduced to 15 following exploratory factor analyses, and a two-factor model consisting of task and interpersonal groupwork skills was revealed. Confirmatory factor analyses with model re-specification on new data (n = 275 students) established that the best fitting model consisted of 10 items and the same two factors (task and interpersonal). Concurrent validity of the GSQ was then determined with 145 participants by demonstrating significant relationships (p < 0.05) with attitudes towards groupwork and groupwork self-efficacy. Test–retest reliability was examined over a one-week interval. Overall, the GSQ demonstrates good validity and reliability, and has potential for both research and pedagogical application. 相似文献

18.

Development of an item bank for assessing generic competences in a higher-education institute: a Rasch modelling approach

Qin Xie Xiaoling Zhong Wen-Chung Wang Cher Ping Lim 《高等教育研究与发展》2014,33(4):821-835

This paper describes the development and validation of an item bank designed for students to assess their own achievements across an undergraduate-degree programme in seven generic competences (i.e., problem-solving skills, critical-thinking skills, creative-thinking skills, ethical decision-making skills, effective communication skills, social interaction skills and global perspective). The Rasch modelling approach was adopted for instrument development and validation. A total of 425 items were developed. The content validity of these items was examined via six focus group interviews with target students, and the construct validity was verified against data collected from a large student sample (N?=?1151). A matrix design was adopted to assemble the items in 26 test forms, which were distributed at random in each administration session. The results demonstrated that the item bank had high reliability and good construct validity. Cross-sectional comparisons of Years 1–4 students revealed patterns of changes over the years. Correlation analyses shed light on the relationships between the constructs. Implications are drawn to inform future efforts to develop the instrument, and suggestions are made regarding ways to use the instrument to enhance the teaching and learning of generic skills. 相似文献

19.

Evaluating construct validity and internal consistency of early childhood individualized family service plans

《Studies in Educational Evaluation》2015

This study presents evidence regarding the construct validity and internal consistency of the IFSP Rating Scale (McWilliam & Jung, 2001), which was designed to rate individualized family service plans (IFSPs) on 12 indicators of family centered practice. Here, the Rasch measurement model is employed to investigate the scale's functioning and fit for both person and item diagnostics of 120 IFSPs that were previously analyzed with a classical test theory approach. Analyses demonstrated scores on the IFSP Rating Scale fit the model well, though additional items could improve the scale's reliability. Implications for applying the Rasch model to improve special education research and practice are discussed. 相似文献

20.

Estimating School Climate Traits Across Multiple Informants: An Illustration of a Multitrait–Multimethod Validation Through Latent Variable Modeling

Timothy R. Konold Kathan Shukla 《Educational Assessment》2017,22(1):54-69

The use of multiple informants is common in assessments that rely on the judgments of others. However, ratings obtained from different informants often vary as a function of their perspectives and roles in relation to the target of measurement, and causes unrelated to the trait being measured. We illustrate the usefulness of a latent variable multilevel multitrait–multimethod measurement model for extracting trait factors from reports of school climate obtained by students (N = 45,641) and teachers (N = 12,808) residing within 302 high schools. We then extend this framework to include assessments of linkages between the resulting trait factors and potential outcomes that might be used for addressing questions of substantive interest or providing evidence of concurrent validity. The approach is illustrated with data obtained from student and teacher reports of two dimensions of school climate, student engagement, and the prevalence of teasing and bullying in their schools. 相似文献