Examining a Multisemiotic Approach to Measuring Challenging Content for English Learners and Others: Results from the ONPAR Elementary and Middle School Science Study

This study contributes to the empirical research base on the effectiveness of ONPAR, a promising multisemiotic test item development process. ONPAR uses a variety of multisemiotic performance techniques to present and measure challenging concepts and skills of students, including low English proficient English Learners (ELs) and non-ELs. Experimental trials were used to investigate how 648 ELs at three English proficiency levels and native English speaking non-ELs performed on randomly assigned 4th and 8th grade traditional items and equivalent ONPAR items of challenging science content. General linear modeling using a covariate variable of classroom performance and biand multi-nomial regressions found differential boost across both grades. That is, findings showed that lower English proficient ELs perform better on ONPAR vs traditional forms in both grades, with p < .05 in favor of ONPAR in grade 8, whereas there were no significant differences between the two forms for non-ELs. The results also underscore the viability of the assessment methodology where students often demonstrate their response by showing their knowledge and skills. Item level results indicate that the ONPAR approach is useful at mitigating the effect of group.


Introduction
The Next Generation Science Standards (NGSS) in K-12 education demand rigorous teaching and learning of challenging content. Yet, the sophisticated language demands associated with complex content are generally beyond the reach of most English Learners (ELs). Do ELs have to wait until they obtain a corresponding level of academic language to fully participate in learning and to demonstrate their mastery of challenging concepts and skills such as multi-step reasoning and complex problem solving? Is it possible to teach and fluidly assess the more complex content Knowledge, Skills and Abilities (KSAs) of lower as well as higher English proficient students?
Many linguistic and EL experts, teachers, and others believe it is. They suggest that ELs do not necessarily lack at least some level of challenging KSAs, but rather that most current assessments of rigorous content lack the facility to communicate with many ELs about their knowledge, reasoning, and strategies. As Walqui and Heritage (2009), Williams, Tang, and Won (2019), and Walqui (personal communication, March 20, 2018) have commented, these less English proficient students are learning, developing unconventional mental schemas and relying on other resources, using their own multimodal meaning representations to acquire and demonstrate more challenging concepts and skills than they can typically read or adequately express to others in English text or orally.

Research Hypotheses
This study was conducted by the University of Wisconsin, in collaboration with the Center for Applied Linguistics and the partner state of Rhode Island. It contributes to the evidence base supporting the validity claims of multisemiotic assessment items, and their relevance for a wide range of students. In this particular investigation, the research team focused principally on measuring more challenging science abilities of elementary and middle school ELs with low proficiency in English. The assumptions underlying the study are, first, that low English proficient students are learning challenging content, not just basic content more commensurate with their current understanding of English. Second, these students are learning more complex content by working with their teachers to take advantage of other modes of representations that, along with some English, are conveying meaning from the teacher to the student and from the student back to the teacher (Kress, Jewitt, Ogborn, & Tsatsarelis, 2001;Lemke, 1998). These modes may include, but often are not limited to their home language for a variety of reasons.
From these assumptions the ONPAR team posited the following hypotheses: 1) Lower English proficient ELs will be able to demonstrate what they know in science significantly better using ONPAR assessment techniques than they could in a traditional test measuring similar content at similar cognitive complexity levels.
2) High English ELs and native English speakers will be able to demonstrate what they know using ONPAR as well as traditional testing formats. The scores of mid-level English proficient ELs on both forms will generally sit between those of their lower and higher English proficient peers.
3) Between groups and within each form, there will be significant differences between low ELs and non-ELs on the traditional form, but less difference on the ONPAR form. 4) At the item level, ONPAR tasks will show less differences between EL and non-EL groups than traditional items measuring the same content.

Background
The ONPAR design methodology was developed to address meet a series of ongoing challenges in the assessment of more challenging content, and in the assessment of ELs and others with diverse learning schemas. As the field of assessment design has transitioned from paper-based to computer-based formats, a door has been opened for the innovative methodological approaches proposed by ONPAR.
Published by SCHOLINK INC. skills must be part of testing. It stands to reason that this means that ELs, as well as others, need to have access to these types of assessment tasks at most grade levels, if this type of challenging coursework is to be taught to them as well as to non-ELs.
Text editing practices associated with Universal Design for Learning (UDL) principles and simplified language and visuals, has been shown to be effective for ELs when items are measuring more basic knowledge or skills (Carr, 2008;Emick & Kopriva, 2007). However, this type of UDL editing and simplified language is frequently not adequate with the more nuanced and abstract language, language structures, and heavier language load used with more challenging content. To-date, these more challenging test requirements and response environments typically still primarily require nuanced reading skills or almost grade level writing skills, carryinga heavy receptive and expressive linguistic load (e.g., American Association for the Advancement of Science (AAAS), 2007; Hansen & Zapata-Rivera, 2010).
Further, technology-rich environments such as those first used on the NAEP 2009 science tests tend to use large amounts of text in developing the problems, nuanced language in the selected response item types, and writing in English (Quellmalz & Silberglitt, 2010). Recently, some drag and drop response types have been added, but by themselves are insufficient for measuring the true mastery of lower English proficient ELs.

Advances in the Theoretical Design and Interpretations of Assessments
Within the field of educational measurement, the traditional argument for common inferences had been made on procedural grounds: common content in items and a common approach for synthesizing and summarizing items and response data over items. The latter part of this argument required standardized conditions of observation as a key aspect of synthesizing item data. However, based on the foundational work of Mislevy et al.'s (2004) Evidence Centered Design (ECD), Mislevy's extension of this design into gaming (2011), and Kane's (2013) theoretical and procedural advances in providing defensible evidence of test scores using traditional and novel item types, the conceptual argument has become prominent. This argument relies on evidencing appropriate relationships between target inferences, the knowledge and skills of interest, necessary observations, the properties of tasks or items designed to elicit the observations, and the assessment situations where students interact with the assessment requests. It shifts the focus to being able to collect commensurate but not necessarily identical types of evidence and allows for some flexibility in how data are collected as long as validation criteria are critically evaluated.
For some ELs, the rate of English development will be steep; for many, it may take four or more years reasoning, and skills using primarily text-based methods. As such, this study provides a possible approach.

Methodology
To examine the hypotheses discussed in this article, a randomized experimental study was conducted.
The first three hypotheses focused at the test level. It examined how ELs at different levels of English proficiency and the control groups would perform on the ONPAR and traditional forms, and if clear differences of form-group combinations could be identified. The fourth hypothesis focused at the item level to quantitatively examine the viability of the ONPAR methodology as compared to the traditional methods.

Instruments
The main instruments used in this study consisted of (1) released state and federal test science multiple choice and constructed response items selected to measure science standards in the two grades, and (2) ONPAR items measuring the same science targets at the item level. Two sets of supportive instruments were also developed for the study, (3) online tutorials and (4) teacher-rating questionnaires.
First, in identifying the traditional items for the study, the research team selected a set of multiple choice and constructed response items and scoring rubrics measuring more challenging KSAs from a set of released large-scale New England Common Assessment Program (NECAP) tests. In 3 cases released constructed response items from recent NAEP science tests were also used. These items were used to begin the development of the ONPAR items, and a subset of them and their ONPAR 'peers' were selected for the final test forms.
Second, after the development of the ONPAR framework criteria that would guide the construction of the ONPAR items, and the selection of the released items, a set of ONPAR items were built to specifically measure the same intended item targets as each of the traditional items, at the same levels of cognitive complexity and content demand. The ONPAR items were designed to utilize dynamic simulated contexts in computer environments, and to also use a variety of novel response environments that asked students to demonstrate what they knew by manipulating stimuli on the computer in various ways. The test items were assembled carefully using the ONPAR framework to ensure accessibility and their equivalency with target intent of the traditional items. For instance, the science team first built and then revised item storyboards that visually display how the item questions, contextual stimuli, response elements, and response space sit on one or more screens, and how dynamic aspects of the items such as animations would unfold. Using iterative feedback sessions with other staff, science ed and state stakeholders, the ONPAR item designs went through multiple iterations. As the final set of ONPAR items were nearing completion, a group of technology designers and programmers worked beside the ONPAR science team to build the ONPAR items electronically, and to score them in real time using algorithms. All the ONPAR items used novel performance techniques and could not be characterized as either multiple choice or constructed response. A decision was made to constrain the ONPAR scoring to 0-1 when the traditional versions were dichotomous, and to constrain the polytomously scored items to use measure the same points in each ONPAR item and its traditional counterpart. Individual rubrics for ONPAR were created to mirror the NECAP rubrics used for the open-ended traditional items.
Voice-overs of the text questions which students could access as needed were developed in English, Spanish, and Korean and uploaded electronically in each ONPAR item.
As reported elsewhere (Kopriva & Wright, 2017), the team conducted 58 cognitive labs with elementary and middle school students during the ONPAR item development to evaluate the items before they were finalized. The cognitive lab feedback was essential in subsequently refining and finalizing the item design, screen layouts, and assembly processes, evaluating if the multi-modal process was measuring the intended science, and if the designs were accessible to students with a range of challenges. Results in terms of access and measuring the target constructs were generally very positive.
To independently confirm that the ONPAR and traditional items were measuring the same assessment targets at the same level of cognitive complexity the research team consulted with an independent group of assessment and cognitive science experts (Kopriva et al., 2009). Based on their recommendations, the final set of ONPAR/traditional item pairs were selected.
In all, 14 ONPAR/traditional item pairs were selected for the 4 th grade ONPAR and traditional test forms, and 13 for the 8 th grade forms. Each of the forms were electronically assembled to be delivered by computer. The traditional 4 th grade form included 5 constructed response and 9 multiple choice items, and the traditional 8 th grade form included 2 constructed response and 11 multiple choice items.
In both grades 4 and 8 the traditional form also included an ONPAR item that was not scored, but were included at the end of the traditional forms at the request of the participating teachers. The four forms (two per grade) were assembled electronically and placed in secure online locations accessible during the administrations of the study.
Third, the instruments the team developed were tutorials to be used immediately prior to the test administration. For students taking the ONPAR form in each grade, the tutorial oriented students to the screen layout and the interactive components of the ONPAR items. For students taking the traditional form (with an ONPAR item at the end), the tutorial was typical to classic instructions for the multiple choice and constructed response items, and then a shortened version of the ONPAR tutorial explaining how the ONPAR item is laid out and its interactive components.
Fourth, the researchers developed a teacher-rating questionnaire for each grade using a 3-point rating scale. The questionnaires asked teachers to rate each student on how well the student demonstrated mastery in the classroom regarding particular content objectives aligned to each of the item pairs.
Teachers rated students on a scale ranging from consistently below grade level, sometimes below grade level, meets or exceeds grade level, or not covered yet. The rating instruments designed to capture a mid-range grain size of data that Schmidt et al. (2001) found teacher raters could use to differentiate with relatively little guidance.

Sampling
Four groups of elementary and middle school students (native English-speaking students and low, mid and high English proficient ELs) from 26 schools across five states (Georgia, Illinois, Pennsylvania, Virginia, and Wisconsin) participated in the study. ELs were assigned to one of the groups based on their level of English language proficiency reading score as measured by the WIDA-developed ACCESS for ELLs® assessment: (a) levels 1 or 2 out of 5 (low)), (b) level 3 (mid), or (c) level 4 or 5 (high). In all, 648 students completed the assessments where students were randomly administered ONPAR or traditional forms measuring the same content at the same level of cognitive complexity. The 338 4 th grade students came from 16 different schools, and the 310 8 th graders were from 10 different schools. School sites overall represented a mixture of urban and suburban areas in Pennsylvania, Virginia, South Carolina, Illinois and Wisconsin, and included students ranging from low to high socio-economic status.

Procedures
After the development and selection of the items, forms and support instruments, the administration and analyses of the cognitive lab results, and the recruitment of participating schools and teachers, the experimental trials occurred over adjacent spring and fall semesters. In each case, the 4 th and 8 th grade students (5 th and 9 th in the fall administration) were randomly assigned one of the test forms for their grade, the traditional item form, or the ONPAR item form.
In conformance with IRB requirements, the research team obtained the necessary participation permissions and signed agreements and conducted online webinar training and question/answer sessions with participating teachers, school administrators, and IT staff.
Before the test administrations, participating teachers received and returned the teacher rating ability questionnaire to be completed for each of their students. Unique student identifier tags for each student per classroom, and instructions to the teacher to prepare the students for the test administrations were sent to each teacher before the test dates.
For testing, students were randomly assigned to forms. This was accomplished using the following process. The unique student identification tags sent to the teachers prior to the administration were developed from the latest classroom lists sent from the participating schools. Going alphabetically down the lists, the student tags included a 2-digit unique number assigned each student per classroom, along with classroom, teacher, and school identifying numbers, and a computer identification number.
A copy of these numbers per list of students and educators were kept securely by the research team at the development site, and the researchers at each of the test administrations had a copy in case the student tags were misplaced.
The laptop computers used to administer the tests were assigned a number commensurate with the computer numbers on each student tag, and were brought to the sites by research staff for the test administrations at each school. Each of these computers were preloaded with either the traditional or ONPAR form as well as the associated pre-test tutorials. Matching the computer number on each student's tag, participants were assigned to this laptop to complete their test form.
The entire testing period lasted about 45 minutes. All data collected via the laptops were subsequently downloaded and placed in a central database by student identifier and type of form for cleaning and analyses.

Data Analyses
After the data were cleaned and data sets were compiled, the scoring of the test items were completed as well as calibration and scaling of the forms. Descriptive and inferential analyses of total test scores were then computed, followed by item level analyses.

Scoring
After the data were cleaned and data sets assembled to conduct the analyses, each of the dichotomous ONPAR-traditional pairs, and the polytomously-scored ONPAR items (except for one) were scored electronically. NECAP rubrics were used for the five 4 th grade and two 8 th polytomously scored traditional and ONPAR items where most of the items were scored on a 0-3 scale except for two 4 th grade items scored 0-2. Four independent raters (2 per grade) were trained by project staff using industry methods to score the traditional constructed response items and the remaining ONPAR item.
Each grade-level rater scored each of the constructed response items, with a 10% read-behind by a third rater from the other grade to ensure reliable results. In the case of adjacent scores, an average score was computed. Project staff mediated in the event that the 10% read-behinds raised concerns or when rater results were not exact or adjacent. Interrater correlations were subsequently computed with results between the two grade-level raters ranging between from .83 to .99. The exact agreement between scores varied from 67% to 93% across items.

Calibration and Scaling
The Rasch model for the dichotomous response items (Rasch, 1980), and one of its polytomous extensions-the Partial Credit Model (PCM, Masters, 1982)  providing equated estimates across the groups/forms. A linear transformation of the scores was then completed, fixing the mean at 500 and the standard deviation at 100.

Descriptive Analysis
Using the SAS statistical package, frequency data about the sample groups, test items, teacher questionnaire results, and item level data were compiled. Correlations were computed between each of these variables per form and grade.

Inferential Analyses
The General Linear Model program within SAS was used to calculate the omnibus 2 independent variables (group and form) analysis of covariance (ANCOVA) analyses for grades 4 and 8. The covariates were the total scores of the students' data from the teacher rating questionnaires, and the dependent variables were test scores. The ANCOVA results showed main effects as well as general interaction effects. Covariate contrasts were conducted for each cell of the group by form interactions.
These analyses, rather than the typical contrast analyses, were calculated using a one independent variable interaction-only ANCOVA to determine if form/group differences could be considered when the main effects were not in the model. Frequency data of the means adjusted for the covariate were also computed.

Item Regressions
To address the viability of the ONPAR forms relative to the traditional method a series of binomial logistic and multinomial/rank order regressions computed in SAS analyzed if the independent variables of group status and classroom accomplishment of the students operationalized by the teachers' ratings at the question level were significant predictors of ONPAR and traditional item scores. Dichotomous items were analyzed using the binomial logistic regressions, and the multinomial/rank order regressions were used for the polychromous items. The methods were used to provide a greater degree of stability of the item coefficients as compared to using traditional regression techniques.

Results
This experimental study examined four hypotheses that looked at if and how ELs at different levels of

Descriptive Analyses
First, frequency counts of the participating sample by grade, form, and group were computed and can be seen in Table 1. Scale score means by grade, form and group can be found in Teacher ratings of individual student performance of topics covered in this study and exhibited in the classroom were used for two purposes: To suggest the student opportunity to learn the topics addressed in test items and to serve as covariates in ANCOVA calculations. The aggregated scores of the topics of the ratings over topics by classroom can be found in Table 3. As expected, means increased in the groups as students learned more English, but the variation in ratings and variance remained similar across groups. These findings are encouraging as they suggest that teachers were able to differentiate students with a similar spread of scores across all groups whether students were ELs or not, and across the three levels of English proficiency. A similar analysis also found that teachers were able to similarly differentiate topics among each group. These results suggest that teachers had a working knowledge of the topics and that students received at least some instruction about each of them. They also provided credibility to the researchers that the ratings could be used as an independent indicator of www.scholink.org/ojs/index.php/wjer

Analysis of Covariance
To address the first three hypotheses for the study, ANCOVA tests and subsequent contrasts were completed for both grades. These analyses examined the differences within between groups and within and between forms.

ANCOVAs, Grade 4
Frequency data, adjusted for the influence of the covariate, are shown in Table 4. In reviewing the results of the omnibus ANCOVA test for grade 4 results indicate that, not surprisingly, there are significant differences by group (F= 3.92, p< .009) but not form. The interaction was non-significant even though the adjusted means suggest there may be form differences by group especially in the mean differences between forms for low ELs versus other groups (albeit with small n's for low ELs on both the traditional and ONPAR forms), and, within forms where there were greater differences for low and mid ELs versus non-ELs. Because of the magnitude of these differences, we decided to examine the viability of the interaction hypotheses using a one independent variable interaction-only ANCOVA to determine if form/group differences could be considered when the main effects were not in the model.
The F (5.66) was p<.0001, with a correlation coefficient of almost .4 (r = .39) suggesting the differential relationship over cells may be worth investigating.  Table 5 reports the results of a set of ANCOVA contrasts most pertinent to this study. Inspecting the results within groups by form no significant differences were found. That the low ELs in particular did not score significantly differently was disappointing but power for this contrast was very low.
Differences in adjusted means within groups for low, high and non-ELs decreased as expected (27, 18, and 10 respectively). Across groups by form, low ELs scored significantly lower than non-ELs on the traditional form, with close but non-significant results (p<.06) for mid ELs versus non-ELs, and no differences between high ELs and non-ELs. For the ONPAR form there were no differences between any EL groups and non-ELs. In inspecting the magnitude of the mean differences from the traditional form, adjusted mean score differences of 65 can be seen for low versus non-ELs, 46 score point difference for mid ELs versus non-ELs, and a two-point difference for high ELs as compared to their non-EL peers. In contrast, the ONPAR score differences for low, mid and high ELs versus non-ELs are 38, 20 and 9, respectively. Across groups, there seems to be a trend towards more proficient ELs scoring more similarly to non-ELs on both forms, while the performance of less proficient ELs favored the ONPAR forms. As expected, differences between EL group and non-ELs were substantially smaller for the ONPAR compared to the traditional form suggesting that ONPAR methodology may be useful in reducing disparity in scores across groups due to issues of test format and presentation. In similar fashion to grade 4, adjusted descriptive data for grade 8 can be found in Table 6, followed by results of the contrasts in Table 7. Overall, there is variation in the spread of the adjusted means for different groups and forms. The main effects of the omnibus ANCOVA are similar to those in grade 4 (significantly different for the group variable (F=10.28, p < .00), not different for form), but that the interaction effect (F= 2.45) is close to significance at .06. The interaction-only ANCOVA was also conducted for grade 8 with results similar to grade 4 (F= 5.42 with p< .0001). In reviewing the subsequent pair wise contrasts, a number of relevant interactions are apparent. In part, this may be due to the fact that the 8 th grade samples for the low and mid ELs are higher than in grade 4. As shown in Table 7, within groups, low ELs showed a significant preference for the ONPAR form.
While none of the other groups appeared to favor a form, it is interesting to note that the mid EL means adjusted for the covariate were higher on the traditional form than the ONPAR. When the scores for each EL group were contrasted by form with those from the native English non-EL group, low and mid ELs scored significantly lower than their non-EL peers on the traditional form. Again, no significant differences between high and non-ELs were identified, although the small sample size of high ELs are low (n's of 16 per form). For the ONPAR form the low ELs scored significantly differently from non-ELs (p<.05), and otherwise findings were non-significant. Similar to what was found in grade 4, the trend in the average score gaps between non-ELs and low and mid EL groups lessened when the ONPAR form was administered as compared to the traditional form. Score differences of 40 versus 111 for low ELs on the ONPAR as compared to the traditional form, and 37 versus 54 for mid ELs on these respective forms can be seen. The spread of score differences for high ELs was 6 on ONPAR and 31 on the traditional.

Item Level Analyses
In order to quantitatively examine the viability of the ONPAR methodology as compared to the traditional methods, a series of item level regression analyses on the ONPAR and traditional items at both grade levels were conducted to address the 4 th hypothesis. Eleven item pairs were analyzed in 4 th grade (Table 8) and 11 in 8 th grade (Table 7). Of these, five of the 4 th grade items and two of the 8 th grade items were polytomously scored.  Overall, seven of the 11 items in grade 4 and 10 of the 11 items in grade 8 indicated a significant difference in EL group and/or students' science ability as rated by the teachers. Specifically, results indicate that ONPAR items are much less likely than the traditional versions to have EL status as a significant predictor. In total, six of the traditional items (two in grade 4 and four in grade 8), show EL to be a significant predictor (with positive Betas in the dichotomous items; negative in the polytomous items), compared to zero ONPAR items. For two of the polytomous item pairs (one in each grade), the EL group was found to be a significant predictor on both forms, where, of note, the Betas were again consistently negative.
There may also be some indication that the independent teacher ratings may better predict performance on the ONPAR items as five ONPAR items (two in grade 4 and three in grade 8) showing student accomplishment as a significant predictor compared to two 8 th grade traditional items. In three cases (all in 8 th grade) both dichotomous item versions were significant. Interestingly, all the significant Betas for science ability showed an inverse relationship whether they were traditional or ONPAR items.

Discussion
This proof-of-concept experimental study investigated a novel approach to conveying meaning in challenging science assessment items. This technology-based methodology called ONPAR, uses multisemiotic stimuli to convey meaning in substantive ways, rather than employing it as "window dressing" to text. The intent of this work is to support learning of challenging content and practices articulated in recent science education standards such as NGSS, and encourage that all students are included in the teaching and learning of more complex content, reasoning, and strategic skills that extend beyond basic recall and procedural application. This is important because this study, as well as formal academic language, still learn more challenging content if we give them a chance. This approach represents a way of tracking their progress.
Finally, the items and tasks developed using this approach are designed to provide a set of innovative techniques and strategies for developing standardized summative and diagnostic tests capable of being effectively used by students who otherwise cannot access the nuanced language typically associated with more challenging content. While the focus here is ELs, other studies suggest that disadvantaged students and others who learn and explain what they know in non-standard ways, may also benefit (Kopriva et al., 2013;Kopriva, Wright, Malkin, & Myers, 2019).
To these ends, four hypotheses were investigated in this study, three at the test level of analyses and one at the item level. The focal groups were students with low English language proficiency, ELs with mid and high levels of English language proficiency, and native English speakers. Data from a total of 648 students who took the ONPAR or traditional science items measuring grades 4 and 8 science content and skills have been analyzed in this study. ONPAR adjusted mean scores were not significantly different than their traditional scores for lower proficiency EL students (or from non-ELs on the respective forms for that matter), it seems that low statistical power may be at least partially at fault, since the cognitive labs with younger students certainly suggested differences in access favoring ONPAR items. Clearly, further research needs to be completed to confirm this.
It is noteworthy to mention that, with much higher n's per form, the difference between form types was never significantly different for native English speaking non-ELs in either grade. This is good news because it suggests that either form is accessible to this group meaning that this approach could be used in performance tasks for all students, as well as in classroom-based products for all students.
Published by SCHOLINK INC. score differences between forms or with non-ELs in either grade (although small samples in grade 8 suggest this finding needs to be confirmed). One caveat to this finding comes from ONPAR research in high school biology classrooms that suggests how high proficiency ELs might behave over grades (Kopriva et al., 2013). Here, the researchers noted that, while ELs significantly preferred ONPAR tasks as compared to traditional tasks measuring the same content at a similar level of cognitive complexity, closer inspection found that most of these ELs had high English proficiency (very few lower proficiency ELs were in the sample). This finding suggests a possible age effect, and further research needs to be completed to understand differences in access between younger and older ELs.
Results for mid-proficiency ELs at both grades were a surprise in that the adjusted mean differences across forms were relatively low and insignificant (16 points difference in grade 4, and 18 in grade 8).
This held true even when sample sizes were small (in grade 4) and more robust (in grade 8). In fact, in grade 8, the traditional adjusted mean score was higher than the ONPAR, which is consistent with high proficiency and non-ELs, whereas in grade 4, slightly higher mean scores favored the ONPAR form, a preference that was consistent with low-proficiency EL mean score comparisons. Two previous studies using static forms had found that mid-level English proficiency ELs are more overall a volatile group in terms of how they access items (Emick & Kopriva, 2007;Carr, 2008). Applied linguistic experts have discussed that, in many cases, there is a mid-point within the development of English proficiency where ELs shift from focusing more holistically on meaning making in items to becoming hyper-focused on small elements in the text or assessment tasks. This shift often looks like it decreases comprehension, especially of novel stimuli and/or more complex content. However, as these students attain higher levels of proficiency this hyper-focus relaxes once again and they are able to grasp more nuanced meaning than low and low-mid proficiency ELs and can again discriminate between preferred and non-preferred communication modalities (for instance, see Graves, August, & Mancilla-Martinez, 2013). Clearly, more research needs to be conducted here as well.

Hypothesis #3: Focus on the Performance Within and Across Groups and Forms
Would there be significant differences between low ELs and non-ELs on the traditional form, but less difference on the ONPAR form? This study tracked the spread of scores between each EL group and their non-EL peers on both forms. Across grades, findings with one exception consistently indicated that the spread of the mean score differences on the traditional forms were substantially more pronounced for each EL group, as compared to the spread of mean differences on the ONPAR forms.
This occurred everywhere except for high-proficiency ELs versus non-ELs at grade 4 where the differences were rather flat (differences of two score points on the traditional form versus nine score points on ONPAR) compared to the magnitude of the spread in all other cases. Looking at the spread of traditional mean scores for three dyads of students, the spread between EL group and non-ELs averaged 26, a spread of 17 for mid-proficiency ELs and non-ELs was reported in grade 8, and a whopping 71 point spread of mean differences was found for grade 8 low vs. non-ELs.
At the item level, when and how might ONPAR tasks minimize the impact of group status on the performance in the items and capture the KSAs of low English ELs better than the traditional methods?
There seemed to be indications in the item regression examinations that effect of group status may be reduced using the ONPAR method but low sample sizes complicate the findings, especially in grade 4.
Likewise, teacher rating estimates of science ability were a little more strongly related to lower EL performances on five ONPAR versus traditional items with significant negative Betas, as compared to their performance on traditional items in two pairs.
This study also tracked the spread of scores between each EL group and their non-EL peers on both forms. Across grades, findings with one exception consistently indicated that the spread of the mean score differences on the traditional forms were substantially more pronounced for each EL group, as compared to the spread of mean differences on the ONPAR forms. This occurred everywhere except for high-proficiency ELs versus non-ELs at grade 4 where the differences were rather flat (differences of two score points on the traditional form versus 9 score points on ONPAR) compared to the magnitude of the spread in all other cases. Looking at the spread of traditional mean scores for three dyads of students, the spread between EL group and non-ELs averaged 26, a spread of 17 for mid-proficiency ELs and non-ELs was reported in grade 8, and a whopping 71 point spread of mean differences was found for grade 8 low vs. non-ELs.
These results suggest that the ONPAR methodology may be useful in helping to closing the performance gap due to communication methods and level the performance "playing field" between ELs and non-ELs, especially for lower English proficient students. It also supports the notion that individual ONPAR items may be a viable item adaptation for some ELs while other students take the traditional counterparts.

Implications
All in all, this study suggests that the ONPAR approach holds promise, although it has also highlighted possible pitfalls that need to be more fully understood. Fundamentally, the methodology illustrates how a number of multisemiotic techniques, properly assembled and implemented, can be used to substantively convey meaning. This approach broadens the status of these techniques in measurement beyond what has typically been their peripheral role vis a vis written text in presenting problems and scenarios to both EL and non-EL test takers, and in configuring how they are able to respond. While by no means eliminating text, ONPAR depicts how a number of the multisemiotic representations can be used to convey meaning in tandem with language or on their own. novel response methods, like those exemplified here, seem to be well-suited to improve how we measure educational content in at least two distinct ways.
First, by carefully integrating multiple sources of stimuli together within and over screens to convey intended constructs and supplement response opportunities, this approach can selectively yet pervasively increase accessible communication avenues so students for whom language, literacy, or, most likely, language processing challenges are often a barrier can more effectively access problems and better demonstrate their understanding. Here, the findings imply that the ONPAR methodology would seem to be useful and effective for measuring these more complex knowledge and skills, for students whose language is less well developed as well as for students with adequate literacy skills. We and others (for instance, see Gee, 2008;and Roth, 2005, among others) contend though, that supporting assessment access to more complex content knowledge and skills has the potential to encourage learning of more nuanced academic language by engaging students in the content material and leveraging that interest as impetus to learn the academic language associated with this content. As Roth (2005) shows, when learning science, students often engage in "muddled" discussions about the phenomena; as their conceptual understanding grows through hands-on experience, so too, does their ability to "talk science." A 2018 National Academies of Science, Engineering, and Medicine report concurs: Academic language development must be anchored in conceptual understanding.
Second, the researchers assert that this study defends the viability of more directly presenting "live" (albeit technology-based) performance scenarios in testing versus traditional modalities that rely almost exclusively on indirect descriptions or hands-on performances. As the testing field begins to tackle how to utilize the capabilities of technology, ONPAR researchers have spent considerable time deliberating and illustrating how to concurrently utilize multiple stimuli without overwhelming or confusing students, how to integrate various presentation and response avenues, including language, into seamless communication packages, how specific stimuli or combinations of stimuli work to convey various kinds of meaning for a number of purposes, and how to capture and concurrently score different types of performances, reasoning arguments, and a range of relational and meta-cognitive explanations using representations that are generated rather than selected by students.
While the technology capabilities referenced in ONPAR are rather straightforward and familiar to most of us who utilize the internet today, what makes ONPAR 'work' is not the individual techniques but rather HOW they are combined to suit a large number of purposes. These purposes range from clarity and defensibility of the targeted material and issues of access for literate students, those without much English, and those in-between, to building tailored situations and site spaces for presenting construct-relevant multimodal scenarios and capturing a wide set of responses. The ONPAR methodology developed thus far is templating how some these types of purposes might be met.
Continuing research is of course necessary to validate how well the approach is producing evidence demanded by the test inferences, for whom, and under what conditions. The proof-of-concept study of this methodology presented here seems to be a viable step forward in that direction.