Student Evaluations of Teaching: Is There a Relationship between Student Feedback on Teaching and the Student Final Grade?

The use of Student Evaluations of Teaching (SET) has become widespread practice in higher education despite inconclusive evidence reported in the literature around its validity. Not surprisingly, the question of the validity of SET continues to be a current debate in higher education, pointing to more research to be conducted in this area. The current study contributes to broadening knowledge and understanding on the validity of SET by drawing on an online unit evaluation completed by students (n=2430 out of total student enrolment of N=7757) in one university across three postgraduate education programs over a two-year period, to determine whether there is a relationship between student feedback on teaching and student final unit grade. Findings revealed that students who achieved very high or very low final unit grades did not participate in the SET, while students who achieved Pass or Credit grades partook in the SET, thus providing feedback. This indicates that teaching and evaluating staff need to be aware that a large subset of their students that are not providing feedback to staff to improve the quality of their courses.


Introduction
There is much debate surrounding the validity and utility of Student Evaluations of Teaching (SET) within the current literature. Some papers have posited that such evaluations are important mechanisms for improving not only teaching practice but also course content (e.g., Borman & Kimball, 2005;Kinash et al., 2015;Richardson, 2010;Secret, Bentley, & Kadolph, 2016), while some authors have raised concerns around its validity and utility (e.g., Hornstein, 2017;Spooren, Brock, & Mortelmans, 2013). According to Hornstein (2017), studies on the validity of SET "have been beset by questionable conceptual and statistical interpretations … [that] … have rendered the conclusions questionable at best" (p. 5). Thus, the validity of SET is a highly contested topic in which there is inconclusive evidence around its validity (Spooren et al., 2017), which leads to its utility being questioned for purposes, such as improving teaching and promotion and tenure. This situation calls for continued research in the area.
This study contributes to the debate on validity of SET by investigating the relationship between student feedback on teaching and the students' final unit grade. Some studies have investigated this relationship (e.g., Centra, 2003;Eizler, 2002;March & Roche, 2000), however much of this research is dated. This calls for more recent studies to be conducted in this area, particularly considering online evaluations are becoming more prevalent in universities. The current study contributes to the debate on validity of SET in two key areas. First, we assess who are the key respondents in terms of their final unit grades. Second, students' feedback on teaching is assessed to determine whether more favourable student feedback on teaching is associated with higher grades, and whether less favourable student feedback on teaching is associated with lower grades. For our analysis we investigated two research questions: 1) Does the grade a student receives from the class affect the likelihood that they will respond to the SET? 2) Does the grade affect the type of response made by the student?

Background Literature-The Use of SET
Higher education institutions employ several mechanisms for monitoring and evaluating course satisfaction from the end-user/students-perspective. These mechanisms range from the informal, such as a tutor asking their students how they are finding the course, through to formal. In terms of formal student evaluations of teaching and courses (units) Australian universities employ a standard survey tool that contains the same base questions for use across all their courses.
The dissemination of student surveys seeking to measure and understand teaching and course delivery may be in some part due to increased higher education regulation and a growing need for institutions to be more accountable to consumers and regulatory bodies, such as Tertiary Education Quality Standards Agency (TEQSA) which mandates that Universities provide a mechanism for student feedback on courses, with the focus becoming compliance driven (Shah, Cheng, & Fitzgerald, 2017) or under the Higher Education Standards Framework (Gannaway, Green, & Mertova, 2017). The collection and interpretation of survey results are systematically carried out in many institutions (Shah et al., 2017). In recent years, the Australian Federal Government has initiated the annual Student Experience Survey  (Fraile & Bosch-Morell, 2015;McClain et al., 2017) and academic achievement in addition to curriculum development (Richardson, 2005).
Within teaching and the education sectors it has been recognised and shown that amongst other aspects, such as content and pedagogical knowledge, teaching quality influences student achievement (Boman & Kimball, 2005;Rozina, Noor, & Mohamed, 2016). In determining the effectiveness of the teaching and content, educators have used amongst other things, student surveys of teaching and content effectiveness. Commonly these types of student evaluations on teaching are used by teachers and institutions to monitor the effectiveness of the teaching instruction and as a mechanism for course quality assurance (Macfadyen, Dawson, Prest, & Gašević, 2016). Indeed, student ratings have a high degree of validity (Cashin, 1995;Felder, 1992;Kong, 2014). After all, it is the students who are being taught and are learning content and attempting to meet course learning outcomes.
The use of SET amongst academics to improve their teaching has been shown to be low (Stein et al., 2012(Stein et al., , 2013. Gold and Adam (2016) point out that such low use of SET to improve teaching is attributed to concerns around validity of data from these evaluations, thus making them less likely to be used for improving teaching. The literature on validity of SET is further explored next.

Validity of SET
Despite reported various uses of data from SET, concerns have been raised in the current literature around the validity of such data (Spooren et al., 2013). There is some discussion amongst the academic community on the validity of online evaluations due to responses being lower in number than paper-based surveys and being less representational of the learning and teaching experience (Rienties, 2014); however, some studies show that lengthier comments are supplied in online surveys (Bennett & De Bellis, 2010) that may be more positive (Sorenson & Reiner, 2003). The meta-analytical review conducted by Spooren et al. (2013) on the validity of SET also found that students provided more comments in these evaluations that are administered online.
Studies on gender bias in SET have sometimes revealed contradictory results where male academics are ranked more positively than their female counterparts (e.g., Bachen, McLoughlin, & Garcia, 1999) and vice versa (e.g., Basow & Silberg, 1987 Studies that have focused on the relationship between grades and SET have raised further concerns around the validity of these evaluations (e.g., Abrami et al., 1990;Isely & Singh, 2005;McPherson, 2006). According to McClain et al. (2017, such studies confirm the association between grades and SET. Some of these studies (e.g., Johnson, 2003;Lin, 2009) have found evidence of "reciprocity effect" (McClain, Gulbis, & Hays, 2017, p. 4) in which students rate instructors according to the grade they received (i.e., reward instructors who reward them and punish those who punished them), while some studies have shown evidence of "leniency effect" (McClain et al., 2017, p. 4) in which instructors who are seen as lenient in their marking are said to receive more favourable ratings (e.g., Abrami et al., 1990;Johnson, 2003;McPherson, 2006). However, it should be noted that much of these studies are old and as online evaluations become more prevalent in universities, this calls for more research to be conducted on the relationship between grades and SET.
While the above studies have raised questions around the validity of data from SET, it is argued that more research is still necessary to broaden knowledge and understanding in this area. According to Spooren et al. (2013), evidence on the validity of SET continue to be inconclusive and that "the utility and validity ascribed to SET should continue to be called into question" (p. 629). The findings of this study are expected to contribute to broadening knowledge and understanding around validity of data from SET, which may lead to more informed future use of such data.

Data Collection
This research used student feedback data collected from the Unit Survey that was administered online from 2013 to 2015 inclusive in three postgraduate programs in the case university, namely the Masters of Teaching, and Master of Education and Master of Education (TESOL). This provided an opportunity to observe a balanced view of students' feedback from these programs, which have many units, yet moderate student enrolments. This kept the data set within a reasonable tolerance limit. Students have access to the unit evaluation survey for a period of 6-weeks, 2-weeks prior to the conclusion of the unit and 4-weeks post. At the time of student completion of the SET survey they do not know their final award (grade) for the unit, though many could work out an approximate award based upon their assessments 2.1.1 Instrument The Unit Survey is a confidential online survey that reviews both teaching and unit content using Likert scale questions to establish the percentage of agreement to the statements. There are also two qualitative free text questions to allow the respondent to offer opinions on what was done well and where improvement may be achieved. It should be noted that it is unusual for students to provide qualitative feedback in our SET surveys. The reasons for that are not known and would provide a basis for an additional study. The Unit Survey is mandated for every unit offering each semester, and the historical data held within the University's data warehouse enables the pilot to review and compare unit feedback across a reasonable timeframe.
The content of the Unit Survey feedback reports provided to Unit Coordinators includes demographic information derived from the enrolment information. This allowed demographics data on the student cohorts to be accessed and investigated without retrieving extra data.
The survey data were stored within the University's data warehouse within Information Technology Services (ITS), to which the project team did not have direct access. To investigate the relationship, the ITS technical analyst was able to pair the individual confidential response from the student with the final unit grade received by the student, without supplying raw data with the student ID stored in the data warehouse. This method using ITS as independent custodians of the data guaranteed anonymity for the respondents and allowed mapping of unit demographics (e.g., Fe/Male, Age, Campus/Location, Study Mode, etc.) provided to staff within existing eVALUate reporting.

Analysis
The quantitative data were analysed using IBM SPSS Statistics 22 to review the frequencies and relationships between the feedback from individual Likert scale questions and final unit grades. The qualitative data were analysed by sentiment analysis by reading and coding each comment in the two free text questions individually within Excel according to whether the student provided feedback using positive or negative language, made statements with no sense of sentiment or made no comment at all.
The revised coding was then imported into the SPSS for further analysis.

Results and Discussions
To better understand the statistical techniques used on the data and its meaning the results are presented with a running commentary (discussion). Such an approach hopefully will provide a context for the analysis and what the findings mean. single and multifactorial solutions. These analyses provided evidence for the validity of the identified factor structure with KMO=>.800 and Bartlett's test of sphericity<.05. Cronbach's Alpha procedure was utilised to provide evidence for the reliability of the factor structure (CA=>.700). Factor loadings were visible if .250 or above, with the threshold for acceptable loadings set at .300.
As indicated in Table 2, these students responded most positively to Q1 (the learning outcomes in this unit are clearly identified), Q6 (The workload in this unit is appropriate to the achievement of the learning outcomes), and Q10 (I think about how I can learn more effectively in this unit) with all three sets of mean responses in the Agree (4) to Strongly agree (5) interval. They responded least positively to Q5 (Feedback on my work in this unit helps me to achieve the learning outcome).
As indicated in Table 2, none of the skew values exceeded 1.96 (i.e., 95% of the distribution was no more than 1.96 standard deviations from the mean). As indicated in   To further interrogate the available data, inferential statistical techniques were undertaken to investigate relationships between variables, as shown in Table 3.

Inferential Statistics
In this section associations between personal variables and learning outcomes were examined via regression and ANOVA. A preliminary examination of bivariate correlations (Spearman's Rho) did not find any of the personal variables to be redundantly (collinearly) correlated with one another or with the learning outcomes variables. As indicated in Table 4, of the five personal variables, only semester was not significantly associated with learning outcomes.  Based on Mahalanobis values, three cases were excluded from subsequent analyses. To simplify the display, all regression outcomes have been reported in tabular format with only significant outcomes per DV included, as indicated in Table 5. As indicated in Table 5, after excluding year and semester (neither of these significant predictors of learning outcomes), the model was significant (p<.001), with the 5% of variance explained (R 2 =.057).

ANOVA Outcomes
An ANOVA was conducted as a follow-up to examine potential interactions between the three significant predictors of learning outcome scores. To do so, the grouped version of age and unit scores and the binary variable, gender, were entered as predictors into an ANOVA with learning outcomes as the outcome variable. This is shown in Table 6.  Levene's test of equality of error variances was significant, consistent with this analysis being best conducted via non-parametric methods. However, given the interest in interactions, and the robustness of parametric methods, these are reported here.
As indicated in Table 6, the main effects for grouped unit score and age group were significant. Further, the two-way interactions between the grouped unit score and age group, the grouped unit score and gender, and the interaction between age group and gender were all significant as was the three-way interaction between grouped unit score, age group and gender.
As illustrated in Figure 1, participants with high distinction or distinction scores obtained significantly higher learning outcome scores than those who obtained credit or pass scores. However, participants who failed these courses did about as well in terms of learning outcome scores as those with distinctions or high distinctions.

Figure 1. Mean Learning Outscores for Each Grade Achieved
As illustrated in Figure 2, students in the 37-69 year with high distinctions obtained significantly larger learning outcome scores than students with pass or fail grades. In other age groups, students with distinctions obtained significantly higher learning outcome scores than students with pass marks.
However, those with fail grades obtained learning outcomes equivalent to those with distinctions or high distinctions.
As illustrated in Figure 3, males and females with distinctions or high distinctions or who had failed obtained non-significantly different learning outcome scores. However, whereas males or females with pass grades obtained significantly lower learning outcome scores than those with distinctions, high distinction or fails, the learning scores of males with credit grades was significantly lower than those with distinction, high distinction or fail grades whereas the learning outcome scores for females with credit grades was non-significantly different from those with distinction, high distinction or fail grades.

Figure 3. Two-Way Interaction between Gender and Grade
As illustrated in Figure 4, female students in the 37-69-year age group obtained significantly higher learning outcome scores than those in the 21-24-year age group whereas male students in 37-69-year age group obtained significantly higher learning outcomes than those in every other age group.
Three-way interactions are by their nature problematic to interpret. As illustrated by Figure 5, for those with high distinctions, males and females in the 21-24 and 37-69-year age groups obtain significantly higher learning outcome scores than females in the 25-28-year age group, whereas this is not the case for 25-28-year-old males.
For those with credits, males in the 37-69-year age group obtain significantly higher learning outcome scores than males in other age groups whereas this is not the case for females.  For those with pass grades, males in the 37-69-year age group obtain significantly higher learning outcome scores than males in other age groups whereas females in that age group obtain significantly higher learning outcome scores than those in the 21-24-year age group but not otherwise. For those with fail grades, the differences in learning outcome scores do not differ significantly by age group for either males or females.

Conclusion
All students receiving various grades have provided either quantitative or qualitative feedback through the Unit Survey within the sample group. Low and high achievers were not found to be the key responders, these were mid-range students. Qualitative feedback received was not abusive or unprofessional and was constructive. Of interest were that students who failed rated their instructors nearly as highly as those students who achieved a distinction or high distinction grade.
There is still much work to do within the survey space; from encouraging non-responders to participate and increase response rates to continuing the cycle of student education on why feedback on the courses and student experience is important, plus closing the feedback loop and ensuring that students know that we are listening to their student voice to implement change.