A Study on the Effectiveness of Automated Essay Marking in the Context of a Blended Learning Course Design

This paper reports on a study undertaken in a Chinese university in order to investigate the effectiveness of an online automated essay marking system in the context of a Blended Learning course design. Two groups of undergraduate learners studying English were required to write essays as part of their normal course. One group had their essays marked by an online automated essay marking and feedback system, the second, control group were marked by a tutor who provided feedback in the normal way. Their essay scores and attitudes to the essay writing tasks were compared. It was found that learners were not disadvantaged by the automated essay marking system. Their mean performance was better (p<0.01) than the tutor marked control for seven of the essays and showed no difference for three essays. In no case did the tutor marked essay group score higher than the automated system. Correlations were performed that indicated that for both groups there was a significant improvement in performance (p<0.05) over the duration of the course and that there was a significant relationship between essay scores for the groups (p<0.01). An investigation of attitude to the automated system as compared to the tutor marked system was more complex. It was found that there was a significant difference in the attitudes of those classified as low and high performers (p<0.05). In the discussion these findings are placed in a Blended Learning context. and learners should have confidence in the system. This is especially true of tutors if it is to be integrated successfully in a Blended Learning context. Our research has shown that the automated system marks accurately and fairly and that learners improve their performance over the duration of the course. Their attitude to the automated system was measured and shown to be comparable or better than the attitude of the control group to the tutor marked system.

The course is mainly followed by freshman and sophomore in HE. Another aim of this course is to pursue the socially recognized English certificates like College English Test 4 and 6 (CET 4) (CET 6) which are regarded as essential qualifications in the Chinese job market and are useful as in addition they cover man socio-cultural aspects of English. English test marks are important for learners in the Chinese job market.

Details of Teaching English Writing Practice
Several changes in teaching were made in the Jian Qiao University in order to adapt to the perceived need of teaching English writing. Among the four basic English skills of listening, speaking, reading and writing, improving English writing ability has always been a difficult task confronted by Chinese teachers and students (Sun, 2014). In College English teaching at Jian Qiao University, the number of students in each class is large. This puts forward higher requirements. Teachers are required to spend inordinate time and energy marking students' compositions. In a single semester, teachers were only able to assign one or two writing task for that reasons. Consequently, there were very few opportunities for the students to practice their writing and consequently for teachers to review progress and provide feedback.
In the experience of one of the authors of this paper, teaching English writing raises many problems related to the process of grading students' English composition. Firstly, it requires a considerable amount of time to grade essays and provide useful, timely and relevant feedback and evaluation to individual students. Secondly, grading can often be subjective when scoring students' writing. There is a possibility that students may be stereotyped according to scores obtained rather than their individual strengths and weaknesses. It is possible that demographic factors such as gender, age, ethnicity. prior performance on tests and other courses and socio-cultural factors may conceivably influence feedback and in extreme cases, the grade obtained by learners. It is often difficult for a teacher to be entirely neutral in their approach to marking essays. For these reasons scoring essays becomes an enormously complex cognitive task that involves a multitude of inferences, choices, and preferences on the part of the grader. The exact features are attended to in an essay, the characteristics and sections that are weighted most highly, and the standards adhered to are all factors that may vary widely across human graders. Indeed, it has been observed that teachers' ratings of essays can be highly variably and often not objective (Huot, 1990;Huot, 1996;Meadows & Billington, 2005).
Additionally, the class size in Chinese universities is often very large. A teacher may often teach a class with more than 50 students. If he or she teaches several classes in parallel in one semester, then he or she is required to grade several hundred essays. Consequently, essay rating becomes an arduous task for teachers. Teachers often devote a great deal of effort, many students appear only to be concerned with the final score and less so with the feedback and feed-forward provided by the teacher. Students may be unwilling to review and reflect the feedback or evaluation from the teacher. This factor makes it difficult to help a student to improve their writing prior to the next task. A possible reason for this may be the timeliness of the feedback. Fast, efficient feedback is likely to ensure that help is provided www.scholink.org/ojs/index.php/elsr Education, Language and Sociology Research Vol. 1, No. 1, 2020 23 Published by SCHOLINK INC. in good time to assist in the next task. If feedback is too slow, then student are likely to pay less attention to it we argue.
Plagiarism is a growing concern in universities across the globe. The prevalence of electronic resources, copy and paste and file sharing has made it easy for some students to cheat. Manual marking of essays is slow and complex as described above. It is therefore difficult to detect plagiarism on students' paper-based writing. The grading of English writing effectively and to provide useful, timely and effective feedback in a timely manner becomes an important task.
Against this background, in the context of a BL design, Automated Essay Scoring (AES) online has been adopted at the Shanghai Jian Qiao university. AES is defined as a computer technology that is able to evaluate and grade written works (Shermis & Burstein, 2003). At the Shanghai Jian Qiao university, the English writing course is delivered by face to face lectures and tutorials in classrooms.
And an online system of AES is has been implemented to supplement the traditional classroom teaching. Using technology to supplement the real classroom teaching is a fundamental objective of China's foreign language teaching as explicitly stated in the National Curriculum of College English Course (2017). According to Kaleta et al. (2007), teachers who design BL courses often place additional online elements within a traditional course framework without removing current activities.
This phenomenon is also referred to as "the course-and-a-half syndrome" (ibid., p. 127).
Figure 1 below summarizes the type of BL design employed within this study. Instruction is delivered in the classroom while all the necessary exercise and practice are completed online after class is over.
This may be considered as a basic way of combining traditional classroom teaching with supplemented web-based activities. Many instructors design BL courses in this way according to several researchers, for example (Brunner, 2006;Kaleta et al., 2007). The addition of extra activities to an existing, traditional course as employed in this study may be referred to as a basic-level blend.
Figure 1 illustrates applying the basic-level blend approach to English writing course design. Then this leads to the research objectives of this study.


How to test the effectiveness of this basic-level blend?
 What are the advantages and disadvantages of this basic-level blend?

Literature Review
In this brief literature review, four main areas relate to the context of our research and are covered as follows: ① How is AES developed?
② What are the claimed benefits and claimed limitations of AES?
③ How do teachers perceive BL and how does the perception impact the course design?
④ What is the attitude of teachers to basic-level blended course design for English writing ?

A Brief Review of Studies on AES
More than 50 years ago, Ellis page (1966) predicted the arrival of the so called "teacher's helper", that would grade papers by computer (Shermis, 2014). Just seven years later, Page and his colleagues at the University of Connecticut developed the first automatic essay grading engine, which was called Project Essay Grade (PEG) (Ajay, Tillett, & Page, 1973;Shermis, 2014). For reasons related to the difficulty of entering text within this technology the system did not gain immediate popularity until the early 1990s. From then on, some commercial and also several non-profit organizations took up exploring different types of essay scoring systems for English language. AES systems at that time were adopted by testing companies, universities, and public schools (Toranj & Ansari, 2012). The most widely known AES systems include Project Essay Grader (Page, 1966(Page, , 1968(Page, , 2003 In the general literature related to AES, the evaluation process for AES covers a number of criteria, including association with human scores, distribution differences, subgroup differences, and association with external variables of interest (Ramineni & Williamson, 2013 Some scholars have compared AES with human raters. According to Shermis (2014), AES performed well in five of the seven tests and was close to human raters in the other two. Further studies on the validity of AES systems, have suggested that they are able to play a practical role in the assessment of high-risk writing (Shermis, 2014). Scoring of Essays. It was reported that AES is more consistent across multiple assignments in comparison to human raters. However, as stated in her paper, the operational rules of AES are not able to capture the characteristics of non-native writing. Human raters are sensitive to these more specific characteristics when marking the essays. Her conclusion from her research with English learners studying a foreign language emphasizes the need to understand the students' diverse needs in the first place, first when system developers are designing AES systems. It is also important for teachers when they are developing courses that include additional activities from AES. The more they know about the students' needs, the greater the possibility of satisfying the diverse needs of an increasingly larger population (Weigle, 2013;. However, because writing is an activity that is so deeply human, its association with formulation is double edged . Because students are encouraged to write fluently or to achieve competency in their knowledge of conventions, a certain degree of formulation is necessary

Self-Efficacy
Gairs (2007) showed that some students had a higher satisfaction rating with online learning systems though they did not necessarily have their performance enhanced or behavior changed by the use of AES systems. This was attributed not to the use of AES system per se, but to their willingness to an inherent engagement with such systems. Motivational processes such as reflection and self-efficacy were likely to be responsible to the high attitude scores it was postulated. Researchers have argued that it was necessary for learners to take part in the reflective activities if it were to result in a significant improvement in self-efficacy and task value in online activities (Qian et al., 2019). Self-reflection may be improved by a constructive BL approach in which the students assess their own work based on feedback and a knowledge of assessment criteria in relation to their individual performances and goals.
Learners may then have affective cognitive reactions guided by their self-judgments and might be able to make decisions based on previous learning and hopefully relate this to future tasks and goals. It is hoped that this hypothesized effect may be measured by an increase in self-efficacy at the end of our study.
Efficacy emphasizes the ability and confidence to achieve a goal satisfactorily. It relates to one's belief in a capability to perform a specific task. It determines how people feel, think, motivate themselves, and it also refers to their confidence to achieve the desired outcome (Bandura, 1986). Individuals' task-specific self-efficacy can be generalized to a wide range of tasks or activities in certain disciplines (Bandura, 1997). Bong (2001)  indicating that that students' self-efficacy is an important factor in predicting their learning performance or achievement. Self-efficacy it may be argued, mediates people's interpretation of their knowledge, skills, or experiences of prior attainments, and is believed to be an essential factor in positively predicting learning outcomes. According to Bandura, students' learning experiences play an important role in explaining their self-efficacy of learning (1997). In our research the use of AES an a BL context is predicted to increase the self-efficacy of learners.

Curriculum Added with AES
A model that has empirically been demonstrated to yield substantial gains for students was described in the book "The Framework for Success in Postsecondary Writing" (CWPA, NCTE, NWP, 2011) and also by Graham and Perrin (2007). The general purpose of the study presented here is to explore the advantages and disadvantages of a basic-level blend with AES. It is hoped that this may help teachers to have a deeper conception of BL in a real context and to help students improve their English writing experiences. This will involve the learning of phrases, idioms, writing styles, skills, conventions, strategies, rhetorical knowledge and critical thinking.

Details of the Online AES Software Used in This Study
This AES system used claims that it is able to provide timely, comprehensive and effective grades and diagnostic feedback to students' writing online. It is claimed that it is able to enable students to understand better their own English composition, to correct mistakes themselves in time in order to improve their English ability. Teachers are also able to assess the overall writing level of students, in order to conduct targeted tutorials for learners, based on their performances. With the help of this system's automatic review, teachers would be able to arrange more pertinent writing assignments easily, thus effectively solving the traditional teaching problem "students are unwilling to write, while teachers are unwilling to mark" (AES online, 2019).
It is also claimed by developers of the system that the system can analyze a composition from the aspects of spelling, content, organization, word choice and grammar, providing multidimensional personalized feedback information, which can be used for formative and terminal evaluation of the students. It can play an extremely important role in improving students' language ability (AES online, 2019). To sum up, this AES System is claimed to function in support of the following traits:  How to test the effectiveness of this basic-level blend?
① Can we observe any significant differences in performance between students using basic level blend approach adding system and students using traditional method only with paper-based practice?
② What is the relationship between learning outcomes and learners' satisfaction with the experience from this basic-blend?
 What are the advantages and disadvantages of this basic-level blend? ① What factors should be considered by teachers in HE when they choose this basic-level blended course design?
② What can be improved in this basic-level blend?

Participants
The experiment involved two groups of learners who were required to produce eleven essays. One group assessed and given feedback by tutors (the control group) and the second group using the online AES system. Participants were 2 groups of undergraduates from non-English majors in a Chinese university. Groups were balanced as far as possible in the context of an ex-post facto study. The demographic variables are shown in Table 2   According to proposed by Zimmerman (2002) there are three stages of self-regulated learning strategy.
These include forethought, performance, and reflection. Learners in this study were required to complete these three stages in their course. Students set learning goals prior to starting a task in the forethought stage. Students then engaged in and completed an essay writing task (performance).
Feedback provided was intended to allow students to reflect on the learning process. How self-regulated learning strategy was employed in this study is explained below.

Forethought:
The students in both groups were given an orientation about the course by the tutor, including the conception of feedback, evaluation, goal setting, writing instructions and reflection. For the experimental group, the teacher also demonstrated how to use AES system. The students acclimatized themselves to the feedback and evaluation mechanisms in the AES system. For the control group, the tutor demonstrated simple administrative procedures such as submitting work, how to make corrections according to the feedback and evaluation from the teacher and how to store their work.
These were functions achieved fairly simply in the online system. Performance: The duration of the experiment was approximately 17 weeks, and Table 3 presents the essay topics that were assigned to both groups.

Reflection:
After completion, students reflected on their learning processes either through the writing feedback and evaluation mechanism provided by AES or from the teacher's paper-based comments.
Reflection then related to the amount and quality of feedback given to participants by the tutor and online system. Although this was not directly assessed in this research, learner attitude to the process was measured which was assumed to relate to learners' reflections of the experience.

Figure 2. Phases and Processes of Self-Regulation according to Zimmerman and Moylan (2009)
At the end of the study, students were asked to rate their perceived difficulty of each of the essay topics on a 10-point Likert scale, and also to complete a short questionnaire on their experience of and attitude to English language essay writing.
The students were assigned writing tasks respectively online and on paper every 10 days throughout the duration of the study. The topics (as shown in  were not informed of the experiment, either. Both groups were taught by the same teacher, and received the same curricular content, teaching schedule, requirements, and goal setting.

Results
A comparison between paper essay and the Panorama online system was undertaken as previously described. A pre-test was completed by both groups to test if there were differences in the samples. The results of this are shown in Table 5 below. In order to test the significance oif any difference in the means of the two groups, an independent samples t test was performed. The results of this test confirmed that there was no significant difference between the mean performances of the groups (t=1.45, df=69, p=0.15). It was noted that although there was no significant difference in the means, the online students exhibited a slightly higher mean score than the paper based students.
A comparison was made between the performance of the students as they undertook 10 essay assignments. The mean results of the essays and their topics are presented in Table 6 below. In order to obtain an informal understanding of the performances of the two groups, a graph was plotted showing how the performance of the two groups varied with time. This is shown in Figure 3 below.

Figure 3. Graph of the Comparison between Online and Paper-Based Mean Test Scores
It is interesting to note that the shape of the curves is similar. In general essay scores in the tutor marked system corresponded to those in the online automated system. To further understand any between the performances of the groups in the essay assignments, a 2x10 mixed ANOVA was performed on the data summarized in Table 6 above. The results of this ANOVA were (F=9.845, df=1, p=0.003). The value of (p<0.01) compels us to conclude that there was a significant difference in test scores between the online and paper-based groups. The mean values of the test scores (from Table 1 above) were Tutor marked=68.26; Online=73.19. We are able to conclude on average, the online automated system learners performed better that the control group.
A post hoc analysis was performed on the data summarized in table 6. The results of an independent ANOVA are shown in Table 7 below.  It is evident that essays E2, E3, E4, E5, E7, E8 and E9 had significant differences in performance between paper-based and online conditions (p one-tailed<0.05). Possible reasons for the lack of a significant difference in essays E6, E10 and E11 (p>0.05) will be discussed later.
The overall shape of the graph presented in Figure 3 is interesting. It suggests that both groups had improvement in their scores over time. This is important as it suggests that the paper-based and online systems were both effective in improving the performance of learners. In order to test this hypothesis, a Pearson's PM correlation was performed on both groups in order to test the significance of any correlation between the test scores and study time.
The output from this correlation is presented below in Table 8. Significant positive correlations were found for both paper-based essays (r=0.839, p<0.001) and online essays (r=0.680, p=0.011) with study time. This suggests that there was a significant positive relationship between study time and essay score for paper-based and online essays, showing that scores improved over the duration of the course.
The results suggest that in both cases learners improved in their scores over time and that the performance of learners on the essays were also related. This is an important finding in the context of this research. It is important to show that learners are not disadvantaged by a new intervention. We can conclude that the online system is at least as effective as the traditional paper-based system at supporting learners in their essay writing.
The overall shape of the graph displayed as figure one is also interesting as the shape of both curves is similar, which supports the above finding.
In order to explore more fully the shape of the graph in Figure 1 above, a further investigation was performed. Learners ranked their perceived level of difficulty for each essay on a Likert scale (1 to 10) where 1 easy and 10 is difficult. It would then be possible to investigate any relationship between perceived difficulty level and the scores obtained in the essays. A summary is presented in Table 9 below. It is interesting to note that those essays (E6, E10 and E11) from Table 3 above, where there was no significant difference in performance between the two groups, had relatively high perceived difficulty levels. This factor may account for the lack of a significant difference. The relatively low level of alpha for these exceptions, in the region of (p=0.1) suggests that despite a lack of significance there may still be a slight positive effect.
In order to test any significance in the relationship between difficulty ratings and performance, a Spearman's correlation was performed on the data summarized in Table 5 above. The results of this correlation are presented in Table 10 below.  The results of the Spearman's correlation shown in Table 6 above were highly significant at (p one-tailed<0.001) in most cases. This was taken to indicate that the test scores were indeed positively www.scholink.org/ojs/index.php/elsr Education, Language and Sociology Research Vol. 1, No. 1, 2020 correlated with perceived difficulty level. Also there was a significant relationship between the perceived difficulty level of online automated marked essays and tutor marked essays (p one-tailed=0.013).
A Mann Whitney U test was performed to test the significance of any difference in the ranking between online automated and tutor marked essays. The results of this analysis showed that there was no significant difference between the perceived difficulty level of the two groups (N=11, U=51.00, p=0.533).
An attitude questionnaire was administered to the participants in order to investigate any relationship between performance and attitude to the essays. The results of the questionnaire are shown in Figure 4 below (based on a Likert scale where 1 represents a negative attitude or opinion and 5 a positive one).

Figure 4. Results of an Attitude Questionnaire for Online Automated and Tutor Marked Groups
A Mann Whitney U test was performed to test any difference between the mean attitude and mean essay score for online automated and tutor-marked essays. The results of this analysis suggested that there was a significant difference between the attitude of learners in the online automated and tutor marked essay groups. (N=19, Mean rank Online=25,39, Tutor marked=13.61; U=68.5, p (one-tailed)=0.001). The mean ranking shows that the learners with online automated essay marking rated higher than those with tutor marked essays.
A correlation was performed to investigate the significance of any relationship between essay score and attitude. Figure 5 below shows mean essay scores and attitude for the two groups of learners.

Groups
The results of a Spearman's correlation on the data displayed in Figure 5 are shown in Table 11 below.  The results of this correlation show that there is a significant positive correlation between the attitude of paper-based participants and their essay scores (rho=0.42, p=03). This is not seen in the online participants where there is no significant correlation (rho==0.099, p=0.33). In order to investigate this finding further, an analysis of any difference in the attitude of those learners with mean high and low scores in their essays for both groups. Table 12 below shows the mean rankings for the attitudes of learners classified as high and low achievers based on their essay scores, divided at the midpoint. the signicance of the mean rankings between the individual groups. The results of this analysis are presented in Table 13 below. The results of this analysis show that there was a significant difference between online automated and tutor marked groups for those classified as low achievers. There was no other significant difference.
Low achievers following the online system rated it higher than the tutor marked group. This may be due to several factors including the feedback provided by the system. Feedback and reflection as well as a summary of the results are discussed in the next section.

Discussion and Conclusion
In order to integrate automated essay marking into a Blended Learning context, it is important to show that it is able to perform at least as well as traditional methods. It must be as fair as traditional methods, not disadvantaging students. It should mark accurately when compared to essays marked by tutors. It should provide useful feedback that compares well to that provided by tutors, leading to improvement in performance over the duration of the course. The attitude of learners to the system should be at least as good as that of learners to the traditional tutor marked system. Tutors and learners should have confidence in the system. This is especially true of tutors if it is to be integrated successfully in a Blended Learning context. Our research has shown that the automated system marks accurately and fairly and that learners improve their performance over the duration of the course. Their attitude to the automated system was measured and shown to be comparable or better than the attitude of the control group to the tutor marked system. Both groups were required to undertake reflective activities such as reflecting on their individual feedback and evaluation of their writing. The AES system provided greater opportunity for this. The AES system provided immediate feedback as soon as the students submitted their work. It allowed adequate time for the students to do as many corrections as they thought necessary. The system could then provide continuous suggestions to improve their work. In contrast, the traditional approach was time-consuming and required teachers to spend a lot of time and effort. Feedback and evaluation in the AES system was quantitatively different from that provided by the tutor. The fact that the AES system performed similarly or better than the traditional system in terms of scores obtained and attitude suggests that this feedback and reflective process was effective.
It may be argued that the significant difference in performance is due to the automated system marking "softer" than the tutor system. This indeed may be the case. It was also noted that the control group had a slightly lower pre-test mean score than the experimental group (although not significant). Future research is planned that will look to investigate these issues with larger groups that will be better matched and have less variance. The attitude of tutors to the system will be explored as this factor is essential in the implementation of the basic level blend.