Computer-Assisted Assessment in Open-Ended Activities through the Analysis of Traces : A Proof of Concept in Statistics with R Commander

Open-ended tasks are common in Science, Technology, Engineering and Mathematics (STEM) education. However, as far as we know, no tools have been developed to assist in the assessment of the solution process of open-ended questions. In this paper, we propose the use of analysis of traces as a tool to address this need. To illustrate this approach, we developed a modified version of R Commander that collects traces of students’ actions and described a way to analyze them by using regular expressions. We used this tool in an undergraduate introductory statistics course. The traces were analyzed by comparing them to predefined problem-solving steps, arranged by the instructor. The analyses provide information about the time students spent on the activity, their work intensity and the choices they made when solving open-ended questions. This automated assessment tool provides grades highly correlated to those obtained by a traditional test and traditional grading scheme.


INTRODUCTION
Problem-solving, or, more generally, working in open-ended tasks, is commonly thought to be a fundamental activity in learning STEM, either as an educational goal in itself or as a way to develop scientific and mathematical skills (Bahar & Maker, 2015;Cai & Lester, 2010;English & Sriraman, 2010;Hardin, 2003).However, instructors face two main difficulties: assessment is time-consuming, and the steps followed by the students to obtain a solution is hardly available even when detailed reports are requested.
A common option to reduce assessment time is using objective tests in Virtual Learning Environments (VLEs) and Online Homework Systems (OHSs).Their data are used in the Educational Data Mining (EDM) community to provide feedback to instructors and predict student performance (Romero & Ventura, 2010).However, some authors (Conijn, Snijders, Kleingeld, & Matzat, 2017) argue that this information is of limited value to predict student performance unless relevant assessment data are included.Unfortunately, neither VLEs nor OHSs provide tools to assess the solution process of open-ended tasks.
To obtain relevant content-related assessment data, Intelligent Tutoring System (ITS) applications are designed to capture user interactions by tracing or logging user data.This provides a powerful source of information for both psychological and educational research, as well as for generating useful feedback for users and instructors.Thus, the analysis of information stored in log files facilitates a large variety of studies about the student learning process -see for instance Romero and Ventura (2010) and references therein-.Throughout the past decade, a large Calvo et al. / Computer-assisted Assessment in Open-ended Activities 2 / 13 number of papers addressed educational applications of information stored in logged files.In this paper we refer to this data as traces.Some of the applications mentioned above may be highlighted: (i) classification of students according to a specific criterion (Stevens, Beal, & Sprang, 2009), (ii) identification of plagiarism (Rosales et al., 2008), (iii) adaptive behavior detection -when a student learns how to manipulate the application in order to get higher marks- (Baker, Corbett, & Koedinger, 2004) or (iv) disengagement detection, lack of learner motivation, (Cocea & Weibelzahl, 2009).
Traces are analyzed using the information produced by the application (Romero, Ventura, & Garcia, 2008).Since the information contained in such files and their internal structure influences how the data are accessed and retrieved, some research teams have produced specifications for generating data files containing information that is relevant to the educational task (Mostow & Beck, 2009;Song & Ma, 2008;VanLehn et al., 2009).The Pittsburgh Science of Learning Center (PSLC) has designed a file format adapted to the problems they study (Ritter & Blessing, 1998;Ritter & Koedinger, 1996).The PSLC data format is well documented in a comprehensive manual which entitles other groups to generate PSLC-compatible data from their own applications (PSLC, 2013).However, their model is primarily intended to capture interactions in educational tutoring applications, where tasks or skills are assessed multiple times.This orientation is less suitable when dealing with rich user interfaces and problem-solving tasks where there is not just one solution path and skills cannot be defined in advance, as is the case in solving open-ended statistical problems.
Efforts are currently being made to face the problem of time-consuming assessment in order to achieve automatic grading in specific open-ended scenarios.For example, De Marsico, Sciarrone, Sterbini, and Temperini (2017) use a Bayesian network to predict students' performance using teacher and peer-evaluation in open-ended questions, and Kinnebrew, Segedy and Biswas (2017) interpret students' open-ended learning and problem-solving behaviors in Betty's Brain environment by using Hidden Markov Models and Differential Sequence Mining.Both approaches are difficult to generalize and provide results that are hard to interpret by an instructor leading a class.
In the field of statistics, multiple online resources for teaching and learning have been devised.For instance, under the section 'Statistics and Probability' within the Multimedia Educational Resource for Learning and Online Teaching website (MERLOT, 2018), over a thousand online teaching resources appear that cover almost all the topics in higher education statistics courses.Moreover, numerous experiences in online statistics learning are reported in the literature (Anderson-Cook & Dorai-Raj, 2003;Basturk, 2005;Carnegie Mellon Open Learning Initiative, 2017;Dinov, Sanchez, & Christou, 2008;Gonzalez & Muñoz, 2006;González, Jover, Cobo & Muñoz, 2010;Lane & Scott, 2000;Larreamendy-Joerns, Leinhardt, & Corredor, 2005;Suanpang, Petocz, & Kalceff, 2004;Symanzik & Vukasinovic, 2003;West & Ogden, 1998).We are not aware that any of them allow for an automated assessment of open-ended tasks using logs.
Nowadays, one of the most used tools in statistics and data analysis is R (Muenchen, 2017;Piatetsky, 2017), an open source programming language and software environment for statistical computing (R Core Team, 2017).To facilitate its use in introductory statistics courses, Fox (2005) developed R Commander, a well-known Graphical User Interface (GUI) for R. It allows the use of R without compromising the learning process, as an R command line interface could represent an obstacle to many students.Despite commonly used VLEs offer tools for assessment, they do not allow for assessing the solution of openended activities.This paper presents the development of a modified version of R Commander able to trace students' actions and how they can be analyzed to evaluate an educational activity and assess student work even for openended activities.We also show that assessment based on traces can provide a grade for student work which correlates with more traditional assessment methods.In the following sections, we detail the methodology of this study and highlight the results and conclusions obtained in the first implementation of this assessment model.

Contribution of this paper to the literature
• While commonly used Virtual Learning Environments (VLEs) offer tools for assessment, they do not allow for assessing the solution of open-ended activities.
• A modified version of R-Commander, a well-known tool for teaching introductory statistics, was developed to enable collecting traces of student actions when solving open-ended activities.
• These traces were analyzed to assess the work of the students by automatically comparing the collected traces to predefined solution paths.This automatic assessment provides grades comparable to those obtained by a traditional test and traditional grading scheme.

EURASIA J Math Sci and Tech Ed
3 / 13 METHODOLOGY The methodological design of this research consists of three main aspects: (1) modifying R Commander to record traces of users' actions, (2) analyzing log files, and (3) creating and running a pilot activity in a Biostatistics course.

Modified version of R Commander
As mentioned earlier, R Commander (Fox, 2005) is an open-source GUI for R. As such, its code can be reused and modified to fit different needs.In our case, we modified R Commander to obtain an activity log of the students' work.This modified package can be downloaded from http://asistembe2.iqs.edu/rcmdrtr/.A portable version including R and our modified version of R Commander is also available there.
When the user interacts with this modified version of R Commander, either by clicking on a menu or by entering the instructions in the R Script panel, the system records both the user command and the results shown in response.The system also records any editing of the dataset in use.Each record includes a timestamp, an anonymized random code for the user and a session identification.This user activity is logged as a sequence of XML entities, that are saved locally in the working directory.

Analysis of the Log Files
The information collected in these log files allows analyzing many aspects of users' work, e.g.time spent on the activity, amount of work done, work intensity, solving strategies and paths taken by the user, and correctness of a specific resolution.We detail below the methods we used to estimate some of the aspects of the log files.We developed a library of R functions, which automates the log analysis.These functions can be downloaded as a Shiny application (Chang, Cheng, Allaire, Xie, & McPherson, 2018) from http://asistembe2.iqs.edu/rcmdrtr/.Time spent on the activity is obtained by adding up the time spans between consecutive user actions.If the time span is longer than a predefined threshold, this lapse can be excluded from the working time for the user.
Amount of work can be understood as the total number of actions recorded in the log files for a specific user.
We visualize work intensity by plotting the actions of each user in their time-on-task timeline, i.e. cumulative time spent by the user up to the current action.Closer actions denote more intensity.
In order to analyze the strategies followed by a student and the correctness of those strategies, the instructor may design some milestones, which can be understood as steps in the solution process.They identify important achievements or potential errors when resolving the activity.When a milestone may be accessed in different ways, we call it an assessment milestone and each possible way is an observation item.Assessment milestones are defined as logical rules based on the observation items built to assess students' work.A trace-based grade for each user can be obtained by considering the number of successfully reached assessment milestones.

Pilot Activity in Teaching Biostatistics
To evaluate the feasibility and usefulness of this approach, a pilot study was run in a Biostatistics course for the degrees in Biomedical Sciences, Biotechnology and Biochemistry at a large public research university in Spain.
In this class, students were asked to carry out several activities online.Some of them adopted the format of a practical case, presented to the students at the beginning of the activity, followed by a series of questions and tasks related to the experiment.Those activities were developed in R with package exams (Grün & Zeileis, 2009;Zeileis, Umlauf, & Leisch, 2014), which facilitates the creation and maintenance of large-scale exams in statistics courses in PDF, HTML or e-learning questionnaires in Moodle or other VLEs.Those activities were randomly generated and automatically assessed in Moodle.We used one of those activities to test the traced version of R Commander.
The specific activity we used to test the system of collecting students' traces, focused on probability distributions.The practical case study was about a factory spreading several pollutants in the atmosphere (Figure 1).We were able to measure the concentration of these air pollutants and checked whether the law was violated.Students had to work with probability density functions, distribution functions and several topics related to probability.The types of questions students had to answer during the activity included questions that expect a numerical answer and also multiple choice questions.
The activity was divided into three parts.In the first section, which consisted of six questions related to the concentration of sulphur dioxide (SO2), students worked with normal and binomial distributions.The second section, including eight more questions, continued with normal and binomial distributions but also introduced the Poisson distribution and some operations on random variables.The third part, which contained nine additional questions, mainly focused on Poisson random variables.
In order to assess the students' work from trace files, the expected solution path can be modeled as a set of milestones to be reached while each student follows the computer-based learning activity.Table 1 shows some milestones students have to reach.For instance, in section 2, we are concerned with the concentration of nitrogen dioxide (NO2).The text states that nitrogen dioxide (NO2) concentrations follow a normal distribution with a given mean and standard deviation.After computing a 96% confidence interval and taking into account that we take nine measurements a day, i.e. 63 measurements a week, question 5 asks the following: "What is the exact probability (calculating the exact distribution, not the approximate distribution) that in the span of one week, two or more measurements fall outside the 96% confidence interval determined at the beginning?"Milestones O14 and O15 are two possible calculations to obtain the desired answer.As can be seen in Table 1, milestones are mainly composed of R functions (such as pbinom, ppois, and pnorm) and some XML tags.
In the spring of 2015, approximately 75 students (of the three degrees cited above) used this activity.We collected valid traces from the work of 62 of them.The activity took place in one session and the total time allotted was two hours.The instructor led the first part of the activity and the students were in charge of parts two and three.

RESULTS
Traces can be analyzed at different levels.The discussion that follows summarizes the observations at group level (including time and number of actions) and the analysis of the students' work from a set of milestones (for both observation item and assessment milestone) defined for the activity described above.
At group level, the distribution of the students' time solving the activity (time on task) is shown in Figure 2.This distribution is almost symmetrical, and its mean is approximately 90 minutes.Most students were actively involved in the activity.Only one of them spent less than 60 minutes solving the problem.value='pnorm(c(150,120),%20mean=132.3,%20sd=20.412,%20lower.The number of actions performed by the students ranges from 10 (lines in the log file) to a maximum of approximately 50 (Figure 3).The distribution seems to be unimodal according to a Gaussian distribution, as one would expect.When looking at the work intensity, no signs of tiredness or quitting are found, as the number of actions is homogenous throughout the solution process (Figure 4).However, some blank gaps, are observed.These intervals could indicate the beginning of a new section and the associated time needed to read its heading and to go through the theoretical questions included in the activity.Figure 4 allows identifying students with uncommon action densities: for instance, the two last subjects in the plot display a lower action density compared to the whole group.A second level of analysis considers the observation items.For instance, O15 is less frequent than O14 when both lead to the same answer as can be seen in Figure 5.
This Figure depicts the relation between assessment milestones and observation items.The bars represent the assessment milestones with their corresponding success rate, while the circles situated above represent the corresponding alternative observation items in the assessment milestone.It is a direct visualization of how students answer each question.It is important to note that the sum of the percentages of the observation items achieved can be greater than the corresponding figure of the assessment milestone, since students can try different or redundant ways to reach the assessment milestone and the software is able to trace those actions.Milestones with a low frequency are located at the end of sections 2 and 3 where the difficulty is higher.For instance, O10 and O22 are clearly the most often used ways to solve specific questions of the activity, identified as the assessment milestones A04 and A09, despite the students were not led by the instructor in this part of the activity.Figure 9 shows a scatter plot and the regression line between the grades obtained in the Moodle questionnaire and the trace-based grades.Interestingly, the correlation coefficient (0.85) suggests that trace-based assessment and the traditional scheme implemented in Moodle provide similar results.It is worth mentioning that the Moodle forms include some theoretical questions that cannot be picked up by traces.

CONCLUSIONS
We developed a modified version of R Commander, allowing traces of students' actions to be collected.Its use is transparent for the users and leaves a file behind that can be submitted as a deliverable.This modified R Commander version can be downloaded from http://asistembe2.iqs.edu/rcmdrtr/.
We used this tool in an undergraduate introductory statistics course aimed for different student profiles.We designed a methodology to analyze the collected data and to estimate a grade from them.The traces were analyzed by comparing them to predefined solution steps.These analyses informed the instructors about the time the students spent, their work intensity and the choices the students made when solving the activity.Students with uncommon action densities could also be identified.
In the specific setting used in this study, the automated assessment presented here provided grades comparable to those obtained by a traditional test and a traditional grading scheme.The correlation coefficient between the trace-based grades and those obtained through the Moodle questionnaire reached 0.85.
The different analyses described in this work can also be valuable to instructors (in this case, in statistics) by informing them of learning gaps, by identifying students with specific learning needs, or by providing insights to give tailored feedback to individual students.The authors also foresee this approach to be useful to scholars involved in discipline-based education research.Some of the limitations of this study are: (1) the assessment milestones and the observations items are inherent to each activity and must be specifically defined; (2) consequently, some training may be required for those willing to create new observation items and/or assessment milestones; (3) the huge amount of information provided must be carefully analyzed to obtain insightful conclusions.In this sense, further work needs to be done to explore the possibility of extending this approach.We expect this assessment methodology to open the door to widespread automated grading of open-ended problems.

Figure 1 .
Figure 1.Screenshot of the activity.Students are required to calculate some probabilities.The students use R-Commander as a statistical calculator and their results are introduced in their corresponding numerical fields

Figure 2 .
Figure 2. Time on task -no idle time discounted.The figure includes a histogram, a density curve, a boxplot, and the raw data plotted using a scatter plot with some jittering

Figure 3 .
Figure 3. Number of actions.The figure includes a histogram, a density curve, a boxplot, and the raw data plotted using a scatter plot with some jittering

Figure 4 .
Figure 4. Actions over time (each column = 1 student, each dot = 1 action).The columns are ordered by the amount of time needed to complete the activity, from lowest to highest

Figure 5 .
Figure 5. Relation between achievement in observation items and assessment milestones

Figure 8
Figure8can be interpreted as a student efficiency plot, i.e., students with a high grade (green) combined with a low number of actions in a short period of time are the most efficient ones.

Figure 8 .
Figure 8. Actions-Grade-Time Scatterplot (trace-based).The number of actions recorded into the log file are presented in front of time spent solving the activity.A diverging color scale is used to encode the trace-based grade, computed from the proportion of achieved milestones and scaled from 0 to 10. Red colors indicate grades in the lower range and green colors mean grades in the higher range.A neutral color is used for the mid-range values

Figure 9 .
Figure 9. Relation between the Moodle grades, obtained from the questionnaire in Moodle, and the trace-based grades, computed from the assessment milestones.The size of the marker is proportional to the number of students

Table 1 .
Milestones students have to reach to correctly answer certain questions of the activity.The milestones in the same bold line box are alternative ways to obtain the answer.Milestones O21A and O21B form a unity and both are necessary to reach the solution.The numbers in red are variable parameters that can differ from one student to another