An Empirical Study of Iterative Improvement in Programming Assignments

As automated tools for grading programming assignments become more widely used, it is imperative that we better understand how students are utilizing them. Other researchers have provided helpful data on the role automated assessment tools (AATs) have played in the classroom. In order to investigate improved practices in using AATs for student learning, we sought to better understand how students iteratively modify their programs toward a solution by analyzing more than 45,000 student submissions over 7 semesters in an introductory (CS1) programming course. The resulting metrics allowed us to study what steps students took toward solutions for programming assignments. This paper considers the incremental changes students make and the correlating score between sequential submissions, measured by metrics including source lines of code, cyclomatic (McCabe) complexity, state space, and the 6 Halstead measures of complexity of the program. We demonstrate the value of throttling and show that generating software metrics for analysis can serve to help instructors better guide student learning.


INTRODUCTION
One challenge facing instructors of computer programming is identifying how students learn the material through hands-on practice that takes place beyond the instructor's observation, often outside the classroom.
A better understanding of how students iteratively modify their programs toward a solution can help us improve programming instruction over time.Examining a final submission can determine if a student eventually created a properly working program but does not indicate how efficiently or systematically he or she approached the problem.This paper examines data collected from more than 45,000 student submissions over 7 semesters in an introductory (CS1) programming course.Each semester, we gave students 75 C++ programming assignments, increasing in difficulty over the course of the semester.Students submitted their work to an online system, Athene (created in-house), which stored, compiled, and ran each submission against a suite of specific test cases and carefully generated random test cases.The Athene system then automatically scored each submission, so students received feedback immediately.Programs that did not pass the entire test suite received either a compile error message or information about one or more failed test cases as feedback.For failed test cases, students received a report of their input, the expected output, and the actual output from their program.Students had the opportunity to modify and resubmit their program in an attempt to improve their grade.A student session consisted of all submissions by one student for a given problem.There was no limit on the number of submissions that a student could attempt until the deadline, although we experimented with "throttling," a technique that limited a student's attempted solutions within a rolling 15-minute time period.This paper also considers the effects of this throttling on submission behavior.
With this approach, we intend to encourage students to complete each assignment, overcoming early errors to eventually reach a final correct solution.Automated scoring is made possible by preprocessing and other source modifications, then compiling and linking with both standard and specialized libraries.The system scores student submissions by direct inspection of required features (such as functions) and validation of program output.
We look more closely at two key moments in each student session in order to see the amount of change and meaningful progress that takes place (1) between a student's first and second submissions and (2) between a student's first and last submissions.In our courses, 83% of student sessions reach a score of 100%.
In June 2005, Kirsti M. Ala-Mutka surveyed a range of assessment tools available at that time, reviewing their abilities in dynamic testing to construct a secure running environment and check program functionality, efficiency, and student testing skills; and in static testing to check coding style and programming errors, collect software metrics, and assess design [3].In September 2005, Christopher Douce, et al., provided a historical overview of the development of AATs and a survey of related research, concluding with a series of useful criteria for evaluating such tools, including whether the system "does what it is supposed to do," whether "it is liked by its users," and whether "it helps students become more proficient at programming" [5].In 2010, Petri Ihantola, et al., followed up with a review of AAT development between 2006-2010, determining that the most significant differences among programs had to do with "how tests are defined, how resubmissions are handled, and how the security is guaranteed" [7].
Since the publication of these surveys, several key papers have devoted expanded attention to more recent programs.For example, Mark Sherman, et al., looked at the web-based Bottlenose framework and considered how students responded to instant automated feedback in contrast to time-delayed instructor feedback, finding that students made 50% more submissions per assignment when using the AAT.Sherman's article does not discuss final submission quality with or without the AAT, but does note the straightforward effect that with the AAT, students continue to modify and re-submit assignments, presumably in response to the system's feedback-an advantage in a course where they might otherwise submit several assignments before receiving instructor feedback [13].The article does not go into detail regarding what Bottlenose assesses in providing feedback; for examples of how such programs work, we can look at papers including those from Manuel Rubio-Sánchez, et al. [12] and Tiantian Wang, et al. [14].
Rubio-Sánchez's group reviewed the widely available Mooshak system, using both qualitative and quantitative analysis to evaluate its effectiveness in student learning.Responding to earlier studies that had noted a negative correlation between the introduction of Mooshak and student drop-out rates, Rubio-Sánchez's group observed that those studies had not held other variables constant-in particular, the studies had changed teaching methodology simultaneously with introducing Mooshak.In their study, then, Rubio-Sánchez's group included both test and control groups in the form of courses with near-identical syllabi and teaching methodologies, some of which used Mooshak and some of which did not.Qualitatively, students self-reported appreciating instant feedback but had complaints about Mooshak specifically, because while it is effective in assessing whether a program has succeeded or failed, it does not provide feedback to help students make changes along the way.The Mooshak tool was originally created for use in programming contests and still lacks features present in many tools designed for use in courses.Holding other factors constant, Rubio-Sánchez's group did not find a statistical change in the dropout rate of courses using Mooshak.
In contrast, Wang's group wrote about AutoLEP, an AAT they developed at the Harbin Institute of Technology in Heilongjiang, China, which combines static analysis with dynamic testing to provide students with location-specific feedback on the syntactic, structural, and logical features in their programs.AutoLEP seems more like Bottlenose in allowing for feedback and multiple submissions, but with the apparent advantage of more precision due to its simultaneous static and dynamic feedback.
The issue of whether AATs ought to allow for multiple submissions comes up again in Vrada Pieterse's paper on the use of AATs-in particular, his group's Fitchfork software-in massive open online courses (MOOCs) that teach programming.Pieterse argues that throttling (limiting the number of submissions per assignment) is inappropriate in a MOOC environment in part because the open enrollment format means students are working voluntarily and should have the opportunity to use AATs for repeated revisions as needed to learn the material.A correlative argument, however, is that unlimited submissions carries a risk to the traditional, credit-based classroom where students may be more tempted to "game the system" in an attempt to get a desired grade rather than to master learning the material [11].Such a conjecture corresponds to an observation by Ihantola's group, who write, "we believe that the very fact that the assessment is automatic is likely to change how some students approach the exercise.Knowingly submitting a weak or even incorrect solution that gets accepted by a machine is quite likely more socially acceptable than trying to cheat a person."Pieterse's paper, then, raises the challenge to AAT developers to consider an appropriate level of throttling for the system's target user environment.
Overall, these articles point to a series of features relevant in developing AATs, including secure running environments (sandboxes), static and dynamic testing, and resubmissions and throttling.In our Athene program, we chose to give students feedback primarily from dynamic testing while running static analysis later for examining its usefulness.We also implemented a throttling rule for several semesters to see how that changed student behavior.

Population Characterization
We collected data from 290 students in a Programming I course over seven semesters.We taught the course using the C++ programming language.Our curriculum is a late objects curriculum and subject matter in the course is typical of a CS1 course, including input/output, basic data types, decision structures, repetition, functions, and arrays.
The course serves primarily as a first programming course required of Computer Science majors, but students also include majors in Engineering, Physics, Mathematics, Information Technology, and other related disciplines.

Description of Data Collected
Each semester, we give students 75 programming assignments for homework, most of which are completed outside of class.The data included in this paper comes from all 75 unique assignments.
The students receive their assignments through the Athene online automated system.
Figure 1 shows a representative assignment, typically assigned in the third week of a 15-week semester, shortly after introducing decision structures.In this case, students are assigned to write a console-based program that asks the user to enter 3 integers and returns the largest of the 3.The goal of this assignment is to give the students practice in using if-else statements.As with all Athene assignments, the student is given a problem description, along with at least 1 test case and the expected output for that test case.
For each assignment, a student writes a program and submits the source code to the Athene system, which checks grading and provides feedback.The student may re-submit a program repeatedly until he or she has successfully written the code.
After the student submits a source code file, the Athene system immediately compiles, runs, and tests the program against established test cases to provide a response.Figures 2 through 5 show a series of representative submissions and their corresponding feedback, all from the same student session in attempting to solve the assignment shown in Figure 1.
Each time a student submits an assignment, the automated system records the following information: We compute SLOC by counting all source lines, then deleting all blank lines and comment lines.We compute the McCabe complexity by adding the total number of branch possibilities (if, for, while, and case statements, adding in short-circuit analysis of boolean conditionals) to the total number of functions defined.
The 6 Halstead complexity measures are: Vocabulary, Length, Computed Length, Volume, Difficulty, and Effort.The Halstead numbers are computed by counting the number of unique and total operators and operands and using them in the appropriate Halstead formulas.
Figure 2 shows feedback given to a particular student after an early attempt at solving the assignment shown in Figure 1.This feedback is displayed almost instantaneously after the student submits a source code file.The top of the Athene page displays the student's ID, submission time, score achieved, course in which he or she was enrolled, and the assignment name.The middle section of the feedback page displays the first expected output line the student's submission failed to produce, followed by the actual output that the student's submission did produce.
The bottom section of the feedback page shows the contents of the student's submitted source code file.
If we examine the source code, we can see that the student wrote a program that would perform correctly for both examples given in the problem description shown in Figure 1.But this student didn't consider different types of test cases, such as two of the three input numbers being the same value.So the submitted source file show in Figure 2 passed many of the test cases and was awarded a score of 64 (out of 100), but failed for the first time when the test input was 5 9 9.The student received the feedback message, "expected output: The largest number is 9," and could then review to see that his or her actual output did not contain that statement.
Figure 3 shows the student's next attempt at the assignment.The student added some additional if statements to catch the test case he or she had just missed but still did not cover all possible test cases.[Also worth noting, the student should have corrected his or her existing if statements, instead of adding more of them.The Athene system did not provide feedback on this point.] Figure 4 shows the student's third attempt at the assignment.The student once again added some additional if statements to catch the missed test case, but still did not cover all possible test cases.Because of the additional test cases passed, the student achieved a score of 88.
Figure 5 shows the student's fourth attempt at the assignment.To catch the test case in which all input numbers are the same value, the student added an additional if statement checking specifically for that case.With that addition, all test cases were successfully passed and the student received a score of 100, although the code is overly complex.
Reviewing this assignment and the given student's 4 submissions, we can see that the student eventually turned in a solution that produced the correct output but was not written to be efficient.The metrics for the source code from each of these submissions is shown in Table 1.Looking at metrics, specifically McCabe, can tell an instructor a great deal about the student's solution.For this assignment, the expected McCabe value is 7.The function counts for 1 and each if statement counts for 2, given the possibility of short circuit evaluation for each.When an instructor sees a McCabe value of 17 for this assignment, the instructor can recognize that the student has not created the expected solution.

RESULTS
Table 2 describes the overall data that was analyzed in this paper.
In analyzing the data, we gave extra attention to factors that changed when students showed positive progress.Table 3 represents the data from only those submissions that come out of multiple-attempt sessions, and we always ignored the first attempt (as there would not be a previous submission to compare it against).We call these submissions "new maximums."32.2% of eligible submissions were new maximums.
Also of special interest to us is the effect that throttling had on what students changed from submission to submission.For semesters 1-4, we allowed students unlimited submissions in any time period without throttling; for semesters 5-7, we established a throttle that limited students to 3 submissions per 15-minute period.This action corresponded with distinctions in student behavior and data outcomes.This data is shown in Table 4.

DISCUSSION
We recognize two primary areas of new knowledge emerging from this study.First, we see that throttling of submissions does indeed have an impact on the quality of student submissions.Table 4 shows that the average score of multiple attempt sessions that eventually scores 100% increased from 11 all the way to 28.Knowing that submissions were throttled made students more careful in making their first submission, hopefully putting more thought into their work and doing independent testing instead of only relying on the grading system.
Second, dynamic testing is important, but instructors and students can also benefit from considering style and content.In January 2004, Ala-Mutka published an executive study on the use of Style++ to promote good style practices in students.The AAT was able to discern and respond to a number of unhealthy programming practices with an appropriate grade and feedback on how the student should improve the efficiency of their program.Ala-Mutka found that "students implement more reliable and understandable programs" after having only been required to submit assignments to Style++ for a year [2].However, Ala-Mutka's study focused on independent student use of the Style++ tool in advance of submitting final assignments, allowing instructors to "concentrate on giving feedback on the more advanced features of program design and course specific issues." We argue that the examples in the session shown in Figures 2  through 5 show us that instructors can learn a great deal more about student submission by employing some basic metric analysis of submitted code.By using these other types of analysis, we can identify gaps in understanding, even when a student finishes with a score of 100%.With a focus on style, the student may be more capable of thinking in terms of efficiency and efficacy for each line in their code, which can help prevent situations similar to that in Figures 2-5 wherein the student needlessly increased complexity and length of code rather than rewriting existing code to achieve the desired output.
Overall, we can infer that both an emphasis on technique and use of throttling submissions encourage a reflective perspective of one's work.

FUTURE WORK
In the future, we would like to more seamlessly integrate static analysis tools-giving students more feedback (such as reporting to them the actual complexity level of their submitted program and the expected level).
Another interesting project would be to automatically analyze individual problems to identify the most common student problems, so instructors can address these issues more effectively in class.

Figure 1 .
Figure 1.Representative assignment as it appears in the Athene online automated system.