Writing Autograder Tests¶
Doctests¶
Most Python autograders run doctests against a global environment from the execution of a student’s
submission. These doctests are text formatted as if from a Python interpreter that tell the executor
what code to run and what output to expect. An important nuance of doctests is that these test are
based on string comparison, so the outputs 1
and 1.0
are not equal.
Doctest Format¶
Python doctests are formatted as input and output from a Python interpeter, e.g.
>>> import numpy as np
>>> np.random.seed(42)
>>> np.random.choice([1, 2, 3])
3
Note that Python structures that require multiple lines have ...
prompts, not >>>
.
>>> def foo(x):
... return x
>>> if foo(1) % 2 == 0:
... print("even")
... else:
... print("odd")
Lines that have an ellipsis prompt include elif
, else
, except
and finally
clauses,
function bodies, continuations of escaped lines (lines after those ending with \
), and other
indented lines (those that begin with whitespace). Note that in a normal interpreter there would be
an additional empty line after an ellipsis block with an ellipsis prompt, but this can be omitted
when writing doctests.
Sometimes, writing autograder tests may require the use of special characters whose meanings are
changed when Python compiles the string, e.g. \
. If your autograder test involves these characters,
when Python compiles the string, the doctest’s meaning will be changed. For this reason, when writing
doctests, it is always best practice to put autograder tests in raw strings, denoted by the letter
r
before the opening quote:
# this is an example test in the OK format
test = {
"suites": [
{
"cases": [
{
"code": r"""
>>> if 4 % 2 == 0:
... print("even")
even
""" # a raw string
}
]
}
]
}
Note also that the doctest library ignores leading whitespace before prompts and outputs, e.g. as
with textwrap.dedent
, so even though the spaces before each line in the string above are captured,
they are ignored when the doctest is executed.
Running Doctests¶
Autograders abstract away executing doctests beyond the step of actually writing them. Once you understand the doctest format, you can write your own autograder tests and allow the autograder to take care of executing them and parsing the output. Most Python autogaders make use of Python’s doctest library to perform these tests, and return results based on whether or not these pass.
Comparisons¶
A fundamental aspect of doctests is that the pass/fail behavior is determined by string comparisons. When a doctest is run, the code provided in the Python interpreter format is executed line-by-line and then the output of that line is expected to equal the output shown in the doctest.
Return Value Types¶
None
¶
Statements that return the Python value None
have no output. If you put None
into a Python
interpeter, there is no output shown:
>>> None
>>> # the next interpreter prompt
Some statements that return the value None
include assignment statements, import statements,
calls to functions like print
and exec
, and any functions with no return
statement or a
return statement that returns None
itself.
>>> def foo(x):
... return x
>>> def bar(y):
... y = "some string"
>>> def baz(z):
... return None
>>> foo(1)
1
>>> bar(2)
>>> baz(3)
>>> foo(None)
>>> print("some other string")
some other string
bool
¶
Boolean values in Python are by far the easiest to check with doctests because they have a static
string representation and only two possible values. The string representation of bool``s is the
same as their variable names: ``True
and False
.
>>> True
True
>>> False
False
>>> def is_even(x):
... return x % 2 == 0
>>> is_even(1)
False
>>> is_even(4)
True
int
, float
, and other numeric types¶
Numeric types are the most difficult to test for two reasons: there are several different types of
numeric values (e.g. int
, float
, and all of the NumPy types) and rounding errors can occur
based on how and when students choose to round off calculations. For these reasons, unless you’re
working with integers, it is usually easiest to use functions that return boolean values to compare
numeric values.
When working with integers, almost all data types that represent them have the same string representation. For this reason, it is a relatively easy thing to write doctests that compare integer values with each other:
>>> from math import factorial
>>> factorial(4)
24
>>> def square(x):
... return x**2
>>> square(25)
625
If, however, it is possible for the numbers to be floating point values, other methods of comparison
are better-suited to writing doctests. One of the best examples is NumPy’s isclose
function, which
compares two values to each other within an adjustable tolerance (which defaults to 1e-8
). Because
NumPy supports both single and double precision floating point values, rounding errors can occur even
when performing the same operation on values represented in different precision data types. This is why
using functions that perform comparisons and return boolean values is much more robust to all of the
ways that students can format their answers.
>>> def divide(a, b):
... return a / b
>>> divide(divide(5, 3), 3) # solution (a)
0.5555555555555556
>>> divide(5, 3) # solution (b)
1.6666666666666667
>>> divide(1.66666667, 3) # solution (b) cont.
0.5555555566666667
Note that while solutions (a) and (b) above are both substantially correct, the rounding in solution
(b) cause the otputs to be different, so if a test using solution (a) check a student’s response
solution (b), the student would fail the test. Using a function like np.isclose
, this is avoided:
>>> np.iclose(
... divide(divide(5, 3), 3), # solution (a)
... divide(1.66666667, 3) # solution (b)
... )
True
Because booleans have easy-to-compare string representations, this test is much more robust to all of
the possible sooutions to this question, and demonstrates the best practice for comparing numeric
values. (Note that NumPy also provides np.allclose
for element-wise comparison of values in
iterables.)
str
¶
String comparisons are relatively easy and the most straightfoward because doctests are based on string
comparison. The main concern is to be careful of leading and trailing whitespace and to note that unless
the '
character appears in the string, Python’s default string delimeters are apostrophes. If
both appear, then apostrophes are used and the apostrophe in the string is escaped:
>>> 'some string'
'some string'
>>> "some'string"
"some'string"
>>> "some string"
'some string'
>>> """some string"""
'some string'
>>> '''some string\n'''
'some string\n'
>>> '''some string '"\n'''
'some string \'"\n'
other data types¶
Other data types don’t have very many complexities surrounding them. For custom objects, note what
their __repr__
function returns and use that. When creating and testing custom classes, always
use a custom __repr__
function, otherwise the representation will contain the pointer to the
object in memory, which changes between sessions.
>>> class Point:
... def __init__(self, x, y):
... self.x = x
... self.y = y
>>> Point(1, 2) # this has no __repr__, so it will have the object id
<__main__.Point object at 0x102cb3ac8>
>>> class OtherPoint:
... def __init__(self, x, y):
... self.x = x
... self.y = y
... def __repr__(self):
... return f"OtherPoint(x={self.x}, y={self.y})"
>>> OtherPoint(1, 2) # this has a __repr__, so it will be printed without the id
OtherPoint(x=1, y=2)
Always test your tests in a Python interpeter if you’re unsure about the string representation of an object. Don’t use a Jupyter Notebook or IPython, because they don’t necessary have the same output and they have different prompts.
Seeding¶
An important aspect of writing code in the data science environment is the ability to simulate random processes. This presents a unique challenge to autograding assignments because the random state of the students’ environments will be completely different from the grading environment. Some autograding solutions account for this issue by incorporating calls to seed random libraries during execution; Otter-Grader does this very easily using grading configurations, and this is by far the most preferable solution if you’re using Otter.
When working with other autograding solutions, seeds must be incorporated into the test files themselves and assignments must be written in such a way as to make the seeding before execution possible.
Solution 1: Functions¶
The first and easiest solution to this problem is to have students wrap their code involving randomness into a function. In this way, the tests can make a seeding call before calling the function directly, ensuring that the seed set before the code is executed. Consider the following example question and the following test:
import numpy as np
# Question 1: Define roll_die which rolls a 6-sided die using NumPy.
def roll_die():
return np.random.choice(np.arange(1, 7))
>>> np.random.seed(42)
>>> roll_die()
4
This approach, while easy-to-implement and effective, is generally at odds with the general autograding paradigm of testing global variables directly rather than encapsulting logic in needless functions. For this reason, there is another solution that is perhaps more elegant.
Solution 2: Seeding the Students’ Code¶
In this solution, the random libraries are seeded directly before the student writes their code in a manner visible to the students.
import numpy as np
np.random.seed(42)
# Question 1: Assign roll_value to the roll of a six-sided die.
roll_value = np.random.choice(np.arange(1, 7))
>>> np.random.seed(42)
>>> roll_die()
4
This method is significantly less elegant and is susceptible to students changing the seed, which could throw off the results of the test and fail students incorrectly. It is also important to note that if you’re working in a notebook format, you need to include seeds im every randomness cell because repeated runs of cells will change the seed value for subsequent questions, which could result in students failing tests.
OK Test Format¶
Most autograders developed at UC Berkeley, including Otter-Grader, rely on the OK test format, originally developed for OkPy. This test format relies on doctests to check that students’ code works correctly.
The OK format is very simple: a test is defined by a single Python file that contains a global variable
called test
which is assigned to a dictionary containing test configurations. The structure of this
dictionary is:
test["name"]
is the name of the test case, and should be a valid Python variable nametest["points"]
is the point value of the test case; points are assigned all-or-nothing and all test cases (more below) must pass to be awarded the pointstest["suites"]
is a list of dictionaries that correspond to test suites, groups of test cases; asuite
consists of:suite["cases"]
is a list of dictionaries that correspond to test cases (individual doctests); acase
consists of:case["code"]
is a Python interpreter-formatted string of code to be executedcase["hidden"]
is a boolean that indicates whether the test case is hidden from studentscase["locked"]
is a boolean that indicates whether the test case is locked; this configuration is generally only relevant to OkPy
suite["score"]
is a boolean that indicates whether the suite is part of the scoresuite["setup"]
is a code string that is run before the individual test cases are runsuite["teardown"]
is a code string that is run after the individual test cases are runsuite["type"]
is a string indicating the type of test; this is almost always going to be"doctest"
unless you’re using OkPy
Note that graders like Otter-Grader only allow test files with a single test suite.
test = {
"name": "q1",
"points": 1,
"suites": [
{
"cases": [
{
"code": r"""
>>> foo()
'bar'
""",
"hidden": False,
"locked": False
}
],
"scored": True,
"setup": "",
"teardown": "",
"type": "doctest"
}
]
Master Notebook Tools¶
Most UC Berkeley autograders include tools to abstract away the generation of test files by allowing users to define the tests along with questions and solutions in a simple notebook format. Tools like this include jAssign for OkPy and Otter Assign for Otter-Grader.
Most of these tools rely on the output of comment-delimited code cells to generate the doctest- formatted test cases used in test files and parse these notebooks out into two versions: a directory containing the notebook with solutions and hidden tests, and a directory with the notebook stripped of solutions and only public tests.
In general, it is highly recommended that you use one of these tools as they ensure the gradeability of assignments and make the process of creating and distributing assignments much easier. Otter Assign also integrates nicely with its other tools and makes the process of using Otter with 3rd part services much easier.
This guide considers best-practices for writing Python autograder tests. While this guide is focused mainly on OK-formatted tests (those used in Berkeley’s OkPy and Otter autograders), some of the ideas discussed are applicable to all Pythonic autograders and the tests written for them. The topics discussed herein include test formatting, ok-formatted test files, structuring tests, and working with randomness.