Micrograd in C
On This Page Micrograd in C 14 sections
A few days ago I saw this re-implementation of Andrej Karpathy’s micrograd in C posted on Hacker News. I immediately saw an opportunity to get more familiar with C by doing this project myself.
I started by watching Karpathy’s excellent spelled-out intro to neural networks and backpropagation video and following along with the code. My plan was to write this as a reusable library (even if it’s just a learning project nobody would use).
Representing Values
At the core of Karpathy’s micrograd is a Value class that holds a scalar data field, as well as a grad field that holds the gradient of the value. For backpropagation, it holds a list of previous values and a lambda function that computes the derivative of the current node’s output with respect to its inputs, meaning its child nodes, according to the operation that produced the node.
Backpropagation works by walking the graph in reverse topological order and calling the lambda on each node. While doing this, each node must not be visited more than once as that would lead to calling backward on the same node multiple times.
class Value:
def __init__(self, data, children, op):
self.data = data
self.grad = 0
self.children = children
self._backward = None
def backward(self):
"""
Performs backpropagation on the graph, computing the gradient of
the current node with respect to its inputs.
"""
self.grad = 1
# how exactly this sort works is not important here
for value in reversed(topological_sort(self)):
value._backward()Values support a range of mathematical operations, such as negation, addition, multiplication, tanh, and others. These operations produce a new Value with the result of the operation, a lambda for backpropagation and a list of children values.
class Value:
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return outDefining the Value Struct
How do we translate this into C? We can start simple by defining a struct with a data and grad field.
typedef struct mg_value {
float data;
float grad;
} mg_value;Keeping Track of Children Values
Next we will need to keep track of the children values, so we can backpropagate gradients. Instead of a variable-length array, I decided to have two pointers to the left and right children values. For operations that take only one child, the right pointer is NULL.
typedef struct mg_value {
float data;
float grad;
mg_value* left;
mg_value* right;
} mg_value;The Backward Lambda
Recreating the backward lambda may seem tricky at first since C does not have lambdas. We could use a function pointer instead, but a more straightforward approach is to define an enum to represent the operation and have one backward function that takes a node and switches on that enum field.
typedef enum {
MG_OP_NONE,
MG_OP_ADD,
MG_OP_MUL,
MG_OP_POW,
MG_OP_RELU,
MG_OP_TANH,
MG_OP_EXP,
} mg_op;
typedef struct mg_value {
float data;
float grad;
mg_value* left;
mg_value* right;
mg_op op;
} mg_value;For now we will leave it at this op field and will get back to backpropagation soon. There is some more plumbing to do first.
Ownership
The biggest decision we need to make is who owns the values and who is in charge of freeing them. There are many ways to approach this and they all have their trade-offs.
We could use a reference counting system, a garbage collector or some arena-style allocation, but for simplicity I opted for a less complex ownership model: an intrusive linked list. It works by embedding each node’s next pointer inside mg_value itself.
typedef struct mg_value {
float data;
float grad;
mg_value* left;
mg_value* right;
mg_op op;
mg_value* next;
} mg_value;We also need a struct that represents the graph itself. It owns the linked list of values by holding a pointer to the head of the list and keeps track of the length of the list.
struct mg_graph {
mg_value* values;
size_t len;
};Before we do any kind of manipulation on the graph’s nodes we need to be able to allocate and free graphs along with the values they own.
mg_graph* mg_graph_new(void) {
mg_graph* g = calloc(1, sizeof(*g));
return g;
}
void mg_graph_free(mg_graph* g) {
if (!g) {
return;
}
mg_value* v = g->values;
while (v) {
mg_value* next = v->next;
free(v);
v = next;
}
free(g);
}Creating a new value in a graph is simple. We allocate an mg_value and set its data field to the given value, then insert it at the head of the linked list.
mg_value* mg_scalar(mg_graph* g, float data) {
mg_value* v = calloc(1, sizeof(*v));
if (!v) {
return NULL;
}
v->data = data;
v->next = g->values;
g->values = v;
g->len++;
return v;
}Operations
With this in place we have everything we need to define the operations and a backward function. An operation will take one or two mg_value operands and produce a new mg_value result using our new mg_scalar function.
mg_value* mg_add(mg_graph* g, mg_value* a, mg_value* b) {
mg_value* out = mg_scalar(g, a->data + b->data);
if (!out) {
return NULL;
}
out->left = a;
out->right = b;
out->op = MG_OP_ADD;
return out;
}Backpropagation
Backpropagation needs to process each node after all nodes that depend on it. To do that, we first build a topological ordering of the reachable graph, where children come before the node that uses them. Then the backward pass walks that ordering in reverse, starting from the output.
Our linked list is reverse creation order, so we need to sort it first. We do this by creating an array of mg_value pointers and walking down the graph, first on the left subtree, then on the right subtree, and finally inserting the node itself. Since we must not visit the same node twice, we mark each node as visited when we first see it. For that we need to add a visited field to the mg_value struct.
typedef struct mg_value {
float data;
float grad;
mg_value* left;
mg_value* right;
mg_op op;
mg_value* next;
bool visited;
} mg_value;
static void topo_visit(mg_value* v, mg_value** topo, size_t* len) {
if (!v || v->visited) {
return;
}
v->visited = true;
topo_visit(v->left, topo, len);
topo_visit(v->right, topo, len);
topo[(*len)++] = v;
}The backward function now has three steps: clear old gradients and traversal state, build the topological ordering, then walk that ordering in reverse to calculate gradients.
bool mg_backward(mg_graph* g, mg_value* out) {
for (mg_value* v = g->values; v; v = v->next) {
v->grad = 0.0;
v->visited = false;
}
mg_value** topo = malloc(g->len * sizeof(*topo));
if (!topo) {
return false;
}
size_t len = 0;
topo_visit(out, topo, &len);
out->grad = 1.0;
for (size_t i = len; i > 0; i--) {
mg_value* v = topo[i - 1];
switch (v->op) {
case MG_OP_NONE:
break;
case MG_OP_ADD:
v->left->grad += v->grad;
v->right->grad += v->grad;
break;
case MG_OP_MUL:
v->left->grad += v->grad * v->right->data;
v->right->grad += v->grad * v->left->data;
break;
/*
* and so on
*/
}
}
free(topo);
return true;
}There is one noteworthy design decision I made here. mg_backward resets the gradient of all values to zero before starting the backward pass. Zeroing the gradients is needed since we don’t want to accumulate gradients from previous backward passes. This could also be done explicitly by the caller, though.
More Operations and Handling Failures
It’s now time to implement more mathematical operations. Not every operation needs to be a primitive graph operation with its own mg_op case.
For example, subtraction can be expressed as addition plus negation:
Negation can be expressed as multiplication by :
Division can be expressed as multiplication by the reciprocal of the divisor:
This is nice because we don’t need to add cases for each operation in mg_backward. The trade-off is that these operations may allocate several mg_value nodes in the graph.
Let’s have a look at how the division operation is implemented:
mg_value* mg_div(mg_graph* g, mg_value* a, mg_value* b) {
mg_value* minus_one = mg_scalar(g, -1.0f);
mg_value* inv = mg_pow(g, b, minus_one);
mg_value* out = mg_mul(g, a, inv);
return out;
}At this point the library supports mg_add, mg_sub, mg_mul, mg_div, mg_pow, mg_neg, mg_square, mg_relu, mg_tanh, and mg_exp.
Most of these operations can be defined via composition of existing operations, but that introduces a C-specific failure case. Each intermediate allocation may fail, and if that happens we need to roll the graph back to the state it had before the operation started.
I’ve decided to do this by defining an mg_graph_checkpoint struct that saves the graph’s current linked-list head and length.
typedef struct {
mg_value* values;
size_t len;
} mg_graph_checkpoint;Checkpoints can be created at any time by calling mg_graph_save and restored via mg_graph_restore. mg_graph_restore keeps popping values off of the graph until it reaches the value pointed to by the checkpoint.
mg_graph_checkpoint mg_graph_save(const mg_graph* g) {
return (mg_graph_checkpoint){
.values = g->values,
.len = g->len,
};
}
void mg_graph_restore(mg_graph* g, mg_graph_checkpoint checkpoint) {
mg_value* v = g->values;
while (v != checkpoint.values) {
mg_value* next = v->next;
free(v);
v = next;
}
g->values = checkpoint.values;
g->len = checkpoint.len;
}Using this in the operations is a little bit verbose as we must check each created value for failure and roll back the graph if any allocation fails. This is how we use checkpoints in the mg_div operation:
mg_value* mg_div(mg_graph* g, mg_value* a, mg_value* b) {
mg_graph_checkpoint checkpoint = mg_graph_save(g);
mg_value* minus_one = mg_scalar(g, -1.0f);
if (!minus_one) {
goto fail;
}
mg_value* inv = mg_pow(g, b, minus_one);
if (!inv) {
goto fail;
}
mg_value* out = mg_mul(g, a, inv);
if (!out) {
goto fail;
}
return out;
fail:
mg_graph_restore(g, checkpoint);
return NULL;
}Building a Neuron
To train neural nets with micrograd, scalar values are not enough on their own. We need model objects that own trainable parameters. The first one is a neuron.
A neuron multiplies each input by its weight, adds those results together and adds a bias.
Often the neuron will pass the result through an activation function, which lets the neuron make a nonlinear transformation.
A mg_neuron Struct
A neuron can be modeled as a struct that holds weights, a bias, and a flag indicating whether the neuron should use an activation function. It also needs to keep track of the number of input values.
typedef struct {
size_t n_in;
bool non_linear;
mg_value** w;
mg_value* b;
} mg_neuron;To initialize an mg_neuron, we must take care of a few things. We allocate an array of mg_value pointers for our weights and allocate a bias mg_value.
It’s also worth noting that we set the initial weights to random values. If we didn’t do this, the neurons in the same layer would be symmetric and receive the same gradients and remain redundant. We will also scale initial random weights by the inverse square root of the number of inputs to ensure that the sum does not grow without bound as the number of inputs increases.
bool mg_neuron_init(mg_graph* g, mg_neuron* n, size_t n_in, bool non_linear) {
*n = (mg_neuron){0};
n->w = calloc(n_in, sizeof(*n->w));
n->n_in = n_in;
n->non_linear = non_linear;
float limit = 1.0f / sqrtf((float)n_in);
for (size_t i = 0; i < n_in; i++) {
n->w[i] = mg_scalar(g, rand_uniform(-limit, limit));
}
n->b = mg_scalar(g, 0.0);
return true;
}For brevity, I left out the error handling in mg_neuron_init and will omit the mg_neuron_free function here. All it does is reset its fields and free the weights array.
Calling the Neuron
To produce a neuron’s output, we iterate over the inputs, multiply each input by its corresponding weight, add the products to a running sum, then optionally apply the activation function. I will leave out the error handling here, but every allocation may fail and every weight, bias, or input value may be NULL.
mg_value* mg_neuron_call(mg_graph* g, mg_neuron* n, mg_value** x) {
mg_value* out = n->b;
for (size_t i = 0; i < n->n_in; i++) {
mg_value* wx = mg_mul(g, n->w[i], x[i]);
out = mg_add(g, out, wx);
}
if (n->non_linear) {
out = mg_tanh(g, out);
}
return out;
}We will also define functions for retrieving the neuron’s parameters (weights and bias). This will be needed later during training, when we need to update the neuron’s parameters.
size_t mg_neuron_param_count(const mg_neuron* n) {
return n->n_in + 1;
}
void mg_neuron_params(const mg_neuron* n, mg_value** out) {
for (size_t i = 0; i < n->n_in; i++) {
out[i] = n->w[i];
}
out[n->n_in] = n->b;
}Layers and MLPs
A useful abstraction for a multi-layer perceptron is a layer struct that contains a collection of neurons, plus an MLP struct that contains a sequence of layers.
I will keep this brief since these two structs are very similar to the neuron.
An mg_layer holds n_out neurons, all taking n_in inputs.
typedef struct {
size_t n_in;
size_t n_out;
mg_neuron* neurons;
} mg_layer;Calling a layer means calling each neuron with n_in inputs and collecting outputs.
bool mg_layer_call(mg_graph* g, mg_layer* l, mg_value** x, mg_value** out) {
for (size_t i = 0; i < l->n_out; i++) {
out[i] = mg_neuron_call(g, &l->neurons[i], x);
if (!out[i]) {
return false;
}
}
return true;
}An MLP wires layers sequentially, with tanh on hidden layers and a linear output layer.
typedef struct {
size_t n_layers;
mg_layer* layers;
} mg_mlp;Calling an MLP means calling the first layer with the input, then each subsequent layer with the output of the previous layer.
bool mg_mlp_call(mg_graph* g, mg_mlp* m, mg_value** x, mg_value** out) {
mg_value** current = x;
mg_value** tmp = NULL;
for (size_t i = 0; i < m->n_layers; i++) {
mg_layer* layer = &m->layers[i];
mg_value** next = calloc(layer->n_out, sizeof(*next));
mg_layer_call(g, layer, current, next);
free(tmp);
tmp = next;
current = tmp;
}
mg_layer* last = &m->layers[m->n_layers - 1];
for (size_t i = 0; i < last->n_out; i++) {
out[i] = current[i];
}
free(tmp);
return true;
}I’m again omitting the repetitive allocation failure handling code here.
Training Loop
With values, neurons, layers, and MLPs in place, we finally have everything needed to train a small neural network. I used the same small dataset from Karpathy’s micrograd video: four input vectors with one target value each.
We need to take graph ownership and how it interacts with the training into account. The parameters should live for the whole training run, but every forward pass creates temporary values for inputs, predictions, differences and squared errors, and the final loss. Those temporary values need to be destroyed after each iteration.
Luckily we have graph checkpoints and can save a checkpoint after we collect stable pointers to all trainable parameters.
size_t n_params = mg_mlp_param_count(&n);
mg_value** params = malloc(n_params * sizeof(*params));
mg_mlp_params(&n, params);
mg_graph_checkpoint params_checkpoint = mg_graph_save(g);Every value allocated after this point belongs to a training loop step and can be discarded.
mg_value* loss = mg_scalar(g, 0.0f);
for (size_t i = 0; i < 4; i++) {
mg_value* x[] = {
mg_scalar(g, xs[i][0]),
mg_scalar(g, xs[i][1]),
mg_scalar(g, xs[i][2]),
};
mg_value* out[1];
mg_mlp_call(g, &n, x, out);
mg_value* ygt = mg_scalar(g, ys[i]);
mg_value* diff = mg_sub(g, out[0], ygt);
mg_value* sq = mg_square(g, diff);
loss = mg_add(g, loss, sq);
}The loss is the sum of squared errors across all samples. Calling mg_backward on that loss fills in the gradients for every value that contributed to it, including the model parameters.
mg_backward(g, loss);
float learning_rate = 0.01f;
for (size_t i = 0; i < n_params; i++) {
float updated = mg_data(params[i]) - learning_rate * mg_grad(params[i]);
mg_set_data(params[i], updated);
}This is plain gradient descent where each parameter is moved a small step opposite its gradient.
At the end of the step, we restore the graph to the checkpoint we saved after initialization.
mg_graph_restore(g, params_checkpoint);This frees all temporary values created during the forward and backward pass while keeping the parameters alive.
This checkpoint pattern ended up being one of the most useful parts of my C version. In the original Python version, temporary graph nodes are eventually cleaned up by the garbage collector, but in C we have to manage memory manually and be explicit about lifetimes.
Conclusion
This was a fantastic learning project. The math translated fairly directly. The hard part was designing an API that made ownership explicit but still kept the training loop understandable.
I also took this as an opportunity to learn how to use tests and documentation tooling in C. I decided on doxygen with Doxygen Awesome CSS for documentation and I think it looks quite nice. See for yourself: lukad.github.io/micrograd
You can check out the full source code on GitHub at github.com/lukad/micrograd.