November 17, 2020

Our work with differentiable programming, which enables programs to optimize themselves, is part of Facebook AI’s broader efforts to build more advanced tools for machine learning (ML) programming. That’s why we're extending the Kotlin compiler to make differentiability a first-class feature of the Kotlin language, as well as developing a system for tensor typing. Our work enables developers to explore Software 2.0, where software essentially writes itself, via:

Seamless differentiation through primitives, data structures, and control flow

Tensor typing for static, compile-time shape inference and checking

Compile-time errors for differentiable functions and tensor shapes

A performant library providing a Tensor class and machine learning APIs

By enabling intuitive and performant differentiable programming in Kotlin, we’re empowering developers to create powerful, flexible programs that take advantage of problem structure while seamlessly maintaining type safety and keeping debugging simple.

Today, most code is either learnable (written using restrictive machine learning libraries) or explicitly programmed (using traditional coding paradigms). A major obstacle toward achieving Software 2.0 is that there’s no true compatibility between these two methods.

Take the Cartpole reinforcement learning model as an example. Cartpole’s goal is to learn to balance a pole on a cart. But the model starts without any base knowledge and has to learn the laws of physics through trial and error, despite the fact that a number of impressive physics simulators exist that have already been created using traditional code. Wouldn’t the model be more efficient if it could leverage these existing simulators?

Differentiable programming addresses this issue. In differentiable programming, arbitrary user (or library) code can be incorporated into more comprehensive models. Differentiable programming also allows developers to leverage gradients to automatically optimize parameterized programs that aren’t written using ML libraries.

We're building an automatic differentiation system for the Kotlin language.

Two Cartpole models after six iterations of learning to balance a pole. The model on the left does not have the physics of the environment incorporated into it, while the model on the right does.

This automatic differentiation (AD) happens at compile time, preserving program structure (such as control flow and function calls) and enabling compiler optimizations that would be infeasible with AD at runtime. We provide differentiability on Kotlin floats and doubles as well as a framework for defining custom differentiable data types. Our team has leveraged this framework to also provide a differentiable Tensor class. This allows users to differentiate through traditional ML models expressed in Kotlin as well as through arbitrary Kotlin code.

Here’s a simple example of the ergonomics provided by our language extensions.

differentiable fun centripetalAccel(velocity: Float, radius: Float) = (velocity * velocity) / radius val gradients = grads(::centripetalAccel, 8f, 2f)

The differentiable modifier, which is syntactically similar to the suspend modifier, is used before functions to imply that they are differentiable and that they can only call other differentiable functions.

When differentiable functions call non-differentiable functions, we throw compile time errors that show up in the IDE during development. This is in contrast to many of today’s dynamic frameworks in which you can incorrectly call a non-differentiable function and not get an error until far into the program’s execution — or never get an error at all, just an incorrect result that is difficult to track down.

The function grads is called with a function reference and a point at which to take the function’s derivative. Once this code is executed, gradients.velocity will be the derivative of centripetalAccel with respect to velocity, or 8f. Similarly, gradients.radius will be the derivative with respect to radius, or -16f.

Our library includes derivatives for primitive built-in operations, like add and multiply, so the compiler can reason about and then compute derivatives for compositions of these functions, like centripetalAccel.

Our system also allows developers to add functions with custom derivatives. Custom derivatives allow developers to experiment with custom data types and write functions with components that do not have built-in derivatives.

For instance:

differentiable fun sigmoid(x: Float): Float = 1f / (1f + math.exp(x)) @PullbackOf(::sigmoid) fun pullback_sigmoid(x: Float): (Float) -> sigmoid.TangentType = { upstream: Float -> sigmoid.TangentType(upstream * sigmoid(x) * (1 - sigmoid(x)) }

This pullback function is used to compute the derivative of the function sigmoid. Here, the derivative of sigmoid with respect to x is sigmoid(x) * (1 - sigmoid(x)). This derivative is scaled by upstream. The class sigmoid.TangentType is generated to allow the developer to unpack the gradient by parameter name.

Along with extensibility through custom pullbacks, we also allow developers to create object-oriented programs. We support a user-defined differentiable class with differentiable and non-differentiable methods and values, and new user-defined differentiable data types. These systems ensure that developers can differentiate their natural object-oriented code without being restricted to our libraries.

differentiable class Model(includeBias: Boolean) { val weight = Tensor.random(Shape(2, 2)) val bias = if (includeBias) Tensor.random(Shape(2)) else Tensor.zeros(Shape(2)) differentiable fun forward(input: Tensor) = weight.matmul(input) + bias differentiable fun loss(data: Tensor, labels: Tensor): Tensor { return crossEntropyLoss(forward(data), labels) } fun hasBias(): Boolean = includeBias } val myModel = Model(true) val gradients = grads(::Model.loss, myModel).receiver gradients.weight // gradient of myModel with respect to weight gradients.bias // gradient of myModel with respect to bias // gradients.includeBias is not defined because Booleans are not differentiable

The compiler recognizes which elements of the class are differentiable and which ones aren’t. A developer can designate any method of a differentiable class to also be differentiable. Even methods that aren’t differentiable, such as hasBias, which returns a Boolean, are allowed in differentiable classes. However, if a developer asked for hasBias to be differentiable by adding the modifier, it would cause a compile time error.

Many of the operators in deep learning, like convolutions, involve complex manipulations of multi-dimensional arrays called tensors. Without static shape information, it is easy to confuse tensors of different shapes, leading to runtime errors that are difficult to debug.

With tensor typing, developers gain compile-time shape inference and checking.

Tensor typing also allows for better code documentation and clarity. Developers can use type annotations as documentation to record what types of tensor inputs are acceptable and expected. Type aliases and generics can be used to further improve code comprehensibility, sharing, and reuse.

Here is a simple example using type aliases to provide more clarity in documentation:

typealias BatchSize = 100 typealias Height = 40 typealias Width = 50 fun getFirst( input: Tensor<[BatchSize, Height, Width]> ): Tensor<[Height, Width]> { ... }

Additionally, we have prioritized the developer experience by integrating our extensions to the Kotlin language with the IntelliJ IDE so that developers can get real-time feedback through type hints and error redlining. This way, tensor shapes can be inspected while models are being written — before they are ever built or run. Anyone who has trained a model for several hours, only to receive a shape error that terminates their progress, knows the extent to which this feedback can save time, resources, and frustration.

Here is a code snippet from a simple convolutional neural network written in IntelliJ. The input data is being fed through multiple layer objects and the developer can inspect the resulting shapes at each step. Note the use of a generic type parameter, N, in the first dimension to signify a variable batch size.

We can induce an error in our code by removing the “maxPool2” layer. This results in the following shape error at our first fully connected layer, fc1, because fc1 expects a tensor of shape N x 784, but the incoming shape found was actually N x 3136.

We’re excited about the productivity and creativity that our work will foster. To further facilitate efforts into differentiable programming, we will also be releasing a user library that takes maximum advantage of our AD and tensor typing systems and allows engineers and developers coming from any ML framework to transition to and deploy onto ours with ease.

*Thanks to all of the members of the Differentiable Programming Languages Team who contributed to this work: Samantha Andow, Arturo Arenas Esparza, Irene Dea, Emilio Arroyo-Fang, Neal Gafter, Johann George, Melissa Grueter, Erik Meijer, Xipeng Shen, Steffi Stumpos, Alanna Tempest, Christy Warden, and Shannon Yang *