Research

NLP

Deep learning to translate between programming languages

July 21, 2020

Migrating a codebase from an archaic programming language such as COBOL to a modern alternative like Java or C++ is a difficult, resource-intensive task that requires expertise in both the source and target languages. COBOL, for example, is still widely used today in mainframe systems around the world, so companies, governments, and others often must choose whether to manually translate their code bases or commit to maintaining code written in a language that dates back to the 1950s.

We've developed and open sourced TransCoder, an entirely self-supervised neural transcompiler system that can make code migration far easier and more efficient. Our method is the first AI system able to translate code from one programming language to another without requiring parallel data for training. We’ve demonstrated that TransCoder can successfully translate functions between C++, Java, and Python 3. TransCoder outperforms open source and commercial rule-based translation programs. In our evaluations, the model correctly translates more than 90 percent of Java functions to C++, 74.8 percent of C++ functions to Java, and 68.7 percent of functions from Java to Python. In comparison, a commercially available tool translates only 61.0 percent of functions correctly from C++ to Java, and an open source translator is accurate for only 38.3 percent of Java functions translated into C++.

Self-supervised training is particularly important for translating between programming languages. Traditional supervised-learning approaches rely on large-scale parallel datasets for training, but these simply don’t exist for COBOL to C++ or C++ to Python, for example. TransCoder relies exclusively on source code written in just one programming language, rather than requiring examples of the same code in both the source and target languages. It requires no expertise in the programming languages, and it is easy to generalize TransCoder’s approach to additional programming languages. We have also created a new evaluation metric designed expressly for this domain.

TransCoder could be useful for updating legacy codebases to modern programming languages, which are typically more efficient and easier to maintain. It also shows how neural machine translation techniques can be applied to new domains. As with Facebook AI’s previous work using neural networks to solve advanced mathematics equations, we believe NMT can help with other tasks not typically associated with translation or pattern recognition tasks.

Building a sequence-to-sequence model expressly for programming languages

In natural language, recent advances in neural machine translation have been widely accepted, even among professional translators, who rely more and more on automated machine translation systems. However, their applications to code translation have been limited due to the scarcity of parallel data in this domain. Programmers still rely on rule-based code translators, which require experts to review and debug the output, or they simply translate code manually. TransCoder overcomes these challenges by leveraging recent advances in unsupervised machine translation to programming languages.

We built a sequence-to-sequence (seq2seq) model with attention, composed of an encoder and a decoder with a transformer architecture. TransCoder uses a single shared model, based in part on Facebook AI’s previous work on XLM, for all programming languages. We trained it following the three principles of unsupervised machine translation detailed in Facebook AI’s previous research: initialization, language modeling, and back translation.

This graphic shows how TransCoder leverages the three principles of unsupervised machine translation.

We first leveraged source code from open source GitHub projects to pretrain our model using a masked language model (MLM) objective. As in the context of natural language processing, this pretraining creates cross-lingual embeddings: Keywords from different programming languages that are used in similar contexts are very close in embedding space (e.g., catch and except). The cross-lingual nature of these embeddings comes from the significant number of common tokens (anchor points) that exist across languages. Examples of anchor points include keywords common to C++, Java, and Python (e.g., for, while, if, try), as well as mathematical operators, digits, and English strings appearing in the source code.

Pretraining with MLM allows TransCoder to generate high-quality representations of input sequences. However, the decoder lacks the capacity to translate, as it has never been trained to decode a sequence based on a source representation. To address this issue, we trained the model to encode and decode sequences with a Denoising Auto-Encoding (DAE) objective. The DAE objective operates like a supervised machine translation algorithm, where the model is trained to predict a sequence of tokens given a corrupted version of that sequence. The first symbol given as input to the decoder is a special token indicating the output programming language. At test time, a Python sequence can be encoded by the model and decoded using the C++ start symbol to generate a C++ translation. The quality of the C++ translation will depend on the “cross-linguality” of the model: If the Python function and a valid C++ translation are mapped to the same latent representation by the encoder, the decoder will successfully generate this C++ translation.

Something Went Wrong
We're having trouble playing this video.

This graphic shows how keywords with similar functions are grouped together.

Cross-lingual model pretraining and Denoising Auto-Encoding alone are enough to generate translations. However, the quality of these translations tends to be low, as the model is never trained to do what it is expected to do at test time, i.e., to translate functions from one language to another. To address this issue, we use back-translation, which is one of the most effective methods to leverage monolingual data in a weakly supervised scenario. We use a single model and a different start token for each target language. It is trained to translate from source to target and from target to source in parallel. The target-to-source version is used to translate target sequences into the source language, producing noisy source sequences corresponding to the ground truth target sequences. The model can then be trained in a weakly supervised manner to reconstruct the target sequences from the noisy source sequences and learn to translate from source to target. The target-to-source and source-to-target versions are trained in parallel until convergence.

To evaluate their models, most previous studies of source-code translation have relied on metrics used in natural language, such as BLEU score or other methods based on the relative overlap between tokens. These types of metrics are not well suited to programming languages, however. Two programs with small syntactic discrepancies might achieve a very high BLEU score while still producing very different results when the code is executed. Conversely, semantically equivalent programs with different implementations will have low BLEU scores. An alternative metric is the reference match, or the percentage of translations that perfectly match the ground truth reference, but this often underrates the quality of the translation because it is unable to recognize semantically equivalent code.

To better measure the performance of TransCoder and other code translation techniques, we’ve created a new metric called computational accuracy, which evaluates whether the hypothesis function generates the same outputs as the reference when given the same inputs. We are also releasing our test set and the scripts and unit tests we used to compute this metric.

Python input

def SumOfKsubArray(arr, n, k):
  Sum = 0
  S = deque()
  G = deque()
  for i in range(k):
    while (len(S) > 0 and arr[S[-1]] >= arr[i]):
      S.pop()
    while (len(G) > 0 and arr[G[-1]] <= arr[i]):
      G.pop()
    G.append(i)
    S.append(i)
  for i in range(k, n):
    Sum += arr[S[0]] + arr[G[0]]
    while (len(S) > 0 and S[0] <= i - k):
      S.popleft()
    while (len(G) > 0 and G[0] <= i - k):
      G.popleft()
    while (len(S) > 0 and arr[S[-1]] >= arr[i]):
      S.pop()
    while (len(G) > 0 and arr[G[-1]] <= arr[i]):
      G.pop()
    G.append(i)
    S.append(i)
  Sum += arr[S[0]] + arr[G[0]]
  return Sum
        

The example below shows how TransCoder translated sample code from Python to C++. We used this code as input into the model:

TransCoder successfully translated the Python input function SumOfKsubArray into C++. It also inferred types of the arguments, return type, and parameters of the function. The model appended the Python dequeue() container to the C++ implementation dequeue<> . Here is the model’s output in C++:

C++ unsupervised translation

int SumOfKsubArray(int arr[], int n, int k){
  int Sum = 0;
  deque <int> S;
  deque <int> G;
  for(int i = 0; i < k; i ++){
    while((int) S.size() > 0 && arr[S.back()] >= arr[i])
      S.pop_back();
    while((int) G.size() > 0 && arr[G.back()] <= arr[i])
      G.pop_back();
    G.push_back(i);
    S.push_back(i);
  }
  for(int i = k; i < n; i ++){
  Sum += arr[S.front()] + arr[G.front()];
    while((int) S.size() > 0 && S.front() <= i - k)
      S.pop_front();
    while((int) G.size() > 0 && G.front() <= i - k)
      G.pop_front();
    while((int) S.size() > 0 && arr[S.back()] >= arr[i])
      S.pop_back();
    while((int) G.size() > 0 && arr[G.back()] <= arr[i])
      G.pop_back();
    G.push_back(i);
    S.push_back(i);
  }
  Sum += arr[S.front()] + arr[G.front()];
  return Sum;
}
        

Advancing research and helping with real-world applications

Automatic code translation has the potential to make programmers working in companies or on open source projects more efficient by allowing them to integrate various codes more easily from other teams within the company or other open source projects. It can also greatly reduce the effort and expense of updating an old codebase written in an ancient language.

Advances in transcompilation could spur companies and other institutions to update to more recent languages and facilitate future innovation, which could benefit the people who use their services as well as the institutions themselves. Advances in machine translation for programming languages could also help people who don’t have the time or cannot afford courses to learn to program in multiple languages.

More broadly, AI has the potential to help with other programming tasks. For example, Facebook AI has previously shared Neural Code Search, a method for using natural language in queries about code, and Getafix, a tool that learns to automatically suggest fixes for coding bugs. While TransCoder isn’t designed to help with debugging or improving code quality, it has the potential to help engineers migrate old codebases or use external code written in other languages.

In order to promote future research on using deep learning for code translation, we are also releasing a test set that enables other researchers to evaluate code translation models using computational accuracy instead of semantics-blind models. We look forward to seeing how others build on our work with TransCoder and advances self-supervised learning for new kinds of translation tasks.

Written By

Baptiste Roziere

Research Assistant

Marie-Anne Lachaux

Research Engineer

Lowik Chanussot

Research Engineering Manager

Guillaume Lample

Research Scientist