Deep Learning Models that Write Code - Issue #6
Consider the following scenario - You write a comment "create a function that fetches all machine learning tweets and saves it to a file", and a deep learning model generates a full python function that achieves this goal (see screenshot below). It imports a suitable library, creates the required boilerplate authentication code and call structure such that you have decent functionality with minimal effort!
The research area, frequently referred to as deep learning for program synthesis has recently become quite active, with useful applications in improving developer productivity (e.g., reducing time to implement features), improving the experience for developers exploring a new language or framework or even helping novice developers understand existing code.
GitHub Copilot · Your AI pair programmer — copilot.github.com GitHub Copilot works alongside you directly in your editor, suggesting whole lines or entire functions for you.
Large Language Models and Program Synthesis
Simply put, a language model is a statistical model that learns the distribution or probabilities of words in a sequence. It turns out that if we can achieve such a model with high fidelity, we can solve a few interesting tasks. For example, if we know that a word is likely to occur given some sequence of words, we can implement some useful functionality like email autocomplete (e.g., given the sequence "Have a great " .. we can predict that the next likely word is "day").
When these statistical models are derived using large neural networks with billions of parameters (hence the term large language models or LLMs), the results and application areas are even more impressive. Results from transformer-based model architectures like BERT, GPT etc., show that these models excel at several complex tasks e.g., they can mimic creative writing, predict sentiment, identify topics within sentences with few examples, meaningfully summarize lengthy documents, translate languages etc.
GPT-3 Powers the Next Generation of Apps — openai.com Over 300 applications are delivering GPT-3–powered search, conversation, text completion, and other advanced AI features through our API.
Given that code is similar to natural language, recent research has shown that LLMs can also be trained to learn a distribution over code tokens and work in solving several code understanding and code generation tasks.
Code Understanding and Generation Tasks: Datasets and Metrics
In their paper, Lu, Shuai, et al [1] identify 10 general tasks in the area of code understanding and generation. They also highlight a set of datasets that are useful for addressing these tasks (including some datasets that they curate) and also discuss some metrics which are used in evaluating each of these tasks.
A list of tasks:
Clone detection (semantic similarity)
Defect detection (predict if code contains bugs)
Cloze test (predict masked token given code snippet)
Code completion (predict next tokens given some context)
Code repair (produce bug free version of code)
Code-to-code translation (translate from one programming language to another)
Natural language code search (retrieve code based on text query, predict if code answers a NL text query)
Text- to-code generation (explicitly generate code given some natural language query)
Code summarization (generate natural language summaries or comments given some code)
Documentation translation (translate code documentation from one natural language to another).
I have written a review/summary of the Lu Shai [1] paper which can be found here - https://victordibia.com/blog/codex-glue/.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation | Paper Review | Victor Dibia — victordibia.com It has been great to see the machine learning community begin to pay attention to dataset creation and allocate conference and workshop…
Code Generation/Completion - Codex, AlphaCode
Perhaps, the more interesting task in the list above has to do with the opening use case presented in this newsletter post - code generation/completion. There has been very interesting progress in this area starting with the Codex model from OpenAI [2] which powers Github CoPilot and more recently the AlphaCode model from Deepmind [2] a transformer based model for solving competitive programming problems.
Codex[2] is the program synthesis model from OpenAI that powers GitHub Copilot. While Codex focuses on general purpose program completion, AlphaCode seems to focus specifically on solving competitive programming contest problems. While Codex is based on a multilayer decoder-only model (GPT-3), AlphaCode is based on an encoder-decoder model architecture. A recent paper [4] from Google evaluating LLMs for code generation also use a decoder-only setup.
I have written a review on the AlphaCode paper which can be found here - https://victordibia.com/blog/alpha-code/
AlphaCode: Competition-Level Code Generation with Transformer Based Architectures | Paper Review | Victor Dibia — victordibia.com DeepMind recently announced the release of AlphaCode - an encoder-decoder transformer model for generating code (solutions to competitive…
Conclusions
The space is increasingly interesting! I have used CoPilot over the last 4 months and can attest to it actually improving my productivity (hint, it works better than having to consult stackoverflow from time to time). There are still many open issues e.g., understanding when these models fail, reasons why they fail (e.g., due to misalignment), improved metrics for evaluating these models, the impact of these tools on developer learning/education, and general societal impacts of LLMs etc.
There is an ICLR 2022 workshop track focused on Deep Learning for Code which should be really insightful!
Home | Deep Learning For Code Deep Learning for Code (DL4C) is a workshop that will provide a platform for researchers to share their work on deep learning for code.
References
[1] Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C., Drain, D., Jiang, D., Tang, D. and Li, G., 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664.
[2] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G. and Ray, A., 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
[3] Yujia Li et al Competition-Level Code Generation with AlphaCode. 2022 https://storage.googleapis.com/deepmind-media/AlphaCode/competition_level_code_generation_with_alphacode.pdf
[4] Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., ... & Sutton, C. (2021). Program synthesis with large language models. arXiv preprint arXiv:2108.07732