Evaluating Large Language Models Trained on Code (Codex paper)

## Objective - introduce Codex, a model that generates code from natural language, and which is a fine-tuned version of GPT. - evaluate Codex as well as provide an evaluation set called humanEval to measure functional correctness of code producing models. ## Previous work [[Seq2Seq transformers]]models have been used in a variety of domains, language being only one of them. More recently, [[language model]]s have been repurposed for program synthesis. In this context, Codex is not the first and only one. Other examples are code-bert and pymt5. An early investigation of gpt3 fielded that it was able to produce code even if it was not explicitly trained for that > The availability of large code datasets like github has motivated the development. So this is clearly an opportunistic use case, and perhaps this particular fine-tuning approach is not applicable to domains with such extensive datasets. Something crucial in this case, and distinct from language models is that this model is evaluated on functional correctnessthrough unit tests while language is just evaluated by heuristics or humans. So for a different fire tuning case, perhaps we need an evaluation method. Performance is correlated with parameter size.one out taking only put gives performance below 25%, but i producing 100 and taking which ever solves it gives arord 80% Evaluated on functional correctness. There are other metrics that match the generated solution with a predefined correct solution, but this approach fails at recogniting functionally identical but semantically different solutions. There fore unit testing is a more appropriate approach. Also, this is now humans evaluate code. While this can be said to produce code from natural language more formally, it is producing it from docstrings. What are the limitations of this approach? Perhaps this limits how dialogically interactive the model can be. > I think some thing interesting about functional correctness evaluations is that fractional correctness will very per task domain, but there might be creative ways of turning a subjective task into a test that can be evaluated according to furctiord correctness. **pass @k * Concepts [[Fine-tuning]] Generative pre-training [[Semi-supervised training]] I initially thought that Codex a fine-tuning of [[GPT3]], but it is actually a fine-tuned GPT, I describe that model here:[[GPT]] **Reference** ``` @article{Chen_Tworek_Jun_Yuan_Pinto_Kaplan_Edwards_Burda_Joseph_Brockman_et al._2021, title={Evaluating Large Language Models Trained on Code}, abstractNote={We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.}, journal={arXiv}, author={Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and et al.}, year={2021} } ```