The best Large Language Models (LLMs) for coding have been trained with code related data and are a new approach that developers are using to augment workflows to improve efficiency and productivity. These coding assistants can be used for a wide range of code related tasks, such as code generation, code analysis to help with debugging, refactoring, and writing test cases, as well offering chat capabilities to discuss problems and inspire developers with solutions. For this guide we tested several different LLMs that can be used for coding assistants to work out which ones present the best results for their given category.
The best large language models are area of technology that is moving very quickly so while we do our best to keep this guide as up to date as possible, you may want to check if a newer model has been released and whether it fits your specific use case better.
The best large language models (LLMs) for coding
Why you can trust TechRadar
We spend hours testing every product or service we review, so you can be sure you’re buying the best. Find out more about how we test.
Best for Enterprises
Originally released in October 2021, GitHub Copilot is a version of Microsoft’s Copilot LLM that is specifically trained with data to assist coders and developers with their work with the aim to improve efficiency and productivity. While the original release used OpenAI’s Codex model, a modified version of GPT-3 which was also trained as a coding assistant, GitHub Copilot was updated to use the more advanced GPT-4 model in November 2023.
A core feature of GitHub Copilot is the extension provided that allows direct integration of the LLM into commonly used Integrated Development Environments (IDEs) popular among developers today, including Visual Studio Code, Visual Studio, Vim, Neovim, the JetBrains suite of IDEs, and Azure Data Studio. This direct integration allows GitHub Copilot to access your existing project to improve the suggestions made when given a prompt, while also providing users hassle free installation and access to the features provided. For enterprise users, the model can also be granted access to existing repositories and knowledge bases from your organization to further enhance the quality of outputs and suggestions.
When writing code, GitHub Copilot can offer suggestions in a few different ways. Firstly, you can write a prompt using an inline comment that can be converted into a block of code. This works in a similar way to how you might use other LLMs to generate code blocks from a prompt, but with the added advantage of GitHub Copilot being able to access existing project files to use as context and produce a better output. Secondly, GitHub Copilot can provide real-time suggestions as you are writing your code. For example, if you are writing a regex function to validate an email address, simply starting to write the function can offer an autocomplete suggestion that provides the required syntax. Additionally, you can also use the GitHub Copilot Chat extension to ask questions, request suggestions, and help you to debug code in a more context aware fashion than you might get from LLMs trained on more broad datasets. Users can enjoy unlimited messages and interactions with GitHub Copilot’s chat feature across all subscription tiers.
GitHub Copilot is trained using data from publicly available code repositories, including GitHub itself. GitHub Copilot claims it can provide code assistance in any language where a public repository exists, however the quality of the suggestions will depend on the volume of data available. All subscription tiers include a public code filter to reduce the risk of suggestions directly copying code from a public repository. By default, GitHub Copilot excludes submitted data from being used to train the model further for business and enterprise tier customers and offers the ability to exclude files or repositories from being used to inform suggestions offered. Administrators can configure both features as needed based on your business use cases.
While these features aim to keep your data private, it’s worth keeping in mind that prompts aren’t processed locally and rely on external infrastructure to provide code suggestions and you should factor this into whether this is the right product for you. Users should also be cautious about trusting any outputs implicitly – while the model is generally very good at providing suggestions, like all LLMs it is still prone to hallucinations and can make poor or incorrect suggestions. Always make sure to review any code generated by the model to make sure it does what you intend it to do.
In the future it’s possible that GitHub will upgrade GitHub Copilot to use the recently released GPT-4o model. GPT-4 was originally released in March 2023, with GitHub Copilot being updated to use the new model roughly 7 months later. It makes sense to update the model further given the improved intelligence, reduced latency, and reduced cost to operate GPT-4o, though at this time there has been no official announcement.
If you want to try before you buy, GitHub Copilot offers a free 30 day trial of their cheapest package which should be sufficient to test out its capabilities, with a $10 per month fee thereafter. Copilot Business costs $19 per user per month, while Enterprise costs $39 per user per month
Best for individuals
CodeQwen1.5 is a version of Alibaba’s open-source Qwen1.5 LLM specifically trained using public code repositories to assist developers in coding related tasks. This specialized version was released in April 2024, a few months after the release of Qwen1.5 to the public in February 2024.
There are 2 different versions of CodeQwen1.5 available today. The base model of CodeQwen1.5 is designed for code generation and suggestions but has limited chat functionality, while the second version can also be used as a chat interface that can answer questions in a more human-like way. Both models have been trained with 3 trillion tokens of code related data and support a very respectable 92 languages, which include some of the most common languages in use today such as Python, C++, Java, PHP, C# and JavaScript.
Unlike the base version of Qwen1.5, which has several different sizes available for download, CodeQwen1.5 is only available in a single size of 7B. While this is quite small when compared to other models on the market that can also be used as coding assistants, there are a few advantages that developers can take advantage of. Despite its small size, CodeQwen1.5 performs incredibly well compared to some of the larger models that offer coding assistance, both open and closed source. CodeQwen1.5 comfortably beats GPT3.5 in most benchmarks and provides a competitive alternative to GPT-4, though this can sometimes depend on the specific programming language. While GPT-4 may perform better overall by comparison, it’s important to remember that GPT-4 requires a subscription and has per token costs that could make using it very expensive compared to CodeQwen1.5 and GPT-4 cannot be hosted locally. Like with all LLMs, its risky to implicitly trust any suggestions or responses provided by the model. While steps have been taken to reduce hallucinations, always check the output to make sure it is correct.
As CodeQwen1.5 is open source, you can download a copy of the LLM to use at no additional cost beyond the hardware needed to run it. You’ll still need to make sure your system has enough resources to ensure the model can run well, but the bonus of the smaller model size means a modern system with GPU that has at least 16GB of VRAM and at least 32GB of system RAM should be sufficient. CodeQwen1.5 can also be trained using code from existing projects or other code repositories to further improve the context of the generated responses and suggestions. The ability to host CodeQwen1.5 within your own local or remote infrastructure, such as a Virtual Private Server (VPS) or dedicated server, should also help to alleviate some of the concerns related to data privacy or security often connected to submitting information to third party providers.
Alibaba surprised us by releasing their new Qwen2 LLM at the start of June that they claim offers significant gains over the base model of Qwen1.5. Alibaba also mentioned that the training data used for CodeQwen1.5 is included in Qwen2-72B, so has the potential to offer improved results, but it’s currently unclear if there is a plan to upgrade CodeQwen to use the new model.
Best Value
When it comes to the best bang for buck, Meta’s open-source Llama 3 model released in April 2024 is one of the best low-cost models available on the market today. Unlike many other models specifically trained with code related data to assist developers with coding tasks, Llama 3 is a more general LLM capable of assisting in many ways – one of which also happens to be as a coding assistant – and outperforms CodeLlama, a coding model released by Meta in August 2023 based on Llama 2.
In like for like testing with models of the same size, Llama 3 outperforms CodeLlama by a considerable margin when it comes to code generation, interpretation, and understanding. This is impressive considering Llama 3 wasn’t trained specifically for code related tasks but can still outperform those that have. This means that not only can you use Llama 3 to improve efficiency and productivity when performing coding tasks, but it can also be used for other tasks as well. Llama 3 has a training data cutoff of December 2023, which isn’t always of critical importance for code related tasks, but some languages can develop quickly and having the most recent data available can be incredibly valuable.
Llama 3 is an open-source model that allows developers to download and deploy the model to their own local system or infrastructure. Like CodeQwen1.5, Llama 3 8B is small enough that a modern system with at least 16GB of VRAM and 32GB of system RAM is sufficient to run the model. The larger 70B version of Llama 3 naturally has better capabilities due to the increased parameter number, but the hardware requirement is an order of magnitude greater and would require a significant injection of funds to build a system capable of running it effectively. Luckily, the Llama 3 8B offers enough capability that users can get excellent value without breaking the bank at the same time. If you find that you need the added capability of the larger model, the open-source nature of the model means you can easily rent an external VPS or dedicated server to support your needs, though costs will vary depending on the provider. If you decide that you’d like the increased capability of the larger model, but the investment needed for the required hardware, or the cost to rent an external host, is outside your budget, AWS offers API access to the model via a pay as you go plan which charges you by the token instead. AWS currently charges $3.50 per million output tokens, which is a considerable quantity for a very small price. For comparison, OpenAI’s GPT-4o costs $15.00 for the same quantity of tokens. If this type of solution appeals to you, make sure to shop around for the best provider for your location, budget, and needs.
Llama 3 performs well in code generation tasks and adheres well to the prompts given. It will sometimes simplify the code based on the prompt, but it’s reasonably receptive to being given instruction to provide a complete solution and will segment if it reaches the token limit for a single response if requested. During testing, we asked for Llama 3 to write a complete solution in Python for a chess game that would immediately compile and could be played via text prompts, and it dutifully provided the requested code. Although the code initially failed to compile, providing Llama 3 with the error messages from the compiler allowed it to identify where the mistakes were and provided a correction. Llama 3 can effectively debug code segments to identify issues and provide new code to fix the error. As a bonus, it can also explain where the error was located and why it needs to be fixed to help the user understand what the mistake was. However, like with all models generating code-related solutions, it’s important to check the output and not trust it implicitly. Although the models are becoming increasingly intelligent and accurate, they also hallucinate at times and provide incorrect or insecure responses.
Like with other open-source models, any data you submit to train Llama 3 from your own code repositories remains within your control. This helps to alleviate some of the concerns and risks associated with submitting proprietary and personal data to third parties, though keep in mind that also means that you should consider what that means for your information security policies where required. It doesn’t cost anything extra to train a model you have hosted within your own infrastructure, but some hosts providing API access do have an additional cost associated with further training.
You can download Llama 3 today directly from Meta.
Best for code generation
Released in April 2024, Claude 3 Opus is the latest and most capable LLM from Anthropic that they claim is the most intelligent LLM on the market today and is designed to tackle a variety of different tasks. Although most LLMs can generate code, the accuracy and correctness of the generated outputs can vary, and may have mistakes or be flat out incorrect due to not being specifically designed with code generation in mind. Claude 3 Opus bridges that gap by being trained to handle coding related tasks alongside the regular tasks LLMs are often used for, making for a very powerful multi-faceted solution.
While Anthropic doesn’t mention how many programming languages it supports, Claude 3 Opus can generate code across a large range of programming languages, ranging from incredibly popular languages such as C++, C#, Python and Java, to older or more niche languages such as FORTRAN, COBOL, and Haskell. Claude 3 Opus relies on the patterns, syntaxes, coding conventions and algorithms identified within the code related training data provided to generate new code snippets from scratch to help avoid direct reproduction of code used to train it. The large 200k token context window offered by Claude 3 Opus is incredibly useful when working with large code blocks as you iterate through suggestions and changes. Like all LLMs, Claude 3 Opus also has an output token limit, and tends to either summarise or truncate the response to fit within a single reply. While summarisation of a purely text response isn’t too problematic as you can ask for additional context, not being provided with a large chunk of required code, such as when generating a test case, is quite a problem. Fortunately, Claude 3 Opus can segment its responses if you request it to do so in your initial prompt. You’ll still need to ask it to continue after each reply, but this does allow you to obtain more long form responses where needed. As well as generating functional code, Claude 3 Opus also adds comments to the code and provides explanations as to what the generated code does to help developers understand what is happening. In cases where you are using Claude 3 to debug code and generate fixes, this is extremely valuable as it not only helps solve the problem, but also provides context as to why changes were made, or why the code was generated in this specific way.
For those concerned about privacy and data security, Anthropic states that they don’t use any of the data submitted to Claude 3 for the purposes of training the model further, a welcome feature that many will appreciate when working with proprietary code. They also include copyright indemnity protections with their paid subscriptions.
Claude 3 Opus does come with some limitations when it comes to improving the context of responses as it doesn’t currently offer a way to connect your own knowledge bases or codebases for additional training. This probably isn’t a deal breaker for most but could be something worth thinking about when choosing the right LLM for your code generation solution.
This does all come with a hefty price tag compared to other LLMs that offer code generation functionality. API access is one of the more expensive ones on the market at an eye watering $75 per 1 million output tokens, which is considerably more than GPT-4o’s $15 price tag. Anthropic do offer 2 additional models based on Claude 3, Haiku and Sonnet, which are much cheaper at $15 and $1.25 respectively for the same quantity of tokens, though they have reduced capability compared to Opus. In addition to API access, Anthropic offers 3 subscription tiers that grant access to Claude 3. The free tier has a lower daily limit and only grants access to the Sonnet model but should give those looking to test it’s capabilities a good idea of what to expect. To access Opus, you’ll need to subscribe to Pro or Team at $20 and $30 per person per month respectively. The Team subscription does need a minimum of 5 users for a total of $150 per month, but increases the usage limits for each user compared to the Pro tier.
Head over to create a free account to access Claude 3.
Best for debugging
Since the release of ChatGPT in November 2022, OpenAI has taken the world by storm and offers some of the most intelligent and capable LLMs on the market today. GPT-4 was released in March 2023 as an update to GPT-3.5
While GPT-4 isn’t an LLM designed specifically as a coding assistant, it performs well across a broad range of code related tasks, including real time code suggestions, generating blocks of code, writing test cases, and debugging errors in code. GitHub Copilot has also been using a version of GPT-4 with additional training data since November 2023, leveraging the human response capabilities of GPT-4 for code generation and within their chat assistant, which should give you an idea of the value it can provide.
GPT-4 has been trained with code related data that covers many different programming languages and coding practices to help it understand the vast array of logic flows, syntax rules and programming paradigms used by developers. This allows GPT-4 to excel when debugging code by helping to solve a variety of issues commonly encountered by developers. Syntax errors can be incredibly frustrating when working with some languages – I’m looking at you and your indentations, Python – so using GPT-4 to review your code can massively speed up the process when code won’t compile due to errors that are difficult to find. Logical errors are one of the toughest errors to debug as code usually compiles correctly, but it doesn’t provide the correct output or operate as desired. By giving GPT-4 your code and an explanation of what it should be doing, GPT-4 can analyse and identify where the problem lies, offer suggestions or rewrites to solve the problem, and even provide an explanation as to what the problem is and how the suggested changes solve it. This can help developers quickly understand the cause of the problem and offers an opportunity to learn how to avoid it again in the future.
Although the training data cutoff for GPT-4 is September 2021, which is quite a long time ago considering the advancements in LLMs over the last year, GPT-4 is continuously trained using new data from user interactions. This allows GPT-4’s debugging to become more accurate over time, though this does present some potential risk when it comes to the code you submit for analysis, especially when using it to write or debug proprietary code. Users do have the option to opt out of their data being used to train GPT-4 further, but it’s not something that happens by default so keep this in mind when using GPT-4 for code related tasks.
You might be wondering why the recommendation here is to use GPT-4 when it is 4 times more expensive than the newer, cheaper, and more intelligent GPT-4o model released in May 2024. In general, GPT-4o has proven to be a more capable model, but for code related tasks GPT-4 tends to provide better responses that are more correct, adheres to the prompt better, and offers better error detection than GPT-4o. However, the gap is small and it’s likely that GPT-4o will become more capable and overtake GPT-4 in the future as the model matures further through additional training from user interactions. If cost is a major factor in your decision, GPT-4o is a good alternative that covers the majority of what GPT-4 can provide at a much lower cost.
Best LLM for Coding Assistants FAQs
How does a coding assistant work?
Coding assistants use Large Language Models (LLMs) that are trained with code related data to provide developers with tools that help increase productivity and efficiency when performing code related tasks. The training data often leverages public code repositories, documentation and other licenced work to enable the LLM to recognise syntax, coding styles, programming practices and paradigms to provide code generation, debugging, code analysis, and problem-solving capabilities across many different programming languages.
Coding assistants can be integrated into your development environments to provide inline code suggestions, and some can be trained further using an organization’s knowledge bases and codebases to improve the context of suggestions.
Why shouldn’t I implicitly trust the code generated by a coding assistant?
LLMs are becoming increasingly intelligent, but they aren’t immune to making mistakes known as “hallucinations”. Most coding assistants generate code that works well, but sometimes the code can be incomplete, inaccurate, or completely wrong. This can vary from model to model and has a high dependency on the training data used and the overall intelligence capability of the model itself.
What is a context window?
A context window is another way of describing how far back the LLM’s memory can go for a conversation, usually measured in tokens. LLMs with a large context window allow for responses that offer better context based on the conversation history which can be valuable for developers working on code related tasks when brainstorming ideas, debugging large sections of code, or iterating on a design.
https://cdn.mos.cms.futurecdn.net/G5t3WH7KXW2miotBLdgAEW-1200-80.jpg
Source link