The 2025 Allens AI Australian law benchmark: The results

Each LLM produced high-quality responses for at least one legal query. We classify an LLM's performance on a particular question as 'high quality' if it scores an average of 8 marks or higher out of 10 across its three attempts at the same question.
Achieving a high-quality response against a question generally required strong scores in each of the three components of the Allens AI Benchmark: substance, citations and clarity.
OpenAI o1 and GPT-4o produced the highest number of high-quality responses—each scoring above 8 in 6 out of the 30 benchmark questions—although DeepSeek was the only LLM to score a consistent perfect score of 10 out of 10 across all three attempts for any benchmark question.

All LLMs were able to produce one-off high-scoring responses across several legal subject areas (see the below graph, showing the highest score each LLM achieved on any attempt at a question in each subject area tested). The broad distribution of one-off high-quality scores demonstrates the clear potential for the adoption of LLMs as part of legal practice—at least in some areas—if reliability, citation and other issues can be effectively managed.

While all LLMs produced high-quality responses to some legal queries, average performance across the full set of Allens AI Australian law benchmark questions varied. OpenAI o1 and GPT-4o returned the highest average response scores overall.
DeepSeek outperformed both Claude 3.5 and Gemini 2.0; however, this is perhaps to be expected, given its later release date (in 2025) and its widely reported 'distillation' of the existing models. DeepSeek's competitive performance is also notable in view of its widely reported low development and running cost.
Overall, GPT-4o performed less effectively when its web search function was disabled. However, the advantage was not shown with every question. In some instances, web results appeared to have diluted the quality of the responses.

The overall variability of each LLM's responses is shown on the 'box and whisker' graph. On the graph, the 'boxes' show the spread of the 50% of each LLM's scores closest to that LLM's median score, while the 'whiskers' extending from the boxes show the spread of the other 50% of scores. A tighter spread of scores indicates that an LLM scored more consistently. Interestingly, while GPT-4o produced a slightly higher average score than OpenAI o1, the latter performed more consistently.
On one measure (relative standard deviations), each of OpenAI o1, GPT-4o and DeepSeek were also more consistent than other LLMs—ie their scores were clustered more closely around the average.
At the lower end, most LLMs occasionally produced responses that were mostly or completely incorrect, including because the LLM got the fundamental legal principal incorrect, applied an overseas jurisdiction's law instead of Australian law, or was too confident or eager—the LLMs generally had a tendency to seek to confirm the user's input query rather than correct it, and several of the LLMs overstated the level of confidence in their advice.

Gemini 1.5 response:

Yes, a cyber breach incident can create potential liability for directors. Directors have a fiduciary duty to act with care and diligence in the best interests of the company. This includes taking reasonable steps to mitigate cyber risks. Failure to do so could lead to personal liability for the directors if the company suffers losses as a result of the breach.

While there is no specific statute or regulation directly addressing cyber risk and director liability in Australia, the Corporations Act 2001 imposes general duties on directors, including the duty to act with care and diligence (section 180) and the duty to act in good faith in the best interests of the company (section 181). These duties can be applied to cyber risk management.

Recent regulatory guidance and case law have highlighted the importance of cyber security for directors. For example, the Australian Securities and Investments Commission (ASIC) has issued guidance on cyber security for directors, emphasizing the need for boards to understand and manage cyber risks. Additionally, courts have held directors liable for losses caused by a failure to adequately manage other types of risk, which could potentially be applied to cyber risk.

Therefore, it is crucial for directors to be proactive in managing cyber risk, including implementing appropriate security measures, regularly reviewing and updating those measures, and ensuring that the company has a robust incident response plan in place. By doing so, directors can help to mitigate their potential liability in the event of a cyber breach.

Why is the answer correct?

Gemini 1.5 scored 10 out of 10 for this response. It is correct, and covers all material issues and sets out a valid process to facilitate the hypothetical sale, as requested. In total, this answer scored 5/5 for substance, 3/3 for citations and 2/2 for clarity.

Why is Gemini 1.5 able to answer so well?

This is a good example of the ability of LLMs to summarise the law and related guidance in particular areas of law that are likely to be well known and discussed extensively on the internet.

LLMs that scored well generally scored well across all three scoring components (eg LLMs that produced strong substance scores also generally performed better on citations and clarity), and LLM rankings against each component generally followed the overall weighted ranking.
The average substance score of all the LLMs we tested was higher in 2025 (4.3 out of 5) than in 2024 (1.92 out of 5), confirming the overall improvement across the board through the year, including with the appearance of the new generation of reasoning models (such as DeepSeek and OpenAI o1).
The clarity score also showed clear improvement since 2024. The difference in the scores of Gemini 1.5 compared with Gemini 2.0 clearly demonstrates the continuing improvement in that respect. However, we observed a concerning new development, which is that several of the LLMs (compared with 2024) appear to be tuned to adopt a more conversational tone, which it deployed despite our 'wrapper' prompt to establish the legal context. (Gemini 2.0's favoured opener was 'Righto, let's break down the [question]'). An overly casual tone does not help to produce clear and rigorous legal analysis.
Citation performance was much more uneven, with average scores ranging from 1.19 (Gemini 1.5) to 1.60 (OpenAI o1) out of 3.

We marked the answers based on substance, citations and clarity. The questions were marked individually by our expert bench of senior lawyers.

Substance (5 marks)

We awarded a maximum of 5 marks for the substance of the answer – ie whether the answer was technically correct.

Substance
0	The response is entirely wrong.
1	The response is generally wrong but contains some correct analysis.
2	The response is generally accurate but contains a number of errors.
3	The response is generally accurate but contains a small number of errors or fails to answer parts of the question.
4	The response is generally accurate and covers most issues.
5	The response is accurate and covers all material issues.

Citations (3 marks)

We awarded a maximum of 3 marks for correct references to cases, laws or guidance. The use of a single fictitious citation automatically leads to 0 marks being awarded.

Citations
0	The citations are fictional.
1	The citations are incorrect.
2	The citations are generally accurate but there are important omissions.
3	Adequate and accurate citations are used.

Clarity (2 marks)

We awarded a maximum of 2 marks for the clarity of the answer.

Clarity
0	The response is very difficult to read.
1	The response is clear but not easy to read.
2	The response is clear and easy to read.

The LLMs we tested produced a variety of incorrect, irrelevant and fictitious citations, which highlights the need for human verification of any citations produced.

Citation holds back all models, but particularly Claude and Gemini 1.5

Gemini 1.5 was a clear outlier in citation performance, scoring only 0.64 out of 3 on average.

By contrast, Gemini 2.0 demonstrated significant improvement, scoring 1.21 out of 3. The non-Gemini models ranged in average scores between 1.16 (Clause 3.5 Sonnet) and 1.62 (GPT-4o).

Against this background of overall improvement, it is remarkable (and concerning) that all LLMs are still prone to incorrect or fictitious citations.

GPT-4o was the only model to score higher than 50% on citation (ie it was mostly correct, most of the time) and only provided fictitious citations in one instance. However, 46% of its responses scored a 1 out of 3 due to several problems it shares with a number of its competitors.

Performance of GPT-4o on citation

GPT-4o was the best performer and only provided fictitious citations in 1 instance. However, 46% of its responses scored a 1 out of 3 due to its tendency to cite secondary sources (such as blog posts or law firm websites) without citing the primary case authority or legislation, and where it did cite the correct legislation, it did not include section references.

Citations for the wrong jurisdiction

General purpose LLMs continue to struggle with the importance of jurisdiction. Some of the citations that appeared in answers were real, but reflected the law of another jurisdiction without recognising the difference in law. For example, question 5 of the Allens AI Australian law benchmark asks the LLM to recall the principles applied by the court when interpreting a hypothetical patent claim in light of advancements in gene-editing technology, and what criteria should be used to determine whether a particular organism falls within the scope of the patented invention. Nearly all responses by all LLMs referred in some way to the 'doctrine of equivalents', which is part of patent law in jurisdictions such as the US or the UK, but not Australia. In one instance, GPT-4o (with web search not enabled) correctly recognised that the doctrine is not part of Australian law, but still attempted to shoehorn it into the analysis by claiming (without citation) that a Canadian case on the subject 'has influenced Australian jurisprudence'. Some other responses simply hallucinated fictitious Australian cases to support the application of the doctrine of equivalents.

Not accurate enough

In some instances, the source of law, such as the relevant legislation, is correctly identified but the pinpoint to the particular section is missing or incorrect. For example, a bare reference to the Copyright Act, the Competition and Consumer Act or the Personal Property Securities Act may not be sufficient to provide a useful legal analysis.

Flights of fancy

Despite the request to provide citations, in 23% of answers the underlying case law or legislation was either completely absent or just made up.

Quantity over quality

Interestingly, all the newer models appear to have addressed the issue of citation and now lean towards overproducing rather than underproducing citations. However, LLMs generally seem to struggle with understanding the authority of different sources of information, and real-time web search access did not necessarily improve citation quality. For instance, in response to a contract law question firmly set in Parramatta, New South Wales, Gemini 2.0 relied mostly on a Canadian crowd-sourced legal text called 'Clicklaw Wikibook', and only one primary source: a 1962 English case (Answer 1 to Q2). While DeepSeek and Gemini 2.0 frequently produced more citations than OpenAI o1, they were mostly irrelevant and erroneous, which diluted the clarity of otherwise accurate answers. It is possible that OpenAI o1's chain-of-thought reasoning contributes more to citation accuracy than grounding training data with real-time web searches, which may introduce errors into an existing pool of adequate data.

Reasoning models performed well across the board, especially on 'difficult' specialist topics

In 2024, the launch of OpenAI o1 and DeepSeek R1 marked the introduction of new families of 'reasoning models' optimised for solving complex problems using step-by-step 'chain of thought' analysis. A useful feature is that these models share their analysis with the user, instead of simply producing an answer without an explanation—this makes it easier to identify mistakes in the AI reasoning process, and allows the AI tool to backtrack and correct steps in its own reasoning.

This year's results demonstrate that the OpenAI-o1 and DeepSeek R1 reasoning models performed well across the board, especially in 'difficult' areas where most LLMs struggled, such as tax, employment and dispute resolution.

However, this advantage was not absolute. Notably, GPT-4o significantly outperformed the later OpenAI o1 on banking and real estate law queries. Similarly, despite lower overall scores, Claude 3.5 Sonnet and (surprisingly, given its relative age) Gemini 1.5 significantly outperformed all other LLMs on corporate law questions. These results illustrate that reasoning models do not necessarily outperform when executing legal analysis, and different models may perform better when answering different types of questions.

The average performance of each LLM against each area of law against the overall performance of all LLMs on that area is shown in the following graph.

Another topical aspect of LLMs is distillation—training a model on the behaviour of a larger model, to mimic its capabilities in a smaller, more efficient model.

DeepSeek's distillation of larger models is well publicised, and during our testing we observed evidence of it. For example, DeepSeek showed remarkable similarities to GPT-4o with web search disabled, in terms of both the structure and substance of its answers for a significant number of questions. The similarity in the two models' behaviour was even greater than the similarity between different versions of GPT LLMs.

The results

Some impressive top scores

Top scores across practice areas

Average responses have improved

...but response quality was not consistent

A perfect response from Gemini 1.5

Gemini 1.5 response:

Why is the answer correct?

Why is Gemini 1.5 able to answer so well?

High scores across substance and citation set the top performers apart

Average performance of LLMs by score category

We (still) need to talk about citations