The Allens AI Australian law benchmark

The results

Some impressive high-quality scores

  • Each LLM produced high-quality responses to at least some legal queries. We classify an LLM's performance on a particular question as 'high quality' if it scores an average of 8 marks or higher out of 10 across its three attempts at the same question.
  • Achieving a high-quality response against a question generally required strong scores in each of the three components of the Allens AI Benchmark: substance, citations and clarity.
  • OpenAI o1 and GPT-4o produced the highest number of high-quality responses—each scoring above 8 in 6 out of the 30 benchmark questions—although DeepSeek was the only LLM to score a perfect averaged score of 10 out of 10 for any benchmark question.

Performance across practice areas

Each LLM was able to produce one-off high-scoring responses across several legal subject areas (see the below graph, showing the highest score each LLM achieved on any attempt at a question in each subject area tested). The broad distribution of one-off high-quality scores across a variety of practice areas demonstrates the clear potential for the adoption of LLMs as part of legal practice—at least in some area—if reliability, citation and other issues can be effectively managed.

Quality varies significantly, but the best responses continue to improve

  • While all LLMs produced high-quality responses to some legal queries, average performance across the full set of Allens AI Australian law benchmark questions varied. OpenAI o1 and GPT-4o returned the highest average response scores overall.
  • DeepSeek outperformed both Claude 3.5 and Gemini 2.0; however, this is perhaps to be expected, given its later release date (in 2025) and its widely reported 'distillation' of the existing models. DeepSeek's competitive performance is also notable in view of its widely reported low development and running cost.
  • Overall, GPT-4o performed less effectively when its web search function was disabled. However, the advantage was not shown with every question. In some instances, web results appeared to have diluted the quality of the responses.
  • The overall variability of each LLM's responses is shown on the box and whisker graph below. On the graph, the 'boxes' show the spread of the 50% of each LLM's scores closest to that LLM's average (mean) score, while the 'whiskers' extending from the boxes show the spread of the other 50% of scores. A tighter spread of scores indicates that an LLM scored more consistently. Interestingly, while GPT-4o produced a slightly higher average score than OpenAI o1, the latter performed more consistently.
  • On one measure (relative standard deviations), each of OpenAI o1, GPT-4o and DeepSeek were also more consistent than other LLMs—ie their scores were clustered more closely around the average.
  • At the lower end, most LLMs occasionally produced responses that were mostly or completely incorrect, including because the LLM got the fundamental legal principal incorrect, applied an overseas jurisdiction's law instead of Australian law, or was too confident or eager—the LLMs generally had a tendency to seek to confirm the user's input query rather than correct it, and several of the LLMs overstated the level of confidence in their advice. 

A perfect response from Gemini 1.5

Query: a company has suffered a cyber breach incident. Does this create any potential liability for the directors? Is there a fiduciary duty that relates to cyber risks?

Gemini 1.5 response:

Yes, a cyber breach incident can create potential liability for directors. Directors have a fiduciary duty to act with care and diligence in the best interests of the company. This includes taking reasonable steps to mitigate cyber risks. Failure to do so could lead to personal liability for the directors if the company suffers losses as a result of the breach.

While there is no specific statute or regulation directly addressing cyber risk and director liability in Australia, the Corporations Act 2001 imposes general duties on directors, including the duty to act with care and diligence (section 180) and the duty to act in good faith in the best interests of the company (section 181). These duties can be applied to cyber risk management.

Recent regulatory guidance and case law have highlighted the importance of cyber security for directors. For example, the Australian Securities and Investments Commission (ASIC) has issued guidance on cyber security for directors, emphasizing the need for boards to understand and manage cyber risks. Additionally, courts have held directors liable for losses caused by a failure to adequately manage other types of risk, which could potentially be applied to cyber risk.

Therefore, it is crucial for directors to be proactive in managing cyber risk, including implementing appropriate security measures, regularly reviewing and updating those measures, and ensuring that the company has a robust incident response plan in place. By doing so, directors can help to mitigate their potential liability in the event of a cyber breach.

Why is the answer correct?

Gemini 1.5 scored 10 out of 10 for this response. It is not only correct, but covers all material issues and sets out a valid process to facilitate the hypothetical sale, as requested. In total, this answer scored 5/5 for substance, 3/3 for citations and 2/2 for clarity.

Why is Gemini 1.5 able to answer so well?

This is a good example of the ability of LLMs to summarise the law and related guidance in particular areas of law that are likely to be well known and discussed extensively on the internet (given they operate as an 'internet in your pocket').  

High scores across substance and citation set the top performers apart

  • LLMs that scored well generally scored well across all three scoring components (eg LLMs that produced strong substance scores also generally performed better on citations and clarity), and LLM rankings against each component generally followed the overall weighted ranking.
  • The average substance score of all the LLMs we tested was higher in 2025 (43 out of 5) than in 2024 (1.92 out of 5), confirming the overall improvement across the board through the year, including with the appearance of the new generation of reasoning models (such as DeepSeek and OpenAI o1).
  • The clarity score also showed clear improvement since 2024. The difference in the scores of Gemini 1.5 compared with Gemini 2.0 clearly demonstrates the continuing improvement in that respect. However, we observed a concerning new development, which is that several of the LLMs (compared with 2024) appear to be tuned to adopt a more conversational tone, which it deployed despite our 'wrapper' prompt to establish the legal context. (Gemini 2.0's favoured opener was 'Righto, let's break down the [question]'). An overly casual tone does not help to produce clear and rigorous legal analysis.
  • Citation performance was much more uneven, with average scores ranging from 1.19 (Gemini 1.5) to 1.60 (OpenAI o1) out of 3.

Average performance of LLMs by score category

Scoring rubrics

We marked the answers based on substance, citations and clarity. The questions were marked individually by our expert bench of senior lawyers.

Substance (5 marks)

We awarded a maximum of 5 marks for the substance of the answer – ie whether the answer was technically correct.

Citations (3 marks)

We awarded a maximum of 3 marks for correct references to cases, laws or guidance. The use of a single fictitious citation automatically leads to 0 marks being awarded.

Clarity (2 marks)

We awarded a maximum of 2 marks for the clarity of the answer.

Substance
0 The response is entirely wrong.
1 The response is generally wrong but contains some correct analysis.
2 The response is generally accurate but contains a number of errors.
3 The response is generally accurate but contains a small number of errors or fails to answer parts of the question.
4 The response is generally accurate and covers most issues.
5 The response is accurate and covers all material issues.
Citations
0 The citations are fictional.
1 The citations are incorrect.
2 The citations are generally accurate but there are important omissions.
3 Adequate and accurate citations are used.
Clarity
0 The response is very difficult to read.
1 The response is clear but not easy to read.
2 The response is clear and easy to read.

 

We (still) need to talk about citations

The LLMs we tested produced a variety of incorrect, irrelevant and fictitious citations, which highlights the need for human verification of any citations produced. 

Citation holds back all models, but particularly Claude and Gemini 1.5

Gemini 1.5 was a clear outlier in citation performance, scoring only 0.64 out of 3 on average.

By contrast, Gemini 2.0 demonstrated significant improvement, scoring 1.21 out of 3. The non-Gemini models ranged in average scores between 1.16 (Clause 3.5 Sonnet) and 1.62 (GPT-4o).

Against this background of overall improvement, it is remarkable (and concerning) that all LLMs are still prone to incorrect or fictitious citations.

GPT-4o was the only model to score higher than 50% on citation (ie it was mostly correct, most of the time) and only provided fictitious citations in one instance. However, 46% of its responses scored a 1 out of 3 due to several problems it shares with a number of its competitors.

Performance of GPT-4o  on citation

GPT-4o was the best performer and only provided fictitious citations in 1 instance. However, 46% of its responses scored a 1 out of 3 due to its tendency to cite secondary sources (such as blog posts or law firm websites) without citing the primary case authority or legislation, and where it did cite the correct legislation, it did not include section references.  

GPT-4o was the best performer and only provided fictitious citations in 1 instance. However, 46% of its responses scored a 1 out of 3 due to its tendency to cite secondary sources (such as blog posts or law firm websites) without citing the primary case authority or legislation, and where it did cite the correct legislation, it did not include section references.

Citations for the wrong jurisdiction

General purpose LLMs continue to struggle with the importance of jurisdiction. Some of the citations that appeared in answers were real, but reflected the law of another jurisdiction without recognising the difference in law. For example, question 5 of the Allens AI Australian law benchmark asks the LLM to recall the principles applied by the court when interpreting a hypothetical patent claim in light of advancements in gene-editing technology, and what criteria should be used to determine whether a particular organism falls within the scope of the patented invention. Nearly all responses by all LLMs referred in some way to the 'doctrine of equivalents', which is part of patent law in jurisdictions such as the US or the UK, but not Australia. In one instance, GPT-4o (with web search not enabled) correctly recognised that the doctrine is not part of Australian law, but still attempted to shoehorn it into the analysis by claiming (without citation) that a Canadian case on the subject 'has influenced Australian jurisprudence'. Some other responses simply hallucinated fictitious Australian cases to support the application of the doctrine of equivalents.

Not accurate enough

In some instances, the source of law, such as the relevant legislation, is correctly identified but the pinpoint to the particular section is missing or incorrect. For example, a bare reference to the Copyright Act, the Competition and Consumer Act or the Personal Property Securities Act may not be sufficient to provide a useful legal analysis.

Flights of fancy

Despite the request to provide citations, in 23% of answers the underlying case law or legislation was either completely absent or just made up.

Quantity over quality

Interestingly, all the newer models appear to have addressed the issue of citation and now lean towards overproducing rather than underproducing citations. However, LLMs generally seem to struggle with understanding the authority of different sources of information, and real-time web search access did not necessarily improve citation quality. For instance, in response to a contract law question firmly set in Parramatta, New South Wales, Gemini 2.0 relied mostly on a Canadian crowd-sourced legal text called 'Clicklaw Wikibook', and only one primary source: a 1962 English case (Answer 1 to Q2). While DeepSeek and Gemini 2.0 frequently produced more citations than OpenAI o1, they were mostly irrelevant and erroneous, which diluted the clarity of otherwise accurate answers. It is possible that OpenAI o1's chain-of-thought reasoning contributes more to citation accuracy than grounding training data with real-time web searches, which may introduce errors into an existing pool of adequate data. 

Rise of the reasoning models

Reasoning models performed well across the board, especially on 'difficult' specialist topics

In 2024, the launch of OpenAI o1 and DeepSeek R1 marked the introduction of new families of 'reasoning models' optimised for solving complex problems using step-by-step 'chain of thought' analysis. A useful feature is that these models share their analysis with the user, instead of simply producing an answer without an explanation — this makes it easier to identify mistakes in the AI reasoning process, and allows the AI tool to backtrack and correct steps in its own reasoning.

This year's results demonstrate that the OpenAI-o1 and DeepSeek R1 reasoning models performed well across the board, especially in 'difficult' areas where most LLMs struggled, such as tax, employment and dispute resolution.

However, this advantage was not absolute. Notably, GPT-4o significantly outperformed the later OpenAI o1 on banking and real estate law queries. Similarly, despite lower overall scores, Claude 3.5 Sonnet and (surprisingly, given its relative age) Gemini 1.5 significantly outperformed all other LLMs on corporate law questions. These results illustrate that reasoning models do not necessarily outperform when executing legal analysis, and different models may perform better when answering different types of questions.

 

Variability

When aggregated by subject area, it is notable that the spread of substance scores achieved by the LLMs varies significantly between practice groups.

Distillation

Another topical aspect of LLMs is distillation—training a model on the behaviour of a larger model, to mimic its capabilities in a smaller, more efficient model.

DeepSeek's distillation of larger models is well publicised, and during our testing we observed evidence of it. For example, DeepSeek showed remarkable similarities to GPT-4o with web search disabled, in terms of both the structure and substance of its answers for a significant number of questions. The similarity in the two models' behaviour was even greater than the similarity between different versions of GPT LLMs.

Query: in Australia, to what extent is a party’s subjective intention relevant when interpreting a contract?
GPT-4o (web search disabled)

In Australia, a party's **subjective intention** is generally **not relevant** when interpreting a contract. Instead, the courts apply an **objective test** to determine the meaning of the contract. This approach focuses on what a reasonable person in the position of the parties would have understood the terms to mean, based on the **language of the contract** and the **surrounding circumstances known to both parties** at the time the contract was made.

Key Principles


Objective Intention:
The courts seek to ascertain the parties' intention as expressed in the words of the contract, rather than their subjective, uncommunicated intentions. (See *Toll (FGCT) Pty Ltd v Alphapharm Pty Ltd* (2004) 219 CLR 165 at [40]).
Plain Meaning Rule:
Words are given their ordinary and natural meaning unless it is clear from the context that they should be interpreted differently.
Context and Surrounding Circumstances:
Courts can consider the surrounding circumstances (also called the "factual matrix") to shed light on the objective meaning of the terms but only to the extent that they are known to both parties. (*Codelfa Construction Pty Ltd v State Rail Authority of NSW* (1982) 149 CLR 337).
Exclusion of Subjective Intention:
Evidence of a party's subjective intention or undisclosed beliefs is inadmissible as it does not assist in determining the objective meaning of the agreement. This was affirmed in *Pacific Carriers Ltd v BNP Paribas* (2004) 218 CLR 451.

Exceptions

Where the contract is **ambiguous**, courts may look more closely at surrounding circumstances.
In cases of **rectification**, subjective intention may be relevant to prove a common intention that differs from the written agreement.  

Conclusion

The Australian approach prioritises an **objective assessment** of the parties' expressed intentions over their subjective beliefs, ensuring consistency and predictability in contract interpretation.

DeepSeek

In Australian contract law, the interpretation of a contract is primarily governed by the objective theory of contract, which focuses on the outward manifestations of the parties' intentions rather than their subjective intentions. The High Court of Australia has consistently affirmed this principle in cases such as *Toll (FGCT) Pty Ltd v Alphapharm Pty Ltd* (2004) 219 CLR 165 and *Electricity Generation Corporation v Woodside Energy Ltd* (2014) 251 CLR 640.

Key Principles
  1. Objective Intention: The court seeks to determine what a reasonable person in the position of the parties would have understood the terms of the contract to mean, based on the language used in the contract and the surrounding circumstances known to both parties at the time of contracting (*Codelfa Construction Pty Ltd v State Rail Authority of NSW* (1982) 149 CLR 337).
  2. Subjective Intention Generally Irrelevant: A party’s subjective intention (what they secretly intended or believed) is generally irrelevant unless it is reflected in the express terms of the contract or the surrounding circumstances. The court does not inquire into what a party subjectively intended unless there is ambiguity or evidence of a common understanding (*Pacific Carriers Ltd v BNP Paribas* (2004) 218 CLR 451).
Exceptions
  • Ambiguity: If the contract is ambiguous, evidence of subjective intention may be admissible to resolve the ambiguity, but only if it is consistent with the objective framework of interpretation (*Western Export Services Inc v Jireh International Pty Ltd* (2011) 282 ALR 604).
  • Common Intention: If both parties shared a common subjective understanding of a term, this may be relevant if it aligns with the objective interpretation (*Brambles Holdings Ltd v Bathurst City Council* (2001) 53 NSWLR 153).  
Relevant Statutes

While the common law governs contract interpretation, statutes such as the *Australian Consumer Law* (Schedule 2 of the *Competition and Consumer Act 2010* (Cth)) may influence the interpretation of consumer contracts, particularly in cases involving unfair terms or misleading conduct.

Practical Takeaway

In most cases, a party’s subjective intention is irrelevant unless it is objectively manifested in the contract or the surrounding circumstances. Courts prioritise the plain meaning of the words used and the context in which the contract was formed.