The Allens AI Australian law benchmark

Introduction

Result summary

The 2025 results for the Allens AI Australian law benchmark are in and show significant leaps forward in legal performance for large language models (LLMs). For this year's report, we tested 7 market-leading LLMs against our benchmark of 30 questions, and assessed their responses at the standard of a 5 years' post-admission Australian qualified lawyer. Responses were evaluated on substance, citations and clarity by a panel of 23 senior Allens legal experts in April 2025.

The background to our benchmark and methodological observations (including comparison with other benchmarks) is set out in our methodology.

What were the results?

  • GPT-4o, OpenAI o1 and DeepSeek R1 each bettered 2024's highest score (achieved by GPT-4), particularly on substance scores.
  • While all tested LLMs achieved good scores for some questions, the quality of the responses of all LLMs varied across the full question set.
  • Mostly, models that performed well did so across a variety of different practice areas, but some LLMs excelled in specific practice areas. GPT-4o significantly outperformed OpenAI o1 in banking and real estate law queries. Despite lower overall scores, Claude 3.5 Sonnet and Gemini 1.5 excelled in corporate law questions.
  • When compared with our 2024 results, the LLMs in general produced higher average citation scores ; however, all LLMs on occasion produced citations that were incorrect (a citation score of 1 out of 3) or fictional (a citation score of 0 out of 3).
  • Head-to-head pairings offer insights into the relative performance of reasoning models (versus 'traditional' LLM models): GPT-4o and o1 produced similar results, but GPT-4o and o1 and DeepSeek R1 outperformed all other models.
  • Reasoning models were not necessarily better at all types of legal questions, and models tuned to overproduce citations compromised on the clarity of their responses.
  • Performance gains in newer models, and newer generations of the same model (eg between Gemini 1.5 and 2.0), were not linear.
  • A comparison of GPT-4o's performance against when web search was disabled illustrates one of a number of factors that should be looked at by businesses considering implementing an internal deployment of generative AI tools, balanced with the key advantages of security, as well as the possibility of fine-tuning the model to best suit the specific needs of the business.

Key takeaways: What the results mean for legal practice

The newest generation of LLMs exhibit a significant leap forward in performance when compared with 2024.    

Despite improved performance, achieving consistently high-quality results remains a challenge.

Citation accuracy remains the significant area for improvement.

'Cross-jurisdiction infection' from larger jurisdictions with different laws remains a problem for smaller jurisdictions such as Australia.  

While LLMs currently require lawyer oversight, strong performance across various areas of legal practice offers the possibility of broad applications.    

When incorporating LLMs into their practice, litigators, general counsel and other lawyers should continue to carefully consider how to best validate and verify LLM responses.

Methodology

Launched in 2024, the Allens AI Australian law benchmark assesses the ability of LLMs to provide legal guidance effectively.  

Which LLMs were tested?

We tested the following publicly available LLMs from the following developers:

  • Anthropic's Claude 3.5 Sonnet, tested via Claude chatbot
  • DeepSeek's DeepSeek R1 model, tested via DeepSeek chatbot
  • Google's Gemini 2.0 and Gemini 1.5, tested via Gemini web chatbot
  • OpenAI's OpenAI o1, tested via OpenAI's web chatbot
  • OpenAI's GPT-4o (with web searching enabled and disabled), tested via OpenAI's web chatbot.

In the time required to complete our assessment, several additional LLMs were released, including Claude 4.0 Sonnet (released May 2025), GPT-4.1 (released April 2025), and OpenAI o4-mini (research release April 2025) and Gemini 2.5 (released March 2025). Due to time constraints, these have been excluded from the scope of the 2025 report.  

The methodology
  • Consistent questions year on year: across 10 practice areas evaluated by 23 senior lawyers.
  • Practice areas: contract law, intellectual property, data privacy, employment, real estate, dispute resolution, corporate, competition, tax and banking.
  • Marking: each LLM is scored out of a total of 10 points, with marks allocated as follows: up to 5 points for substance, up to 3 points for citations, and up to 2 points for clarity.

  Further details are available here .

Which LLMs were excluded?

Notably, the 2025 edition excludes various legal-industry specific tools such as Harvey, Thomson Reuters's CoCounsel and LexisNexis's Lexis+ AI.

While Allens continues to benchmark and leverage several of these (and other) products, none are included in this report to avoid drawing broad performance comparisons between general purpose products and the products that have been optimised for specific use-cases or user-interactions, as well as (increasingly) 'multi-model' products that use different AI models for different tasks. Our use of 'single prompt' questioning (for example) to ensure consistency between LLMs may not best demonstrate the capabilities of products that offer multiple structured pathways for answering the same question.

Disclaimer

No reliance should be placed on any LLMs or their responses, even in cases where specific answers or LLM performances have been positively marked. They are for general information purposes only and do not claim to be comprehensive or provide legal or other advice.

Similarly, we understand that the providers of the LLMs discussed in this report do not recommend their products be used for Australian law advice, and that the output of those LLMs is provided on an 'as is' basis.