The Allens AI Australian law benchmark

About the research

Launched in 2024, the Allens AI Australian Law Benchmark assesses the ability of LLMs to provide legal guidance effectively.  

Which LLMs were tested?

We tested the following publicly available LLMs from the following developers:

  • Anthropic's Claude 3.5 Sonnet, tested via Claude chatbot;
  • DeepSeek's DeepSeek R1 model, tested via DeepSeek chatbot;
  • Google's Gemini 2.0 and Gemini 1.5, tested via Gemini web chatbot;
  • OpenAI's OpenAI o1, tested via OpenAI's web chatbot;
  • OpenAI's GPT-4o (with web searching enabled and disabled), tested via OpenAI's web chatbot.
During the course of testing, several additional LLMs were released including Claude 3.7 Sonnet (released May 2025), GPT-4.1 (released April 2025) and OpenAI o4-mini(research release April 2025) and Gemini 2.5 (released March 2025). Due to time constraints, these have been excluded from the scope of the 2025 report.  

Methodology

  • A consistent set of questions across 10 practice areas evaluated by 23 senior lawyers.
  • Practice areas: contract law, intellectual property, data privacy, employment, real estate, dispute resolution, corporate, competition, tax and banking.
  • Marking: each LLM is scored out of a total of 10 points, with marks allocated as follows: up to 5 points for substance, up to 3 points for citations, and up to 2 points for clarity.
Further details: marking rubric and methodology are available: [Link: here].  

LLMs not tested

Notably, the 2025 edition excludes various legal-industry specific tools such as Harvey, Thomson Reuters's CoCounsel and LexisNexis's Lexis+ AI.

While Allens continues to benchmark and leverage several of these (and other) products, none are included in this report to avoid drawing broad performance comparisons between general purpose products and the products that have been optimised for specific use-cases or user-interactions, as well as (increasingly) 'multi-model' products that use different AI models for different tasks. Our use of 'single prompt' questioning (for example) to ensure consistency between LLMs may not best demonstrate the capabilities of products that offer multiple structured pathways for answering the same question.