The Allens AI Australian law benchmark

Appendix: our methodology in detail

What does the future hold for the role of AI in the delivery of legal services?

The Allens AI Australian law benchmark and methodological observations

An eye on the horizon

The Allens AI Australian law benchmark is an ongoing benchmarking project that assesses and tracks the capability of LLMs to deliver Australian law legal advice over time. Through a close examination of the results, we consider what opportunities exist today to implement GenAI in legal practice, current limitations and how human lawyers can best harness these capabilities to enhance and streamline legal workflows.

Our methodology

Each year, in conjunction with Linklaters' LinksAI English law benchmark, we ask leading publicly available LLMs to answer a set of 30 legal questions from 10 separate practice areas that an Australian qualified lawyer may be routinely called upon to respond to or explain in the course of their legal practice. Each question has been selected to require an LLM to summarise an aspect of Australian law, to interpret and explain a contractual provision or to undertake some blend of each of these tasks.

The 2025 edition of this benchmark has asked the same questions as were posed in 2024 in order to deliver both an absolute assessment of LLM performance over time, and to allow for the relative assessment of different LLMs at a given moment in time. For each question, all LLMs were primed with the following context-setting prompt:

  'You are an experienced Australian lawyer. Provide a concise answer to the question below applying Australian law. Cite any relevant statutes, regulation, guidance or case law'.

To control for random hallucination and variation in user queries and responses, each question was asked of each LLM three times as a 'single shot' query without further interrogation.

Responses were graded against three scoring rubrics for substance (assessing the accuracy and correctness of each response), citations (assessing the quality, correctness and accuracy of evidence cited by each LLM) and clarity (assessing how well the response explained the relevant issues) by teams of Allens' knowledge management and subject matter specialists, with a central team moderating scores awarded by different practice groups.

Allens AI methodology

  • 30 difficult legal questions from across 10 separate practice areas
  • LLMs primed with a common context prompt with 'single prompt' queries
  • Questions repeated three times to control and test for random hallucination and variance
  • Scoring performed by Allens' knowledge management and subject area specialist lawyers
  • Responses scored against substance, citations and clarity marking rubrics
  • To control for variations in legal-specific LLM workflows and tailored-use cases, omits CoCounsel, Lexis+ AI, Harvey and similar legal LLM tools.  

Comparison with other legal benchmarks

When we released the first edition of the Allens AI Australian law benchmark in 2024 alongside the LinksAI English law benchmark, our reports were some of the few projects in the world that sought to provide measurable and repeatable tests to gauge the performance of generative AI in a legal context. In 2025, a number of new legal benchmarks have been launched that are of interest, including:

Our methodology distinguishes itself from other emerging benchmarks in a number of ways:

  • human-grading: the emergence of LLM-as-judge frameworks for assessing LLM performance has allowed benchmarks to be compiled and updated quickly by removing human grading from the process. While the relative benefits of LLM-marked and human-marked frameworks have been discussed elsewhere, a human-marking process allows our benchmark to consider and distinguish nuanced behaviours that arise in each question.
  • Australian-law specific: grading by Australian law professionals has allowed us to monitor for the intrusion of international legal explanations and principles into Australian law answers. Other published benchmarks typically focus on the jurisprudence of larger jurisdictions such as the US or by assessing against specific structured datasets that exist in jurisdictions.
  • scaled question-by-question marking: grading by Allens' knowledge management and specialist teams allows our benchmark to ascribe a wider analogue range of scores to a response (rather than binary correct / incorrect assessments) allowing for an assessment of the average quality of an LLM's responses (instead of its aggregate performance against a multi-question or multi-choice dataset) and the degree to which each LLM's responses can be expected to vary in quality for any given question.

While marking by expert legal practitioners is necessarily more resource-intensive than other automated or programmatic marking solutions, it has complemented Allens' broader adoption of LLMs across its existing legal practice and continues to facilitate an ongoing consideration of how such tools can be effectively and reliably used in day-to-day practice.

Omissions and extensions

Legal LLM

Notably, the Allens AI Australian law benchmark excludes various legal-industry-specific LLMs such as Harvey, Thomson Reuters's CoCounsel and LexisNexis's Lexis+ AI. While Allens continues to benchmark and leverage several of these (and other) products for our internal use, each has been excluded from the scope of this report to avoid drawing broad performance comparisons between products that have been specially optimised for specific use cases or user interactions. Our benchmark's use, for example, of 'single prompt' questioning to ensure consistency between LLMs may disadvantage products that offer multiple structured pathways for answering the same question.

Refined LLM querying

To ensure consistency when comparing LLMs and to control for variations in usage by testers, the Allens AI Australian law benchmark uses repeatable 'single prompt' questions across all queries. Such prompting, however, does not explore the results that can be achieved by carefully refining queries for each LLM or by interrogating and testing LLM responses in subsequent prompts. It also does not reflect harder to test, longer form, nuanced or niche use cases for LLM adoption in law that include:

  • summarising longer documents, such as to create a bullet point summary.
  • contract extraction of specific provisions from agreements.
  • stylistic amendment to make a document something more concise, less formal, etc.
  • ideation to help come up with concepts and ideas.

While we expect that improved results could be obtained by increasing the detail of each prompt (eg by including additional references to Australian law to correct jurisdictional error), by interacting further with LLMs to refine their responses and by considering longer-form content, such experimentation is to be carried out as an extension exercise, with this benchmark providing a starting point for the relative comparison of LLM performance.

Human-lawyer benchmarking

Unlike other studies, our benchmark does not include a human-lawyer quality control benchmark. As the focus of this benchmark is on assessing the relative performance of different LLMs rather than the utility of LLMs against lawyers of a specific seniority, a human comparator was not considered necessary. As a general rule, this benchmark assumes that LLMs will be utilised by lawyers as part of their toolset but that the extent of review required will depend in each case on the overall average quality of the LLM's response and the variance in the quality of that LLM's responses.

Incorporation of prior benchmarks into training data

As the 2024 LinksAI and Allens AI Australian law benchmarks were published online, it is possible that commentary and sample responses from those reports may have been indexed and incorporated into the responses offered by any of the LLMs as part of the 2025 benchmark. Depending on the extent of any such indexation, the incorporation could conceivably result in improved responses (incorporating suggested improvements from the commentary) or may have merely reinforced last year's responses (by copying from sample answers). While we have not detected instances of word-for-word replication of 2024 responses, moderation for the over-fitting of LLM results to any responses that are publicly available will need to be monitored in coming years.