The Allens AI Australian law benchmark and methodological observations
An eye on the horizon
The Allens AI Australian law benchmark is an ongoing benchmarking project that assesses and tracks the capability of LLMs to deliver Australian law legal advice over time. Through a close examination of the results, we consider what opportunities exist today to implement GenAI in legal practice, its current limitations, and how human lawyers can best harness the capabilities to enhance and streamline legal workflows.
Our methodology
Each year, in conjunction with Linklaters' LinksAI English law benchmark, we ask leading publicly available LLMs to answer a set of 30 legal questions from 10 separate practice areas that an Australian qualified lawyer may be routinely called upon to respond to or explain in the course of their legal practice. Each question has been selected to require an LLM to summarise an aspect of Australian law, to interpret and explain a contractual provision, or to undertake some blend of each of these tasks.
The 2025 edition of this benchmark has asked the same questions as were posed in 2024, in order to deliver both an absolute assessment of LLM performance over time, and to allow for the relative assessment of different LLMs at a given moment in time. For each question, all LLMs were primed with the following context-setting prompt:
'You are an experienced Australian lawyer. Provide a concise answer to the question below applying Australian law. Cite any relevant statutes, regulation, guidance or case law'.
To control for random hallucination and variation in user queries and responses, each question was asked of each LLM three times as a 'single shot' query without further interrogation.
Responses were graded against three scoring rubrics for substance (assessing the accuracy and correctness of each response), citations (assessing the quality, correctness and accuracy of evidence cited by each LLM) and clarity (assessing how well the response explained the relevant issues) by teams of Allens' knowledge management and subject matter specialists, with a central team moderating scores awarded by different practice groups.
Allens AI Australian law benchmark methodology
|
Comparison with other legal benchmarks
When we released the first edition of the Allens AI Australian law benchmark in 2024 alongside the LinksAI English law benchmark, our reports were some of the few projects in the world that sought to provide measurable and repeatable tests to gauge the performance of generative AI in a legal context. In 2025, a number of new legal benchmarks have been launched that are of interest, including:
- the CaseLaw, ContractLaw and LegalBench benchmarks published by ai that consider LLM performance against Canadian case law, contract law and legal reasoning tasks;
- targeted studies of LLM performance including by Stanford's Human-Centered AI Intelligence team (AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries); and
- further information released by legal LLM providers about their internal benchmarking processes, such as that released by Thomson Reuters for its CoCounsel tool (Legal AI Benchmarking: CoCounsel - Thomson Reuters Institute).
Our methodology distinguishes itself from other emerging benchmarks in a number of ways:
- human-grading: the emergence of LLM-as-judge frameworks for assessing LLM performance has allowed benchmarks to be compiled and updated quickly by removing human grading from the process. While the relative benefits of LLM-marked and human-marked frameworks have been discussed elsewhere, a human-marking process allows our benchmark to consider and distinguish nuanced behaviours that arise with each question.
- Australian-law specific: grading by Australian law professionals has allowed us to monitor for the intrusion of international legal explanations and principles into Australian law answers. Other published benchmarks typically focus on the jurisprudence of larger jurisdictions such as the US or by assessing against specific structured datasets that exist in jurisdictions.
- scaled question-by-question marking: grading by Allens' knowledge management and specialist teams allows our benchmark to ascribe a wider analog range of scores to a response (rather than binary correct / incorrect assessments) allowing for an assessment of the average quality of an LLM's responses (instead of its aggregate performance against a multi-question or multi-choice dataset) and the degree to which each LLM's responses can be expected to vary in quality for any given question.
While marking by expert legal practitioners is necessarily more resource-intensive than other automated or programmatic marking solutions, it has complemented Allens' broader adoption of LLMs across its existing legal practice and continues to facilitate an ongoing consideration of how such tools can be effectively and reliably used in day-to-day practice.
Omissions and extensions
To ensure consistency when comparing LLMs and to control for variations in usage by testers, the Allens AI Australian law benchmark uses repeatable 'single prompt' questions across all queries. Such prompting, however, does not explore the results that can be achieved by carefully refining queries for each LLM or by interrogating and testing LLM responses in subsequent prompts. It also does not reflect harder-to-test, longer-form, nuanced or niche use cases for LLM adoption in law that include:
- summarising longer documents, such as to create a bullet-point summary
- contract extraction of specific provisions from agreements
- stylistic amendment to make a document more concise, less formal, etc.
- ideation to help come up with concepts and ideas.
While we expect that improved results could be obtained by increasing the detail of each prompt (eg by including additional references to Australian law to correct jurisdictional error), by interacting further with LLMs to refine their responses and by considering longer-form content, such experimentation is to be carried out as an extension exercise, with this benchmark providing a starting point for the relative comparison of LLM performance.
Unlike other studies, our benchmark does not include a human-lawyer quality control benchmark. As the focus of this benchmark is on assessing the relative performance of different LLMs, rather than the utility of LLMs against lawyers of a specific seniority, a human comparator was not considered necessary. As a general rule, this benchmark assumes that LLMs will be utilised by lawyers as part of their toolset but that the extent of review required will depend in each case on the overall average quality of the LLM's response and the variance in the quality of that LLM's responses.
As the 2024 LinksAI and Allens AI Australian law benchmarks were published online, it is possible that commentary and sample responses from those reports may have been indexed and incorporated into the responses offered by any of the LLMs as part of the 2025 benchmark. Depending on the extent of any such indexation, the incorporation could conceivably result in improved responses (incorporating suggested improvements from the commentary) or may have merely reinforced last year's responses (by copying from sample answers). While we have not detected instances of word-for-word replication of 2024 responses, moderation for the over-fitting of LLM results to any responses that are publicly available will need to be monitored in coming years.