What does the future hold for the role of AI in the delivery of legal services?
The Allens AI Australian law benchmark and methodological observations
An eye on the horizonThe Allens AI Australian law benchmark is an ongoing benchmarking project that assesses and tracks the capability of LLMs to deliver Australian law legal advice over time. Through a close examination of the results, we consider what opportunities exist today to implement GenAI in legal practice, current limitations and how human lawyers can best harness these capabilities to enhance and streamline legal workflows. |
Our methodologyEach year, in conjunction with Linklaters' LinksAI English law benchmark, we ask leading publicly available LLMs to answer a set of 30 legal questions from 10 separate practice areas that an Australian qualified lawyer may be routinely called upon to respond to or explain in the course of their legal practice. Each question has been selected to require an LLM to summarise an aspect of Australian law, to interpret and explain a contractual provision or to undertake some blend of each of these tasks. The 2025 edition of this benchmark has asked the same questions as were posed in 2024 in order to deliver both an absolute assessment of LLM performance over time, and to allow for the relative assessment of different LLMs at a given moment in time. For each question, all LLMs were primed with the following context-setting prompt: 'You are an experienced Australian lawyer. Provide a concise answer to the question below applying Australian law. Cite any relevant statutes, regulation, guidance or case law'. To control for random hallucination and variation in user queries and responses, each question was asked of each LLM three times as a 'single shot' query without further interrogation. Responses were graded against three scoring rubrics for substance (assessing the accuracy and correctness of each response), citations (assessing the quality, correctness and accuracy of evidence cited by each LLM) and clarity (assessing how well the response explained the relevant issues) by teams of Allens' knowledge management and subject matter specialists, with a central team moderating scores awarded by different practice groups. |
|
Comparison with other legal benchmarksWhen we released the first edition of the Allens AI Australian law benchmark in 2024 alongside the LinksAI English law benchmark, our reports were some of the few projects in the world that sought to provide measurable and repeatable tests to gauge the performance of generative AI in a legal context. In 2025, a number of new legal benchmarks have been launched that are of interest, including:
Our methodology distinguishes itself from other emerging benchmarks in a number of ways:
While marking by expert legal practitioners is necessarily more resource-intensive than other automated or programmatic marking solutions, it has complemented Allens' broader adoption of LLMs across its existing legal practice and continues to facilitate an ongoing consideration of how such tools can be effectively and reliably used in day-to-day practice. Omissions and extensionsLegal LLM Notably, the Allens AI Australian law benchmark excludes various legal-industry-specific LLMs such as Harvey, Thomson Reuters's CoCounsel and LexisNexis's Lexis+ AI. While Allens continues to benchmark and leverage several of these (and other) products for our internal use, each has been excluded from the scope of this report to avoid drawing broad performance comparisons between products that have been specially optimised for specific use cases or user interactions. Our benchmark's use, for example, of 'single prompt' questioning to ensure consistency between LLMs may disadvantage products that offer multiple structured pathways for answering the same question. Refined LLM querying To ensure consistency when comparing LLMs and to control for variations in usage by testers, the Allens AI Australian law benchmark uses repeatable 'single prompt' questions across all queries. Such prompting, however, does not explore the results that can be achieved by carefully refining queries for each LLM or by interrogating and testing LLM responses in subsequent prompts. It also does not reflect harder to test, longer form, nuanced or niche use cases for LLM adoption in law that include:
While we expect that improved results could be obtained by increasing the detail of each prompt (eg by including additional references to Australian law to correct jurisdictional error), by interacting further with LLMs to refine their responses and by considering longer-form content, such experimentation is to be carried out as an extension exercise, with this benchmark providing a starting point for the relative comparison of LLM performance. Human-lawyer benchmarking Unlike other studies, our benchmark does not include a human-lawyer quality control benchmark. As the focus of this benchmark is on assessing the relative performance of different LLMs rather than the utility of LLMs against lawyers of a specific seniority, a human comparator was not considered necessary. As a general rule, this benchmark assumes that LLMs will be utilised by lawyers as part of their toolset but that the extent of review required will depend in each case on the overall average quality of the LLM's response and the variance in the quality of that LLM's responses. Incorporation of prior benchmarks into training data As the 2024 LinksAI and Allens AI Australian law benchmarks were published online, it is possible that commentary and sample responses from those reports may have been indexed and incorporated into the responses offered by any of the LLMs as part of the 2025 benchmark. Depending on the extent of any such indexation, the incorporation could conceivably result in improved responses (incorporating suggested improvements from the commentary) or may have merely reinforced last year's responses (by copying from sample answers). While we have not detected instances of word-for-word replication of 2024 responses, moderation for the over-fitting of LLM results to any responses that are publicly available will need to be monitored in coming years. |