Avoiding Juris-fiction: a way forward for transparent, risk-based AI experimentation in justice and the rule of law

Share

The promise of artificial intelligence (AI) to ease the burden on judicial and legal systems worldwide is real, evidenced by the buzz at the recent Global Forum on Justice and the Rule of Law. The efficiencies argument for AI in public services has long been touted, but it is only more recently that we are starting to get some hard data from studies on Large Language Model (LLM) deployment in the justice domain to inform policymakers. Kudos, then, to the courageous judges who publicly dipped their robes into the murky waters of AI-assisted decision-making. In a recent critical appraisal of LLM-use for judicial decision-making in Latin America, one pioneering Colombian judge used ChatGPT to draft almost 30% of a ruling regarding the fundamental rights of a minor. This caused quite a stir.

Efficiency? Certainly.

We know LLMs can organize and structure legal texts in seconds, saving precious time and potentially helping to clear court backlogs. They can also suggest text that will make a point more clearly or forcefully. This type of tech can even add a nice healthy layer of ambiguity, if a jurist needs a fence to conveniently straddle. The Colombian judge argues that tools like ChatGPT could improve response times in courts, where delays can quite literally mean life or death. Indeed, in a world where legal language can confuse people, AI offers a glimmer of clarity; like in Argentina where LLMs are simplifying complex rulings, making them more accessible to the public. However, this efficiency comes with a caveat. Judges must verify AI-generated content rigorously.

Accuracy? Not entirely.

LLMs like ChatGPT and Llama have a well-documented tendency to -for want of a better term- make stuff up. AI Hallucinations involve spitting out text that is factually incorrect or just completely fabricated. As revealed in another study, 58% to 88% of the time LLMs hallucinated when tested on US federal court cases (see figure 1). This is a damning statistic when you consider that making erroneous facts could lead to hefty fines or even prison terms. Hallucinations are an inherent limitation of current large language models. They are trained to predict and organize language, not necessarily to understand truth or legal nuance.

Figure 1: Hallucination rates by LLM, source Dahl et al. (2024).


Moreover, LLMs perform better on higher-profile cases from more prominent legal jurisdictions, while struggling with brand new, old or less famous rulings. AI models hang out with the superstars of legal precedent, leaving lesser-known rulings or jurisdictions by the wayside. I even did some of my own primary research; trying to elicit from ChatGPT 4.0 the correct authorities for seminal points of law in Irish constitutional jurisprudence. I was sent on an electronic wild goose chase.

So here is the problem: LLMs work well when drafting simple memos or summarizing basic laws. They also work well for an expert who can spot an anomaly, ensuring efficiency and accuracy. But when called to interpret complex legal doctrine, historical case law, or tease out nuanced precedents, outputs become as reliable as a chocolate teapot… for now. Bear in mind that the next generation of GenAI, expected as early as 2025, will take reasoning and logic to a whole other level. In fact, the recent findings about LLMs could become moot within weeks or months. But none of this will ever be reliably known, or have earned the trust of the public, unless we freely experiment with the technology.

Open experimentation – a time for judicious bravery
Despite the clear risks, it is heartening that judges in some jurisdictions are openly pioneering this technology and providing transcripts of the prompts they used, the output received and their final text. By testing the LLMs in real-world cases, they are allowing us all to scrutinize the technology in ways that controlled lab settings cannot. We need more of this open experimentation and the regulatory and legislative flex to permit it. By analyzing these early uses (and seeing the prompts, results and the reality in law), we can set boundaries and establish guidelines to ensure AI is used responsibly in our legal systems.

Transparency is key. When judges openly share how they have integrated AI into rulings, it provides invaluable insight into the technology’s strengths and limitations. This experimentation is the first step toward developing good practices, ensuring that AI enhances judicial efficiency without undermining fairness, accuracy, or the very human element that justice demands.

A second step would be to push ahead on the basis of risk, exercised in a rights-hierarchy context. Singapore’s move to tackle an annual caseload of 10,000 small claims using generative AI is both sensible and scalable. Other areas like regulatory compliance and non-complex tax disputes would also be good candidates. Some family law areas may soon be suitable contenders, such as uncontentious divorces, maintenance payments and other instruments where the goal is simply to move on with life, with minimum hassle and cost to the parties.

In the end, the question is not whether AI should play a role in judicial decision-making- that is already happening. The real question is: what we need to do as policymakers as we move to integrate these tools into our justice systems, to maintain the requisite human oversight, ethical standards, and rigorous checks required to keep justice fair and trustworthy? We must test, learn and share our findings widely and truthfully. This way we can move forward in a manner that presents the least risk of encroaching on the fundamental rights of parties. The efficiency gains, no matter how hyped, are meaningless if we lose sight of the core principles that justice demands.

Source: https://blogs.worldbank.org/