Recall and Precision – HM Argus’s quality metrics

Argus Eyes – The Blog on Internal Investigations, Crisis Management and Compliance

3 June 2026

Key Contacts: Sven H. Schneider and Daniel M. Weiß

With “recall” and “precision”, Hengeler Mueller has, since the very first stages of HM Argus’s development, been using two key performance indicators that not only serve as fundamental quality benchmarks for the tool’s use, but also quantitatively demonstrate how HM Argus is continuously evolving. Recall indicates the proportion of actually relevant documents that the tool correctly identifies relative to the total data set. Precision, meanwhile, measures how many of the documents flagged as relevant in the first step actually prove to be relevant after more thorough review, including manual review where necessary. Recall is, in this respect, a direct indicator of a review’s quality, whereas precision describes its efficiency. At the same time, the two metrics are subject to an inherent tension: a high recall tends to lower precision, and vice versa.

In the context of high-quality investigations, a high recall is an absolute prerequisite. Failing to find the literal ‘smoking gun’ or other crucial documents is simply not an option in legal investigations – particularly in cases where the outcome of the proceedings carries significant financial or personal consequences. Based on this consideration, HM Argus is usually optimised in favour of maximum recall, depending on the project. To put this into context: historically, with the traditional method – i.e. keyword filtering followed by manual review – recall values of around 0.70 were considered very good and ‘defendable’. The fact that, conversely, around 30% of potentially relevant documents remained undetected is inherent to the approach: spelling mistakes, synonyms, other languages and paraphrases inevitably lead to gaps in a pure keyword search, resulting in an incomplete picture. As powerful as language is, its pitfalls are equally numerous: anyone searching for “car” but failing to include synonyms such as “set of wheels” in their search terms runs the risk of overlooking highly relevant documents due to methodological shortcomings.

This is where HM Argus comes in: using semantic document analysis based on modern language models, the tool analyses not just individual words, but entire linguistic contexts. For example, in addition to ‘car’ and ‘set of wheels’, Argus automatically recognises the term ‘vehicle’ – simply because of the semantic proximity of these terms to one another. This paradigm shift already significantly improves recall. However, it can be optimised even further if prompts and contextual information are precisely tailored to the specific characteristics of the matter and the language model used, and the results are refined using a proprietary relevance scoring system. This requires collaboration in interdisciplinary teams, in which lawyers and technology experts in particular work as equals to further develop the respective tool. For HM Argus, this already means, as things stand: the recall value achieves over 0.90. This means that, instead of the success rates of 70% previously regarded as ‘defendable’, HM Argus can identify over 90% of relevant documents – although even the documents not recognised by the tool often represent borderline cases upon manual review, where both classifications – relevant and not relevant – would be justifiable even under human judgement.

Despite the focus on recall, the second metric, precision, was also integrated into the development process of HM Argus. The benchmark here is to perform at least more efficiently as the traditional keyword-based approach. Furthermore, the current generation of language models tends to classify an excess of documents as relevant. To counter this issue, HM Argus employs a supplementary relevance scoring system: through targeted weightings, logical links, cluster capping and a prompt strategy specifically tailored to the matter, precision values of over 0.70 up to over 0.80 are achieved, whereas the classic keyword method with manual review regularly yields results of around 0.10.

Recall and precision are the key, industry-wide recognised quality benchmarks for any AI-based review system. HM Argus significantly outperforms the traditional, keyword-driven approach in both respects – not least because the tool is individually optimised for each matter. In addition to the technical sophistication of HM Argus, this also delivers operational benefits: clients receive high-quality results more quickly and at a lower cost.

Argus Eyes – The Blog on Internal Investigations, Crisis Management and Compliance

Recall and Precision – HM Argus’s quality metrics

Authors