Random sampling is a powerful eDiscovery tool that can provide you with reliable measurements of the efficacy and efficiency of searches, reviewers, or other classifiers
As we discussed in our series on estimating prevalence, formal sampling is a powerful, judicially-approved, and underutilized tool for eDiscovery efforts. In that series we reviewed the necessary concepts, terms, and steps for estimating the prevalence of desired materials within a collected dataset, which can yield a variety of benefits during project planning and discovery negotiations. Estimating prevalence, however, is just one of the ways practitioners can leverage sampling in their projects.
Beyond estimating prevalence, there are other opportunities to replace informal sampling of unknown reliability with formal sampling of precise reliability. For example, since the early days of eDiscovery, it has been common for a knowledgeable team member to test potential search terms and phrases by informally poking around in some of the results returned by them. Similarly, it has also been common for more-senior attorneys to double-check the relevance or privilege determinations made by more-junior attorneys by poking around in some of the batches of documents they’ve reviewed.
These informal approaches to sampling can be valuable for gathering some initial anecdotal evidence and for making instinctual assessments, but using formal sampling instead is much more effective, reliable, and persuasive. Imagine iteratively refining searches for your own use, or negotiating with another party about which searches should be used, armed with precise, reliable information about their relative efficacy. As we saw frequently in our recent TAR case law survey, and as has been suggested in non-TAR cases, judges prefer argument and negotiation based on actual information to that based merely on conjecture and assumption.
But how is testing searches or TAR processes or reviews actually done? What sampling concepts and terms do you need to know? What steps do you need to take? What kind of results can you achieve, with what kind of reliability?
In this short series, we will answer these questions by reviewing the concepts, terms, and steps necessary for “testing classifiers” using formal sampling.
“Classifiers” are mechanisms used to classify documents or other materials into discrete categories such as those requiring review and those not requiring review, or relevant and non-relevant, or privileged and non-privileged. That mechanism might be a search using key words or phrases. It might be the decisions of an individual human reviewer or the aggregated decisions of an entire human review process. It might be the software-generated results of a technology-assisted review process. The binary classification decisions of any of these classifiers are testable in the same basic way. To start, we will focus on searches as the classifiers to be tested.
When testing search classifiers, we are actually measuring two things about them: their “recall” and their “precision,” which correlate to their efficacy and their efficiency:
Both recall and precision are expressed as percentages out of 100:
There is also generally a tension between the two criteria. Optimizing a search to maximize recall beyond a certain point is likely to require lowering precision and accepting more junk, and optimizing a search to maximize precision beyond a certain point is likely to require accepting lower recall and more missed relevant materials. Deciding what balance between the two is reasonable and proportional is fact-based determination specific to the needs and circumstances of each matter.
In order to test a search classifier’s recall and precision, you must already know the numbers of documents in the classifications you are testing. For example, to determine what percentage of the relevant material is found, you must know how much relevant material there is. Since it is not possible to know this about the full dataset without reviewing it all (which would defeat the purpose of developing good searches), classifiers must be tested against a “control set” drawn from the full dataset.
Much as we did for estimating prevalence, control sets are created by taking a simple random sample from the full dataset (after initial, objective culling) and manually reviewing and classifying the materials in that sample. Just as with estimating prevalence, it is important that the review performed on the control set be done carefully and by knowledgeable team members. In fact, in many cases you may be able to use the same set of documents you reviewed to estimate prevalence as a control set for testing classifiers.
Unlike estimating prevalence, however, figuring out the size of the sample needed for your control set is not so cut and dry. As we will see in the next Part, the reliability of the results you get when testing classifiers is related to how many potential things there were for the classifiers to find in the control set. For example, if you are testing searches designed to find relevant documents, the more relevant documents there are in your control set the more reliable your results will be.
This means that datasets with low prevalence may require larger control sets to test classifiers than datasets with high prevalence, depending on how reliable you need your results to be. The results of a prevalence estimation exercise can help you figure out how large of a control set you need (and whether your prevalence estimation set can just be repurposed for this exercise).
Upcoming in this Series
In the next Part, we will review the steps to apply these concepts and terms to testing search classifiers.
Whether you prefer email, text or carrier pigeons, we’re always available.
Discovery starts with listening.