Random sampling is a powerful eDiscovery tool that can provide you with reliable estimates of the prevalence of relevant materials, missed materials, and more
A candy store is running a contest. In the front window is a comically enormous jar of jelly beans, all different kinds and colors. Mixed in among them are a secret number of red hot cinnamon candies, similar in size and shape, all red. Whoever can guess closest to the true number of red hots mixed into the jar wins the prize. How do you guess? Do you try to count the red candies you can see, hoping they’re all red hots, and then guess at how many you can’t see? Do you try to count all the candies? Do you try to estimate volumes?
What if you were allowed to take one scoop of candies out of the enormous jar for closer examination, to determine exactly which ones in the scoop were red hots? Could you extrapolate from the scoop to the jar? How much better might your guess be then?
Despite years of discussion in the eDiscovery industry about the power and importance of sampling techniques – particularly in the context of technology-assisted review (“TAR”), many practitioners remain unfamiliar with what they can accomplish with them and when, outside of TAR, they might do so. Beyond just being an essential part of TAR, however, there are opportunities across the phases of an eDiscovery project to replace guesses based on anecdotal evidence with actual estimates based on formal sampling.
Courts have actually been encouraging parties to leverage sampling techniques in eDiscovery since before TAR existed, suggesting its use for the validation of search terms and document review processes:
And they have continued encouraging its use for those purposes, even outside of TAR, to this day:
And, of course, as we saw throughout our recent TAR case law survey, the importance of sampling comes up again and again in discovery decisions and orders related to TAR use.
Industry publications, too, have taken repeated notice of the power and importance of sampling in eDiscovery. For example, sampling features prominently in The Sedona Conference’s Commentary on Achieving Quality in the E-Discovery Process, and the EDRM organization has released two editions of a commentary specifically on leveraging sampling in eDiscovery.
Many practitioners do engage in informal types of sampling already. As practitioners have done since the early days of discovery, it is common for a knowledgeable team member to test potential search terms and phrases by informally poking around in some of the results returned by them. The same thing goes for poking around in the materials collected from different sources or different custodians to determine the relative importance of different tranches of materials. The same also goes for quality control checks of document review efforts, with more senior attorneys poking around in the batches of document reviewed by less-experienced attorneys to double-check their relevance or privilege determinations.
These informal approaches to sampling are inarguably valuable for gathering anecdotal evidence, making instinctual assessments, and learning about your materials or your efforts. Some information is always better than no information. But there are limits to what can be learned through these informal approaches and to how reliable such insights are.
Formal approaches to sampling, on the other hand, facilitate more precise estimates with known reliability. It is these approaches that make sampling so valuable in TAR specifically and in eDiscovery generally. For example, formal sampling approaches can be used to generate:
These measurements and many more can be taken using the same basic sampling techniques at various points in the discovery project lifecycle.
In this short series, we are going to focus specifically on the process and applications of estimating prevalence, from winning the jellybean jar contest described above, to planning at the beginning of a project, to checking completeness at the end.
Upcoming in this Series
In the next Part, we will review some key sampling concepts and terms for estimating prevalence.