Finding out How Many Red Hots are in the Jellybean Jar, Estimating Prevalence Series Part 1

Random sampling is a powerful eDiscovery tool that can provide you with reliable estimates of the prevalence of relevant materials, missed materials, and more

A candy store is running a contest.  In the front window is a comically enormous jar of jelly beans, all different kinds and colors.  Mixed in among them are a secret number of red hot cinnamon candies, similar in size and shape, all red.  Whoever can guess closest to the true number of red hots mixed into the jar wins the prize.  How do you guess?  Do you try to count the red candies you can see, hoping they’re all red hots, and then guess at how many you can’t see?  Do you try to count all the candies?  Do you try to estimate volumes?

What if you were allowed to take one scoop of candies out of the enormous jar for closer examination, to determine exactly which ones in the scoop were red hots?  Could you extrapolate from the scoop to the jar?  How much better might your guess be then?

Sampling in eDiscovery

Despite years of discussion in the eDiscovery industry about the power and importance of sampling techniques – particularly in the context of technology-assisted review (“TAR”), many practitioners remain unfamiliar with what they can accomplish with them and when, outside of TAR, they might do so.  Beyond just being an essential part of TAR, however, there are opportunities across the phases of an eDiscovery project to replace guesses based on anecdotal evidence with actual estimates based on formal sampling.

Courts have actually been encouraging parties to leverage sampling techniques in eDiscovery since before TAR existed, suggesting its use for the validation of search terms and document review processes:

  • Common sense dictates that sampling and other quality assurance techniques must be employed to meet requirements of completeness,” In re Seroquel Prods. Liab. Litig., 244 F.R.D. 650 (M.D. Fla. 2007) [emphasis added]
  • The implementation of the methodology selected should be tested for quality assurance; and the party selecting the methodology must be prepared to explain the rationale for the method chosen to the court, demonstrate that it is appropriate for the task, and show that it was properly implemented,” Victor Stanley Inc. v. Creative Pipe Inc., 250 F.R.D. 251 (D. Md. 2008) [emphasis added]

And they have continued encouraging its use for those purposes, even outside of TAR, to this day:

  • Just as it is used in TAR, a random sample of the null set provides validation and quality assurance of the document production when performing key word searches. Magistrate Judge Andrew Peck made this point nearly a decade ago.  See William A. Gross Constr. Assocs., 256 F.R.D. at 135-6 (citing Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251, 262 (D. Md. 2008)); In re Seroquel Products Liability Litig., 244 F.R.D. 650, 662 (M.D. Fla. 2007) (requiring quality assurance).”  City of Rockford v. Mallinckrodt ARD Inc., No. 17 CV 50107, No. 18 CV 379 (N.D. Ill. Aug. 7, 2018) [emphasis added]

And, of course, as we saw throughout our recent TAR case law survey, the importance of sampling comes up again and again in discovery decisions and orders related to TAR use.

Industry publications, too, have taken repeated notice of the power and importance of sampling in eDiscovery.  For example, sampling features prominently in The Sedona Conference’s Commentary on Achieving Quality in the E-Discovery Process, and the EDRM organization has released two editions of a commentary specifically on leveraging sampling in eDiscovery.

Informal Approaches to Sampling

Many practitioners do engage in informal types of sampling already.  As practitioners have done since the early days of discovery, it is common for a knowledgeable team member to test potential search terms and phrases by informally poking around in some of the results returned by them.  The same thing goes for poking around in the materials collected from different sources or different custodians to determine the relative importance of different tranches of materials.  The same also goes for quality control checks of document review efforts, with more senior attorneys poking around in the batches of document reviewed by less-experienced attorneys to double-check their relevance or privilege determinations.

These informal approaches to sampling are inarguably valuable for gathering anecdotal evidence, making instinctual assessments, and learning about your materials or your efforts.  Some information is always better than no information.  But there are limits to what can be learned through these informal approaches and to how reliable such insights are.

Formal Approaches to Sampling

Formal approaches to sampling, on the other hand, facilitate more precise estimates with known reliability.  It is these approaches that make sampling so valuable in TAR specifically and in eDiscovery generally.  For example, formal sampling approaches can be used to generate:

  • Reliable estimates of how many relevant documents are in a given tranche
  • Reliable projections of the amount of redaction or privilege logging to do
  • Reliable measurements of relevant materials missed by a given process
  • Reliable reporting on the efficacy of a given search or other classifier

These measurements and many more can be taken using the same basic sampling techniques at various points in the discovery project lifecycle.

In this short series, we are going to focus specifically on the process and applications of estimating prevalence, from winning the jellybean jar contest described above, to planning at the beginning of a project, to checking completeness at the end.

Upcoming in this Series

In the next Part, we will review some key sampling concepts and terms for estimating prevalence.

