Explore

Red Hots, Hot Docs, and the Ones that Got Away, Estimating Prevalence Series Part 3

3 / 3

Random sampling is a powerful eDiscovery tool that can provide you with reliable estimates of the prevalence of relevant materials, missed materials, and more

In “Finding out How Many Red Hots are in the Jellybean Jar,” we discussed a candy contest hypothetical and the importance of sampling techniques to eDiscovery.  In “Key Sampling Concepts for Winning the Candy Contest,” we discussed sampling frame, prevalence, confidence level, and confidence interval.  In this final Part, we apply those concepts to our hypothetical contest and to eDiscovery review.


Now that we understand the necessary sampling concepts, let’s apply those concepts to our candy contest and figure out how many red hots we think are in the jellybean jar.  In order to do so, we will need to identify our sampling frame, select our desired confidence level, and select our desired confidence interval.

For this example, our sampling frame is all the candies in the enormous jellybean jar, which a sign indicates holds approximately 100,000 candies.  For our confidence level, let’s use 95%, which has been referenced in a variety of cases and articles as a potentially acceptable level of confidence, and for our confidence interval, let’s use 4% – also known as a margin of error of +/-2%, which has also been widely discussed and used.  (For example, 95% and +/-2% were the proposed values used in the plan in the da Silva Moore case and in several other cases from our TAR case law survey.)

So, How Many Red Hots?

Now that we have our required values (100,000, 95%, +/-2%, and an assumed 50% prevalence), we are ready to plug them into our sampling tool or calculator to find out how large our simple random sample will need to be.  Most modern document review tools have some form of sampling tools built into them, but sampling calculators are also readily available online and random document selections can be made in manual ways if needed (e.g., by using the RAND function in Microsoft Excel).  Plugging in the values we’ve chosen into a sampling calculator reveals that we need a simple random sample of 2,345 pieces of candy to make our desired estimate, which is just 2.345% of the total sampling frame.

Once a candy store employee has retrieved for us a randomly selected assortment of 2,345 pieces of candy from the jar, we can then review those sample candies up close to determine exactly which ones are cinnamon red hots.  Let’s say our review reveals there are 142 red hots among the 2,345 sample candies, or 6.1%.  We can now say, with 95% confidence, that the overall prevalence of red hots in the jar is between 4.1% and 8.1%, or between 4,100 and 8,100 total red hots.

If we were willing to review a larger sample of 8,763 candies, we could even narrow that range to between 5,100 and 7,100 total red hots.

So, How Many Hot Documents?

This same process can be employed in an eDiscovery project to make any number of useful estimations about a new collection of materials, including: the prevalence of relevant materials, the relative prevalence of relevant materials in different sources, and the prevalence of materials requiring special review efforts (e.g., privilege logging, redactions, technical knowledge, etc.).  These estimates can in turn be used to more accurately estimate your needed project resources, optimal project workflows, and likely project costs and durations.  They can also be valuable in assessing the viability of a TAR solution or the need for additional objective culling.  As projects progress, they can also provide a yardstick against which to measure progress and completeness.

It should also be noted that, when applying these techniques to eDiscovery, it is important to use the highest quality document review possible.  While identifying cinnamon red hots is very straightforward, making legal or process determinations about documents can be quite nuanced, and the nature of sampling (extrapolating from a little to a lot) means that mistakes in classification during sampling will have amplified effects on the reliability of your estimates.

And, How Many Did We Miss?

One of the most common applications of prevalence estimation is in testing for completeness at the end of a TAR process or after the application of keyword searches.  This is sometimes referred to as measuring “elusion,” i.e. the quantity of materials that eluded identification by the filtering and review process employed.  For such estimations, the sampling frame is the pool of unreviewed materials eliminated before human review, by either the TAR software used, or by the keyword searches applied.  The process is otherwise identical to the one described above.

There is no way to perfectly identify and produce all relevant electronic materials – and no legal requirement that you achieve such perfection, but there can be great value in being able to say with some certainty how little (or how much) has been missed.  A reliable estimate can provide concrete evidence of the (in)adequacy of a completed process, or a basis for arguing the (dis)proportionality of any additional discovery efforts.


For Assistance or More Information

Xact Data Discovery (XDD) is a leading international provider of eDiscovery, data management and managed review services for law firms and corporations.  XDD helps clients optimize their eDiscovery matters by orchestrating pristine communication between people, processes, technology and data.  XDD services include forensicseDiscovery processingRelativity hosting and managed review.

XDD offers exceptional customer service with a commitment to responsive, transparent and timely communication to ensure clients remain informed throughout the entire discovery life cycle.  At XDD, communication is everything – because you need to know.  Engage with XDD, we’re ready to listen.


About the Author

Matthew Verga

Director, Education and Content Marketing

Matthew Verga is an electronic discovery expert proficient at leveraging his legal experience as an attorney, his technical knowledge as a practitioner, and his skills as a communicator to make complex eDiscovery topics accessible to diverse audiences. An eleven-year industry veteran, Matthew has worked across every phase of the EDRM and at every level from the project trenches to enterprise program design. He leverages this background to produce engaging educational content to empower practitioners at all levels with knowledge they can use to improve their projects, their careers, and their organizations.

Because you need to know

Contact Us