Explore

Pop Quiz: How Do You Test a Search?, Testing Classifiers Series Part 1

1 / 3

Random sampling is a powerful eDiscovery tool that can provide you with reliable measurements of the efficacy and efficiency of searches, reviewers, or other classifiers


As we discussed in our series on estimating prevalence, formal sampling is a powerful, judicially-approved, and underutilized tool for eDiscovery efforts.  In that series we reviewed the necessary concepts, terms, and steps for estimating the prevalence of desired materials within a collected dataset, which can yield a variety of benefits during project planning and discovery negotiations.  Estimating prevalence, however, is just one of the ways practitioners can leverage sampling in their projects.

More Uses for Informal and Formal Sampling in eDiscovery

Beyond estimating prevalence, there are other opportunities to replace informal sampling of unknown reliability with formal sampling of precise reliability.  For example, since the early days of eDiscovery, it has been common for a knowledgeable team member to test potential search terms and phrases by informally poking around in some of the results returned by them.  Similarly, it has also been common for more-senior attorneys to double-check the relevance or privilege determinations made by more-junior attorneys by poking around in some of the batches of documents they’ve reviewed.

These informal approaches to sampling can be valuable for gathering some initial anecdotal evidence and for making instinctual assessments, but using formal sampling instead is much more effective, reliable, and persuasive.  Imagine iteratively refining searches for your own use, or negotiating with another party about which searches should be used, armed with precise, reliable information about their relative efficacy.  As we saw frequently in our recent TAR case law survey, and as has been suggested in non-TAR cases, judges prefer argument and negotiation based on actual information to that based merely on conjecture and assumption.

But how is testing searches or TAR processes or reviews actually done?  What sampling concepts and terms do you need to know?  What steps do you need to take?  What kind of results can you achieve, with what kind of reliability?

In this short series, we will answer these questions by reviewing the concepts, terms, and steps necessary for “testing classifiers” using formal sampling.

What Is a Classifier?

Classifiers” are mechanisms used to classify documents or other materials into discrete categories such as those requiring review and those not requiring review, or relevant and non-relevant, or privileged and non-privileged.  That mechanism might be a search using key words or phrases.  It might be the decisions of an individual human reviewer or the aggregated decisions of an entire human review process.  It might be the software-generated results of a technology-assisted review process.  The binary classification decisions of any of these classifiers are testable in the same basic way.  To start, we will focus on searches as the classifiers to be tested.

What Properties of a Search Classifier Do We Test?

When testing search classifiers, we are actually measuring two things about them: their “recall” and their “precision,” which correlate to their efficacy and their efficiency:

  • Recall is how much of the total stuff available to find the classifier actually found, so higher recall (e., finding more) means greater efficacy, and lower recall (i.e., finding less) means lower efficacy.
  • Precision is how much other, unwanted stuff the classifier included along with the stuff you actually wanted, so higher precision (e., less junk) means higher efficiency, and lower precision (i.e., more junk) means lower efficiency.

Both recall and precision are expressed as percentages out of 100:

  • For example, if there are 500 relevant documents somewhere in a dataset, and a search finds 250 of those documents, then that search has a recall of 50% (e., 250/500).
  • If the search returned 750 non-relevant documents along with the 250 relevant ones, that search would have a precision of 25% (e., 250/1000).

There is also generally a tension between the two criteria.  Optimizing a search to maximize recall beyond a certain point is likely to require lowering precision and accepting more junk, and optimizing a search to maximize precision beyond a certain point is likely to require accepting lower recall and more missed relevant materials.  Deciding what balance between the two is reasonable and proportional is fact-based determination specific to the needs and circumstances of each matter.

What Sample Is Needed to Test a Search Classifier?

In order to test a search classifier’s recall and precision, you must already know the numbers of documents in the classifications you are testing.  For example, to determine what percentage of the relevant material is found, you must know how much relevant material there is.  Since it is not possible to know this about the full dataset without reviewing it all (which would defeat the purpose of developing good searches), classifiers must be tested against a “control set” drawn from the full dataset.

Much as we did for estimating prevalence, control sets are created by taking a simple random sample from the full dataset (after initial, objective culling) and manually reviewing and classifying the materials in that sample.  Just as with estimating prevalence, it is important that the review performed on the control set be done carefully and by knowledgeable team members.  In fact, in many cases you may be able to use the same set of documents you reviewed to estimate prevalence as a control set for testing classifiers.

Unlike estimating prevalence, however, figuring out the size of the sample needed for your control set is not so cut and dry.  As we will see in the next Part, the reliability of the results you get when testing classifiers is related to how many potential things there were for the classifiers to find in the control set.  For example, if you are testing searches designed to find relevant documents, the more relevant documents there are in your control set the more reliable your results will be.

This means that datasets with low prevalence may require larger control sets to test classifiers than datasets with high prevalence, depending on how reliable you need your results to be.  The results of a prevalence estimation exercise can help you figure out how large of a control set you need (and whether your prevalence estimation set can just be repurposed for this exercise).


Upcoming in this Series

In the next Part, we will review the steps to apply these concepts and terms to testing search classifiers.


About the Author

Matthew Verga

Director, Education and Content Marketing

Matthew Verga is an electronic discovery expert proficient at leveraging his legal experience as an attorney, his technical knowledge as a practitioner, and his skills as a communicator to make complex eDiscovery topics accessible to diverse audiences. An eleven-year industry veteran, Matthew has worked across every phase of the EDRM and at every level from the project trenches to enterprise program design. He leverages this background to produce engaging educational content to empower practitioners at all levels with knowledge they can use to improve their projects, their careers, and their organizations.

Because you need to know

Contact Us