Explore

Show Your Work: Contingency Tables and Error Margins, Testing Classifiers Series Part 2

2 / 3

Random sampling is a powerful eDiscovery tool that can provide you with reliable measurements of the efficacy and efficiency of searches, reviewers, or other classifiers

In “Pop Quiz: How Do You Test a Search?,” we discussed the application of sampling techniques to testing classifiers and introduced the concepts of recall and precision.  In this Part, we apply those concepts to testing a hypothetical search classifier.


As we discussed in the last part, sampling can be used to test search classifiers by calculating their recall and precision.  This process replaces informal sampling and instinctual assessments with actual data on efficacy and efficiency, for both your own planning and your discovery process negotiations with other parties.  Doing so requires creating a control set using simple random sampling and then reviewing that control set to identify in advance the items you hope the search classifier will find.  Once you have a reviewed control set, you are ready to run your search classifiers against it for testing, whether those are keyword searches, TAR software, or another search classifier.

Using Contingency Tables

Once you have run a search classifier you are testing against your control set, you can calculate recall and precision for it by using “contingency tables.”  Contingency tables (also sometimes referred to as “cross-tabulations” or “cross-tabs”) are simple tables used to break down the results of such a test into four categories: “true positives,” “false positives,” “false negatives,” and “true negatives.”  These four categories are comparisons of the results of the search classifier to the prior results of your manual review of the control set:

  1. True positives are documents that your search classifier returns as relevant results that your prior review of the control set also marked as relevant, i.e. the right stuff
  2. False positives are documents that your search classifier returns as relevant results that your prior review of the control set had determined were not relevant, i.e. the wrong stuff
  3. False negatives are documents that your search classifier does not return as relevant results that your prior review of the control set marked as relevant, i.e. missed stuff
  4. True negatives are documents that your search classifier does not return as relevant results that your prior review of the control set had determined were not relevant, i.e. actual junk stuff

An Example Application

As we discussed in our series about estimating prevalence, sample sizes of a few thousand documents are common for taking prevalence measurements about large document collections.  So, let’s assume a hypothetical in which you have a randomly selected set of 3,982 documents that you previously reviewed to take a strong measurement of prevalence (99% confidence level, with a margin of error of +/-2%) within your collection of 100,000 documents.  Let’s also assume that your review of that random sample revealed 1,991 relevant documents.

In addition to knowing prevalence within the overall collection (48-52% prevalence, with 99% confidence), you now have a 3,982 document control set for testing search classifiers, containing 1,991 relevant documents for them to try to find.  The next step is running your search classifier against it and seeing how its classifications compare to those of your prior review.  Let’s assume your hypothetical search returns 1,810 total documents which break down into the four categories as follows:

Deemed Relevant
by Prior Review
Deemed Not Relevant
by Prior Review
Returned by Search Classifier (i.e., Deemed Relevant)

1,267 (True Positives)

543 (False Positives)

Not Returned by Search Classifier (i.e., Deemed Not Relevant)

724 (False Negatives)

1448 (True Negatives)

 

As we can see on this contingency table, the 1,810 results from your hypothetical search included 1,267 documents that were also deemed relevant in your prior review, which are your true positives.  It also included 543 documents that were deemed not relevant in your prior review, which are your false positives.  And, finally, we can see it missed 724 documents that were deemed relevant in your prior review, which are your false negatives.

You can use the results shown in this contingency table to easily estimate the recall and precision of the hypothetical search classifier you tested.  As we discussed, recall is the percentage of all the available relevant documents that were successfully identified by the search classifier being tested.  So, in this example, your search identified 1,267 out of 1,991 relevant documents, which gives you a recall of about 63.6%Also as we discussed, precision is the percentage of what the search classifier identified that was actually relevant.  So, in this example, the search returned a total of 1,810 documents including 1,267 relevant documents, which gives you a precision of 70%.

Your hypothetical search, then, has high precision and decent recall.  The search could probably be revised to trade off some of that precision for higher recall or, possibly, to improve both numbers.  Subsequent iterations of the search can be easily tested in the same way to measure the effect of your iterative changes.

How Reliable Are These Estimates?

We noted at the beginning of this hypothetical that your control set was a random sample of 3,982 documents that had been taken and reviewed previously to estimate prevalence in the full document collection with a confidence level of 99% and a margin of error of +/-2%.  That same confidence level and margin of error, however, do not carry over to the estimates of recall and precision that you have made using the same documents.  Because of how the math in question works, your sample sizes for recall and precision are effectively smaller, which in turn makes your margins of error a little wider.

The effective sample size for a recall estimation performed in this way is not the total number of documents in the control set, but rather the number of relevant documents within it that are available to be found.  In this example, the search classifier is looking for the 1,991 relevant documents contained in the control set, which are effectively a random sample of 1,991 relevant documents from among all the relevant documents in the full document collection (a sampling frame you’ve already estimated to be about 50,000 documents).

The effective sample size for a precision estimation is also not the total number of documents in the control set, but rather the number of documents identified by the search classifier.  In this example, the search classifier identified 1,810 documents, which are effectively a random sample of 1,810 documents from among all the documents the search would return from the full document collection (a sampling frame you can estimate to be about 45,500 documents).

Some tools will provide you with these calculations automatically, but you can also plug these numbers into a sampling calculator yourself to work backwards and see what margin of error would apply to your recall and precision measurements.  In this example, your recall estimate would carry a margin of error of about +/-2.83% (at a confidence level of 99%), and your precision estimate would carry a margin of error of about +/-2.97% (also at a confidence level of 99%).  Thus, you could be very confident that your tested search had a recall between 60.77% and 66.43% and a precision between 67.03% and 72.97%.


Upcoming in this Series

Next, in the final Part of this short series, we will discuss the application of these concepts and techniques to the evaluation of human document review.


About the Author

Matthew Verga

Director, Education and Content Marketing

Matthew Verga is an electronic discovery expert proficient at leveraging his legal experience as an attorney, his technical knowledge as a practitioner, and his skills as a communicator to make complex eDiscovery topics accessible to diverse audiences. An eleven-year industry veteran, Matthew has worked across every phase of the EDRM and at every level from the project trenches to enterprise program design. He leverages this background to produce engaging educational content to empower practitioners at all levels with knowledge they can use to improve their projects, their careers, and their organizations.

Because you need to know

Contact Us