Grading Papers: Measuring Human Review, Testing Classifiers Series Part 3

3 / 3

Random sampling is a powerful eDiscovery tool that can provide you with reliable measurements of the efficacy and efficiency of searches, reviewers, or other classifiers

In “Pop Quiz: How Do You Test a Search?,” we discussed the application of sampling techniques to testing classifiers and introduced the concepts of recall and precision.  In “Show Your Work: Contingency Tables and Error Margins,” we applied those concepts to testing a hypothetical search classifier.  In this final Part, we apply them to human document review.

As we discussed previously, a classifier can be a search, a TAR process, or other things – including a human reviewer or a team of human reviewers.  Just as a search or a TAR tool is making a series of binary classification decisions, so too are your human reviewers, and the quality of those reviewers’ decisions can be assessed in a similar manner to how you assessed the quality of a search classifier in the last Part.  Depending on the scale of your review project, employing these assessment methods can be more efficient than a traditional multi-pass review approach, and in general, they are more precise and informative.

Human Classifiers and Control Sets

In this context, the reviewer (or reviewers) doing the initial review work is (or are) the classifier being tested.  The control set is effectively generated on the fly by the more senior attorney performing quality control review.  Their classification decisions are the standard against which the initial reviewer’s classification decisions can be judged.  If an appropriate document tagging palette is employed, it is not hard to track and compare both sets of decisions to create contingency tables like the one you used to assess your search classifier in the last Part.

In this context, however, we are not typically measuring the recall and precision of the human reviewers, although that could be done as well.  For human reviewers, it is more common to measure “accuracy” and “error rate.”  Accuracy is expressed as a percentage out of 100, and it represents the total number of correct classification decisions made by the initial reviewer(s).  Error rate is also expressed as a percentage out of 100, and it represents the total number of incorrect classification decisions made.  Together, accuracy and error rate should add up to 100%.

In terms of a contingency table, accuracy is derived from the combination of all true positives and true negatives, and error rate is derived from the combination of all false positives and false negatives.

An Example Application to Human Review

Let’s look at an example of how this works.  Let’s assume that you perform quality control review of a random sample of 350 of the 2,000 documents reviewed this week by a particular member of your initial review team.  After completing your classifications and comparing them to those of the initial reviewer, you get the following breakdown of results:

Deemed Relevant
by the QC Reviewer
Deemed Not Relevant
by the QC Reviewer
Deemed Relevant by the Initial Reviewer

70 (True Positives)

40 (False Positives)

Deemed Not Relevant by the Initial Reviewer

65 (False Negatives)

175 (True Negatives)


As with your search classifier, it is now straightforward to calculate an estimated accuracy and error rate for this reviewer’s work this week.  As noted above, accuracy is a combination of all the correct classification decisions, i.e. true positives + plus true negatives.  So, in this example, your reviewer made 245 correct decisions out of 350 total decisions.  That gives you an accuracy of 70%.  As also noted above, error rate is combination of all the incorrect classification decisions, i.e. false positives + false negatives.  So, in this example, your reviewer made 105 incorrect decisions out of 350 total decisions.  That gives you an error rate of 30%.

There is also no reason that the same measurements could not be performed for more than one classification criteria based on the same quality control review (e.g., relevant and not relevant, privileged and not privileged, requiring redaction and not requiring redaction, etc.).  Any binary classifications for which you and your reviewers are making classification decisions can all be measured the same way.  The specific criteria measured and the specific results you get can then guide you in your ongoing reviewer training efforts, review oversight steps, and project staffing decisions.

How Reliable Are These Estimates?

As with testing search classifiers, you can work backwards from these results to determine how reliable these estimates of accuracy and error rate are, based on the size of the sampling frame (i.e., the total number of reviewed documents from which you pulled the sample) and the size of the sample you took.   In this example, the sampling frame would be 2,000 and the sample size would be 350.  Using those numbers, we find that your estimates of accuracy and error rate have a margin of error of +/-4.76% at a confidence level of 95%.  Thus, you could be 95% confidant that the rest of the work this week from the reviewer in question was between 65.24% and 74.76% accurate.

A Note about Lot Acceptance Sampling

Lot acceptance sampling” is an approach to quality control that is employed in high-volume, quality-focused processes such as pharmaceutical production or military contract fulfillment.  In this approach, a maximum acceptable error rate is established, and each batch of completed materials is randomly sampled to check that batch’s error rate at a predetermined level of reliability.  If the batch’s error rate is below the acceptable maximum, the batch is accepted, and it if the error rate is above the acceptable maximum, the batch is rejected.

Large-scale document review efforts have a lot in common with those other high-volume, quality-focused processes, and some particularly-large review projects have employed lot acceptance sampling in similar way.  Individual batches of documents reviewed by individual reviewers are the batches being accepted or rejected, and random samples are checked from each completed one.  Those with a sufficiently low error rate move on to the next phase of the review and production effort, those with too high of an error rate are rejected and re-reviewed (typically by someone other than the original reviewer).  Error rates and batch rejections can be tracked by reviewer, by team, by classification type, or by other useful properties to identify problem areas for process improvement or problem reviewers for retraining or replacement.

Many practitioners become uncomfortable at the idea of deliberately identifying an acceptable error rate, or even of actively measuring the error rate at all, but avoiding knowledge of your errors does not prevent their existence.  It just prevents you from being able to address them or being prepared to defend them.  After all, as we have discussed before, the standards for discovery efforts are reasonableness and proportionality – not perfection.

For Assistance or More Information

Xact Data Discovery (XDD) is a leading international provider of eDiscovery, data management and managed review services for law firms and corporations.  XDD helps clients optimize their eDiscovery matters by orchestrating pristine communication between people, processes, technology and data.  XDD services include forensicseDiscovery processingRelativity hosting and managed review.

XDD offers exceptional customer service with a commitment to responsive, transparent and timely communication to ensure clients remain informed throughout the entire discovery life cycle.  At XDD, communication is everything – because you need to know.  Engage with XDD, we’re ready to listen.

About the Author

Matthew Verga

Director of Education

Matthew Verga is an electronic discovery expert proficient at leveraging his legal experience as an attorney, his technical knowledge as a practitioner, and his skills as a communicator to make complex eDiscovery topics accessible to diverse audiences. A fourteen-year industry veteran, Matthew has worked across every phase of the EDRM and at every level from the project trenches to enterprise program design. He leverages this background to produce engaging educational content to empower practitioners at all levels with knowledge they can use to improve their projects, their careers, and their organizations.

Whether you prefer email, text or carrier pigeons, we’re always available.

Discovery starts with listening.

(877) 545-XACT / or / Email Us