Threading, Duplicates & Near-Duplicates, ECA Fundamentals Series Part 4

4 / 6

A multi-part series on the fundamentals eDiscovery practitioners need to know about effective early case assessment in the context of electronic discovery

In “Clearing the Fog of War,” we reviewed the uncertainty inherent in new matters and the three overlapping goals of ECA.  In “Sampling a Well-Stocked Toolkit,” we began our survey of available tools and techniques with an overview and with a discussion of sampling.  In “Searching and Filtering for Fun and Profit,” we continued our survey with a discussion of searching and filtering options.  In this Part, we turn our attention to tools for handling threading and duplicates.

After sampling tools and searching and filtering tools, the next major types of tools available for pursuing the three goals of ECA are tools for handling email threading and for handling duplicates and near-duplicates.


Despite the rise of mobile and social sources and other alternative communication channels, email still remains a major component of most ESI collections for eDiscovery, and it tends to be voluminous.  A single gigabyte of email can easily contain 5,000 to 10,000 discrete email messages, plus their attachments and embedded images and objects.  Thankfully, email ESI also typically contains a significant amount of repetition and overlap that can be skipped.

For example, if you collect email from two custodians, you will have multiple copies of the email messages sent between them – a sender copy and a recipient copy for each one.  Moreover, if they are engaged in a thread of replies to each other, the emails in such a thread may contain the preceding emails within themselves as quoted text, and the last one in the thread may contain the full text of the whole thread within itself.  Such emails are sometimes referred to as “inclusive emails,” as are any standalone or offshoot emails that contain unique content or attachments.

Email threading tools typically offer some version of two functions to users: conversation threading and inclusive email identification.  Additionally, many now offer visualization features as an alternative way to explore the email threads and inclusive emails identified by the system.

Conversation Threading

Conversation threading is a process in which emails are analyzed and automatically organized into thread groups, arranged chronologically.  This analysis looks at existing conversation IDs, if available, and a range of email header fields and other document properties to match up replies in sequence.  Such organization makes it possible to quickly identify related materials, speeding up investigation, and to quickly see the context surrounding a particular message, improving understanding.  Additionally, presenting emails to reviewers as organized threads speeds up later document review.

Inclusive Email Identification

Inclusive email identification is a process in which textual analysis is used to identify inclusive emails, i.e. those that contain a full thread within themselves or that otherwise contain unique text or attachments.  Identifying the inclusive emails allows you to more quickly get the full picture, speeding up investigation, and when used as a filter, it can dramatically reduce the number of emails requiring later document review.


As we have discussed before, the operation of computer systems can produce a lot of duplicate files (including duplicate emails, as noted above).  Although duplicates may need to be tracked and reported on in certain circumstances, they do not need to be examined during ECA or included in later review.  Such duplicate files are identified using a technique called “hashing.”


In hashing, sufficiently-unique file fingerprints are generated for each file in a collection by feeding all the files into a cryptographic hash function (commonly MD5 or SHA-1).  For each file input, the hash function generates a fixed-length output called a “hash” or “hash value” (e.g., a string of 32 numbers and letters).  Identical files produce identical hash values, and because hash values are fixed-length and relatively short, they can be efficiently compared by software to automatically identify matches across even a large collection of ESI.

Duplicate Identification

Typically, collected ESI is hashed and deduplicated during processing and loading, prior to the ECA phase of the project.  But, because other rules (e.g., family group preservation) may override deduplication, some duplicates may remain.  For example, if the same spreadsheet was attached to two different emails, neither copy would be removed.  Most platforms provide features for identifying and managing such duplicates within your loaded collection of materials.

Repeated Content Identification

In addition to identifying fully-duplicated documents, many platforms also offer some form of repeated content identification.  Such features are designed to automatically identify frequently-repeated blocks of text (e.g., email signature blocks, automatic confidentiality warnings, etc.) so that they can be filtered out of search results (reducing false positives, particularly for privilege searches) and omitted from the creation of semantic/conceptual indices (improving the effectiveness of the semantic/conceptual analytic tools we will discuss in the next Part).


In addition to true duplicates, it is common for collections of ESI to contain large numbers of “near-duplicates.”  Near-duplicates are documents that are substantially similar to each other, but not truly identical (and therefore not removed during deduplication).  There are two main types of near-duplicates that occur:

  1. Superficially-identical documents that only vary in some metadata property, typically arising from their different sources or collection methods
  2. Documents with some actual variation in content, like successive drafts of a contract

Finding the former reduces the number of documents to consider (and later review), while ensuring consistent treatment across duplicates.  Finding the latter can provide valuable context to the development of key documents over time.

Near-Duplicate Identification

Textual near-duplicate identification is somewhat more complicated behind the scenes than true-duplicate identification.  Rather than comparing whole documents as single, abstracted values, the full textual content of the documents must be broken down into smaller pieces (sometimes called “shingles”).  These small pieces can then be hashed and the sequences of those pieces compared across documents.  If a sufficient number of pieces match, in the right order, the documents will be treated as near-duplicates.  Typically, the threshold of similarity at which the system treats two documents as near-duplicates can be customized.

Threading and Duplicate Tools and the Three Goals

These threading and duplicate tools can yield benefits for each of the three ECA goals:

  1. For Traditional ECA, conversation threading provides quick access to related messages and substantive context, as can near-duplicate identification. The ability to identify and filter repeated content also improves your search and advanced analytics results, making investigation easier.
  1. For EDA, the ability to map conversation threads and inclusive emails can reveal gaps requiring further collection, the identification of excessive duplicates can reveal low-priority tranches of ESI, or the identification of excessive near-duplicates might reveal a collection process issue.
  1. For Downstream Prep, the ability to organize conversation threads together and near-duplicates together can dramatically speed up later review work, and the ability to eliminate non-inclusive emails and true-duplicates from that review work can dramatically reduce volume.

Upcoming in this Series

In the next Part, we will conclude our survey of available tools and techniques with a discussion of analytic tools and technology-assisted review workflows.

About the Author

Matthew Verga

Director, Education and Content Marketing

Matthew Verga is an electronic discovery expert proficient at leveraging his legal experience as an attorney, his technical knowledge as a practitioner, and his skills as a communicator to make complex eDiscovery topics accessible to diverse audiences. An twelve-year industry veteran, Matthew has worked across every phase of the EDRM and at every level from the project trenches to enterprise program design. He leverages this background to produce engaging educational content to empower practitioners at all levels with knowledge they can use to improve their projects, their careers, and their organizations.

Whether you prefer email, text or carrier pigeons, we’re always available.

Discovery starts with listening.

(877) 545-XACT / or / Complete Contact Form