A multi-part series on the essentials practitioners need to know about ESI collections
In “Collection and the Duty of Technology Competence,” we discussed lawyers’ duty of technology competence and the importance of understanding collection to fulfilling that duty. In “The Broad Scope of Collection,” we discussed the potential legal and technological scope of collection. In “How Computers Store ESI,” we discussed the operation of computer memory. In this part we review collection and recovery from that memory.
As we discussed in the last Part, we are generally concerned in collection with the primary, non-volatile data storage in a computer (or mobile device), whether in the form of a hard disk drive or a solid state drive (or the mobile device equivalent), and there is a distinction between what’s actually, physically stored on that drive and what the computer is currently tracking in its master storage spreadsheet for that drive. Collections can be based on either.
Physical collections of storage drives capture an exact copy – or “image” – of everything on the physical storage, regardless of what the master storage spreadsheet says about where data is and isn’t on the drive. This is a bit-by-bit copy, also known as a “bitstream” copy, which replicates all the physical contents of the storage exactly as they are, essentially creating a virtual duplicate of that physical hardware. The primary benefits of this approach are its completeness and the potential for recovery of deleted files it provides.
Logical collections of storage drives work within the file system’s management of the storage hard drive in question rather than replicating the whole piece of hardware. Logical images exactly replicate everything tracked in the computer’s master storage spreadsheet or some defined subset of it (e.g., specific drive partitions or directories). The primary benefit of this approach is the potential to target more narrowly and collect less extraneous material.
There are two potential sources of information that are captured in physical images that are not captured in logical ones: slack space and unallocated space. As we noted in the last Part, files in computer storage take up multiple sectors or clusters on a drive. Sectors or clusters not currently in use for active storage are referred to as unallocated space. Some sectors or clusters that are in active use may only be partially full. The remaining, unused portion of the sector or cluster is referred to as slack space.
Since computer deletion only deletes the records of what’s in sectors and clusters, rather than actually erasing them, both unallocated space and slack space may contain fragments of deleted files that had been stored there. A forensic examiner working with a full, physical image may be able to use specialized software tools to recover files or file fragments from unallocated or slack space and render them usable for investigation or litigation. While this is not a typical step in routine eDiscovery work, it can be invaluable in cases involving accidental or intentional deletion of relevant ESI.
During the collection of ESI from computer or device drives, it is important to avoid altering the source in any way by the act of collection. As we noted in the last Part, computers are designed for efficiency rather than data preservation and when in operation, they have a constant flow of information being read from and written to various memory components. To avoid doing any new writing to a drive during the act of reading from a drive, forensic examiners use tools called write blockers. Write blockers are specialized hardware or software tools that block any write commands from being passed to a drive while it is being accessed for collection, ensuring the original source is unaltered by the collection activity.
Just as important as avoiding alteration to the source is verifying that the copies you’ve made are accurate ones. When copying large volumes of ESI (hundreds or thousands of files per source drive is common), there is some potential for errors to occur during the copying of some of those files. Hashing is used to validate that all files have been copied accurately.
Hashing is a technique by which sufficiently unique “fingerprints” can be generated for files. Hash functions are mathematical processes that take irregular-length inputs (e.g., the data in a file), and use them to generate fixed-length outputs (e.g., a string of 32 numbers and letters). Forensic hashing typically uses a cryptographic hash function (commonly MD5 or SHA-1), which is well-suited to matching unique inputs to particular outputs.
To verify a collection’s accuracy, one set of fingerprints is generated from the source files, and that set is then compared to a second set generated from the copied files. Fingerprint matches confirm an accurate copy, and fingerprint mismatches identify copying errors.
Upcoming in this Series
In the next Part of this series, we will review the intersection of legal and technical realities with a look at metadata, forensic soundness, and chain of custody.
About the Author
Matthew Verga, JD
Director, Education and Content Marketing
Matthew Verga is an electronic discovery expert proficient at leveraging his legal experience as an attorney, his technical knowledge as a practitioner, and his skills as a communicator to make complex eDiscovery topics accessible to diverse audiences. An eleven-year industry veteran, Matthew has worked across every phase of the EDRM and at every level from the project trenches to enterprise program design. He leverages this background to produce engaging educational content to empower practitioners at all levels with knowledge they can use to improve their projects, their careers, and their organizations.