Volume Estimation and Downstream Planning – Measure Twice Series, Part 5


A multi-part series providing guidance on how to effectively scope and plan eDiscovery projects

In the first Part of this series, we reviewed the value of preparation, planning, and checklists, as well as the evolving challenges and expectations associated with eDiscovery project planning.  In the second Part, we discussed the initial eDiscovery project scoping steps you must take.  In the third Part and fourth Part, we discussed some of the investigative steps that can follow, including targeted interviews, reactive data mapping, surveying, and sampling.  In this Part, we turn our attention to volume estimation and cost estimation.

How Much Is a Lot?

Once you have completed your initial planning and completed your investigation activities to validate and flesh out your initial assumptions, you should be equipped with enough information to proceed to estimations of project volumes and potential costs.  At a minimum, you need a reasonably accurate count of:

  • Custodians requiring collection
  • Devices per custodian requiring collection
  • Mailboxes and network shares requiring collection
  • Enterprise or departmental systems requiring collection
  • Backup tapes or other loose media requiring collection

You should also have some sense of how large each category of sources is (i.e., laptop size, mailbox size, etc.) and how broadly you expect to have to collect (i.e., full images vs. logical images vs. pre-filtered collections/exports).  With this information, making an educated guess as to your initial collected volume becomes simple math:

Custodians x (Sum of Issued Devices’ Typical Sizes)
+ Mailboxes x Typical Size
+ Network Shares x Typical Size
+ Sum of Enterprise and Departmental Systems’ Sizes
+ Backup Tapes/Storage Media x Tape/Media Sizes
= Approximate Total ESI Volume to Collect

Once you have this number, you will need to make some additional assumptions and adjustments to project your likely downstream volumes.

First, you’ll need to consider the expansion of the collected data volume that will occur at the beginning of processing.  For example, your collected data volume will include some number of compressed container files (e.g., ZIP, RAR, etc.) each of which will expand into one or more files of larger size than when compressed.  Other types of compressed and nested content also exist (e.g., local PST and OST email stores), and during processing all will be fully expanded so each element can be individually normalized, tracked, and reviewed.  The amount of expansion can vary widely – from as little 10% to more than 40% – depending on just what was in the original collection.

Second, you’ll need to consider the immediate reductions that will occur from de-NISTing, deduplication, and the application of any objective filters:

  • De-NISTing: It is standard practice in the industry to de-NIST each collection to eliminate system, software, and utility files that can have no bearing on the matter at hand. Just how much material will be eliminated depends on how narrowly or broadly the collection was done.  Full disc images will be greatly reduced in volume; targeted collections of user files will not.
  • Deduplication: It is also standard practice in the industry to globally deduplicate each collection so that only one copy of each record need be reviewed, managed, etc. Modern discovery software makes the tracking of each place a duplicate was, as well as their later restoration and handling, simple matters.  Email heavy collections will see the most volume reduction, as every party to an internal email communication will have duplicate copies of every message between them.
  • Objective Filtering: It is also very common for a new collection to have some objective filtering applied during processing based on the scope of the case or the negotiated limits of discovery – for example, the application of date restrictions or file type restrictions. How much additional impact these filters have any volume will depend on how narrowly targeted the initial collection process was.

As both parties and collection tools have grown more sophisticated in recent years, the trend has definitely been towards smaller, more-targeted initial collections that therefore reduce less during this phase.  Additionally, it should be noted (when estimating volume for hosting costs) that the final, post-processing volume will expand slightly again when loaded into a review platform to accommodate the review platform’s database file, extracted text files, etc.

A variety of tools and analyses are available to help you select your assumptions and do these sorts of estimations.  The EDRM group has collected several free calculators here, and their own EDRM Data Calculator is a good place to start.  For moving on to cost estimations and other downstream planning, you will need to estimate not only total volumes, but also potential file/document counts.  You can review EDRM’s published metrics for this here, and the 2016 results of John Tredennick’s annual study of “How Many Documents in a Gigabyte” are available here.

How Much Will It Cost?

At this point in your process, you have completed your initial planning, completed your investigation activities to validate and flesh out your initial assumptions, and you’ve completed your project volume estimations.  You now have enough information to also do cost estimation:

  • Source types and counts for estimating collection costs
  • Projected collected volume for estimating processing costs
  • Projected post-processing volume for estimating hosting costs
  • Projected file/document counts for estimating review costs

As with volume estimation, cost estimation is now largely a matter of math, multiplying your projected volumes and counts by your preferred service provider’s price list (or bundle, etc.).  Several of the calculators linked above for data volume estimation can be used to help you estimate pricing as well, and many service providers also offer their own calculator built to reflect their specific pricing model.

Estimation of the review costs portion does require some additional work.  Simply dividing your projected document count by 50 documents per hour to get a total number of hours of review to be performed will not give you an accurate estimate.  Instead, you must consider a number of additional variables:

  • How much do you believe the collection can be further reduced during ECA?
    • By using searching, sampling, filtering, clustering, etc.
  • Will you use near-duplicate identification and email threading to reduce further?
    • They both reduce volume and increase review speed
  • Will review be traditional or technology-assisted (e., predictive coding)?
    • The former takes more first-level hours, the latter more QC
  • How much training, oversight, and quality control time will be needed?
    • Management, oversight, and QC increase exponentially with project size
  • How much privileged material do we anticipate needing to code and log?
    • Assume a minimum of 5-10% privileged materials requiring logging
  • Do we expect many spreadsheets, technical drawings, or other difficult documents?
    • If there are enough, establishing specialized workflows can save time

All of these variables will affect how much must be reviewed, how fast it can be reviewed, and how many labor hours the total effort will take.  A deep dive into document review design and management is beyond the scope of this series, but any experienced eDiscovery project manager can help you think through the options and their effects.

Volume and Cost Estimation Checklists

As we noted in the first Part, checklists are invaluable tools for insuring consistency and completeness in your efforts. Here are model checklists for volume and cost estimation, which you can customize for your organization:

Volume Estimation

  1. How many of each kind of source do we have?
  2. How large do we expect each individual source to be?
  3. How broadly or narrowly do plan to collect for this matter?
  4. How much volume expansion do we anticipate during processing?
  5. How much reduction from de-NISTing, deduplication, and objective filters?

Cost Estimation

  1. What are our projected volumes before and after processing?
  2. How much additional volume reduction do we expect after processing?
  3. What review options and methods do we expect to employ in this matter?
  4. How much privileged or technically-challenging material do we expect?
  5. What price list, bundle, or model are we using for this matter?

Upcoming in this Series

In the next Part of this series, we will continue our discussion of eDiscovery project planning by considering project roles and communications.

About the Author

Matthew Verga, JD
Director, Education and Content Marketing

Matthew Verga is an electronic discovery expert proficient at leveraging his legal experience as an attorney, his technical knowledge as a practitioner, and his skills as a communicator to make complex eDiscovery topics accessible to diverse audiences.  A ten-year industry veteran, Matthew has worked across every phase of the EDRM and at every level from the project trenches to enterprise program design.  He leverages this background to produce engaging educational content to empower practitioners at all levels with knowledge they can use to improve their projects, their careers, and their organizations.

Because you need to know

Contact Us