An increasing amount of the electronically stored information (“ESI”) requested in litigation discovery originates in databases or other structured data repositories, rather than in the discrete e-mail messages, spreadsheets, and word processing files that have long made up the bulk of most ESI document productions. The reason for this shift is simple: businesses (and even some individuals) creating and managing their accumulated information have discovered that they are able to extract far more utility–and lower their overall management costs–if they store their data in a single repository and in a standardized format. These clear organizational and data mining benefits, combined with the steadily decreasing price of sophisticated database and database search technology, foreshadow a time when the majority of the world’s text-based ESI will be maintained and managed in this manner.
In the meantime, though, as has happened during other major shifts in technology, legal standard and best practices have lagged behind the migration of document-based information to databases. In civil litigation, many common procedures used to validate the accuracy of ESI produced in discovery cannot be applied to information taken from databases due to differences in underlying technology. In addition, lack of experience with database discovery has caused some confusion within the Bench and Bar, both of which have sometimes combined wholly separate issues of database accuracy and reliability into a single, flawed analysis. Working correctly with this type of evidentiary material requires a more nuanced approach.
1. Testing the Accuracy of Copied Information
One threshold analysis in every document production is whether the materials being produced are accurate copies of their originals. For document-based ESI, a number of relatively fast and inexpensive validation approaches, such as comparing hash mark values of original files (or hard drive partitions) and their copies, have gained wide acceptance as defensible methods for showing that copies are an accurate and precise duplicate of the original files. Unfortunately, those file and container-based checks cannot be applied to database information because of the way that data is exported and written to wholly new files typically containing less information and often a different underlying structure from the original database. Even if the information is truly identical, the “containers” will be too different to permit these particular electronic comparisons to work.
The absence of an easy, “silver bullet” procedure for authenticating database information doesn’t mean, however, that copies of database information cannot be tested for its accuracy. Manual checks, such as tallying the number of database records or rows produced, provide points of direct comparison between the results of a database search query and a production of database information. It is also relatively straightforward to test some (but not necessarily all) of the data fields that are being produced for evidence of data truncation–something that can easily happen if the contents of a text entry field are ported to a spreadsheet or other application that only accepts a set number (e.g., 256) number of characters in a single data field. Data truncation is easy to spot if the data loss begins mid-word or mid-sentence; it can be much more difficult to find when the truncation is at the sentence or paragraph level, as can happen when line breaks or carriage returns are falsely recognized as end-of-field markers by an application.
2. The Fallacy of Assuming the Reliability of Database Information
A more insidious problem in the production of database information is a general tendency–by everyone–to assume that the substantive information in a database is always correct. After all, the reasoning goes, someone or some action entered this information into the database–why would inaccurate information be memorialized like this? Thus, unlike other documents, the contents of databases are sometimes accepted as correct without formal validation. This can quickly lead to significant problems.
Databases typically include a combination of information entered by manual (i.e., human) data entry and information automatically created by the database application itself. Both types of information are often requested in discovery–the date and time when a database record was created, for example, can be important substantive evidence in some disputes. However, both may contain material errors that make the data insufficiently reliable to admit as substantive evidence. On the human data entry side, simple problems, such as typos and transposed numbers, may be quite common in database fields that aren’t subject to real-time spell-checking or double key entry to identify typos. Similarly, database information may be correctly entered but based on incorrect source materials. For example, it’s not hyperbole to suggest that just about every database of contact information contains some out-of-date address information or entries relating to deceased individuals. Eight years after I moved to my current residence, I’m still receiving mail addressed not only to the previous owners, but also to the owners before them.
In a business context, database information may also contain other substantive errors. Call center databases, for example, may contain the exact information conveyed by a customer–but not all of that information may be correct. In the pharmaceutical context, someone reporting a possible adverse reaction to a drug may mis-remember the last time that it or other medicines were taken or the doses that were ingested. With no way to validate the accuracy of this information by call center staff, the information is simply entered. Only close analysis may ever reveal that some of this information is materially incorrect.
Database fields automatically populated by the underlying application can also contain errors. For example, when a new database record is created, the database application reads and records the current date and time from the server or workstation on which it is running. Not all of these clocks and calendars are correct. Even when machine clocks are accurately set, the host database server may be located in another time zone, again populating a database record with information that may not be consistent with that of other records. Databases that capture the name of the user entering or updating information in a database record may record incorrect information if the user has logged in from a workstation assigned to someone else or if the user has signed in with another user’s ID and password. Most of these data inconsistencies are unintentional or viewed as inconsequential to the database’s business purposes, but in the harsh and sometimes slanted light of litigation-driven data analysis, these inaccuracies may be given undeserved significance. Cases also exist where data has been intentionally spoofed, and these tell-tales may be the best way of determining which database information is correct and which must be set aside as unreliable.
Conclusion
Working with client databases is an increasingly important part of litigation discovery. It’s critical to recognize ways that this information may be misleading (unintentionally or not) and to take proactive steps to minimize the risk of relying on flawed information. It’s also important not to confuse validation of the data copying process with validation of the underlying information. While problems in either of these areas can make a production of database information insufficiently reliable to admitted as substantive evidence, imperfect copies can usually be remedied. In contrast, a perfect copy of flawed original data remains fundamentally flawed.
In partnership with technologists, lawyers and judges are working to develop better guidance for practitioners working with databases in discovery. Such efforts should help educate the legal community and promote greater understanding of this highly nuanced aspect of discovery. In the meantime, though, legal teams should be careful not to gloss over the unique procedures needed to establish the authenticity and accuracy of information extracted from databases. Failing to do so may make the difference between success–and evidentiary catastrophe.