The Southern District of New York has published a new TAR order, upholding a predictive coding process as reasonable but requiring some additional transparency following several discovery missteps, including the inadvertent production of relevant documents.
“[R]easonableness and proportionality, not perfect and scorched-earth, must be their guiding principles.”
By my count, the Southern District of New York has published more predictive coding opinions than any other court in the world. As Judge Peck famously wrote, it is indeed “black letter law” that if a party wants to use TAR, they will be allowed to do so. This new order underscores a shift in the jurisprudence away from issues like defensibility and towards the more nuanced issues around process.
The case centers around New York City’s affordable housing program and policies that the Plaintiffs claim have a disparate impact on racial minorities in violation of state and federal law. The court was involved in the discovery process early, modulating the Plaintiffs’ requests and ultimately ordering the City to produce documents from 50 custodians using “heavily negotiated” search terms. Pursuant to the protocol, the City initiated a linear, manual review of the results.
The Plaintiffs “lodged numerous complaints” regarding the pace of the linear review process. In response, the court directed the use of TAR. Things seemed to be back on track, but then took a turn for the worse.
In the course of discovery, the Plaintiffs received documents that the City had identified in slip sheets as non-responsive, but, “due to a production error” the Plaintiffs were nonetheless able to view, read, and identify as actually relevant. The Plaintiffs used these documents as the basis for an aggressive transparency motion and requested large volumes of machine categorized documents. Specifically, they sought documents that were close to the threshold of relevance and also documents that were ultimately re-categorized by humans as relevant. In addition, the Plaintiffs sought information regarding the administration of predictive coding such as the ranking system used (i.e., what cut-off was used, and how many documents were deemed responsive and unresponsive at each ranking).
Magistrate Judge Katharine Parker observed that “[c]ourts are split as to the degree of transparency required” in the predictive coding process. She went on to observe some cases featuring party-driven agreements that promoted transparency, but did not identify any authority where a court ordered transparency. Likewise, this court, for the most part, stopped short of ordering transparency.
“In sum, the City’s training and review processes and protocols present no basis for finding that the City engaged in gross negligence in connection with its ESI discovery—far from it.”
Instead of ordering transparency, the court conducted an in camera review of some contested documents, which “reveal[ed] that the City appropriately trained and utilized its TAR system.” The City’s seed set included over 7,200 documents that were manually reviewed and used to train the TAR system. The set was comprised of randomly selected documents and also some interesting pre-coded exemplar documents like the Plaintiffs’ document requests and pleadings. The City conducted 5 training rounds and a validation process for each one. In addition, the City provided detailed training and resources for their document review team.
As for the inadvertently produced documents, the Court observed that they were de minimis following an in camera review of an 80-document sample.
“However, this Court does not find the labeling of these 20 documents, only 5 of which were ‘incorrectly’ categorized as non-responsive during the initial ESI review—out of the 100,000 documents that have been reviewed thus far in this case—sufficient to question the accuracy and reliability of the City’s TAR process as a whole.”
Nevertheless, in light of the deficiencies identified, the court was persuaded to require the City to provide an additional measure of transparency: It ordered the City to deliver to Plaintiffs random samples of non-responsive, non-privileged documents from the unproduced data sets—enabling them to conduct an ad hoc validation process of their own. Though an unusual remedy, it falls far short of requiring a party to share its “seed set” or other key details about the predictive coding process.
The court’s explanation of TAR in the order betrays an unfortunate technology paradigm often referred to as “TAR 1.0.” Under this kind of system, the producing party creates a seed set to train the machine learning algorithm. Once that seed set reaches critical mass, the machine uses it to rank all remaining documents in the corpus according to relevance and only documents above a certain relevance threshold are reviewed.
The Plaintiffs’ core argument is that the City improperly trained their TAR 1.0 system by over-designating documents as non-responsive. This, they argued, led the system to categorize documents as non-relevant that were actually relevant. And because of the nature of the TAR 1.0 system, the Plaintiffs argued for extensive transparency into the inner workings of the TAR administration and further data validation.
We’ve been writing about this exact issue for years and have consistently observed that continuous machine learning not only avoids these costly side-litigations but also produces better results. With continuous machine learning there is no “seed set” as the court defines it here. Every document reviewed continually trains and refines the algorithm and the “seed set,” to the degree there is one, is the production itself.
As a closing thought, consider some of these numbers from the order:
That’s roughly 12.5% efficiency (12,500 relevant documents for every 100,000 reviewed) and roughly $3.5/document. Using continuous machine learning, we regularly see clients achieve 50, 60, or even 70% efficiency. In one of our published case studies, a client using our continuous machine learning technology achieved over 50% efficiency at a cost of $0.50/document. These stats beg the question: Was transparency really the biggest issue in this TAR protocol?
Want to learn more? Here are some quick resources: