Predictive Coding Explained
A great deal of discussion has taken place recently about a new form of document review that is taking the eDiscovery industry by storm: Predictive Coding. The reasons for this surge in interest are several – as discussed below – but the timing is not coincidental, as two major trends are colliding when 1) the economics of traditional, linear review have become unsustainable while 2) the early returns from those employing Predictive Coding are nothing short of phenomenal and have given such early adopters a significant competitive advantage. Given the nascent stage of the Predictive Coding world, we thought the timing was right for a quick primer on what Predictive Coding is, what it isn’t, how it came to be and the problems it seeks to address.
Linear document review – where individual reviewers manually review and “code” documents ordered by date, keyword, custodian or other simple fashion – has been the accepted standard within the legal industry for decades. This was not a big deal when ESI volumes were measured in megabytes or even a few gigabytes; the explosion of data volumes over the past decade, however, has exposed traditional linear review as an exceedingly inefficient, costly and inconsistent approach to document review (which accounts for 60-70% of the costs of eDiscovery). There is simply so much data to be coded that the old model has become too slow and expensive to keep up.
Why does linear review perform so poorly in most cases? For starters, many – and often most – documents in a review are “false positives” (i.e. irrelevant, unresponsive, or both), yet they are still reviewed by an attorney, which racks up huge amounts of unnecessary costs. Second, documents are typically not organized by topic which forces reviewers to jump from topic to topic, slowing down the process and leading to inaccurate results. Third, documents aren’t prioritized in any way (i.e. from most important to least important) so reviewers can miss key documents. And finally, because individual attorneys typically know little about a case’s substance, multiple “passes” must be made over the same documents based on the substance of a particular review (i.e. a first pass for relevance, a second for responsiveness, a third for relationship to a substantive category, etc.). Add it all up and one is left with a woefully outdated and extremely expensive approach that is rapidly falling out of favor with clients and outside counsel.
By contrast, Predictive Coding seeks to automate the majority of the review process. Using a bit of direction from someone knowledgeable about the matter at hand, Predictive Coding uses sophisticated technology to extrapolate this direction across an entire corpus of documents – which can literally “review” and code a few thousand documents or many terabytes of ESI at a fraction of the cost of linear review. The result? A more thorough, more accurate, more defensible and far more cost-effective document review…which allows attorneys to do what they were trained to do, namely use the facts to advocate on behalf of their client. Predictive Coding is so powerful that it actually changes the economics of eDiscovery, allowing law firms to win new business while maintaining or even improving their margins.
Due to the number of vendors, practitioners, outside counsel and clients in the eDiscovery space, there has been a lot of confusion about Predictive Coding. For all that it is and all it can do, here are several of the most common ways in which various commentators have inaccurately characterized Predictive Coding:
- Can be comprised of culling, threading, categorizing and/or clustering. These techniques can be helpful in organizing documents for review. However, they do not themselves predictively “code” documents, nor do they prioritize documents automatically, nor do they provide quality control after the fact, thus they are not Predictive Coding. Put another way, they address one symptom of linear review (the lack of topical organization of documents) but do not address the fundamental flaws of linear review and still require huge review teams (often contract attorneys).
- A replacement for attorneys. Simply put, Predictive Coding makes seasoned attorneys more valuable (not less) as it allows them to focus on the most important part of any matter: defending or prosecuting their client’s interests. Predictive Coding also allows attorneys to take on more business by expediting the most tedious element of eDiscovery – document review – which is especially important in the current economic cycle.
- Subject to defensibility issues. A classic red herring, some linear review vendors and practitioners have reflexively voiced concerns about defensibility, namely that Predictive Coding may carry risk because it is not linear review. In fact, the opposite is true: more and more courts are pushing litigants to pursue alternative approaches to document review (like Predictive Coding) due to the risk and costs associated with legacy eDiscovery methods. When part of a thorough, documented process, Predictive Coding is actually more defensible than linear review.
- As being solely about technology. The technology aspect of Predictive Coding is not trivial and cannot be discounted; it is not easy to do, which is why linear review has continued to outlive its useful lifespan. But what makes Predictive Coding so defensible and effective are the processes, workflows and documentation of which it is an integral part. Although technology is at its CORE, Predictive Coding includes all of these parts as one integrated whole.
- Solely for big cases and/or big law firms. One of the most common misperceptions is that Predictive Coding is the province of the rich (i.e. AmLaw 100 firms and Fortune 100 clients). This is simply not true. As many small and mid-sized yet forward-looking law firms like Eimer Stahl have begun to realize, Predictive Coding is useful for anyone dealing with litigation or regulatory or internal investigation.
With today’s huge volumes of ESI pressuring inside and outside counsel alike to embrace new approaches to eDiscovery, it’s no surprise that Predictive Coding has become so popular in such a short period of time. Look for this trend to continue throughout 2010 and into 2011 and beyond.
The Intersection of Enterprise Search and eDiscovery at the “Book of Knowledge”
Legend has it that during the 16th century Juan Ponce de Leon, the first governor of Puerto Rico and erstwhile conquistador, spent years looking for the fountain of youth in Florida. While he did become the first European to “discover” Florida during his first such expedition to the territory, he never did find a fountain of youth; he instead died in Cuba from a wound inflicted on his last expedition to Florida.
What does a 16th century explorer have to do with a modern day economic giant like Toyota? Both utilized technology to great advantage, both enjoyed a great deal of success, and both were credited with quickly conquering the Americas. But while Ponce de Leon was felled by a poisoned arrow, as we have written before Toyota has been felled by a lack of internal awareness – of what the company knew and when it knew it (aka “Proactive ECA”) – which led to an inability to get ahead of events before such events overtook them. The latest bombshell in the Toyota case alleges that the company may have deliberately withheld evidence in safety lawsuits, evidence in the form of a “secret electronic ‘Book of Knowledge’” the company maintained – which allegedly included information about design problems which was never disclosed in lawsuits against the company.
We don’t yet know if such ‘secret Books of Knowledge’ do in fact exist, although the chairman of the House Oversight and Government Reform Committee and a Toyota legal department whistleblower alleged they do. If so, one wonders how Toyota could have either not known of their existence or believed they would never see the light of day. Giving Toyota the benefit of the doubt that the powers-that-be were unaware of any such documents or of their relevance to particular lawsuits, however, one wonders exactly how they could have been so in the dark – and how other companies can ensure they do not find themselves in a similar situation. Here are a few straightforward steps companies can take:
1. Ensure the right people can find relevant, internal information well before litigation. In today’s business environment, we are all deluged by information on a daily basis – with the “Digital Universe” doubling in size every 18 months. To make matters worse, most companies have not rolled out effective enterprise search, making information exceedingly difficult to find; one study found that on average it takes employees 38 minutes to find a single document. Effective enterprise search allows sensitive information to be “findable” by those who should have access to it while securing such documents from those who shouldn’t.
2. Employ “Proactive ECA” to gain critical insight before litigation ensues. As we noted recently, it would appear that Toyota may have relied on “Reactive ECA” – where ESI is not analyzed until it has been identified, then preserved, then collected, then sent to a third party for processing, then loaded into a review tool before it can finally be analyzed. If so, this would not allow the company’s inside and outside counsel to get critical insight into the situation until well after a proceeding has started and taken on a momentum of its own. Simply put, this is typically too late. Instead, “Proactive ECA” can allow companies to get this critical insight before or during the collection phase, delivering key insight from the very outset of a matter.
3. Beware the shortcomings of keyword-only search. As numerous studies have shown, keyword-only search is notoriously inaccurate and ineffective – correctly finding documents only 20% of the time. This fundamental shortcoming with keyword-only search is extremely inefficient in the knowledge management arena and downright dangerous in eDiscovery. Supplementing keyword-only search with conceptual search, however, is a far more accurate and comprehensive approach which can help ensure key documents are not missed by those who need them, regardless of their role at the company.
Unfortunately for him, Ponce de Leon did not have the luxury of sophisticated enterprise search and modern, US-style eDiscovery. Had such tools been at his disposal, he may well have been able to find the fountain of youth and still be with us to this day.
Toyota, Take Two: Why “Reactive ECA” Doesn’t Work
As we reported last week, Toyota continues to face severe scrutiny amidst multiple recalls, Congressional hearings, grand jury subpoenas and a rising tide of public questions and concerns. Arguably the worst news for the car giant is the fact that the end of the barrage may be nowhere in sight – either on the technical fixes or the PR front. As is often the case in blowups like this, it appears that Toyota has been forced to play catch-up to the events unfolding in front of it (if not over its head) at breathtaking speed.
Wikipedia tells us that the 3 elements to a “Crisis Management” situation are 1) a threat to the organization, 2) the element of surprise, and 3) a short decision time. While we can debate about whether or not Toyota should have been surprised by their current predicament, clearly at this point the company is facing a serious threat, did not expect the situation to have evolved as it has (i.e. it’s safe to say Toyota was surprised), and decision cycles continue to shorten. Put another way, Toyota has been facing and continues to face a crisis; even worse, in the context of elements 2 and 3, this particular crisis appears to have the company very much in reactive mode, responding to events as opposed to shaping them by getting out ahead of things. Simply put, Toyota is not in control of events at this point in time.
This is where Early Case Assessment (ECA) comes in. The whole point of ECA is to 1) give a company instant insight into a situation (notice I did not say “case”), which will allow them to 2) quantify the costs and risk associated with the likeliest outcomes of a situation, and 3) make the best, most informed strategic and tactical decisions from the very outset of a situation. In other words, ECA is meant to allow a party to very quickly become proactive in any situation – making informed decisions that will shape and channel events before they occur. As we have stated again and again, this is impossible to do unless live data is assessed where it sits, before or concurrently with a legal hold or collection. This can logically be called “Proactive ECA” because that is exactly what it is: an early, quick assessment of one’s situation that gives one enough crucial insight to become proactive with any situation before events overtake them.
Unfortunately, most eDiscovery vendors – and, sadly, their customers – simply do not understand this fact or are led in another, less productive direction. Instead, what these vendors sell and their customers buy is “Reactive ECA.” In contrast to Proactive ECA, Reactive ECA is neither early nor proactive; in fact, it typically only applies to document collections as part of relatively structured civil litigation (which does not help with Toyota’s current predicament). Instead of gaining instant insight into a situation by looking at data where it resides – before collection or even preservation – with Reactive ECA, parties must typically send legal hold notices…then preserve huge amounts of data through self-collection or by imaging entire drives…then send the data to an external, third party for expensive and time consuming processing…before bringing the data back in-house or accessing it at yet another third party’s data center where ECA (well, really CA at this point as it’s certainly no longer Early) can, finally, be conducted. Weeks and hundreds of thousands (if not millions?) of dollars later, events – and costs – have overtaken the client, rendering ECA pointless.
![]()
Which brings us back to Toyota. For obvious reasons, we can’t know for sure what Toyota did or did not do as part of any ECA activities they may have undertaken. We also do not know if their current predicament is the result of not having the right information from the outset, from decisions that were made early on, and/or from the sheer speed with which events seem to have escalated and taken on a life of their own. What we do know is that Toyota apparently has access to at least one so-called ECA tool…which appears to have done little to help them in this case. Why might this be? Simply put, tools like this are completely Reactive in nature – and never would have enabled Toyota to search and assess live data early on in this ill-fated process. That, unfortunately, is the real-life embodiment of Reactive ECA in all its glory.
As we have stated before, Toyota is one of the most successful, sophisticated companies in the world, which will emerge from this situation strong and vibrant, although perhaps with a new mandate on safety and transparency. One can only hope that this entire situation will serve as a wakeup call to the entire eDiscovery industry about the shortcomings of Reactive ECA – and why Proactive ECA is so critical to getting out in front of events.
ESI’s Threat to Corporate Brands
![]()
Toyota, #3 on the list of the world’s most admired brands (for 2009, at least) and the world’s largest automaker (again, for now at least), has recently come under fire in the US for massive safety recalls of its leading models for various mechanical issues, including faulty braking systems. Even the car giant’s iconic Prius Hybrid line is being affected, with 437,000 such vehicles worldwide requiring a fix. In total, Toyota has committed to fix more than 8 million vehicles worldwide – a number which exceeds its entire global sales in 2009 (7.8 million vehicles).
The direct cost of the recall is expected to be $2B, which Toyota believes will fix the actual issues it faces with respect to the safety of its vehicles. Of more concern, however, is the hit Toyota will take to its brand and how such hit will impact future sales. By one account, Toyota sales demand plunged by 28% in the period immediately after the company’s announcement that it would suspend sales of affected models until safety issues were resolved. A 28% plunge in global sales for Toyota would translate into roughly $18.5 billion in lost revenue per quarter ($74B on an annual basis). Interestingly, this plunge only occurred after Toyota was subjected to continued negative press coverage; in other words, what really hurt Toyota’s brand wasn’t necessarily the fact that they were forced to conduct a massive recall, but rather the fact that the story kept mushrooming day after day after day and simply would not die.
Toyota’s pain was significantly exacerbated when US authorities stepped in, which is where Electronically Stored Information (ESI) enters the picture. First came Tuesday’s announcement that the US National Highway Traffic Safety Administration (NHTSA), led by DoT Secretary Ray LaHood (who presided over the Clinton Impeachment hearings, by the way), was demanding documents from Toyota "to determine if the automaker conducted three of its recent recalls in a timely manner” (note to Toyota North America General Counsel: wording like this from a US regulator – any US regulator, but especially one who loves a good public fight – is an enormous red flag). Next up is an “invitation” from the House Oversight and Government Reform Committee to Toyota Chairman Akio Toyoda to testify about what the company knew, when, and the efficacy and thoroughness of their response to safety issues. Because the aforementioned House Committee has subpoena power and can use any documentation produced by Toyota to the NHTSA – as well as any documents the Committee itself may subpoena – the situation has morphed from a PR issue to a major, “bet the business” investigation that will highlight any perceived or real inconsistency between what Toyota has said publicly (e.g. about when they realized a recall was necessary) and what conclusions the ESI they will be producing may support. We are, after all, talking about the safety of US drivers in the context of a recall which has already claimed the lives of 34 Americans.
In America, bad PR can and typically is a fleeting thing that does not need to be fatal, and Toyota’s predicament is no different. The company is far too big, respected and successful for its existence to be seriously threatened – even by a situation as bad as the current one. But what Toyota’s obligation to produce ESI to the US government and testify in front of government officials guarantees is that this story is not going away anytime soon, which, if brand experts are to be believed, will have a tremendous impact on the company’s top and bottom lines. In other words, the question isn’t whether Toyota will suffer losses from this situation, but rather just how many billions they will lose.
The First Legal Shot Across the Web 2.0 Bow?
Under the “it was only a matter of time” heading, a recent case in San Francisco Superior Court (where else?) seems to have landed a Silicon Valley law firm in hot water – hung by their own petard, 21st century-style (Brain Research Labs v. Clarke, 491932). (Full disclosure: I began my legal career at said law firm, Ropers, Majeski, et al; they are a great group of people who know their stuff…which tells you how complicated things have become when even a sophisticated law firm hits turbulence over Web 2.0 issues). Here’s what happened.

In seeking to solicit members for a potential class action suit against the maker of a dietary supplement, Brain Research Labs, a litigation partner at Ropers, Majeski uploaded a video to YouTube. The video was a commentary by partner Thomas Clarke that sought to interest individuals who had used Brain’s dietary supplement in joining forces with Clarke in an anticipated suit against Brain. It included some commentary about the potential defendants in the class action (Brain), including the following:
“…These scam artists do not care if you live or die. They only want you to live long enough to give them your money.”
Brain Research Labs sued Ropers, Majeski for defamation; Ropers, Majeski sought to use California’s anti-SLAPP law as an affirmative defense to the suit. San Francisco Superior Court Judge Harold Kahn denied Ropers, Majeski’s anti-SLAPP defense, thus paving the way for the defamation suit to proceed. Specifically, the court found that 1) the above comments (among others) were potentially defamatory, 2) by choosing ‘new media’ like YouTube “…Clarke chose, in a 21st century way, to ‘litigate in the press’”, 3) Clarke’s selection of a broad medium like YouTube also usurped his ability to invoke an anti-SLAPP defense, as the judge noted “…there are far more narrowly tailored ways” for Clarke to have communicated with potential class members.
This case is interesting for several reasons, but most notably the case is consistent with the warnings we and others have been making for many months about the information risks associated with social media. Using mass-communication vehicles like YouTube, Facebook, Twitter and LinkedIn entails significant risk for the exact reasons they are such powerful platforms, namely their ability to communicate truncated (and often out of context) messages to an incredibly wide audience about which the speaker may know nothing and over which they have no control…instantly. With YouTube Clarke had no way of knowing that only – or even largely – potential class members would view his video, just as he would not have had such knowledge with Facebook, LinkedIn or Twitter (think retweets with that last forum).
This case is but one data point in what is sure to quickly become a rich body of caselaw wading into the use – and potential abuse – of Web 2.0 tools in the 21st century. If a seasoned law firm steeped in litigation and IP-related issues can get in trouble, anything is possible.
About this Blog
INFOcus looks at information-generated challenges facing today’s large enterprises, and seeks to promulgate best practices amongst enterprise IT, KM, records management, compliance and legal practitioners.
Blog Archive
- March 2010 (2)
- February 2010 (4)
- January 2010 (3)
- December 2009 (2)
- November 2009 (3)
- September 2009 (2)
- August 2009 (1)
- July 2009 (1)
- June 2009 (5)
- May 2009 (3)
