Study shines light into the real black box problem in legal research.
Recently, while following coverage from the 2017 Center for Computer-Assisted Legal Instruction (CALI) conference, I came across this blog post from the LII blog. It put me on to University of Colorado Law Library Director Susan Nevelow Mart’s study, “The Algorithm as a Human Artifact: Implications for Legal {Re}Search.” Mart tested search query performance across the six legacy services: Casetext, Fastcase, Google Scholar, Lexis Advance, Ravel and Westlaw. I was blown away by the study’s findings.
The good here is that this study put the six legacy services to the test. The goal of the article was, “in part, a call for more algorithmic accountability.” This search for great accountability was interesting, as it mirrored one of the greatest criticism folks have towards AI — that they can’t trust the “black box” nature of some machine learning techniques. What always puzzled me was the fact that everyone is currently living with mysterious black boxes when it comes to legal research due to the unknown human-made decisions that go into how results are shown when searching on any legacy legal research software tool.
The How
The same keyword search was entered into each of the six database services and the top 10 results each brought back were analyzed and compared. Each result was ranked depending on if it was “unique and relevant” or “unique and not relevant.” Mart’s study included “fifty different searches” with “many of the searches” coming from Mart’s previous “study of digests and citators.”
The Bad
Mart found that, “there is hardly any overlap in the cases which appear in the top ten results returned by each database” and that, “an average of 40 percent of the cases were unique to one database, and only about seven percent of the cases were returned in search results in all six databases.” This means that of the results, 40% are unique to one of the six research services.
Unsurprisingly, the study found that “the oldest database providers, Westlaw and Lexis, were at the top in terms of relevance, with 67% and 57% relevant results, respectively” while the newer providers, “Fastcase, Google Scholar, Casetext, and Ravel, were clustered together at a lower relevance rate, each returning about 40% relevant results.”
This means that at the higher end, 35–40% of the results legal research databases return are irrelevant, while 60% of the results the majority of legacy legal research databases are returning are irrelevant.
Mart also found that an average of “25% of the cases are only in two databases. An average of 15% of the cases appeared in three databases” and only “9.5% of the cases appeared in four databases.”
Also interesting is that Mart found there may not be a correlation between the overall number of results brought back and relevancy as “the average relevance of the top ten results stays fairly consistent even when the number of results increases.” This makes a lot of sense to those outside the law but oftentimes those doing research may want to see 1000s of results, as they may psychologically believe this means a more thorough search.
Natural Language Searching?
Also, Susan and her team found that none of the current providers truly offer natural language search, finding that:
“reference to natural language searching is frequently a misuse of a technical term that refers to a complex attempt to pattern match speech or text ‘through reference to a database with the aid of grammatical structures models.’”
The study found that when it came to Westlaw, it was “not clear from the promotional material if the methodologies include true natural language searching or not” as the interface lists its search as “plain language” while “early WestlawNext promotional material calls it “natural language.” When it came to Lexis, Mart found that the company called search without connectors “‘natural language’ searching” The piece highlighted that Fastcase’s promotional materials stated that “natural language searches are much less precise.” Ravel was upfront with how its system works when you enter your search and that it “found all cases that contain those keywords” and then ranks them using a variety of methods. Casetext’s relevance is determined as a function of “keyword frequency, citation count, data, and jurisdiction” according to an email Mart received from a Casetext team member. Google Scholar uses factors such as where it was published, who wrote it, and how recently it had been cited in other materials.
The Ugly
LII’s blog states that Mart’s research “has implications for practicing attorneys, scholars, students, and technologists” — I couldn’t agree more. Three things jumped out at me in particular after taking in her research.
1. The Current Legal Research Tools are Costing Firms Time and Money
When you consider the current legal landscape, and the fact that recovery for time spent on legal research has hit an all-time low of 25% in 2016 according to this report from Mattern & Associates, and is on pace to hit rock bottom at 0% recovery in 2020, the idea that 35–60% of the top 10 results brought back from the six legal databases are irrelevant noise is troublesome.
If for every dollar of legal spend, a firm is only recovering 25 cents right now, and the tools they are using are inefficient, this means money and time are being burned at record rates.
2. The term ‘Natural Language Search’ is being used incorrectly
Something that I had long wondered was whether natural language search was truly being offered by the 6 tools tested in Mart’s study and it would appear, according to the research, that none of the six legal database search services offer true natural language search. This being said, perhaps these publishers will release more information as to their natural language search capabilities to refute Mart’s findings.
Either way, light being shone into what constitutes true natural language search and what could be keyword and Boolean search dressed up as natural language search is very interesting and timely considering the rise of new AI methods which are making true natural language search available to researchers.
3. The idea that a new tool must displace an old one could be dangerous
Often times when speaking to decision makers at law firms the question as to if a legal search tool could displace the one the firm currently uses is posed. The idea is, we use X tool and the tool is perfect, how could the new proposed tool, Y, be any better? The issue is, it would appear that neither of the two most widely used tools, Lexis or Westlaw, or any of the proposed alternatives to date, truly are offering a perfect tool.
Shining light into the real black box problem in legal research
Mart concludes that “this study clearly demonstrates that the need for redundancy in searches and resources has not faded with the rise of the algorithm.” Again, I agree. When working with pre-determined algorithms made by humans, information falls through the cracks. Algorithms are created by humans who select which factors to weigh favorably, which to ignore, which to weigh moderately etc. These weights do not change unless the human, or team of humans, manually change the predetermined factors and weights they have set. Think of it as turning various knobs which correspond with different unique values.
This is why new artificial intelligence training methods, such as deep learning and neural networks, are so exciting, these models are able to learn in a non-static fluent way, rather than the hard-coded ways legacy legal research tools have offered to date. While there are challenges is showing how these AI tools made a decision, folks continue to make breakthroughs on opening up the black box of A.I. decisions, yet all the while the current offerings are not only clearly having black box issues, the factors being looked at are all the result of human decisions, rather than a man plus machine approach which AI tools bring to the table.
With these new techniques, the machine system begins to modify its own weights and factors as it learns what works best via interactions with those depending on it. Humans can still modify factors as well, however, these new approaches are dynamic, rather than the static stop and go of manual algorithmic manipulation.
The late comedian George Carlin once said, “If the black box flight recorder is never damaged during a plane crash, why isn’t the whole airplane made out of that stuff?” The obvious answer is that in order to have a functional plane, the whole plane couldn’t be made of the black box material.
When it comes to legal research, it would appear a black box problem also isn’t 100% avoidable, but in my opinion, new artificial intelligent methods being developed today, especially those using deep learning and neural nets, have the best chance of solving our legal research woes, especially considering its plummeting recovery rates, but then again, I’m biased…I’m one of the founders of ROSS Intelligence after all. This being said, so far so good — the data supports this hypothesis.