Evaluating Generative AI for Legal Research: A Benchmarking Project
It is difficult to test Large-Language Models (LLMs) without back-end access to run evaluations. So to test the abilities of these products, librarians can use prompt engineering to figure out how to get desired results (controlling statutes, key cases, drafts of a memo, etc.). Some models are more successful than others at achieving specific results. However, as these models update and change, evaluations of their efficacy can change as well. Law Librarians and tech experts par excellence, Rebecca Fordon, Sean Harrington and Christine Park plan to propose a typology of legal research tasks based on existing computer and information science scholarship and draft corresponding questions using the typology, with rubrics others can use to score the tools they use.