Control Risks’ presentation on ‘The future of eDiscovery technology’ sounded like a bold attempt to predict the predictive. It was actually a practical walk through two key eDiscovery tools – concept clustering and predictive coding. Although it focused on Control Risks’ toolkit, the principles are applicable to other eDiscovery products.
While sophisticated technology impresses client organisations sufficiently to make them invest in it, users often fail to make use of its features. The main challenge is user confidence and understanding how the technology works.
Principal architect, legal technology, Tom McCotter’s analogy for document analysis in eDiscovery is condensing ‘War and Peace’ into revision notes that contain the essence of the story, characters and underlying message. Contextual clustering is a key element in the redaction process. His presentation took us through some of the tools and techniques.
How contextual clustering works
All technology produces better results from data with rich context. Preserving groups of data – with closely related contexts – allows you to retain its richness. McCotter’s demonstration applied Control Risks’ software to the 20 Newsgroups data set, which developers and others commonly use to test their software and findings.
Each of the 20 Newsgroups has 1,000 articles. The software divided the articles into six categories, three layers deep. The categories and layers are defined by two heuristics: words and terms that appear in relatively few articles; and words and terms that appear frequently in the same document (i.e. ‘hockey’ will indicate that an article should be categorised as sport). Next steps include weighting documents by significance, language and potentially by a basic level of sentiment analysis – i.e. positive terms tend to be separated by ‘and’ while negative ones are separated by ‘but’.
User reluctance around conceptual clustering often occurs because although people understand the principle, they are not clear about how the technology works. This is not helped by the fact that the term is a misnomer: rather than being about clusters per se, it is about high-dimensional similarity – identifying the multitude of factors that define a group.
When conceptual clustering is used for document analysis in eDiscovery, the software identifies some 10,000 to 20,000 dimensions for each document. A dimension can be determined by plotting two factors on a graph. For example, if you were to plot restaurants by quality and price, McDonalds and Burger King would appear close to each other. Each dimension is visualised as a point on a sphere, and cosine similarity, which determines the distance and angle between the points, is used to calculate frequency and similarity.
We were then given a look ‘under the bonnet’ of Control Risks software development model, which is designed to bring the technology closer to the users, and includes self-service technology for smaller-scale eDiscovery projects and parts of larger projects. ‘Agile Scrum Methodology’ enables developers to deliver tailored functionality in response to clients’ specific requirements within a very short timescale.
Avoiding common pitfalls
Questions from the floor raised some interesting issues. When it comes to predictive coding, which is another key feature of the Control Risks eDiscovery toolkit, the differentiator is the ability to identify the right keywords. When choosing a software solution, McCotter suggests asking different vendors to analyse the same dataset to find the most appropriate predictive modelling algorithm for your business.
Common pitfalls included document sets with multiple languages, because algorithms are language agnostic. One solution is to apply language determination software first to filter out documents in different languages for separate analysis. Spreadsheets are also worth filtering out because they tend to contain similar vocabulary, whatever their context.
Another question related to ranking results – the software gives each document a relevance score, and these scores can be used to manipulate contextual clustering and predictive coding and determine which documents should be included in the next stage of the review.
Education and engagement
Like all statistical forecasting, eDiscovery combines art and science – choosing the appropriate variables and applying the right algorithms. Senior consultant Adam Page added that investing in a comprehensive suite of tools – as provided by Control Risks – allows clients to predict eDiscovery costs. This is a critical consideration, particularly as UK practice direction 31B includes provisions related to the technology, timing and cost of eDiscovery.
The chasm between understanding the underlying principles of eDiscovery and applying the technology effectively can be bridged by educating people within user organisations so that investing in sophisticated technology is no longer a leap of faith.
For vendors, it is becoming increasingly important to ensure that their products are also sold internally, within the client organisation. Continuous training and development help users make the most of the increasingly complex software that significantly reduces the time and manpower required for eDiscovery.
As with many complex IT systems, the key message is around user engagement and embedding tools and techniques into working practices. Events like this that explain how they are developed and how they work on a practical level are surely an important step in the right direction.
Copyright © 2023 Legal IT Professionals. All Rights Reserved.