An overview of information extraction techniques for legal document analysis and processing

Received Jan 2, 2021 Revised Apr 17, 2021 Accepted May 11, 2021 In an Indian law system, different courts publish their legal proceedings every month for future reference of legal experts and common people. Extensive manual labor and time are required to analyze and process the information stored in these lengthy complex legal documents. Automatic legal document processing is the solution to overcome drawbacks of manual processing and will be very helpful to the common man for a better understanding of a legal domain. In this paper, we are exploring the recent advances in the field of legal text processing and provide a comparative analysis of approaches used for it. In this work, we have divided the approaches into three classes NLP based, deep learning-based and, KBP based approaches. We have put special emphasis on the KBP approach as we strongly believe that this approach can handle the complexities of the legal domain well. We finally discuss some of the possible future research directions for legal document analysis and processing.


INTRODUCTION
Nowadays a lot of information is available on the internet in a structured and unstructured form stored in multiple documents. This information belongs to different domains and needs to be analyzed and processed to extract the desired piece of information for a particular task. Manual processing and analysis of such a large repository of documents demand too much efforts and it will be very much time consuming also. To overcome these problems, automatic information processing and analysis is the need of the hour. Information retrieval and information extraction are the tasks required for automatic document analysis. Information extraction deals with automatically extracting relevant information for a particular application problem from the available corpus and represents it in a structured machine-readable format. Information retrieval gets relevant information sources whereas information extraction automatically extracts relevant information from those sources in a structured format. To differentiate between information retrieval (IR) and information extraction (IE) one can say that IR is a task that will locate the desired document form a large collection whereas IE focuses on extracting the exact piece of information from a document to solve user query. Generally, IE processes human language texts employing natural language processing (NLP) techniques. Automatic document analysis is desired by different domains like biomedical, administration, financial, literature, journalism, and many more. Researchers all over the world are using a combination of

How legal text differs?
Automatic legal document processing systems must understand some peculiar characteristics of domain corpus before further processing. Every year legal institutions produce thousands of documents in the form of legal contracts, law commission reports, tribunal, case judgments, acts, online contracts, citations. In countries like India, the Supreme Court of India, different state high courts, hundreds of district courts publish the legal proceedings in the public domain every month. But this large volume of publicly available legal data is not processed effectively to provide legal information to common people. One of the main reasons behind this is a complicated structure and lack of knowledge about legal language by common people. Some of the distinguishing features of a legal text in comparison to other domain texts are as shown in below. a. Legal documents are too long as compared to documents in other domains. b. Legal documents are having a complex internal structure containing a description of different acts, citations, and hierarchical form. c. The vocabulary of legal documents consists of several domain-specific terminologies that may not be familiar with the non-legal community. d. Ambiguity does exist in legal documents in the form of the different interpretations of the same content depending on the hierarchy of different courts, judges, or lawyers.
Citations are very important in the legal domain as compared to other domains and highlights of that particular case. The legal domain is quite promising for information retrieval and information extraction due to the large available corpus. As legal domain documents follow a peculiar layout, NLP techniques can process it better than extremely informal news and social media text. Hence, a knowledge base for automatically managing legal documents will be helpful for all types of users.

LITERATURE REVIEW
The approaches used for IE from legal documents are broadly classified into three categories and different legal document processing systems developed using these approaches are discussed below.

NLP techniques for legal text processing
By combining the power of artificial intelligence and computational linguistics, natural language processing (NLP) techniques help machines to "read" text by simulating the human ability to understand language. Some of the applications developed using NLP techniques are machine translation, automatic summarization, sentimental analysis, text classification, question answering. NLP represents the automatic handling of natural human languages like speech or text. The Law domain can be represented as a combination of language, logic, and conceptual relationships, and their analysis [1]. So, there is a wide scope of applying NLP techniques for legal information mining.
Kanapala et al. [2] provided a survey of different text summarization techniques recently used for legal text processing. This survey focuses on single as well as multiple document summarization techniques. These techniques were tested on different datasets like AustLII, HOLJ, Federal Court of Canada judgments. The techniques surveyed in this paper are divided into four categories namely the Linguistic Feature-based approach, Graph-based approach, Semantic role labeling based approach, and Classification based approach. Padayachy et al. [3] proposed an approach to design a comprehensive model to assist legal researchers in accessing legal data for the most applied case. The proposed approach is implemented using LegalCo. The legal database is provided by the organization. The proposed system is composed of four different modules namely information retrieval where query-dependent ranking and retrieval of the document is performed using the VSM model followed by the information extraction module which extracts the facts using NLP techniques for named entity recognition, relation extraction, and event extraction, the extracted facts are stored in graph database as labeled property graph (using Neo4j python library). The last module will return the recommendations in the form of the most applied case by doing a Query-independent ranking of obtained results.
Surdeanu et al. [4] proposed a method for extracting text relevant to litigation claims and entity mentions in each claim from hierarchical annotated legal domain data. They adopted a semi-supervised bottom-up approach for building a joint hierarchical conditional random field model using a combination of pseudo-likelihood and Gibbs sampling and proved that these models perform better in comparison with model adapting top-down approach. Constantino et al. [5] proposes a CLIEL system for annotating legal documents using XML tags to facilitate IE of data point instances such as date of the document, name of the party, governing law, and many more. The system is tested on the Set of 97 digitized commercial law documents of different formats, structures, and layouts. CLIEL system is using NLP techniques, java annotation pattern engine (JAPE), rule-based layout detection tree (RLDT) for information extraction from Annotated XML document generated for each commercial law document and store it in a database for future reference.
María et al. [6] present an approach focused on validating and improving the quality of the results of an IE system based on the use of ontology that store domain knowledge. The proposed approach works on the output produced by the AIS system, an IE system specialized in analyzing Spanish legal documents. This approach is using Ontology specially designed for the legal domain and data curation process to validate the results obtained from AIS and store for future reference through the entity aligner module. Bommarito et al.
[7] developed LexNLP, a Python package for extracting information from for legal and regulatory text. The objective behind the development is to support academic research as well as industrial applications. It is developed using NLP techniques and machine learning mechanisms to provide features like legal document segmentation, extracting structured info from the text, NER, converting text into feature vectors for the machine learning model. The model is built from real documents from SEC EDGAR and is open source. Savelka et al [8] proposed a framework for extracting important sentences from court judgments so that users need not refer to lengthy case documents for understanding statutory terms. They adopted techniques like measuring similarity among the case sentences and user queries, using the context model for sentences, query optimizations, and identify novel sentences for user queries. The proposed framework is tested on the labeled dataset of 4,635 sentences for three statutory queries. Kumar et al. [9] worked on finding similarity among the court judgments by using IR techniques and search engine mechanism. They have compared all term versus legal term cosine similarity method to prove that the legal term cosine similarity method performs better.

Deep learning techniques for legal text processing
Recently, Goodfellow et al. [10] becomes the popular choice of researchers for handling the complex and heterogeneous legal domain documents. Goldberg [11] provides an efficient approach to outperform traditional rule-based, dictionary-based, and machine learning models by supporting multilayering, non-linear activation functions, and capable of capturing long-term dependencies. Deep neural networks provides excellent analytical and processing capacity to capture language semantics and syntax thus becoming closer to human sophistication.
Chalkidis and Kampas [12] in a survey discusses applications of deep learning for processing legal text-based of three different NLP tasks namely text classification, information extraction, and information retrieval. This work is primarily focusing on semantic feature representation for deep learning models. One of the important contributions of their research is the legal word embedding dataset using the word2vec model containing legislations from European countries. Bansal et al. [13] provides the comparative analysis of different legal tasks such as classification, summarization, case reviews, and predictions using deep learning models namely CNNs, RNNs, LSTM, and GRU. Their study is based on the classification of the legal task into three subdomains viz. data search, legal text analytics, and legal intelligent interfaces. They found that deep learning models provide state of the art performance for the majority of the studied systems.
Lippi et al. [14], [15] proposed a methodology to identify loopholes from online service agreements in the form of unfair clauses. They formulated the problem of identification of unfair clause a sentence classification problem with the experimental setup using support vector machines [16], combined with deep learning architecture i.e. convolution neural networks [17] and long-short term memory networks [18]. This work is available as a commercial tool for domain users.
Xia et al. [19] in their work emphasizes the need for intelligent justice through effective deep learning techniques. Considering the complex structure of legal documents, similarity analysis is a difficult task. To address this difficulty, they proposed an approach using the combination of Word2vec with legal document corpus to improve the accuracy of similarity analysis of law documents and demonstrated that their approach is showing improved performance.
Nanda et al. [20] extended their work based unsupervised lexical and semantic similarity techniques [21], [22] to evaluate multilingual legal corpus of European directives and national legislation (from Ireland, Luxembourg, and Italy). They used shallow neural networks to developed word and paragraph embedding models for the corpus. Proposed work develops unsupervised as well as supervised semantic similarity model to identify transpositions and their performance is evaluated on various feature sets.
Marques et al. [23] presented a scoring mechanism to rank the most relevant legal citation in case judgments to support the legal argument. The scoring mechanism developed for the system is using a feature matrix as each case article as a feature to classifier for recommendations. Another score value is making use of word embedding text similarity techniques for finding relevant citations. Researchers have claimed that their proposed technique is better in comparison to baseline techniques for ranking evaluation of relevance criteria.

Knowledge base population for legal text processing
Knowledge base is a machine-readable data repository in a structured format. Some of the popular commercially used knowledge bases projects include Wikidata [24], DBpedia [25], Freebase [26]. The graph is a well-suited data structure stores factual information in the form of relationships between entities. Knowledge base population (KBP) systems [27] extract knowledge from available resources and generate knowledge base by considering semantic and contextual information from the resources. Knowledge base population system's objective is to automatically identify entities from unstructured text documents and discovering the facts about those automatically extracted entities and represent it in a structured knowledge base format. A specific KBP system goal should be to use logical reasoning for drawing inferences based on the logical contents of the input data. KBP involves two separate sub-tasks, entity linking, and slot filling. The entity linking task [28] aligns textual mention of a named-entity to its appropriate entry in the knowledge base or determines that the entity does not exist in the KB. The slot filling task [29] collects information regarding certain attributes of an entity from the corpus. If the corpus does not provide any information for a given attribute, the system will generate a NIL response. Information Extraction is necessary and crucial for successfully populating knowledge bases.
The objective of IE from text corpus is to extract and represent information in a tuple of two entities and a relationship between them. The task of extracting information from a large no of documents in the absence of a Labeled data set is termed as open information extraction [30]. This paradigm is claimed to be portable across different domains. One can prefer open information extraction to analyze legal documents that run across several pages and can assists practitioners and ordinary people to get the essence of the complex legal document. One of the important challenges of the traditional information extraction approach [31] is the dependency on some handcrafted domain-specific pattern matching rules. Information extraction outside the boundary of pattern matching rules cannot be done using traditional IE. Refer to Table 1. Comparative analysis of traditional versus Open IE. Some of the error classes identified with IE [32] are the boundary errors class, uninformative extraction error class, redundant relations extraction error class, wrong extractions error class. A variety of approaches have been proposed to address entity linking and slot filling. These diverse approaches are providing new opportunities for both entity linking and slot filling tasks of KBP. TextRunner [33] and STANFORD OPENIE [34] is an example of OIE system. A knowledge graph is a very effective data structure for storing semantically related concepts together extracted by using open information extraction approach and represented using relational machine learning.
Shrinivasa et al. [35] developed a knowledge base named as crime base from online news articles in leading Indian newspapers Times of India and Deccan Chronicle from Jan 2018 to Jun 2018 as crime reports published in newspapers are more authenticate then info available on social media crime base contains crime entities from multiple modalities in machine-readable form which can be useful to law enforcement agencies for crime activities analysis and future predictions. The novelty of this work is considering the image as well as text data for the construction of a knowledge base. The crime base uses domain-specific manually crafted rule-based approach crime entities extraction by using techniques like Tokenization, POS tagging, NER, named entity disambiguation [36] contextual and semantic similarity measures [37] for text data, and low and high level for image data. The system visualization is don using OWL model [38]. Boella [39] proposed a legal knowledge management system for the understanding of legal terms, different norms, and interrelation between them. This system will be beneficial for legal experts as well as a common man for a better understanding of the legal domain. The main objective of the proposed work was to semi-automate the frequently needed tasks of classification of documents, get a clear understanding of legal terms, extracting key terms for the user query, and more sophisticated search options. The Semiautomatic knowledge population task in the legal domain [40] proposed is implemented using rule-based, statistical procedures for parsing from the Italian legal database of norms for sentence extractions followed by application of pattern matching rules to identify named entities from the corpus. Statistical framework and legislative XML are used to represent extracted named entities for visualization purpose.

Proposed approaches for Indian legal system
Legal information retrieval systems require to identify catchphrases from judgments automatically, a mechanism needs to be explored in depth. Mandal [41] proposed an approach using unsupervised learning, for extraction and ranking of catchphrases automatically using the noun phrases from judgments. The proposed system is compared with different supervised and unsupervised baseline systems and getting statistically better performance over those baseline systems. Like catchphrase detection, measuring similarity between different legal documents is also desired by IR systems, two types namely graph-based and textbased techniques are available for the said task. Mandal [42] proposed a similarity measuring approach for Indian legal documents using text-based methods combined with topic modelling and neural networks for word and document embedding for better results. This work proves that the embedding based approach outperforms over both graph-based and text based baseline systems. Bhattacharya [43] proposed an approach for automatic c identification of the rhetorical roles of sentences from Supreme Court of India judgments using deep neural networks. The significance of using deep neural network these systems work better than many baseline systems which use handcrafted features.

CONCLUSION
Legal text documents are structurally different than other domain texts such as news articles or bioinformatics domains. So, techniques adapted for information extraction from the legal domain demands for understanding the formats and semantics of the legal document. Also, legal documents exist in different varieties like contracts, reports, court judgments each of which follows a different layout and structure. From the survey conducted for IE from legal texts, it is very much visible that not much work is being carried out for IE for Indian law system documents. Though the NLP approach seems promising for legal text processing, representation of extracted information in machine-readable as well as user-friendly form creates a challenge for this approach. The deep learning approach is wildly adapted by the researcher community for various domains but creating a tagged corpus for this approach needs to much manual efforts for the complex and lengthy legal documents. After analysis of all the approached for legal text processing we find that the knowledge base population approach combined with NLP techniques can give promising results for legal text analysis for many tasks such as automatic summarization, finding the most relevant case judgments, classification of legal documents according to laws, acts or any other parameter, automatic head note generation for a case, finding the references for a given case through citations and may more for Indian law system. There is no. of areas open for exploration with different issues in legal text analysis where researchers from the Information Extraction community can contribute to benefit legal experts to get rid of the manual, complicated time-consuming task as well as a common man to better understand the legal domain. After discussing the need of automation in legal sector a fundamental question arises that whether automation in legal sector would replace the lawyer and legal analyst in future? To answer this question, one needs to understand that legal domain is highly driven by analysis, decision making, and representation techniques which is difficult to automate. Still there are some areas in legal domain where automation is highly desired. Due diligence-contract review, legal research conduction to save manual efforts. Prediction technology-to predict the probable outcome of the cases by analyzing previous judgments. Legal analytics-to generate the data points from past judgments, and identify relevant case laws to be used by lawyers in their present cases. Automation of documentation-by just submitting the relevant documents get your legal documents ready.