Meet with team smartData

Calendar April 17 - 21, 2023
location Chicago, IL

Today, there is a huge amount of data in the digital world and to get the main points out of that, manually is next to an impossible task. So, there is a need to develop a machine learning algorithm that can shorten longer texts and deliver summaries that can fluently pass the intended messages.

There are numerous ML-based models for this task. Most approach this problem as a classification problem. In this way, it’ll be decided that if to include a sentence in the summary or not. Other approaches have used topic information, Latent Semantic Analysis (LSA), Sequence to Sequence models, Reinforcement Learning, and Adversarial processes.

There are two main types to summarize the text in NLP

1. Extractive Text Summarization:

In the case of extractive summary, new sentences are not generated which are not in the document like humans. This is just a subset of the original text. These summaries represent an approximate content of the text for relevant judgment, but mostly good narrative coherence.

A Machine Learning (ML) approach can be devised by using the data sets with their corresponding extractive summaries. The sentences of each document are modeled as vectors of features extracted from the text. The summarization task can be seen as a two-class classification problem, where a sentence is labeled as “correct” if it belongs to the extractive reference summary, or as “incorrect” otherwise. The trainable summarizer is expected to “learn” the patterns which lead to the summaries, by identifying relevant feature values which are most correlated with the classes “correct” or “incorrect”. When a new document is given to the system, the “learned” patterns are used to classify each sentence of that document into either a “correct” or “incorrect” sentence, producing an extractive summary.

Here is an example: Source text:

smartData Enterprises and XYZ enterprise came to sign a contract for ‘ABC’ project in Jerusalem. In the city, Mary gave birth to a child named Jesus.

Extractive summary:

smartData Enterprises and XYZ enterprise sign contract Jerusalem. Mary birth Jesus.

As you can see above, the words in bold have been extracted and joined to create a summary — although sometimes the summary can be grammatically strange.

2.Abstractive Text Summarization

In this case, new phrases and sentences can be created that provide useful information from the original text. So, the sentences generated through this method may not be present in the original document and this is in a similar way, we as humans do, to summarize. Developing this type of summarizer may be difficult as it requires generating Natural Language. One of the most used approaches to this problem is sequence-to-sequence RNN. Here, an output sequence of words from an input sequence of words models is created. Input and output of the model could be of variable lengths. Example,

Source text:

smartData Enterprises and XYZ enterprise came to sign a contract for the ‘ABC’ project in Jerusalem. In the city, Mary gave birth to a child named Jesus.

Abstractive Summary:

smartData Enterprises and XYZ enterprise came to Jerusalem to sign a contract where Jesus was born.

smartData has developed solutions to solve the problem of summarising and paraphrasing using ML. In one of the solutions, we have created an ML model that provides effective and efficient summarization and paraphrasing using T5, distilbasrt-cnn, and GPT3 models. Custom training on different datasets on the above models using transfer learning has been done to achieve better and optimized results.

In another solution, we have developed a smartbot that focuses on extracting information from the organization-related documents, preprocessing the document, and saving them into databases in a structured format. The bot allows to ask questions related to organization and tries to give the best answer based on the data available.BERT SQUAD model for question answering, a custom NLU to understand the questions asked by users, and various data extraction and preprocessing pipelines have been used.

Share on

Estimate Project