Text Summarization in NLP

Text Summarization in NLP

Today, there is a huge amount of data in the digital world and to get the main points out of that, manually is next to an impossible task. So, there is a need to develop a machine learning algorithm that can shorten longer texts and deliver summaries that can fluently pass the intended messages.

There are numerous ML-based models for this task. Most approach this problem as a classification problem. In this way, it’ll be decided that if to include a sentence in the summary or not. Other approaches have used topic information, Latent Semantic Analysis (LSA), Sequence to Sequence models, Reinforcement Learning, and Adversarial processes.

There are two main types to summarize the text in NLP

1. Extractive Text Summarization:

In the case of extractive summary, new sentences are not generated which are not in the document like humans. This is just a subset of the original text. These summaries represent an approximate content of the text for relevant judgment, but mostly good narrative coherence.

A Machine Learning (ML) approach can be devised by using the data sets with their corresponding extractive summaries. The sentences of each document are modeled as vectors of features extracted from the text. The summarization task can be seen as a two-class classification problem, where a sentence is labeled as “correct” if it belongs to the extractive reference summary, or as “incorrect” otherwise. The trainable summarizer is expected to “learn” the patterns which lead to the summaries, by identifying relevant feature values which are most correlated with the classes “correct” or “incorrect”. When a new document is given to the system, the “learned” patterns are used to classify each sentence of that document into either a “correct” or “incorrect” sentence, producing an extractive summary.

Here is an example: Source text:

smartData Enterprises and XYZ enterprise came to sign a contract for ‘ABC’ project in Jerusalem. In the city, Mary gave birth to a child named Jesus.

Extractive summary:

smartData Enterprises and XYZ enterprise sign contract Jerusalem. Mary birth Jesus.

As you can see above, the words in bold have been extracted and joined to create a summary — although sometimes the summary can be grammatically strange.

2.Abstractive Text Summarization -

In this case, new phrases and sentences can be created that provide useful information from the original text. So, the sentences generated through this method may not be present in the original document and this is in a similar way, we as humans do, to summarize. Developing this type of summarizer may be difficult as it requires generating Natural Language. One of the most used approaches to this problem is sequence-to-sequence RNN. Here, an output sequence of words from an input sequence of words models is created. Input and output of the model could be of variable lengths. Example,

Source text:

smartData Enterprises and XYZ enterprise came to sign a contract for the ‘ABC’ project in Jerusalem. In the city, Mary gave birth to a child named Jesus.

Abstractive Summary:

smartData Enterprises and XYZ enterprise came to Jerusalem to sign a contract where Jesus was born.

smartData has developed solutions to solve the problem of summarising and paraphrasing using ML. In one of the solutions, we have created an ML model that provides effective and efficient summarization and paraphrasing using T5, distilbasrt-cnn, and GPT3 models. Custom training on different datasets on the above models using transfer learning has been done to achieve better and optimized results.

In another solution, we have developed a smartbot that focuses on extracting information from the organization-related documents, preprocessing the document, and saving them into databases in a structured format. The bot allows to ask questions related to organization and tries to give the best answer based on the data available.BERT SQUAD model for question answering, a custom NLU to understand the questions asked by users, and various data extraction and preprocessing pipelines have been used.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Software Consulting and Practices

Technology Solutions

AI and Data Solutions

Healthcare Software Solutions

AI-Driven Enterprise SaaS & Industry Solutions

AI & Intelligent Solutions

Healthcare Products(6)

Enterprise Products(2)

Next Gen Products(12)

Text Summarization in NLP

Frequently Asked Questions