Skip to main content

Hybrid MemNet for Extractive Summarization



 -By Abhishek Singh, Manish Gupta, Vasudeva Varma






Centre for Language Technologies Research Centre 
International Institute of Information Technology 
Hyderabad - 500 032, 
INDIA


summary can can be defined as:

”A summary is a text produced from one or more texts that contains a significant portion of the information in the original text which is no longer than half of the original text.”


”Automatic text summarization is the process of reducing text document(s) with a computer program in order to create a summary that retains the most important points of the original document(s)”



Summarization categories based on type of generated summary:



Extractive Summarization: is the most common approach for text summarization. It focuses on selecting a subset of existing textual units (words/phrases/sentences) in the original text by assigning a score to each textual unit followed by picking the most informative units in order to create a summary. However, this summary is un-coherent, un-cohesive, and has unresolved co-reference and discourse relations.
Abstractive Summarization: aims to generate a summary which is closer to what a manual summary looks like. It builds an internal semantic representation of the text and then uses natural language generation techniques to create a summary from scratch (word-by-word summary). These summaries are grammatically correct and more coherent. However, In practice poor performance of natural language generation techniques affect the quality of generated summaries which makes extractive summaries preferable. Primitive approaches saw applying sentence fusion and sentence compression on extract as first step towards abstractive summarization.


Summarization categories based on number of documents to summarize:


Single-Document Summarization: was the first attempts towards developing an automated summarization system. Here, the system takes one document as an input and produces a concise summary of the input document. Sentence ranking is an important step in such a system.

Multi-Document Summarization: Due to the high information redundancy on the internet, researchers interests shifted towards the problem of multi-document summarization. Here, the system takes multiple documents referring to same theme/cluster and produces a single summary of the original multi-document text.
       These systems follows two fold approaches:
  •  Sentence Ranking - scoring sentence based on their informativeness score.  
  •  Sentence Selection - selecting top k sentences based on ranking score such that the sentences selected in the summary should not be redundant. 

Such type of system inherit the primary issue of coherence when sentences carry conflicting information. Due to the redundancy in the internet data these system are quite popular.

Summarization categories based on content of summary:

Indicative Summarization: It portrays the key topics in the text reducing the length of the original text by 90%, includes metadata like writing style, length of a document, however fails to provide factual information. It helps to decide whether a user wants to read the document or not.
Informative Summarization: It contains content which are generally longer, reduces the length of original text by 70-80%. It includes facts and information which can replace the original text.
Evaluative Summarization: It aims to capture the opinion or the views of the author on a given topic/subject/product. Sometimes, it is also referred as review or opinion based summarization.


Summarization categories based on number of documents to summarize:

Single-Document Summarization: was the first attempts towards developing an automated summarization system. Here, the system takes one document as an input and produces a concise summary of the input document. Sentence ranking is an important step in such a system.
Multi-Document Summarization: Due to the high information redundancy on the internet, researchers interests shifted towards the problem of multi-document summarization. Here, the system takes multiple documents referring to same theme/cluster and produces a single summary of the original multi-document text.

Summarization categories based on content of summary:
Indicative Summarization: It portrays the key topics in the text reducing the length of the original text by 90%, includes metadata like writing style, length of a document, however fails to provide factual information. It helps to decide whether a user wants to read the document or not.
Informative Summarization: It contains content which are generally longer, reduces the length of original text by 70-80%. It includes facts and information which can replace the original text.


Evaluative Summarization: It aims to capture the opinion or the views of the author on a given topic/subject/product. Sometimes, it is also referred as review or opinion based summarization.

Summarization categories based on target audience:

Generic Summarization: is most prevalent type of summarization focused towards general audience, is independent of the genre or domain of the document and purpose of the intended user.
Query-focused Summarization: aims to generate a summary based on the user’s query. The system picks out only the information which are related to the given query and present a concise summary to the user. The query can be a phrase, keyword or a question. Search engines uses this kind of summarization to produce snippets for the suggested web pages related to user’s query.
Update Summarization: a special type of multi-document summarization, where user is already familiar with some facts about the news. The idea is to generate summary by omitting the facts which user already knows and presenting new/novel content to user. It follows two fold steps: Detecting novelty followed by summarizing the content. Novelty detection is a crucial step here.

Summarization categories based on type of summarizer:

Author Summarization: are the summaries which reflects author’s point of view, generally 150-200 word non- technical summary of the work aims to present in an understandable manner.
Expert Summarization: These are summaries produced by the domain expert who has sound knowledge of the given field, topic and domain but they are not skilled in producing summaries.
Professional Summarization: These are summaries are produced by professional summarizers. Being a professional they might not necessarily be the expert in the given topic or field or domain.

Summarization categories based on input and output language.

Monolingual Summarization: They takes input text in one language and produces the summary in the same language as of input text. For example: summarization system for English language.
Multilingual Summarization: These systems (called unified systems) can work for multiple languages. However, their input side text and generated summary text are in the same language.
Cross-lingual Summarization: they process several languages, however the output summary is in different language than input side text. For example: summarization of Hindi news to English.




Summarization Applications





Major contributions towards the paper are as follows.
• Hybrid MemNet for Single Document Extractive Summarization

– Introduce a novel architecture to learn better unified document representation combining features from the memory network as well as features from convolutional BLSTM network.

– Investigate the application of memory network (incorporates attention to sentences; captures notion of summary worthiness of sentence) and Conv-BLSTM (incorporates n-gram features & sentence level information) for learning better thought vector with rich semantics.

– Further, authors experimentally show that proposed Hybrid MemNet architecture outperforms the basic systems and several state-of-the-art baselines. The model achieves significant performance gain on the DUC 2002 generic single document summarization datasets.

• Neural Sentence Ranking for Multi-Document Extractive Summarization
– Research present CSTI, a novel method to encode semantic and compositional features latent in a sentence which can be combined with document dependent features to learn a better heterogeneous sentence representation for capturing the notion of summary worthiness of a sentence. which eventually improves the sentence ranking task for extractive summarization.

– Authors examine the application of transfer learning technique to overcome the serious problem of scarcity of training data for the task of multi-document extractive summarization.

– They experimentally demonstrate that our CSTI based deep neural architecture outperforms the basic systems & various competitive baselines. Our system achieves significant performance gain on DUC 2001, 2002, & 2004 generic multi-document summarization datasets.

• Attention based Neural Composition for Abstractive Summary Generation

– They presented NASH, a first abstractive neural summarizer for Hindi, and a novel method to obtain semantics and compositional features latent in the text to capture an effective thought representation, which can be attentively decoded into a high quality abstractive summary.
– They create a large parallel corpus of Hindi language, containing ∼ 250K text summary pairs.
– Experimentally demonstrated that our NASH architecture outperforms several competitive abstractive baselines with a significant margin of ∼ 2−3 points in terms of ROUGE score.



What is MemNet?

A Men-Net system is one which allows all of the processors to have equal and fast access to a large amount of shared memory. While this abstraction is convenient for the programmer, the system implementation is not as simple as this abstraction implies. If it were, system response time would suffer from scarce resource contention and transmission latency. As additional hosts were added to the network, the potential required memory bandwidth would increase linearly in the number of hosts. To provide this abstraction to the programmer, and to avoid the Performance roadblocks of scarce resource contention,
extensive caching facilities are provided at each host. 





Pytorch-MemNet

Building Blocks of model are

a) Document Encoder
Captures local (n-grams level) information, global (sentence level) information and the notion of summary worthy sentences

Hybrid Mem-Net is the summation of the document representations vectors learned from Convolutional LSTM (Conv-LSTM; for hierarchical encoding) and MemNet (for capturing salience and redundancy).

Convolutional Sentence Encoder

Convolution neural network uses convolution operation over various word embeddings which is then followed by a max pooling operation. By using multiple convolution nets with different filter sizes {1, 2, 3, 4, 5, 6, 7} to compute a list of embeddings which are summed to obtain the final sentence vector

Convolutional Bidirectional LSTM Document Encoder

Since Recurrent Neural Network (RNN) suffers from vanishing gradient problem over long sequences, Long Short-Term Memory (LSTM) network is used. Future context is achieved by using BiDirectional LSTM (BLSTM).

MemNet Based Document Encoder

To solve question answering and language modeling task they leveraged memory network encoder (MemNet)






b) Decoder
        Attention based sequence to sequence decoder

Datasets

Daily Mail corpus is used, this corpus contains 193,986 training documents, 12,417 validation documents and 10,350 test documents. To evaluate our model, authors piked standard DUC-2002 single document summarized dataset which consists of 567 documents.

After implementation for evaluating the quality of system summaries they used ROUGE-1 (unigram overlap) ROUGE-2(bigram overlap) as means of assessing informativeness and ROUGE-L for assessing fluency.

List of Baseline Systems used


  • ILP - Integer Linear programming for Phrase-based extraction model
  • TGRAPH - Graph based approach for sentence based extraction
  • URANK - Unified Ranking for single as well as multi document  summarization 
  • LEAD - Selecting leading three sentences from document as summary.
  • NN-SE - a Neural Network based sentence extractor
  • GRU-RNN - Deep classifier to sequentially accept or reject each sentence in the document for being in summary.
  • SummaRuNNer  -  RNN based extractive summarizer.


Results


Hybrid MemNet outperforms because of BLSTM learns a richer set of semantics as they exploit some notion of future context as well as by processing the sequential data in both directions.

  

Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...