具深層支持式注意力之記憶性神經網路

標題:	具深層支持式注意力之記憶性神經網路 Deep Supporting Attention in Memory Network
作者:	林庭安簡仁宗 Lin, Ting-An Jen-Tzung Chien 電機工程學系
關鍵字:	自然語言處理;記憶體加強神經網路;序列對序列學習;注意力機制;問答系統;影像標題生成;natural language processing;memory-augmented neural network;sequence-to-sequence learning;attention mechanism;question answering;image caption
公開日期:	2017
摘要:	深度學習被成功地運用在很多不同的自然語言處理任務，比如說像是語音識別、機器翻譯、對話系統、語言理解、閱讀理解、影像標題生成、影像理解和問答系統。這些任務的共通點是其中關於自然語言的時序資訊都需要透過遞迴式神經網路(RNN)或是長短期記憶(LSTM)來學習，然而在標準的遞迴式神經網路和長短期記憶中有兩個缺點：第一、長短期記憶只能使用一個內部記憶體來儲存時序資訊，如此一來，這樣的模型太過受限，以至於當歷史資料太過豐富且悠久時，會被迫放棄一些時序資訊；第二、對於自然語言來說，如果沒有融入注意力機制，時序資訊會變成一個散漫而缺乏效率的表示。根據上述原因，近期論文發表了一系列新的注意力機制，這些注意力機制將被運用在記憶體強化神經網路上。我們提出支持式注意力機制，並運用於提供了外顯記憶體以供應資訊儲存的記憶性神經網路上。一般而言，使用於自然語言或已觀察到的樣本上的注意力機制通常是聚焦或定位在對規律分類來說重要的區域或地點，這樣子的注意力參數可以視作是一個潛在變數，而這個潛在變數是由最小化分類損失的方式非直接地估計，使用這樣的注意力機制時，目標資訊可能無法被正確地辨別出來。因此，除了最小化分類錯誤，我們同時也最小化對支持資料的重構損失，直接地聚焦在重要的區域上，這種解法可以用變分推斷的公式表示。在這份研究中，我們會介紹一些根據物件辨識、問答系統、影像標題生成的任務，這些任務將被用來闡述用於深度學習中多變的注意力機制解決方法，我們特別關注在端到端記憶性網路(End-to-end memory network)中使用序列到序列學習(Sequence-to-sequence learning)的解決方法。在物件辨識的任務裡，在有雜訊且物件有所位移的觀察樣本中，注意力機制被發展來學習尋找物件的位置，並辨別物件種類，我們的想法是透過一個叫做支持式注意力機制的方法且在支持資訊存在的條件下來學習如何關注。這種注意力機制是為了學習轉換的不變性，這種想法不只發展在物件辨識的任務上，同時還有發展在問答系統上。基本上，問答系統的目標是建立一個有能力閱讀一個故事並且回答一個語故事相關問題的模型，然而，在很多狀況下，有些在故事中的句子對於尋找答案並沒有幫助，為了處理這個議題，我們吸收那些支持句子來學習如何調整注意力機制，我們在記憶性網路中使用了序列到序列學習(Sequence-to-sequence learning)來完成問答系統。這個模型能夠閱讀故事、建立記憶、注意有資訊的語句並且把其中的資訊壓縮到一個維度固定的脈絡向量(context vector)中，脈絡向量被拿來重購支持句子，同時也與問題向量結合後被拿來尋找相關的答案，在這個階段，編碼器和解碼器是由長短期記憶(LSTM)所推動的，而且輸入記憶體由詞嵌入(word embedding)產生。不只如此，影像標題生成任務中需要尋找最好的自然語言語句來描述一張輸入進來的影像，為了影像標題生成任務，我們根據記憶性網路提出了兩種注意力機制，第一種是將原來運用於問答系統的支持性注意力機制調整到能運用於影像標題生成上，而要使用這樣的方法，影像需要預先經過卷積網路處理，卷積網路的參數由ImageNet資料集訓練而得，影像被編碼成一組高維度的特徵向量，這些特徵會被連續的關注多次並計算隱藏編碼，該隱藏編碼會被用來產生在譯本中的不同詞，在實做過程中，支持資料可由其他現有的模型自動產生，支持式注意力機制連續地對焦在影像中的不同物件上來產生文檔；第二種方法是結合自我注意(self-attention)與字對圖注意力(word-to-image attention)兩種互補的機制，並運用在影像標題生成上，再次提醒，影像會經由卷積網路的處理而變成特徵圖並儲存於輸入記憶體中，影像中的物件會由記憶體區塊來表示，自我注意機制是運用於關注一對物件之間的關係，這些關係將會對自然語言生成有幫助，而字對圖注意力機制是被用在關注物件對應個別語詞上，一個理想的譯本不只在詞彙上反映了個別物件，而且也在語句上描繪出物件之間的關係。這個方法也結合了殘差(residual)的概念，允許單一的注意力機制可以在每個詞之間學習。在物件辨識、問答系統、影像標題生成的實驗中，我們分別使用了有雜訊/物件有所位移/混亂的MNIST資料集、bAbI資料集和MS-COCO資料集來評估我們的方法；在物件辨識的評估是採取在不同扭曲情況下的效果，在問答系統的評估是採用在bAbI中20種不同問題的表現，在影像標題生成的評估是採用在MS-COCO中回報的BLEU分數結果，BLEU是一種與其他參考譯文比較的分數，我們報告了一些關於支持式注意力機制和混和式注意力機制的實驗論證。 Deep learning has been successfully developing for different natural language processing tasks such as speech recognition, machine translation, dialogue system, language understanding, reading comprehension, image caption, image comprehension, and question answering where the temporal information in natural language can be learned by recurrent neural network (RNN) or long short-term memory (LSTM). There are twofold limitations in standard RNN or LSTM. First, temporal information in LSTM is stored in an internal memory. This model is too limited to store abundant information in long and rich history data. Second, the temporal information without attention is basically a loose and insufficient representation for natural language. Accordingly, this dissertation presents a series of new attention mechanisms for memory-augmented neural networks where deep supporting attention are proposed and incorporated in memory networks which provide external memory for information storage. In general, attention over an observed sample or natural language is run by spotting or locating the region or position of interest for pattern classification. Such an attention parameter is a latent variable, which was indirectly estimated by minimizing the classification loss. Using this attention, the target information may not be correctly identified. Therefore, in addition to minimizing the classification error, we directly attend the region of interest by minimizing the reconstruction error due to support data. This solution can be formulated by variational inference. In this study, the tasks or scenarios based on object recognition, question answering and image caption are introduced to illustrate various attention solutions to deep learning. In particular, we focus on the solutions to sequence-to-sequence learning in an end-to-end memory network. In the task of object recognition, the attention mechanism is developed to learn the location and identify of an object in a noisy and shifted observation sample. Our idea is to learn how to attend through the so-called supporting attention where the support information is available. This attention mechanism corresponds to learning for translation invariance. This idea is not only developed for object recognition but also for question answering. Basically, the task of question answering is to build a model that can read a story and answer the query related to the story. However, in many cases, some sentences in a story are not helpful for finding the answer. To deal with this issue, we adopt the supporting sentences to learn the way of attention in sequence-to-sequence learning for question answering based on memory network. This model is capable of reading the story, building the memories, attending the informative sentences and embedding the information into a fixed-dimensional context vector. Context vector is token for reconstructing the supporting sentences, and then augmented with query vector for retrieving the associated answer. In this procedure, the encoder and decoder are driven by LSTMs and the input memories are obtained by word embedding. In addition, image caption aims to find the best natural sentence to describe an input image. We propose two attention methods to image caption based on memory network. The first method is proposed by adjusting the supporting attention in question answering to work for image caption. Using this method, an input image is first encoded by convolutional layer. Convolutional weights are trained by using ImageNet dataset. A high-dimensional feature vector is encoded. This vector is then attended multiple times to sequentially calculate the hidden codes to produce different words in the transcription. In the implementation, the support data are automatically acquired by using the other attention method. The supporting attention sequentially zooms in different objects of an image for text generation. The second method is developed by combining a self-attention and a word-to-image attention which are complementary for image caption. Again, an image is encoded into a number of feature maps as an input memory by a convolutional layer. Objects in an image are represented by memory slots. Self-attention is performed to attend the pairs of objects of an image which are helpful for generating natural language. The word-to-image attention is applied to attend the object from individual word. A desirable text transcription does not only reflect the individual objects in lexical level but also characterize the relations of objects in syntactic level. This method further incorporates a residual scheme to allow single attention mechanism in sequential learning at each word. The experiments on object recognition, question answering and image caption are evaluated by using noisy/shifted/cluttered MNIST dataset, bAbI dataset and MS-COCO dataset, respectively. The evaluation on object recognition is conducted in presence of different distortions. The evaluation on question answering using bAbI task is performed over 20 kinds of questions in different styles. The evaluation on image caption using MS-COCO task is reported by BLEU score when comparing with the reference captions. We report a number of experiments to demonstrate the effectiveness of supporting attention and hybrid attention in end-to-end memory network.
URI:	http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070550718 http://hdl.handle.net/11536/142481
顯示於類別：	畢業論文