標題: 深度強化學習中的多重目標預測
Multiple Target Prediction for Deep Reinforcement Learning
作者: 洪博彥
簡仁宗
Hung, Po-Yen
Chien, Jen-Tzung
電信工程研究所
關鍵字: 深度學習;強化學習;深度Q網路;多重目標預測;自然語言處理;deep learning;reinforcement learning;deep Q network;multiple target prediction;natural language processing
公開日期: 2017
摘要: 強化學習是機器學習中一塊負責處理電腦與周圍環境互動問題的重要部分。學習者的目標為選擇一連串的行動以使得環境給予的累積獎勵最大化。其中,環境可以被模擬成一個馬可夫決策過程。強化學習演算法可被歸類成基於模型與非基於模型兩大類,基於模型的學習者使用其與環境互動過程中建造的馬可夫決策過程預測環境接下來的走向;而非基於模型的學習者不建立馬可夫決策過程,而直接從環境給的獎勵中學習。 基於模型的學習方法是由模仿人類的學習而來,人類會以對環境的了解來預測環境接下來的行為。這種學習方式十分直接易懂,但是對於電腦來說卻是極度不容易的,因為電腦並無像人類對環境的常識,知道接下來比較傾向於發生什麼。而非基於模型的方法即是為了解決當環境模型不容易建立時的替代方案。而傳統非基於模型的演算法,例如Q學習,使用表格將遇到的狀態及其分數存下,這種方法是十分缺乏效率的,因為相近的狀態照理來說應有相近的分數,此種存法無視了這種關係。而此種複雜映射的問題可以使用深度類神經網路的模型解決。深度Q網路將深度類神經網路與非基於模型的Q學習結合,提出全新的深度強化學習概念。深度Q網路另外提出兩個元件:重播記憶庫以及目標網路,這兩個元件的新增亦使深度強化學習的效能提升不少。這個結合深度學習與強化學習的重要演算法對於機器學習有十分重大的影響。 雖然近期的研究對於深度Q網路有效能的強化,深度Q網路使用單一目標值的參數更新方式使得其訓練的效率有限。在這篇論文中,我們針對這個問題提出多重目標預測的方法。這個方法在原本的深度強化學習演算法中,新增了兩個元件:預測網路以及虛擬重播記憶庫,其主要的想法是要預測出對於所有行動而非僅有單一行動的目標值並同時對於類神經網路做更新。預測網路在與環境互動的過程中,使用重播記憶庫中的資料持續訓練,此網路的目的是由已知狀態在不知道確切的下一個狀態時預測出下一個Q值與獎勵。而虛擬重播記憶庫是用來儲存狀態以及預測網路由此狀態以及所有可能行動預測出的多個目標值。值得強調的是,我們的預測是非基於模型的,亦即我們不需要任何有關環境的背景知識。因此,預測網路主要處理不知道環境模型時的預測,虛擬重播記憶庫使類神經網路一次更新整個向量而非僅有單一數值。基本上,我們的虛擬元件可以套用在任何離散行動環境並使用重播記憶庫的深度強化學習模型中。 在實驗中,我們以三個不同的實驗來測試我們的模型。在第一個實驗中,我們使用一個簡單的方格世界。在這種簡單的環境中,真實的Q值可以直接以程式算出來。我們在這個環境中顯示多個目標值同時更新將會比單一目標值的更新在訓練的過程中更有效率。在第二個實驗中,我們使用Atari 2600遊戲機的不同遊戲來測試我們這個模型的效率。在第三個實驗中,我們想顯示我們的模型也可以使用在自然語言處理的應用中。我們使用了一種類型的文字遊戲:一種狀態與行動都是以自然語言來描述的環境。我們建造一個有五個房間家庭世界,當前的目標與狀態等資訊會顯示給學習者,其目標是根據這些資訊找到正確的房間並做出正確的事。我們以這三個實驗來展現我們模型的優點。
Reinforcement learning (RL) is a key area of machine learning which deals with how software agents take actions in an environment based on maximization of cumulative reward. RL deals with a sequential decision problem through interaction with environment, which can be modeled by a Markov decision process (MDP). Reinforcement learners, also called agents, can be classified into model-based and model-free agents. Model-based agents predict what might happen next in the environment by its MDP, while model-free agents learn from values directly without building a MDP of the environment. Model-based algorithm acts to simulate how human learns. Such learning algorithm is straightforward but extremely hard to use because the agent does not hold “common sense” like human learners. This kind of agent cannot accurately predict the varying environment as simple as human does. Accordingly, model-free algorithm is more practical than model-based algorithm for RL in case that it is unclear to figure out a model of environment for the agent. However, traditional model-free RL algorithm such as Q-learning uses a table for learning process to record the values of different states, which is extremely inefficient because the agent can only update the values given by exactly the same states rather than update the values for similar states. A regression model based on deep neural network (DNN) can deal with this complicated mapping problem. Therefore, it is popular to carry out deep reinforcement learning (DRL) based on the deep Q network (DQN) which is a model-free algorithm by incorporating DNN into Q-learning. DQN further introduces two components, called replay memory and target network, which improve the performance of DRL. DQN successfully combines deep learning and reinforcement learning and significantly impacts on the community of machine learning. Although recent researches keep improving the performance of DQN, there is still a serious problem in efficiency and convergence of training procedure which is limited due to single target value update for an action. In this thesis, we tackle this problem and propose the multiple target prediction (MTP) for DRL where two new components, called prediction network and pseudo replay memory, are implemented. The main idea is to predict a target vector whole target vector rather than a single target value for efficient training in DQN. In the implementation, the prediction network, which is trained in sequential manner by using the samples from replay memory. This network is employed to predict the Q values of next state and the reward from a state without exactly knowing the next state. In addition, the pseudo replay memory is introduced to store the states and the predicted target vector depends on every possible action taken. Importantly, our prediction is model-free, which means that it does not need any domain knowledge for the environment. The prediction network handles the situation that the agent is blind to the model of environment. Directly storing the target values can allows the agent to update the gradient of total reward with respect to a whole target vector rather than a single target value. Basically, the pseudo components in proposed DQN provide a general solution to any discrete type of RL model with replay memory. The training efficiency based on the resulting MTP-DQN can be improved. In the experiments, MTP-DQN is evaluated by using three tasks. First, the evaluation is performed by using the grid-world toy task where the ground-truth of target Q values can be calculated directly. In this set of experiments, we want to illustrate how updating multiple target values in Q network runs more efficient than updating a single target value. Second, the evaluation using different games in Atari 2600 also shows the merit of MTP in training convergence based on DQN. The third task is conducted to evaluate how MTP-DQN works for natural language processing problem. This task is seen as a text game where the environment and the actions are described in natural language. We build a home-world environment with five rooms. The agent can do one thing in each room. The objective of the agent is to act correctly corresponding to the input sentence that describes the current room and the goal. The merit of MTP-DQN will be demonstrated.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070560205
http://hdl.handle.net/11536/142827
Appears in Collections:Thesis