Avishek Anand · b21bf699
--- a/Re-inforcement-Learning-for-NLP-and-Text.md
+++ b/Re-inforcement-Learning-for-NLP-and-Text.md
+- [Machine Comprehension by Text-to-Text Neural Question Generation](<https://arxiv.org/pdf/1705.02012.pdf>) by [Xingdi Yuan](https://arxiv.org/search/cs?searchtype=author&query=Yuan%2C+X), [Tong Wang](https://arxiv.org/search/cs?searchtype=author&query=Wang%2C+T), [Caglar Gulcehre](https://arxiv.org/search/cs?searchtype=author&query=Gulcehre%2C+C), [Alessandro Sordoni](https://arxiv.org/search/cs?searchtype=author&query=Sordoni%2C+A), [Philip Bachman](https://arxiv.org/search/cs?searchtype=author&query=Bachman%2C+P), [Sandeep Subramanian](https://arxiv.org/search/cs?searchtype=author&query=Subramanian%2C+S), [Saizheng Zhang](https://arxiv.org/search/cs?searchtype=author&query=Zhang%2C+S), [Adam Trischler](https://arxiv.org/search/cs?searchtype=author&query=Trischler%2C+A) in Workshop ACL '17'
+
+  - Proposes a RNN model for QG from documents, conditioned on answers. Seems to me as the first neural model (contrary to what has been claimed before).
+
+  - Works for **Extractive QA**
+
+  - **Motivation:** Answering the questions in most existing QA datasets is an extractive task – it requires selecting some span of text within the document – while question asking is comparatively abstractive – it requires generation of text that may not appear in the document.
+
+    - **Cool motivating idea:** generating training data for question answering (Serban et al., 2016; Yang et al., 2017)
+
+  - **Method** Use set-2-seq to generate questions but in addition to maximum likelihood for predicting questions from (document, answer) tuples, they use policy gradient optimization to maximize several auxiliary rewards. These include a language-model-based score for fluency and the performance of a pretrained question-answering model on generated questions.
+
+    - **Potentially interesting rel work:** Yang et al. (2017) developed generative domain-adaptive networks, which perform question generation as an auxiliary task in training a QA system. The main goal of their question generation is data augmentation, thus questions themselves are not evaluated.
+
+  - **Architecture innovation:** seq-2-seq encoder-decoder mechanism for QG and RL on the QA output as rewards.
+
+  - In **question generation**, they condition the encoder on two different sources of information (compared to the single source
+    in NMT). They base their model on the attention mechanism of Bahdanau et al. (2015) and the pointer-softmax copying mechanism of Gulcehre et al. (2016)  [this seems to be important]
+
+    - **Input to seq-2-seq:** 2 sequences - encoding the document/para and encoding the answer 
+
+      - **document encoding:** augment each document word embedding with a *binary feature* that indicates if the document word belongs to the answer. 
+      - **answer encoding:** encode the answer A using the annotation vectors corresponding to the answer word positions in the document. Basically take the hidden rep. corresponding to the answer words in the earlier step and concat with the input answer word embeddings.
+
+    - **Encoder:** they first run a bi-LSTM network on the augmented document sequence, producing annotation vectors hd = [hf;hb] (concat of hidden states, fancy name: *annotation vectors*).  Same for answer to give ha (they give this a fancy name *extractive condition encoding* ).
+
+    - **Decoder:**  Initial state of decoder.
+
+      - > When formulating questions based on documents, it is common to refer to phrases and entities that appear directly in the text. We therefore incorporate into our decoder a mechanism for **copying relevant words** from D. We use the pointer-softmax formulation (Gulcehre et al., 2016), which has two output layers: the *shortlist softmax* and the *location softmax*. The shortlist softmax induces a distribution over words in a predefined output vocabulary. The location softmax is a pointer network (Vinyals et al., 2015) that induces a distribution over document tokens to be copied. A source switching network enables the model to interpolate between these distributions.
+        >
+        > 
+
+      - **Training:** in the decoder , the previous token y(t−1) comes from the source sequence
+        rather than the model output (this is called *teacher forcing*).
+
+        - They encourage the model not to generate answer words in the question (different in pur setting)
+        - Also encourage variety in the output words to counteract the degeneracy often observed in NLG
+          systems towards common outputs
+
+    - **Policy Gradient Optimization:** 
+
+      - use teacher forcing to train the model to generate text by maximizing groundtruth likelihood. Teacher forcing introduces critical differences between the training phase (in which the model is driven by ground-truth sequences) and the testing phase (in which the model is driven by its own outputs) (Bahdanau et al., 2016). Significantly, teacher forcing prevents the model from making and learning from mistakes during training. This is related to the observation that maximizing ground-truth likelihood does not teach the model how to distribute probability mass among examples other than the ground-truth, some of which may be valid questions and some of which may be completely incoherent. This is especially problematic in language, where there are often many ways to say the same thing. A reinforcement learning (RL) approach, by which a model is rewarded or penalized for its own actions, could mitigate these issues – though likely at the expense of reduced stability during training. A properly designed reward, maximized via RL, could provide a model with more information about how to distribute probability mass among sequences that do not occur in the training set (Norouzi et al., 2016).
+      - The reward is given by MPCM’s answer accuracy on a BBOX QA model in terms of the F1 score
+      - They account for enforcing proper english and avoid cheating (copying answers)
+
+    - Instead of sampling from the model’s output distribution, we use beam-search to generate questions
+      from the model and approximate the expectation. Empirically they found that rewards could
+      not be improved through training without this approach.
+
+    - Experiments on SQUAD
+
+![image](uploads/9f9375955138bc1924962ca3fe6a40a4/image.png)
+
+## Some Resources on Architectures and Optimization tricks
+
+
+- **Pointer-Softmax Formulation:** [Pointing the Unknown Words](https://arxiv.org/abs/1603.08148) by [Caglar Gulcehre](https://arxiv.org/search/cs?searchtype=author&query=Gulcehre%2C+C), [Sungjin Ahn](https://arxiv.org/search/cs?searchtype=author&query=Ahn%2C+S), [Ramesh Nallapati](https://arxiv.org/search/cs?searchtype=author&query=Nallapati%2C+R), [Bowen Zhou](https://arxiv.org/search/cs?searchtype=author&query=Zhou%2C+B), [Yoshua Bengio](https://arxiv.org/search/cs?searchtype=author&query=Bengio%2C+Y) in ACL '16'
+
+![image](uploads/2f67aea5962c39ea5fd4a5a0ea117a26/image.png)