Skip to content

Chatbot and Related Research Paper Notes with Images

Richard Csaky edited this page Sep 10, 2019 · 51 revisions

Papers related to chatbot models in chronological order spanning about 5 years from 2014. Some papers are not about chatbots, but I included them because they are interesting, and they may provide insights into creating new and different conversation models.

For each paper I provided a link, the names of the authors, and GitHub implementations of the paper (noting the deep learning framework) if I happened to find any. Since I tried to make these notes as concise as possible they are in no way summarizing the papers but are merely a starting point to get a hang of what the paper is about, and to mention main concepts with the help of pictures.

I also divided the papers into 3 categories described earlier, placing them after the paper title.

  • [n-c] means this is a paper that is neither related to chatbots nor to other seq2seq tasks
  • [s2s] means that this paper is not specifically about chatbots but it is related to the seq2seq architecture or to other sequence-to-sequence NLP transduction tasks (like NMT)
  • [chat] means that this paper is concerned with some aspect of dialog modeling
Check my paper, for an organized, in-depth research survey based on most of the papers listed here, up until 2017.08.
Contributions are welcome 😄

Papers

Jiwei Li
2017

Before starting the list of publications, that I have read and made notes on, I want to highlight here an amazing work that I came upon from Jiwei Li. His PhD thesis summarizes all of his most notable publications in the field of neural conversational agents, providing, in my opinion, a number of very interesting papers on experimenting with diverse approaches to make open-domain dialog agents better. Almost all of the publications mentioned here will appear later on this page, as I have read and enjoyed them thoroughly. Furthermore, the GitHub link provided contains most of his works in Torch. List of publications discussed in the PhD thesis:

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
2014.06
  • Generator and discriminator networks
  • Generator tries to mimic data distribution as closely as possible, so that the discriminator can't decide which sample is from the data and which by the generator
  • This is analogous to a 2 player minimax game
  • Both can be trained together with backpropagation
  • Alternate between k steps of optimizing D and on step of optimizing G
  • This whole function is for optimizing D, the second term is for optimizing G
  • Optimal: D(x)=0.5 and p_g=p_data
Kyunghyun Cho, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio
2014.06
  • Two RNNs for encoding and decoding of sequences, jointly trained
  • Equations in the paper using gated recurrent unit
  • They only looked at rescoring translation phrases, not generating
Ilya Sutskever, Oriol Vinyals, Quoc V. Le
2014.09
  • Encoder-decoder with LSTM (pretty big architecture)
  • Words are reversed in source sequence for better performance
  • Left to right beam search decoder
Kyunghyun Cho, Dzmitry Bahdanau, Yoshua Bengio
2014.09
  • It's hard to compress all information into the context vector, especially for long sequences
  • To solve this, we use soft search to allow the decoder to peek at the relevant source words, and we don't encode into fixed length vector
  • Distinct context vector for each target word
  • Annotation for each source word with strong focus on parts sorrounding it, then the i-th context vector is the weighted sum of these annotations
  • These weights computed by alignment model which scores how well the inputs around position j and the output at position i match
  • Alignment model as a feedforward network trained jointly with translation model
  • Encoder is bidirectional RNN, thus the hidden states represent words both before and after the source word
  • Target word probability computed with multilayet network with a single maxout
  • Bleu 28.45, probably because it's a shallow network
Alex Graves, Greg Wayne, Ivo Danihelka
2014.10
  • RNNs with memory, differentiable end-to-end->trainable with gradient descent
  • Neural network controller interacts with memory bank (matrix) with read and write heads
  • NxM memory matrix, N adresses with M long vectors at each adress
  • Read and writes are blurry, "focus" determines how specific is the addressing
  • Writing operation composed of erase and add vectors.
  • Weighting vector is also used for reading and writing, constructed by using controller outputs, memory matrix, and previous weighting vector; operations to get new vector: content adressing->interpolation->convolutional shift->sharpening
  • Content- and location based adressing implemented in the above flow chart to get the weighting, which allows to only use previous weighting, to interpolate it with content based adress, or to shift it to next adress. Shift can be blurry so sharpening is needed
  • Copy experiment trained on sequences up to length 20, able to generalize with minor errors up to length 50 (much better than LSTM)
  • Repeated copy can generalize as well to longer sequences and more copy steps
Lifeng Shang, Zhengdong Lu, Hang Li
2015.03
  • Encoder-decoder model applied to twitter style 2-turn conversations, with bahdanau attention and GRU and beam search for decoding
  • Combines the bahdanau attention model with the original global context vector representation
  • Evaluation done with human judgment
Oriol Vinyals, Quoc V. Le
2015.06
  • IT helpdesk dataset and movie subtitles; Big architectures and big vocabs
  • Input sequence is what has been conversed so far (context), output sequence is the reply
  • Objective function optimized is not the actual objective achieved through human communication
  • Problem mentioned is with the inconsistent answers (there is no personality) and with not being able to evaluate correctly :(
Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, Bill Dolan
2015.06
  • Encode past information, which is then decoded to promote responses
  • Separate context from last message
  • They use IR to generate more responses to a (c,m,r) triple based on bag of words
  • They use a ton of features together with the neural network models to generate likely responses
Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, Phil Blunsom
2015.06
  • Neural stacks, queues and deques -> effective hierarchical structure for NLP transduction problems
  • In contrast to LSTM, these can generalize to much longer sequences than seen at training
  • Continuous push and pop operations, which mean the degree of certainty of pushing or popping
  • RNN is controlling the stack
  • Read,pup,push and other vectors are concatenated as the input to the RNN
  • Shown to work really well for seq copying, inversing and other transduction tasks
Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, Joelle Pineau
2015.07
  • Non-goal driven dialog systems, incorporating NL understanding, reasoning, decision making and generation
  • HRED (introduced by Sordonni 2015a) -> encoder RNN encodes tokens appearing in utterance and context RNN takes this context vector as input and encodes the temporal structure of utterances apprearing so far in the dialogue with GRU (good diagram in the paper). The decoder takes the output of the context RNN at the timestep and generates the response with beam search
  • Speech acts, pause and end of turns included as separate tokens
  • Bidirectional RNN to summarize the information in forward and backward chain of the tokens
  • Pretrained word embeddings on huge google corpus are used to capture more info, and pretraining of the HRED model is done on a Q&A dataset.
  • Training done on movie dialog triples, 10k vocab stripped of person names and numbers
  • For evaluation perplexity and word error rate is used although not sure how good is it to use it
  • A lot of generic "i don't know" answers, because there are too many punctuations and pronoun tokens (maybe semantic structure should be separated from syntactic structure). Also usualy metrics don't capture the similar semantic content, thus they do not correlate with the objective
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, Bill Dolan
2015.10
  • Maximum Mutual Information instead of likelihood of output used for objective function
  • Conventional neural models assign high probability to safe responses
  • Propose to capture the intuition that likelihood of message for a given response should be taken into account
  • To maximize this N-best lists is used with beam search, then rerank the N-best lists using the second term log(p(S|T))
  • Trained maximum likelihood models and used the MMI criterion above only during testing, also a parameter is used that takes into account seq length
  • Another approach is to use Log(p(T)) for MMI, but this can lead to ungrammatical outputs, solution is to multiply LM by * weights, thus the first words are more diverse and then it gets closer to a LM
  • Multireference BLUE used (better for dialog evaluation), with references extracted with IR methods
Kaisheng Yao, Geoffrey Zweig, Baolin Peng
2015.10
  • Encoder, intention, decoder RNN structure
  • Similar to HRED, but the output of decoder is also fed directly to the encoder RNN, and encoder RNN output is also directly fed to decoder network
  • Basically a HRED with bidirectional attention
Łukasz Kaiser, Ilya Sutskever
2015.11
  • Similar to NTM, but it's parallel and shallow
  • Using convolutional GRUs (architecture described in the paper)
  • Can do long binary addition and multiplication much better than stack RNN or LSTM with attention
  • Grid search to train 729 models, curriculum learning to go to longer inputs only if accuracy is good
  • Gradient noise, hard gate cutoff
  • Small dropout on recurrent connections helps generalization
  • 6 identical sets of non-shared parameters are used, at different time stpes, thus it can perform different operations at different time steps
  • Above thing is called relaxation, as the model converges the 6 sets are forced to unify
  • This relaxation has the potential to improve any RNN training
Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, Joelle Pineau
2015.12
  • This is an article I didn't fully read!
  • Long paper full of useful corpora classified into categories, also discussion of metrics and of data pre-processing techniques!!!
  • Remove acronyms, slang, misspellings and stem and lemm (depending on task); also tokenization (defining the smallest unit of input)
  • Speaker segmentation with small gold corpus, and then iteratively segmenting the rest
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çaglar Gulçehre, Bing Xiang
2016.02
  • Applying bahdanau model to summarization, but uses features like POS, TF-IDF together with word embeddings in 1 big vector
  • Switching decoder/pointer (to source) architecture to handle OOV words, by copying them from document to summary
  • Two bi-directional encoders for word level and sentence level, they both have attention, word level attention is affected by sentence level attn.
  • Additional positional info is embedded in the sentence-level RNN
Jiatao Gu, Zhengdong Lu, Hang Li, Victor O.K. Li
2016.03
  • Seq2seq model incorporating a copying mechanism, with which it can directly copy parts of input sequences
  • Similar to bahdanau attention model with differences: prediction is based on two modes(generate,copy), where copy-mode picks words from source
  • In addition to vocab it uses all the words in source sentence (even OOV) when using location based copying.
  • Mixing probabilities of copy-mode and generate-mode(same as bahdanau) with same normalization term to make them compete through softmax
  • Selective read from M attention matrix is used, which bears the location of the word in the source.
  • Both semantics and location of source word encoded into hidden states in M, for attentive and selective read
Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, Bill Dolan
2016.03
  • Capture background information and speaking style which is a persona
  • Incorporate speaker and addressee vectors into seq2seq
  • We add the speaker embedding vector to each input of the unrolled decoder LSTM together with the words of response
  • Speaker embedding vector is learned through normal backprop together with other params (like word embeddings, but separate)
  • Speaker-adressee model: combine the user vectors -> same speaker will react differently to different adressees
  • The diversity promoting objective function is used, namely an inverse seq2seq is trained without speaker info to get log(p(S|T))
  • They trained on opensubtitles and then adapted the model to friends conversations (also trained another model on twitter (c,m,r) triples)
  • There are still errors but pretty consistent and diverse answers
Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, Joelle Pineau
2016.03
  • BLEU not good where responses are diverse with no matching words, deltaBLEU is weak and needs human annotation for multiple reference replies
  • BLEU is based on n-grams, METEOR produces alignment between response and ground truth, ROUGE is based on longest common subsequence
  • Greedy matching is based on matching words with closest embedding vectors in response and truth, embedding average: sentence level embedding
  • They all correlate (with human judgment) poorly on twitter dataset and not at all on ubuntu dataset
Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, Andrew Senior, Fumin Wang, Phil Blunsom
2016.03
  • Attend over structured inputs to generate code implementations of card descriptions (2 types of inputs: text fields and singular fields)
  • Structured attention implemented with character embeddings, Bi-LSTM, linear and tanh projections, ending with softmax for probabilities
  • Probability over multiple predictors that can generate multiple segments of arbitrary length at time step t (ex: "copy name", "generate char")
  • Objective function is marginal log likelihood over a latent variable (representing a sequence of pairs of predictors and generated strings)
  • 3 types of predictors: char generation (softmax over chars), copy singular field (100% copy), copy text field (pointer network learns probability of copying)
  • Decoder with beam search takes best predictor and best string corresponding to the predictor at each time step to generate most likely code
  • Code compression -> replace commonly generated words (public, return) by tokens, to generate less characters
  • Only model to achieve non-zero accuracy, and better bleu scores than MT or seq2seq models
Xiang Li, Lili Mou, Rui Yan, Ming Zhang
2016.04
  • Computer side should also be initiative and introduce new content when necessary, by stalemate breaking detected with keywords like "…" or "Errr"
  • When a stalemate is detected backtrack conversation history to find named entities, then search for related entities in knowledge graph
  • The system is retrieval and ranking based
Tsung-Hsien Wen, David Vandyke, Nikola Mrkšic, Milica Gašic, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, Steve Young
2016.04
  • Performs well across several metrics when trained on only a few hundred dialogues
  • Seq2seq model with dialogue history (belief trackers), and current database search outcome + they use MMI with reward function and beam search
  • Inputs into two representations: distributed representation by intent network (which can be a CNN), and probability distribution over slot-value pairs called the belief state. Then most probable values from belief state taken to form a query to the DB, and search result together with intent and belief state combined by policy network to form a single vector representing the next system action
  • Belief tracker keeps track of dialog state, using a smart weight trying strategy. It maintains a s multinomial distr. over values for each informable slot and binary distr. for each requestable slot
  • Each tracker is a recurrence from output to hidden layer RNN with a CNN feature extractor from user input and machine response
  • Summary belief vectors for each slot, and truth vector from DB (how much the entities match), and vector from intent network is used as input to policy network, to produce action vector
  • Generation LSTM uses attentive action vector to generate tokens that are delexicalised with pointers to entities in DB (3 informable, 7 requestable trackers
Iulian V. Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, Yoshua Bengio
2016.05
  • Not enough variation in the models, only source of variation is through the conditional output distribution
  • HRED with latent variable (based on prior and posterior parametrization) at the decoder trained by maximizing variational lower-bound on the MLE
  • At test a sample latent variable is drawn from the prior for each sub-sequence, and concatenated with output of context RNN
  • At training this is drawn from the approximate posterior (parameterized by its own one-layer FFN, used to estimate the gradient of variational lower-bound
  • Similar to Variational Recurrent Autoencoder, but the latent variable is conditioned on all previous sub-sequences (sentences)
Sam Wiseman, Alexander M. Rush
2016.06
  • Training loss based on difference from target word is not represented in testing, also locally-normalized scores and exposure bias are bad
  • Proposing a non-probabilistic score for entire sequence and loss function in terms of errors made during beam search
  • !!Scheduled sampling!! = At training seq2seq select the target word at first to be the gold, and later to have higher probability to be the predicted word
  • Beam search is used at training as well to construct sequences, beams are changed when there is a margin violation in the loss of the previous seq
  • Model pretrained with standard word level cross-entropy, the size of the beam is increased gradually during training and dropout is also used
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel
2016.06
  • GANs are used to learn salient representations in an unsupervised way (like angle/thickness of MNIST digit)
  • Decompose the z noise variable in which the generator is based, into z (incompressible noise) and c (<- this targets the salient features)
    • c contains several latent variables/factors
  • Information-theoretic regularization in order to cope with trivial c-s
    • There should be high mutual information (MI) between c and the generator distribution G(z,c)
    • If it's high that means P_generator(c|x) has small entropy
    • P(c|x) is approximated with a lower bound of mutual information
  • The approximator and discriminator share parameters, and there is one final fully-connected layer to output the Q(c|x) distribution
  • It is shown that in a regular GAN the lower bound MI is 0, however by training to maximize it goes to maximal MI
  • Three latent variables are used, a categorical one for digit classifying, and two continuous ones for digit rotation and thickness
    • By varying each latent variable it is shown that it learns meaningful representations
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan Jurafsky
2016.06
  • Two virtual agents explore the space of possible actions while learning to maximize expected reward with policy gradient methods
  • An action is a dialog utterance taken according to the policy, a state is the previous two dialog turns, a policy is a enc-dec LSTM
  • Reward types:
    • ease of answering: how unlikely is that the response to the utterance will be a dull one based on mle based seq2seq probabilities
    • information flow: penalize semantic similarity between consecutive turns from same agent
    • semantic coherence: ensure the mutual information between action and previous turns
  • Curriculum learning is used such that first couple of tokens generated based on MLE, then switch to RL, and gradually reduce impact of MLE
  • Longer and more diverse simulated dialogues
Kaisheng Yao, Baolin Peng, Geoffrey Zweig, Kam-Fai Wong
2016.06
  • HRED with attention
  • Incorporating IDF in objective function(with log-likelihood), and reinforcement learning is used based on this to compute gradients
  • Training data is computer helpdesk stuff, model performs pretty well
Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, Wei-Ying Ma
2016.06
  • Represent people's prior knowledge about the topic, and embed this into reply of seq2seq model with attention
  • Two encoders with separate attention modules, one is bidirectional RNN, other is for topic words, then their attention is jointly fed into decoder
  • The two encoders can affect each others attention, topic attn finds relevant info, content attn determines the content focus
  • Topic word list obtained from twitter LDA model, they play the role of classification and association in response generation (better first words chosen)
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, Wei Xu
2016.06
  • Deep lstm enc-dec model with linear connections and an interleaved bi-directional architecture to stack the LSTM layers
  • There is a feed-forward network from the input nodes, fed into the current hidden state and the next layer together with previous hidden state
  • Alternate the RNN direction at different layers, two completely different encoders with different starting directions
  • Dropout is used and Attention is used from the vectors generated by the two encoders, and FF is used at decoder as well
Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua Bengio, Aaron Courville
2016.06
  • Model multiple parallel sequences by factorizing the joint probability over the sequences
  • Hierarchical abstraction, information flows from high level sequences to low level ones
  • One sequence with the words, and another with coarse tokens (nouns for example)
  • Both sub-models r HRED, but the coarse predictor encoder encodes all previously generated tokens to a vector which is concatenated with the context RNN
  • Conditioned on the coarse sequence of higher level tokens the natural language sub-model generates a dialog utterance
  • 2 types of coarse representations: noun and activity-entitiy (extracting verbs and entities, only used for Ubuntu corpus)
John M. Pierre, Mark Butler, Jacob Portnoff, Luis Aguilar
2016.07
  • More previous conversational turns -> better models
  • Deixis, anaphora, logical consequence for measuring the relevance of the response to previous utterances
Kun Xiong, Anqi Cui, Zefeng Zhang, Ming Li
2016.07
  • CNN and RNN encoder fed into RNN decoder; CNN: to learn topic distribution from sentence matixes, generates a topic vector
  • Context-in model: CNN vector is directly fed to decoder
  • Context-IO model: CNN vector fed to both hidden and output layer of decoder
  • Context-Attention model: attention computed from context at each decoder input
  • Trained on QA pairs with categories, and on twitter style chat
  • Shorter sentences have lower perplexity, but overall results look good
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
2016.07
  • Batch norm normalizes the summed inputs at each neuron, leads to faster convergence, and serves as a regularizer as well, but it's hard to apply to RNN
  • Layer normalization: computes statistics over all hidden units in the same layer, all neurons have same mean and variance terms
Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, Zhi Jin
2016.07
  • Based on pointwise mutual info, compute a key-word noun, generating reply from this word going backwards and then forwards with 2 different RNNs
  • When computing point wise mutual info (PMI) we penalize frequent words
  • In the backwards part words are reversed, and the forward RNN depends on the generated backward part
Baskaran Sankaran, Haitao Mi, Yaser Al-Onaizan, Abe Ittycheriah
2016.08
  • Memorize alignments temporally from previous timesteps to modulate the attention in subsequent timesteps (somewhat similar to memory networks)
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean
2016.09
  • 8 layer deep enc-dec with residual connections and attention from the bottom layer of decoder to top layer of encoder (which is bidirectional)
  • Low precision arithmetic (quantization at inference using int operations) for faster training+model and data parallelism
  • Deep LSTM improve performance but only if used with residual connections, input is added to the output of a layer and forms the input to the next layer
  • Wordpiece model cuts up words with a greedy algorithm, thus it has very few OOV (with 8-32k wordpieces), but it's faster than using only characters
  • Maximum likelihood is used together with expected reward RL objective function
  • Length normalization is needed so that beam search doesn't favor shorter results
  • Coverage penalty to favor results that fully cover the source sentence according to attention
  • RL refinement of the trained models barely improves human expression of translation quality
David Ha, Andrew Dai, Quoc V. Le
2016.09
  • Smaller network to generate the weigths for a larget network; both of them trained together with gradient descent
  • Inputs are embedding vectors that describe the entire weights of a given layer (this can also be learned during training)
  • They generate non-shared weigths for LSTM, meaning that weights can change between timesteps, that work better than standard LSTM
  • Static hypernetwork for CNN: for each layer input is a layer embedding; hypernetwork is a 2 layer linear network to project embedding to weight matrix
  • Thus it has to learn the projection weigths and biases and the embeddings which are less than the original CNN parameters
  • Dynamic hypernetwork for RNN: hypernetwork is an RNN, produces relaxed weight sharing (middle ground between hard and no weight sharing)
  • A linear network is also used in hyperRNN to project embeddings (the network entails similar theory as layer norm)
  • They applied it to a resnet, drastically reducing parameters with relaxed weight sharing
  • They compared hyperLSTM with layer norm LSTM together with recurrent dropout (similar results), and also applied layer norm to the hyper LSTM (best)
Łukasz Kaiser, Samy Bengio
2016.10
  • Active memory can make parallel computations on the whole memory (like neural GPU), doesn't just focus on local stuff like attention
  • Memory operations with convolutions, and with CGRUs
  • After n-th CGRU there are the decoder attention CGRUs, which accumulate outputs and allow access to all outputs produced in steps before t.
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, Koray Kavukcuoglu
2016.10
  • ByteNet is a one-dimensional convolutional enc-dec (with dilation and residual blocks with ReLUs) for character-level language modelling
  • Stack decoder on top of the representation of the encoder preserving the temporal resolution, instead of passing a context vector
  • Dynamic unfolding: process different length sentences, with an estimated target length which is usually bigger than acutal target
  • ByteNet is good because it runs in linear time and preserves source sequence resolution
Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, Ming Zhang
2016.10
  • Retrieve candidate found by IR, feed into generative model (biseq2seq) along with query, then generated reply is post-reranked with the retrieval model
  • Used crowd-sourcing to find out how relevant a query and a reply are instead of negative sampling approach
  • Achieves better results than either sub-model; the 2 sub-components are chosen about equal number of times during post-ranking
Barret Zoph, Quoc V. Le
2016.11
  • Structure of a NN can be specified as a string representing the various parameters, thus a controller RNN could generate such strings
  • The generated network can be trained, and it's accuracy used as reward to compute the policy gradient to update the controller
  • RNN generates CNN, layer by layer and parameters one after another, then the CNN is trained until convergence and it's accuracy is used for REINFORCE algorithm, a policy gradient method… CNNs produced achieve state of the art on CIFAR-10
  • Add anchor points and set-selection attention to the RNN to propose skip connections (what previous layers to use as input to the current layer)
  • Produce a recurrent cell: as a tree of steps that take x_t and h_t-1 as inputs to produce h_t as output; the nodes can be labeled by functions and methods
  • The awesome recurrent cell produced is implemented in tensorflow as NASCell
Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc’Aurelio Ranzato, Jason Weston
2016.11
  • Teacher gives feedback through RL to learner (bot trained with SL) in the context of QA
  • Both reward based numerical feedback (only receives it 50% of time when it's doing good) and forward prediction methods using textual feedback
  • Memory network: input is last utterance and memories (dialog context and KB); Memories are compared with query vector to select relevant ones
  • RL: policy is MemN2N model, state is dialog history, action space is set of answers
  • Reward based imitation (RBI): choose own model with 1-e probability, otherwise a random answer
  • REINFORCE: maximize expected cumulative reward of the episode
  • Forward prediction (FP): query and memory mapped to vector representation, and that together with an attention hop over all possible answers is combined to predict teach feedback; In online setting learner needs to update model using teacher textual feedback
  • RBI and FP work better with random exploration; all methods work a little worse than SL on babi tasks
Prajit Ramachandran, Peter J. Liu, Quoc V. Le
2016.11
  • Two language models are trained to initialize weights of an enc-dec model on source and target corpus
  • Only 1 LSTM layer, softmax of decoder and embeddins are pretrained, then the model is initilaized with these plus one more randomly init. LSTM layer
  • Additional losses added from pretraining objective to regularize the model to avoid overfitting on the small dataset
  • Residual connections from output of pretrained LSTM directly to softmax
  • Attention over the top and first layer; attention vector is passed to 2nd layer at each time step
  • Model gives much better results than baseline on low resource datasets
  • Only pretrain encoder is more important for summarization and only pretrain decoder is more important for MT tasks
Nabiha Asghar, Pascal Poupart, Xin Jiang, Hang Li
2016.12
  • Offline supervised learning of seq2seq model, followed by online active learning
  • Train sequentially on Cornell then on chatlogs, then comes online AL with real users, and learn incrementally from their feedback at each dialog turn
  • Model generates K responses uding hamming-diverse beam search -> user selects best one or suggests another response, then it's backpropagated using XENT lostt and one-shot (really high learning rate) learning, to immidiately change the weights significantly
  • Diverse beam search penalizes similar beams
  • Trained to mimic differend moods from user training (only needs 100 interactions to train)
Chongyang Tao, Lili Mou, Dongyan Zhao, Rui Yan
2017.01
  • Unsupervised, thus easy to use; referenced metric comparing embedding similarity of ground truth and generated reply combined with unreferenced metric that uses a neural network scorer to measure the relatedness between generated reply and its query
  • Cosine distance between ground truth and reply using max and min word embeddings
  • Query and reply vector computed with BiGRU, and a score assigned to them by a NN which is trained with negative sampling, by showing it bad responses
  • 2 scores combined in differently: choosing the maximum does not work, but choosing the minimum or averaging the scores gives near human correlation
Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean, Alan Ritter, Dan Jurafsky
2017.01
  • Generator seq2seq model, and discriminator labels dialogs as human or machine generated
  • Quality of machine generated utterances measured by its ability to fool the discriminator
  • Output of discriminator used as reward to the generator using REINFORCE algorithm
  • Discriminative model is a binary classifier: input is dialog encoded into a vector using a HRED
  • Discriminator updated together with generator, using human generated dialog as positive example, and machine-generated dialog as negative example
  • Improve model with reward for every generation step: in order to distinguish word level rewards
  • Monte Carlo search: A partially decoded seq. is finished (sampled) 5 times and fed to discriminator->average score used as reward for the partially dec. seq.
  • Some fraction of responses generated are human so that the generator doesn't get lost, and gets positive rewards sometimes to go the right way
  • Remove short training examples, weighted learning rate based on tf-idf, penalizing word types that have already been generated
  • Adversarial evaluation labels dialogs as machine or human generated, model should achieve 50% accuracy if human and machine dialogs are the same
  • Adversarial success is the fraction of instances in which a model fools the evaluator, the difference between 1 and evaluator accuracy
  • Achieve higher adversarial success than MMI seq2seq models (MC better than vanilla reinforce)
Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, Wei-Ying Ma
2017.01
  • BiGRU encodes words, and calculates attention over them, then the utterance vector is used as input to utterance level encoder (backward GRU, because more recent utterance is more important), and utterance level attention is calculated over the utterances to form the context vector
  • Word level attention depends on both the hidden states of the decoder and hidden states of utterance level encoder
Mihail Eric, Christopher D. Manning
2017.01
  • Seq2seq with attention and soft copy, only copy from the source entities of the knowledge base
  • Inputs augmented with entity type features, append one-hot class vectors to word embeddings
  • Really simple network outperforming more complex architectures
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean
2017.01
  • Large models trained on huge datasets cost a lot computationally. Conditional computation increases model capacity with less costs: parts of network are active or inactive on a per-example basis; from thousands of networks chooses only a handful, where gating vector is not zero
  • MoE consists of many experts, each a feed-forward NN (same architecture) and a trainable gating network that selects a sparce combination of the experts
  • Apply MoE convolutionally between stacked LSTM layers, different experts become highly specialized based on syntax and semantics
  • Gating is based on softmax, but with added tunable gaussian noise and only selecting top k values
  • Problem is that b batch size shrinks to k*b/n if k experts choosen out of n. By distributing model to separate devices with separate batch updates but keeping the expert parameters shared we can factorize the size of the batch while updating the model synchronously
  • Apply MoE to all time steps of a previous LSTM layer convolutionally -> bigger batch size
  • Additional "importance" loss added to loss function of the model so that experts are equal -> coefficient of variation of the sum of batchwise gate values
  • Trained 2 LSTM layers with MoE between them; also tried hierarchical MoE, where each expert is a MoE as well
  • With same computational budget it achieves lower perplexity than simple LSTM models on language modelling
  • Also on WMT it achieves new state of the art with billions of parameters but similar training time as GNM
Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, Ray Kurzweil
2017.01
  • Target-side attention into decoder so it can keep track of what has been output so far, to generate longer coherent responses
  • This is memory-intensive, trade-off is the glimpse model, which interpolates between source-side-only attention on the encoder and source and target-side attention on the encoder and decoder, done with fixed-length glimpses from target side, and source + part of target seq. before the glimpse on encoder
  • Rerank beams segment by segment, injecting diversity early, and integrate sampling into beam-search making it stochastic
  • The model produces longer responses that are also more coherent, but for shorter responses they choose to fall back to the baseline without length norm.
Marjan Ghazvininejad1, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, Michel Galley
2017.02
  • Condition responses based on conversation history and external facts (amazon, wikipedia) relevant to current context
  • NER is used for example to make a query to retrieve facts; these are fed into a fact encoder -> this summed with conversation encoder are fed into decoder
  • Fact encoder is similar to memory network, retrieves and weights facts based on user input and conversation history
  • Multitask learning: first task is conversational, pure enc-dec model trained; second task exposes the full model to facts as well; third task is similar to autoencoder, it uses facts for both encoders
  • Twitter dataset with mentions of local business, augmented with facts (foursquare tips): many contextually relevant facts -> filter them with tf-idf, retain 10 tips
  • They use beam search and N-best lists reranking based on MMI
  • The results are somewhat more diverse than baseline seq2seq
Kirthevasan Kandasamy, Yoram Bachrach, Ryota Tomioka, Daniel Tarlow, David Carter
2017.02
  • Reward only after agent has reached terminal state, aim is to find a policy (with gradient methods) that does well with respect to the data distribution
  • Value function to get the expected reward if we follow a stochastic policy
  • Action-value function: for the expected reward of taking an action at a state following a specific policy
  • Input encoder (state), output decoder (action, and reward triples from costumer service conversations, where reward is a quality score of the conversation
  • Use the convex combination of re-weighted future rewards for estimating the action-value function
  • Estimate the value function based on an LSTM parameterization (hidden state of bottom layer in enc-dec) of state representation, but constant estimation of value function gives almost the same results
  • 2 layer enc-dec, with batch RL; RL only changes top LSTM layer of decoder and softmax
  • Europarl dataset, bootstrap with more unlabeled data with MLE objective, then train on smaller labeled data with RL (works if RL and MLE have some overlap)
Jason D. Williams, Kavosh Asadi, Geoffrey Zweig
2017.02
  • RNN; domain-specific software and action templates (text or API call) and a conventional entity extraction module
  • Utterance is featurized in 1.bag of words, 2.embedding, 3.entity extraction and these are passed to RNN, output is action template
  • Best results on babi dialog tasks 5 and 6, and other toy examples…
Zhou Yu, Alan W Black, Alexander I. Rudnicky
2017.03
  • Utterance is fed into and non-task response generator and language understanding module which encodes it for task response generator; then a response selection policy (using RL) chooses between the all of the candidates from the 2 generators, and the response is fed back into the system
  • Language understanding module: based on simple key-word matching because user responses are usually yes / no
  • Task response generator: 8 pre-defined templates about movie promotion considering the info from language module
  • Non-task response generator: 3 methods used (no RNN), keyword retrieval, skip-thought vector, statistical templates based conversation strategies
  • Q-learning used to optimize towards long-term coherence, consistency, variety and continuity
  • Constraints based on conversational data and expert rules applied to reduce number of states
  • Reward function based on 4 weighted metrics: turn-level appropriatness, conversation depth, information gain, conversation length
Tiancheng Zhao, Ran Zhao, Maxine Eskenazi
2017.03
  • Conversation representation with 3 random variables: dialog context c, response utterance x, latent varaible z, which captures latent distr. over valid responses
  • Generative process: sample a latent variable z from the prior network -> generate x through response decoder
  • Training done with stochastic gradient variational bayes that maximizes the variational lower bound of conditional log likelihood of p(x|z,c)
  • Utterance encoder is BiRNN, context enc. and response dec. is 1-layer GRU; samples of z obtained by the recognition (training) or the prior network (testing)
  • Easier to train CVAE with explicitly extracted discourse features y (dialog acts ex.) -> this is the knowledge-guided CVAE, x relying on c,z,y; and y relies on c and z.
  • Tackling vanishing latent variable problem with bag-of-word loss (decoder has to generate a bag of words representation as well through an MLP)
  • Better than a VHRED baseline, more diverse responses; latent variable is correlated with dialog acts and response length
Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, Bing Liu
2017.04
  • Seq2seq framework with emotion category embedding, internal implicit emotion memory, and external explicit memory
  • P(Y|X,e), where e is one of 6 emotion categories, we embed this and feed into decoder
  • Internal memory: capture emotion dynamics, each emotion is decaying during decoding, because it is read and written (by the GRU) at each step to the memory
  • External memory: the model can choose between words from a generic or an emotion vocab (separate softmaxes)
  • Regularization: emotion state in internal memory should decay to zero at the end of decoding; there is another term for constraining the external memory
  • Emotion category annotation obtained with bi-lstm emotion classifier (62.3% acc.)
  • ECM model obtains better perplexity (without external memory) and emotional accuracy and better human rating than base seq2seq
Pierre Lison, Serge Bibauw
2017.04
  • Associate each context and response pair with a numerical weight that reflects the quality, then these weights are included in the loss function of a neural model
  • Weights computed via a neural model learned from dialog data, positive (high quality) and negative examples (quality can be coherent and interesting response)
  • Weighting model has 2 sub-networks for context and response tokens, then it produces an embedding together with other quality features, and then a score
  • Tf-idf and dual encoder models are investigated with the new loss (retrieval models), dual encoder with weighting loss produced best results on recall
  • Open subtitles dataset used, lemmatised and pos-tagged, and names replaced by NER with tokens
Satoshi Akasaki, Nobuhiro Kaji
2017.05
  • Decide whether a dialog act is chat or non-chat (task) in order to better integrate chat generators like seq2seq into intelligent assistants
  • They constructed a dataset from yahoo voice with 15k utterances labeled as chat or non-chat (many sentences can be both)
  • Two binary classifiers used
    • SVM using character and word n-gram features and skip-gram word embeddings
    • CNN with word embeddings pre-training
    • Character-based tweet and query GRU enhances these 2 classifiers by training on twitter and yahoo search queries (concatenated as a vector)
  • SVM+embed+tweet+queryGRU performs the best, 87.5% F1 score.
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin
2017.05
  • Convolutions can be well parallelized, and conv layers create hierarchical representations
  • Embed inputs into a matrix, and concatenate them with a position embedding to maintain order; proceed similarly with outputs
  • Convolutional block structure: each layer contains 1D conv and a non-linearity; 6 blocks with kernel width 5 mean that the input field consists of 25 elements
  • Output is twice the size of the input, but gate linear units are used to reduce it back to the same size
  • Residual connections from the input of each convolution to the output of the block; also pad the input at each layer in encoder network
  • Multi-step attention: for each decoder layer combine the current decoder state with an embedding of the previous target element, and then compute the dot product between this and each output of the last encoder block
  • Conditional input to current decoder layer is an attention weighted sum of the encoder outputs and input element embeddings; this is added to the output of corresponding decoder layer to get final predictions; this considers which words we previously attended to and can be seen as attention with multiple hops
  • Normalization by scaling conditional inputs by the number of vectors, and scale gradient for the encoder layers by the number of attention mechanisms, and apply dropout to embedding, decoder outputs and to the input of the convolutional blocks
  • Datasets are WMT translation -> better BLEU results than GNMT
  • Grid search over kernel width and encoder/decoder layer depth shows that a narrow kernel and a deep network is the best
Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, Guoping Long
2017.05
  • Generate next response based on dialog context(modeled separately for both speakers), stochastical latent variable and an external label
  • The model is HRED with separated context models: encoder RNN for tokens, and 2 status RNN for each speaker utterance
  • Variational auto-encoders used conditioned on context concatenation provided by SPHRED and an additional class label (ex: generic or non-generic response)
  • The class label can be unknown in which case a classifier is implemented to first predict it from the context vector
  • VAE produces the latent variable for the HRED and the posterior distribution of latent variable approximated based on context and class label
  • Dataset used is ubuntu dialog corpus, gradually more and more focus on the latent variable as the training goes on; results are similar to VHRED
Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron Courville
2017.05
  • With GANs sequence-level training objective can be incorporated together with curriculum learning on the length of the sequence using a generator (G) and discriminator (D) network
  • Regular GAN objective function is hard to train(unstable/vanishing gradients), which Wasserstein GANs (WGAN) alleviate -> better at assuring that the (D) objective won't only exploit the difference between the sparsity of 1-hot vectors, and continuous output predictions
  • (G) is provided with a noise matrix at each time step, transforming it into a sequence of probability distributions over the vocab
  • (G) and (D) model variants:
    • (G) LSTM with peephole connection between output and previous hidden state; (D) LSTM uses binary logistic regression on last hidden state
    • Same 1-D convolutional residual blocks for both (G) and (D)
  • For evaluation of GANs, the likelihood of the sample under the true data distribution is used. Datasets are toy CFG, PCFG, and Chinese poetry and Penn treebank
  • Conditional generation is also explored with a question and positive/negative sentiment attributes added as feature vectors to each conv layer (no LSTM)
Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, Jason Weston
2017.05
  • Software platform that provides a unified framework for training and testing dialog models with over 20 tasks (datasets) supported and example models
  • World, agent and teacher classes in python to handle the training of a dialog model in some environment
  • 5 Task categories: QA, Sentence Completion, Goal-Oriented Dialog, Chit-Chat, Visual Dialog
  • Models: HRED, IR, Memory NN, seq2seq
  • Seamless integration with Mechanical Turk for data collection, training, and evaluation
Serhii Havrylov, Ivan Titov
2017.05
  • There is a sampled image and some distracting images -> sender agent has to formulate NL message such that it helps the receiver to choose the sampled image
  • Sender (LSTM) only sees the sampled image, while the receiver (LSTM) sees all images and the message which is a sequence of symbols (strings)
  • Learning by using straight-through Gumbel-softmax estimators which is more efficient than RL methods
  • Sender's inputs are features extracted by a CNN from the target image; it has to sample one token at a time from the categorical distribution
  • The sampling sender agent is non-differentiable thus RL has to be used like the REINFORCE algorithm upgraded with a categorical distribution with continuous relaxation obtained from the Gumbel-softmax distribution; as time goes on samples from the distribution are becoming more one-hot encoded
  • In the forward pass the relaxation is discretized so that it resembles NL, and in the backward pass we use the gradient of continuous relaxation
  • Images from MS COCO dataset, output of the relu7 layer of VGG used; REINFORCE achieves 87% success, while GS-ST achieves 97%
  • Inspecting message symbols shows that a hierarchical language emerged by describing categories, but forcing a language model with KL halves the success rate
Łukasz Kaiser, Aidan N. Gomez, François Chollet
2017.06
  • SliceNet inspired by Xception network based on depthwise separable convolution layers with residual connections applied to MT tasks
  • Depthwise conv is a spacial conv performed independently over every input channel followed by pointwise conv projecting to a new channel space
  • DSCNN uses much fewer parameters than regular CNN, super-SC uses even fewer parameters by splitting the input into groups along the depth then apply separable conv to each group separately, and then concatenate the results along the depth
  • With DSCNN we can use larger filter windows, thus we don't have to use filter dilation
  • Autoregressive decoder produces new output prediction given encoded input and encoding of all existing predicted outputs (not just previous!)
  • Both encoders and decoder use convolutional modules composed of stacking conv steps with residual connection; one conv-step consists of a ReLU->SepConv->layer norm.
  • Attending is performed by adding a timing signal to the targets (encoding positional info) then doing 2 conv-steps and then attending to the source by computing feature vector similarities between source and target
  • Beats GNMT in WMT english to german by 0.1 BLEU
Tiancheng Zhao, Allen Lu, Kyusong Lee, Maxine Eskenazi
2017.06
  • Framework: 1. entity indexing, 2. slot-value independent enc-dec, 3. utterance lexicalization by replacing special tokens with NL
  • NER is used to detect entities and convert them to indexes, then enc-dec predicts next utterance using KB query
  • Each utterance encoded by a CNN, then enc LSTM reads the utterances and dec LSTM generates output based on attention over enc LSTM states as well
  • Task-oriented dialog dataset augmented by inserting utterance-response pairs from chit-chat style dataset; system first answers the chit-chat style question and then repeats its previous task-oriented question
  • System tested on bus schedule dataset, achieves around 70-80% success rate for finding a good bus schedule between locations
Chaitanya K. Joshi​, Fei Mi​, Boi Faltings
2017.06
  • Make a restaurant reservation that is personalized to the user's attributes/preferences with memory networks
  • Simulated dialogs made on restaurant reservation task with API calls, but with added personalization attribute values before first dialog turn
  • Augment bAbI tasks with personalization of the bot's language style based on user's gender and age, adding 6 patterns of the same dialog for different styles
  • Other personalization is based on vegetarian / non-vegetarian, adding that to restaurant types KB
  • Rule-based (which should perform 100%), supervised embedding and memory network models investigated in retrieval style dialog
  • Supervised embeddings were very bad, while memory networks were almost 100% for the first two tasks, but only 60% for KB tasks
  • first 5 original bAbI tasks:
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
2017.06
  • Encoder makes a continuous representation on symbols, then autoregressive decoder takes these and generates one output at a time, consuming previously generated outputs as inputs for the next one
  • Transformer is based on this using stacked self-attention and position-wise, fully connected layers for both enc and dec
  • Encoder:
    • 6 identical layers, each made up of two sub-layers
    • First is a multi-head self-attention mechanism and second is a feed-forward network
    • Residual connection around each sub-layer followed by layer norm.
  • Decoder:
    • 6 identical layers, with same sub-layers as encoder plus a third one
    • Third sub-layer performs multi-head attention over the output of the encoder stack
    • Self-attention is masked compared to the encoder to prevent positions from attending to subsequent positions
  • Scaled (to counteract large vector dimensions) dot-product attention is used over the set of queries, keys, and values
  • Multi-head attention is used by applying scaled dot-product attention to different linear mappings of queries, keys, and values; the outputs from the attention layers are concatenated and once again projected
  • Transformer attention:
    • In enc-dec attention layers (middle) the queries come from previous decoder layer and the memory keys and values from the output of the encoder
    • In encoder self-attention layers keys, values and queries all come from the output of the previous layer in the enc
    • Decoder self-attention layers allow each position in the dec to attend to all positions in the dec up to and including that position
  • Position wise feed-forward layers are similar to two convolutions with kernel size 1, parameters are shared between positions but are different between layers
  • Positional encodings:
    • Added to the input and output embeddings at the bottom of encoder and decoder stacks
    • Sine and cosine functions of different frequencies based on position
  • Trained on WMT english to german and english to french, using word-piece vocab; outperforms previous state-of-the-art ensemble models
  • Dropout is applied to the output of each sub-layer and do the sums of the embeddings and the positional encodings and dropping out attention weights
Łukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit
2017.06
  • MultiModel trained simultaneously on WMT, ImageNet, WSJ speech, and parsing corpus and COCO image captioning dataset
  • Modality nets convert images, speech, text and categorical data into joint representation space which is variable size
  • 3 basic types of blocks for encoder and decoder:
    • Convolutional blocks: 4 x (ReLU over inputs -> depthwise separable CNN -> layer norm.) with residual connections and dropout at the end of the block
    • Attention blocks: multi-head dot product mechanism with source and target inputs
      • Target (composed with a timing signal from sine and cosine curves and mixed using 2 conv blocks) is self-attended
      • Source passed through 2 pointwise convolutions to generate memory keys and values
      • Finally the query keys, memory keys, and values are used to apply attention between self-attended target and source
    • Mixture-of-Experts blocks: feed-forward networks (experts) and a trainable gating network selecting a sparse combination of experts
  • Encoder encodes inputs to encoded inputs, which together with previously computed outputs are passed to I/O mixer which computes encoded outputs, which together with encoded inputs are passed to autoregressive (left-padded) decoder to generate outputs
  • Modality nets:
    • Different tasks from same domain share modality nets; a special token embedding is learnt for differentiating between tasks
    • Language mod net: tokenized using same vocab of 8k sub-word units
    • Image mod net: number of residual convolutional steps applied
    • Category mod net: output modality by applying conv steps to get the 1D category
    • Audio mod net: 1D waveform or 2D spectrogram transformed with 8 residual convolution blocks
  • MultiModel achieves 10-20% percent lower performance from state-of-the-art on WMT and ImageNet
  • Accuracy increases slightly for all tasks when trained jointly on 8 tasks compared to training separately on each task
  • Excluding any of the 3 types of blocks reduces performance on all tasks
Satwik Kottur, José M.F. Moura, Stefan Lee, Dhruv Batra
2017.06
  • Task & talk game: between Q-bot and A-bot in a world with 64 unique objects
  • A-bot sees an object that Q-bot doesn't and Q-bot has to discover two attributes of this object through dialog, then Q-bot guesses the attributes and both bots receive a reward on how good the guess was (RL)
  • Model Q-bot and A-bot as operating under stochastic policies which are LSTM-based models
  • Q-bot has a listener LSTM encoder, a speaker fully connected layer and a prediction LSTM network
  • Dialog is done through speaker networks and listener LSTMs of Q-bot and A-bot, the final prediction LSTM is based on the previous state and the task encoding
  • REINFORCE algorithm used, estimate reward expectation by sample averages (environment, dialog)
  • Agents usually invent a language to solve the game near perfectly, but this language is not compositional, interpretable or natural
    • Overcomplete vocabularies: A-bot learns to convey each attribute with separate symbol, generalizes very badly
    • Attribute-value vocabulary: limiting the vocab leads to better generalization but still doesn't yield compositionality
    • Memoryless A-bot: resetting the state of A-bot at each dialog round and further reducing it's vocab leads to a consistent and compositional language
Mike Lewis, Denis Yarats, Yann N. Dauphin, Devi Parikh, Dhruv Batra
2017.06
  • Dataset constructed (6k dialogs) with Mturk task, where humans are given items to negotiate, who gets what
    • The items' values are not the same for the 2 participants
    • Same task for agents, where they get a reward if they reach an agreement
  • First a seq2seq is trained to generate the dialog given the input items and values (goal)
    • There is an input encoder RNN, a dialog generator RNN
    • After the dialog is generated an output RNN predicts the output agreement (who gets what), based on input goal and dialog
  • After pretraining with SL, self-play is used, but one agent is fixed since training both led to divergence from human language
    • During RL, the dialog generator acts both as an encoder for the other agent's utterance and as response generator
  • Rollout is used for a better decoding tactic
    • Agents rollout several utterances until the end of the dialog, and select the utterance that gets the highest reward
  • After each RL update an SL update is made
  • Evaluation with humans show that the simple SL model learns to agree more times, but doesn't get an optimal solution as many times as the RL model using rollouts
    • RL+ROLLOUTS negotiates harder, resulting in more turns
  • Evaluation with an SL agent is much better than with humans, meaning that the RL agent overfitted to the SL agent scenario
  • They take inspiration from alphago and propose to scale tree search to dialog modeling as future work
Zi Yin, Keng-hao Chang, Ruofei Zhang
2017.07
  • Apply seq2seq model to rewrite user question into one that a recommendation system understands and use seq2seq model to score and pick best candidates
  • Bidirectional encoder LSTM, and attention at the top layer of decoder LSTM (we attend to the top layer of the encoder)
  • Entropy to measure the confidence of an agent in whether it should recommend an item based on the previous dialog
  • They propose a greedy uncertainty-reduction algorithm to maximize expected information gain at each step based on mutual information and a set of questions that the chatbot can ask the user
  • By estimating the posterior distribution the model can rank returned items by the IR in order of relevance to the query
  • The chatbot either asks a sampled question or makes a recommendation based on confidence from entropy value
Grishma Jena, Mansi Vashisht, Abheek Basu, Lyle Ungar, Joao Sedoc
2017.08
  • The bot consists of two seq2seq models to handle star trek style (trained on star trek dataset) input and everyday conversations (trained on Cornell dataset)
  • When confidence in response is low rule-based outputs are used
  • Trained on twitter dataset to binary classify to choose which seq2seq model to use
  • Word graph algorithm is used to insert star trek specific words into responses -> this could cause ungrammatical sentences so consequently a bigram LM is used to select between candidate sentences
  • It's compared to the pandora rule-based bot and achieves better coherence and star trek style scores
Sajal Choudhary, Prerna Srivastava, Lyle Ungar, Joao Sedoc
2017.08
  • Domain-specific seq2seq followed by a re-ranker to predict the most likely response and domain combination (which is fed back into domain classifier)
  • Utterance is fed into domain classifier as well as into multiple separately trained domain-specific seq2seq (with attention)
  • Domain classifier is composed of an SVM with logistic regression or an RNN with one-hot input vector representing subsequent domains in the conversation
  • Reddit dataset with 3 domain categories, and another model trained on twitter dataset for out of domain queries
  • Logistic regression over previous domain categories coupled with SVM performed the best, beating a simple seq2seq model
Kartik Goyal, Graham Neubig, Chris Dyer, Taylor Berg-Kirkpatrick
2017.08
  • Models trained with max likelihood objective can't take into account the benefits from beam search, they yield better performance with greedy decoding
  • Hamming loss evaluated on the output of beam search, but to make it continuous we approximate the beam search decoding
  • The approximation is achieved by relaxing the objective function with a parameter to become more and more like actual loss function based on beam decoding
  • Make decoding soft by approximating argmax by a temperature-controlled softmax
Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau
2017.08
  • Train a hierarchical RNN with a dataset of dialogues and corresponding human scores
  • Several models used to generate varied responses for this dataset
  • Encoder that learns vector representations of context, model response, reference response, then it computes dot product with learned matrices
  • Model is trained to minimize the squared error between prediction and human score
  • Model is pre-trained as a dialogue model (VHRED), sub-words and layer normalization used in the encoder
  • ADEM (name of the model) correlates somewhat better with human judgment both at the response and system levels
  • It can also generalize to new models, even if it was trained on only retrieval-based models it can test a generative model
Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, Adam Coates
2017.08
  • Previously a similar model, deep fusion used the combination of seq2seq and LM hidden states to form output, but these were trained separately, thus the decoder of the seq2seq also had to learn a language model specific to the training data, which makes it hard to transfer to other tasks
  • In cold fusion seq2seq model is trained together with a fixed pre-trained LM
  • A gating mechanism (neural network) chooses how to combine the LM logits and the seq2seq states to get the final prediction
  • They experimented on speech recognition task, and the LM was an RNN
  • Trained on search query database, achieved 12% word error rate, and then applied to movie subtitles achieved 28% world error rate, better than basic seq2seq
  • Further fine-tuning/training on 10% of the movie subtitles gets the word error rate close to a basic seq2seq trained on 100% of movie subtitles dataset
Tao Lei, Yu Zhang
2017.09
  • Simple Recurrent Unit that is 10x faster (same speed as CNN) than LSTM
  • They use skip-connections for computing the final output of an RNN, and dropout on the inputs
  • To make it less recurrent: drop connection between the previous state and neural gates at current step (to compute current state we still use the previous state, but this is only an element-wise computation)
  • Validated on question answering, language modeling and machine translation achieving similar accuracy as LSTM, but trained much faster
Iulian V. Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Rajeshwar, Alexandre de Brebisson, Jose M. R. Sotelo, Dendi Suhubdy, Vincent Michalski, Alexandre Nguyen, Joelle Pineau, Yoshua Bengio
2017.09
  • An ensemble model that uses several machine learning models (trained separately on datasets, then jointly with reinforcement learning with user interactions) and few hand-crafted rules
  • Takes as input the dialogue history, and each model outputs a response from which the priority is chosen or if none, all models scored to select the best
  • Template-based models:
    • Alicebot (string-matching)
    • Elizabot (engaging using more questions)
    • Initiatorbot (asks hand-written starting questions)
    • Storybot (triggered by the user, returns a story; non-conversational)
  • Knowledge-base based QA:
    • Evibot: forward question to amazon QA
    • BoWMovies: handles questions in the movie domain by recognizing entities and tags via string matching or word embeddings
  • Retrieval-based neural networks:
    • VHRED: candidate responses retrieved based on cosine similarity, then the likelihood of each one is computed by VHREDs (trained on separate datasets)
    • SkipThought vector models: handles trigger phrases (with keywords); ensures that the bot follows the Alexa prize rules (bot shouldn't state its opinion)
    • Dual encoder models: encoding dialog history and candidate responses and selecting the best
    • BoW: retrieve response with highest cosine similarity from several Reddit and twitter topics
  • Retrieval-based logistic regression:
    • BoWEscapePlan: returns from a set of 35 generic responses based on a logistic classifier
  • Search Engine-based neural networks:
    • LSTM classifier: chooses responses from a set of search engine results (trained as a binary classifier to choose relevant search snippets)
  • Generation-based neural networks:
    • GRUQuestionGenerator: generates question conditioned on dialog history (start of question is template-based)
  • Model selection policy:
    • Sequential decision-making problem to satisfy long-term dialog with reinforcement learning (reward for each response)
    • Action-value function estimates expected return for a candidate response
    • Stochastic policy is a discrete distribution over actions based on a scoring function
    • A lot of input features to the scoring network: word embeddings, similarity metrics, PoS, dialog acts, bigram, generic response, etc…
  • Scoring model architectures:
    • Scoring model (policy) is a simple feed-forward neural network
    • Supervised pre-training with AMT labelers labeling dialogs spit out from the chatbot
  • Supervised learned reward: Predict the Alexa user score with a linear regression model for a dialog history and response, based on hand-selected features
  • Learn the policy with off-policy REINFORCE: reward shaping, by giving 0 reward when a negative user response is detected, and RL reward otherwise
    • Also combine this RL with the learned reward model for automatic rewards
  • Off-policy reinforce has higher variance and lower bias, and supervised learned reward is the opposite
    • New method: trade-off between variance and bias with Q-learning an abstract discourse MDP (second figure below)
    • At each step, there is a hierarchical structure, with a discrete random variable at top, based on the sets of dialog act, user sentiment and genericness
    • Given this sample, the MDP samples a dialog history from a set, then the agent chooses an action according to its policy, after which there's a reward
    • Finally, a variable representing the AMT score is sampled, and a new discrete state is sampled according to the current one, and the action
  • Off-policy reinforce q-learning and supervised amt offer best Alexa user score
    • q-learning selects responses from much riskier models (bowfact, reddit), than supervised amt (alicebot,elizabot)
    • Off-policy reinforce can hold the longest dialog, and offers best user score for long dialogs
    • Based on final alexa user scores only q-learning achieve a higher score than the base evibot+alicebot heuristic
  • Q-learning has the highest topical coherency and topic specifity
Igor Shalyminov, Arash Eshghi, Oliver Lemon
2017.09
  • Augmenting the babi task 1 with incremental dialog phenomena (hesitations, restarts and corrections)
  • Training memn2n on babi and testing it on augmented babi gives very bad performance
    • Training and testing on augmented babi gives better performance, especially with more data
    • Training on augmented babi and testing on normal babi gives 99% accuracy
  • Dynamic syntax and type theory with records framework used by the authors (rule-based)
    • To build word-by-word semantic representations
  • Gives 100% semantic accuracy on both babi and augmented babi
Wenya Zhu, Kaixiang Mo, Yu Zhang, Zhangbin Zhu, Xuezheng Peng, Qiang Yang
2017.09
  • KB for music domain conversations, containing triplets of {subject, predicate, object}
  • GenDS:
    • Candidate retriever detects entities and retrieves a set of facts from KB
    • Message encoder encodes input message (transforms entities to their general types)
    • Reply decoder decodes this together with the retrieved facts
    • Knowledge gate is used to determine whether to generate common or knowledge words at each time step
  • Dynamic knowledge enquirer:
    • Generates knowledge words based on 3 scores (computed by MLPs)
      • Message matching score
      • Entity update score
      • Entity type update score
    • They depend on last generated words
  • GenDS achieves significantly better entity accuracy than baseline seq2seq
Ben Krause, Marco Damonte, Mihai Dobre, Daniel Duma, Joachim Fainberg, Federico Fancellu, Emmanuel Kahembwe, Jianpeng Cheng, Bonnie Webber
2017.09
  • Data is collected as self-dialogues written by AMT workers
  • Edina converses on 3 topics (movies, music, sports), combining rule-based and machine learning methods
    • Rule-based component with templates (backs off to matching score) (16%)
    • Matching score retrieves an answer (with confidence score) (46%)
      • Using the current utterance, and candidate contexts and responses from AMT dialogs
      • Based on bag-of-words with IDF
    • Neural network, is the last option if the other two fail (20%)
      • Pretrained on opensubtitles and finetuned on AMT self-dialogs (they let it overfit a bit)
    • Proactive component which steers the conversation with questions related to entities (16%)
  • Preprocessing with NER and user-modeling (user preferences caught by rules)
  • Confidence rating is important to know when not to select matchin score outputs for example
Bing Liu, Ian Lane
2017.09
  • Jointly optimize a dialog agent policy and the user simulator policy used to train it
  • Bootstrapping both agents with supervised learning on task-oriented corpora
    • Then further training them with a collaborative task-oriented goal
    • User simulator is given a goal to complete
    • Dialog agent attempts to estimate this goal and fulfill requests
    • Both receive a reward on the level of task completion
  • Dialog agent
    • Bi-directional LSTM to encode the utterance, previous agent output, and retrieved KB result encoding
    • Dialog acts as system actions based on LSTM state, and action sample with an MLP from this state
    • Belief tracker maintains and updates a probability distribution over candidate values for each goal slot
    • Dialog agent has KB component and can issue API calls
      • API call with slot-type tokens can be replaced by corresponding values from belief tracker
    • Template-based NLG module to convert system action, slot values and KB entities to NL response
  • User simulator
    • State maintained in an LSTM, takes as input a sampled goal encoding, the previous user output, and current agent input
    • Informable (price range) and requestable slots (address)
  • RL policy gradient optimization
    • States are the LSTM user and agent states
    • Action space is finite and discrete for both the dialog agent and user simulator
      • Actions are not words themselves, but rather higher level
    • Turn-level reward based on the progress that the agent and user made in completing the task in that turn
    • Softmax policy is applied during training, and during evaluation only for the user to generate more diverse utterances
  • Dataset is DTSC2 with added API calls and corresponding KB results
  • Training iteratively the agent and the user simulator
  • RL training improves the task success rate significantly compared to supervised learning
Tom Young, Erik Cambria, Iti Chaturvedi, Minlie Huang, Hao Zhou, Subham Biswas
2017.09
  • Integrate commonsense knowledge into retrieval-based models
  • Dual LSTM used to encode context and response
    • In classical retrieval, compatibility is computed between the created vector representations with a learned weight matrix
  • Commonsense:
    • Made up of assertions, that contain a triplet <c1, r, c2>, where r is a relation between two concepts
    • Concepts are retrieved as n-grams from the message, and all corresponding assertions are searched
    • An LSTM is used to encode all the retrieved assertions
    • Match score between each encoded assertion and response is computed with a learned weight matrix
      • The score of the assertion with biggest score is added to the original compatibility function
  • Comparison with memory networks and a baseline using comparison based on supervised word embeddings instead of LSTM representations
  • Dataset is 2M twitter status response pairs
    • 1M positive responses (ground truths)
    • For each status a negative response is sampled as a random different response from the training set
  • recall@k metric is somewhat better for tri-lstm than dual-lstm without commonsense
Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals
2017.09
  • Adapting the parameters of a model during the generation of a sequence at test time
    • To better capture the slightly different probability distribution
  • A long sequence is divided into a sequence of shorter sequences
    • After each short sequence segment, a backpropagation step is carried out
    • The next sequence element is evaluated with the new parameters
    • This can be applied to sequence generation as well
    • Previous adaptation updates decay exponentially over time
  • Adaptation of hidden units is not direct but rather we adapt a matrix that is multiplied with the hidden units to get the new hidden units -> fewer parameters to adapt (achieves a little bit lower performance)
  • Achieves better perplexities for word and character-level language modeling than state-of-the-art models
  • The longer the sequence the lower perplexity the model achieves
Irwan Bello, Barret Zoph, Vijay Vasudevan, Quoc V. Le
2017.09
  • Domain specific language for optimizer
    • The two unary functions applied to two operands and the binary function applied to them
    • A lot of operands, unary and binary functions are accessible to the controller
    • Each operand can be further used until we get to the optimizer equation
  • The policy is an RNN, that selects the operands and operations sequentially
  • Since as the sequence unrolls new operands are created that can be subsequently selected, the softmax weights at each step are different
  • RNN trained to maximize validation performance of the update rules on a specified model
  • For speed increase the child network is a small convnet, and it is trained only for 5 epochs on CIFAR-10
  • PowerSign (discovered update rule)
    • The sign of the gradient and the moving average is multiplied together and a number is raised to this power and then multiplied with the gradient
  • AddSign (discovered update rule)
    • The sign of the gradient and the moving average is multiplied and added to a number, and then multiplied with the gradient
  • The found update rules offer a small performance advantage for larger networks as well
Yi Luan, Chris Brockett, Bill Dolan, Jianfeng Gao, Michel Galley
2017.10
  • The aim is to use non-conversational data to make a seq2seq model learn speaker roles (doctor, technician, etc.)
  • Multi-task learning approach:
    • Seq2seq learns conversational model based on the large general population of speakers
    • Autoencoder utilizes large non-conversational personal data from target speakers
    • The decoder part of the two models are shared and jointly trained so that the language model for generation is adapted to the target speaker
    • However one model can only be trained with 1 type of target speaker
      • So persona based model is tried as well which learns multiple speaker embeddings
  • Twitter data is used, and for the autoencoder 20 twitter users are selected and their posts without replies used as training data
  • Little bit better correlation of output responses with target speaker style than baseline
Jason Lee, Kyunghyun Cho, Jason Weston, Douwe Kiela
2017.10
  • One agent sees an image and describes it in its language
    • Goal is to produce a description close to ground truth and to help other agent identify the target image
  • The other agent has to choose the correct image from several
  • Game played in both directions and agents trained jointly
  • Each agent has an image encoder, a native speaker module, and a foreign language encoder
    • Image encoder is a CNN
    • Speaker module is an RNN taking image representation as the initial state
    • Foreign language encoder is another RNN
  • The model achieves better performance if image encoder or native language encoder is pre-trained and fixed during translation training
  • In conclusion, the achieve bleu scores are promising but far away from NMT baselines
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, Shuzi Niu
2017.10
  • Dialog dataset available
  • Dialogs from english learning examples
  • Short dialogs on specific topics (average dialog is 8 turns)
  • Each utterance is labeled as one of four dialog acts: {inform, questions, directives, commisive}
  • Each utterance is labeled as one of 7 emotion categories: {anger, disgust, fear, happiness, sadness, surprise, other}
  • Dialogs usually follow some pattern along the four dialog acts, like question-inform bi-turn dialog flow
  • 83% of dialogs falls into the "other" emotion category
  • Some baseline dialog models, including retrieval and seq2seq based, are evaluated (with emotion and dialog act included)
Sharath T. S., Shubhangi Tandon, Ryan Bauer
2017.10
  • Use a history of dialog acts to get a global context for a seq2seq model
  • They also realize the problem of the loss function and try to tackle the incorporation of previous dialog turns
  • Conv-net pre-trained to predict dialog acts given input utterances, and the context encoder's hidden state is fed additionally to the decoder
  • Context encoder CNN pre-trained on switchboard corpus
  • The seq2seq part is trained on cornell movie corpus
  • Seq2seq baseline where previous turns are simply concatenated performs worse than single-turn seq2seq
  • Proposed model outperforms baselines on qualitative analysis
    • Automatic evaluation is also given, based on dialog length, diversity and specifity
  • Choosing one among the least probable beams contributed to diversity of responses.
Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Yun-Nung Chen, Kam-Fai Wong
2017.10
  • Train a discriminator to differentiate the dialog agent responses from human ones
    • Use the output of discriminator as an intrinsic reward (another critic in A2C framework)
    • Similar to Li et al. Adversarial neural dialog generation
  • Applied to movie ticket booking dialog system
    • Binary reward at the end of each dialog
  • As in other RL tasks, the gradients have high variance, so a baseline function is used
  • Alternating optimization is used between the discriminator reward and the A2C reward
  • User simulator is used, that has a goal, and informable and requestable slots
  • Both generator and discriminator are single layer neural networks
  • Adversarial A2C performs better than simple A2C
Bing Liu, Tong Yu, Ian Lane, Ole J. Mengshoel
2017.11
  • Contextual nonlinear multi-armed retrieval bandit networks in an online setting (feedback from users)
    • 2 BiLSTMs for encoding context (sequence of utterances) and responses
    • The vectors produced serve as input to the contextual bandits
    • Binary reward is collected from the user to update the parameters
  • Logistic regression Thompson sampling
    • Apply an approximation of the reward on selected dimensions of a second order polynomial feature space
      • Apply sigmoid function on cMu^T
  • Pretrained on supervised labeled data with cMu^T
  • The nonlinear bandit achieves better performance than a linear one, but the recall@1 is still pretty bad
Francis Dutil, Caglar Gulcehre, Adam Trischler, Yoshua Bengio
2017.11
  • Standard RNN seq2seq augmented with alignment planning and commitment vector
  • At each time-step an alignment plan matrix and a commitment plan vecor is computed
    • Matrix holds alignment for current and next k timesteps, conditioned on the previously predicted token and current context from encoder hidden states
    • Decoder receives the previous hidden state and predicted token and the context, which is a weighted sum of encoder annotations
      • The weights are from the first row of the alignment matrix
    • Commitment plan vector is a binary decision whether to follow the existing alignment plan or to recompute it
      • Gumble-softmax trick to be differentiable
      • If it is 1, then update the alignment by interpolating with the previous alignment plan (mixing ratio determined by a learned gate)
      • If it is 0, the previous alignment plan is used, by shifting the time-step
  • Penalty added to the loss function, so the model doesn't commit too often (update the alignment plan)
  • Better than a baseline seq2seq with attention on the task of finding eulerian circuits of graphs
    • And converges faster on QA and Char-level NMT
Guillaume Lample, Ludovic Denoyer, Marc’Aurelio Ranzato
2017.11
  • Build a common latent space between two languages with a single autoencoder seq2seq (with different vocab)
  • For translation the encoded sentence is decoded from the latent representation with the other language's decoder
  • Pretrain with unsupervised word-by-word monolingual translation
  • Constrain latent representations to have same distribution using an adversarial regularization term
    • Discriminator trained to identify the language of a given latent representation
    • Encoder trained to fool the discriminator
  • Train encoder and decoder by reconstructing a sentence given a (random) noisy version in the same language
    • Or by translating it to the other language (which is a noisy version by itself), and translating it back
  • Final loss function is a weighted sum of auto-encoding, cross-domain and adversarial loss
  • Evaluation is done by translating from a language to another and then back to the original, and computing bleu score over original inputs and their reconstruction
  • Proposed model outperforms the word-by-word unsupervised baseline
    • Performs on-par with a supervised model trained on less parallel sentences
    • If trained on the same amount of data supervised model far outperforms it however
  • Ablation study shows that the pretraining together with the cross-domain loss is the most important
    • After that comes the noising of sentences, since without noise the model merely learns to copy the input sentence
Sergey Edunov, Myle Ott, Michael Auli, David Grangier, Marc’Aurelio Ranzato
2017.11
  • Seq2seq models used are 1D convolutional and recurrent as well (with attention)
  • Token-level loss functions:
    • Token negative log likelihood (NLL)
    • Token NLL with label smoothing
  • Sequence-level loss functions:
    • Directly optimize sequence metrics, by computing a set of outputs and scoring them (Each word has the same loss)
      • One approach is to compute this set with beam search
      • Second strategy is sampling over model's output distribution
    • Sequence NLL: the sum of token log probabilities, normalized by the number of tokens
    • Risk: Minimize a cost function based on Bleu or Rogue
    • MultiMargin: the difference between the cost of pseudo-reference and candidate response, based on the pre-softmax score
    • SoftmaxMargin: sequence NLL augmented with a cost inside the exponent
  • Combined objectives:
    • Weighted combination of a token and sequence level loss
    • Constrained combination, in which one of the two losses is used at any one time
      • If token loss is better than a baseline model, then train on the sequence loss
  • On NMT task, they achieve the best results with weighted combination of losses
  • Regenerating the candidate set for each input is much slower than pre-computing a set of candidates for each input, but achieves better overall BLEU
  • Beam search performs better than sampling
  • Increasing the candidate set size up to 16 increases the performance (after that it the performance decreases)
Oswaldo Ludwig
2017.11
  • The work is closely related to Li et al.-s adversarial dialog agent
  • Context vector is used as input in the decoder at each time-step
    • This is made up of the entire LSTM encoded dialog history
    • A different LSTM is used to encode the so-far generated response
    • Decoder is a dense layer, predicting the likelihood of current token
    • Greedy decoding is used to generate a token
  • Discriminator performs token-level binary classification
    • Whether the current token is machine or human-generated
    • Takes as input the token, the previous dialog utterances, and the incomplete answer
      • These are processed by two different LSTMs from the encoder, and then fed into a dense layer
    • This way backpropagation can be used instead of reinforcement learning
  • Adversarial training starts with a pre-traind model using teacher forcing
  • Since whole dialogs are fed into the model, machine generated dialogs are also generated in each epoch by the model
  • Discriminator and generator are trained alternately
    • Discriminator is trained on the machine and human dialogs to distinguish between them
    • Then, the generator is trained on the machine-generated dialogs, minimizing the difference between the discriminator output and 1
    • After that, the generator is also trained on only the human dialog dataset with standard cross-entropy loss
  • Dataset is from online english courses
  • Human and adversarial evaluation used (as in Li et al.)
    • Jaccard index between human and adversarial evaluation is 0.58
    • Adversarial training achieves a much better evaluation score
Kaixiang Mo, Yu Zhang, Qiang Yang, Pascale Fung
2017.11
  • Personalized decoder that can transfer phrase-level knowledge between users, while keeping personalized user info intact
    • With the use of a gate to switch between personal and shared phrases
  • The input to the model is the dialog history, where each word is labeled whether it is personal or general
  • First step of decoding is to compute the control gate based on the encoded sentence and the hidden state of shared and personal RNN
    • Then compute the next hidden states based on the gate output
    • Lastly generate the word based on one of the hidden states (given by control gate)
  • Each user is represented by a different decoder RNN
  • Shared and personal component trained together with RL (Reinforce algorithm)
    • Agent takes a combination of general and personal rewards
    • Personal rewards when the user confirms the suggestion of the agent
    • General reward when the user provides information about target task
    • Big general reward when system helps user finish target task
    • Negative general reward when the user rejects to proceed
    • Shared params updated at each iteration, while personalized params updated based on data collected from the corresponding user
  • This decoder is also integrated into the HRED model
  • Model tested in a coffee ordering task setting (very limited dataset)
  • Word-level transfer models perform better than sentence-level transfer
Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, Li Deng
2017.11
  • Problems is that q-learning methods never experience success because of huge action space in dialog
  • Dialog acts are utterances, with informed and requested slot-value pairs
  • State tracker contains a representation of the conversation history and database features
  • Domain is movie-booking with 39 actions (each slot has to actions, inform and request)
  • Per-turn penalty is given, so that dialog is as short as possible
  • In q-learning the optimal, but intractable Q-policy can be approximate with a learned neural network for example (DQN)
  • Bayes-by-backprop: the weights of a neural network are sampled from a gaussian distribution
    • Learn params by minimizing KL-divergence between the variational approximation of the distribution and the posterior
  • Bayes-by-backprop Q-network (BBQN), integrates DQN with bayes-by-backprop networks
    • The authors use a simple MLP
    • BBQ network trained with q-learning and Monte Carlo sampling is used over the frozen network to generate targets
    • Targets can also be computed with maximum a posterior (MAP) estimate
  • Variational Information Maximizing Exploration (VIME) can be used in BBQN to encourage unexplored state-action regions
  • Rule-based agents is used to pre-fill replay buffer so that BBQN sees some successful dialogs
  • Representing dialog state with a vector:
    • One-hot representations of act and slot corresponding to the current user action
    • Act and slot corresponding to last agent action
    • A bag of slots corresponding to all previously filled slots
    • Knowledge base counts
  • BBQNs achieve much better performance than DQNs
Bing Liu, Gokhan Tür, Dilek Hakkani-Tür, Pararth Shah, Larry Heck
2017.11
  • Dialogue-level LSTM takes in the encoding of current user utterance and encoding of previous system action
    • Produces a probability distribution over candidate values for each of tracked goals
    • Utterance-level LSTM is used to encode utterance
  • System action emitted based on current dialog state and retrieved info from KB (using a separate MLP-based policy network)
    • This is translated to NL using a template-based generator
  • The authors first train the system in a supervised manner using task-oriented corpora.
    • Then user REINFORCE to further train the agent (reward at the end of the dialog)
    • Penalty is given, to encourage shorter task completion time
  • RL clearly improves the task success rate and accomplishes the task in fewer turns than SL
    • Updating only the policy network results in less improvement than end-to-end RL
Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, Anton van den Hengel
2017.11
  • In visual dialog (compared to visual QA), the agent has to provide reasoning to keep the conversation flowing (not just yes-no answers)
  • Propose a visual dialog model trained with adversarial learning
    • Discriminator has access to attention weigths which can be regarded as a form of reasoning
    • Monte carlo search used to compute word-level rewards (as in Li et al.)
  • Sequential co-attention is used to combine the image and dialog encodings (from CNN and LSTMs)
    • First utterance of the dialog is the image caption
    • Discriminator is also conditioned on image, question and dialog attention memories
      • And on the encoded question and generated answer (by generator)
    • Attention is always based on an input, and weighting based on other two features
      • First the question feature is used to attend to the image
      • Attended image features and question feature combined to attend to utterances
      • Then, the attended dialog features and attended image features guide the question attention
      • Finally, image attention is run again, guided by attended question and attended dialog
      • The three attended features are concatenated
      • This is fed to an LSTM to compute the probability of generating each token
  • Teacher forcing, whereby the generator is alternately updated based on discriminator reward and MLE loss
  • While the model is mainly generation-based, evaluation is done in retrieval style: rank a set of responses
  • Generator pretrained on the dataset, and discriminator pretrained as well
  • The proposed model performs better than previous state-of-the-art (which also used attention)
    • On recall@k as well as human evaluation
Huiting Liu, Tao Lin, Hanfei Sun, Weijian Lin, Chih-Wei Chang, Teng Zhong, Alexander Rudnicky
2017.11
  • Alexa participant, using an ensemble of rule-based, retrieval and generative models
  • NLU / Preprocessing:
    • Topic detection (6 classes)
    • Intent analysis (42 classes)
    • Entity linking links entities to entries in wikipedia
  • NLU followed by a strategies layer, which selects the reply generator based on preprocessing results
  • Order of priority: rule-based, knowledge-based, retrieval-based, generative (seq2seq)
    • Rule-based: intent templates, backstory, entity-based templates
    • If no match, the system tries to get a response from Evi (KB QA provided by Amazon)
    • If even this fails, then retrieval and generative models are employed
  • Context and topic history is tracked
  • Retrieval is based on recent Twitter data
    • Randomly select from twitter posts related to recognized entities
  • Train an SVM classifier to rerank the candidate responses from seq2seq, based on engagement (binary)
  • Alexa user score is used to see how different modules affect the quality of the bot
    • Evi is used more in higher-rated dialogs
  • Neural generative model is used the most
  • Mean score achieved is less than the MILAbot
    • Main problems are that the bot is not engaging or coherent
Mircea Mironenco, Dana Kianfar, Ke Tran, Evangelos Kanoulas, Efstratios Gavves
2017.12
  • Q-bot and A-bot trained to guess an image through dialog
  • Intervening by replacing image pixels with random noise and caption words with random words
  • Intervening by replacing each token in Q-bot or A-bot utterance with a random one with some probability
  • Intervening by negating yes/no answers of A-bot, and see if Q-bot cooperates
  • Results:
    • Changing the caption with some probability correlates very well with the final percentile rank
    • Replacing image with random noise or changing the answer has no effect on performance
    • Replacing questions has a slightly bigger effect on performance (still minimal)
  • Basically Q-bot relies on the caption at the beginning of the dialog, so there is no cooperation between the bots
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis
2017.12
  • Alphazero learns to play chess, shogi and go entirely by self-play at a superhuman level
  • At each step, an action vector is outputted representing the probability of each action based on the state, and a scalar value estimating the expected outcome
  • General-purpose Monte-Carlo tree search (MTCS) used
    • Each search consists of a series of simulated self-play games traversing the tree
  • At the end of each game, a reward is given to the neural network policy
    • Parameters updated to minimize predicted outcome and the actual outcome
    • And to maximize the similarity of policy vector to search probabilities
  • Alphazero outperforms best-programs in each game
    • Trained for less than a day on 5000 TPUs
  • Alphazeros performance scales better with thinking time/move than Stockfish
Bolin Wei, Shuai Lu, Lili Mou, Hao Zhou, Pascal Poupart, Ge Li, Zhi Jin
2017.12
  • Given a source sequence the conditional distribution of the target sequence has multiple plausible points
  • Mimicking this scenario in MT by shuffling source and target sentences
  • As the percentage of shuffled sentences grows in a dataset so does the bleu score, entropy and length of the output go down, achieving similar values as a dialog system
Li Zhou, Kevin Small, Oleg Rokhlenko, Charles Elkan
2017.12
  • Define reply generation as an MDP, parameterized by enc-dec
  • Combine on-policy with off-policy policy gradient
  • Two types of rewards:
    • Utterance-level reward captures the quality of generated agent utterance compared with target from training data (with BLEU)
    • Dialog-level reward captures the contribution of reply to achieving dialog goals
      • Negative reward if an API call is issued too early or too late, and positive reward for correct API call parameters
  • Reward shaping used to give rewards to intermediate actions
    • Approximate reward based on bleu; last action reward is the true reward
  • Off-policy policy gradient to help with exploration:
    • Maximize probability of actions in dataset weighted by importance sampling ratios
  • bAbI dialog task 6 dataset
    • they fed all KB restaurants and attributes into the encoder
    • achieves slightly better performance than the baseline
Boyang Deng, Junjie Yan, Dahua Lin
2017.12
  • Encode layers of networks through an LSTM, and predict on validation data
  • Very few types of layers and architectures permitted (limited in scope)
  • The prediction is also conditioned on the number of epochs
    • Thus the entire learning curve can be predicted
  • CNN layer encoding:
    • A vector representing the type of layer, kernel width and height, and number of channels
    • Similar to word embeddings the discrete vector is transformed to a continuous one
  • Final prediction is given by an MLP based on concatenation of last LSTM hidden state and the epoch index embedding
  • Block-based generation to acquire training samples
    • Reasonable architectures constructed based on simple heuristics and skeletons
    • They only used the accuracy of each network from last epoch for training
  • Trained on CIFAR-10 and MNIST
  • Much better results than previous approaches which relied on at least part of the learning curve
  • As the accuracy of a networks increases the correlation between predicted and actual accuracy becomes better
Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, Kevin Murphy
2017.12
  • Similar to A*, searching for more and more complex architectures, consisting of cells (these are learned)
    • Cells consist of blocks that do some convolution operation (8 possible choices) and addition over two input tensors
    • Set of possible inputs to a block is the set out all previous blocks in that cell, plus final block in previous, and the one before cell
  • Also train an RNN that can predict the reward for any model (validation performance)
  • Progressive learning, starting with cells of only 1 blocks
    • Use the reward predictor to predict the performance of networks with cells consisting of 2 blocks, and pick K most promising ones, compute actual reward and update predictor based on these, then iteratively use more and more blocks
  • The max number of blocks is five, so in total 1280 models are trained
  • The final best network achieves same performance on CIFAR-10 as the best one from NAS, but fewer networks had to be trained, and initial networks are smaller, so less training time
  • The best network on CIFAR-10 is ported to ImageNet and achieves the same performance as previous state-of-the-arts
Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O. Stanley, Jeff Clune
2017.12
  • They used the simplest GA, without crossover, and selecting the parent based on elitism
  • They store each parameter vector as a seed, plus the list of random seeds that produce the series of mutations applied to the parameter
    • This is much more memory efficient, needed for the millions of params of deep nets
  • Novelty search: reward given to agents that perform behavior never seen before (this avoids local optima)
    • It doesn't get stuck where normal GA would get stuck in the image hard maze problem
  • Atari and human locomotion tasks used to benchmark
    • GA outperforms DQN and AC3 on some games but performs worse on others
  • Simple random search also outperforms DQN and AC3 on a few games
Sungjin Lee
2017.12
  • Hierarchical encoder used (even the word embeddings are constructed with a character RNN)
  • Actions are selected based on a projection from the state embedding to the action embeddings
  • With continual learning the total loss function over all tasks has to be minimized
    • Without access to prior tasks -> leads to catastrophic forgetting
    • To combat this a modified loss function is used, to preserve the weights learned from prior tasks
  • Small, in-house human-human and human-computer datasets used
  • Weight transfer alone is not enough, the performance diminishes when switching between two tasks
    • The elastic loss, however, performs better
Ioannis Papaioannou, Amanda Cercas Curry, Jose L. Part, Igor Shalyminov, Xinnuo Xu, Yanchao Yu, Ondrej Dušek, Verena Rieser, Oliver Lemon
2017.12
  • Rule-based bots:
    • Persona, Eliza, Weatherbot
  • Retrieval bots:
    • Newsbot, Factbot, Evi
  • Response selection in 3 steps:
    • Bot priority list
    • Contextual priority: newsbot is prioritized if it stays on topic
    • Ranking if no default bots fired
  • They experimented with data-driven bots, which weren't included in the final system (lol)
  • Ranker:
    • Hand-engineered: coherence, flow, questions, same topic, dullness, sentiment polarity
    • Linear classifier, based on n-grams, and dialog features
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, Jason Weston
2018.01
  • Dialog dataset constructed by crowd workers, containing about 160k utterances
  • Each dialog consists of two specific personas finding out information about each other
  • Revised personas are constructed that are similar to original persona so that models don't simply learn to copy text from the persona into the dialog
  • 4 training scenarios:
    • Conditioning on no persona, conditioning on one of the personas, and conditioning on both
  • Both ranking and generative models are tested
  • Seq2seq model augmented with a memory network that encodes the profile sentences
    • During decoding the decoder attends to the profile representation (picture below)
  • Results:
    • Most models have better hits@1 if persona info is given during training
    • Revised personas perform worse since word overlap is rarer, thus it is a harder problem
    • Based on human evaluation ranking models perform better than generative ones, and persona helps a bit
      • Also a ranking system trained on opensubtitles achieves much lower performance
      • Human evaluation has very high variance
Pararth Shah, Dilek Hakkani-Tur, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, Larry Heck
2018.01
  • Build templates/outlines for goal-oriented dialogue by letting two agents converse with discrete actions
  • The task-specific knowledge is left to the developer
    • In this work a database querying task is used
  • Two-step process: map the task specification to a set of dialogue outlines, then map each outline to NL
    • Outlines have annotations, consisting of dialog act and slot-value map
    • To generate outline a scenario is sampled consisting of user profile and user goals (with slots)
      • User profile captures verbosity, and other task independent characteristics
    • Only the annotations are generated which map to a template utterance (rule-based)
  • User simulator agent and system agent is used, which is a finite state machine
  • To map outline to NL crowd workers are used (multiple paraphrases of the same outline)
  • Second round of crowdsourcing used to validate the written utterances
  • Thus a high-quality annotated goal-oriented dataset is constructed
Fenfei Guo, Angeliki Metallinou, Chandra Khatri, Anirudh Raju, Anu Venkatesh, Ashwin Ram
2018.01
  • Topic-breadth and topic-depth
    • Bot should be able to converse on a variety of topics and it should sustain long and coherent conversations on given topics
  • Deep average networks (DAN) used to train a topic classifier
    • DAN extended with topic attention table, learning topic-word weights across the vocab, to detect topic-specific keywords
    • It is trained on internal (big) data annotated with 20-50 topics (1 special category for chit-chat or non-topical utterances)
  • Topic-based metrics:
    • Depth: the average number of turns on topics
    • Breadth: histogram based on number of turns the bot talked about different topics across all dialogs
      • Topic-specific keywords coverage: number of distinct keywords from DAN, on a given topic (more is better)
  • Topic depth shows almost the same correlation as user rating with response error rate (amazon manual utterance evaluation)
    • Much higher correlation than what avg. dialog length has
  • Topic breadth correlates much worse, however, this is expected, it is more complementary to user ratings (to eliminate repetitiveness)
Xiaoyu Shen, Hui Su, Shuzi Niu, Vera Demberg
2018.02
  • A separate CVAE and AE is used, where the CVAE generates the input latent variables to the AE decoder (modeled with RNN)
  • RNN encoder extracts the corresponding latent variable target for each turn, based on which a CVAE is trained to reconstruct it through context-dependent Gaussian noise
  • The CVAE replaces an AED (adversarial enc-dec), thus alternating is needed between the AE phase and the CVAE phase (see images)
    • In CVAE phase a sample is obtained from the AE by transforming dialog context into continuous embedding and is used as the target for max likelihood training (RNN encoder is fixed during this phase because it is from AE)
    • In AE phase an utterance is encoded to a continuous latent variable, and a corresponding one is sampled from CVAE posterior distribution
  • KL divergence constraint added to RNN encoder in AE
  • Scheduled sampling is used to go from ground truth latent variable to noisier one, produced by CVAE
  • Trained on dailydialog, human evaluation gives more fluency than other VAE models
Yookoon Park, Jaemin Cho, Gunhee Kim
2018.04
  • They address the degeneration problem of VAE-s: A powerful model like RNN learns to ignore the latent variables
  • VHCR model proposed, with a hierarchical latent structure, and an utterance drop regularization technique
  • The ignorance of the latent variable can be shown using the KL-divergence term in the loss function, which falls to zero
  • Another problem is the data sparsity: if conditioned on context, there exist very few targets to the same context
    • Therefore hierarchical models can overfit to training data, without using the latent variable
  • In VHCR global latent variable is used along with the utterance level latent variables
    • Context and decoder RNN is conditioned on global latent variable as well
    • Utterance latent variable is conditioned on global l.v.
    • For inference of global latent variable bidirectional RNN is used over the utterance vectors generated by encoder RNN
  • With these latent variables the decoder still learns to ignore them, thus utterance drop is used
    • The utterance encoder vector is randomly replaced by an unkown vector
  • With these additions VHCR achieves much higher KL-divergence
  • Cornell, and Ubuntu corpus used for training (truncated utterances longer than 30 words)
  • With automatic metrics it is shown that VHCR balances better the KL-divergence term and the NLL
  • They show that the global latent variable controls tone, and overall content of the conversation, and the utterance latent variable is a more fine-grained control in response generation (however, this is based only on a few questionable examples)
Tiancheng Zhao, Kyusong Lee, Maxine Eskenazi
2018.04
  • VAE based approach, but latent variable is a set discrete variables
    • This latent variables should capture salient features about the response, and be independent of the context
  • Recognition network to map a sentence to the latent variable z, and the generator network defines the learning signals use to train z
    • Recognition network does not depend on context!
    • Recognition and generator network form a VAE over the response (DI-VAE)
      • Because of the known issues of VAE, they modify the loss function to also optimize mutual information, which is similar to adversarial auto-encoders
    • Another model is using the skip-thought model: discrete variational skip thought (DI-VST)
      • Recognition is the same, but here two RNNs used to predict previous and next sentence
  • Additionally there is an encoder-decoder network and a policy network
    • This is used to encode the context and generate the response using samples from the VAE
    • Policy network trained to predict aggregated posterior from the context
    • An additional loss is used based on the recognition network to penalize the decoder is its generated responses don't reflect the attributes in the latent variable (LAED)
      • For this a relaxation method is used: weight the word embeddings of the vocab with the probability prediction by the decoder because otherwise, it would be discrete (1 word at each step)
  • Using multiple small latent variables is better than using one large, according to perplexity, and the mutual information metrics
  • DI-VST is better at learning dialog acts and emotions through the latent variables on DailyDialog
    • Although homogeneity is still pretty low (0.34 and 0.12)
  • When LAED is added, the attribute accuracy of the model increases, because the decoder is forced to take into account the latent variable
Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark, Ari Holtzman, Yejin Choi, Noah A. Smith, Mari Ostendorf
2018.04
  • NLU-DM-NLG
    • Dialog manager stores context and communicates with a knowledge graph
  • NLU extracts the speaker's goals, the potential topic, and sentiment
  • DM is a hierarchical state-based dialog model, with a master than manages the overall conversation, and a collection of mini skills
  • Response generation consists of speech acts from four categories: grounding, inform, request, and instruction
  • The model is adapted to the user personality based on some probing questions
    • More extroverted personalities tend to rate the chatbot higher
  • Longer conversations usually received higher rating (but only slight correlation)
Xiaodong Gu, Kyunghyun Cho, Jungwoo Ha, Sunghun Kim
2018.05
  • DialogWAE models the data distribution by training a GAN within the latent variable space
  • Distribution of latent variable is modeled by GAN, which transforms random noise
    • This random noise is drawn from a normal distribution whose mean and covariance matrix are computed from the context with a feed-forward network
    • Optimization: minimize the Wasserstein distance between prior and posterior, and the NLL of a reconstructed response
  • This is wrapped by an encoder-decoder architecture
    • At training the posterior is computed (based on context and response), and the decoder RNN computes the reconstruction loss from this
    • A discriminator (FFN) is trained to tell apart prior and posterior samples
  • Sampling from Gaussian distribution doesn't capture the multimodal nature of responses
    • Thus a mixture of Gaussian distributions is used
    • Gumbel-softmax is used to sample a Gaussian component
  • Training is done by alternating between an AE phase where the reconstruction loss of responses is minimized and a GAN phase during which the aggregated posterior distribution of the latent variable is matched with the prior distribution
  • Evaluation metrics: BLEU, BOW embedding, distinct
    • For each context 10 reponses are sampled
    • Distinct measures the diversity of the responses
  • DialogWAE with gaussian mixture prior network outperforms all previous models, and also generates much longer responses
Hainan Zhang, Yanyan Lan, Jiafeng Guo, Jun Xu, Xueqi Cheng
2018.05
  • Problem with generic responses in seq2seq is analyzed
  • The objective of seq2seq is the same as minimizing KL divergence between predicted and true probability
    • However this doesn't penalize enough the common responses where predicted prob. is high and true prob. is low
  • Use coherence between reply and input to estimate true prob.
    • Cosine sim., pre-trained matching models,
    • And dual-learning agents: 2 seq2seq models
      • First agent generates response, second agent calculates coherence and sends it to first agent
      • Then this is repeated for the second agent as well
  • Coherence model is the reward function in an RL setting
  • For dual learning, the agents get the reward from each other
  • Slightly outperform baseline seq2seq and mmi, and adversarial seq2seq on both quantitative and human evaluation
Yansen Wang, Chenyi Liu, Minlie Huang, Liqiang Nie
2018.05
  • 3 types of words are identified: interrogative, topic, ordinary
    • At decoding first a type distribution is estimated
  • Soft typed decoder estimates three type-specific generation distributions over the vocab.
  • Hard typed decoder uses Gumbel-softmax to approximate argmax of the predicted types
    • Words are pre-classified in types for each input, and only words of the highest prob. type are generated (not over whole vocab.)
  • Significantly better on the distinct unigrams and bigrams metric than baseline seq2seq
    • Also much higher relevant topic word ratio in responses
  • Also much better according to human evaluation
  • Hard typed outperformed soft typed significantly
  • Error distribution analysis shows that errors fall in 3 categories almost evenly: no topic word, wrong topics, wrong word type
Oluwatobi O. Olabiyi, Alan Salimov, Anish Khazane, Erik T. Mueller
2018.05
  • HRED (with attention) + GAN, trained with teacher forcing
  • MLE loss is also added to the loss function of the generator
  • There is a noise injected to the decoder of HRED, which assures that the model is not deterministic
  • Discriminator is a BiRNN on top of the same context RNN from the HRED generator network
  • At generation a list of responses is generated and ranked by the discriminator
  • Outperforms VHRED in automatic metrics, but no human evaluation is given
  • Depending on the dataset word or utterance level noise results in better performance
Tiancheng Zhao, Maxine Eskenazi
2018.05
  • Dialog model that can generalize across domains from only a description of the domain
  • Description is made up of seed responses in the domain, and annotations of these (dialog acts)
  • Alternate between two losses during training
    • Optimize to make seed response representation close to its annotation representation
    • Optimize to make context representation close to response representation
  • HRE is used for context encoding, and the utterance encoder part is the same for seed response encoding (reused)
  • Model evaluated on synthetic restaurant data performs much better than standard seq2seq with copy
Can Xu, Wei Wu, Yu Wu
2018.07
  • Dialogs are annotated with dialog acts
    • 2 high level: context switch and context maintain
    • For each high level, 3 low levels: statement, question, answer
  • From the data it is concluded that context switch and questions are important to make a dialog longer
  • A dialog act classifier is learned based on the manual annotations
    • HRE encodes the dialog, and MLP at the end predicts dialog act probabilities for next utterance
    • Achieves 70% accuracy, it is employed to classify all dialog data
  • For the dialog model, the dialog act classifier is also trained (policy network), together with the response generator
    • Response generator is not hierarchical, however, its inputs are the last two utterances and the predicted dialog act
  • After training the dialog model with supervised learning, they only train further the policy network with self-play reinforcement learning
    • Reward is the dialog length, and response relevance
    • Response relevance is a trained (with negative sampling) LSTM model estimating the relevance between response and a context
    • Dialogs are terminated if the utterances are repetitive or a length limit is reached
  • Performance of the SL and RL model is better (than VHRED, RL-S2S baselines) only according to distinct metric
    • Also much better according to human evaluation, RL model is either very good or very bad, while SL is more average
    • Also longer average dialog length than RL-S2S, both in machine-machine and human-machine setting
  • The predicted dialog acts give very nice interpretability and controllability over the generated dialog
    • Context switch replies are generally longer than context maintain
Bowen Wu, Nan Jiang, Zhifeng Gao, Suke Li, Wenge Rong, Baoxun Wang
2018.08
  • The target probability in response generation can be decomposed to 2 probabilities
    • First a set of suitable words has to be found
    • Then this set of words has to be ordered to form a response
  • Analysing the 2 probabilities
    • The set of words probability leads to optimizing for high-frequency words from a set of replies to given input
    • The word ordering probability just acts as a language model (basically independent from input)
  • A new loss is proposed, where there is a term trying to minimize the log prob of a randomly sampled negative response (not given to query)
    • This is considered more as a regularization than standard loss function
  • Slightly better than the simple seq2seq according to human evaluation and distinct metric
Yury Zemlyanskiy, Fei Sha
2018.08
  • The chatbot's goal is to choose utterances that elicit responses from the other agent which increase its understanding of it
  • There is a group of personality traits and the chatbot has to minimize so that it arrives at a set which characterizes the other agent
  • Maximize mutual information between dialog and revealed personality (discovery score)
  • Rerank beam search samples based on discovery score
  • Discovery score improves the engagingness of the chatbot in human evaluation
Ashutosh Baheti, Alan Ritter, Jiwei Li, and Bill Dolan
2018.09
  • Models trained to maximize conditional likelihood assign a low probability to content words compared to function words
  • A topic constraint is added to the training objective
    • A random variable is defined over topics, and the probability of this variable given the source, and the prob. given the output has to be similar (dot product)
    • HMM-LDA model is used to estimate topic probability distribution given a sentence (word-wise, so it works with beam search)
  • A semantic constraint is added, so that the source and output have to be similar (dot product)
    • Arora et al., SIF average word embedding used
  • Adding MMI to these constraints results in the best performance on diversity measuring metrics, and also human evaluation for content richness
Tong Niu, Mohit Bansal
2018.09
  • Should-not-change attack
    • Random swap: swap adjacent words
    • Stopword dropout
    • Data-level paraphrasing: only change words by their synonyms
    • Generative-level paraphrasing: sentence-level paraphrase using neural networks
    • Grammar errors: introduce real grammar errors based on a huge corpus
  • Should-change attack
    • Negate the root verb, change verbs
    • Adjectives or adverbs to their antonyms
    • Turn utterances to random
    • Turn utterances to random but keep entities
    • Turn only entities to random
  • 3 types of training: train with normal data, evaluate on adversarial attacks, train with adversarial attack evaluate on adversarial attacks, train with adversarial attacks evaluate on normal data
    • For training on should-change attacks, use max-margin loss together with maximum likelihood
  • VHRED and RL model are generally not robust to the attacks, and training on adversarial data makes them more robust
Pierre-Emmanuel Mazare, Samuel Humeau, Martin Raison, Antoine Bordes
2018.09
  • From the huge Reddit dump, they create personas using a profile's sentences that are about themselves
  • Dataset is constructed only as single turn
  • A retrieval model is used, and there is a separate persona and input encoder
  • Conditioning on personas clearly improves the recall metric
  • Transformer model achieves the best performance
  • First training on the Reddit data and then finetuning on persona-chat is much better than just training on persona-chat
Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, Bill Dolan
2018.09
  • Adversarial training to improve diversity
  • Variational information maximization to regularize the adversarial learning, and boost informativeness
  • A backward model is used to calculate this variational lower bound over the mutual information
  • CNN encoder is used, and its output is fed into LSTM decoder together with a random noise vector
  • Soft-argmax is used to make it differentiable, and to be able to use deterministic policy gradient
  • For the discriminator, the source, the target, and the generated response are all projected to the same space (learned embedding)
    • Cosine similarity between projected S, T and S, T' is computed
  • Generator tries to minimize the difference between projected S, T and S, T', while discriminator tries to maximize it
  • Evaluation metrics are BLEU, the 3 embedding-based and the two distinct metrics, and an entropy metric
  • On all of them the AIM is better than a seq2seq-MMI baseline and a GAN baseline
  • According to human evaluation the AIM is better in informativeness than MMI, but on par in relevance
Xinnuo Xu, Ondrej Dusek, Ioannis Konstas, Verena Rieser
2018.09
  • Latent variable based on context and based on coherence
  • Context gate to control the reliance on context or already generated response
    • This is dependent on coherence variable
    • But this coherence is computed based on the dataset and fixed (better results than using the true coherence for each example)
  • Coherence measure: Cosine distance of source and response with stop word filtering
  • Base model is a cVAE
    • One of the losses is to minimize KL between prior and posterior network
    • This is why we can the z conditioned on prior at inference the same way as it would be on the posterior
  • The original opensubs corpus is used, and a filtered version where they filter based on the coherence of source-response pairs
  • The coherence based data filtering improves results across all metrics
  • CVAE generally outperforms baseline seq2seq across metrics (BLUE, and distinct and coherence metrics)
Kangyan Zhou, Shrimai Prabhumoye, Alan W Black
2018.09
  • Document grounded dataset released, containing 100k utterances.
  • Documents are Wikipedia articles about popular movies
Joachim Fainberg, Ben Krause, Mihai Dobre, Marco Damonte, Emmanuel Kahembwe, Daniel Duma, Bonnie Webber, Federico Fancellu
2018.09
  • Dataset available
  • 25000 self-dialogues collected from Mturk on several categories
  • The dialogues are shown to be of high quality
Jianfeng Gao, Michel Galley, Lihong Li
2018.09
  • 70-page long paper, which offers a good in-depth introduction to the many aspects of conversational AI as a field.
  • Abstract: The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.
Chandra Khatri, Rahul Goel, Behnam Hedayatnia, Angeliki Metanillou, Anushree Venkatesh, Raefer Gabriel, Arindam Mandal
2018.10
  • Data from 2017 Alexa prize is annotated with dialog acts and topics
    • Keywords useful for determining topic are also labeled
  • Dialogs are also rated by humans for coherence and engagement (based on 4 yes-no questions about the dialog)
  • Topical depth (number of consecutive on-topic utterances) highly correlated with coherence and engagement
  • CDAN and CADAN, models extending the originals with context
    • Either average of utterances is used or dialog acts as the context
  • BiLSTM performs best (classification accuracy) with added context and dialog acts
  • Context extended ADAN performs best for keyword detection
Yujie Xing, Raquel Fernandez
2018.10
  • Li et al. Persona model modified to work with personality types (OCEAN score for 5 types)
  • Ocean score for each speaker is computed based on a number of utterances from that speaker
  • Pre-training on opensubs, because the tv-series dataset used is only 100k samples
  • Sample utterances are computed for each personality on a test set
  • The Ocean score is able to distinguish somewhat well between personalities (60%)
  • Baseline model achieves only 0.16 F1
  • Original persona model achieves better distinguishability than the personality model (normal)
    • They are both higher than baseline
  • The personalities can be interpolated between the 5 types and if extremes are used 0.53 F1 can be achieved
Hui Su, Xiaoyu Shen, Wenjie Li, Dietrich Klakow
2018.10
  • Response should connect context history and future responses
    • Achieve this by maximizing MMI of current utterance with both past and future contexts
  • Replace utterance with continuous code space learned from the whole dialog flow
    • Follow gaussian distribution
  • Dialog history and future encoded with hierarchical RNN
    • MLP on top to estimate Gaussian mean and covariance
  • Based on history and code space the decoder computes output
  • At test time the code space is sampled based on only the history (prior distribution)
    • Use variational inference to maximize variational lower bound
  • Better than VHRED baseline in automatic metrics and human evaluation as well
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, Milica Gasic
2018.10
  • 10k task-oriented annotated dialogs
  • Dataset
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
2018.10
  • Model is the encoder part of a normal Transformer
  • Masked language model: mask some words in a sentence and predict these based on the others
  • 3-way masking: sometimes mask the word, sometimes replace it with a random word and sometimes keep the original word
  • Also pre-train the model for next sentence prediction:
    • Half the time a sentence is the true next sentence, and half the time it is random
      • Model has to predict a binary label
  • For classification fine-tuning, just a classification layer is added, and all parameters are finetuned
  • For other types of tasks specific finetuning layers are added
  • It beats previous SOTA an all GLUE tasks
  • Extensive ablation study is conducted for pre-training type, number of steps and model size
Nicolas Gontier, Koustuv Sinha, Peter Henderson, Iulian Serban, Michael Noseworthy, Prasanna Parthasarathi, Joelle Pineau
2018.11
  • They have some new dataset, but no link yet
  • Ensemble model with a ranker that ranks generated responses, and then selects one to output
    • Generative, retrieval, and rule-based systems
    • Neural question generator generates a question based on the news article
  • The dataset has a news article at the beginning of a dialog, which the dialog should be about
  • Supervised scoring:
    • Predict human vote (from the dataset) based on conversation history
    • There are many different features used for this classifier
    • Classifier achieves 64% accuracy
  • RL based scorer:
    • Estimate the q-value of a response (expected reward after a response)
    • Reward is a weighted version of the vote signal
    • Deep q-network used
  • Since supervised scorer is not that good mainly a set of designed rules are used to select a response during the dialog
  • Data was also collected by the user selecting the best response among candidates
    • With this data the supervised scorer proved best with a policy of choosing the response with highest score
Igor Shalyminov, Ondřej Dušek, Oliver Lemon
2018.11
  • Dataset from the 2017 Alexa prize
    • Length correlates somewhat more with positive feedback than negative feedback
    • Length correlates poorly with user rating
  • Ranker takes as input previous utterances and other features like sentiment and names
    • MLP at the end outputs rating (or dialog length)
  • Evaluation is done with sentiment analysis, to check the goodness of replies against a set of positive replies
    • How well can the ranker distinguish between positive and negative replies
  • Training with dialog length achieves slightly better performance than user rating (at a sufficiently big dataset size)
Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, Shuming Shi
2018.11
  • Generate a set of responses for each input (bag of instances)
  • Latent space consisting of the vocabulary, from which to sample a word based on which the reply should be
  • Model consists of a latent word inference and a response generation network
    • Response generator encodes the input and sampled words to generate a set of responses
    • Use the minimum of individual losses of responses as overall loss
  • Pre-train the latent word inference network on keyword extraction task
  • Pre-train generator network using top 1 inferred latent word
  • Then jointly train them, using RL for the word inference network, and backprop for the generator
  • Only sample from a smaller set of words (specified for each input), because of huge latent space
  • The proposed model is much better according to human evaluation than S2S and CVAE baselines
Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, Jason Weston
2018.11
  • Dataset constructed where there is a wizard and an apprentice
    • They have to talk about some topics, but the wizard has access to relevant Wikipedia articles
    • The relevant Wikipedia article retrieval model is a simple fixed model
    • The wizard chooses a relevant article and sentence to his/her response (which will be used in the dataset)
  • Input is the initial topic and the utterances so far
    • Plus the retrieved sentences from Wikipedia are attended to with a Transformer
    • The top knowledge sentence based on attention is selected, and further encoded together with the dialog context
  • Cross-entropy loss extended with a term to select the sentence from the articles which the annotator also selected
  • Transformer achieves a 25 R@1 for finding the correct knowledge sentence (better than MemNet)
  • The Transformer is both used in retrieval and generative dialog setting:
    • Using the gold or predicted knowledge greatly improves performance for retrieval and generative models and using gold knowledge is better
  • The two-stage generative transformer is better with predicted knowledge, while the end-to-end is better with gold knowledge
  • Pretraining on Reddit improved the performance everywhere
  • According to human score retrieval transformer is better than generative, but generative gains bigger relative improvement from using Wikipedia knowledge
Ilya Kulikov, Alexander H. Miller, Kyunghyun Cho, Jason Weston
2018.11
  • Comparing greedy, beam, and iterative beam search
  • Iterative beam search:
    • Run more beam searches, but exclude prior hypotheses, by setting their score to negative infinity
    • Thus the candidates are guaranteed to be dissimilar
  • A ranking term is added to the loss function (ranking negative responses lower)
    • Iterative beam search is best according to full-length human dialog evaluation
Heriberto Cuayáhuitl, Seonghan Ryu, Donghyeon Lee, Jihie Kim
2018.12
  • Automatically derive dialog rewards for a dialog dataset
    • Positive reward if the response is in the dataset, negative to randomly sampled responses
    • Reward for the dialog is the sum of all rewards
    • Generate dialogs with varying number of randomly sampled responses
      • Thus extend dataset from 20k to 150k dialogs
  • The model is a small 2-layer RNN, with a dense layer at the end
  • They experiment with different dialogue history lengths
    • The bigger the dialog history the better, with a max of 0.81 correlation between predicted and true reward when using 25 sentences
Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, Jason Weston
2019.01
  • Dataset released with 60k utterances and 60k feedback examples
  • Initially train the agent on next utterance prediction, and satisfaction prediction with supervised data (persona-chat)
    • Crowd-working dataset collected for satisfaction scores
  • During deployment:
    • When agent predicts good satisfaction, add the user’s replies as target examples
    • When agent predicts bad satisfaction, ask for feedback, and try to predict the feedback itself (new task)
  • Model was shared between dialog and feedback task
    • But separate model used to predict satisfaction
  • Transformer used in a candidate ranking setting
    • Only two previous turns used as history
  • More improvement is observed if less supervised data is used
  • Using both extra dialog examples and feedback examples provides the biggest improvement
  • Adding the chatbot’s own responses as targets decreases performance
  • The feedback prediction task is generally easier than the dialogue task
  • More frequent retraining using new feedback is beneficial
Thomas Wolf, Victor Sanh, Julien Chaumond, Clement Delangue
2019.01
  • Blog post
  • GPT model pre-trained on BooksCorpus dataset
  • Input representation is the sum of the following vectors:
    • Word embeddings
    • Dialog state embeddings: persona sentences, speaker1, speaker2
    • Positional embeddings
    • Separation tokens could also be added between turns
  • Order of personality sentences doesn’t matter, so the dataset can be augmented by different examples with different orders
    • To promote invariance to ordering the same positional embedding can be reused for each sentence
  • Two losses jointly optimized: next-utterance classification and language modeling
    • Classifier distinguishes between a correct next utterance and a set of distractors
  • Outperforms simple seq2seq
Rongzhong Lian, Min Xie, Fan Wang, Jinhua Peng, Hua Wu
2019.02
  • Prior and posterior distributions over knowledge selection
    • Posterior also takes into account the response
    • Prior is optimized to approximate posterior (KL between them)
  • There is no ground truth knowledge
  • Utterance encoder and knowledge encoder don't share parameters
  • Attention is used between encoded input and knowledge vectors
  • In the posterior module first the input and response are processed by an MLP and then attention with knowledge vectors
  • Two types of decoders:
    • Hard: knowledge concatenated with every word
    • Soft: separate knowledge and utterance GRU, and a fusion unit
  • First the bag of words loss between knowledge and responses is minimized for a couple of epochs
  • Evaluated on persona-chat and wizard-of-wikipedia it achieves better metrics and human scores than memnet baseline
  • Fusion is better than hard knowledge incorporation
Abigail See, Stephen Roller, Douwe Kiela, Jason Weston
2019.02
  • Dialog level controllable attributes
  • Two types of controllable methods:
    • Conditional training: append an additional control variable to the decoder inputs, representing some type of controllable attribute
    • Weighted decoding: during decoding assign different weights to words, based on some features (representing controllable attributes)
  • Controllable attributes:
    • Repetition: only with weighted decoding, by looking at bigrams (this control is used in all other controls)
    • Specificity: controlled with both methods, using IDF to weight words and mean IDF of the words in the whole responses as the control variable
    • Response-relatedness: weight words by cos. sim. between the word and the input sentence (conditional training was ineffective)
    • Question-asking: weight question words (not so good), conditional training by giving the ratio of utterances in a dialog that should be questions
  • Conditional training fails to learn more complex attributes, but always output good sentences
  • Weighted decoding is efficient by increasing the weight, but this can lead to unintended side-effects
  • Large-scale human evaluation
    • Engangingness increases when controlling for repetitiveness, specificity, and question-asking
    • Repetition controlling provides the biggest improvements across all human metrics
    • Response-relatedness doesn't improve anything
    • Combining repetition control with question asking is even better
    • 50-70% question asking in this setting is the best balance (across multiple human metrics)
  • A balance between all these attributes is essential, but it's hard to pin down what makes a good conversation
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
2019.02
  • Hypothesis: a language model should be able to learn a wide variety of tasks just by being trained as an unsupervised language model (and maybe specifying the task in natural language)
  • Webtext: outbound links from Reddit with at least 3 karma
  • Special byte-pair encoding used, 1024 tokens in context size, and 50k vocab
  • Biggest model, 1.5B params, underfits the 40GB webtext
  • Improves over SOTA on most LM datasets without training on those datasets
  • On reading comprehension, summarization and translation in zero-shot setting it's far even from simple baselines, but achieves impressive results, and adding the task as natural language to the input definitely works
  • There is a small overlap between webtext train data and the many task-specific datasets' test data, but this overlap is not bigger than with their own train data
Yizhe Zhang, Xiang Gao, Sungjin Lee, Chris Brockett, Michel Galley, Jianfeng Gao, Bill Dolan
2019.03
  • A separate topic and persona feature extractor from individual utterances (dot product between extracted vectors)
    • Topic: utterances from the same conversation are likely on the same topic
      • Task is to identify whether two random utterances belong to the same dialog
    • Persona: Task is to identify if two random utterances (from the same dialog) are from the same speaker
  • They also use feature vector disentanglement and binary feature vectors in order to make these features more interpretable
  • Since utterance pairs close to each other might be textually similar, they only collect pairs at least 4 turns away for the positive samples
  • During dialog learning, the context vector (previous turns) and the feature vector of the target utterance are combined with an MLP, and then used as the first hidden state in an LSTM
    • Feature extractors are fixed during dialog learning
    • Additional loss is used between the target feature vectors and self-generated response feature vectors
    • During test time the feature vector is extracted by aggregating over all context utterances
  • They achieve 0.75 topic and 0.6 persona accuracy on a big twitter dataset
  • Investigating the feature vectors shows interesting interpretability
  • Responses seem more consistent with respect to persona
  • With binary feature vectors, interesting controllability can be achieved by toggling specific bits to 1
Nouha Dziri, Ehsan Kamalloo, Kory W. Mathewson, Osmar Zaiane
2019.04
  • The problem setting: given a conversation history (premise) and generated response (hypothesis), decide whether they are entailing, contradictory or neutral
  • ESIM and BERT used
  • Inference corpus based on Personachat
    • Entailment constructed by taking an appropriate and on-topic response
    • Contradiction constructed with random word utterances, or from multiNLI corpus
    • Neutral examples constructed by taking random utterances from the data
  • BERT is better, and there is a little correlation between NLI class and human score of response (entailment gets higher scores)
  • BERT based semantic similarity between utterances is also better correlated with humans than embedding metrics
Shikib Mehri, Evgeniia Razumovskaia, Tiancheng Zhao, Maxine Eskenazi
2019.06
  • Two existing pre-training objectives:
    • Next utterance retrieval (from a set of candidates)
    • Next utterance generation
  • Two new pretraining objectives:
    • Masked-utterance retrieval
      • Retrieve the correct utterance for a sequence of utterances where one of them has been replaced by a random utterance
    • Inconsistency identification
      • Replace a random utterance in a sequence of utterances and find the utterance that was replaced
  • Evaluations done on 4 downstream tasks
    • Belief state prediction
    • Dialog act prediction
    • Next utterance retrieval
    • Next utterance generation
  • Except belief state prediction all tasks show improvement with pretraining
  • They also converge faster and work better with limited data than the baseline
Jiawei Wu, Xin Wang, William Yang Wang
2019.07
  • Given a target utterance pair, a triple containing it is constructed, and also two other triples from previous history are sampled (both ordered and misordered)
    • Sampling these triples is more effective than just encoding the whole dialog history
    • All these dialogues are passed to the network which predicts whether the triple containing the target utterance pair is ordered
  • There can be gaps in a sequence of utterances, the focus is on the order
  • Hypothesis: if a generated response is good the misorder detection should be easy, otherwise harder
    • Based on the expectation that it's misordered we can provide a training signal to how good the generated response is
    • Basically and adversarial learning setup
  • The sampling based order detector achieves 85% accuracy
  • Much better than Li's adversarial dialog agent in both AdverSuc and human eval
Shikib Mehri, Tejas Srinivasan, Maxine Eskenazi
2019.07
  • A neural dialog module is constructed for the classic dialog system modules (NLU, DM, NLG)
    • NLU module takes the context as input and outputs a belief state
    • DM module projects the belief state and database vector and predicts a dialog act vector
    • NLG is a language model conditioned on belief state, dialog act, and database vector
  • Naive fusion
    • Modules are trained independently, and during inference they use each other's outputs
    • This propagates errors
    • The modules can also be finetuned jointly for response generation
  • Mulktitask fusion
    • Individual modules are learned simultaneously with end-to-end response generation task
  • Structured fusion networks
    • Learn a higher-level model on top of pre-trained modules for end-to-end response generation
    • The higher-level is achieved by extending the modules with further neural parts, and combining them into an end-to-end setup with cold fusion
    • 3 variants: pre-trained modules are fixed, finetuned, or multitasked
  • The baseline is a seq2seq model concatenating context, belief state, and database vector
  • Wizard-of-Oz dataset which is annotated with belief states and dialog acts
  • RL is used to finetune a supervised model with success rate as reward
  • Outperforms seq2seq baseline, but it's worse than a current best BERT-based model
  • SFN with finetuned modules is the best out of all variants
Prakhar Gupta, Shikib Mehri, Tiancheng Zhao, Amy Pavel, Maxine Eskenazi, and Jeffrey P. Bigham
2019.07
  • Single-reference and multi-reference metrics explored (word-overlap, embedding-based)
  • In multi-reference the reference which provides the best score for the target is used
  • In multi-reference diversity setting recall is calculating, so the model should produce responses that cover a high percentage of the references
  • Dual-encoder (retrieval-based), Seq2seq, HRED and CVAE is used
  • 4 references collected for each test example in DailyDialog
    • References are at least as good as the original targets
  • In single-reference setting most metrics show insignificant correlation with human judgment
    • In multi-reference setting there is significant correlation, but still not that great
      • Eg. Human responses are rated almost the same as model responses by multi-reference word-overlap metrics
  • Distinct and self-bleu correlate poorly with human diversity judgment, because they don't capture diversity in meaning
    • Reference recall-based metrics show higher correlation
  • Correlation increases with human judgment as more references are added, but seems to plateau at 8
John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, Graham Neubig
2019.07
  • Bleu has several issues when used as a training reward
    • No credit assignment
    • Penalizes lexically different translations
  • Instead use semantic similarity score
    • Average of subword embeddings and then cosine similarity
    • Length penalty is added, because model learns to repeat words
  • First MLE training then finetuning with semantic similarity
  • Marginal improvements in bleu score over finetuning with bleu
    • Also marginally better than MLE or bleu according to human evaluation
  • Bleu provides higher reward for more frequent words, while similer provides less
    • Probably because low-frequency words contribute more to sentence embeddings
Shikib Mehri, Maxine Eskenazi
2019.08
  • Dual encoder is the baseline model (and an ensemble of baseline dual encoders)
  • Idea is that observing different types of negative candidate response sets will result in different representations
    • Negative examples close to the ground truth should produce fine-grained representation, careful for minute differences
    • Negative examples distant should result in abstract representations
  • Semantic similarity measured by cosine sim.
  • L (=5) models are learned such that the negative examples are placed into L buckets based on distance from particular response
    • Each model trained on different level of granularity
    • Models ensembled
  • Baseline ensemble is better than single model, and multi-granular ensemble is a bit better than baseline ensemble
    • Evaluated in retrieval setting, with two models and two datasets
  • Explicit representation modeling is tested with bag-of-words (granular) and dialog act (high-level) prediction
    • Indeed the lowest granularity model achieves best performance on dialog act prediction and vica-versa
Clone this wiki locally