The encoder-decoder architecture for recurrent neural networks is achieving state-of-the-art results on standard machine translation benchmarks and is being used in the heart of industrial translation services. The model is simple, but given the large amount of data required to train it, tuning the myriad of design decisions in the model in order get top …
A Gentle Introduction to Neural Machine Translation: The first post in the series of Neural Translation. This post gives some Machine Translation background. The development of Machine Translation flow based on time is listed and illustrate as following:
Rule-based MT incorpoates all the linguistic properties and construct grammar rules in order to capture the syntatic structure
Data-driven MT or Statistical Translation is purely using the training data alone and maximizing the likelihood of source-target pairs in the alignment position. It uses phrase-based translation which allows variable-length inputs. It could construct latent or hidden states which have the same concept with context vector encoded by encoder-decoder mechanism. However, this approach suffers from certain scarce training cases and rare words; therefore, need to incorporate linguistic or syntatic information.
Neural-based MT is a variant of data-driven MT which features phrase-based translation and allows variable length source phrase to be transformed into a shared context vector through encoder/decoder mechnism in the model. The fixed length of shared context vector has been an issue for phrase-based translation until attention mechanism (jointly align and translation) was also used in the neural translation and variable length of context vector generated through attention mechanism allows more flexibility in model.
Encoder-Decoder recurrent model: In this second post, the author introduces the encoder-decoder recurrent model which is the core model used in google translation service (from 2014). He introduces two variations of encoder-decoder RNN:
Sutskerever NMT model: End-to-end model which encode intput with fixed length context vector and then output by decoder with variable length target translation. It was firstly developed for English-French translation. It uses LSTM and gradient clipping to tackle gradient explosion problem. It pre-processes input to suffix a tag to the source sentence, reverse source encoding input and out of vocabulary handling as UNK
Cho NMT model: a similar sequence-to-seuqence model (end-to-end) as Sutskerever NMT model but using GRU (Gated Recurrent Unit, a variant or simplified LSTM) instead of full LSTM unit. Same as above, it trains a English-Fendch translation with much smaller batch size. It also uses pooling layer such as maxout.
Cho NMT mode + Attention Mechanism: Cho et al based on the previous paper and observed a decreasing performance when the input sequence length and the vocabulary size increases. They proposed Attention mechanims in which a varible length context vector is jointly trained with alignment (this model will outout a target word and find a best alignment position) to mitigate this problem.
Beam search in NMT: In order to find the best output sequence, two popular methods are often used in NLP. They are greedy decoder (or Viterbi algorithm) which looks for maxmial likelyhood probability when generating the next word and beam search which in turns finding K (tunable parameter and also is called the width of beam) possible candidates words while generating sequence. It often does better than Viterbi. Simple python code available in this post.
Configure Encoder-Decoder in NMT: Mainly discuss 2017 paper about large scale exploration of NMT architecture which used English-German translation and discuss the possible configuration for a model to achieve state-of-art translation result. The hyperparameters they studied are listed in the table below. They found that the use of attention (attention dim and attention type) and beam search (beam size and length penalty) can improve significantly when comparing with the models without them. Other hyperparemeter tuning could achieve minor difference such as embedding dim (128 generally good and the higher the better, for example, 2048 could achieve best result with marginally difference). They compares different RNN cells including Vallina RNN, GRU and LSTM. The performance determines by the complexity of cell types that is LSTM achieves best. The depth of encoder and decoder is minor (1 layer is sufficient to achieve good result for one direction). For the direction of encoder bi-direational is better than unidirectional and reverse is better than without it.
The main difficulties to conduct this survey is possible model configurations is too large and difficult to exhaustively execute. Some heuristic knowledge might be required to tune seq2seq model.
Hands on with Keras and use French-English datsaet:
prepare French-English data: Use European Paraliment 1996 - 2011 English-French dataset . Some basic text processing is done such as using space-token, turn lower cases, removing punctuations, convert French character into latin, removing non-printables and non-alphabets (numers etc). Some minor note, he frequently use str.maketrans n his code,to create a translation table and do some single character replacement but not through re.sub function.
using keras to build model from scratch: Using pre-trained embedding and keras wrapper.
phrase-based translation: A variable length translation model which doesn't rely on window-based segmentation of whole sentence.
end-to-end model: There is no components to be trained separatedly.
Massive Exploration of Neural Machine Translation Architectures