Week Three

This week I focused on learning more about BiMPM. I went through the paper by Zhiguo Wang, Wael Hamza, and Radu Florian titled “Bilateral Multi-Perspective Matching for Natural Language Sentences”.   I started looking at this week one, but I am trying to develop a deeper understanding about BiMPM and NLSM.  Given a pair of sentences P and Q, the BiMPM model estimates the probability distribution Pr(y|P,Q) through the following five layers shown here:

The Word Representation Layer will represent each word in P and Q with a d-dimensional vector. The vector has two components: a word embedding and a character-composed embedding. The first component, word embedding, is a fixed vector for each individual word. The character-composed embedding is calculated by feeding each character in a word into a Long Short-Term Memory Network.

The Context Representation Layer incorporates contextual information into the representation of each time step of P and Q. BiLSTM is used to encode contextual embeddings for each time-step of P. Meanwhile, the same BiLSTM is used to encode Q.

The Matching Layer is the core layer in the model. This layer compares each contextual embedding of one sentence against all contextual embeddings of the other sentence. Then the two sentences P and Q in two directions: match each time-step of P against all time-steps of Q, and match all time-steps of Q against all time-steps of P. To do this there is a multi-perspective matching operation. The output of this layer is two sequences of matching vectors. Each matching vector corresponds to the matching result of one time-step against all time-steps of the other sentence.

The Aggregation Layer aggregates the two sequences of matching vectors into a fix-length matching vector. We use another BiLSTM model, and apply it to the two sequences of matching vectors individually. Then concatenating vectors from the last time-step of the BiLSTM models construct a fixed-length matching vector.

The Prediction Layer is there to evaluate the probability distribution Pr(y|P,Q). A two layer feed-forward neural network is employed to consume the fixed-length matching vector, and apply the softmax function in the output layer.

The server I need to continue is currently down. Once it is running I will be able to continue to my next step.