This week I focused on learning more about BiMPM. I went through the paper by Zhiguo Wang, Wael Hamza, and Radu Florian titled “Bilateral Multi-Perspective Matching for Natural Language Sentences”. I started looking at this week one, but I am trying to develop a deeper understanding about BiMPM and NLSM. Given a pair of sentences P and Q, the BiMPM model estimates the probability distribution Pr(*y*|P,Q) through the following five layers shown here:

The **Word Representation Layer** will represent each word in P and Q with a d-dimensional vector. The vector has two components: a word embedding and a character-composed embedding. The first component, word embedding, is a fixed vector for each individual word. The character-composed embedding is calculated by feeding each character in a word into a Long Short-Term Memory Network.

The **Context Representation Layer** incorporates contextual information into the representation of each time step of P and Q. BiLSTM is used to encode contextual embeddings for each time-step of P. Meanwhile, the same BiLSTM is used to encode Q.

The **Matching Layer **is the core layer in the model. This layer compares each contextual embedding of one sentence against all contextual embeddings of the other sentence. Then the two sentences P and Q in two directions: match each time-step of P against all time-steps of Q, and match all time-steps of Q against all time-steps of P. To do this there is a multi-perspective matching operation. The output of this layer is two sequences of matching vectors. Each matching vector corresponds to the matching result of one time-step against all time-steps of the other sentence.

The **Aggregation Layer** aggregates the two sequences of matching vectors into a fix-length matching vector. We use another BiLSTM model, and apply it to the two sequences of matching vectors individually. Then concatenating vectors from the last time-step of the BiLSTM models construct a fixed-length matching vector.

The **Prediction Layer **is there to evaluate the probability distribution Pr(*y*|P,Q). A two layer feed-forward neural network is employed to consume the fixed-length matching vector, and apply the *softmax *function in the output layer.

The server I need to continue is currently down. Once it is running I will be able to continue to my next step.

Save