Week Six: 10/9 – 10/13

For this week, I was able to successfully get the training of the data part working. I trained the Quora Question Pair Dataset.

I used the command below to get it to work:

Boost_DIR=/opt/boost-1.63.0-gcc4.9/ OpenCV_DIR=/opt/opencv2.4-gcc4.9/ PKG_CONFIG_PATH=/opt/libzip-1.2.0-gcc4.9/lib/pkgconfig PATH=/opt/openexr-2.2.0-bin/bin:/opt/protobuf-3.2.0-gcc4.9/bin/:/opt/python2.7-gcc4.9/bin:$PATH PYTHONPATH=/opt/dlib-19.3-gcc4.9/:/opt/opencv2.4-gcc4.9/lib:/opt/opencv2.4-gcc4.9/lib/python2.7/site-packages/ LD_LIBRARY_PATH=/opt/openexr-2.2.0-bin/lib:/opt/python2.7-gcc4.9/lib/:/opt/boost-1.63.0-gcc4.9/lib/:/opt/opencv2.4-gcc4.9/lib/:/opt/protobuf-3.2.0-gcc4.9/lib:/opt/glog-gcc4.9-bin/lib:/opt/gflags-gcc4.9-bin/lib:/opt/snappy-gcc4.9-bin/lib/:/opt/cuda/lib64/:/opt/libzip-1.2.0-gcc4.9/lib/ python src/SentenceMatchTrainer.py --train_path train.tsv --dev_path dev.tsv --test_path test.tsv --word_vec_path wordvec.txt --suffix sample --fix_word_vec --model_dir models --MP_dim 20

 

I had the following configurations:

Namespace(MP_dim=20, NER_dim=20, POS_dim=20, aggregation_layer_num=1, aggregation_lstm_dim=100, batch_size=60, char_emb_dim=20, char_lstm_dim=100, context_layer_num=1, context_lstm_dim=100, dev_path='Quora_question_pair_partition/dev.tsv', dropout_rate=0.1, fix_word_vec=True, highway_layer_num=1, lambda_l2=0.0, learning_rate=0.001, lex_decompsition_dim=-1, max_char_per_word=10, max_epochs=10, max_sent_length=100, model_dir='models', optimize_type='adam', suffix='sample', test_path='Quora_question_pair_partition/test.tsv', train_path='Quora_question_pair_partition/train.tsv', with_NER=False, with_POS=False, with_aggregation_highway=False, with_filter_layer=False, with_highway=False, with_lex_decomposition=False, with_match_highway=False, wo_attentive_match=False, wo_char=False, wo_full_match=False, wo_left_match=False, wo_max_attentive_match=False, wo_maxpool_match=False, wo_right_match=False, word_level_MP_dim=-1, word_vec_path='Quora_question_pair_partition/wordvec.txt')

 

Below is some information about the dataset that I was training:

Collect words, chars and labels ...
Number of words: 104910
Number of labels: 2
Number of chars: 1198
word_vocab shape is (106686, 300)
tag_vocab shape is (3, 2)
Build SentenceMatchDataStream ...
Number of instances in trainDataStream: 384348
Number of instances in devDataStream: 10000
Number of instances in testDataStream: 10000
Number of batches in trainDataStream: 6406
Number of batches in devDataStream: 167
Number of batches in testDataStream: 167

 

As it was running, I observed that the validation data eval accuracy kept increasing, like so:

After a day of the code running, I got the result below:

Accuracy for test set is 87.26