Week 13

This week: I focused on training the data through an alternate model known as doc2vec.

Numeric representation of text documents is a challenging task in machine learning. Bag of words (BOW), the original model that we used to test both the Biomedical and Wikipedia data sets, is considered a very simplistic method because it loses many subtleties of a possible good representation, e.g consideration of word order.

Doc2vec is supposed to be a better technique. It is heavily based on word2vec. So I’ll start with a short introduction about word2vec first.

Word2vec is a well known concept that intends to give a numeric representation for each word. It will be able to capture such relations, like synonyms, antonyms, or analogies, and generate representation vectors out of the words.

This representation is created using 2 algorithms:

  1. Continuous Bag-of-Words model (CBOW)
  2. Skip-Gram model.

Continuous bag of words (CBOW) creates a sliding window around current word, to predict it from “context” — the surrounding words. Each word is represented as a feature vector. After training, these vectors become the word vectors.

Skip Gram is actaully the opposite of CBOW. Instead of prediciting one word each time, it uses 1 word to predict all surrounding words (“context”). It is much slower than CBOW, but considered more accurate with infrequent words.

With that said,

The goal of doc2vec is to create a numeric representation of a document, regardless of it’s length. While the word vectors represent the concept of a word, the document vector intends to represent the concept of a document.

I am currently working on training the data sets using this doc2vec model. However, I am having trouble splitting the large datasets into smaller sections so that I can use each smaller section as its own “document” for testing.

Hopefully, I will be able to sort through all of the problems that I am facing this week and report my findings with this model and how it compares with the bag-of-words model that we used initially next week.