Week 4-5

My task: 

  1. Find the training data sets used in the CoNLL 2010 Shared Task
  2. Learn how to use the bag-of-words strategy
  3. Learn how to use K-Nearest Neighbor (KNN) and Support Vector Machine (SVM)

The training, evaluation, and trial data sets were found at http://rgai.inf.u-szeged.hu/conll2010st/download.html

The data sets included biomedical abstracts and full articles from Bioscope and paragraphs from Wikipedia containing “weasel” information.  (“Weasel” words or phrases are statements that intentionally indicate ambiguity. They are often misleading because they express a non-neutral point-of-view, making the reader or listener think that the writer or speaker has given a clear and direct response to a statement when the reality behind the response is very vague and inconclusive.)

Examples of weasel words and phrases:

Some people say…

Many have claimed…

Legend has it…

It is believed that…

Research shows…

Experts say…

It has been suggested…

Seems…

Appears…

Looks like…

Many…

Most…

Almost all…

Up to…

Helps…


The basic idea behind the bag-of-words method is to take any text and note the frequency of words within that text.

I trialed the bag_of_words method using python with the help of some examples from www.udacity.com/

I placed 3 sentences that I created into an array and then “vectorized” that list into the bag of words. 

After making a few mistakes along the way, I got the following result:

To translate:

  • First column: Values that resemble 2-D coordinate points
    • The first number found in the “x” position of the coordinate indicates the index of the sentence in “email_list” array in which it is referring to.
      • “0” meaning the first sentence
      • “1” meaning the second sentence
      • “2” meaning the third sentence
    • The second number found in the “y” position of the coordinate indicates the index of a particular word in the bag of words. In other words, it is a key given to us by the system to interpret how the system organized each of the words inside the bag of words.
  • Second column: Single numerical value that represents the frequency of that particular word in the bag of words

For example, if we look at this row:

This means that word #7 in the bag of words, which can be found in the first sentence, has appeared twice within the bag of words. By simply looking back at the first sentence,

“hi Hi Katie the self driving car will be late Best Sebastian”

we can see that the word “hi” appeared twice: once uppercased and once lowercased. But just to make sure, I tried to find the position, or the location, of the word “hi” in the bag of words:

As you can see, “hi” appears as word #7 in the bag of words, which is exactly what we predicted. This also shows that this bag-of-words method is not case-sensitive. And to further investigate that issue, I tried to find the location of the word “Hi” in the bag or words instead. My result:

Apparently, “Hi” is not in the bag of words at all, but “hi” is and is representative of both the uppercase and lowercase versions of the word.


I did not get to work on KNN and SVM as much as I would’ve liked to, nor did I get all the way to testing the training data sets with bag of words and KNN. But I did learn the following:

K-Nearest Neighbor is a classification algorithm that falls under the “supervised learning” category of machine learning. It is used to classify items based on the Euclidean distance formula.

 

www.udacity.com/

https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/#how-does-knn-work