Monthly Archives: November 2017

Week 12

Unfortunately, I was unable to research this week because of planning, traveling, and attending several family events throughout the holiday week.

I am in the process of discovering different methods to train and test data that may produce better results than the method that we have used this semester with the bag-of-words and count vectorizer.

Week 12: 11/20 – 11/24

Due to Thanksgiving break, I was, unfortunately, unable to work much this week. 

I have just started the process of cleaning and reorganizing up our repository on Bitbucket to make it neater and more accessible in the future.

For next week, I will continue investigating the error that I kept getting and hopefully, we can now test the dataset!

Week 11

This week…

I went through the same steps to extract the sentences from the Wikipedia data sets as I did with the biomedical data sets.

  1. After I making mistakes in the extracting progress, I tested the Wikipedia Evaluation set on the Wikipedia training set. My result was an “accuracy” measurement of 0.80, which means that our model correctly labeled 80% of the sentences from the evaluation set as certain or uncertain.
  2. Out of curiosity, I also tested the Wikipedia Evaluation set with the Biomedical Training set used in the previous bag-of-words model. Surprisingly, my result was an “accuracy” measurement of 0.78, meaning this model correctly labeled 78% of the sentences from the evaluation set as certain or uncertain.

Week 10

My task: to run the bag-of-words model through the Wikipedia training and testing data sets from the CoNLL 2010 Shared Task Data and record my observations

In my attempt to complete my task this week I did run into a lot of errors in the execution of my code as I got a 0.0 accuracy measurement on my first run-through. I spent the majority of this week trying to sort through the errors and find a solution to give me a more reasonable accuracy measurement.

Week 11: 11/13 – 11/17

For this week, the dataset needed to be retrained to account for updates.

I finally took the manipulated TSV file and ran it. I had a few more bugs to fix in my python script so that it fixes the data that I receive completely.

Below is the final script I came up with.

import pandas as pd
import numpy as np

# Import the CSV file received of certain sentences. The two columns will be labelled as "Sentence" and "Value". (The second column of '1' values stands for the certainty.)

frame = pd.read_csv('TRAIN_wiki_certain.csv', sep="\t", names = ["Sentence", "Value"])

# Since we no longer need the "Value" column, we will just drop it.
frame = frame.drop(columns=['Value'])

# Here we are transforming the dataset into the format needed to be able to test this data.
frame.insert(loc=0, column='Value', value=0)
frame['Compare'] = frame['Sentence']
frame.Compare = frame.Compare.shift(-1)
frame['id'] = np.random.randint(50, 405000, frame.shape[0])

# Finally, we have to download the new datasat as a tsv file.
frame.to_csv("test_final.tsv", sep='\t', index=False, header=None)


Now, we should be ready to run the dataset, and it should work, but I am still getting an “IndexError: list index out of range” problem. For the next week, I’ll keep on working to debug this error.