Unfortunately, I was unable to research this week because of planning, traveling, and attending several family events throughout the holiday week.
I am in the process of discovering different methods to train and test data that may produce better results than the method that we have used this semester with the bag-of-words and count vectorizer.
Due to Thanksgiving break, I was, unfortunately, unable to work much this week.
I have just started the process of cleaning and reorganizing up our repository on Bitbucket to make it neater and more accessible in the future.
For next week, I will continue investigating the error that I kept getting and hopefully, we can now test the dataset!
I went through the same steps to extract the sentences from the Wikipedia data sets as I did with the biomedical data sets.
- After I making mistakes in the extracting progress, I tested the Wikipedia Evaluation set on the Wikipedia training set. My result was an “accuracy” measurement of 0.80, which means that our model correctly labeled 80% of the sentences from the evaluation set as certain or uncertain.
- Out of curiosity, I also tested the Wikipedia Evaluation set with the Biomedical Training set used in the previous bag-of-words model. Surprisingly, my result was an “accuracy” measurement of 0.78, meaning this model correctly labeled 78% of the sentences from the evaluation set as certain or uncertain.
My task: to run the bag-of-words model through the Wikipedia training and testing data sets from the CoNLL 2010 Shared Task Data and record my observations
In my attempt to complete my task this week I did run into a lot of errors in the execution of my code as I got a 0.0 accuracy measurement on my first run-through. I spent the majority of this week trying to sort through the errors and find a solution to give me a more reasonable accuracy measurement.
For this week, the dataset needed to be retrained to account for updates.
I finally took the manipulated TSV file and ran it. I had a few more bugs to fix in my python script so that it fixes the data that I receive completely.
Below is the final script I came up with.
import pandas as pd
import numpy as np
# Import the CSV file received of certain sentences. The two columns will be labelled as "Sentence" and "Value". (The second column of '1' values stands for the certainty.)
frame = pd.read_csv('TRAIN_wiki_certain.csv', sep="\t", names = ["Sentence", "Value"])
# Since we no longer need the "Value" column, we will just drop it.
frame = frame.drop(columns=['Value'])
# Here we are transforming the dataset into the format needed to be able to test this data.
frame.insert(loc=0, column='Value', value=0)
frame['Compare'] = frame['Sentence']
frame.Compare = frame.Compare.shift(-1)
frame['id'] = np.random.randint(50, 405000, frame.shape)
# Finally, we have to download the new datasat as a tsv file.
frame.to_csv("test_final.tsv", sep='\t', index=False, header=None)
Now, we should be ready to run the dataset, and it should work, but I am still getting an “IndexError: list index out of range” problem. For the next week, I’ll keep on working to debug this error.