My goal this week was to figure out how to train the machine learning model in sci-kit learn to classify whether a sentence is certain or indicates uncertainty.
I trained a bag-of-words model using the Biomedical full articles training data set.
I then tested the trained model on the Biomedical abstracts training data set.
In conclusion, when the Biomedical full articles training data set is used to train the model using the bag-of-words technique, we get about a 93% accuracy measurement when we test the data on the Biomedical abstracts training data set. Next, is to see how well this model performs when being tested by the Biomedical evaluation data sets.
I also did a little research on an alternate classification model.
Because we would only need to classify sentences under two categories (certain-1 or uncertain-0), this would be considered a binary classification model.
To train binary classification models, we would use the industry-standard learning algorithm known as logistic regression. As Wikipedia defines,
“Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.”