Monthly Archives: October 2017

Week 8

My goal this week was to figure out how to train the machine learning model in sci-kit learn to classify whether a sentence is certain or indicates uncertainty.

I trained a bag-of-words model using the Biomedical full articles training data set.

I then tested the trained model on the Biomedical abstracts training data set.

In conclusion, when the Biomedical full articles training data set is used to train the model using the bag-of-words technique, we get about a 93% accuracy measurement when we test the data on the Biomedical abstracts training data set. Next, is to see how well this model performs when being tested by the Biomedical evaluation data sets.


I also did a little research on an alternate classification model.

Because we would only need to classify sentences under two categories (certain-1 or uncertain-0), this would be considered a binary classification model.

To train binary classification models, we would use the industry-standard learning algorithm known as logistic regression. As Wikipedia defines,

“Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.”

http://docs.aws.amazon.com/machine-learning/latest/dg/training-ml-models.html

https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/

http://dataaspirant.com/2017/04/15/implement-logistic-regression-model-python-binary-classification/

https://dataaspirant.com/2017/03/02/how-logistic-regression-model-works/

https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn

 

Week 8: 10/23 – 10/27

For this week, I worked mostly on trying other arguments to get a better performance on my dataset.

I played around with the configuration below:

{
“dropout_rate”: 0.1,
“suffix”: “quora”,
“NER_dim”: 20,
“highway_layer_num”: 1,
“with_match_highway”: true,
“optimize_type”: “adam”,
“with_highway”: true,
“max_epochs”: 10,
“with_aggregation_highway”: true,
“with_filter_layer”: false,
“lex_decompsition_dim”: -1,
“aggregation_layer_num”: 1,
“max_char_per_word”: 10,
“wo_maxpool_match”: false,
“context_layer_num”: 1,
“wo_full_match”: false,
“lambda_l2”: 0.0,
“fix_word_vec”: true,
“wo_left_match”: false,
“with_NER”: false,
“aggregation_lstm_dim”: 300,
“context_lstm_dim”: 100,
“POS_dim”: 20,
“with_lex_decomposition”: false,
“learning_rate”: 0.001,
“with_POS”: false,
“wo_right_match”: false,
“MP_dim”: 10,
“max_sent_length”: 100,
“batch_size”: 60,
“wo_max_attentive_match”: false,
“wo_char”: false,
“wo_attentive_match”: false,
“char_emb_dim”: 20,
“char_lstm_dim”: 100,
“word_level_MP_dim”: -1,
“base_dir”: “./quora”
}

I also looked into doing more research about how I can go about transforming the csv file of sentences into the proper tsv format that we will need to be able to run it properly using the Sentence Match Decoder.

Week Seven

This week I worked on setting up Python and NumPy onto my computer. I also worked on studying a Python code by Diogojc called PageRank. The way the code should work is I will be given a matrix of ones and zeros representing relationships between different news stories. Then, my code should be able to calculate how much relation there is between each story using PageRank. My code should then produce a graph to show how likely it is for a news story to be true.

I spent the rest of the time reading a paper called ”Collective influence Maximization in Threshold Models of Information Cascading with First-Order Transitions” by Sen Pei, Xian Teng, Jeffrey Shaman, Flaviano Morone, and Hernan A. Makse. The paper talked about Collective Influence in Threshold Models, optimal percolation, the CI-TM algorithm, message passing, and linearization. Here is one example the paper gives:

“Subcritical paths and collective influence of spreaders. (a), Three combinations of neighbors P1, P2, and P3 corresponding to V(i->j) in message passing equation. Node i has a threshold mi = 2. The full activation of at least one combination will lead to V(i->j) = 1. (b) For link i->j with an active neighbor k1 and inactive ones k2 and k3, I(k1->i,i->j) = 0 since i has 0 ( < mi – 1) active neighbor excluding k1 and j, while I(k2->i,i->j) = 1 because i has 1 ( = mi -1) active neighbors k1 excluding k2 and j. (c), Illustrations of subcritical paths ending with link i->j for L=0,1,2. Red dots stand for seeds, while squares represent m – 1 active neighbors attached to subcritical nodes. Subcritical paths are highlighted by thick links. (d), The contribution of seed i to ||v(->)|| exerted through subcritical paths of length L=0,1,2. (e), Calculation  method of CI-TM(L) (i).  Subcritical paths starting from i with length l =< L are displayed by different colors. (f), An example of a subcritical cluster. Assuming a uniform threshold m = 3, nodes inside the circle are subcritical since they all have 2 active neighbors, represented by blue nodes. Activation of the red node will trigger a cascade covering all subcritical nodes.”

Next week I plan on getting Python to work on my computer, running the PageRank code, and maybe modifying the code so it works with our project.

Week Six

This week I worked on understanding the influence of all nodes in a network and efficient collective influence maximization in cascading processes with first-order transitions. I read a paper called “Understanding the Spreading Power of all Nodes in a Network: a Continuous-Time Perspective” by Glenn Lawyer. The paper talked about the correlation to epidemic outcomes, weighted graphs, and different networks. Here is a picture that shows and explains a network of nodes:

I also found a Python code on GitHub that will calculate PageRank. Next week I plan on installing NumPy, reading a paper on collective influence, and studying the PageRank code.

Week 6-7

My task:

  1. Figure out how to extract the training sentences from the data sets
  2. Determine how to use bag-of-words (sci-kit learn) with the training sentences

 

When extracting the sentences from the training data sets, the result was an excel document with two columns. The first column held the individual sentences found in the data set, and the second column held a series of 1s and 0s that corresponded to each sentence. If a sentence was labeled with a 1, it was “certain.” Whereas, a sentence labeled with a 0 was deemed “uncertain.”

Here are some examples of each pulled from the document of extracted sentences:

In nuclear extracts from monocytes or macrophages, induction of NF-KB occurred only if the cells were previously infected with HIV-1.
1
When U937 cells were infected with HIV-1, no induction of NF-KB factor was detected, whereas high level of progeny virions was produced, suggesting that this factor was not required for viral replication.
0

As you can see, the first sentence was labeled with a 1, meaning that we determined that that sentence was certain. Reading through that sentences, I cannot find any hedge words or phrases that would indicate any level of uncertainty.

On the other hand, the second sentence was labeled with a 0, meaning that we determined that that sentence was uncertain. Notice that the word “suggesting” is found near the middle of that sentence. This word is a common hedge word used in language. Because this one hedge word was found in this sentences, the sentence as a whole was labeled as uncertain.

A total of 14,541 sentences were extracted from the biomedical training data sets.


I am currently working on the code needed to create a bag of words from the sentences that were extracted.