Week 6-7

My task:

  1. Figure out how to extract the training sentences from the data sets
  2. Determine how to use bag-of-words (sci-kitĀ learn) with the training sentences


When extracting the sentences from the training data sets, the result was an excel document with two columns. The first column held theĀ individual sentences found in the data set, and the second column held a series of 1s and 0s that corresponded to each sentence. If a sentence was labeled with a 1, it was “certain.” Whereas, a sentence labeled with a 0 was deemed “uncertain.”

Here are some examples of each pulled from the document of extracted sentences:

In nuclear extracts from monocytes or macrophages, induction of NF-KB occurred only if the cells were previously infected with HIV-1.
When U937 cells were infected with HIV-1, no induction of NF-KB factor was detected, whereas high level of progeny virions was produced, suggesting that this factor was not required for viral replication.

As you can see, the first sentence was labeled with a 1, meaning that we determined that that sentence was certain. Reading through that sentences, I cannot find any hedge words or phrases that would indicate any level of uncertainty.

On the other hand, the second sentence was labeled with a 0, meaning that we determined that that sentence was uncertain. Notice that the word “suggesting” is found near the middle of that sentence. This word is a common hedge word used in language. Because this one hedge word was found in this sentences, the sentence as a whole was labeled as uncertain.

A total of 14,541 sentences were extracted from the biomedical training data sets.

I am currently working on the code needed to create a bag of words from the sentences that were extracted.