Week 1

My task: To research and determine an ideal way to detect and extract hedges from a document.

A hedge, or a hedge cue,  is a word, phrase, or statement that signals the attitude of uncertainty or speculation.

Most hedges fall under the following categories:

Adjectives or adverbs Probable(y), likely, possible, unsure, unknown, unclear, predicted, etc…
Auxiliaries May, might, could, etc…
Conjunctions Either…or, etc…
Verbs of hedging Suggest, suspect, indicate, suppose, seem, appear, etc…

Hedge detection refers to the task of identifying these hedges in a text, which is particularly useful in distinguishing between certain, factual information and uncertain, nonfactual information.

I found the most-recognized hedge detection system was developed by Maria Georgescul and presented at the Conference for Natural Language Learning (CoNLL) 2010 Shared Task Evaluation, which focused on hedging in the domain of biomedical literature. We proposed that we would be reimplementing her system in order to extract words and assertions from our data set.

Once the words in question are extracted, the next step was to determine entailment and contradiction relationships. Through research, I think the words-space approach known as random indexing could accomplish just that! The process of random indexing can be broken down into 3 stages.

Stage 1: Each word in the data set is assigned a unique and random representation called an index vector. Index vectors consist of a small number of sparse +1s and -1s, with 0s fulfilling the rest of the elements in the vector set.

Stage 2: The document is scanned and context vectors are produced, representing the different contexts, like a group of words or a document, in which words have occurred.

Stage 3: The values within the index and context vectors are compiled together into what is known as a co-occurrence matrix, wherein each row represents a unique word and each column represents a unique context.

Now, herein lies the beauty of this process. Wouldn’t you say that if you recognized two words that occurred within the same or similar contexts, it is safe to assume that those two words have the same or similar meanings? I think so! In fact, such is known as the distributional hypothesis: Words with similar meanings tend to occur in similar contexts. Within the random indexing process, the same rule applies! Words occurring in similar contexts will have vector similarity. To determine whether two words are the same or similar, we simply compare their context vectors by calculating the cosine of the angles in between the vectors. The closer the resulting value, the more similar the words in question are in their contexts.

In our case, we will be working more with statements rather than individual words. If we assign two statements their own context vectors and we know one statement to be true, then we can prove that the another statement is written in the same context and holds the same truth value as the first statement if they have vector similarity.