Week 23

My task: Continue my research on document similarity methods/techniques.

Jaccard similarity also known as Jaccard index is a measure to find similarities between sets.

Sets: A set is (unordered) collection of objects {a,b,c}. we use the notation as elements separated by commas inside curly brackets { }. They are unordered so {a,b} = {b,a}.

The Jaccard similarity measures similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find Jaccard similarity between two sets A and B it is the ratio of the cardinality of A ∩ B and A ∪ B.

Let’s say we have the sentences:

A = “How can I be a geologist?”, B = “What should I do to be a geologist?”

The intersection of the two sentences would be: A B = [‘a’, ‘be’, ‘geologist?’, ‘i’]. (The words that they have in common.)

The union of the two sentences would be: A U B = [‘a’, ‘be’, ‘geologist?’, ‘to’, ‘do’, ‘i’, ‘what’, ‘should’, ‘how’, ‘can’]. (The combination of the words from both sentences.)

Cardinality, number/amount, of the elements in the intersection of A and B would be: |A ∩ B| = 4. Likewise, the cardinality of the elements in the union of A and B would be: |A ∪ B| = 10.

Thus, the Jaccard similarity of the sentences A and B is: |A ∩ B| : |A U B|,

= 4 : 10,

= .40.

I will continue to do research on this method, and soon test to see which method is more accurate: Jaccard similarity or cosine similarity. I will report my findings in the next blog post.