Week 23

My task: Continue my research on document similarity methods/techniques.

Jaccard similarity also known as Jaccard index is a measure to find similarities between sets.

Sets: A set is (unordered) collection of objects {a,b,c}. we use the notation as elements separated by commas inside curly brackets { }. They are unordered so {a,b} = {b,a}.

The Jaccard similarity measures similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find Jaccard similarity between two sets A and B it is the ratio of the cardinality of A ∩ B and A ∪ B.

Let’s say we have the sentences:

A = “How can I be a geologist?”, B = “What should I do to be a geologist?”

The intersection of the two sentences would be: A B = [‘a’, ‘be’, ‘geologist?’, ‘i’]. (The words that they have in common.)

The union of the two sentences would be: A U B = [‘a’, ‘be’, ‘geologist?’, ‘to’, ‘do’, ‘i’, ‘what’, ‘should’, ‘how’, ‘can’]. (The combination of the words from both sentences.)

ardinality, number/amount, of the elements in the intersection of A and B would be: |A ∩ B| = 4. Likewise, the cardinality of the elements in the union of A and B would be: |A ∪ B| = 10.

Thus, the Jaccard similarity of the sentences A and B is: |A ∩ B| : |A U B|,

= 4 : 10,

= .40.

I will continue to do research on this method, and soon test to see which method is more accurate: Jaccard similarity or cosine similarity. I will report my findings in the next blog post.

Week 22

My task: CONTINUE To research document similarity techniques

Cosine Similarity vs. Euclidean Distance

I ran an example python code to try to understand the measurement and accuracy differences between the two methods.

I began by importing all the necessary libraries and API’s for this code, including the Wikipedia API. Through this, I was able to access the Wikipedia articles on Machine Learning, Artificial Intelligence, Soccer, and Tennis. Based on my selection, I could already make the assumptions that the Machine Learning article would be more similar to the Artificial Intelligence than to Tennis or Soccer and that Tennis would be more similar to Soccer than to Machine Learning or Artificial Intelligence.

I saved each of the Wikipedia articles in their own own variable so that they can be used throughout the program. Then, I used the CountVectorizer() to turn all of the documents into their own vectors for comparison purposes.

I will use the following abbreviations throughout the program:

• ML –> Machine Learning
• AI –> Artificial Intelligence

First, I measured the Euclidean Distance from ML to every other article. As expected, the ML article was closer to (more similar to AI than to tennis.

Then, I repeated the process with cosine similarity. The result was the same: ML was more similar (closer to 1) to AI than to soccer or tennis. However, the gap was not as large as I expected when comparing ML to either of the sports.

After those results, I decided to test all the articles now against the following tweet:

“New research release: overcoming many of Reinforcement Learning’s limitations with Evolution Strategies.”

Of course, this tweet is rather broad. So, there is not much we can conclude regarding its similarity to the articles. Using Euclidean distance, I found that this tweet is the most similar to ML. However, it is nearly just as similar to AI as it is to tennis.

Using cosine similarity, I found that this tweet is the most similar to ML as well and equally similar to soccer as it is to tennis.

Let’s try something different:

“#LegendsDownUnder The Reds are out for the warm up at the @nibStadium. Not long now until kick-off in Perth.”

From first glance, I can tell this tweet is about sports of some kind.

Using Euclidean distance, I found that this tweet is surprisingly the most similar to ML than to either of the sports. However, using cosine similarity, I found that this tweet is more similar to the sports than the ML or AI.

Conclusion: Cosine Similarity is more accurate measurement than Euclidean distance when it comes to document similarity.

https://cmry.github.io/notes/euclidean-v-cosine

Week 21

My task: To research document similarity techniques.

Similarity is the measure of how much alike two data objects are. Similarity in a data mining context is usually described as a distance with dimensions representing features of the objects. If this distance is small, there will be high degree of similarity; if a distance is large, there will be low degree of similarity. Document similarity is the distance between documents.

Similarity is measured in the range 0 to 1 [0,1]. If the two objects in question are X and Y,

• Similarity = 1, means X = Y
• Similarity =0, means X ≠ Y.

Usually, documents treated as similar if they are semantically close and describe similar concepts. On the other hand, “similarity” can be used in the context of duplicate detection.

The text2vec package provides 2 set of functions for measuring various distances/similarity in a unified way. All methods are written with special attention to computational performance and memory efficiency. This package includes:

***Note: Methods have suffix `2` in their names because in contrast to base `dist()` function they work with two matrices instead of one.

The five most popular methods for detecting similarity measures are:

1. Euclidean distance
2. Cosine similarity
3. Manhattan distance
4. Minkowski distance
5. Jaccard similarity

I talked about Euclidean distance in the last blog post.

Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter.

This happens for example when working with text data represented by word counts. We could assume that when a word (e.g. science) occurs more frequent in document 1 than it does in document 2, that document 1 is more related to the topic of science. However, it could also be the case that we are working with documents of uneven lengths (Wikipedia articles for example). Then, science probably occurred more in document 1 just because it was way longer than document 2. Cosine similarity corrects for this. Text data is the most typical example for when to use this metric. However, you might also want to apply cosine similarity to other cases where some properties of the instances make so that the weights might be larger without meaning anything different. Sensor values that were captured in various lengths (in time) between instances could be such an example.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html

Implementing the Five Most Popular Similarity Measures in Python

Week 20

My Task: To research “locality-sensitive hashing” (LSH).

Locality-Sensitive Hashing (LSH) is an algorithm for solving the approximate or exact Near Neighbor Search in high-dimensional spaces.

It is often used to solve the problem of finding duplicate documents in a list. If documents are small integers, we can use an array to implement a symbol table, by interpreting the document as an array index so that we can store the value associated with document i in array position i. For this, we consider using a hash table.

In computing, a hash table (hash map) is a data structure which implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

However, if we need to find not only exact duplicates but also documents with differences such as typos or different words, the problem becomes much more complex. First, we need to define a method of determining whether a question is a duplicate of another. To do this we can use string metrics, also known as string similarity metrics or string distance functions (i.e. Euclidean distance).

string metric is a metric that measures distance between two text strings for approximate string matching or comparison. It provides a number indicating an algorithm-specefic indication of distance.

The most common distance measurement used the Euclidean distance.

The Euclidean distance between two points is the length of the shortest path connecting them.

The Euclidean distance function can be defined as follows:

Thus, the distance between points (0,3,4,5) and (7,6,3,-1) is 9.74679434. However, Euclidean distance is not useful in the Natural-Language Processing (NLP) field.

In using the LSH method, there is a possibility that two documents that are very similar are not paired together. Likewise, there is also a possibility that two documents that aren’t similar are paired together.

Week 19

No progress was made this week. This was the first week of school and a very hectic week for Resident Assistants because new and old residents are in the process of moving into the dorm and many programs are done in the first week to welcome everyone back to campus. Progress will resume next week.