My task: CONTINUE To research document similarity techniques
Cosine Similarity vs. Euclidean Distance
I ran an example python code to try to understand the measurement and accuracy differences between the two methods.
I began by importing all the necessary libraries and API’s for this code, including the Wikipedia API. Through this, I was able to access the Wikipedia articles on Machine Learning, Artificial Intelligence, Soccer, and Tennis. Based on my selection, I could already make the assumptions that the Machine Learning article would be more similar to the Artificial Intelligence than to Tennis or Soccer and that Tennis would be more similar to Soccer than to Machine Learning or Artificial Intelligence.
I saved each of the Wikipedia articles in their own own variable so that they can be used throughout the program. Then, I used the CountVectorizer() to turn all of the documents into their own vectors for comparison purposes.
I will use the following abbreviations throughout the program:
- ML –> Machine Learning
- AI –> Artificial Intelligence
First, I measured the Euclidean Distance from ML to every other article. As expected, the ML article was closer to (more similar to AI than to tennis.
Then, I repeated the process with cosine similarity. The result was the same: ML was more similar (closer to 1) to AI than to soccer or tennis. However, the gap was not as large as I expected when comparing ML to either of the sports.
After those results, I decided to test all the articles now against the following tweet:
“New research release: overcoming many of Reinforcement Learning’s limitations with Evolution Strategies.”
Of course, this tweet is rather broad. So, there is not much we can conclude regarding its similarity to the articles. Using Euclidean distance, I found that this tweet is the most similar to ML. However, it is nearly just as similar to AI as it is to tennis.
Using cosine similarity, I found that this tweet is the most similar to ML as well and equally similar to soccer as it is to tennis.
Let’s try something different:
“#LegendsDownUnder The Reds are out for the warm up at the @nibStadium. Not long now until kick-off in Perth.”
From first glance, I can tell this tweet is about sports of some kind.
Using Euclidean distance, I found that this tweet is surprisingly the most similar to ML than to either of the sports. However, using cosine similarity, I found that this tweet is more similar to the sports than the ML or AI.
Conclusion: Cosine Similarity is more accurate measurement than Euclidean distance when it comes to document similarity.