Week 21

My task: To research document similarity techniques.

Similarity is the measure of how much alike two data objects are. Similarity in a data mining context is usually described as a distance with dimensions representing features of the objects. If this distance is small, there will be high degree of similarity; if a distance is large, there will be low degree of similarity. Document similarity is the distance between documents.

Similarity is measured in the range 0 to 1 [0,1]. If the two objects in question are X and Y, 

  • Similarity = 1, means X = Y
  • Similarity =0, means X ≠ Y.

Usually, documents treated as similar if they are semantically close and describe similar concepts. On the other hand, “similarity” can be used in the context of duplicate detection.

The text2vec package provides 2 set of functions for measuring various distances/similarity in a unified way. All methods are written with special attention to computational performance and memory efficiency. This package includes:

***Note: Methods have suffix 2 in their names because in contrast to base dist() function they work with two matrices instead of one.

The five most popular methods for detecting similarity measures are:

  1. Euclidean distance
  2. Cosine similarity
  3. Manhattan distance
  4. Minkowski distance
  5. Jaccard similarity


I talked about Euclidean distance in the last blog post.

Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter.

This happens for example when working with text data represented by word counts. We could assume that when a word (e.g. science) occurs more frequent in document 1 than it does in document 2, that document 1 is more related to the topic of science. However, it could also be the case that we are working with documents of uneven lengths (Wikipedia articles for example). Then, science probably occurred more in document 1 just because it was way longer than document 2. Cosine similarity corrects for this. Text data is the most typical example for when to use this metric. However, you might also want to apply cosine similarity to other cases where some properties of the instances make so that the weights might be larger without meaning anything different. Sensor values that were captured in various lengths (in time) between instances could be such an example.



Implementing the Five Most Popular Similarity Measures in Python