Week 21

My task: To research document similarity techniques.

Similarity is the measure of how much alike two data objects are. Similarity in a data mining context is usually described as a distance with dimensions representing features of the objects. If this distance is small, there will be high degree of similarity; if a distance is large, there will be low degree of similarity. Document similarity is the distance between documents.

Similarity is measured in the range 0 to 1 [0,1]. If the two objects in question are X and Y, 

  • Similarity = 1, means X = Y
  • Similarity =0, means X ≠ Y.

Usually, documents treated as similar if they are semantically close and describe similar concepts. On the other hand, “similarity” can be used in the context of duplicate detection.

The text2vec package provides 2 set of functions for measuring various distances/similarity in a unified way. All methods are written with special attention to computational performance and memory efficiency. This package includes:

***Note: Methods have suffix 2 in their names because in contrast to base dist() function they work with two matrices instead of one.

The five most popular methods for detecting similarity measures are:

  1. Euclidean distance
  2. Cosine similarity
  3. Manhattan distance
  4. Minkowski distance
  5. Jaccard similarity


I talked about Euclidean distance in the last blog post.

Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter.

This happens for example when working with text data represented by word counts. We could assume that when a word (e.g. science) occurs more frequent in document 1 than it does in document 2, that document 1 is more related to the topic of science. However, it could also be the case that we are working with documents of uneven lengths (Wikipedia articles for example). Then, science probably occurred more in document 1 just because it was way longer than document 2. Cosine similarity corrects for this. Text data is the most typical example for when to use this metric. However, you might also want to apply cosine similarity to other cases where some properties of the instances make so that the weights might be larger without meaning anything different. Sensor values that were captured in various lengths (in time) between instances could be such an example.



Implementing the Five Most Popular Similarity Measures in Python




Week 20

My Task: To research “locality-sensitive hashing” (LSH).

Locality-Sensitive Hashing (LSH) is an algorithm for solving the approximate or exact Near Neighbor Search in high-dimensional spaces.

It is often used to solve the problem of finding duplicate documents in a list. If documents are small integers, we can use an array to implement a symbol table, by interpreting the document as an array index so that we can store the value associated with document i in array position i. For this, we consider using a hash table.

In computing, a hash table (hash map) is a data structure which implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

However, if we need to find not only exact duplicates but also documents with differences such as typos or different words, the problem becomes much more complex. First, we need to define a method of determining whether a question is a duplicate of another. To do this we can use string metrics, also known as string similarity metrics or string distance functions (i.e. Euclidean distance).

string metric is a metric that measures distance between two text strings for approximate string matching or comparison. It provides a number indicating an algorithm-specefic indication of distance.

The most common distance measurement used the Euclidean distance.

The Euclidean distance between two points is the length of the shortest path connecting them.

The Euclidean distance function can be defined as follows:

Thus, the distance between points (0,3,4,5) and (7,6,3,-1) is 9.74679434. However, Euclidean distance is not useful in the Natural-Language Processing (NLP) field.

In using the LSH method, there is a possibility that two documents that are very similar are not paired together. Likewise, there is also a possibility that two documents that aren’t similar are paired together. 




Week Twenty-Two

This week I worked on figuring out how to build a certain graph we need for our project. At the moment, I am trying to build a 3D network graph. I installed Plotly on my computer, and now I am trying to learn how to input a network of nodes.

Next week, I will continue working on building the graph.

Week 19

No progress was made this week. This was the first week of school and a very hectic week for Resident Assistants because new and old residents are in the process of moving into the dorm and many programs are done in the first week to welcome everyone back to campus. Progress will resume next week.