Week 20

My Task: To research “locality-sensitive hashing” (LSH).

Locality-Sensitive Hashing (LSH) is an algorithm for solving the approximate or exact Near Neighbor Search in high-dimensional spaces.

It is often used to solve the problem of finding duplicate documents in a list. If documents are small integers, we can use an array to implement a symbol table, by interpreting the document as an array index so that we can store the value associated with document i in array position i. For this, we consider using a hash table.

In computing, a hash table (hash map) is a data structure which implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

However, if we need to find not only exact duplicates but also documents with differences such as typos or different words, the problem becomes much more complex. First, we need to define a method of determining whether a question is a duplicate of another. To do this we can use string metrics, also known as string similarity metrics or string distance functions (i.e. Euclidean distance).

string metric is a metric that measures distance between two text strings for approximate string matching or comparison. It provides a number indicating an algorithm-specefic indication of distance.

The most common distance measurement used the Euclidean distance.

The Euclidean distance between two points is the length of the shortest path connecting them.

The Euclidean distance function can be defined as follows:

Thus, the distance between points (0,3,4,5) and (7,6,3,-1) is 9.74679434. However, Euclidean distance is not useful in the Natural-Language Processing (NLP) field.

In using the LSH method, there is a possibility that two documents that are very similar are not paired together. Likewise, there is also a possibility that two documents that aren’t similar are paired together.