NearPy is a small open source framework for ANN search in large, high-dimensional data sets. It is written in Python and utilizes such well-known Python frameworks as Numpy and Scipy for scientific computing.
ANN stands for Approximated Nearest Neighbour and is used in pattern matching applications, for instance, image retrieval, text mining, recommendation systems, and audio search. Usually databases for this purposes are very large so it is hard to search for items similar to query using nearest neighbor search. ANN methods and locality-sensitive hashes give much better results and considerably save time.
For indexing and searching vectors NearPy has Engine - a modular pipeline that contains four types of objects.
- Hashes. Input to the pipeline is a vector and hashes generate one or several bucket keys based on it. Then NearPy indexes dataset and stores vector - each key in separate bucket. While performing search framework gathers neighbour candidates from all these buckets. Usually hashes are locality-sensitive. It means that they mostly take into account the spatial structure. Close vectors will get the same buckets.
- Storage. Storage adapters are used to store and return bucket contents. NearPy’s default setting is in-memory storage, though Redis adapter is also available.
- Distance. If locality-sensitivity of hashes is not enough, distance to the query vector is computed for all gathered candidates. Distance measure can be configured: euclidean, angular or customly implemented.
- Filter (optional). Filters receive lists of tuples: (vector, data) or (vector, data, distance), if distance was used in the pipeline. Filters functionality varies depending on the implementation. NearPy offers three options: NearestFilter, DistanceThresholdFilter and UniqueFilter.
NearPy is a modular Python framework that uses ANN search methods for different types of data. It comes with experiment classes that assess different engine settings, hashes, distances and filters depending on a data set. To get more information visit NearPy website.