Usage

Use the following methods to fill, update, reset and search the index.

class elastichash.ElasticHash(es: Elasticsearch, additional_fields: List[str] = ['image_path'], index_prefix: str = 'eh', shards=1, replicas=1, radius=2)[source]

__init__(es: Elasticsearch, additional_fields: List[str] = ['image_path'], index_prefix: str = 'eh', shards=1, replicas=1, radius=2)[source]

Extend Elasticsearch for efficient similarity search based on binary codes.

Parameters:

es (elasticsearch.Elasticsearch) -- instance of a elasticsearch.Elasticsearch client
additional_fields (List[str]) -- list of additional fields used in the index, defaults to ["image_path"]
index_prefix (str) -- a prefix used for ElasticHash indices and functions, defaults to es
shards (int) -- number of shards for ElasticHash indices
replicas (int) -- number of replicas for ElasticHash indices
radius (int) -- radius for subcode, defaults to 2

add(vec: str | ndarray | List[int], additional_fields: Dict[str, Any] = None)[source]

Add a code to the index, optionally along with additional fields. The code needs to be 256 bits long (0 or 1). A code can also be represented as a list or numpy array of 4 integer values. It can be either string, list of int or numpy array.

Usage examples:

add_vec(code='010...0', additional_fields={'image_path':'/path/to/image.jpg') add_vec(code=[0,1,0,...,0]) add_vec(np.array([0,1,0,...,0]) add_vec(np.array([10,20,-10,-20])

Parameters:

vec (Union[str, np.ndarray, List[int]]) -- a binary code of length 256, or represented as 4 integers
additional_fields (Dict[str, Any]) -- a dictionary of field name and value pairs that should also be stored in the index

add_bulk(vecs: List[str | ndarray | List[int]] | ndarray, additional_fields: List[Dict[str, Any]] = None)[source]

Add a list or numpy array of codes, optionally together with a corresponding list of dictionaries with additional fields. If a list of additional fields is given, it must have the same length as the list of codes. A code can be either string, a list of int or numpy array.

Parameters:

vecs (Union[List[Union[str, np.ndarray, List[int]]]) -- a list of codes
additional_fields (List[Dict[str, Any]]) -- list of additional fields for the codes

decorrelate(plot_dir: str = None, num_samples: int = None)[source]

After adding about 10,000 codes in the index the decorrelate method should be called. After rearranging the bit positions search may be significantly faster. The bit distribution and correlation matrix are plotted if a plot_dir is specified. The number of samples num_samples used for computing the correlation should not be to high (i.e. not higher than 10,000 as correlation computation is carried out in memory). Based on the correlation a better permutation for the bits is computed. The permutation is applied only for all documents in the index, or in case of interruption, none. This is achieved by using a temporary copy of the retrieval index. This step is also needed to find increase performance on the short codes as these are the most discriminative ones of the long codes. More details can be found in https://arxiv.org/abs/2305.04710

Parameters:

plot_dir (str) -- directory for correlation and bit distribution plots
num_samples (int) -- number of samples to use for computing the correlation matrix

Returns:

True if decorrelation was successful, False otherwise

reset()[source]: Delete and recreate all indices. This will also delete all documents in the retrieval index.

search(vec: str | ndarray | List[int], size: int = None) → ObjectApiResponse[Any][source]

Search a document with the given code in the index. The code needs to be 256 bits long (0 or 1). It can be either string, list of int or numpy array. A code can also be represented as a list or numpy array of 4 integer values.

The _score value for a document \(d\) in the results is a similarity score to the query vector \(q\) based on Hamming distance \(H\). It is the number of common bits, i.e. \(256-H(q,d)\).

Use util.parse_dists() to turn similarities into distances and or/normalize them.

Parameters:

vec (Union[str, np.ndarray, List[int]]) -- a binary code of length 256, or represented as 4 integers
size (int) -- number of hits to return (default: 10)

Returns:

the search result returned by Elasticsearch

update(id: int, vec: str | ndarray | List[int] = None, additional_fields: Dict[str, Any] = None)[source]

Update a document in the index by its id. Updates the code or updates additional fields of the document, or both. The new code needs to be 256 bits long (0 or 1). It can be either string, list of int or numpy array. A code can also be represented as a list or numpy array of 4 integer values.

Parameters:

id (int) -- id of the document to be updated
vec (Union[str, np.ndarray, List[int]]) -- the new binary code of length 256, or represented as 4 integers
additional_fields (Dict[str, Any]) -- a dictionary of field name and value pairs that should also be stored in the index

Extract scores from ES result:

elastichash.util.parse_scores(es_result: ObjectApiResponse, normalize=False, distance=True)[source]

Extracts _score values from Elasticsearch result. Normalization can be applied and the similarity score can be turned into a distance.

Parameters:

es_result (ObjectApiResponse) -- Elasticsearch result (JSON dict)
normalize (bool) -- if score should be normalized
distance (bool) -- if similarity score should be turned into a distance

Returns: