Usage
Use the following methods to fill, update, reset and search the index.
- class elastichash.ElasticHash(es: Elasticsearch, additional_fields: List[str] = ['image_path'], index_prefix: str = 'eh', shards=1, replicas=1, radius=2)[source]
- __init__(es: Elasticsearch, additional_fields: List[str] = ['image_path'], index_prefix: str = 'eh', shards=1, replicas=1, radius=2)[source]
Extend Elasticsearch for efficient similarity search based on binary codes.
- Parameters:
es (
elasticsearch.Elasticsearch
) -- instance of aelasticsearch.Elasticsearch
clientadditional_fields (List[str]) -- list of additional fields used in the index, defaults to
["image_path"]
index_prefix (str) -- a prefix used for ElasticHash indices and functions, defaults to
es
shards (int) -- number of shards for ElasticHash indices
replicas (int) -- number of replicas for ElasticHash indices
radius (int) -- radius for subcode, defaults to 2
- add(vec: str | ndarray | List[int], additional_fields: Dict[str, Any] = None)[source]
Add a code to the index, optionally along with additional fields. The code needs to be 256 bits long (0 or 1). A code can also be represented as a list or numpy array of 4 integer values. It can be either string, list of int or numpy array.
Usage examples:
add_vec(code='010...0', additional_fields={'image_path':'/path/to/image.jpg')
add_vec(code=[0,1,0,...,0])
add_vec(np.array([0,1,0,...,0])
add_vec(np.array([10,20,-10,-20])
- Parameters:
vec (Union[str, np.ndarray, List[int]]) -- a binary code of length 256, or represented as 4 integers
additional_fields (Dict[str, Any]) -- a dictionary of field name and value pairs that should also be stored in the index
- add_bulk(vecs: List[str | ndarray | List[int]] | ndarray, additional_fields: List[Dict[str, Any]] = None)[source]
Add a list or numpy array of codes, optionally together with a corresponding list of dictionaries with additional fields. If a list of additional fields is given, it must have the same length as the list of codes. A code can be either string, a list of int or numpy array.
- Parameters:
vecs (Union[List[Union[str, np.ndarray, List[int]]]) -- a list of codes
additional_fields (List[Dict[str, Any]]) -- list of additional fields for the codes
- decorrelate(plot_dir: str = None, num_samples: int = None)[source]
After adding about 10,000 codes in the index the decorrelate method should be called. After rearranging the bit positions search may be significantly faster. The bit distribution and correlation matrix are plotted if a
plot_dir
is specified. The number of samplesnum_samples
used for computing the correlation should not be to high (i.e. not higher than 10,000 as correlation computation is carried out in memory). Based on the correlation a better permutation for the bits is computed. The permutation is applied only for all documents in the index, or in case of interruption, none. This is achieved by using a temporary copy of the retrieval index. This step is also needed to find increase performance on the short codes as these are the most discriminative ones of the long codes. More details can be found in https://arxiv.org/abs/2305.04710- Parameters:
plot_dir (str) -- directory for correlation and bit distribution plots
num_samples (int) -- number of samples to use for computing the correlation matrix
- Returns:
True if decorrelation was successful, False otherwise
- reset()[source]
Delete and recreate all indices. This will also delete all documents in the retrieval index.
- search(vec: str | ndarray | List[int], size: int = None) ObjectApiResponse[Any] [source]
Search a document with the given code in the index. The code needs to be 256 bits long (0 or 1). It can be either string, list of int or numpy array. A code can also be represented as a list or numpy array of 4 integer values.
The _score value for a document \(d\) in the results is a similarity score to the query vector \(q\) based on Hamming distance \(H\). It is the number of common bits, i.e. \(256-H(q,d)\).
Use
util.parse_dists()
to turn similarities into distances and or/normalize them.- Parameters:
vec (Union[str, np.ndarray, List[int]]) -- a binary code of length 256, or represented as 4 integers
size (int) -- number of hits to return (default: 10)
- Returns:
the search result returned by Elasticsearch
- update(id: int, vec: str | ndarray | List[int] = None, additional_fields: Dict[str, Any] = None)[source]
Update a document in the index by its id. Updates the code or updates additional fields of the document, or both. The new code needs to be 256 bits long (0 or 1). It can be either string, list of int or numpy array. A code can also be represented as a list or numpy array of 4 integer values.
- Parameters:
id (int) -- id of the document to be updated
vec (Union[str, np.ndarray, List[int]]) -- the new binary code of length 256, or represented as 4 integers
additional_fields (Dict[str, Any]) -- a dictionary of field name and value pairs that should also be stored in the index
Extract scores from ES result:
- elastichash.util.parse_scores(es_result: ObjectApiResponse, normalize=False, distance=True)[source]
Extracts _score values from Elasticsearch result. Normalization can be applied and the similarity score can be turned into a distance.
- Parameters:
es_result (ObjectApiResponse) -- Elasticsearch result (JSON dict)
normalize (bool) -- if score should be normalized
distance (bool) -- if similarity score should be turned into a distance
- Returns: