Embeddings search vs traditional search ranking algorithms comparison | Part 2, Review and Findings

By Gabriel, 31 Dec 2023 , updated 06 Jan 2024

The goal of this review is to compare the quality of new semantic-based algorithms using embeddings versus traditional term-based search ranking algorithms to retrieve good content: The traditional is Okapi BM25, an the new ones are Elastic Learned Sparse EncodeR (ELSER, retrieval model trained by Elastic.co) and all-MiniLM-L6-v2 from sbert.net. Adding Google because it is a good reference point. Which one works for what type of query. Part 2 details the experience and present the findings.


A computer screen and a magnifying glass

A computer screen and one big magnifying glass, sai-line art_style, by Stable Diffusion XL

This is the part 2 of a bigger piece of work than anticipated:

IN a nutshell, in this experiment I have compared 4 different algorithms: For each I have first indexed ~9000 piece of contents, then run 16 different search queries and analyse relevancy of the 5 first results returned.

About the dataset

The documents are ~9000 automotive-related questions searchable here carsguide.com.au/ask-the-guide.

Disclaimer: Carsguide is my employer and even so all those questions are publicly accessible I had special access to the raw data which did make the data extraction step easier.

It is an good dataset because the questions are smallish and self-contained: one sentence, median length is 54 words and the questions are written to contain all the information, no need of context. (at least once you know it is about cars!): eg: “How to Use Bluetooth in a Suzuki Swift?”, “Where are Range Rovers made?”.

When one need to generate an embedding for big document it seems that the vector will fail to “capture” all the meanings of that document. From my readings I found that the common workaround is to split the document in chunks of smaller size. Chunking the right way for your type of documents and preserving good search relevancy on the other end is a challenge by itself. While it has to be done on some way when you are in this situation, it was convenient for me in that review to be able to bypass this challenge using small documents and focus and the embeddings search relevancy.

The documents don’t include responses (except for the Google experiment, more on that below). And the search queries run are questions too (or words from a question) - like a user would probably do. Hence this is a symmetric search.

Example: if your query is “Is the engine of 1995 toyota LandCruiser good?” in a symmetric search you would want to find the entry “Is the engine of 1995 toyota LandCruiser reliable?”. In an asymmetric search you would want to find the entry “A 1995 toyota LandCruiser is tough, rugged and ultimately a reliable workhouse”

This does simplify the use cases: we are not testing the capacity of the algorithms to find an answer text, which would be structured differently from the question and would imply some challenging interpretation. It is a “simple” semantic search: find the text with the closest meaning.

In other words, even if the dataset are questions, we are not testing algorithms to find the most relevant answer, but just the most similar questions. Maybe best to forget that those are questions and focus on what they are more generally: a sentence.

Preparations

I have run the three first algorithms in elastic.co cloud.

Google experiment is special because it couldn’t be run in isolated cloud environment like the other three (obviously): Each of the questions from the dataset has been published long time ago on individual pages on carsguide.com.au. The site is essentially #1 in Australia in editorial automotive hence it has a lot of authority and the questions are popular. This guarantees that they are already well indexed by Google. Hence no need to prepare and index the data. The indexed pages include the response as well, hence Google has indexed question and answer. That is another difference with the other algorithms to keep in mind. But because the search queries I ran targeted the questions wording, I’ll consider that having the answer index won’t have a noticeable impact on the results returned.

Here are the steps I did to prepare the three cloud-based experiments:

Setting up elasticsearch instance

  1. Create a free trial on cloud.elastic.co
  2. Create a deployment “CG Semantic Search” with 120 GB storage, 4Gb, up to 2.2vcpu elasticsearch (default), AWS Sydney
  3. ADD a machine learning instance to the deploy: 4 GB RAM | 2.1 vCPU up to 8.4 vCPU (as noted above this is the min requirement to run ELSER model)

Step 3 is required for the embeddings experiments. In elastic.co, in order to use sparse vector embedding implementation, ELSER or other, it is necessary to deploy machine learning nodes in your account. This is because even so embedding are very conveniently integrated in this cloud, the computing required for it is done outside elasticsearch ordinary nodes. Those nodes are specialized to run the inference engine of machine learning algorithms like embeddings models.

Loading the data

The dataset is ~9000 automotive related questions publicly searchable here https://www.carsguide.com.au/ask-the-guide. Questions are relatively small: median length is 54 words. The answer is not part of the dataset for this experiment.

elastic.co offers convenient way to load the content once (the 9000 questions) so it can be used after for the 3 algorithms tested here, using the Data Visualizer in the Machine Learning UI of Kibana:

I have prepared the raw data in a CSV file formatted [id],[question-text], then simply loaded them through the data visualizer. After a couple of failed attempts due to special characters, I managed to load them all under the dataset questions-data-3.

Building the indexes

Next I have created 3 different indexes, for each experiment and ingested the data in each one.

The code snippets underneath are Elasticsearch Console commands that I have used to interact with the REST APIs of Elasticsearch (unless specified otherwise). I did run them through the Kibana web UI. Very convenient way to quickly run instructions.

BM25

Create (full-text) index

PUT cg-normal
{
  "mappings": {
    "properties": {
      "text": { "type": "text" }
    }
  }
}

Index the test dataset

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "questions-data-3",
    "size": 50
  },
  "dest": {
    "index": "cg-normal",
  }
}

TOTAL: 9073 in 917 ms

indexing speed: 594,000 / min

And Perform an ordinary full text search

GET cg-normal/_search
{
  "query": {
    "term": {
      "text": "Is the engine of 1995 toyota hilux reliable?"
    }
  }
}

Sparse vector embedding: ELSER

Download and Deploy ELSER

ELSER model is available by default in elastic.co. It is easy to install. I did follow the official tutorial:

  1. visit “Trained Models” in Kibana https://cg-semantic-search.kb.ap-southeast-2.aws.found.io:9243/app/ml/trained_models
  2. Click the Download model (model ID: .elser_model_1)
  3. then: start deployment - configuration popup: and CREATE!

Create the index mapping

PUT cg-semantic-a
{
  "mappings": {
    "properties": {
      "ml.tokens": {
        "type": "rank_features"
      },
      "text": {
        "type": "text"
      }
    }
  }
}

Create an ingest pipeline with an inference processor

PUT _ingest/pipeline/cg-semantic-elser-1
{
  "processors": [
    {
      "inference": {
        "model_id": "elser-for-cg-search",
        "target_field": "ml",
        "field_map": {
          "text": "text_field"
        },
        "inference_config": {
          "text_expansion": {
            "results_field": "tokens"
          }
        }
      }
    }
  ]
}

Reindex the data through the inference pipeline

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "questions-data-3",
    "size": 50
  },
  "dest": {
    "index": "cg-semantic-a",
    "pipeline": "cg-semantic-elser-1"
  }
}

TOTAL: 9073 in 1636192 ms

indexing speed: 332 / min

And Perform a full text search

GET cg-semantic-a/_search
{
   "query":{
      "text_expansion":{
         "ml.tokens":{
            "model_id":"elser-for-cg-search",
            "model_text":"Is the engine of 1995 toyota hilux reliable?"
         }
      }
   }
}

Dense vector embedding: all-MiniLM-L6-v2

Download and Deploy all-MiniLM-L6-v2

Third-party provided embedding models are obviously not available initially in elastic.co, but there is a convenient Docker-way to download them from popular site huggingface.co, deploy it and use it as a inference ingest pipeline in elastic cloud: I did follow the official How to deploy a text embedding model and use it for semantic search guide: Below are the instructions I have run on my machine in shell:

git clone https://github.com/elastic/eland.git
cd eland
docker build -t elastic/eland .

export CLOUD_ID=CG_Semantic_Search:[base64-encoded-something]

docker run -it --rm elastic/eland \
    eland_import_hub_model \
      --cloud-id $CLOUD_ID \
      -u elastic -p [password] \
      --hub-model-id sentence-transformers/all-MiniLM-L6-v2 \
      --task-type text_embedding \
      --start

then visit: https://cg-semantic-search.kb.ap-southeast-2.aws.found.io:9243/app/ml/trained_models see notification “ML job and trained model synchronization required”, and follow the step to synchronise (…)

The model is now deployed in elastic.co and ready to use.

Test the text embedding model (using _infer API)

POST /_ml/trained_models/sentence-transformers__all-minilm-l6-v2/_infer
{
  "docs": {
    "text_field": "How is the weather in Jamaica?"
  }
}

The API returns a response similar to the following: (values in this example are from a different sentence transformers model)

{
  "inference_results": [
    {
      "predicted_value": [
        0.39521875977516174,
        -0.3263707458972931,
        0.26809820532798767,
        0.30127981305122375,
        0.502890408039093,
        ...
      ]
    }
  ]
}

Create the index mapping

PUT cg-semantic-b
{
  "mappings": {
    "properties": {
      "my_embeddings.predicted_value": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      },
      "text": {
        "type": "text"
      }
    }
  }
}

Create an ingest pipeline with an inference processor

PUT _ingest/pipeline/cg-semantic-minilm-l6-v2
{
  "description": "Text embedding pipeline",
  "processors": [
    {
      "inference": {
        "model_id": "sentence-transformers__all-minilm-l6-v2",
        "target_field": "my_embeddings",
        "field_map": {
          "text": "text_field"
        }
      }
    }
  ]
}

Reindex the data through the inference pipeline

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "questions-data-3",
    "size": 50
  },
  "dest": {
    "index": "cg-semantic-b",
    "pipeline": "cg-semantic-minilm-l6-v2"
  }
}

TOTAL: 9073 in 216794 ms

indexing speed: 2520 / min

And Perform a full text search

GET cg-semantic-b/_search
{
  "knn": {
    "field": "my_embeddings.predicted_value",
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "sentence-transformers__all-minilm-l6-v2",
        "model_text": "Is the engine of 1995 toyota hilux reliable?"
      }
    },
    "k": 10,
    "num_candidates": 100
  },
  "_source": [
    "id",
    "text"
  ]
}

Experiments

Search queries

I have run 16 search queries for each algorithm manually prepared covering some popular common questions in that dataset. There are 4 groups of query:

see 2023 Comparative Search Review Raw Results for the list of queries.

For each queries I have collected the first 5 results and the score of each ones.

For each one of the three cloud-based experiments, I have performed the queries using Elasticsearch Console in Kibana (see the examples of full-text search commands in the Preparation section above)

Google experiment is special because it couldn’t be run (obviously) automatically in a test environment like the others:

Example: in Google search box:

Is the engine of 1995 toyota hilux reliable? site:www.carsguide.com.au/car-advice/q-and-a/

Results

How does those new semantic-based algorithms compare to existing one Okapi BM25 and Google?

Raw results

available in 2023 Comparative Search Review Raw Results

Result analysis is basic: for each algo, I have evaluated the relevancy of the first result of each search queries: 2 outcomes possible: is that the best expected result or not, 0 or 1?

I found it was interesting to set those strict expectations because I know the domain and the dataset quiet well and because I did choose the search queries. In other words, As a non-forgiving user would I find this first result to be the best satisfactory one.

This is a difference from benchmarks like Normalized Discounted Cumulative Gain (NDCG) applied to search engine algorithms: this review is a practical application for a real use case and focus is on the first returned results only. Search result order does really matter.

This table sums-up the 16 search queries score for each algo, converted to value between 0 and 1 (higher values are better):

Algo Total score Ratio
BM25 11/16 0.688
ELSER 9/16 0.563
MiniLM 12/16 0.750
Google 14/16 0.875

Considering the 4 non-existing questions, which should have been failed for all algos, the maximum score should have been 12/16 or 0.750. But somehow, Google managed to surprise and find excellent results.

Findings

it is mostly a success

Retrieved documents are relevant and ranking is somehow good compared to BM25. That is a great achievement by itself for ELSER and MiniLM: even if behind the scene indexing techniques is totally different it does reach a satisfactory level with our dataset, without any customisation/fine-tuning yet.

Google is still the best :-/

In those 2 examples below, google results impress me by retrieving unexpected good results for a challenging search.

Google is of course more than one algorithm for all requests. It must use a collection of search retrieval approach and probably some rule-based system to display the best results. So it is not a apple to apple comparison. Still it does set the expectations for users so it definitely worth comparing.

Performance

Query time is too small to be measured accurately here, but indexing time can be measured accurately because it does process much more data and it gives us a glimpse of performance: all-MiniLM-L6-v2 is faster than ELSER: 2520 documents indexed per minute for all-MiniLM-L6-v2 against only 332/min for ELSER. It came as a surprise for me since I both run then with the same configuration on elastic.co and ELSER is the in-house model hence I was expecting it to be more optimized than a third-party model.

BM25 is much faster (594,000/min). but considering that I have used here the smallest possible machine learning node configuration to run all-MiniLM and ELSER models, I think this is still a good performance. It shows that semantic based algorithms applied to search is a “smallish” machine learning application making it practical to run.

Embeddings: Sparse vector and Dense vector

The sparse encoder returned unexpected results order for one of the “exact match” query scenario: the query was What is the best oil type for my Ford Ranger, and is it possible to change the oil myself? ELSER algo ranked the following documents

  1. 19.8: How to change oil on a Ford Ranger?
  2. 17.5: What is the best oil type for my Ford Ranger, and is it possible to change the oil myself?

The second result was expected to have a higher score than the first one because it is the exact match. Not sure what happen here and it’ll need to be confirmed with other experiments but at this stage I’m suspecting that the expanding step of sparse vector encoding may favor or penalise some wordings. Maybe that could happen for dense vector as well, but I didn’t notice it in that experiment. If that is the case sparse vector could be subject to adversial attack: In the search application that would mean someone would be able to craft specifically the wording for a document based on the model vocabulary to artificially increased its rank in searches.

At this stage and in my opinion sparse vector is inferior to dense vector when applied to search retrieval. It is more interpretable for sure (one can follow along which terms are matched) but it does inherently limit the relation between 2 documents to the list of terms defined in the vocabulary which is vastly smaller than the number of parameters in the model. For example in ELSER: there is ~100M ? parameters (trained from a BERT which has 100M) but only 30,000 terms in the vocabulary. And because most of the term value are 0 for any given document, the comparison between 2 documents will be the comparison of ~ 50 terms maximum. It is small when compared to the 384 dimensions of a dense vector model like all-miniLM-v6 (which is even a small model with 10M paramaters “only”).

What’s next on embeddings

Embedding models can be multimodal, term-based cannot do that. That mean it can compute embedding for non-text content (image, video) That is another possibility to improve search for embedding models and something I’d like to explore.

Fine-tuning the embedding models seems like another promising possibility for search. In the area of that experiment - automotive questions - I can see that the questions often revolve around the same concepts (make, model, year, engine, recall, price …). Some of that concepts are missed by generic models, and hence they may be better captured by specialized models…

elastic.co works well

Implementing machine learning models on production environment is still changing experimental area, even more difficult for non-machine learning engineer (like me!) but It seems that elastic search cloud did a very good job to make that usable and production ready for whoever is already familiarized with elasticsearch software. And that is a great achievement because knowledge and confidence on how to deploy new tools take long time for an organisation to master. In that context elastic cloud seems like a valuable shortcut to me.

Final note

I have realised the experiment and wrote the review mostly on my free time and the views and opinions expressed here do not necessarily state or reflect those of my employer. However, I’m grateful to have been able to work on that good dataset. Furthermore, I did start that work during some very valuable learning days organised by my employer and even more importantly the constant interaction I have with other colleagues on that topics or other similar ones is great for me to think about how to experiment things.

Resources