Embeddings search vs traditional search ranking algorithms comparison | Part 1, background

The goal of this review is to compare the quality of new semantic-based algorithms using embeddings versus traditional term-based search ranking algorithms to retrieve good content. The traditional is Okapi BM25, an the new ones are the models Elastic Learned Sparse EncodeR (ELSER) and all-MiniLM-L6-v2 from sbert.net. Adding Google because it is a good reference point. Which one works for what type of query. Part 1 presents the background.

A computer screen and one big magnifying glass, sai-line art_style, by Stable Diffusion XL

This turned out to be a bigger piece of work than anticipated, hence I have split it in 2 parts:

About full text-search techniques

It can be something you remembered reading somewhere or something you just want to know more. It could have been in text document of your own, in an email, in a book you read, in one specific web page, in a website or in the entire web). Vast mean too big for you to actually read it yourself and have your brain performing the search.

The first computer-assisted method that comes to mind is to scan all documents for a string match at every query. It is a very simple approach used when you look for a string inside the open Web page in your browser or in several text files with grep tool. All the work is done at query time.

But that method has limitations, for really vast amount of text documents this is not practical, there are better approaches. In those cases it is better to divided the problem of full-text search into 2 tasks: indexing and querying: during indexing each documents is scanned and an organized data structure is build from it, where basically each words or terms found in a document is stored next to the documents it appears in. Imagine a big table like that:

Then at query time, the search keywords are matched with the relevant document by a simple look-up in this sorted table. This is much faster.

There are then a lot ot techniques and optimisation applied to that idea to make it works better: stemming (combine into one term “drive” the other forms of the verb “drove”, “driven”), lemmatisation (in english it can be as simple as removing “-ing” ending for verb) and then mathematics function to rank the relevancy of matched documents based on the number of matched keywords found, the frequency of a terms and the size of the documents.

This is the solution used today in most popular search tool available, so it is what powers most of the internal website search. This is what I have called here traditional term-based search ranking algorithms. I have deployed countless systems like that as a web developer.

Embedding: What and Why

With the rise of LLM in the last 12 months, there have been renewed interest in the utilisation of embeddings to perform search. At least this is my observation. Machine learning is a new field for me and as many other IT professionals I’m trying to understand what LLM and related techniques can and cannot do, and how can we efficiently apply it to existing IT challenges.

terms	Documents
car	document 1, document 42, document 628
toyota	document 2, document 42
bmw	document 42, document 901
…	.., …, …

An embedding - aka vector embeddings or text embeddings - is a technique part of Natural Language Processing domain where one uses a trained model to transform any piece of text to a list of numbers. Embeddings model can be called Sentence Transformers. For example, I have computed the embedding of “How to Use Bluetooth in a Suzuki Swift?” using all-MiniLM-L6-v2 model. The output is a list of 384 numbers -0.01730852946639061, -0.059170424938201904, -0.043736234307289124, ...[truncated]. Mathematically speaking the model computes a projection of any piece of text into one point in a 384-dimensions space. One can read this list of numbers as the coordinates of that text into a (weird) multi-multi-dimensional space.

The key feature of the vector is that 2 semantically close text will be projected into 2 close points. During training, the model have captured the similarity between words and set of words that may be lexicographically different but semantically close. For example the point of “a car” will be close to the point “a vehicle with 4 wheels” in that space. That is a fantastic feature for search.

Chronologically this technique was already well established, by several companies and research teams before the big bang of chatGPT and other LLM 1 year ago. For example Open AI offers his own embeddings trained model as SaaS before chatGPT (producing 1536-dimensions vectors). I think those models are smaller and easier to train to reach satisfactory levels than LLM are, hence have come first in the research work of many teams around the world. But for many of us, non machine learning engineers, it came into light thanks to LLMs… and their 2 main limitations: The difficulty (understand the high cost) to adapt LLM to custom or recent data AND their tendency to hallucinate.

LLM is not the topic of that post, but it is still now a subject of hype and focus for a lot of teams in the world and it is beneficial to show how embeddings relates to it and why it is featured so much today in LLM context.

LLM are generic, having “crunched” all accessible text data in the world during their training phase. But what if, at a personal level, I want question a LLM on my personal emails, agenda, address book, bank statements (eg: “What did I buy in my last amazon order?”, “What’s the mechanic I used last time to fix my car?”). And what if, at a company level, I want to question a LLM on internal documentation, Slack messages (“How to I request access to that service?” etc)

If one want to have a LLM customized for its own data it will be required to run a fine-tuning training. This is expensive (running machine learning algorithms for long period of time on expensive computer). Then when new data is added or data is being updated, one would have to do it all again. For long time the cut-off date of chatGPT has been September 2021. I believe since OpenAI has come up with workaround to include more recent data, but it has been very secretive about how this is achieved. It is unlikely that the entire costly training was run all-over again and this workaround probably won’t give the same level of quality for answers about recent data than it does for data prior to september 2021.

LLM have the inherent behavior to sometimes fabricate random and wrong responses (but always nicely formulated!). This is commonly called hallucinations. It sounds like the model has no idea what it is talking about it is because it doesn’t. A good reminder that LLM just compute statistically occurrence of the next words to output. How can we force a LLM to output verifiable response, based on precise data?

The idea is to have a multiple steps approach: Have a database where one would have computed and stored the vector embeddings for all data, and keeping it updated by adding vectors for new data. This is an easy incremental effort, thus solving the custom or recent data problem. When a question is asked to the LLM, compute the vector embedding of that question. Retrieve from the vector database the close data and combine with the original question to build the prompt for the LLM. In the prompt the retrieved text will be introduced as “hints”, ie element of information to use to answer the question. This is called Retrieval-augmented generation, aka RAG. And this should solve the hallucinations problem too by “grounding” the models to response with verifiable elements of data.

On a side note here I suppose one could also use a traditional indexing and search algorithm instead of embedding to find similar content then combine it with the question exactly the same way to build the LLM prompt and still call it RAG. But most of the usages I found on the web so far are using embeddings…

RAG is still very much at experimenting stages in a lot of places. It is promising in my opinion but there are a lot of caveats, and we are yet to be presented with a successful application. This is not the goal of this post. Instead I wanted to review how good can vector embedding search be compared to traditional search ranking algorithms when simply apply to search: Can embedding search significantly improve local search of document on your personal machine or the website search, ie search within a site?

As a web developer I did implement or maintain search tools in numerous websites through my career. Still do. I have experimented different tools (Apache SOLR, sphinxsearch, elasticsearch…), everytime using traditional term-based ranking algorithms. It does work for most search queries, but too often I found myself using Google to search within those websites (using the query operator site:) because it will return better results :-/

Commonly the application of vector embedding to search is called semantic search.

About the algorithms reviewed

All-MiniLM-L6-v2 produces dense vectors, ELSER produces sparse vectors. Those are quite different approaches and there is a lot of readings out there about that, but we will just cover the main aspects and how it does affect the results of this review.

A dense vector embedding, is a vector of fixed dimensions (384 for all-MiniLM-L6-v2). Any encoded document will result in a vector of 384 dimensions. For retrieval, the query is first encoded into another 384-dimensions vector then compared doing a vector search with all encoded documents and returning nearest neighbor. Consider together the vector values represent the semantic value of the documents. Considered separately, each value doesn’t mean anything.

A sparse vector embedding, is a vector of a very large dimensions (around 30,000 terms for ELSER) where only a small fraction of its entries are non-zero. More information can be found on internet on SPLADE model from naver) which architecture was used to build ELSER.

For example, the ELSER vector embedding for “How to Use Bluetooth in a Suzuki Swift?” is those 51 weighted tokens:

{
  "software": 0.23656718,
  "usb": 0.48834926,
  "use": 0.9640181,
  "while": 0.095061995,
  "bmw": 0.72607034,
  "tracking": 0.098949105,
  "mode": 0.65638715,
  "motorcycle": 0.40226376,
  "compatible": 0.27592084,
  "protocol": 0.015813658,
  "transmission": 0.5294601,
  "should": 0.20630935,
  "connection": 0.28526223,
  "communication": 0.28686664,
  "cable": 0.476537,
  "signal": 0.2654448,
  "guide": 0.21253671,
  "swift": 2.4527044,
  "##tooth": 1.9773918,
  "unlock": 0.0679657,
  "method": 0.43937576,
  "gps": 0.1820281,
  "tool": 0.21815841,
  "route": 0.09803897,
  "transfer": 0.062095787,
  "driver": 0.7232692,
  "phone": 0.09017429,
  "device": 0.07086152,
  "link": 0.7142692,
  "manual": 0.34261033,
  "interface": 0.026841396,
  "smart": 0.42636025,
  "bike": 0.49763623,
  "vehicle": 0.3884574,
  "button": 0.8167683,
  "can": 0.41126505,
  "how": 0.34164882,
  "golf": 0.41182873,
  "engine": 0.50633496,
  "car": 0.030690737,
  "kit": 0.07260544,
  "suzuki": 2.3925042,
  "connect": 0.19877134,
  "charge": 0.2748698,
  "relay": 0.20421943,
  "way": 0.00975579,
  "blue": 1.6325705,
  "setup": 0.49366593,
  "wireless": 0.9060208,
  "step": 0.8354686,
  "to": 0.029833497
}

For “How to change oil on a Ford Ranger?” it is 58 tokens, for “Where are Range Rovers made?”, it is 41 tokens.

From the official documentation: “ELSER expands the indexed documents into this collections of terms […] These expanded terms are weighted as some of them are more significant than others”. This provides a more “interpretable” vector. See the example above. Elastic.co insists on the fact that those terms or tokens are NOT synonyms (a trick already used in traditional search engines to improve recall)

The retrieval algo is different as well. I understand that ELSER can use a “traditional” inverted index, which is fast. Faster than some vector searchs. Actually vector search speed depends on the vector size and the search method use, so that is relative.

ELSER is popular within the elastisearch “sphere of influence”, that is within all past and present - big - user-base. It is well featured by elastic.co. Besides that I think it is a good example for any implementations of sparse vector embeddings developed by other actors (mainly SPLADE). That is why I found it interesting to add it to this review.

This is the end of what I felt was required information to grasp the concepts and understand the review and its findings. Visit the Part 2, the actual review, to view the findings.