By Gabriel, 31 Oct 2023 , updated 06 Nov 2023
The goal of this review is to compare the quality of new semantic-based algorithms using embeddings versus traditional term-based search ranking algorithms to retrieve good content. The traditional is Okapi BM25, an the new ones are the models Elastic Learned Sparse EncodeR (ELSER) and all-MiniLM-L6-v2 from sbert.net. Adding Google because it is a good reference point. Which one works for what type of query. Part 1 presents the background.
A computer screen and one big magnifying glass, sai-line art_style, by Stable Diffusion XL
This turned out to be a bigger piece of work than anticipated, hence I have split it in 2 parts:
With the rise of LLM in the last 12 months, there have been renewed interest in the utilisation of embeddings to perform search. At least this is my observation. Machine learning is a new field for me and as many other IT professionals I’m trying to understand what LLM and related techniques can and cannot do, and how can we efficiently apply it to existing IT challenges.
What is an embedding?
An embedding - aka vector embeddings or text embeddings - is a technique part of Natural Language Processing domain where one uses a trained model to transform any piece of text to a list of numbers. Embeddings model can be called Sentence Transformers. For example, I have computed the embedding of “How to Use Bluetooth in a Suzuki Swift?” using all-MiniLM-L6-v2
model. The output is a list of 384 numbers -0.01730852946639061, -0.059170424938201904, -0.043736234307289124, ...[truncated]
. Mathematically speaking the model computes a projection of any piece of text into one point in a 384-dimensions space. One can read this list of numbers as the coordinates of that text into a (weird) multi-multi-dimensional space.
The key feature of the vector is that 2 semantically close text will be projected into 2 close points. During training, the model have captured the similarity between words and set of words that may be lexicographically different but semantically close. For example the point of “a car” will be close to the point “a vehicle with 4 wheels” in that space. That is a fantastic feature for search.
Chronologically this technique was already well established, by several companies and research teams before the big bang of chatGPT and other LLM 1 year ago. For example Open AI offers his own embeddings trained model as SaaS before chatGPT (producing 1536-dimensions vectors). I think those models are smaller and easier to train to reach satisfactory levels than LLM are, hence have come first in the research work of many teams around the world. But for many of us, non machine learning engineers, it came into light thanks to LLMs… and their 2 main limitations: The difficulty (understand the high cost) to adapt LLM to custom or recent data AND their tendency to hallucinate.
The difficulty to adapt LLM to custom or recent data
LLM are generic, having “crunched” all accessible text data in the world during their training phase. But what if, at a personal level, I want question a LLM on my personal emails, agenda, address book, bank statements (eg: “What did I buy in my last amazon order?”, “What’s the mechanic I used last time to fix my car?”). And what if, at a company level, I want to question a LLM on internal documentation, Slack messages (“How to I request access to that service?” etc)
If one want to have a LLM customized for its own data it will be required to run a fine-tuning training. This is expensive (running machine learning algorithms for long period of time on expensive computer). Then when new data is added or data is being updated, one would have to do it all again. For long time the cut-off date of chatGPT has been September 2021. I believe since OpenAI has come up with workaround to include more recent data, but it has been very secretive about how this is achieved. It is unlikely that the entire costly training was run all-over again and this workaround probably won’t give the same level of quality for answers about recent data than it does for data prior to september 2021.
The tendency to hallucinate
LLM have the inherent behavior to sometimes fabricate random and wrong responses (but always nicely formulated!). This is commonly called hallucinations. It sounds like the model has no idea what it is talking about it is because it doesn’t. A good reminder that LLM just compute statistically occurrence of the next words to output. How can we force a LLM to output verifiable response, based on precise data?
The embeddings search solution: RAG
The idea is to have a multiple steps approach: Have a database where one would have computed and stored the vector embeddings for all data, and keeping it updated by adding vectors for new data. This is an easy incremental effort, thus solving the custom or recent data problem. When a question is asked to the LLM, compute the vector embedding of that question. Retrieve from the vector database the close data and combine with the original question to build the prompt for the LLM. In the prompt the retrieved text will be introduced as “hints”, ie element of information to use to answer the question. This is called Retrieval-augmented generation, aka RAG. And this should solve the hallucinations problem too by “grounding” the models to response with verifiable elements of data.
On a side note here I suppose one could also use a traditional indexing and search algorithm instead of embedding to find similar content then combine it with the question exactly the same way to build the LLM prompt and still call it RAG. But most of the usages I found on the web so far are using embeddings…
Embedding search in general
RAG is still very much at experimenting stages in a lot of places. It is promising in my opinion but there are a lot of caveats, and we are yet to be presented with a successful application. This is not the goal of this post. Instead I wanted to review how good can vector embedding search be compared to traditional search ranking algorithms when simply apply to search: Can embedding search significantly improve local search of document on your personal machine or the website search, ie search within a site?
As a web developer I did implement or maintain search tools in numerous websites through my career. Still do. I have experimented different tools (Apache SOLR, sphinxsearch, elasticsearch…), everytime using traditional term-based ranking algorithms. It does work for most search queries, but too often I found myself using Google to search within those websites (using the query operator site:
) because it will return better results :-/
Commonly the application of vector embedding to search is called semantic search.
The documents are ~9000 automotive-related questions searchable here carsguide.com.au/ask-the-guide.
Disclaimer: Carsguide is my employer and even so all those questions are publicly accessible I had special access to the raw data which did make the data extraction step easier.
It is an good dataset because the questions are smallish and self-contained: one sentence, median length is 54 words and the questions are written to contain all the information, no need of context. (at least once you know it is about cars!): eg: “How to Use Bluetooth in a Suzuki Swift?”, “Where are Range Rovers made?”. When one need to generate an embedding for big document it seems that the vector will fail to “capture” all the meanings of that document. From my readings I found that the common workaround is to split the document in chunks of smaller size. Chunking the right way for your type of documents and preserving good search relevancy on the other end is a challenge by itself. While it has to be done on some way when you are in this situation, it was convenient for me in that review to be able to bypass this challenge using small documents and focus and the embeddings search relevancy.
The documents don’t include responses. Only comparing here considering the given questions, whether the returned other questions have a similar intent.
The 4 search ranking algorithms compared are:
Dense Vector and Sparse Vector
All-MiniLM-L6-v2 produces dense vectors, ELSER produces sparse vectors. Those are quite different approaches and there is a lot of readings out there about that, but we will just cover the main aspects and how it does affect the results of this review.
A dense vector embedding, is a vector of fixed dimensions (384 for all-MiniLM-L6-v2). Any encoded document will result in a vector of 384 dimensions. For retrieval, the query is first encoded into another 384-dimensions vector then compared doing a vector search with all encoded documents and returning nearest neighbor. Consider together the vector values represent the semantic value of the documents. Considered separately, each value doesn’t mean anything.
A sparse vector embedding, is a vector of a very large dimensions (around 30,000 terms for ELSER) where only a small fraction of its entries are non-zero. More information can be found on internet on SPLADE model from naver) which architecture was used to build ELSER.
For example, the ELSER vector embedding for “How to Use Bluetooth in a Suzuki Swift?” is those 51 weighted tokens:
{
"software": 0.23656718,
"usb": 0.48834926,
"use": 0.9640181,
"while": 0.095061995,
"bmw": 0.72607034,
"tracking": 0.098949105,
"mode": 0.65638715,
"motorcycle": 0.40226376,
"compatible": 0.27592084,
"protocol": 0.015813658,
"transmission": 0.5294601,
"should": 0.20630935,
"connection": 0.28526223,
"communication": 0.28686664,
"cable": 0.476537,
"signal": 0.2654448,
"guide": 0.21253671,
"swift": 2.4527044,
"##tooth": 1.9773918,
"unlock": 0.0679657,
"method": 0.43937576,
"gps": 0.1820281,
"tool": 0.21815841,
"route": 0.09803897,
"transfer": 0.062095787,
"driver": 0.7232692,
"phone": 0.09017429,
"device": 0.07086152,
"link": 0.7142692,
"manual": 0.34261033,
"interface": 0.026841396,
"smart": 0.42636025,
"bike": 0.49763623,
"vehicle": 0.3884574,
"button": 0.8167683,
"can": 0.41126505,
"how": 0.34164882,
"golf": 0.41182873,
"engine": 0.50633496,
"car": 0.030690737,
"kit": 0.07260544,
"suzuki": 2.3925042,
"connect": 0.19877134,
"charge": 0.2748698,
"relay": 0.20421943,
"way": 0.00975579,
"blue": 1.6325705,
"setup": 0.49366593,
"wireless": 0.9060208,
"step": 0.8354686,
"to": 0.029833497
}
For “How to change oil on a Ford Ranger?” it is 58 tokens, for “Where are Range Rovers made?”, it is 41 tokens.
From the official documentation: “ELSER expands the indexed documents into this collections of terms […] These expanded terms are weighted as some of them are more significant than others”. This provides a more “interpretable” vector. See the example above. Elastic.co insists on the fact that those terms or tokens are NOT synonyms (a trick already used in traditional search engines to improve recall)
The retrieval algo is different as well. I understand that ELSER can use a “traditional” inverted index, which is fast. Faster than some vector searchs. Actually vector search speed depends on the vector size and the search method use, so that is relative.
ELSER is popular within the elastisearch “sphere of influence”, that is within all past and present - big - user-base. It is well featured by elastic.co. Besides that I think it is a good example for any implementations of sparse vector embeddings developed by other actors (mainly SPLADE). That is why I found it interesting to add it to this review.
This is the end of what I felt was required information to grasp the concepts and understand the review and its findings. The Part 2, the actual review, is coming soon, stay tune.