Old meets New: Hybrid Search
Why state of the art semantic similarity search isn't blowing away the tried and true methods
In the realm of digital data, being able to swiftly and efficiently retrieve pertinent information is imperative. Vector search has been hyped to be the current meta for search, but there might be a better way. Hybrid search is an approach that combines the best of OG AI algorithms (like BM25) and the current state of the art. I wanted to start with a bit of general history about searching, to remind us of where we have come from and how building a search today can benefit from the tech that was hype in its day.
History of Text Search Algorithms
Imagine you are a software engineer trying to build an app like Crunchbase for letting someone search for information regarding different companies. At its core, the app is going to have to have an intuitive search. Let’s take a look at how, over time, an engineer’s choice of technology/approach would have changed.
Early Days – Exact Match Searching
In the nascent days of web development, searching primarily revolved around exact string matches. Databases would simply retrieve entries that perfectly matched a user’s query. This was straightforward but lacked flexibility. Misspelling or phrasing something slightly differently meant you got no relevant results.
Engineers would rely on simple SQL queries to match strings.
SELECT * FROM companies WHERE name = 'apple';
Wildcard Searches and Regular Expressions
To introduce some flexibility, engineers started using wildcard characters. For instance, searching for "appl*" might return "apple", "applied materials", and "applebees". Regular expressions provided more granular control over search patterns, but these methods were still limited in capturing the intent behind diverse user queries.
With SQL databases, the `LIKE` operator was often used.
SELECT * FROM companies WHERE name LIKE 'appl%';
For regex, many turned to Perl, PHP, or JavaScript's regex capabilities.
appl.*
Fuzzy Searching
The realization that users don't always spell correctly, especially in fast-paced web interactions, led to the adoption of fuzzy searching. Techniques like the Levenshtein distance (or edit distance) measured how many edits (insertions, deletions, substitutions) it would take to transform one string into another. This allowed databases to return results that were "close enough" to the queried term.
PostgreSQL offers the pg_trgm module for fuzzy string matching.
SELECT * FROM companies WHERE SIMILARITY(name, 'aplpe') > 0.4;
BM25 and Probabilistic Search
BM25 emerged as a game-changer. Instead of just looking at raw term frequencies, it factored in document length and the rarity of terms. This probabilistic approach provided a more sophisticated and relevant ranking of search results, greatly enhancing user experience on web platforms. Elasticsearch became a go-to platform, with built-in BM25 support.
{
"query": {
"match": {
"companies": {
"query": "apple",
"operator": "and"
}
}
}
}
Full-Text Search Engines
As databases grew and user expectations evolved, specialized full-text search engines like Elasticsearch and Apache Solr came into play. These tools are designed for high-speed, complex text searches, utilizing algorithms like BM25 and providing features like faceted search, filtering, and scoring. A typical query in Solr might look like:
q=companies:apple&defType=edismax&qf=companies^1.5
Semantic Search and Vector Search
Traditional search methods struggle to understand context. The advent of word embeddings and neural networks led to semantic search capabilities. Tools like word2vec or FastText can convert words into high-dimensional vectors that capture semantic meaning. The idea is: words with similar meanings have vectors that are close in the vector space. Later on, more advanced models like BERT and its variants took this a notch higher, offering even richer semantic embeddings.
Searching in these vector spaces (often dubbed as "vector search") is fundamentally different from traditional methods. ANN (Approximate Nearest Neighbors) algorithms, for instance, help in efficiently finding the most similar vectors in large datasets.
For word embeddings, the `gensim` library with its word2vec implementation was commonly used in Python.
from gensim.models import Word2Vec
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
For vector search, libraries like Annoy and Faiss were popular to find nearest neighbors in large datasets.
The Age of Transformers
With models like BERT, GPT, and their successors, we've entered an era where the context and intricate nuances of search queries can be grasped more deeply. Engineers now integrate transformer-based models into web applications, enabling incredibly rich and context-aware search experiences. Tools like OpenAI's Dense Retrieval methods and embeddings like Sentence-BERT are being used for more accurate and efficient vector search.
With the rise of transformer models, libraries like Hugging Face's `transformers` became essential.
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
inputs = tokenizer("I love apples", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)
outputs = model(**inputs, labels=labels)
Later on, tools like Sentence Transformers (`sentence-transformers`) made it even easier to create embeddings for entire sentences.
Postgrest has pgvector for saving and searching vector embeddings.
What should I use today?
Each phase of the search algorithm evolution had its set of tools and libraries. As technology evolved, the goal shifted from merely finding exact matches to understanding context and semantics.
Just like with anything in software engineering, it can be difficult to know what approach to use for searching the content you have. Keyword and fuzzy search are both super easy to setup and if you are using a relational database like Postgres, you already have them at your disposal. My general advice would be to push your existing db as far as it goes for your app. If you find that the search experience for your users doesn’t quite feel right after something simple like this, or you need to scale up the data you are ingesting, then you will need some more fire power. Elasticsearch is going to get you pretty far. There are many methods for indexing, including vectors, as well as many users of the software who have battle-tested it.
Not to mention the managed version Amazon infamously built.
If you are really dead set on using vectors and are looking for vector databases, check out this post about them.
Sparse vs Dense
Most of us are probably fairly familiar with building a search with the technology available in a relational database (if not, check this out). So we are going to play with sparse vs dense searching since those are not as widely supported and a dedicated service might be needed. The differences between these two methods are quite nuanced since they both aim to capture semantic similarity, but the way they measure similarity is notably different with some key trade-offs.
Sparse Searching (BM25)
Sparse representations like BM25 are high-dimensional with many zeros. A classic example is the Bag-of-Words (BoW) representation or TF-IDF weighting in text, where the vector's dimensionality equates to the vocabulary size. In such a setup, words present in a document are denoted by non-zero values at their respective positions, while most other values are zero. These vectors can be stored compactly using structures like compressed sparse row (CSR) matrices, saving space. As data grows, however, dimensionality can increase, making vectors longer.
Searching in sparse vectors can be efficient with methods like inverted indices that quickly map terms to documents. But, with very high dimensionality or complex similarity measures, computational costs rise. One limitation is that these representations may miss semantic relationships, like the connection between "king" and "monarch." Nonetheless, they're ideal for structured, keyword-focused data, particularly in large datasets needing scalable search solutions.
Dense Searching (Vector Similarity)
Dense vectors, like those from word2vec, FastText, and BERT, are low-dimensional and filled with real numbers, with dimensions such as 300 for word2vec remaining consistent regardless of dataset size. Every element is stored, potentially increasing storage demands. Despite the computational intensity of searching dense vector spaces for similar vectors, techniques like Approximate Nearest Neighbors and KD-trees optimize the process. The strength of dense vectors lies in capturing semantic relationships, positioning similar words or phrases closely in vector space. They excel with unstructured datasets and when queries prioritize semantic understanding.
Hybrid Searching: The Best of Both Worlds
The synthesis of BM25 (sparse) and FAISS (dense) promises a more refined search experience. Imagine using BM25 for a primary sift of data, swiftly narrowing down potential results, and then employing FAISS to fine-tune and rank these results based on nuanced semantics.
This fusion not only ensures speed but also precision, delivering results that are both keyword relevant and contextually apt.
A case study for this might look like: Consider an e-commerce platform. A user might type "winter coat with fur hood." With BM25, the platform quickly zeroes in on all listings with these keywords. But to rank these results based on the nuanced preferences of the user, like style or brand sentiment, vector similarity is employed, refining the search to meet exact user intent.
Here is an example of what this would look like in Python:
# pip install rank_bm25 faiss-cpu sentence-transformers
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss
class HybridSearch:
def __init__(self, documents):
self.documents = documents
# BM25 initialization
tokenized_corpus = [doc.split(" ") for doc in documents]
self.bm25 = BM25Okapi(tokenized_corpus)
# Sentence transformer for embeddings
self.model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
self.document_embeddings = self.model.encode(documents)
# FAISS initialization
self.index = faiss.IndexFlatL2(self.document_embeddings.shape[1])
self.index.add(np.array(self.document_embeddings).astype('float32'))
def search(self, query, top_n=10):
# BM25 search
bm25_scores = self.bm25.get_scores(query.split(" "))
top_docs_indices = np.argsort(bm25_scores)[-top_n:]
# Get embeddings of top documents from BM25 search
top_docs_embeddings = [self.document_embeddings[i] for i in top_docs_indices]
query_embedding = self.model.encode([query])
# FAISS search on the top documents
sub_index = faiss.IndexFlatL2(top_docs_embeddings[0].shape[0])
sub_index.add(np.array(top_docs_embeddings).astype('float32'))
_, sub_dense_ranked_indices = sub_index.search(np.array(query_embedding).astype('float32'), top_n)
# Map FAISS results back to original document indices
final_ranked_indices = [top_docs_indices[i] for i in sub_dense_ranked_indices[0]]
# Retrieve the actual documents
ranked_docs = [self.documents[i] for i in final_ranked_indices]
return ranked_docs
# Sample usage
documents = [
"Artificial Intelligence is changing the world.",
"Machine Learning is a subset of AI.",
"Deep Learning is a subset of Machine Learning.",
"Natural Language Processing involves understanding text.",
"Computer Vision allows machines to see and understand.",
"AI includes areas like NLP and Computer Vision.",
"The Pyramids of Giza are architectural marvels.",
"Mozart was a prolific composer during the classical era.",
"Mount Everest is the tallest mountain on Earth.",
"The Nile is one of the world's longest rivers.",
"Van Gogh's Starry Night is a popular piece of art.",
"Basketball is a sport played with a round ball and two teams."
]
hs = HybridSearch(documents)
query = "Tell me about AI in text and vision."
results = hs.search(query, top_n=10)
print(results)
Other Considerations
The different methods discussed here are how you can improve relevance of results returned from a search. Using one approach over another is not going to automatically improve the performance of your search. If you want to improve something like speed, you should think about things like properly indexing your data.
There are a lot of people who want the ChatGPT experience of looking information up. To build this system today, I have shared some thoughts previously on how I would think about doing it. A fine-tuned model to perform the sole task of search that is also fast and cheap to run kind of feels to me like the north star. It isn’t really feasible today because it is still slow and expensive, and hosting the resulting model is still a challenge. It still seems that RAG will reign the best approach for the near future, especially as context windows for LLMs continue to increase. I was blown away with how well claude.ai works when giving it 100 pages from a random ebook. You should totally try this out too.
Conclusion
Google has PageRank, Facebook has their algorithm, Tinder has their Elo Score. Each of these are revolutionary methods of searching which only came about because they reasoned about data they collected. Look at the data that you have and consider who will be searching through that data. Are there indexes that you can build to let someone do a simple filter before a full on text search is needed? If a full text search is needed, how big do you expect the data to be?
Hybrid search might be right for you. I think an interesting way you might want to test the search out is by taking the example code above and plugging in your own data and playing with the search. Does it feel like it works for you and your data?
Further Reading & Resources
For those keen on diving deeper, here's a curated list of resources:
1. A comprehensive guide on TF-IDF and BM25
2. I highly recommend you read Pinecone’s Faiss: The Missing Manual if you are interested on learning more about vector search