Semantic Search With Facebook AI Similarity Search (FAISS)

Introduction

Facebook AI Similarity Search (FAISS)

Faiss (Facebook AI Similarity Search) is an open-source library developed by Facebook, designed for efficient similarity searches and clustering of dense vectors. This library addresses challenges commonly encountered in machine learning applications, particularly those involving high-dimensional vectors, such as image recognition and recommendation systems. Its widespread applicability, combined with features like scalability and flexibility, makes it a valuable tool for various machine learning and data analysis tasks, as demonstrated in its real-world application scenarios outlined in the Facebook Engineering blog post.

Faiss employs advanced techniques like indexing and quantization to accelerate similarity searches in large datasets. Its versatility is evident in its support for both CPU and GPU implementations, ensuring scalability across different hardware configurations. Faiss offers flexibility with options for both exact and approximate similarity searches, allowing users to tailor the level of precision to their specific requirements.

Semantic Search

Semantic search refers to a search technique that aims to improve the accuracy of search results by understanding the intent and context behind a user’s query. Unlike traditional keyword-based search engines, which rely on matching specific words or phrases, semantic search focuses on the meaning of the query and the content of the documents.

Semantic search systems use natural language processing (NLP) and machine learning algorithms to comprehend the context, relationships, and semantics of words and phrases. This allows them to deliver more relevant results by considering the user’s intent and the context of the query.

Some key components of semantic search include:

Context Understanding: Semantic search engines analyze the context of a query, taking into account factors such as user location, previous search history, and the relationships between words.
Entity Recognition: Identifying and understanding entities (e.g., people, places, and things) in the query and the documents being searched can enhance the accuracy of results.
Concept Matching: Semantic search systems go beyond simple keyword matching and attempt to match the underlying concepts or meanings in the query and documents.
Natural Language Processing (NLP): NLP techniques are employed to understand the natural language in queries and documents, helping the search engine better interpret and respond to user input.
Machine Learning: Algorithms learn from patterns and user behavior, continuously improving the relevance of search results over time.

Semantic search is particularly beneficial for complex queries, ambiguous language, and situations where users may not use the exact keywords that would typically yield the desired results. It has applications in various fields, including information retrieval, recommendation systems, and question-answering systems.

Implementation

We can implement Semantic Search by using a FAISS vector database with Haystack.

Getting Started

First, setup a new environment in Python/ VSCode. Open a PowerShell terminal within VSCode and use the command ->

PowerShell

python -m venv . venv

to create a virtual environment. Activate this virtual environment via the Terminal to your workspace using ->

PowerShell

.venv\Scripts\Activate.ps1

Now, install the required libraries using pip->

PowerShell

#Python3.10.2 preferred
pip install farm-haystack==1.19.0
pip install faiss-cpu==1.7.2
pip install scikit-learn==1.3.0
pip install farm-haystack[faiss]
pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu121/torch_stable.html
pip install sentence_transformers==2.2.2

We will be using Hugging Face model all-mpnet-base-v2 for Semantic Search. The all-mpnet-base-v2 model provides the best quality and is an All-round model tuned for many use-cases. It is trained on a large and diverse dataset of over 1 billion training pairs, with base model as microsoft/mpnet-base. It could be used directly via Hugging Face or could be downloaded locally using ->

PowerShell

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/sentence-transformers/all-mpnet-base-v2

Code

Python

from haystack.pipelines import Pipeline
from haystack.schema import Document
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
import torch
import os

FaissIndexPath = r".\faiss_index.faiss"
FaissJsonPath = r".\faiss_index.json"
FaissDbPath = r".\faiss_document_store.db"
EmbeddingModelPath = r".\Models\all-mpnet-base-v2"

# use GPU if available and drivers are installed
use_gpu = True if torch.cuda.is_available() else False

if os.path.exists(FaissDbPath):
    os.remove(FaissDbPath)

JsonKnowledgeObject = {}
# Adding Content to be indexed
JsonKnowledgeObject[
    "content"
] = """Faiss (Facebook AI Similarity Search) is an open-source library developed by Facebook, designed for efficient similarity searches and clustering of dense vectors. This library addresses challenges commonly encountered in machine learning applications, particularly those involving high-dimensional vectors, such as image recognition and recommendation systems. Its widespread applicability, combined with features like scalability and flexibility, makes it a valuable tool for various machine learning and data analysis tasks, as demonstrated in its real-world application scenarios outlined in the Facebook Engineering blog post. 

Faiss employs advanced techniques like indexing and quantization to accelerate similarity searches in large datasets. Its versatility is evident in its support for both CPU and GPU implementations, ensuring scalability across different hardware configurations. Faiss offers flexibility with options for both exact and approximate similarity searches, allowing users to tailor the level of precision to their specific requirements."""

# Adding Meta Data
JsonKnowledgeObject["meta"] = {}
JsonKnowledgeObject["meta"][
    "title"
] = "Semantic Search With Facebook AI Similarity Search (FAISS)"
JsonKnowledgeObject["meta"]["author"] = "ThreadWaiting"
JsonKnowledgeObject["meta"][
    "link"
] = "https://threadwaiting.com/semantic-search-with-facebook-ai-similarity-search-faiss/"

# Initialize/Reload Document Store
document_store = FAISSDocumentStore(
    similarity="cosine", sql_url="sqlite:///faiss_document_store.db"
)

# Convert Json object to Document object
document = Document(
    content=JsonKnowledgeObject["content"], meta=JsonKnowledgeObject["meta"]
)

# Add document to the document store
document_store.write_documents([document])

# This needs to be executed every time the data gets refreshed
retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model=EmbeddingModelPath, use_gpu=use_gpu
)
document_store.update_embeddings(retriever)
document_store.save(index_path=FaissIndexPath)


# Load the saved index into anew DocumnetStore instance
document_store = FAISSDocumentStore(faiss_index_path=FaissIndexPath)
retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model=EmbeddingModelPath,
    model_format="sentence_transformers",
    top_k=3,
    use_gpu=use_gpu,
)

query_pipeline = Pipeline()
query_pipeline.add_node(component=retriever, name="retriever", inputs=["Query"])

output = query_pipeline.run(query="What is Faiss?")

results_documents = output["documents"]

if len(results_documents) > 0:
    print("\nMatching Article: \n")
    for doc in results_documents:
        docDoc = doc.to_dict()
        print(docDoc["meta"]["title"])
        print(docDoc["content"])
        score = round(float(str(docDoc["score"] or "0.0")) * 100, 2)
        print("Match score:", score, "%")

Output

PowerShell

>> python .\SemanticSearchFAISS.py
Writing Documents: 10000it [00:00, 304743.30it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.15s/it]
Documents Processed: 10000 docs [00:01, 7334.36 docs/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.54it/s]

Matching Article: 

Semantic Search With Facebook AI Similarity Search (FAISS)
Faiss (Facebook AI Similarity Search) is an open-source library developed by Facebook, designed for efficient similarity searches and clustering of dense vectors. This library addresses challenges commonly encountered in machine learning applications, particularly those involving high-dimensional vectors, such as image recognition and recommendation systems. Its widespread applicability, combined with features like scalability and flexibility, makes it a valuable tool for various machine learning and data analysis tasks, as demonstrated in its real-world application scenarios outlined in the Facebook Engineering blog post. 

Faiss employs advanced techniques like indexing and quantization to accelerate similarity searches in large datasets. Its versatility is evident in its support for both CPU and GPU implementations, ensuring scalability across different hardware configurations. Faiss offers flexibility with options for both exact and approximate similarity searches, allowing users to tailor the level of precision to their specific requirements.
Match score: 60.51 %