Semantic Search With MongoDB Atlas

Share

Introduction

Semantic search refers to a search technique that aims to improve the accuracy of search results by understanding the intent and context behind a user’s query. Unlike traditional keyword-based search engines, which rely on matching specific words or phrases, semantic search focuses on the meaning of the query and the content of the documents.

Semantic search systems use natural language processing (NLP) and machine learning algorithms to comprehend the context, relationships, and semantics of words and phrases. This allows them to deliver more relevant results by considering the user’s intent and the context of the query.

Some key components of semantic search include:

  1. Context Understanding: Semantic search engines analyze the context of a query, taking into account factors such as user location, previous search history, and the relationships between words.
  2. Entity Recognition: Identifying and understanding entities (e.g., people, places, and things) in the query and the documents being searched can enhance the accuracy of results.
  3. Concept Matching: Semantic search systems go beyond simple keyword matching and attempt to match the underlying concepts or meanings in the query and documents.
  4. Natural Language Processing (NLP): NLP techniques are employed to understand the natural language in queries and documents, helping the search engine better interpret and respond to user input.
  5. Machine Learning: Algorithms learn from patterns and user behavior, continuously improving the relevance of search results over time.

Semantic search is particularly beneficial for complex queries, ambiguous language, and situations where users may not use the exact keywords that would typically yield the desired results. It has applications in various fields, including information retrieval, recommendation systems, and question-answering systems.

Implementation

We can implement Semantic Search by using a MongoDB Atlas database with Haystack.

Getting Started

First, setup a new environment in Python/ VSCode. Open a PowerShell terminal within VSCode and use the command  ->

PowerShell
python -m venv . venv

to create a virtual environment. Activate this virtual environment via the Terminal to your workspace using ->

PowerShell
.venv\Scripts\Activate.ps1

Now, install the required libraries using pip->

PowerShell
pip install mongodb-atlas-haystack
pip install sentence_transformers

We will be using Hugging Face model all-mpnet-base-v2 for Semantic Search. The all-mpnet-base-v2 model provides the best quality and is an All-round model tuned for many use-cases. It is trained on a large and diverse dataset of over 1 billion training pairs, with base model as microsoft/mpnet-base. It could be used directly via Hugging Face or could be downloaded locally using ->

PowerShell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/sentence-transformers/all-mpnet-base-v2

Create an MongoDB Atlas online account if you don’t already have one.

Setup a database in your Atlas Cluster and Index it using Atlas Search.
Navigate to Deployment>Database>Browse Collections->Atlas Search>Actions>Edit Index ->

JSON
{
  "fields": [
    {
      "numDimensions": 768,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    }
  ]
}

Code

Python
from haystack import Pipeline
from haystack.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.pipelines import Pipeline
from haystack.schema import Document
from pymongo import MongoClient

mongodb_conn_string = 'mongodb+srv://<username>:<password>@maincluster.d67gxdl.mongodb.net/'
db_name = "<db_name>"
collection_name = "<collection_name>"
index_name = "<vsearch_index_name>"
embedding_model_path=r"C:\Code\Python\Models\all-mpnet-base-v2"

# Initialize MongoDB python client
client = MongoClient(mongodb_conn_string)
collection = client[db_name][collection_name]

# Reset w/out deleting the Search Index 
collection.delete_many({})

JsonKnowledgeObject = {}
# Adding Content to be indexed
JsonKnowledgeObject[
    "content"
] = """Faiss (Facebook AI Similarity Search) is an open-source library developed by Facebook, designed for efficient similarity searches and clustering of dense vectors. This library addresses challenges commonly encountered in machine learning applications, particularly those involving high-dimensional vectors, such as image recognition and recommendation systems. Its widespread applicability, combined with features like scalability and flexibility, makes it a valuable tool for various machine learning and data analysis tasks, as demonstrated in its real-world application scenarios outlined in the Facebook Engineering blog post. 

Faiss employs advanced techniques like indexing and quantization to accelerate similarity searches in large datasets. Its versatility is evident in its support for both CPU and GPU implementations, ensuring scalability across different hardware configurations. Faiss offers flexibility with options for both exact and approximate similarity searches, allowing users to tailor the level of precision to their specific requirements."""

# Adding Meta Data
JsonKnowledgeObject["meta"] = {}
JsonKnowledgeObject["meta"][
    "title"
] = "Semantic Search With Facebook AI Similarity Search (FAISS)"
JsonKnowledgeObject["meta"]["author"] = "ThreadWaiting"
JsonKnowledgeObject["meta"][
    "link"
] = "https://threadwaiting.com/semantic-search-with-facebook-ai-similarity-search-faiss/"


# Convert Json object to Document object
document = Document(
    content=JsonKnowledgeObject["content"], meta=JsonKnowledgeObject["meta"]
)
# use GPU if available and drivers are installed
use_gpu = True if torch.cuda.is_available() else False

document_store = MongoDBAtlasDocumentStore(
    mongo_connection_string=mongodb_conn_string,
    database_name=db_name,
    collection_name=collection_name,
    vector_search_index=index_name,
    similarity="cosine",
    embedding_dim=768,
    embedding_field = "embedding"
)

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model=embedding_model_path,
    model_format="sentence_transformers",
    top_k=10,
)



# Add document to the document store
document_store.write_documents([document])
# This needs to be executed every time the data gets refreshed
retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model=embedding_model_path, use_gpu=use_gpu
)
document_store.update_embeddings(retriever)

retriever = retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model=embedding_model_path, use_gpu=use_gpu,
    model_format="sentence_transformers",
    top_k=3
)

query_pipeline = Pipeline()
query_pipeline.add_node(component=retriever, name="retriever", inputs=["Query"])

output = query_pipeline.run(query="What is Faiss?")

results_documents = output["documents"]

if len(results_documents) > 0:
    print("\nMatching Article: \n")
    for doc in results_documents:
        docDoc = doc.to_dict()
        print(docDoc["meta"]["title"])
        print(docDoc["content"])
        score = round(float(str(docDoc["score"] or "0.0")) * 100, 2)
        print("Match score:", score, "%")

Output

PowerShell
Writing Documents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 28.55 docs/s]
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.11s/it]
Updating Embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.19s/ docs] 
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.74it/s]

Matching Article:

Semantic Search With Facebook AI Similarity Search (FAISS)
Faiss (Facebook AI Similarity Search) is an open-source library developed by Facebook, designed for efficient similarity searches and clustering of dense vectors. This library addresses challenges commonly encountered in machine learning applications, particularly those involving high-dimensional vectors, such as image recognition and recommendation systems. Its widespread applicability, combined with features like scalability and flexibility, makes it a valuable tool for various machine learning and data analysis tasks, as demonstrated in its real-world application scenarios outlined in the Facebook Engineering blog post.

Faiss employs advanced techniques like indexing and quantization to accelerate similarity searches in large datasets. Its versatility is evident in its support for both CPU and GPU implementations, ensuring scalability across different hardware configurations. Faiss offers flexibility with options for both exact and approximate similarity searches, allowing users to tailor the level of precision to their specific requirements.
Match score: 80.25 %

GitHub: https://github.com/threadwaiting/SemanticSearchMongoDB