Semantic Search using LangChain and MongoDB Atlas

Share

Introduction

Semantic search refers to a search technique that aims to improve the accuracy of search results by understanding the intent and context behind a user’s query. Unlike traditional keyword-based search engines, which rely on matching specific words or phrases, semantic search focuses on the meaning of the query and the content of the documents.

Semantic search systems use natural language processing (NLP) and machine learning algorithms to comprehend the context, relationships, and semantics of words and phrases. This allows them to deliver more relevant results by considering the user’s intent and the context of the query.

Some key components of semantic search include:

  1. Context Understanding: Semantic search engines analyze the context of a query, taking into account factors such as user location, previous search history, and the relationships between words.
  2. Entity Recognition: Identifying and understanding entities (e.g., people, places, and things) in the query and the documents being searched can enhance the accuracy of results.
  3. Concept Matching: Semantic search systems go beyond simple keyword matching and attempt to match the underlying concepts or meanings in the query and documents.
  4. Natural Language Processing (NLP): NLP techniques are employed to understand the natural language in queries and documents, helping the search engine better interpret and respond to user input.
  5. Machine Learning: Algorithms learn from patterns and user behavior, continuously improving the relevance of search results over time.

Semantic search is particularly beneficial for complex queries, ambiguous language, and situations where users may not use the exact keywords that would typically yield the desired results. It has applications in various fields, including information retrieval, recommendation systems, and question-answering systems.

Implementation

We can implemented Semantic Search using LangChain and MongoDB Atlas.

Create an MongoDB Atlas online account if you don’t already have one.

Setup a database in your Atlas Cluster and Index it using Atlas Search.
Navigate to Deployment>Database>Browse Collections->Atlas Search>Actions>Edit Index ->

JSON
{
  "fields": [
    {
      "numDimensions": 1536,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    }
  ]
}

Code

Python
import warnings
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import MongoDBAtlasVectorSearch
from pymongo import MongoClient

openai_api_key = '<Api Key>'
mongodb_conn_string = 'mongodb+srv://<username>:<password>@maincluster.d67gxdl.mongodb.net/'
db_name = "<db_name>"
collection_name = "<collection_name>"
index_name = "<vsearch_index_name>"

# Filter out the UserWarning from langchain
warnings.filterwarnings("ignore", category=UserWarning, module="langchain.chains.llm")

# Step 1: Load Webpages to Index
loaders = [
    WebBaseLoader("https://threadwaiting.com/python-oops/"),
    WebBaseLoader("https://threadwaiting.com/semantic-search-with-facebook-ai-similarity-search-faiss/")

]
data = []
for loader in loaders:
    data.extend(loader.load())

# Step 2: Transform (Split)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=[
                                               "\n\n", "\n", "(?<=\. )", " "], length_function=len)
docs = text_splitter.split_documents(data)
print('Split into ' + str(len(docs)) + ' docs')

# Step 3: Embed
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

# Step 4: Store
# Initialize MongoDB python client
client = MongoClient(mongodb_conn_string)
collection = client[db_name][collection_name]

# Reset w/out deleting the Search Index 
collection.delete_many({})

# Insert the documents in MongoDB Atlas with their embedding
docsearch = MongoDBAtlasVectorSearch.from_documents(
    docs, embeddings, collection=collection, index_name=index_name
)

# Process arguments

questions = ["What is an object?", "What is FAISS?"]

# initialize vector store
vectorStore = MongoDBAtlasVectorSearch(
    collection, OpenAIEmbeddings(openai_api_key=openai_api_key), index_name=index_name
)

# perform a similarity search between the embedding of the query and the embeddings of the documents
# print("\nQuery Response:")
for query in questions:
    print("---------------")
    print(query)
    docs = vectorStore.max_marginal_relevance_search(query, K=1)
    if len(docs)>0:
        print(docs[0].metadata['title'])
        print(docs[0].page_content)

for query in questions:
    # Contextual Compression
    llm = OpenAI(openai_api_key=openai_api_key, temperature=0)
    compressor = LLMChainExtractor.from_llm(llm)

    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=vectorStore.as_retriever()
    )

    print("\nAI Response:")
    print("-----------")
    print(query)
    compressed_docs = compression_retriever.get_relevant_documents(query)
    if len(compressed_docs)>0:
        print(compressed_docs[0].metadata['title'])
        print(compressed_docs[0].page_content)

Output

PowerShell
Split into 26 docs
---------------
What is an object?
Python – OOPs – Thread Waiting
Python – OOPs – Thread Waiting

Thread Waiting
Code Everything!

Menu

Home
About
Privacy Policy
Contact Us

 Python – OOPs
17/01/2018Python

Introduction
Here we will discuss about:
Classes and Objects in Python
Closures and Decorators
Descriptors and Properties

Introduction to OOP
Object-oriented programming can model real-life scenarios and suits developing large and complex applications.
Object
In real life, an object is something that you can sense and feel. For example Toys, Bicycles, Oranges and more.
However in Software development, an object is a non tangible entity, which holds some data and is capable of doing certain things.
Class and Object Relationship

Defining Classes
Class
A Class is a template which contains

instructions to build an object.
methods that can be used by the object to exhibit a specific behaviour.
---------------
What is FAISS?
Semantic Search With Facebook AI Similarity Search (FAISS) – Thread Waiting
Introduction
Facebook AI Similarity Search (FAISS)
Faiss (Facebook AI Similarity Search) is an open-source library developed by Facebook, designed for efficient similarity searches and clustering of dense vectors. This library addresses challenges commonly encountered in machine learning applications, particularly those involving high-dimensional vectors, such as image recognition and recommendation systems. Its widespread applicability, combined with features like scalability and flexibility, makes it a valuable tool for various machine learning and data analysis tasks, as demonstrated in its real-world application scenarios outlined in the Facebook Engineering blog post.

AI Response:
-----------
What is an object?
Python – OOPs – Thread Waiting
Object-oriented programming can model real-life scenarios and suits developing large and complex applications.
Object
In real life, an object is something that you can sense and feel. For example Toys, Bicycles, Oranges and more.
However in Software development, an object is a non tangible entity, which holds some data and is capable of doing certain things.

AI Response:
-----------
What is FAISS?
Semantic Search With Facebook AI Similarity Search (FAISS) – Thread Waiting
Faiss (Facebook AI Similarity Search) is an open-source library developed by Facebook, designed for efficient similarity searches and clustering of dense vectors. This library addresses challenges commonly encountered in machine learning applications, particularly those involving high-dimensional vectors, such as image recognition and recommendation systems. Its widespread applicability, combined with features like scalability and flexibility, makes it a valuable tool for various machine learning and data analysis tasks, as demonstrated in its real-world application scenarios outlined in the Facebook Engineering blog post.

GitHub: https://github.com/threadwaiting/SemanticSearchMongoDbLangChain