Building an ArXiv Research Assistant with LangGraph, MongoDB, and AWS Bedrock

Share

In this blog post, we’ll explore the implementation of an ArXiv Research Assistant using LangGraph, MongoDB, and AWS Bedrock. This application allows users to search and summarize academic papers from arXiv using a combination of full-text and vector-based search techniques. We’ll dive into the key components of the system and explain how they work together to provide a powerful research tool.

System Overview

The ArXiv Research Assistant is built with the following key components:

  1. MongoDB for document storage and retrieval
  2. AWS Bedrock for embeddings and language model inference
  3. LangGraph for workflow orchestration
  4. Gradio for the user interface

The system follows these main steps:

  1. Accept a user query
  2. Identify the core search topic
  3. Perform a hybrid search (full-text + vector) on stored documents
  4. Download and process new papers if necessary
  5. Generate a comprehensive answer based on the retrieved documents

LangGraph for ArXiv Research Assistant

Let’s explore each component in detail.

MongoDB Setup and Indexing

We use MongoDB to store and retrieve academic papers. The system sets up two crucial indexes:

  1. A vector search index for efficient similarity searches
  2. A full-text search index for keyword-based searches
def setup_vector_search_index():
    index_definitions = [
        {
            "name": VECTOR_SEARCH_INDEX_NAME,
            "type": "vectorSearch",
            "definition": {
                "fields": [
                    {
                        "type": "vector",
                        "path": "embeddings",
                        "numDimensions": 1536,
                        "similarity": "cosine",
                    }
                ]
            },
        },
        {
            "name": SEARCH_INDEX_NAME,
            "type": "search",
            "definition": {
                "mappings": {
                    "dynamic": False,
                    "fields": {"text": {"type": "string"}},
                }
            },
        },
    ]

    for index_definition in index_definitions:
        try:
            collection.create_search_index(model=index_definition)
            time.sleep(10)
        except pymongo.errors.PyMongoError:
            pass

This function ensures that our MongoDB collection has the necessary indexes for efficient searching.

Hybrid Search Implementation

The hybrid search combines full-text search and vector-based search to provide more accurate and relevant results:

def hybrid_search(query, vector_query, weight=0.5, top_n=10):
    pipeline = [
        # ... (aggregation pipeline steps)
    ]
    results = list(collection.aggregate(pipeline))
    return results

This function uses MongoDB’s aggregation pipeline to perform a weighted combination of full-text and vector searches, providing a flexible and powerful search mechanism.

ArXiv Integration

The system integrates with ArXiv to fetch and download papers based on user queries:

def fetch_arxiv_papers(query: str, max_results: int = 5) -> List[Dict]:
    # ... (implementation details)

This function uses the arxiv library to search for and retrieve paper metadata from ArXiv.

AWS Bedrock Integration

AWS Bedrock is used for generating embeddings and performing language model inference:

bedrock_client = boto3.client("bedrock-runtime", region_name="us-east-1", config=config)
embedding_model = BedrockEmbeddings(
    model_id=EMBEDDING_MODEL_NAME, client=bedrock_client
)

The BedrockEmbeddings class is used to generate embeddings for documents and queries, while the bedrock_client is used for language model inference in functions like find_search_topic and generate_answer.

LangGraph Workflow

The core of the application is built using LangGraph, which orchestrates the workflow:

workflow = StateGraph(GraphState)

def create_graph_nodes():
    workflow.add_node("find_search_topic", find_search_topic)
    workflow.add_node("hybrid_search", perform_hybrid_search)
    workflow.add_node("download_paper", download_papers)
    workflow.add_node("generate_answer", generate_answer)
    # ... (edge definitions)

create_graph_nodes()
app = workflow.compile()

This setup defines the workflow as a graph, with nodes representing different stages of the process and edges defining the flow between these stages.

Gradio User Interface

The user interface is implemented using Gradio, providing a chat-like interface for interacting with the ArXiv Research Assistant:

with gr.Blocks(
    fill_height=True,
    fill_width=True,
    title="ArXiv Research Assistant",
    theme=gr.themes.Soft(),
) as demo:
    gr.ChatInterface(
        fn=handle_query,
        type="messages",
        title="ArXiv Research Assistant",
        description="Get assistance in searching and summarizing academic papers from arXiv.",
        # ... (other configuration options)
    )

The handle_query function processes user inputs and streams the results back to the interface.

Applications of an arXiv Research Assistant Powered by RAG

A Retrieval-Augmented Generation (RAG) model applied to the arXiv repository could serve as a transformative tool for efficiently extracting and synthesizing insights from the extensive collection of scientific papers. Here are some potential use cases that highlight its versatility and value:

1. Scientific Literature Review

  • Researchers can query the RAG system with specific questions or topics, such as “What are the latest advancements in quantum computing?” or “How does reinforcement learning apply to robotics?”
  • The system retrieves relevant papers from arXiv, extracts key findings, and generates a coherent summary or explanation, streamlining the literature review process.

2. Automated Research Assistance

  • Assists researchers in identifying gaps in the literature, emerging trends, or summarizing methodologies from multiple sources.
  • Answers highly specific technical questions by synthesizing information across papers, saving significant time and effort.

3. Educational Applications

  • Students and educators can use the system to demystify complex scientific concepts or provide concise overviews of research areas.
  • It can generate study guides, FAQs, or detailed explanations of technical topics tailored to a specific audience.

4. Fostering Cross-disciplinary Collaboration

  • Enables researchers from diverse fields to comprehend papers outside their expertise, fostering interdisciplinary innovation.
  • For instance, a biologist exploring machine learning applications in genomics can gain accessible summaries of relevant computer science research.

5. Support for Patent and Grant Proposals

  • Inventors and grant writers can identify prior work and gather relevant references, ensuring their proposals are novel and well-informed.
  • The system can help generate theoretical frameworks and summaries to strengthen application narratives.

6. Staying Updated with Recent Advances

  • Delivers personalized summaries of newly published arXiv papers on a daily or weekly basis, tailored to individual research interests.
  • Ensures researchers remain current without having to sift through extensive lists of new publications.

7. Contextual Code Retrieval and Explanation

  • For papers containing code snippets or pseudo-code, the system can extract and provide detailed explanations or runnable examples, enhancing reproducibility and accelerating practical applications.

By integrating RAG with arXiv, this system has the potential to revolutionize how knowledge is consumed and utilized across academia, industry, and beyond. Researchers, students, and professionals alike would benefit from its ability to make cutting-edge science more accessible, comprehensible, and actionable.

Conclusion

The ArXiv Research Assistant demonstrates the power of combining modern AI technologies with efficient data storage and retrieval systems. By leveraging LangGraph for workflow orchestration, MongoDB Atlas for flexible document storage and searching, and AWS Bedrock for state-of-the-art language models, we’ve created a powerful tool for helping with academic research.

This system showcases several key concepts:

  1. Hybrid search techniques combining full-text and vector-based approaches
  2. Integration of external APIs (ArXiv) for real-time data retrieval
  3. Use of large language models for topic extraction and answer generation
  4. Graph-based workflow management for complex AI applications
  5. Building user-friendly interfaces for AI-powered tools

Future improvements could include adding more sources beyond ArXiv, implementing user feedback mechanisms, and exploring more advanced retrieval and summarization techniques.

GitHub: https://github.com/mohammaddaoudfarooqi/arXivSearch