Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

A Step-by-Step Guide:

Building an AI-Powered
Academic Research Assistance

1 ANSHUMAN JHA
A Comprehensive Guide: CrewAI, LangChain-Groq & DuckDuckGoSearch

Table of Contents
1. Introduction

• Project Overview
• Use Case Story and Background

2. Project Structure

3. Step-by-Step Guide

• Step 1: Set Up the Project


• 1.1 Create the Project Directory
• 1.2 Create and Activate a Virtual Environment
• 1.3 Install Prerequisites
• 1.4 Create the Required Files
• 1.5 Install Dependencies
• 1.6 Configure Environment Variables
• Step 2: Create Utility Functions
• Step 3: Define the Agents
• Step 4: Define the Tasks
• Step 5: Create the Main Execution Script

4. Running the Project

5. Explanation of the Workflow

• 5.1 Utility Functions (utils.py)


• 5.2 Agents (agents.py)
• 5.3 Tasks (tasks.py)
• 5.4 Main Execution (main.py)
6. Cloud Architecture Diagram for AI-Powered Research Agent
• 6.1 Components of Cloud Architecture Diagram

2 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance

1. Introduction
Project Overview

In this tutorial, we will create an AI-powered research agent using Python. This agent will leverage
various tools and APIs to gather information, specifically focusing on retrieving data from research
papers, performing web searches, and providing structured reports.

Use Case Story and Background

Scenario: Academic Research Assistance

User: Dr. Jane Doe, a researcher in machine learning, needs to gather the latest advancements in
natural language processing (NLP) to include in her upcoming review paper.

Goal: Utilize the AI research agent to automate the retrieval of the latest NLP research papers and
summarize their abstracts for a comprehensive overview.

Steps:

1. Setup the Agent: Dr. Doe sets up the research agent by following the step-by-step guide to
install dependencies and configure API keys.

2. Dataset Preparation: She loads the arXiv dataset relevant to NLP and populates the
knowledge base using the provided utility functions.

3. Agent Invocation: Dr. Doe invokes the agent with the input "summarize the latest research in
NLP." The agent fetches the necessary data, performs web searches, and aggregates the
information.

4. Report Generation: The output from the agent is formatted into a structured report, which Dr.
Doe includes in her review paper.

Outcome:

Dr. Doe efficiently compiles a comprehensive summary of the latest NLP advancements, significantly
reducing her manual research time and increasing productivity.

3 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance

2. Project Structure
research_agent/
├── .env
├── agents.py
├── tasks.py
├── utils.py
├── main.py
├── requirements.txt

3. Step-by-Step Guide
Step 1: Set Up the Project
1.1 Create the Project Directory
Open your terminal or command prompt and create a new project directory:

mkdir email_reply
cd email_reply

1.2 Create and Activate a Virtual Environment


Create a virtual environment:
python -m venv env
source env/bin/activate # On Windows use `env\Scripts\activate`

1.3 Install Prerequisites


• Ensure you have Python installed on your local machine.
• Install graphviz and other necessary system dependencies.
sudo apt-get install graphviz libgraphviz-dev pkg-config

1.4 Create the Required Files


Create the necessary Python files and directories as outlined in the Project Structure section.
touch .env agents.py tasks.py utils.py main.py requirements.txt

4 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance
1.5 Install Dependencies
Install the required Python libraries using pip:
datasets==2.19.1
langchain-pinecone==0.1.0
langchain-openai==0.1.3
langchain==0.1.16
langchain-core==0.1.42
langgraph==0.0.37
langchainhub==0.1.15
semantic-router==0.0.39
serpapi==0.1.5
google-search-results==2.4.2
pygraphviz==1.12

Install the dependencies:


pip install -r requirements.txt

1.6 Configure Environment Variables


Set the API keys as environment variables:

OpenAI API Key =your_OpenAI_API_Key


SerpAPI Key =your_SerpAPI_Key
Pinecone API Key =your_Pinecone_API_Key

5 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance

Step 2: Create Utility Functions

Create helper functions and configurations in utils.py:


# utils.py
from datasets import load_dataset
import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder
import pandas as pd
from tqdm.auto import tqdm

# Load dataset
def load_arxiv_dataset():
dataset = load_dataset("jamescalam/ai-arxiv2-semantic-chunks", split="train")
return dataset

# Initialize Pinecone index


def initialize_pinecone():
api_key = os.getenv("PINECONE_API_KEY") or getpass("Pinecone API key: ")
encoder = OpenAIEncoder(name="text-embedding-3-small")
return encoder

# Populate knowledge base


def populate_knowledge_base(index, dataset, encoder):
data = dataset.to_pandas().iloc[:10000]
batch_size = 128
for i in tqdm(range(0, len(data), batch_size)):
i_end = min(len(data), i+batch_size)
batch = data[i:i_end].to_dict(orient="records")
metadata = [{"title": r["title"], "content": r["content"], "arxiv_id": r["arxiv_id"]} for r
in batch]
ids = [r["id"] for r in batch]
content = [r["content"] for r in batch]
embeds = encoder(content)
index.upsert(vectors=zip(ids, embeds, metadata))

# Populate Building Reports


def build_report(output: dict):
research_steps = output["research_steps"]
if type(research_steps) is list:
research_steps = " ".join(research_steps)
return f"Research Summary:\n{research_steps}"

6 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance

Step 3: Define the Agents


Create the research agent in agents.py:
# agents.py
import os
from getpass import getpass
from langchain.agents import create_openai_tools_agent
from langchain import hub
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
import requests
import re

# Define OpenAI API key and initialize the encoder


os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass("OpenAI API key: ")

# Initialize the OpenAI chat model


llm = ChatOpenAI(
model="gpt-4o",
openai_api_key=os.environ["OPENAI_API_KEY"],
temperature=0
)

# Fetch ArXiv tool


abstract_pattern = re.compile(
r'<blockquote class="abstract mathjax">\s*<span
class="descriptor">Abstract:</span>\s*(.*?)\s*</blockquote>',
re.DOTALL
)

@tool("fetch_arxiv")
def fetch_arxiv_tool(arxiv_id: str):
res = requests.get(f"https://export.arxiv.org/abs/{arxiv_id}")
re_match = abstract_pattern.search(res.text)
return re_match.group(1)

# Define other tools as needed...

prompt = hub.pull("hwchase17/openai-functions-agent")

# Create the agent


oracle_agent_runnable = create_openai_tools_agent(
llm=llm,
tools=[
fetch_arxiv_tool,
# Add other tools here...
],
prompt=prompt
)

7 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance

Step 4: Define the Tasks


Define specific tasks and utility functions in tasks.py:

# tasks.py
from langchain_core.agents import AgentAction, AgentFinish
from langchain_core.messages import BaseMessage
import operator
import requests
from serpapi import GoogleSearch
import os

# Define utility functions


def execute_fetch_arxiv(state: dict):
action = state["agent_out"]
tool_call = action[-1].message_log[-1].additional_kwargs["tool_calls"][-1]
out = fetch_arxiv_tool(tool_call["function"]["arguments"]["arxiv_id"])
return {"intermediate_steps": [(action, str(out))], "chat_history": state["chat_history"]}

def web_search(query: str):


search = GoogleSearch({
"engine": "google",
"api_key": os.getenv("SERPAPI_KEY") or getpass("SerpAPI key: "),
"q": query,
"num": 5
})
results = search.get_dict()["organic_results"]
contexts = "\n---\n".join([f"{x['title']}\n{x['snippet']}\n{x['link']}" for x in results])
return contexts

8 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance

Step 5: Create the Main Execution Script

Create the main execution script in main.py:


# main.py
from agents import oracle_agent_runnable
from tasks import execute_fetch_arxiv, web_search
from utils import load_arxiv_dataset, initialize_pinecone, populate_knowledge_base
import os

# Set up and initialize components


dataset = load_arxiv_dataset()
encoder = initialize_pinecone()

# Assuming index is set up correctly in Pinecone


index_name = "gpt-4o-research-agent"

# Populate the knowledge base


populate_knowledge_base(index_name, dataset, encoder)

# Example usage of the agent


inputs = {
"input": "tell me something interesting about dogs",
"chat_history": [],
"intermediate_steps": [],
}
agent_out = oracle_agent_runnable.invoke(inputs)
print(agent_out)

report = build_report(agent_out)
print(report)

4. Running the Project


Run the agent and observe its output. Make sure all dependencies are installed and API keys are
configured correctly.

1. Run the main script:

python main.py

9 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance

5. Explanation of the Workflow


5.1 Utility Functions (utils.py)

Defines helper functions and configurations, such as loading datasets and initializing Pinecone.

5.2 Agents (agents.py)

Contains the definitions and initialization of the research agent and tools.

5.3 Tasks (tasks.py)

Defines specific tasks and utility functions for the agent to perform.

5.4 Main Execution (main.py)

Serves as the entry point for running the agent, setting up components, and invoking tasks.

10 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance

6.Cloud Architecture Diagram for AI-Powered Research Agent

Project Overview: An AI-powered research agent that automates the retrieval of research data,
performs web searches, and provides structured reports. The project leverages various tools and APIs,
and we aim to deploy it in a cloud environment for scalability and reliability.

Components to Include:

1. User Interface (UI):


• A web-based interface for users to interact with the research agent.
• Input fields for users to provide queries.
• Display area for showing search results and reports.
2. API Gateway:
• Entry point for all API requests.
• Routes requests to appropriate backend services.
11 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance
3. Authentication and Authorization:
• Secure access to the research agent using API keys or OAuth.
4. Compute Resources:
• Virtual Machines or Containers (e.g., AWS EC2, Azure VM, Google Cloud Compute
Engine, or Kubernetes).
• Serverless functions (e.g., AWS Lambda, Azure Functions, Google Cloud Functions) for
executing agent tasks.
5. Data Storage:
• Database for storing API keys, user data, and configuration settings (e.g., AWS RDS,
Azure SQL Database, Google Cloud SQL).
• Object storage for storing datasets and large files (e.g., AWS S3, Azure Blob Storage,
Google Cloud Storage).
6. AI and Machine Learning Services:
• Utilize cloud-based AI services for natural language processing and other ML tasks (e.g.,
OpenAI API, AWS SageMaker, Azure Cognitive Services, Google AI Platform).
• Pinecone for vector database management.
7. Networking:
• Virtual Private Cloud (VPC) to host the services securely.
• Subnets, routing tables, and internet gateways for managing network traffic.
8. Monitoring and Logging:
• Tools for monitoring application performance and logging (e.g., AWS CloudWatch,
Azure Monitor, Google Cloud Logging).
9. CI/CD Pipeline:
• Continuous Integration and Continuous Deployment pipeline for automated testing and
deployment (e.g., AWS CodePipeline, Azure DevOps, Google Cloud Build).

Specific Technologies and Services:

• Language and Tools: Python, LangChain, PyGraphviz, OpenAI, SerpAPI, Pinecone.


• Cloud Provider: [Specify AWS, Azure, or Google Cloud]
• Additional Tools: Git for version control, Docker for containerization (if applicable).

12 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance

Constructive comments and feedback are welcomed

13 ANSHUMAN JHA

You might also like