Academic Research Assistance 1716570959

A Step-by-Step Guide:
Building an AI-Powered
Academic Research Assistance
1 ANSHUMAN JHA
A Comprehensive Guide: CrewAI, LangChain-Groq & DuckDuckGoSearch
Table of Contents
1. Introduction
• Project Overview
• Use Case Story and Background
2. Project Structure
3. Step-by-Step Guide
• Step 1: Set Up the Project

• 1.1 Create the Project Directory
• 1.2 Create and Activate a Virtual Environment
• 1.3 Install Prerequisites
• 1.4 Create the Required Files
• 1.5 Install Dependencies
• 1.6 Configure Environment Variables
• Step 2: Create Utility Functions
• Step 3: Define the Agents
• Step 4: Define the Tasks
• Step 5: Create the Main Execution Script
4. Running the Project
5. Explanation of the Workflow
• 5.1 Utility Functions (utils.py)

• 5.2 Agents (agents.py)
• 5.3 Tasks (tasks.py)
• 5.4 Main Execution (main.py)
6. Cloud Architecture Diagram for AI-Powered Research Agent
• 6.1 Components of Cloud Architecture Diagram
2 ANSHUMAN JHA
A Step-by-Step Guide: Building an AI-Powered Academic Research Assistance
1. Introduction
Project Overview
In this tutorial, we will create an AI-powered research agent using Python. This agent will leverage
various tools and APIs to gather information, specifically focusing on retrieving data from research
papers, performing web searches, and providing structured reports.
Use Case Story and Background
Scenario: Academic Research Assistance
User: Dr. Jane Doe, a researcher in machine learning, needs to gather the latest advancements in
natural language processing (NLP) to include in her upcoming review paper.
Goal: Utilize the AI research agent to automate the retrieval of the latest NLP research papers and
summarize their abstracts for a comprehensive overview.
Steps:
1. Setup the Agent: Dr. Doe sets up the research agent by following the step-by-step guide to
install dependencies and configure API keys.
2. Dataset Preparation: She loads the arXiv dataset relevant to NLP and populates the
knowledge base using the provided utility functions.
3. Agent Invocation: Dr. Doe invokes the agent with the input "summarize the latest research in
NLP." The agent fetches the necessary data, performs web searches, and aggregates the
information.
4. Report Generation: The output from the agent is formatted into a structured report, which Dr.
Doe includes in her review paper.
Outcome:
Dr. Doe efficiently compiles a comprehensive summary of the latest NLP advancements, significantly
reducing her manual research time and increasing productivity.
3 ANSHUMAN JHA
2. Project Structure
research_agent/
├── .env
├── agents.py
├── tasks.py
├── utils.py
├── main.py
├── requirements.txt
3. Step-by-Step Guide
Step 1: Set Up the Project
1.1 Create the Project Directory
Open your terminal or command prompt and create a new project directory:
mkdir email_reply
cd email_reply
1.2 Create and Activate a Virtual Environment

Create a virtual environment:
python -m venv env
source env/bin/activate # On Windows use `env\Scripts\activate`
1.3 Install Prerequisites

• Ensure you have Python installed on your local machine.
• Install graphviz and other necessary system dependencies.
sudo apt-get install graphviz libgraphviz-dev pkg-config
1.4 Create the Required Files

Create the necessary Python files and directories as outlined in the Project Structure section.
touch .env agents.py tasks.py utils.py main.py requirements.txt
4 ANSHUMAN JHA
1.5 Install Dependencies
Install the required Python libraries using pip:
datasets==2.19.1
langchain-pinecone==0.1.0
langchain-openai==0.1.3
langchain==0.1.16
langchain-core==0.1.42
langgraph==0.0.37
langchainhub==0.1.15
semantic-router==0.0.39
serpapi==0.1.5
google-search-results==2.4.2
pygraphviz==1.12
Install the dependencies:

pip install -r requirements.txt
1.6 Configure Environment Variables

Set the API keys as environment variables:
OpenAI API Key =your_OpenAI_API_Key

SerpAPI Key =your_SerpAPI_Key
Pinecone API Key =your_Pinecone_API_Key
5 ANSHUMAN JHA
Step 2: Create Utility Functions
Create helper functions and configurations in utils.py:

# utils.py
from datasets import load_dataset
import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder
import pandas as pd
from tqdm.auto import tqdm
# Load dataset
def load_arxiv_dataset():
dataset = load_dataset("jamescalam/ai-arxiv2-semantic-chunks", split="train")
return dataset
# Initialize Pinecone index

def initialize_pinecone():
api_key = os.getenv("PINECONE_API_KEY") or getpass("Pinecone API key: ")
encoder = OpenAIEncoder(name="text-embedding-3-small")
return encoder
# Populate knowledge base

def populate_knowledge_base(index, dataset, encoder):
data = dataset.to_pandas().iloc[:10000]
batch_size = 128
for i in tqdm(range(0, len(data), batch_size)):
i_end = min(len(data), i+batch_size)
batch = data[i:i_end].to_dict(orient="records")
metadata = [{"title": r["title"], "content": r["content"], "arxiv_id": r["arxiv_id"]} for r
in batch]
ids = [r["id"] for r in batch]
content = [r["content"] for r in batch]
embeds = encoder(content)
index.upsert(vectors=zip(ids, embeds, metadata))
# Populate Building Reports

def build_report(output: dict):
research_steps = output["research_steps"]
if type(research_steps) is list:
research_steps = " ".join(research_steps)
return f"Research Summary:\n{research_steps}"
6 ANSHUMAN JHA
Step 3: Define the Agents

Create the research agent in agents.py:
# agents.py
import os
from getpass import getpass
from langchain.agents import create_openai_tools_agent
from langchain import hub
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
import requests
import re
# Define OpenAI API key and initialize the encoder

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass("OpenAI API key: ")
# Initialize the OpenAI chat model

llm = ChatOpenAI(
model="gpt-4o",
openai_api_key=os.environ["OPENAI_API_KEY"],
temperature=0
)
# Fetch ArXiv tool

abstract_pattern = re.compile(
r'<blockquote class="abstract mathjax">\s*<span
class="descriptor">Abstract:</span>\s*(.*?)\s*</blockquote>',
re.DOTALL
)
@tool("fetch_arxiv")
def fetch_arxiv_tool(arxiv_id: str):
res = requests.get(f"https://export.arxiv.org/abs/{arxiv_id}")
re_match = abstract_pattern.search(res.text)
return re_match.group(1)
# Define other tools as needed...
prompt = hub.pull("hwchase17/openai-functions-agent")
# Create the agent

oracle_agent_runnable = create_openai_tools_agent(
llm=llm,
tools=[
fetch_arxiv_tool,
# Add other tools here...
],
prompt=prompt
)
7 ANSHUMAN JHA
Step 4: Define the Tasks

Define specific tasks and utility functions in tasks.py:
# tasks.py
from langchain_core.agents import AgentAction, AgentFinish
from langchain_core.messages import BaseMessage
import operator
import requests
from serpapi import GoogleSearch
import os
# Define utility functions

def execute_fetch_arxiv(state: dict):
action = state["agent_out"]
tool_call = action[-1].message_log[-1].additional_kwargs["tool_calls"][-1]
out = fetch_arxiv_tool(tool_call["function"]["arguments"]["arxiv_id"])
return {"intermediate_steps": [(action, str(out))], "chat_history": state["chat_history"]}
def web_search(query: str):

search = GoogleSearch({
"engine": "google",
"api_key": os.getenv("SERPAPI_KEY") or getpass("SerpAPI key: "),
"q": query,
"num": 5
})
results = search.get_dict()["organic_results"]
contexts = "\n---\n".join([f"{x['title']}\n{x['snippet']}\n{x['link']}" for x in results])
return contexts
8 ANSHUMAN JHA
Step 5: Create the Main Execution Script
Create the main execution script in main.py:

# main.py
from agents import oracle_agent_runnable
from tasks import execute_fetch_arxiv, web_search
from utils import load_arxiv_dataset, initialize_pinecone, populate_knowledge_base
import os
# Set up and initialize components

dataset = load_arxiv_dataset()
encoder = initialize_pinecone()
# Assuming index is set up correctly in Pinecone

index_name = "gpt-4o-research-agent"
# Populate the knowledge base

populate_knowledge_base(index_name, dataset, encoder)
# Example usage of the agent

inputs = {
"input": "tell me something interesting about dogs",
"chat_history": [],
"intermediate_steps": [],
}
agent_out = oracle_agent_runnable.invoke(inputs)
print(agent_out)
report = build_report(agent_out)
print(report)
4. Running the Project

Run the agent and observe its output. Make sure all dependencies are installed and API keys are
configured correctly.
1. Run the main script:
python main.py
9 ANSHUMAN JHA
5. Explanation of the Workflow

5.1 Utility Functions (utils.py)
Defines helper functions and configurations, such as loading datasets and initializing Pinecone.
5.2 Agents (agents.py)
Contains the definitions and initialization of the research agent and tools.
5.3 Tasks (tasks.py)
Defines specific tasks and utility functions for the agent to perform.
5.4 Main Execution (main.py)
Serves as the entry point for running the agent, setting up components, and invoking tasks.
10 ANSHUMAN JHA
6.Cloud Architecture Diagram for AI-Powered Research Agent
Project Overview: An AI-powered research agent that automates the retrieval of research data,
performs web searches, and provides structured reports. The project leverages various tools and APIs,
and we aim to deploy it in a cloud environment for scalability and reliability.
Components to Include:
1. User Interface (UI):

• A web-based interface for users to interact with the research agent.
• Input fields for users to provide queries.
• Display area for showing search results and reports.
2. API Gateway:
• Entry point for all API requests.
• Routes requests to appropriate backend services.
11 ANSHUMAN JHA
3. Authentication and Authorization:
• Secure access to the research agent using API keys or OAuth.
4. Compute Resources:
• Virtual Machines or Containers (e.g., AWS EC2, Azure VM, Google Cloud Compute
Engine, or Kubernetes).
• Serverless functions (e.g., AWS Lambda, Azure Functions, Google Cloud Functions) for
executing agent tasks.
5. Data Storage:
• Database for storing API keys, user data, and configuration settings (e.g., AWS RDS,
Azure SQL Database, Google Cloud SQL).
• Object storage for storing datasets and large files (e.g., AWS S3, Azure Blob Storage,
Google Cloud Storage).
6. AI and Machine Learning Services:
• Utilize cloud-based AI services for natural language processing and other ML tasks (e.g.,
OpenAI API, AWS SageMaker, Azure Cognitive Services, Google AI Platform).
• Pinecone for vector database management.
7. Networking:
• Virtual Private Cloud (VPC) to host the services securely.
• Subnets, routing tables, and internet gateways for managing network traffic.
8. Monitoring and Logging:
• Tools for monitoring application performance and logging (e.g., AWS CloudWatch,
Azure Monitor, Google Cloud Logging).
9. CI/CD Pipeline:
• Continuous Integration and Continuous Deployment pipeline for automated testing and
deployment (e.g., AWS CodePipeline, Azure DevOps, Google Cloud Build).
Specific Technologies and Services:
• Language and Tools: Python, LangChain, PyGraphviz, OpenAI, SerpAPI, Pinecone.

• Cloud Provider: [Specify AWS, Azure, or Google Cloud]
• Additional Tools: Git for version control, Docker for containerization (if applicable).
12 ANSHUMAN JHA
Constructive comments and feedback are welcomed
13 ANSHUMAN JHA

Academic Research Assistance 1716570959

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Academic Research Assistance 1716570959

Uploaded by

Copyright:

Available Formats

A Step-by-Step Guide:

• Step 1: Set Up the Project

4. Running the Project

5. Explanation of the Workflow

• 5.1 Utility Functions (utils.py)

Use Case Story and Background

Scenario: Academic Research Assistance

1.2 Create and Activate a Virtual Environment

1.3 Install Prerequisites

1.4 Create the Required Files

Install the dependencies:

1.6 Configure Environment Variables

OpenAI API Key =your_OpenAI_API_Key

Step 2: Create Utility Functions

Create helper functions and configurations in utils.py:

# Initialize Pinecone index

# Populate knowledge base

# Populate Building Reports

Step 3: Define the Agents

# Define OpenAI API key and initialize the encoder

# Initialize the OpenAI chat model

# Fetch ArXiv tool

# Define other tools as needed...

# Create the agent

Step 4: Define the Tasks

# Define utility functions

def web_search(query: str):

Step 5: Create the Main Execution Script

Create the main execution script in main.py:

# Set up and initialize components

# Assuming index is set up correctly in Pinecone

# Populate the knowledge base

# Example usage of the agent

4. Running the Project

1. Run the main script:

5. Explanation of the Workflow

5.2 Agents (agents.py)

5.3 Tasks (tasks.py)

5.4 Main Execution (main.py)

6.Cloud Architecture Diagram for AI-Powered Research Agent

1. User Interface (UI):

Specific Technologies and Services:

• Language and Tools: Python, LangChain, PyGraphviz, OpenAI, SerpAPI, Pinecone.

Constructive comments and feedback are welcomed

You might also like