Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Atal Bihari Vajpayee

Indian Institute of Information


Technology and Management Gwalior

Colloquium
A project report,
submitted in complete fulfilment of the requirements for B. Tech project

Submitted By:

Tushar Guha Neogi


(2018IMG-066)
CANDIDATE’S DECLARATION
I hereby certify that the work, which is being presented in the report, entitled
Colloquium - Google Summer of Code(GSOC), in complete fulfilment of the
requirement for the award of the Degree of Integrated Postgraduate (IT + MBA) and
submitted to the institution is an authentic record of my own work carried out
during the period June 2021 to August 2021. I also cited the reference about the
text(s)/figure(s)/table(s) from where they have been taken.

Date: 16/10/2022 Signatures of the Candidate

2
INTERNSHIP CERTIFICATE

3
ABSTRACT
Monitoring of systems to monitor functioning of systems on the basis of various
multiple metrics is crucial for any system providing service/product to their
customer. Monitoring enables the system maintainer to predict failure and identify
bottlenecks in the system and solve them accordingly. There are various Monitoring
tools already present in the market. BenchRoutes tries to focus on the Monitoring of
API endpoints and provide multiple features for doing so, which makes it unique
from other monitoring software.

The BenchRoutes has already released its version 1 and has proved to be a great tool
for monitoring endpoints by giving information about API jitter, ping, response
Delay, System Monitoring like ram usage, gpu usage etc. But there were also few
challenges to overcome, scalability was a crucial one. For scaling the current
architecture we have to redesign the architecture to reduce resource consumption,
add different scrape intervals for different endpoints, update Querier and API
integration. During the span of this project I, with my Mentor, worked on these
issues and successfully solved each of the challenges.

4
ACKNOWLEDGEMENT
I am highly indebted to Mr. Harkishen Singh(Mentor) and am obliged to give me the
autonomy of functioning and experimenting with ideas. I would like to take this
opportunity to express my profound gratitude to him for his guidance and his
personal interest in my project and constant support coupled with
confidence-boosting and motivating sessions that proved very fruitful and were
instrumental in infusing self-assurance and trust within me. The nurturing and
blossoming of the present work is mainly due to his valuable guidance, suggestions,
astute judgement, constructive criticism, and an eye for perfection. My mentor
always answered a myriad of my doubts with smiling graciousness and prodigious
patience. He never let me feel that I am a novice by always lending an ear to my
views, appreciating and improving them, and by giving me a free hand in my project.
It is only because of their overwhelming interest and helpful attitude; the present
work has attained its stage. Finally, I am grateful to my Institution and colleagues
whose constant encouragement served to renew my spirit, refocus my attention and
energy, and carry out this work.

(Tushar Guha Neogi)

5
TABLE OF CONTENTS

1. INTRODUCTION……………………………………………………………………………………………………07
2. LITERATURE REVIEW…………………………………………………………………………………………08
2.1. Monitoring…………………………………………………………………………………………………….08
2.2. Concurrency………………………………………………………………………………………………….09
3. PROJECT OBJECTIVES…………………………………………………………………………………………12
3.1. Revamping Configurations Of BenchRoutes………………………………………….12
3.2. Scheduling Jobs for scraping endpoint metrics……………………………………..12
3.3. Adding querier and API integration………………………………………………………..13
4. METHODOLOGY……………………………………………………………………………………………………14
4.1. Configuration……………………………………………………………………………………………….14
4.2. Job…………………………………………………………………………………………………………………..15
4.3. Module…………………………………………………………………………………………………………..17
4.4. Architectural Diagram………………………………………………………………………………..20
5. RESULT………………………………………………………………………………………………………………………21
6. CONCLUSION…………………………………………………………………………………………………………22
7. REFERENCES…………………………………………………………………………………………………………..23

6
1. INTRODUCTION

Modern web applications can have routes ranging from a few to millions in
numbers. This makes it tough to discover the condition and state of such
application at any given point. Bench-routes monitors the routes of a web
application and helps you know about the current state of each route, along
with various related performance metrics.

Bench-Routes is a monitoring tool that monitors various performance-based


metrics of the mentioned API endpoints. Since monitoring each endpoint on a
large scale is an arduous task, bench-routes come to the rescue. These API
endpoints monitored through Bench-Routes can be very large in
numbers(thousands of endpoints). The scraped data is stored in TSDB as
object storage. This project mainly focused on revamping the data model of
the scraping mechanism and scheduling the scraping jobs concurrently. Also,
changing the configuration of API and recommence the querying from the
new TSDB.

7
2. LITERATURE REVIEW
Benchroutes domain lies in the field of Monitoring. Importance of this domain is
exponentially increasing with the ever increasing complexities of system that
provides products that we use on a daily basis, eg. Youtube, Google, Zomato, Netflix,
Amazon, Flipkart etc. These are just a handful of examples that we are using on a
daily basis. In this world of Digitalisation, there are systems powering each product
that we use. With a system comes multiple possibilities of failures, and to monitor all
the working of a system lies in the Domain of Monitoring. Let us learn more about
Monitoring.

2.1. Monitoring
Monitoring entails overseeing the entire development process from planning,
development, integration and testing, deployment, and operations. It involves a
complete and real-time view of the status of applications, services, and infrastructure
in the production environment. Features such as real-time streaming, historical
replay, and visualisations are critical components of application and service
monitoring.

Moving to the operations side of the life cycle, the site reliability engineer needs to
understand the services that can be measured and monitored, so if there's a problem,
it can be fixed. If you don’t have a DevOps toolchain that ties all these processes
together, you have a messy, uncorrelated, chaotic environment. If you have a
well-integrated toolchain, you can get better context into what is going on.

Key Capabilities

1. Shift-left testing : Shift-left testing that is performed earlier in the life cycle
helps to increase quality, shorten test cycles, and reduce errors. For DevOps
teams, it is important to extend shift-left testing practices to monitor the
health of pre-production environments. This ensures that monitoring is
implemented early and often, in order to maintain continuity through

8
production and the quality of monitoring alerts are preserved. Testing and
monitoring should work together, with early monitoring helping to assess the
behaviour of the application through key user journeys and transactions. This
also helps to identify performance and availability deviations before
production deployment.

2. Alert and incident management: In a cloud-native world incidents are as


much a fact of life as bugs in code. These incidents include hardware and
network failures, misconfiguration, resource exhaustion, data inconsistencies,
and software bugs. DevOps teams should embrace incidents and have
high-quality monitors in place to respond to them.

Some of the best practices to help with this are:


● Build a culture of collaboration, where monitoring is used during
development along with feature/functionality and automated tests
● During development, build appropriate, high-quality alerts in the code that
minimise mean time to detect (MTTD) and mean time to isolate (MTTI)
● Build monitors to ensure dependent services operate as expected
● Allocate time to build required dashboards and train team members to use
them
● Plan “war games” for the service to ensure monitors operate as expected
and catch missing monitors
● During sprints, plan to close actions from previous incident reviews,
especially actions related to building missing monitors and automation
● Build detectors for security (upgrades/patches/rolling credentials)
● Cultivate a “measure and monitor everything” mindset with automation
determining the response to detected alerts

2.2. Concurrency
This concept will be used a lot while designing the revamped architecture of the
system. Concurrency is the execution of multiple instruction sequences at the same
time. It happens in the operating system when there are several process threads

9
running in parallel. The running process threads always communicate with each
other through shared memory or message passing. Concurrency results in sharing of
resources resulting in problems like deadlocks and resource starvation.

Principles of Concurrency :
Both interleaved and overlapped processes can be viewed as examples of concurrent
processes, they both present the same problems.
The relative speed of execution cannot be predicted. It depends on the following:
● The activities of other processes
● The way operating system handles interrupts
● The scheduling policies of the operating system

Problems in Concurrency :
● Sharing global resources –
Sharing of global resources safely is difficult. If two processes both make use
of a global variable and both perform read and write on that variable, then the
order in which various reads and writes are executed is critical.
● Optimal allocation of resources –
It is difficult for the operating system to manage the allocation of resources
optimally.
● Locating programming errors –
It is very difficult to locate a programming error because reports are usually
not reproducible.
● Locking the channel –
It may be inefficient for the operating system to simply lock the channel and
prevent its use by other processes.

Advantages of Concurrency :
● Running of multiple applications –
It enables you to run multiple applications at the same time.

10
● Better resource utilisation –
It enables the resources that are unused by one application can be used for
other applications.
● Better average response time –
Without concurrency, each application has to be run to completion before the
next one can be run.
● Better performance –
It enables better performance by the operating system. When one application
uses only the processor and another application uses only the disk drive then
the time to run both applications concurrently to completion will be shorter
than the time to run each application consecutively.

Drawbacks of Concurrency :
● It is required to protect multiple applications from one another.
● It is required to coordinate multiple applications through additional
mechanisms.
● Additional performance overheads and complexities in operating systems are
required for switching among applications.
● Sometimes running too many applications concurrently leads to severely
degraded performance.

11
3. PROJECT OBJECTIVES
The project had three major objectives which were to be completed over a span of 3
months. The objectives are as follows:

3.1. Revamping Configurations Of BenchRoutes


BenchRoutes require API configuration from the user for monitoring the endpoints.
The configuration is actually parsed from a file config.yml. The api details are
provided by the user in this file according to the prescribed format. The earlier
config structure on which BenchRoute was working was not much refactored and
had much nesting which was not necessary. We are required to make a new config
design and a new parser package that will parse the new config design.

3.2 Scheduling Jobs for scraping endpoint metrics


Bench-route monitor API endpoints by launching go-routines for each-endpoints
with different workers concurrently. However, there are multiple problems with this
system. Some of them are, relaunching of excessive go-routines, no independent
scrape interval for each of the endpoints, inefficient jitter calculation method etc. We
need to redesign the architecture of scraping the metrics.

Problems with the current architecture was:


● The code that iteratively launches go-routines after every scrape interval is
redundant in different modules. All modules have the same functions: iterate()
and perform().
● Excessive launching of go-routines. We are launching multiple go-routines
each time, performing their respective functions(Jitter, Ping, Monitor)
thereafter killing them. Then again after the scrape interval launching the
go-routines and repeating the cycle.
● Different endpoints can have different scrape intervals, but the current
implementation does not support this. All the endpoints have a global scrape
interval.
● Ping Module can be used to calculate Jitter values but this is not implemented
yet.

12
3.3. Adding querier and API integration
Since the base architecture of BenchRoute has changed due to the last two ideas, the
earlier APIs need to be changed with new APIs according to the client requirements.
Also, a new TSDB has been integrated with the new Designs, hence querying data
from TSDB should also be changed. Once all these tasks are completed, all different
modules should be integrated to form the final product.

13
4. METHODOLOGY

4.1. Configuration
Bench-routes is configured using local-config.yml. This will be generalised to accept the file
name from a flag.

Updated configuration file.

apis:
- name: route-name-one
every: 3s
protocol: https
domain_or_ip: www.some-url-for-one.com
route: /api/v1/test
method: get
headers:
key_1: value_1
params:
key_1: value_1
key_2: value_2
body:
key_1: value_1
key_2: value_2
- name: route-name-two
every: 5s
protocol: https
domain_or_ip: www.some-url-for-two.com
route: /api/v1/test
method: post
params:
key_1: value_1
key_2: value_2
body:
key_1: value_1
key_2: value_2

To frame the above YAML in words, the root will be an api type. api contains an array of
route where each route has name (string), every (duration), domain_or_ip (string), route
(string), method (string), params (map[string]string), body (map[string]string).

For the ping interval, the interval will be the lowest of all the scrape_intervals corresponding
to the domain. This is to make things simple for now.

14
4.2. Job
A job is a basic unit (low-level type) in bench-routes that contains information about a route.
There will be two types of jobs:
1. monitoringJob
2. machineJob

monitoringJob is for monitoring of API endpoints, for calculations like response time,
response length. machineJob is for performing ping and extracting out the ping and jitter
information from the response.

Monitoring job
Let's discuss monitoringJob first.

Each monitoringJob represents a route only from the api, thereby having a one-to-one
relation between job and a route.

type jobInfo struct {


mux sync.RWMux
name string
every time.Duration
lastExecute time.Time
}

type monitoringJob struct {


jobInfo
app appendable
client *http.Client
request *http.Request
}

monitoringJob type (and other types related to job) will be contained in a package job, which
will be inside the lib package. It will contain two types: jobInfo and monitoringJob.
monitoringJob inherits the jobInfo type.

A job can be created using the constructor newMonitoringJob(app appendable, method,


url string, every time.Duration) that returns a &monitoringJob{}. Each job has a predefined

15
*http.Request and *http.Client which will be created when the job is created (in the
constructor). This will be stored in the request field of the job.

Storing a request and client in the struct will save us from repeated allocations and save CPU
since we will directly use this request and client as client.Do(request) and get the response.
Earlier, we used to parse URLs in each interval which is not the case now.

The appendable is an interface that will be passed to the newMoitoringJob() constructor.


This will contain functions like Add(name, sub_type, timestamp, value) which will be
provided by the tsdb package.

job will have functions


1. Execute(<- chan sig struct{}): Executes the predefined http request using the client
and saves the response in on receiving the signal from the channel.
2. Abort(): stops the job by closing the sig channel.
3. Info() jobInfo: returns the jobInfo

There will be an interface Executable with the following functions:


1. Execute(<- chan struct{})
2. Abort()
3. Info() jobInfo
The Executable will be implemented by the monitoringJob type.

Machine Job
On the similar lines, there will be another job, machineJob that will also inherit the jobInfo
type. It will be exactly as the monitoringJob, having the same functions, and implements the
Executable interface. It will have a constructor newMachineJob(app appendable, name,
domainOrUrl string, every time.Duration) and returns a &machineJob{}. The only
difference is that machineJob will have the struct as

type machineJob struct {


jobInfo
app appendable
sigCh chan struct{}
ping *ping.Pinger
}

16
Each machineJob will have a one-to-one relation with the number of unique domain_or_ip
in the configuration file.

In the Execute() function of machineJob, it will call the Run() function of *ping.Pinger type
instead of the client.Do(request) in the monitoringJob’s Execute. For more info on how
go-ping can be used, see https://github.com/go-ping/ping.

Finally, there will be a NewJob(typ, app, name, url) function that returns an Executable.
This function will be called by other modules and based on the typ’s value, the function will
decide if it’s a machineJob or a monitoringJob and accordingly call their constructors.

4.3. Module
A module is a high-level unit in bench-routes, that calls or manages multiple low-level units
(like jobs).

Modules will be of two types, similar to the job:


1. Machine
2. Monitor

Machine module is responsible for managing machineJob type goroutines. Machine, in


general, corresponds to Virtual machines and does high-level management of (low-level)
modules that do the calculation of ping, jitter.

Monitor module is responsible for managing monitorJob type goroutines. This is for
api/route monitoring and manages (low-level) modules that do the calculation of response
time and length.

Both Machine and Monitor will implement the Runnable interface.

The runnable interface will contain:


1. Run()
2. Reload(configuration)
3. Stop()

17
Machine
Machine type will launch the machineJob. These jobs will be launched after creating their
job based on the route received from the configuration. Once you get a route, create a job
using the factory and passing the type as machine and relevant details, and create a channel
for this. After creating a channel, call the Execute(<- signal) of job (as it implements the
Executable interface) and pass the created channel in this Execute function.

Example of a job factory:


NewJob(type uint8, url string) Executable

Store the channel created in a map[jobInfo]chan <- struct, since this channel will be used to
signal the running goroutines in Execute() to do the ping operation. You can get the jobInfo
from the created job.Info() as job implements the Info().

type Machine struct {


scheduler
mux sync.Mutex
jobs map[*jobInfo]chan <- struct{}
reload chan struct{}
}

The Run of Machine will have a goroutine that listens to the reload channel. If the reload
channel closes, the goroutine exits, meaning it marks shutting down the machine.

Functions:
1. Run(): runs the Machine goroutine and starts listening to the reload channel. It calls
the run of the scheduler and passes a context after creating it.
2. Reload(configuration): receives the configuration and based on that, prepares a map
of map[jobInfo]chan <- struct{} and takes the mux.Lock(). After taking the lock, we
replace the jobs of Machine type with this new map and then release the lock via
mux.Unlock() and pass a signal to the reload channel, signalling the Run to restart the
process.
3. Stop(): Stop closes the reload channel which signals the Run() to stop its operations,
return and exit.

18
Now let's see what a scheduler is.

type scheduler struct {


scanFrequency time.Duration
timeline map[*jobInfo]chan <- struct{}
}

Scheduler is responsible for sending signals to the sig channel of the job. This triggers the
job to do its job which can be a ping operation or a HTTP request.

Scheduler works by going through all the jobInfo (key of the timeline map) in the timeline
and sees if the difference between the current time and the last_execute of that jobInfo is
greater than or equal to every of that jobInfo. If this is true, it sends the signal to the
corresponding channel (which is the value of the key jobInfo, of the timeline). This process
happens every scanFrequency which we will set as 1 second always.

Functions in the scheduler:


1. Run(context.WithCancel): This function is called by the Run() of the Machine/Monitor
and accepts a context. Then, it creates a ticker with 1-second value, and at each tick,
it looks through all the jobInfo in the timeline and sends the signal based on the
above algorithm. Run() of scheduler will also listen to the context.C and if there is a
cancel, it will exit. This cancel (function which you get when you create a context
withCancel) is used to stop the scheduler by the Monitor/Machine, which happens
after every reload. In short, as soon as the higher module gets a reload signal, it stops
the currently running scheduler, and creates a new scheduler with a new timeline and
runs the new schedules.

Scheduler is created whenever the high-level module’s constructor is called. The


Run(context) of the scheduler is called immediately so that scheduler is active, but it wont do
anything till the timeline is empty. After the timeline gets filled after a reload, it starts
working.

19
Monitor
The Monitor module is exactly the same as the Machine. The difference is just that it will
handle the monitorJob instead of machineJob in the Machine module case. Everything else
remains the same, including the scheduler.

4.5. Architectural diagram

20
5. RESULT
I have made a new parser package that parses the data from config.yml according to
the new design. You can have a look at the new design from the Configuration
section. I have written tests considering most of the edge cases. The parser also
validates the API config according to the design.

After discussion with the mentor, we came up with a new composite-pattern design.
Now each endpoint is mapped to individual Jobs. The jobs can be termed as
iteratively running go-routines that are launched concurrently once the server is
started. These jobs are then scheduled using a scheduler which keeps track of the
last execution time and schedules accordingly. Also, there is high-level module
abstraction that controls the low-level modules. The calculation of jitter metrics is
also done using the ping metric.

For this implementation, I have built multiple packages: job, module, scheduler and
evaluate. Each package has unit tests and also integration tests.

I have made a new package querier which does the querying from the TSDB. It
searches the range endpoints in the most optimised complexity of log(n) using binary
search. It returns the response with multiple details about the query like
evaluation-time, query-type, values which is an array of values of ping/jitter/monitor
according to the query type etc. I have also written tests for the package. Apart from
this, I have made a new api.go file which is integrated with the querier and contains
all the APIs required by the client-side with proper error handling. Refer to the
commit link for actual code.

Commit Links
1. https://github.com/bench-routes/bench-routes/commit/2c4998c174527ff39fc236
2746354c42a9111439
2. https://github.com/bench-routes/bench-routes/commit/43e68363f8c9e2ff134e9e
f4a4c44d2df82cc8ab
3. https://github.com/bench-routes/bench-routes/commit/f8ed0155ca38228ce57d1
12dac636e1eaa130c1a

21
6. CONCLUSION
I had an exuberant amount of confidence and knowledge after completion of the 12
weeks internship. The objectives that were initially made before the internship
period were finally completed and the revamped architecture solved all the issues
that arose in the earlier version. The concurrency of the system was finally improved
with a scheduler scheduling the independent go-routines which saved a lot of system
resources. The new architecture also addressed the issue of independent scrape
intervals for each API endpoint. Also the updated TSDB can use a new Querier for
efficient querying using Binary Search.

This experience also added secondary skills like adaptability, where I have to adapt
to the persisting code quickly and find the bottlenecks in the system. Agile
methodology was followed during the internship period where work was done in
Sprints with three meetings per week scheduled with my mentor. Planning tasks
beforehand and executing them within the deadline taught me working under a
limited time frame. Finally after all the meetings and hours of learning and writing
code the objectives were finally completed.

22
7. REFERENCES
[1] Lion, J. D.(2010-10-12) .”A Tour of Go”. https://go.dev/tour/list

[2] Gamma, E.; Helm, R.; Johnson, R.;.(1994-10-31). Design Patterns: Elements of

Reusable Object-Oriented Software

[3] Kumar, G. and Bhatia, P.: 2012, Impact of agile methodology on software

development process, International Journal of Computer Technology and Electronics

Engineering (IJCTEE) 2, 2249–6343.

[4] ”The Go scheduler,” Morsing's Blog, (2013-06-30). [Online]. Available:

https://morsmachine.dk/go-scheduler.

[5] K. Lemons, ”Go: A New Language for a New Year,” 06 01 2012. [Online].

Available:

http://kylelemons.net/blog/2012/01/06-go-new-languagenew-year.article#TOC_1.1..

[6] Masse, M..(2011-10-08) . REST API Design Rulebook. O’Reilly .

https://www.oreilly.com/library/view/rest-api-design/9781449317904/

23

You might also like