Professional Documents
Culture Documents
Arnab Paul
Arnab Paul
Arnab Paul
TECHNICAL SKILLS
ETL Tools AWS Glue, Azure Data Factory, GCP Data Fusion &Dataflow, Airflow, Spark,
Sqoop, Flume, Apache Kafka, Spark Streaming, Apache Knife, Microsoft
SSIS, Informatica PowerCenter & IICS, IBM DataStage
NoSQL Databases MongoDB, Cassandra, Amazon DynamoDB, HBase, GCP Datastore
Data Warehouse AWS RedShift, Google Cloud Storage, Snowflake, Teradata, Azure Synapse
SQL Databases Oracle DB, Microsoft SQL Server, IBM DB2, PostgreSQL, Teradata, Azure SQL
Database, Amazon RDS, GCP Clouds, GCP Cloud Spanner
Hadoop Distribution Cloudera, Hortonworks, Map, AWS EMR, Azure HDInsight, GCP Daturic
Hadoop Tools HDFS, HBase, Hive, YARN, Mar Reduce, Pig, HIVE, Apache Storm, Sqoop,
Oozie, Zookeeper, Spark, SOLR, Atlas
Programming & Scripting Spark Scala, Python, Java, MySQL, PostgreSQL, Shell Scripting, Pig Latin,
HiveQL
Visualization Tableau, Looker, Quick Sight, QlikView, Powerbase, Grafana, Python
Libraries
AWS EC2, S3, Glacier, Redshift, RDS, EMR, Lambda, Glue, CloudWatch,
Recognition, Kinesis, CloudFront, Route53, DynamoDB, Code Pipeline, EKS,
Athena, Quick Sight
Azure DevOps, Synapse Analytics, Data Lake Analytics, Databricks, Blob Storage,
Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB
Google Cloud Platform Compute Engine, Cloud Storage, Cloud SQL, Cloud Data Store, Big Query,
Pub/Sub, Dataflow, Daturic, Data Fusion, Data Catalog, Cloud Spanner,
Atom
Web Development HTML, XML, JSON, CSS, JQUERY, JavaScript
Monitoring Tools Splunk, Chef, Nagios, ELK
Source Code Management Frog Artifactory, Nexus, GitHub, Code Commit
Containerization Docker & Docker Hub, Kubernetes, OpenShift
Build & Development Tools Jenkins, Maven, Gradle, Bamboo
Methodologies Agile/Scrum, Waterfall
PROFESSIONALEXPERIENCE
Data Engineer at CarMax | Richmond, Virginia Jan 2021 – Jul 2021
CarMax specializes in used-vehicle retail in the US. I joined their inventory management team and as a Data
Engineer, I work with on-premises Hadoop data infrastructure, AWS cloud architecture & cloud migration to
AWS.
Responsibilities-
Designed, and build scalable distributed data solutions using with Esplanade migration plan for existing
on-premises Cloudera Hadoop distribution to AWSbased on business requirement.
Worked with legacy on-premises VMs based on UNIX distributions. Worked with batch data as well as
3rd Party data through FTP. Configured traditional ETL tools- Informatica, Party data.
Worked wights that stores distributed data. Configured Oozie along with Sqoop to distributed data.
distributed data Sqoop/ distributed data graphical interpretation, encryption, graphical
interpretation. Implemented data transfer to data transfer using Knife.
Wrote YML files for Kafka Producers for ingesting streaming data. Assigned partitions to customers.
Developed Scala scripts using both Data frames/SQL/Datasets and RDD/MapReduce in Spark for Data
Aggregation, queries and writing data back into OLTP system through Sqoop.
Set up static and dynamic resource pools using YARN in Cloudera Manager for job scheduling &cluster
resource management. Used Zookeeper for configuration management, synchronization & other
services.
Perform data profiling, modeling and Metadata Management tasks on complex data integration
scenarios adhering to Enterprise Data governance and Data Integration standards using Apache Atlas.
Developed the core search module using Apachito and customized the Apachito for handling fallback
searching and to provide custom functions. Worked with big data tools to integrate Apachito search.
Worked with Snowflake to integrate Powerbase, Apachito for dashboards visualizations.
Utilized Apache Spark ML &Milab with Python to develop and execute Milab.
Part of On-premises Hadoop infrastructure to AWS EMR Migration (Refactoring) Team.
Built data pipelines for data governance at real-time & batch processing using Lambda/Kappa
architecture.
Worked withS3DistCp tool to copy data from relational HDFS data to S3 buckets. Created a custom
utility tool in Hive that targets and deletes backed up folders that are manually flagged.
Implemented auto-migration from MongoDB server in JSON format to tool in. Created Replication
server, defined endpoints and defined endpoints.
Created S3 &EMR endpoints using Private Link& used AWS Private Subnet Network for fast transfers.
Used Spark scripts implemented on EMR to automate, compare &validateS3 files to the original HDFS
files.
Developed Spark custom framework to load the data from AWS S3 to Redshift for data warehousing.
Experience in Jenkins for deployment of project. Help deploy projects on Jenkins using GIT. Used
Docker to achieve delivery goal on scalable environment & used Kubernetes for orchestration
automation.
Integrated Apache Airflow and wrote scripts to automate workflows in automate workflows.
Created Restful Pausing Flask to integrate functionalities & communicate with other applications.
Integrated Teradata Warehouse into EMR cluster. Developed BTEQ scripts to load data from Teradata
Staging area to Teradata DataMart. Handled Error &tuned performance in Teradata queries and
utilities.
Used Teradata queries to defines all settings- provision hardware, define security, & set up elements
for an EMR cluster. Wrote Infrastructure management as code, checked in and managed with source
control.
Helped develop a source control oncogenomic, Code Build and Code Deploy with CloudFormation.
Implemented Wathena as a replacement for Hive query engine. Migrated Analysis, end-users, and
other processes (automated Tableau and other dashboard tools) to query S3 directly.
Integrated GIT into Jenkins to automate the code check-out process. Used Jenkins for automating
Builds and
Automating Deployments.
Designed and Automating Deployments strategies using (CI/CD) Pipelines and Automating Deployments
with remote execution. Ensured zero downtime using blue/green deployment strategy and shortened
deployment cycles through Jenkins’s automation.
Worked in all areas of Jenkins-setting up CI for new branches, build automation, plugin management
and securing Jenkins and setting up master/slave configurations.
Worked on IAM, KMS, Secrets, Config, Systems Manager and others for security and access
management.
Implemented AWS RDS to store relational data & integrated it along with Elasticate load balancing.
Implemented Kinesis on EMR for streaming analysis. Implemented existing clickstream Spark jobs to
Kinesis.
Used Wagle as the new ETL tool. Used Glue to catalog with crawler to get the data from S3 and perform
Slurry operations. Implemented Allscripts on Glue for data transformation, validation, and data
cleansing.
Used DNS management in Route53& configured CloudFront for access to media files (images). Worked
with AWS Recognition for real-time content filtering. Wrote Lambda functions to resize/scale images.
Worked closely with GCP team to provide support for data pipeline to to run Big Query/ Atom jobs.
Worked with data cleansing such as Athena, data cleansing. data cleansing for dashboards.
Environment: Hadoop (HDFS, HBase, Hive, YARN, MapReduce, Pig, HIVE, Apache Storm, Sqoop, Oozie,
Zookeeper, Spark, SOLR, Atlas), AWS (EC2, S3, Redshift, RDS, EMR, Lambda, Glue, CloudWatch, Recognition,
Kinesis, CloudFront, Route53, DynamoDB, Code Pipeline, Athena, Quick Sight), Python, MongoDB, Cassandra,
Snowflake, Airflow, Tableau
Data Engineer at CapitalOne | Capgemini India | Mumbai, India Jun 2016 – July 2019
Capgemini SE is an IT Services & Consulting company with diverse clients from all over the globe. I worked
for CapitalOne where I was a part of the team that handled the data architecture. I worked with Python,
Java Applications, Hadoop & a variety of ETL tools including Informatica& Apache Sqoop/Flume.
Responsibilities:
Analyzed, designed, and build scalable distributed data solutions using with Hadoop, AWS&GCP.
Worked on multi-tier applications using AWS services (EC2, Route53, S3, RDS, DynamoDB, SNS, SQS,
IAM) focusing on high-availability, fault tolerance, and auto-scaling in Teradata queries.
Participated in documenting Teradata queries for smooth transfer of project from development to
testing environment and then moving the code to production.
Worked to implement persistent storage in AWS using Teradata queries, S3, Glacier. Created Volumes
and configured Snapshots for EC2 instances. Also, built, and managed Hooper clusters on AWS.
Used Hooper in Scala to convert distributed data into named columns& helped develop Predictive
Analytics using Predictive Analytics.
Developed Scala scripts using both Data frames/SQL/Datasets and RDD/MapReduce in Spark for Data
Aggregation, queries and writing data back into OLTP system through Sqoop.
Developed Hive queries to pre-process the data required for running business processes.
Implemented multiple generalized solution model using AWS Sage Maker.
Extensive expertise using the core Sparaxis and processing data on an EMR cluster.
Worked on Hive queries and Sparaxis to create HBase tables to load large sets of structured, semi-
structured and unstructured data coming from UNIX, NoSQL databases, and a variety of portfolios.
Worked on ETL Migration services by developing and deploying Alameda functions for generating an
Alameda which can be written to Glue Catalog and can be queried from Athena.
Programmed in Hive, Sparks, Java, and Python to streamline & orchestrate the incoming data and build
data pipelines to get the useful insights.
Loaded data into Sparked and in-memory data computation to generate the output response stored
datasets into HDFS/ AmazonS3 storage/ relational databases.
Migrated Legacy Informatic batch/real time ETL logical code to Hadoop using Python, Spark Context,
Spark-SQL, Data Frames and Laird’s in Data Bricks.
Experienced in handling large datasets using partitions, Spark in-memory capabilities, Broadcasts in
Spark, effective & efficient Joins, Transformation and other during ingestion process itself.
Worked on tuning Spark applications to set Batch Interval time, level of Parallelism and memory
tuning.
Implemented near-real time data processing using Stream Sets and Spark/Databricks framework.
Stored Spark Datasets into Snowflake relational databases & used data for Analytics reports.
Migrated SQL Server Database into multi cluster Snowflake environment and created data sharing
multiple applications and created Stream Sets based on data volume/ Jobs.
Developed Apache Spark jobs using Python in the test environment for faster data processing and used
Sparks for querying.
Used Sparks container for validating data load for test/ dev-environments.
Worked on Metapipeline to source tables and to deliver calculated ratio data from AWS to Datamart
(SQL Server) & Credit Edge server.
Worked in tuning relational databases (Microsoft SQL Server, Oracle, MySQL, PostgreSQL) and columnar
databases (Amazon Redshift, Microsoft SQL Data Warehouse).
Hands-on experience in Amazon EC2, Amazon S3, Amazon RedShift, Amazon EMR, Amazon RDS, Amazon
ELB, Amazon CloudFormation, and other services of the AWS family.
Developed job processing scripts using Oozie workflow. Experienced in scheduling& job management.
Wrote Jenkins Groovy script for automating workflow including, ingesting, ETL and reporting to
dashboards.
Worked in development of scheduled jobs using with commands/BASH Shell in UNIX
Environment: Hadoop (Hive, Sqoop, Pig), AWS (EC2, S3, RedShift, EMR, EBS), Java, Django, Python, Flask,
XML, MySQL, MSSQL Server, Shell Scripting, MongoDB, Python 3.3, Django, Cassandra, Docker, Jenkins, JIRA,
jQuery
Python Developer at Juniper Networks | Tech Mahindra | Hyderabad, India June 2014 – May 2016
HDFC Bank leading private sector bank in India which offers wide range of banking products and financial
services like investment banking, life insurance and asset management. Involved in Python web application
development (both front-end & backend), database management, testing and deployment. Also worked with
UNIX Bash Shell Scripting for job scheduling & maintenance.
Responsibilities:
Created APIs, Database Model and Views Utilization using Python to build responsive web page
application.
Worked on a fully automated continuous integration system using Git, Gerrit, Jenkins, MySQL and in-
house tools developed in Python and Bash.
Participated in the SDLC of a project including Design, Development, Deployment, Testing and
Support.
Deployed& troubleshoot applications used as a data source for both customers and internal service team.
Wrote and executed MySQL queries from Python using Python-MySQL connector and MySQL dB package.
Implemented troubleshoot applications website development using CSS, HTML, JavaScript and jQuery.
Worked on a Python/Django based web application with PostgreSQL DB and integrated with third party
email, messaging& storage services.
Developed GUI using webapp2 for dynamically displaying the test block documentation and other
features of Python code using a web browser.
Involved in design, implementation and modifying back-end Python code and MySQL database schema.
Developed user friendly graphical representation of item catalogue configured for specific equipment.
Used Beautiful Soup for web scrapping to extract data & generated various capacity planning reports
(graphical) using Python packages like NumPy, matplotlib.
Automated different workflows, which are initiated manually with Python scripts and UNIX shell
scripting.
Fetched Twitter feeds for certain important keyword using Twitter Python API.
Used Shell Scripting for UNIX Jobs which included Job scheduling, batch-job scheduling, process control,
forking and cloning and checking status.
Monitored Python scripts that are run as daemons on UNIX to collect trigger and feed arrival
information.
Used JIRA for bug & issue tracking and added algorithms to application for data and address generation.
Environment:Python 2.7 (BeautifulSoup, NumPy, matplotlib), Web Development (CSS, HTML, JavaScript,
JQuery), Database (MySQL, PostgreSQL), UNIX/Linux Shell Script, JIRA, Jenkins, GIT.
Education:
2014,Bachlors of Computer science