Professional Documents
Culture Documents
Akhil Data+Engineer1
Akhil Data+Engineer1
Email: akhil@crispmails.com
Contact: +1 6147697922
Professional Summary:
TECHNICAL SKILLS:
Hadoop/Big Data Technologies Hadoop, Map Reduce, Sqoop, Hive, Oozie, Spark, Zookeeper and
Cloudera Manager, Kafka, Flume.
ETL Tools Informatica
NO SQL Database Cosmos,HBase, Cassandra, Dynamo DB, Mongo DB.
Monitoring and Reporting Tableau, Custom shell scripts, PowerBI
Hadoop Distribution Horton Works, Cloudera
Build Tools Maven
Programming & Scripting Python, Scala, SQL, Shell Scripting, C, C++
Databases Oracle, MY SQL, Teradata
Version Control GIT
Operating Systems Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8,
Windows 7
Cloud Computing AWS, AWS EC2, AWS S3, Azure SQL Database, Azure Data Studio,
Azure SQL Datawarehouse, Azure Data Factory (ADF)
Web Technologies HTML, XML, JDBC, JSP, CSS, JavaScript, SOAP
EDUCATION:
PROFESSIONAL EXPERIENCE:
Responsibilities:
• Proficient in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data
Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
• Designed and deployed data pipelines using DataLake, DataBricks, and Apache Airflow.
• Enabling other teams to work with more complex scenarios and machine learning solutions.
• Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob
storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
• Evolved in Spark Scala functions for mining data to provide real time insights and reports.
• Configured spark streaming to receive real time data from the Apache Flume and store the stream data
using Scala to Azure Table.
• DataLake is used to store and do all types of processing and analytics.
• Created Data Marts after analyzing the raw data in Data warehouse and store the data in different
sections based on the business or department area and make this data available to the next step i.e to
easy access for users to insights.
• Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark
Scala scripts and UDF's to perform transformations on large datasets.
• Utilized Spark Streaming API to stream data from various sources. Optimized existing Scala code and
improved the cluster performance.
• Involved in using Spark DataFrames to create Various Datasets and applied business transformations and
data cleansing operations using DataBricks Notebooks.
• Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows
using Airflow, Apache NiFi.
• Tasks are distribution on celery workers to manage communication between multiple services.
• Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from
Cassandra to Azure SQL Datawarehouse and improved the query performance.
• Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data
Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
• Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data
(Snowflake, MS SQL, MongoDB) into HDFS.
• Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API.
• Used Flume sink to write directly to indexers deployed on cluster, allowing indexing during ingestion.
• Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow.
• Implemented performance tuning logic on Targets, Sources, Mappings and Sessions to provide maximum
efficiency and performance
• workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL).
• Managed resources and scheduling across the cluster using Azure Kubernetes Service. AKS can be used
to create, configure and manage a cluster of Virtual machines.
• Extensively used Kubernetes which is possible to handle all the online and batch workloads required to
feed, analytics and machine learning applications.
• Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for
authentication and Apache Ranger for authorization.
• Using Informatica PowerCenter Designer analyzed the source data to Extract & Transform from various
source systems(oracle 10g,DB2,SQL server and flat files) by incorporating business rules using different
objects and functions that the tool supports.
• Using Informatica PowerCenter created mappings and mapplets to transform the data according to the
business rules.
• Tuned the Informatica mappings for optimal load performance
• Experience in working with Spark applications like batch interval time, level of parallelism, memory
tuning to improve the processing time and efficiency.
• Used Scala for amazing concurrency support, and Scala plays the key role in parallelizing processing of
the large data sets.
• Developed map reduce jobs using Scala for compiling the program code into bytecode for the JVM for
data processing.
• Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business
requirements.
Environment: Azure HDInsight, Databricks (ADBX), DataLake (ADLS), CosmosDB, MySQL, Snowflake,
MongoDB, Teradata, Ambari, Flume, VSTS, Tableau, PowerBI, Azure DevOps, Ranger, Informatica,
Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS,
MapReduce, Yarn), Spark v2.0.2, Airflow, Hive, Sqoop, HBase.
Responsibilities:
• Experience in Big Data analysis using PIG and HIVE and understanding of SQOOP and Puppet.
• Created yaml files for each data source and including glue table stack creation
• Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, Snowflake.
• Extensive experience on Hadoop ecosystem components like Hadoop, Map Reduce, HDFS, HBase, Hive,
Sqoop, Pig, ZooKeeper and Flume.
• Developed a python script to hit REST API’s and extract data to AWS S3
• Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda,
AWS Glue and StepFunctions
• Expertise in modelling HANA calculation views, extend CDS views for external data consumption, flow graphs
design and SAP Predictive Analytics Library for building Machine learning models.
• Developed real time SLA monitoring dashboards in Tableau for the Kafka messages load in SapHANA
• Proposed an automated system using Shell script to Sqoop the job.
• Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying
• Data acquisition from REST API / json; data wrangling wifi Python and unix tools; segment and organize data
from disparate sources and data loading to Google Big Query
• Implemented data ingestion and handling clusters in real time processing using Kafka.
• Good knowledge on AWS cloud formation templates and configured SQS service through java API to send
and receive the information.
• Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic Map
Reduce.
• Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables
and connected to Tableau for generating interactive reports using Hive server2.
• Used Sqoop to channel data from different sources of HDFS and RDBMS.
• Extensive Experience on importing and exporting data using stream processing platforms
like Flume and Kafka.
• Enhance HANA costing models (SAP Purchasing/Order Management).
• Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation
templates.
• Deep analytics and understanding of Big Data and algorithms using Hadoop, Map Reduce, NoSQL and
distributed computing tools.
• Developed Oozie workflow schedulers to run multiple Hive and Pig jobs that run independently with time
and data availability.
• Experienced in troubleshooting errors in HBase Shell/API, Pig, Hive and map Reduce.
• Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation and
aggregation from multiple file formats.
• Used HBase/Phoenix to support front end applications that retrieve data using row keys
• Developed a strategy for Full load and incremental load using Sqoop.
• Experience in developing customized UDF’s in java to extend Hive and Pig Latin functionality.
• Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
• Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS,
Event Bridge, SNS)
• Imported documents into HDFS, HBase and creating HAR files.
• Involved in in importing real time data to Hadoop using Kafka and implemented zombie runner job for daily
imports.
• Good Knowledge and experience in Amazon Web Service (AWS) concepts like EMR and EC2 web services
successfully loaded files to HDFS from Oracle, SQL Server, Teradata and Netezza using Sqoop.
• Involved in creating Hive QL on HBase tables and importing efficient work order data into Hive tables
• Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
• Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
• Using Hadoop on Cloud service (Qubole) to process data in AWS S3 buckets.
• Executing parameterized Pig, Hive, impala, and UNIX batches in Production.
Environment: Bigdata, Hadoop, Oracle, Pl/Sql, Scala, Spark-Sql, PySpark, Python, Kafka, Sas, Sql, Oozie,
Ssis, T-Sql, Etl, Hdfs, Cosmos, Aws, Zookeeper, hive, HBase.
Client: Hyundai auto ever America, Fountain valley, CA March 2020 – August
2021
Role: Big Data Engineer
Responsibilities:
Environment: Hadoop, MapReduce, Horton Works, HDFS, Hive, SQL, Cloudera Manager, Pig, Apache
Sqoop, Spark, Oozie, HBase, AWS, PL/SQL, MySQL and Windows.
Responsibilities:
Environment: Hortonworks, Hadoop, HDFS, Pig, Sqoop, Hive, Oozie, Zookeeper, NoSQL, HBase, Shell
Scripting, Scala, Spark, SparkSQL.