Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Saumya shikha

Big Data Engineer

SUMMARY
Around 9 years of IT experience in Analysis, design, development, implementation, maintenance, and support with
experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data
processing requirement.
Strong Working knowledge in Technologies on systems which comprises of massive amount of data running in
highly distributive mode in Cloudera, Hortonworks Hadoop distributions and Amazon AWS.
Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra,
Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala
and Hue.
Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User
Defined Functions (UDF's) as required.
Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL,
Spark streaming and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch
processing and stream processing.
Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets.
Worked on HBase to load and retrieve data for real time processing using Rest API.
Capable of understanding and knowledge of job workflow scheduling and locking tools/services like Oozie,
Zookeeper, Airflow and Apache NiFi.
Can work parallelly in both GCP and AWS Clouds coherently.
Strong knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide
ETL Solutions and Data Warehouse tools for reporting and data analysis.
Hands on experience with Google cloud services like GCP, BigQuery, GCS Bucket and G-Cloud Function.
Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream
data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet and Hive.
Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and also
used Spark Data Frame Operations to perform required Validations in the data.
Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and
Pandas for organizing data.
Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage.
Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark
jobs on Amazon Web Services (AWS).
Ingested data into Snowflake cloud data warehouse using Snow pipe.
Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files
arrive to staging area.
Worked in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
Experience in importing and exporting the data using Sqoop from HDFS to Relational Database Systems and from
Relational Database Systems to HDFS.
Experienced in designing different time driven and data driven automated workflows using Oozie.
Skilled in using Kerberos, Azure AD, Sentry, and Ranger for maintaining authentication and authorization.
Designed UNIX Shell Scripting for automating deployments and other routine tasks.
Extensive experience in development of Bash scripting, T-SQL, and PL/SQL scripts.
Proficient in relational databases like Oracle, MySQL and SQL Server. Extensive experience in working
with NOSQL databases and its integration Dynamo DB, Cosmos DB, Mongo DB, Cassandra and HBase
Extensive knowledge in working with Azure cloud platform (HDInsight, Data Lake, Databricks, Blob Storage, Data
Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
Hands on Experience in using Visualization tools like Tableau, Power BI.
Technical Skills

EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, SQS,
AWS environment
DynamoDB, Redshift, ECS, Quick Sight, Kinesis.
Azure Databricks, Azure Data Lake, Blob Storage, Azure Data Factory, SQL
Azure environment
Database, SQL Data Warehouse, Cosmos DB, AAD, Azure batch.
Scripting Languages Python, PySpark, SQL, Scala, Shell, PowerShell, HiveQL.
Databases Snowflake, MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2
NoSQL Databases HBase, DynamoDB
HDFS, Yarn, MapReduce, Spark, Kafka, Hive, Airflow, Stream Sets, Sqoop,
Big Data Ecosystem HBase, Flume, Ambari, Oozie, Zookeeper, Nifi, Apache Hadoop, Cloudera CDP,
Hortonworks HDP
Others Jenkins, Tableau, Power BI, Grafana
SDLC- Methodologies Agile, Waterfall, Hybrid

Education:

Bachelors in Computer science from West Bengal University.

Project Experience

Client: Johnson and Johnson– New Brunswick, NJ


Role: Sr Big Data Engineer
Duration: January 2022- Till Date

Responsibilities:

Handled importing of data from various data sources, performed transformations using Hive, MapReduce, Spark
and loaded data into HDFS.
Understanding of AWS Product and Service suite primarily EC2, S3, VPC, Lambda, Redshift, Spectrum Athena,
EMR(Hadoop) and other monitoring service of products and their applicable use cases, best practices and
implementation, and support considerations
Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to
perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
Automating the ETL tasks and data work flows for the data pipeline of the ingest process through UC4 scheduling
tool.
Experience in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster
related issues
Assist with the analysis of data used for the Tableau reports and creation of dashboards.
Design and implement large scale distributed solutions in AWS.
Analyze and develop programs by considering the extract logic and the data load type using Hadoop ingest
processes using relevant tools such as Sqoop, Spark, Scala, Kafka, Unix shell scripts and others.
Developed UDFs in Java as and when necessary to use in PIG and HIVE queries.
Automated the cloud deployments using chef, python and AWS Cloud Formation Templates.
Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
Migrated an entire Oracle database to BigQuery and build Data pipelines in Airflow in GCP for ETL related jobs
using different Airflow operators.
Develop the automation scripts to transfer the data from on premise clusters to Google Cloud Platform (GCP).
Continuously monitor and manage data pipeline (CI/CD) performance alongside applications from a single console
with GCP.
Created and managed cloud VMs with AWS EC2 Command line clients and AWS management console.
Migrated on premise database structure to Confidential Redshift data warehouse. Worked on AWS Data Pipeline
to configure data loads from S3 into Redshift
Extracting batch and Real time data from DB2, Oracle, Sql server, Teradata, Netezza to Hadoop (HDFS) using
Teradata TPT, Sqoop, Apache Kafka, Apache Storm.
Developing Apache Spark jobs for data cleansing and pre-processing.
Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
Writing spark programs to improve the performance and optimization of the existing algorithms in Hadoop using
spark context, spark-sql, data frame, pair RDD's, spark yarn.
Design and build ETL workflows, leading the efforts of programming data extraction from various sources into
Hadoop file system, implement end to end ETL workflows using Teradata, SQL, TPT, SQOOP and load to HIVE
data stores.
Monitoring resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such
as EBS, EC2, ELB, RDS, S3, SNS and configured notifications for the alarms generated based on events defined.
Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how
does it translate to MapReduce jobs.

Environment: RHEL, HDFS, Map-Reduce, Hive, GCP, AWS, EC2, S3, Lambda, Redshift, Pig, Sqoop, Oozie, Teradata,
Oracle SQL, UC4, Kafka, GitHub, Hortonworks data platform distribution, Spark, Scala.

Client: Ally Bank- Charlotte, NC


Role: Big Data Engineer
Duration: November2020-December 2021

Responsibilities:
Worked on developing Kafka producer and consumers, Cassandra clients and PySpark with components HDFS,
Hive.
Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation
from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage
patterns.
Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source
system which include loading nested JSON formatted data into snowflake table.
Participates in the development improvement and maintenance of snowflake database applications.
ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL
queries against Snowflake.
Consulting on Snowflake Data Platform Solution Architecture, Design, Development and deployment focused to
bring the data driven culture across the enterprises
Experience in migrating existing databases from on premise to AWS Redshift using various AWS services
Wrote Package containing several Procedures and Functions in PL/SQL to handle sequencing issue.
Experience in Designing, Architecting and implementing scalable cloud-based web applications using AWS.
Have done POC on Redshift spectrum to create external tables by using S3 files.
Create, modify and execute DDL in table AWS Redshift tables to load data.
Performance tuning the tables in Redshift.
Reviewing the explain plan for the SQLs in Redshift.
Automated the cloud deployments using chef, python and AWS Cloud Formation Templates.
Experience in Linux Bash shell scripting and following PEP Guidelines in Python.
Working with backend python automation, CI pipelines, Docker and cloud provisioning/automation.
Experience in developing web services (WSDL, SOAP and REST) and consuming web services with python
programming language.
Good experience in using Shell scripting for automation by following Python PEP.
Created IAM policies for delegated administration within AWS and Configure IAM Users / Roles / Policies to grant
fine - grained access to AWS resources to users.
Improved infrastructure design and approaches of different projects in the cloud platform Confidential Web
Services (AWS) by configuring the Security Groups, Elastic IP's and storage on S3 Buckets.
Participate in planning, implementation, and growth of our customer's Confidential Web Services (AWS)
foundational footprint.
Proficient with deployment and management of AWS services - including but not limited to: VPC, Route 53, ELB,
EBS, EC2, S3
Worked on Analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase,
Oozie, Zookeeper, Sqoop, Spark and Kafka.
Provided solution to AWS specific challenges like s3-dist-cp 503 slowdown errors, EMRFS Sync errors, Hive
metastore inconsistencies, RDS MySQL timeout errors, Queries with longer initialization times, etc.,
Developed highly scalable classifiers and tools by leveraging machine learning, Apache spark & deep learning.
Configured the above jobs in Airflow.
Helped develop validation framework using Airflow for the data processing.
Extensive working knowledge and experience in building and automating processes using Airflow.
Worked with version control systems like Subversion, Perforce, and GIT for providing common platforms for all the
developers.
Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with
Hive and SQL/Oracle.

Environment: Apache Spark, Kafka, Cassandra, MongoDB Databricks, Flume, YARN, Sqoop, Oozie, Hive, Pig, Java,
Hadoop distribution of Cloudera 5.4/5.5, Linux, XML, Eclipse, MySQL, AWS

Client: TriWest Healthcare- Phoenix, AZ


Role: Big Data Engineer
Duration: May2019- October 2020

Responsibilities:
Created Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the
preparation of high-quality data.
Implemented Spark SQL to load JSON data, frequently used collect (), coalesce () for faster processing of data.
Created Data sources to load data into SQL Server (Staging database) before and after performing cleansing on the
Extract tables.
Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to
handle data from various RDBMS and Streaming sources.
Experienced in writing Spark Applications in Scala and Python (Pyspark).
Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's.
Deep knowledge of healthcare payer processes and EDI infrastructures
Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of
Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more
Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure
Databricks
Used Azure API Manager to maintain on premises API services with policies
Developed Azure function apps as API services to communicate DB
Azure Cosmo’s DB development & usage.
Involved in Build & Azure deployment of Function apps from Visual Studio
Hands on experience in Azure - PaaS worked on various areas of Azure like Azure Active Directory, App Services,
Azure SQL, Azure Storages like CDN, BLOB
Worked extensively on Azure Active directory and on-premise fActive directory
Worked with continuous integration/continuous delivery using tools such as Jenkins, Git, Ant, and Maven, created
workflows in Jenkins and Worked on the CI-CD model setup Using Jenkins.
Developed python scripts to sync data from GCP spanner to Azure and monitored jobs using Airflow.
Responsible for performing various transformations like sort, join, aggregations, filter in-order to retrieve various
datasets using Apache spark.
Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.
Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files
with different schema in to Hive ORC tablesx.
Transform and analyze the data using PySpark, HIVE, based on ETL mappings.
Developed PySpark programs and created the data frames and worked on transformations.
Analysed the SQL scripts and designed it by using PySpark SQL for faster performance.
Provide guidance to development team working on PySpark as ETL platform
Created and maintained the development operations pipeline and systems like continuous integration, continuous
deployment, code review tools and change management systems.
Big data Hadoop and Cassandra prod support and architecture with Bigdata Hadoop and Cassandra prod
support and architecture.
Experienced in Developing Spark application using Spark Core, Spark SQL and Spark Streaming API's.

Environment: Confidential, Azure, Mongo DB, GCP, Hadoop, Snowflake, python, Pig, Hive, Oozie, NoSQL, Sqoop,
Flume, HDFS, HBASE, Map-Reduce, MySQL, Horton Works, Impala, Cassandra DB, Mongo, IBM WebSphere,
Tomcat.

Client: Virtusa- Bangalore, India


Role: Big Data Engineer
Duration: Jan2016- September2018

Responsibilities:
Responsible for building scalable distributed data solutions using Hadoop.
Worked on AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data. Using
Amazon S3 as data lake and Amazon Redshift as the Data warehouse.
Implemented and maintained the monitoring and alerting of production and corporate servers/storage using Cloud
Watch.
Experience in extracting source data from Sequential files, XML files, CSV files, transforming and loading it into the
target Data warehouse.
My area of expertise has been on performing duties such as Analytics, Design, Data warehouse Modeling,
Development, Implementation, Maintenance, Migration and Production support of large-scale Enterprise Data
Warehouses.
Designed and developed high-quality integration solutions by using Denodo virtualization tool (read data from
multiple sources including Oracle, Hadoop, and MySQL).
Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
Cluster balancing and performance tuning of Hadoop components like HDFS, Hive, Impala, MapReduce, Oozie
workflows.
Created the framework for the dashboard using Tableau and optimized the same using open-source Google
optimization tools.
Involved in publishing of various kinds of live, interactive data visualizations, dashboards, reports and workbooks
from Tableau Desktop to Tableau servers
Extensively participated in translating business needs into Business Intelligence reporting solutions by ensuring the
correct selection of toolset available across the Tableau BI suite.
Handled importing of data from various data sources using Sqoop, performed transformations using Hive,
MapReduce and loaded data into HDFS.
Well Exposure on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data
pipelines.
Developed data pipeline using Flume, Pig, Sqoop to ingest cargo data and customer histories into HDFS for
analysis.
Configured Sqoop and developed scripts to extract data from MySQL into HDFS.
Designed, developed, and implemented solutions with data warehouse, ETL, data analysis, and BI reporting
technologies.
The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified
and access files directly.
Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data
onto HDFS.
Designed and implemented MapReduce-based large-scale parallel relation-learning system
Setup and benchmarked Hadoop/HBase clusters for internal use
Used Oozie as workflow engine and Falcon for Job scheduling. Debugged the technical issues and errors was
resolved.
Analyzing data with HIVE, TEZ, Spark SQL and comparing its results with TEZ and SPARK SQL.
Written Hive query’s structure them in tabular format to facilitate effective querying on the log data to perform
business analytics.

Environment: Hadoop, Map-Reduce, Tableau, AWS, EMR, HBase, NIFI, Hive, Impala, Pig, Hive, Sqoop, HDFS,
Flume, Oozie, Spark, Spark SQL, Spark Streaming, Scala, Cloud Foundry, Kafka and Confidential.

Client: Adobe - Bangalore, India.


Role: Hadoop Developer
Duration: June2014- December 2015

Responsibilities:
Involved in frequent meeting with clients to gather business requirement & converting them to technical
specification for development team.
Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop
Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
Written Hive queries for data analysis to meet the business requirements.
Creating Hive tables and working on them using Hive QL. Experienced in defining job flows.
Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce
way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified
and access files directly.
Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data
onto HDFS.
Designed and implemented MapReduce-based large-scale parallel relation-learning system
Setup and benchmarked Hadoop/HBase clusters for internal use
Loaded the aggregated data onto DB2 for reporting on the dashboard.

Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop. HBase, Hive, DB2, MS Office, Windows

Client: Nitchbit- Bangalore, India.


Role: Software Engineer
Duration: July 2013- May2014

Responsibilities:
Strong experience in configuring backend using various spring frameworks features such as Spring MVC, Spring
Boot, Spring ORM and Spring Security.
Good experience with Core Java, Advanced JAVA Programming, J2EE JSP, Struts, SQL Queries, Database
programming, OOP, Object Oriented analysis and design, relational database, SQL
Created forms to collect and validate data from the user in HTML and JavaScript.
Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for
huge volume of data.
Used Web pack for bundling of react, live-server, babel, magnifiers and for generating dependencies graph for web
application development.
Extensive experience working in spring 2.5/3.0 framework, Struts framework, O/R Mapping Hibernate 3.x
framework and web services (SOAP and RESTful).

Environment: Core Java, Advanced JAVA Programming, J2EE JSP, Struts, SQL Queries, HTML, JavaScript, AJAX,
CSS, JSON, jQuery, XML, JSON, Photoshop, Jira, Agile, SQL, Windows.

You might also like