Professional Documents
Culture Documents
Santosh Goud - Senior AWS Big Data Engineer
Santosh Goud - Senior AWS Big Data Engineer
Email: santoshgoud2526@gmail.com
Contact: 737-372-2211.
Senior AWS Big Data Engineer
LinkedIn:
OVERALL SUMMARY:
TECHNICAL SKILLS:
Big Data Tools: Hadoop Ecosystem: Map Reduce, Spark, Airflow, Nifi, HBase, Hive, Pig, Sqoop, Kafka,
Oozie, Hadoop
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS (Amazon Web Services), Microsoft Azure
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
Databases: Oracle 12c/11g, Teradata R15/R14.
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
PROJECT EXPERIENCE:
Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda,
AWSGlue and StepFunctions
Involved in conducting JAD sessions to identify the source systems and data needed by Actimize-SAM
(KYC/CIP).
Assisted with FATCA testing using internal software to ensure that proper controls were in place for the
new regulation
Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and
Scala for data cleaning and preprocessing.
Created a Lambda Deployment function, and configured it to receive events from S3 buckets
Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using
commands with Crontab.
Developed various Mappings with the collection of all Sources, Targets, and Transformations using
Informatica Designer
Installed and configured Hive and written Hive UDFs and Used Map Reduce and Junit for unit testing.
Written the Map Reduce programs, Hive UDFs in Java.
Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data
messaging and to migrate clean and consistent data
Developed report layouts for Suspicious Activity and Pattern analysis under AML regulations
Prepared and analyzed AS IS and TO BE in the existing architecture and performed Gap Analysis. Created
workflow scenarios, designed new process flows and documented the Business Process and various
Business Scenarios and activities of the Business from the conceptual to procedural level.
Migrate data from on-premises to AWS storage buckets
Create Spark code to process streaming data from Kafka cluster and load the data to staging area for
processing.
Create data pipelines to use for business reports and process streaming data by using Kafka on premise
cluster.
Process the data from Kafka pipelines from topics and show the real time streaming in dashboards
Developed a python script to transfer data from on-premises to AWS S3
Developed a python script to hit REST API’s and extract data to AWS S3
Analyzed business requirements and employed Unified Modeling Language (UML) to develop high-level
and low-level Use Cases, Activity Diagrams, Sequence Diagrams, Class Diagrams, Data-flow Diagrams,
Business Workflow Diagrams, Swim Lane Diagrams, using Rational Rose
Worked with senior developers to implement ad-hoc and standard reports using Informatica, Cognos, MS
SSRS and SSAS.
Joined various tables in Cassandra using spark and Scala and ran analytics on top of them.
Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.
Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity
Monitoring, CTR, CDD, and EDD.
Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using
Python and NoSQL databases such as HBase and Cassandra
Collected data using Spark Streaming from AWS S3bucket in near-real-time and performs necessary
Transformations and Aggregation on the fly to build the common learner data model and persists the data
in HDFS.
Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on
RDDs.
Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon
EMR.
Developed Java Map Reduce programs for the analysis of sample log file stored in cluster.
Used Apache NiFi to copy data from local file system to HDP.
Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP
system, Conceptual, Logical and Physical data modeling using Erwin.
Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and
aggregation from multiple file formats.
Created yaml files for each data source and including glue table stack creation
Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers
(SQS, Event Bridge, SNS)
Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies
application requirements, data processing and analytics using inbuilt libraries.
Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables
and connected to Tableau for generating interactive reports using Hive server2.
Used Sqoop to channel data from different sources of HDFS and RDBMS.
Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and
aggregation from multiple file formats.
Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using
Python and NoSQL databases such as HBase and Cassandra
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary
Transformations and Aggregation on the fly to build the common learner data model and persists the data
in HDFS.
Used Apache NiFi to copy data from local file system to HDP.
Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File
System.
Environment:AWS,Spark-SQL, Oracle12c, PL/SQL, Map Reduce, Hive, Impala, Scala, Erwin, Java, Bigdata, Hadoop,
PySpark, Python, kafka, SAS, MDM, Oozie, SSIS, T-SQL, ETL, HDFS, Cosmos, Pig, Sqoop, MS Access.
Environment: Redshift, DynamoDB,Pyspark, EC2, EMR, Glue, S3, Java, Kafka, IAM, PostgreSQL, Jenkins, Maven,
AWSCLI, Shell Scripting,Git.
Client: Scotia Bank, New York City, NY Feb 2017- May 2019
Title: Big Data Engineer
Responsibilities:
Participated in Data Acquisition with Data Engineer team to extract clinical and imaging data from several
data sources like flat file and other databases.
Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
Setup and benchmarked Hadoop/Hbase clusters for internal use.
Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services
(AzureData Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW,
HDInsight/Databricks, NoSQL DB)
Developed Java Map Reduce programs for the analysis of sample log file stored in cluster.
Developed Simple to complex Map/reduce Jobs using Hive and Pig
Developed Map Reduce Programs for data analysis and data cleaning.
Performed Data Preparation by using Pig Latin to get the right data format needed.
Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its
performance over MR jobs
Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by
business users.
Primarily involved in Data Migration process using Azure by integrating with GitHub repository and
Jenkins.
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a
combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data
Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and
processing the data in in Azure Databricks.
Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic
Modelling.
Utilized Waterfall methodology for team and project management.
Used Git for version control with Data Engineer team and Data Scientists colleagues.
Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
Implemented data streaming capability using Kafka and Talend for multiple data sources.
Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
Processed the image data through the Hadoop distributed system by using Map and Reducethen stored
into HDFS.
Used SCALA to store streamingdata to HDFS and to implement Spark for faster processing of data.
Developed the Apache Storm, Kafka, and HDFS integration project to do a real time data analyses.
Created Session Beans and controller Servlets for handling HTTP requests from Talend
Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports
including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
Wrote documentation for each report including purpose, data source, column mapping, transformation,
and user group.
Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database
and SQL data warehouse environment.
Used windows Azure SQL reporting services to create reports with tables, charts and maps
Populated HDFS and PostgreSQL with huge amounts of data using Apache Kafka.
Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
Knowledge on implementing the JILs to automate the jobs in production cluster.
Troubleshooted user's analyses bugs (JIRA and IRIS Ticket).
Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
Worked on analyzing and resolving the production job failures in several scenarios.
Implemented UNIX scripts to define the use case workflow and to process the data files and automate the
jobs.
Environment: Hadoop, Microservices, Java, MapReduce, Agile, HBase, JSON, Spark, Kafka, JDBC,AWS
EMR/EC2/S3,Hive, JSON, Pig, Flume, Zookeeper, Impala, Sqoop
Client: Axis Bank, Mumbai, India Sep 2012- Nov 2013
Role: Data Engineer
Responsibilities:
Research and recommend suitable technology stack for Hadoop migration considering current enterprise
architecture.
Responsible for building scalable distributed data solutions using Hadoop.
Experienced in loading and transforming of large sets of structured, semi-structured and unstructured
data.
Experienced in developing Spark scripts for data analysis in both python and Scala.
Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts.
Built on premise data pipelines using Kafka and spark for real-time data analysis.
Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors.
Implemented Hive complex UDF's to execute business logic with Hive Queries.
Developed Spark jobs and Hive Jobs to summarize and transform data.
Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
Handled importing data from different data sources into HDFS using Sqoop and performing
transformations using Hive and then loading data into HDFS.
Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
Experience in managing and reviewing Hadoop Log files.
Used Sqoop to transfer data between relational databases and Hadoop.
Worked on HDFS to store and access huge datasets within Hadoop.
Good hands on experience with GitHub.
Environment: Cloudera Manager, HDFS, Sqoop, Pig, Hive, Oozie, Spark SQL, Tableau, My SQL, Python, Kafka,
flume, Java, Scala, Git.