Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Surya Gorrepati | Sr Data Engineer

PROFESSIONAL SUMMARY:
 Around 10 Years of experience in data engineer and Python developer with a proven track record in orchestrating,
optimizing, and maintaining data pipelines utilizing cutting-edge technologies such as Apache Spark, Kafka, and AWS
services.
 Proficient in architecting large-scale data solutions, expertise in AWS Redshift, Snowflake, and real-time data processing
systems.
 Adept at leveraging Informatica, Teradata, and Apache Airflow for streamlined data integration and orchestration, I have a
strong foundation in data warehousing and ETL processes.
 Committed to delivering high-quality solutions, I possess a deep understanding of data engineering practices and
methodologies, with experience in Agile environments.
 Experience in Big Data/Hadoop, Data Analysis, Data Modeling professional with applied information Technology.
 Strong experience working with HDFS, Map Reduce, Spark, Hive, Sqoop, Flume, Kafka, Oozie, Pig and HBase.
 IT experience in Big Data technologies, Spark, database development.
 Good experience in Amazon Web Service (AWS) concepts like EMR and EC2 web-services which provides fast and client
processing of Teradata Big Data Analytics.
 Solid expertise in cloud platforms, including AWS (IAM, S3, EC2 etc.) and strong programming skills in Python and Java
 Have experience in Apache Spark, Spark Streaming, Spark SQL, and NoSQL databases like HBase, Cassandra, and MongoDB.
 Establishes and executes the Data Quality Governance Framework, which includes end - to-end process and data quality
framework for assessing decisions that ensure the suitability of data for its intended purpose.
 Proficiency in Big Data Practices and Technologies like HDFS, Map Reduce, Hive, Pig, HBase, Sqoop, Spark, Kafka.
 Experience in implementing security practices within Airflow, including user authentication, access controls, and
encryption, ensuring data privacy and compliance.
 Extensive experience in loading and analyzing large datasets with Hadoop framework (Map Reduce, HDFS, PIG, HIVE,
Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase, Cassandra.
 Integrated Kafka with Spark Streaming for real time data processing.
 Strong experience in the Analysis, design, development, testing and Implementation of Business Intelligence solutions
using Data Warehouse/Data Mart Design, ETL, BI, Client/Server applications and writing ETL scripts using Regular
Expressions and custom tools (Informatica, Pentaho) to ETL data.
 Orchestrated intricate ETL processes with Airflow, ensuring seamless execution and monitoring of tasks within defined
schedules.
 Implemented best practices for logging, monitoring, alerting to maintain high availability and reliability of data pipelines.
 Deep expertise in advanced SQL, data modeling, distributed data processing frameworks like Spark
 Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing
Data Mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning Algorithms.
 Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
 Skilled in performing data parsing, data manipulation and data preparation with methods including describing data
contents.
 Experienced in writing complex SQL Quires like Stored Procedures, triggers, joints, and Sub quires.
 Proficiency in Business Intelligence Tools, with a preference for PowerBI, facilitating the creation of insightful and visually
appealing data visualizations for effective communication and decision-making.
 Extensive experience in generating data visualizations using R, Python and creating dashboards using tools like Tableau.
 Experience designing data models and ensuring they align with business objectives.
 Extensive working experience with Databricks for data engineering and analytics
 Skilled in crafting efficient data models for seamless integration and reporting in Tableau, I excel in Python scripting,
PySpark, and SQL for data manipulation and cleansing.
TECHNICAL SKILLS:
Big Data Systems Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP), Cloudera Hadoop,
Hortonworks Hadoop, Apache Spark, Spark Streaming, Apache Kafka, Pig Hive, Amazon S3,
AWS Kinesis
Databases Cassandra, HBase, DynamoDB, MongoDB, BigQuery, SQL, Hive, MySQL, Oracle, PL/SQL,
RDBMS, AWS Redshift, Amazon RDS, Teradata, Snowflake
Programming & Scripting Python, R, Scala, PySpark, SQL, Java, Bash
Web Programming HTML, CSS, Javascript, XML
ETL Data Pipelines Apache Airflow, Sqoop, Flume, Apache Kafka, DBT, Pentaho, SSIS
Visualization Tableau, Power BI, Quick Sight, Looker
Cloud Platforms AWS, GCP, Azure
Scheduler Tools Apache Airflow, Azure Data Factory, AWS Glue, Step functions
Spark Framework Spark API, Spark Streaming, Spark Structured Streaming, Spark SQL
CI/CD Tools Jenkins, GitHub, GitLab
Operating Systems Windows, Linux, Unix, Mac OS X

PROFESSIONAL EXPERIENCE:

WELLS FARGO, REMOTE AUG 2023 – PRESENT


SENIOR AWS DATA ENGINEER
RESPONSIBILITIES:
 Orchestrated the setup, maintenance, and optimization of data pipelines utilizing Kafka and Spark, facilitating seamless
data flow.
 Devised efficient data models to facilitate data loading into Snowflake, integrating with other DIM tables for streamlined
reporting in Tableau.
 Demonstrated proficiency in architecting and managing large-scale data solutions using AWS Redshift and AWS EMR,
alongside deploying real-time data processing systems utilizing AWS Kinesis and AWS Lambda.
 Engineered Shell scripts to facilitate data transfer from MySQL/EDW server to HDFS via Sqoop functionality.
 Created Databricks Job workflows for extracting data from SQL server, followed by uploading files to SFTP utilizing Spark
and Python.
 Utilized GCP services like Big Query, Cloud Storage, and DataProc to build and orchestrate data pipelines, capitalizing on
cloud scalability.
 Crafted Python (PySpark) scripts for custom UDFs, empowering manipulation, merging, aggregation, and cleaning of data.
 Played a vital role in data cleaning and transformation within Airbyte, ensuring data readiness for analysis and reporting.
 Actively involved in identifying and addressing data integration issues within Air byte.
 Employed macros in DBT to streamline loading data from ADL Gen2 to Snowflake data warehouse, alongside conducting
schema tests and data quality tests.
 Interpreted requirements and modeled attributes from diverse source systems like Oracle, Teradata, and CSV files,
employing Informatica and Teradata utilities for staging, integration, and validation before loading into Teradata
Warehouse.
 Leveraged Apache Airflow and Jenkins tools for configuring and optimizing AWS EC2 instances to enhance data workload
processing efficiency.
 Used dynamic cache memory and index cache to amplify the performance of the Informatica server.
 Designed and implemented intricate mappings to extract data from various sources including flat files, RDBMS tables, and
legacy systems.
 Engineered incremental jobs to fetch data from DB2 and load it into Hive tables, subsequently connecting to Tableau for
interactive report generation using Hive Serve and r2.
 Designed and developed a data pipeline using Talend Big Data, Spark, to inject data from diverse sources into Hadoop,
Hive Data Lake.
 Conducted work related to downloading BigQuery Databricks into pandas or Spark data frames to enable advanced ETL
capabilities.
 Developed PySpark and Spark-SQL applications for data extraction, transformation, and aggregation across multiple file
formats.
 Integrated AWS DynamoDB with AWS Lambda for storing and backing up values of items from DynamoDB streams.

Environment: AWS (EC2, S3, EBS, ELB, RDS, SNS, SQS, VPC, LAM Cloud formation, CloudWatch, ELK Stack), Ansible, Python, Shell
Scripting, PowerShell, GIT, Jira, JBOSS, Bamboo, Snaplogic, Docker, Web Logic, GCP, Maven, Web sphere, Unix/Linux, AWS Xray,
DynamoDB, Kinesis, Snowflake DB, DBT, Data Modelling, Data warehouse, Power Query, Splunk, SonarQube. Snowflake, Java,
Databricks, Bitbucket, Kafka, Spark, Lambda, Hadoop, Tableau, Hive, SQL, Oracle, scheduling tool, Shell scripting.

SPECTRUM HEALTH, REMOTE SEPT 2022 – AUG 2023


SENIOR DATA ENGINEER
RESPONSIBILITIES:
 Conceptualized and executed data pipelines, ETL, data warehouses, and reporting systems using a suite of technologies
including Python, Spark, Dataflow, Airflow, Snowflake, and Databricks to facilitate data-driven advertising solutions.
 Developed data engineering pipeline that significantly reduced processing time by 50%, resulting in substantial cost
savings.
 Spearheaded the design and deployment of a real-time data processing system, boosting data processing speed by 80%
and enabling timely decision-making.
 Implemented robust data quality checks and monitoring systems, leading to a 35% decrease in data errors and enhancing
overall data accuracy.
 Developed Spark programs in Python, applying principles of functional programming to efficiently process complex
structured datasets.
 Applied Agile methodologies in the design and development of ETL applications and data processing workflows, ensuring
flexibility and adaptability.
 Engineered infrastructure as code solutions using CDK, facilitating efficient management of AWS resources for GenAI
applications.
 Led the design and development of high-performance data architectures supporting data warehousing, real-time ETL, and
batch big data processing.
 Developed and optimized complex SQL queries, stored procedures, and triggers for relational database management
systems (RDMS), ensuring optimal data retrieval and manipulation.
 Utilized Hadoop infrastructure to store data in HDFS storage and leverage Spark / HIVE SQL for migrating SQL codebases
in AWS, converting Hive/SQL queries into Spark transformations using Spark RDDs and PySpark.
 Integrated Boto3 into data engineering pipelines for seamless interaction with AWS services, enhancing retrieval and
generation processes for GenAI applications.
 Exported tables from Teradata to HDFS using Sqoop and constructed tables in Hive, handling large sets of structured,
semi-structured, and unstructured data using Hadoop/Big Data concepts.
 Leveraged SparkSQL to load JSON data, create Schema RDDs, and load them into Hive Tables, managing structured data
efficiently.

Environment: IBM Info sphere DataStage 9.1/11.5, Oracle 11g, Flat les, Snowflake, Autosys, GCP, UNIX, Erwin, TOAD, MS SQL
Server database, XML les, AWS, MS Access database.

BEST BUY, SAN JOSE, CA JUN 2020 – AUG 2022


ROLE: DATA ENGINEER
RESPONSIBILITIES:
 Engaged in the analysis, design, and implementation of business user requirements.
 Employed Python scripting and Spark SQL for gathering extensive datasets.
 Managed both structured and unstructured datasets.
 Automated data ingestion using Beam, Kafka, and Debezium to capture changes from databases like Oracle and MySQL.
 Created Sqoop scripts to handle importing/exporting of relational data, including incremental loading based on data for
customer and transaction data.
 Proficiently utilized Avro and Parquet files, with a focus on converting data and parsing semi-structured JSON to Parquet
using Data Frames in Spark.
 Utilized LSTM and RNN for developing deep learning algorithms.
 Utilized GCP services like BigQuery, Cloud Storage, and DataProc to build and orchestrate data pipelines, capitalizing on
cloud scalability.
 Using PySpark and SparkSQL, I analyzed and improved relevant data stored in Snowflake.
 Implemented Spring Security to safeguard against SQL injection and manage user access privileges, while also applying
various Java and J2EE design patterns like DAO, DTO, and Singleton.
 Conducted data analysis and profiling using intricate SQL queries across diverse source systems, including Oracle 10g/11g
and SQL Server 2012.
 Led initiatives to optimize and fine-tune big data jobs and SQL queries, resulting in a significant reduction in processing
time by 45%.
 Developed Spark applications for tasks such as data validation, cleansing, transformations, and custom aggregations.
 Stored time-series transformed data from the Spark engine, built on a Hive platform, onto Amazon S3 and Redshift.
 Spearheaded the deployment of multi-clustered environments using AWS EC2 and EMR, and incorporated Docker for
versatile deployment solutions.
 Executed specific data processing and statistical techniques, including sampling, estimation, hypothesis testing, time
series analysis, correlation, and regression analysis, utilizing R.

Environment: Python, SQL server, Oracle, HDFS, HBase, AWS, Map Reduce, Hive, Impala, Pig, Sqoop, NoSQL, Tableau, RNN,
LSTM, Unix/Linux, Core Java.

ADOBE, LEHI, UTAH APRIL 2018 – NOV 2019


ROLE: PYTHON DEVELOPER
RESPONSIBILITIES:
 Managed, developed, designed, and instituted a dashboard control panel for more than 1000 customers and
administrators using Oracle DB, Postgre SQL and VMware API Calls
 Implemented a login module for users, registration of products, placing and tracking the order which resulted in a 15 %
increase in the overall departmental process efficiency.
 Working on the backend part of the application in Django using python to create the API's and maintaining the databases.
 Developed views and templates with django view controller and template language to create a user-friendly web
interface.
 Maintained large databases, configured servers, and collaborated on reduction of software maintenance expenses -
decreased the costs by 10% within one year.
 Strategically planned and executed automated systems (Talend Cloud, DBT, Snowflake DB, Power BI) for extracting
complex business insights.
 Implemented user interface guidelines and standards throughout the development and maintenance of website using
HTML5, JavaScript and Angular JS
 Worked on rest API calls and integration with UI. Used Angular JS to develop the component for the application team.
 Wrote python scripts to parse xml documents and load the data in database.
 Made use of pandas API to put data as time series and tabular format for timestamp data manipulation and retrieval.
 Ensured high quality data collection and maintaining the integrity of the data.
 Participated in code reviews to maintain codebase integrity and adherence to coding standards.
 Implemented automated testing using frameworks such as pytest to ensure code quality and reliability.
 Optimized application performance by identifying and resolving bottlenecks.
 Collaborated with cross-functional teams to gather requirements and translate them into technical specifications.

Environment: Python, Django, Pandas, RestAPI, HTML, CSS, JavaScript, AngularJS, Oracledb, PostgreSQL, Python-MYSQL
connector.

YNOT CREATIVES, INDIA OCT 2016 –


MARCH 2018
ROLE: DATA ANALYST
RESPONSIBILITIES:
 Collaborated with a team of analysts and associates to validate lineage and transformations from source to target,
ensuring data integrity and accuracy throughout the process.
 Utilized Tableau integrated with R to generate comprehensive reports for internal teams and external clients, leveraging
mathematical models and statistical techniques to identify valuable data insights.
 Employed R for data exploration and cleaning, leveraging packages like ggplot2 and Plotly for visualization, and dplyr for
data wrangling.
 Successfully decoded SAS scripts and migrated them to Python programming for various reports, demonstrating
proficiency in Python development.
 Defined Key Performance Indicators (KPIs) to assess the efficacy of business decisions, translating business requirements
into feasible technical solutions with clear acceptance criteria to facilitate effective coding.
 Developed presentations, authored, and reviewed reports based on recommendations and findings, and delivered status
updates to senior management, ensuring clear communication of project progress and insights.
 Transformed business cases into live solutions through close collaboration with cross-functional groups, providing them
with tailored solutions to address their needs.
 Created new procedures to handle complex logic for business and modified already existing stored procedures, functions,
views, and tables for new enhancements of the project and to resolve the existing defects.
 Data validation and cleansing of staged input records was performed before loading into Data OLEDB, at les to SQL Server
database Using SSIS Packages and created data mappings to load the data from Warehouse.

Environment: MS SQL Server 2008, SQL Server Business Intelligence Development Studio, R, SAS, Tableau, SSIS- 2008, SSRS-
2008, Report Builder, Office, Excel, Flat Files, .NET, T-SQL

AKSHARA TECHNOLOGIES, INDIA JULY 2014 – AUG


2016
DATA ANALYST
RESPONSIBILITIES:
 Collect, analyze, and extract data from a variety of sources to create reports, dashboards, and analytical solutions and
Assisting with the debugging of Tableau dashboards.
 To design and create dashboards, workbooks, and complicated aggregate computations, I used Power BI as a front-end BI
tool and MS SQL Server as a back-end database.
 Cleaned data and processed third party spending data into maneuverable deliverables within specific formats with excel
macros and python libraries.
 Extensively used Informatica Client tools Power Center Designer, Workflow Manager, Workflow Monitor and Repository
Manager. Extracted data from various heterogeneous sources like Oracle, Flat Files
 Worked with Impala for massive parallel processing of queries for ad-hoc analysis.
 Designed and developed complex queries using Hive and Impala for a logistics application.
 Using Informatica, developed sophisticated SQL queries and scripts to extract, aggregate, and validate data from MS SQL,
Oracle, and flat files and put it into a single data warehouse repository.
 Configure and monitor resource utilization throughout the cluster using Cloudera Manager, Search, and Navigator.
 Using Apache Flume to collect and aggregate huge volumes of log data, then stage the data in HDFS for later analysis.
 Used relational and non-relational technologies such as SQL and NoSQL to create data warehouse, data lake, and ETL
systems for processing transforming a business demand into a technical design document.

Environment: - Python, PySpark, Kafka, GitLab, PyCharm, Hadoop, AWS S3, Tableau, Hive, Impala, Flume, Apache Nifi, Java, Shell-
scripting, SQL, Sqoop, Oozie, Java, Python, Oracle, SQL Server, HBase, PowerBI, Agile Methodology

Education:

Bachelors in electrical and electronics engineering – Amrita Vishwa Vidyapeetham - 2014


Masters in Data Science - University of Texas at Arlington - 2021

You might also like