Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Jim Xiang

Data Engineer
| Santa Clara, CA
SUMMARY
· Over 6+ years of experience in data analysis and data engineering covering the whole data
lifecycle, from data ingestion, wrangling and modeling, to data visualization and insight discovery.
· Data-driven mindset, passionate about diving deep into data and communicating data-based
findings and insights with co-workers and business stakeholders.
· Strong programming skills in Python (NumPy, Pandas, scikit-learn, Seaborn), SQL, Java, R,
Scala, Linux Shell Scripting.
· Hands-on experience of RDBMS such as MySql, Postgres as well as NoSql batabase such as
HBase, Cassandra, MangoDB.
· Good experience in handling big data framework including Hadoop, Map-Reduce, HDFS, Yarn,
HBase, Hive.
· Experience in real-time data streaming including Spark, Kafka, Flume.
· Versed in performing ETL tasks, designing data warehouse, and building data pipelines for data
ingestion, aggregation, transformation, grouping, join, etc.
· Hands-on experience in cloud computing such as AWS and GCP, including AWS experience in
EC2, S3, RDS, Elastic Beanstalk, Glue, CloudWatch, as well as GCP experience with
deployment of Docker and k8s.
· Knowledge in statistics including descriptive statistics, inferential statistics, probability theory,
probability distributions, and Bayesian statistics.
· Modeling experience leveraging machine learning techniques such as regression, classification,
dimension reduction, as well as deep learning techniques such as convolution neural network and
recurrent neural network with Tensor Flow and Keras.
· Good understanding of business requirements and good product sense, familiar with purchase
funnel analysis, fractional attribution analysis, A/B testing design, strategies such as SEM and
SEO.
· Proficient in BI tools such as Tableau, PowerBI, Google Analytics and matplotlib with good
experience of creating interactive data-oriented reports and dashboards.

· Effective communication skills and presentation skills as evidenced by working with people from
both Engineering and Marketing. Meet and collaborate with managers, development teams,
stakeholders.

SKILLS
Programming Languages: Python, SQL, Java, R, Scala, Shell Scripting
Data Wrangling & Visualization: NumPy, Pandas,Tableau, Matplotlib, ggplot, Seaborn
Machine Learning & Deep Learning: scikit-learn, Logistic Regression, Random Forest, K-means
Clustering, Keras, Tensor Flow
Big Data: Hadoop, Hive, Spark
Cloud: AWS, GCP
Deployment & Version Control: Docker, Kubernetes, Heroku, Git
Web design: Flask, HTML, CSS
WORK EXPERIENCE
Company: Cisco Oct 2019 – Present
Role: Data Engineer Santa Clara, CA

Project description:
The goals of the project are building data pipelines for networking product wholesale data,
integration of data ingestion, data transformation, data persistence, as well as working with
marketing team and deliver business insights to stakeholders.

Responsibilities:
· Design, create and implement RDBMS as well as NoSQL database, build views, indexes, stored
procedures.
· Data modeling of the product information, customer features, build data warehouse solution to
support BI activities.
· SQL queries on RDBMS such as MySql/Postgres and HiveQL on Hive tables for data extraction
and preliminary data analysis.
· Build data pipelines including data ingestion, data transformation such as aggregation, filtering,
cleaning, and data storage.
· Data ingestion from SQL and NoSQL database and multiple data formats such as XML, JSON,
CSV.
· Data ingestion of real-time customer behavioral data into HDFS using Flume, Sqoop, Kafka, and
data transformation using Spark Streaming.
· Perform ETL operations using Scala Spark and PySpark under IntelliJ with Java and PyCharm
with Python respectively.
· Implement and execute the parallel processing of Map-Reduce job utilizing Java for the log data
from the servers.
· Monitor and health check of the data warehouse by providing failover solutions and disaster
recovery solutions in a cost-effective manner.
· Leverage Yarn for large-scale distributed data as well as troubleshoot and resolve Hadoop cluster
performance issues.
· Perform data management and data query using Spark and deal with streaming data using Kafka
to make sure data transfers and processes in a fast and reliable manner.
· Leverage AWS S3 as storage solution for HDFS, AWS Glue as the ETL solution and AWS
kinesis as the data streaming solution to deploy the data pipeline on cloud.
· Migrate data warehouse from RDBMS to AWS Redshift and analyze log data using AWS
Athena on S3. Maintain Hadoop cluster using AWS EMR.
· Data cleansing, data manipulation, data wrangling using Python to eliminate invalid datasets and
reduce prediction error.
· Conducted A/B test on metrics such as customer retention, acquisition, sales revenue, and volume
growth to assess the performance of products.
· Leveraged Pandas, Numpy and Seaborn for exploratory data analysis.
· Extend Hive functionality by using User Defined Functions including UDF, UDTF, and UDAF.
· Developed predictive modeling using Python packages such as SciPy and scikit-learn as well as
Mixed-effect models and time series models in R based on business requirements.
· Feature selection, feature extraction using Spark Machine Learning libraries including
algorithms such as multivariate regression, K-means clustering, KNN.
· Carried out Dimension Reduction with PCA and Feature Engineering with Random Forest to
capture key features for predicting annual sales and best purchased product using Python and R.
· Created Hive integrated Tableau dashboards and reports to visualize the time series of purchase
value to keep track of the business metrics as well as deliver business insights to stakeholders.
· Work with Git for version control, Maven for Java project build, test and deploy.
Technologies: SQL, Python, Scala, Hive, AWS, Machine Learning

Company: Nokia Aug 2018 – Sept 2019 Role: Data Engineer


San Jose, CA

Project description:
The goals of the project are building effective machine learning tools to assist voice recognition and
text punctuation to translate voice into a human-redable format, the data processing of the project
includes data collection, data cleaning and statistical model development.

Responsibilities:
· Collect 7 million pairs of ‘raw – punctuated’ text data of CSV files for text cleaning stage of a
speech recognition app.
· Perform data ingestion, transformation, and cleaning utilizing Python NumPy and Pandas.
· Implement and evaluate punctuated text data as a post-processor of speech recognition RNN using
Keras and Tensor Flow.
· Integrate the result of cleaned text data with REST API with as well as deploy the API to the
AWS Elastic Beanstalk.
· Leverage S3 for data lake solution, DynamoDB and RDS for database solution.
· Leverage CloudWatch to monitor the performance of the product. Implement auto-scaling
structures to deal with failovers.
· Design and develop data augmentation for synthetic text and voice data.
· Pre-process of raw data and conduct data wrangling such as grouping, aggregation, filtering,
replacing missing values using Python.
· Perform tree-based ensemble algorithm such as XGBoost and AdaBoost for feature extraction and
feature selection.
· Work with ML teams dealing with acoustics and leverage toolkit such as NLTK.
· Build analysis and prediction algorithms for correlations of features and conduct Hypothesis
Testing to determine the significance level.
· Developed innovative solutions to big data and cloud issues such as deploying the Docker
containers and k8s pods on GCP.

Technologies: Python, Numpy, Pandas, Keras, TensorFlow, AWS


Company: UC Irvine Aug 2015 – June 2018 Role: Data
Engineering Researcher Irvine, CA

Project description:
The goals of the project are developing data warehouse solutions to store and analyze laboratory
equipments data as well as create reports and dashboards through data manipulation for monitoring
the key features and analyzing loggings of the data events.

Responsibilities:
· Design the data architecture in MySQL in for storage of device information, save 30% manpower
on legacy database.
· Create REST APIs for model testers to upload and download their test results on our server that is
integrated with the database.
· Automate ETL procedures and build data warehouse to keep the product and model testers
information updated.
· Combine new device models into the data warehouse, track product versions, interpret the reason
of test failure using data analytics tools.
· Data transformation and exploratory data analysis using Python, R and data visualization using
Matplotlib.
· Database migration utilizing SSIS from DB2 to SQL Server and design data warehouse using
FTDW sizing tools.
· Enforce data quality in data warehouse by data cleansing using SSIS data flow services.
· Data analytics using SSAS and produce formatted reports using SSRS.
· Create report and dashboard using Tableau and PowerBI to deliver business insights to managers
and stakeholders.
· Process XML, JSON, Delta tables and build ETL data pipeline with dashboard.
· Implement visualization tools to generate daily, weekly, and monthly dashboards from massive
databases to monitor key features of data and deal with loggings of events.
· Support for application such as reviewing and tuning production related queries and deal with long
running batch jobs.
· Isolate and de-bug infrastructure problems and perform problem resolution.

Technologies: MySQL, ETL, Python, R, SSIS, Tableau

EDUCATION
University of California, Irvine, CA , Ph.D. Structural Engineering
Southwest Jiaotong University, Sichuan, China , B.S. Structural Engineering

CERTIFICATES
• Deep Learning Specialization on Coursera: Neural Networks and Deep Learning, Improving Deep
Neural Networks, Structuring Machine Learning Projects, Convolutional Neural Networks and
Sequence Models.
• Algorithms, Part I and Algorithms, Part II, Princeton Online.
• AWS Certified Solutions Architect – Associate

You might also like