Professional Documents
Culture Documents
Joseph Manoj
Joseph Manoj
,
Data Scientist (Full Stack), Katalyst Labs (www.dataflo.io), Chennai
PROFESSIONAL SUMMARY_________________________________________________________________
• 4+ Years of experience in Data Science and solving business problems in Sales, Marketing, and Retail
• Experience in Data Vizualization & Storytelling. Rich Experience in Statistics, ML Modelling, NLP (NER,
Redaction, Document Classification)
• Proficiency in ML Model Development, Productization and AWS MLOPs Framework, GitLab, Docker
• Experience in AWS Sagemaker, Kinesis, EMR, Glue, S3, RDS, EC2, Lambda, ECS, EKS, Secrets
• Familiar in Data Ingestion Pipeline Designing using Azure ADF, Logic App, EventHub, Blob, SQL
Analytics, ADLS, CosmosDB, Azure Databricks/Spark
• Exposure to Big data technologies Hadoop, Sqoop, Pig, Hive, NoSQL, Spark, Kafka, Nifi, Zookeeper.
• Involved in Data collection & integration process from different sources to data warehouse
• Involved in Data modeling and data transformation process in snowflake data warehouse/DBT
• Implemented various Regression/Classification Algorithms using python/Sckit-learn
• Forecasted Univariate and multivariate Time-series data using various time-series algorithms
• Handled Streaming and the large volume of data using Databricks/Spark
• Experience in Agile methodologies and writing the user story, Refinement and monitor it
• Certified as “Microsoft Certified Azure Data Scientist” and “Tableau Certified Data Scientist”
• Contributed research projects and published over 30+ research articles/Blogs in the field of Data Science
• Refer to my data science/Machine learning blogs at www.datasigns.info
• Exposure in Quantum Computing (IBM QisKit, QML, QNLP)
TECHNICAL SKILLS_______________________________________________________________________
• Programming Language: Python 3.0
• Packages: Pandas, NumPy, Matplotlib, seaborn, Plotly, Scikit-Learn, BeautifulSoup, Pyspark, Flask
• Cloud Services: AWS (Sagemaker, EMR, ECS, EKS, Fargate, Glue, Kinesis), Azure (Databricks, ADF,
EventHub, Logical App, Blob, CosmosDB), GitLab and Docker
• Big Data: Apache Spark, Hadoop, HIVE, Nifi, PIG
• Database: MySQL, PostgreSQL, MongoDB (NoSQL)
• CloudML/AutoML: H2O.ai, MS-Azure ML Studio, AWS Sagemaker
• ML Algorithms: Linear/Logistic Regression, K-Means, Random Forest, SVM, XGBoost, LightGBM etc.,
• DL/NLP: NLTK, spaCy, LSTM, GAN and exposure in RNN, CNN Network
• Data warehouse: Snowflake; BI Tools: Tableau, MS-Excel
• Statistical Testing: A/B Testing, Hypothesis Testing and ANOVA testing
• Quantum Computing - IBM QisKit, QML, QNLP (Beginner)
WORK EXPERIENCE_______________________________________________________________________
• Data Scientist, Katalyst Labs Pvt Ltd (www.dataflo.io), Chennai Jan 2021 - Till Date
• Data Scientist, Hexaware Technologies, Chennai Sep.2019 - Dec.2020 (1.4 Years)
• Data Science Mentor, Stigmata Technologies, Chennai (Freelance) Jan 2018 - Sep. 2019 (1.9 Years)
• Prof/Researcher, St. Joseph’s College of Engg, Chennai June 2005 - Sep 2019 (14.4 Years)
• Software Programmer, Swaminathan Networking, Chennai June 2003 - May 2005 (2 Years)
EDUCATION_______________________________________________________________________________
• Ph.D. (CSE) Manonmanium Sundaranar University, Tirunelveli, TamilNadu, India. Year: Mar.2015
• M.E. (CSE) Sathyabama University, Chennai, TamilNadu, India. Year: Apr.2009
• M.C.A. Manonmanium Sundaranar University, Tirunelveli, TamilNadu, India. Year: Apr.2002
• B.Sc. (Chem) Manonmanium Sundaranar University, Tirunelveli, TamilNadu, India. Year: Apr.1999
CERTIFICATIONS__________________________________________________________________________
• Microsoft Azure Certified Data Science Associate, January 2020
• Tableau Certified Data Scientist, May 2020
• AgileKB Certification in Agile Project Management & Delivery, May 2020
1
Curriculum Vitae
PROFESSIONAL EXPERIENCE_______________________________________________________________
Project #1 (Marketing KPIs Anomaly Detection): Anomaly Detection in Marketing KPIs (Google Analytics) –
Google Analytics KPIs like sessions, sessions/user and bounce rate anomalies forecasting using time-series data. It
gives alerts and helps marketing people to understand the website performance and make decisions
Environment: AWS Sagemaker/python, S3, Snowflake, FIVETRANS, DBT, PostgreSQL, Sci-kit learn
Deployment: AWS Sagemaker/AWS Fargate/ECS/AWS Secrets Manager/AWS Cloud watch, GitLab/Flask
Responsibilities:
• Business use cases and requirements discussions with various stakeholders like External Clients, Product
Manager, Delivery head
• Participating in Agile and demonstrated the demo and progress of data science projects
• Support to write user scrum story and story refinement process
• Involved in setting data pipeline from data sources (Sales & Marketing apps) using FIVETRAN
• Setting data pipeline with snowflake using snow pipe and fetch the data
• Creating PostgreSQL to store the model inferences or output
• Model selection, Writing the source code using AWS Sagemaker, Cloning the code in GitLab
• Docker creation and REST API Implementation using FLASK
• Storing key credentials in AWS secret key manager
• Data pipeline and Model Pipeline Implementations using AWS Services (Sagemaker, Fargate)
• Data collection, feature engineering/selection, Hyper parameter tuning, model optimization
• Model Building, Deploying the model and monitoring the model (MLOPs)
Project #2 (Sales Conversion Prediction): Prediction of conversion probability of the sales leads – Fetch live
update of number of leads and applies ML Model to find the conversion probability of each new lead so that sales
people may focus high probability leads and make them to convert quickly.
2
Curriculum Vitae
Project #1 (Amaze-DIF): Hexaware's Unique Automated Application Transformation Platform. With a high level
of automation and customization capabilities, it can deploy web applications on the cloud seamlessly and make
your application replatforming journey much simple yet secure and future-proof.
Now it is extended for ingesting the structured data into Azure cloud storage based on its metadata. Further
real time analytics will be done based on NLP and other ML algorithms. It’s a generic model which pulls the data
from different RDBMS tools like MySQL, Oracle and push the real processing data into the Azure storage.
Environment: Data bricks, Pyspark, Azure Synapse, Azure Data Lake, EventHUB, LogicApp, MySQL
Responsibilities:
• Connecting Azure Eventhub & Logicapp with MySQL Database and pulling the streaming data
• Processing the consumed data – Cleaning and finding the errors in the data set
• Finding the data match with existing data and loading the data into the Azure Data Lake.
• Pyspark will be used to do real time analytics based on the requirements and final version of data will be
stored in Azure synapse (SQL data warehouse)
Project #2 (Document Redaction): The customer is a world-renowned consulting company had expressed interest
in creating a common repository of all consulting documents without violating the data security rules &
regulations. This was to have knowledge and best practice sharing across the company. Natural language
Processing (NLP) sub fields such as NER and Redaction were applied and deployed successfully.
Environment: Python, SpaCy (NER and Document Redaction), GOCR (OCR Tool)
Responsibilities:
• Client Interaction to Documents Identification and Infrastructure Requirements
• Involved in various phases such as data collection, statistics analysis, Building ML model, optimization
• NLP Model testing using POSTMAN. Involved in Demonstration & Preparing Project Deliverables.
Key Achievements:
• Reduced manual work for identifying the document content and redacting the sensitive data
• More than 50%-man resources saved by the NLP solutions
• Accuracy was improved by 89% by removing/finding the invalid entities from the document
Project #3 (Port Call Cost Prediction): The shipping service company who provides various harbors services to
the ships around the world. They issue base line value for the various services they provide to ships before they
reach harbor. Since actual value varies enormously from base line value, they need to come up with a prediction
model to predict the baseline value correctly. Model and User Interface deployed in AWS cloud.
Environment: Python, Tableau, AWS Cloud Deployment, ML Algorithms: Linear Regression, XGBoost, NodeJS
Responsibilities:
• Use case discussion with Client, Understanding the client requirements from functional document
• Understanding the bus matrix and acquired data from their data warehouse
• Involved in various phases of data science life cycle such as data collection, statistics analysis, Building ML
model, optimization and deployment in Azure cloud/AWS. Model Testing in POSTMAN
Key Achievements:
• Helped the client to quote the base value correctly so that their goodwill maintained with their clients
• Increased customer retention by 23% and quotation of base value is consistent
• Increased model accuracy from 86% to 91% and involved in Model Demonstration to the client
3
Curriculum Vitae
Project #4 (Loss Prevention Insights): The client is the leading provider of insurance and related risk
management services to the international transport and logistics industry. The objective is to analyze internal and
publicly available data on equipment fire incidents to derive actionable insights on loss prevention for claims.
Environment: Conda, JupyterLab, Azure VM, Azure Blob Storage
Role and Responsibilities:
• End to end solution designing, identifying the tools, technologies, and recommendations
• Work with team members for solution implementation, review, and optimization.
• Collect fire incidents data from the public domain using Web Scraping and Web Crawling, and convert the
collected data to structured form using NLP techniques.
• Merge with internally available data, analyze combined data, and visualize the results.
• Interface with the stakeholders to remove roadblocks, Showcase results to clients and act on feedbacks
Key Achievements:
• Supports Insurance Underwriters to handle claim process or claim issuance quickly.
• Reduced considerable time of underwriting process.
Project #5 (Document Classification): The customer is a well-known company serving Health Information
Technology and Clinical Research. They receive millions of clinical Trial documents every year and QC team
review the documents, checks missing fields, classify the documents per reference model, extract metadata,
configure and upload it to ETMF. The spend average of 20 mins/doc and 400 FTE’s manual effort but still there
are high number of backlogs and errors in classification.
Environment: Conda, Jupyter Lab, Pandas, spacy, GOCR, SciKitLearn, Tensorflow
Solution: Provided NLP/ML based document classification, metadata extraction, validation and import of
documents in ETMF. The solution resulted in approximately 75% manual effort reduction.
Responsibilities:
• Involved as a Decision Science Community Member and Participated in DS Community forum
• Assisting in algorithm selection, technique evaluation and optimizing
• Contribution of common knowledge repository
Project #6 (Risk Leaders Board): The client is a marine/defense/property insurer who involved in the issuing
cargo insurance to different ship owners. Customer is running their business under Lloyd’s insurance market. The
proposed system allows them to choose the best leaders or laggards in the market for the particular policy
Environment: H2O.ai (AutoML), Python, PowerBI, ML Algorithms: XGBoost and K-Means
Responsibilities:
• Involved in Use case discussion with Client, Understanding the client requirements by studying functional
document. Understanding the bus matrix and acquired data from their data warehouse
• Involved in various phases of data science life cycle such as data collection, statistics analysis, Building ML
model, Hypothesis Testing, Z-Score, Hyper Parameter Tuning and deployment in Azure cloud/AWS
• Reporting to the onsite coordinator about the status of the test activities on a daily basis
• Applied Agile Methodology and attended Daily stand-up and sprint meeting
Key Achievements:
• Helped Insurance Underwriters to complete the claim process and new policy issue decision quickly.
• Reduced 20% of the underwriting process is reduced. Model accuracy was improved up to 93%
• Developed using AutoML tool and extracted meaningful insights from the historical data