Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 30

Data Science Product Development

(CSE 679)
Lecture 2 (29th Jan)
Today’s agenda
● Talk about final project.
● Go through a few examples / templates.
● Talk about logistics / deadlines.
● Go through some cloud basics.
● Go through Pytorch and deep learning examples.

2
Project dates
● Groups should be finalized by Monday.
● I’ll randomly assign anyone who doesn’t email me.
● Stage 1 (2% grade) Feb 8: Project statement.
● Stage 2 (8% grade) Feb 29 : Project plan and start.

3
Sample Project
● Airplane Reservation System.
● Discussed this in last class.
● How would this look like as a class project?
● Will talk about it as I go over each section of first
deliverable.

4
Stage 1: Vision.
● The DS idea you want to build.
● People who want to book by voice or sms.
● Talk about main market players.
● E.g: Kore.AI
● Statistics: value of airline industry, number of tickets per
year, number of customers.
● High level thoughts on viability / feasibility.

5
Stage 1: Vision.
● My chance to give feedback to you quickly.
● Suggest scope narrowing / changes
● Make sure you’re not locked into a project I think is
unviable for good grade.
● Will allow project modification after this feedback.
● Maximum 1-2 page report.

6
Stage 2: Project Plan and start.
● Set up your data sources.
● Data storage and data querying.
● Commodity ML solutions.
● You should start working on this even before your Feb 8
submission

7
Stage 2: Initial data sources
● Must have good datasets to build models/ideas.
● Most projects will fail due to inadequate, non-
representative or poor quality data.
● Your research and search skills are vital.
● Justify and describe these choices.

8
Some ideas for dataset sources.
● Hugginface NLP datasets repository here.
● Kaggle has huge amount of datasets.
● Visual data has a set of vision datasets here.
● AWS and Google have datasets collections here and here
respectively.

Sometimes easier to work from datasets to idea, than vice


versa.
9
Reservations: data sources
● ATIS: atis dataset for training intents.
● All live airline flight schedules here.
● Huge dataset of 400K dialogues here.
● An example of how you could find multiple, and
changing datasets.

10
Stage 2: Dataset storage
● Store raw data, query data.
● Store raw dataset in low cost “bucket”.
● Options: Google Bucket, AWS S3 buckets.
● Set up a queryable database to support app.
● PostGres, MS-SQL amongst options.

11
Stage 2: Dataset storage
● Set up schema in your DBs. E.g:
○ Subscribed airlines table.
○ Cities table.
○ Reservations / flights table.
○ Customer ids (by phone number?)
○ Inbound messages (can use synthetic)

Setup with real data / schema by the 29th


12
Stage 2: Commodity ML approach.
● Identify an API you’ll use.
● E.g, here Google NLP, Google Dialogflow.
● Amazon and Microsoft have APIs as well.
● Focus on “core” DS model / idea.
● Integrate with your data somewhat.
● Summarize performance.

Should be the largest portion of this deliverable on 29.


13
Stage 2: Custom ML models .
● Describe models you may explore in more detail - naive
bayes, transformers, etc.
● Mention a couple of state-of-the-art research papers to
apply, with brief summary.
● Should be able to show some “toy” versions, although
final POC can wait.

This will be stopping point of Feb 29 deliverable.


14
Challenges of a class project.
● Probably won’t get “live” customer data.
● Think how you will simulate / collect this.
● Hold back a dataset you can add later, or plan to interact
with it yourself.
● Can add “complexity” later part of project.

E.g: Adding policy changes to reservations, or add noisy


data.
15
Reconsider your project if….
● You can’t identify starting data sources.
● No progression of “simple” to “advanced”.
● No updates needed.
● Getting synthetic or real data is too difficult.
● No commodity ML available whatsoever.

16
Further Stages
● Stage 3, late March.
● Stage 3 is the “meat” of project
● Choose custom model.
● Model deployment/delivery.
● Model updates.

17
Further Stages
● Stage 4, early May.
● Stage 4 is the “polish”.
● Optimize throughput.
● Experimental frameworks.
● Interpret models.

18
Stage 3: Explore custom models.
● Show notebooks / experiments trying out different
custom models.
● Set up MLFlow (or KubeFlow) to track model
performance.
● Summarize your results, choice of final model.
● Record your offline experiments through one of these
platforms.

19
Stage 3: Deployment and updates.
● Serve commodity and custom models through docker /
microservice.
● How can you automate / schedule updates to live data?
(E.g, flights available daily)
● How do you update models and track performance?
(E.g, conversation data changes.)
Set up basic workflow / scheduler

20
Stage 3: Back to the past.
● You may need to revisit your data storage plan from
stage 1.
● May need to collect more data from stage 1.
● You should try to generate “adversarial” data, and try to
break/test your DS model performance.

21
Stage 4: Optimizations.
● Set up basic benchmarking of throughput.
● Optimize performance of models.
● Integrate experimental frameworks.
● Simulate experiments.
● Set up interpretability tools for your models.

22
Stage 4: Optimizations.
● Set up basic benchmarking of throughput.
● Optimize performance of models.
● Integrate experimental frameworks.
● Simulate experiments.
● Set up interpretability tools for your models.

23
Other ideas and notes
● News feed + friend recommender hybrid.
● Grocery budget tracker + fitness aide.
● Should have dynamic/changing component.
● Not only static dataset.

24
Let’s play with AWS for a bit.
● AWS Educate account requires no credit card.
● Similar patterns for Google Cloud and Azure.
● Set up AWS credentials and boto.

25
Let’s play with AWS for a bit.
● Let’s create S3 buckets and upload a file.
● Similar patterns for downloading a file.
● Can access from any service.
● Useful for “raw” storage.

26
Let’s play with AWS for a bit.
● Let’s create an SQS queue.
● Useful for ML APIs pulling work.
● One form of workflow management.

27
Let’s talk about SQL vs noSQL.
● Both have their place and advantages imo.
● One high level slide deck.
● AWS’s offering, Dynamo DB.
● Other services have their own offering - Google has
BigTable, Azure has CosmosDB.

28
Let’s set up DynamoDB instance.
● Dynamo DB basics.
● Let’s work through some code now.

29
What comes next.
● Form groups by Monday 12:00 pm.
● Rest will be in random groups by Tuesday 12:00 pm
● Vision statement due by February 8, midnight.
● Stage 1 due by February 29 (dataset in cloud + commodity ML
approach in notebooks + reports)
● HW 1 to be released around Feb 12.
● Due by around March 3.
● Do in Google Colab, email me link or code.

30

You might also like