Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Lecture 1 - Sofie De Cnudde (ASOS.

com)
Large questions

1. Explain the business and data science behind ASOS’ “You might also like”

What is the goal?

1. Asos provides an enormous catalogue of products which may result in an information


overload for the customer. A personalized recommendation system may resolve this
problem.
2. To increase customer engagement and as a result increase earning, by adding an extra
layer of personalization

2. How is this done? What are the technical aspects? How was it done previously?

The YMAL is done using a matrix factorisation. A matrix is made using explicit feedback and
implicit feedback on the users. Explicit feedback could be ratings that a customer gives on a
certain product. Implicit feedback is feedback that isn’t directly given by the customer, but it can
be deducted from a customer’s actions on the website. After having built the matrix, the goal is
to learn latent factors or embeddings such that products are close to each other if they are
similar or if a customer is very likely to like a certain product. The user factor matrix U and item
factor matric M are learnt using alternating least square to minimize the loss. To prevent
overfitting some type of regularization is required.

They took the viewed item and get the product embedding for that product (through matrix
factorization as described above). Based on user vectorization they also make the user vectors.
They take those embeddings, scale them and combine them with a weighted average for both
of those embeddings and they get that personalized embedding. The major effort for the data
scientists here was to decide what weight to give to the item embedding and what weight to give
to the customer embedding.
What data is used?

First of all data on the product itself can be used. In order to make the YMAL more
personalized, explicit and implicit feedback can be used as well. Explicit feedback e.g. ratings
that a customer has given. Implicit feedback is feedback that is deducted from a customer’s
actions such as clicking on an item, saving it in its favorite items or purchasing an item. The
YMAL could be personalized even more by taking into account: a user vector, the weather, the
intent of the user session, the size, stock information.

How are the models optimized? Which metric is used?

The models are optimised by choosing the best weights for the different embedding (as
described in b). For the offline evaluation metric they looked at the hitrate at 1 which is how
many customers purchased the first product that they recommend to them in the carroussel.
They just have to had a loop of weights between 0 and 1 and then used that weight as weighted
average and see which combination resulted in the highest hitrate. They can then just look at
the plot for the highest hitrate for each line (logisch dat item meer weight krijgt voor producten
hier want user vector gaat vooral over stijl en meer terug in de tijd enzo).

How can the models be improved in the next iteration?

Using session data in real-time for the next iteration to make the model more complex. They
used the previous approach but by extending with the embedding of the user session. They
have replaced the weighted layer by using a keras neural network that learned the weights for
them and optimize for the maximum hitrate.
Small questions (answer in 2 lines):
What is session data and how is it used to improve the predictions at ASOS

Session data is data that is collected when the customer visits the website and clicks on
different products. This data can be used as an extra layer of personalization.

Give 3 technical issues when trying to predict the size of a customer?


1. People sometimes buy items for other people with a different size: difficult to determine
the user, so difficult to make recommendations
2. EU, UK, USA sizing is often different, as well as sizing for jeans, dresses, etc. for
example: unclear what the correct size is for different type of items to recommend
3. A person might feel insulted if the prediction doesn’t correspond to the reality (size might
be a sensitive thing)

Lecture 2 - Andrew NG
Large question (answer in 2 pages):

§ Explain the data-centric view and the model-centric view, as well as their
differences, within developing an AI framework. Additionally answer:
• Data-centric means the data should be optimized. Good data will lead to good
predictions. This means that high quality data should be collected. The collected
data also has to be preprocessed so the machine learning algorithms have the best
possible data to work with.
--> “Hold the code fixed and iteratively improve the data”
--> How can I modify my data (new examples, data augmentation, labeling, etc.) to
improve performance?
• Model-centric means that the models used in machine learning should be
improved, instead of spending extra time working on the data.
--> “Hold the data fixed and iteratively improve the code/model”
--> How can I tune the model architecture to improve performance?

o Explain the advantages and disadvantages of both.

Data-centric :
- Advantages : relatively easy to do basic data cleaning, which results in a decent improvement
in performance
- Disadvantages : can be time consuming if a ton of data needs to be relabeled or if new data
needs to be collected.

Model-centric :
- Advantage: No new data is needed.
- Disadvantages : Complicated, most machine learning algorithms are built by researchers and
are continuously being improved. Improving these algorithms is complicated and time
consuming. The results will probably only be marginally better than what the current algorithms
offer.
o What function does the MLOps framework serve within these views?

Nog op te lossen

o What’s the most important role of MLOps, according to Andrew?


“Ensure consistently high-quality data in all phases of the ML project lifecycle”
Good data is :
- Defined consistently (definition of labels y is unambiguous)
- Cover of important cases (good coverage of inputs x)
- Has timely feedback from production data (distribution covers data drift and concept drift)
- Sized appropriately

Small questions (answer in 2 lines):


§ What’s the most important role of MLOps, according to Andrew Ng?
o Ensuring consistent high-quality data in all phases of the ML project lifecycle
§ Is it better to expand the dataset, or to clean/improve the dataset, according to Andrew Ng?
o Cleaning/improving the dataset. The claim that large datasets are needed is spread by
large organizations to discourage start-ups.
§ Give three examples of label inconsistencies: for image data, speech data and for location
data.
o Speech data, when someone hesitates while speaking whether there should be a comma
or three dots.
o Image data, when labelers indicate objects differently. For example drawing a pixel perfect
outline of an object vs just drawing a box around the object.
o Location data: drawing one box around the location of multiple things, versus drawing
individual boxes around the things.
Lecture 3 – Véronique Van Vlasselaer (Customs fraud detection)
1.Draw an analytical decision process for fraud detection at customs. Incorporate the following
steps:

o Data enriching

o Whitelist/allowed-list

o Blacklist

o Business Knowledge

o Taking action (i.e. intercepting/ inspecting a package)

For each of the processes, mention if it should be done in real-time, or can be done on a different
time (also mention why you would or wouldn’t do this in real-time).

Everytime a package comes in there needs to be checked whether the origin or the
customer is part of a watchlist/ blacklist e.g. merchants that have sent goods with
falsified information are typically on the watchlist.

If the package is not part of the watchlist, they check if it’s part of the allowed
list/whitelist (eg big vendors like Amazon are often completely aligned with
regulations).The watchlist and the whitelist are based on business knowledge already
present in the company.

If the package is not on the allowed list or on the watchlist, they enrich the data with
some details of the customer, details of the merchant, the risk of the country. Based on
the category of the goods, different kinds of predictive models will be applied on the new
data.

Then based on all this information, they can apply the business rules. For these rules
but also for making up the watchlist and the allowed list, it’s important to rely on
business knowledge that is already present in the company and incorporate this
expertise. Business rules example: if probability is > 0.65 -> inspect, else not inspect or
if origin = US -> not inspect. The result is that each package that comes in into the EU
borders, will get a score to investigate a package (low risk-medium-high risk) with its
motivation why the package should be investigated or should not be investigated.
Previous steps can all be done before the package arrives at customs since you
typically already have the data needed for the analytical decision process a couple of
days beforehand. Not doing these steps in real-time can save a lot of money because
doing this in real-time may be very expensive and does not add more value than doing it
before it arrives. However, the action of investigating the package itself at customs
needs to be done in real-time.

2. Should an AI take over the job of human custom inspection? Talk about the ethical
implications, as well as technical drawbacks. What is the most fruitful reconciliation between the
two?

Over gepraat bij de vragen?


Small question
Fill in on the following table which step(s) can be done in batch, and which step(s) need
to be done in real-time
Real-time Batch

Action x

Decision x

Prediction x

Data x

(explanation) The action to inspect is in real-time. The inspector says “this package in front of
me is a package that I will inspect”. All the calculations up-front and the analytical decision
system can be made before the action. The data is typically already availabWle a couple of
days before it arrives at the customs, if the data is available they can already start the analytical
models, already create a decision etc.

Lecture 4 – Bart Hamers (BNPPF)


Large questions

1. Imagine you are the CEO at a bank that has not done any work on AI whatsoever.
You hire a data scientist that comes at you with three projects:

i. Automatically labeling and parsing text documents by


using NLP

ii. Credit scoring

iii. Building a data pipeline to continuously improve, enrich,


process, and gather data on customers, stock exchange
prices, documents …

a. Using the framework put forward by Bart Hamers, draw on the graph
below where you think these projects end up and motivate why.

When focusing on your potential AI projects there are 3 aspects to keep in mind for each
project: why, what, and how.
- Why: how does the project link to your corporate strategy? Here, there are 3 motivations:
improvement, increase and innovate.

Type 1: Improvement à continuous improvement of existing processes which are thus rather
short term projects such as email routing, chatbots, process optimization, …

Type 2: Increase à increase of sales revenue or a reduction in cost of risk which are thus often
medium term projects with significant ambition such as fraud detection, customer segmentation,

Type 3: Innovate à new business model or value stream creation which are thus often long term
projects with significant ambition and higher risk of failure such as new products or service
dimensions.

- What: what is the value for the organization? On which levels can AI create value in the
industry? Often answered through the use of AI project canvas

1. Operating expenses à cost saving or avoidance, automation, …

2. Operating income à revenue improvement (volume increase, price


optimization)

3. Capital à risk deduction/ management

4. Asset efficiency à inventory management

5. Expectations à fraud management, customer satisfaction,


reputation gain

- How: complexity: how difficult is it to realize? What are the costs and complexity linked to
AI?

What are the costs, efforts and resources needed to realize the AI project? à development
costs, implementation costs, integration costs, change management costs, regulatory and
ethical constraints, …

The portfolio of AI projects thus needs to have a balance between value and complexity. A
balanced portfolio should consist of short term/ low risk projects, medium term/ medium risk
projects, and high risk/ long term projects.
When looking at the 3 proposed projects we thus need to keep these 3 aspects in mind:

i. Automatically labeling and parsing text documents by using


NLP

Natural language processing is a branch that focuses on teaching computers how to read and
interpret text in the same way as humans do. In banking it can be used to automate certain
document processing, analysis and customer service activities. This is a very hard branch of
data science which implies high costs and high resource and time intensity of the project for a
type 1: improvement project. The risk (= making large costs + complexity à failure) is thus quite
high while the value for a bank is limited. à low value / high risk / long time of completion

ii. Credit scoring

A credit scoring project would imply a type 2 project concerning a reduction of costs since it
causes risk deduction by reducing defaults on loans. The risk however also concerns ethical
and privacy issues. The time of completion should not be that long as there are already many
open source credit scoring algorithms in existence and a bank already has lots of customer
history available à medium value/ medium risk / medium time of completion

iii. Building a data pipeline to continuously improve, enrich,


process, and gather data on customers, stock exchange prices,
documents …

Building a data pipeline implies moving data from different sources and moving them to a
destination for storage and analysis. If this pipeline is qualitative, it enables improvements in
processes, reduction of costs, the creating of new business models or value streams on top of
that aggregated data. It really depends on what data is used and comes out of the data pipeline
à the value could be quite high as it enables other AI projects while the risk is relatively
low/medium (relatively low ethical / privacy/ data contamination concerns). Because there is a
continuous improvement of this data pipeline it is a project which can take a long time to
continue improving. The costs, resources, and time allocated to the datapipeline should diminish
overtime however à start with this project out of all 3? (depends on corporate strategy of the firm
ofcourse)
Blauw = project 1, black = project 2, orange = project 3

Lecture 5: Galit Schmueli


Nog op te lossen

Lecture 6: Jonathan Berthe (Robovision)


How did brain research inspire Deep Learning? What is the role of Deep Learning in
scaling AI?
Deep learning is based on the reverse engineering of the brain:
They looked at the visual pathway of a human brain that is processing visual information. For example
they presented images of mice to cats to see what part of the brain of the cat was active when it saw the
pictures.
⇒ They found that processing and classification is done in a hugely parallel way with a lot of different
layers passing information to each other. The gradual processing of info in the brain is similar to gradual
processing of info in such a deep neural network with the convolutional kernels that create algorithms in
an autonomous way.
So you:
- connect images to the input layer of the neural network
- give them the right classification (data annotation), so tell the neural network if it is a cat or dog
f.ex.
- the connections will be made in an automated way with a mathematical technique: gradient
descent
So the idea is that you no longer need to program in software 1.0 if-then-else code but now Deep
Learning models are used to solve your solution (software 2.0). Create an artificial brain or a big neural
network with different neurons.
…DL in scaling AI?

Describe three use cases of deep learning in retail, as described by Jonathan Berte.
(
1. Automated cash checkouts where the customer himself is able to checkout the products
(zelfscankassa’s) 🡺 this is barcode oriented technology
They are going to enhance this with Product and behavior recognition 🡺 hand tracking: check
what is the path of the product so here from right to left (scanned)
)

2. Smart scale = to detect fruits and vegetables automatically


o @ Colruyt
o to speed up the check-out process
o Software recognizes different types of fruit and vegetables: a smart camera was
developed and put above the weighing scale that immediately determines which
product is present.
▪ customer puts fruit/vegetable in the scale

▪ a trigger is sent to the camera to take a picture

▪ picture runs through the AI platform that identifies the fruit/veg

▪ customer confirms the predicted fruit/veg using a tablet

o The DL model can recognise 120 types of fruits and vegetables at Colruyt. Thanks to AI,
product recognition becomes more and more accurate as more photos are added.
3. Shelf management
o Detect misplaced units
o Count stock
o Detect wrong displayed price tags and promotions
4. Customer behavior analysis
o Analytics on dwelling time (verblijftijd) & customer flows
o Improve shop layout
o Open a new cash register if there are too many customers
o Detect shoplifting
(Are often based on pose detection and store cameras)
===🡺 they all work on the same kind of neural networks

Small questions:

What are the challenges with AI right now, according to Jonathan Berte?
- The current AI field is chaotic because there is a jungle of frameworks with their own drivers and
cycles.
- Many companies today just “throw some AI at the problems”
- Going from a POC (proof of concept) to production is difficult
o Keeping the AI applications live is at least as difficult as getting them live and they need
to be reliable, scalable and adaptable

What is the “language gap” between specialists and data scientists according to
Jonathan Berte? How does Robovision solve this?
On the one hand you have the product specialists or operators who work in the factory and have no DS
knowledge but are just running the AI models.

On the other hand you have the data scientists that don’t have the product knowledge of the operators
and always need more data from them

= continu frustration

🡺Robovision solves this with their AI platform of collaborative intelligence: operators who have a lot of
knowledge can transpose their knowledge into the AI models and data scientist prepare the AI models
that are used on the production floor

(Collaborative intelligence = AI intelligence working hand in hand with product specialists, a tool that can
augment their function)

What is data enhancement? Why does Robovision do this?


= a process that involves adding new data to an existing database. 

They do this to create more data in order to train the neural network in a matter of minutes instead of
hours and get a very high performance neural network

(Sometimes there is a lack of sufficient data to build a good AI model f.ex. with the gas bottle detection
in metal waste processing where they created artificial data points to train the model. The more data you
add to models, the better they perform.)
Lecture 7: Vinayak Javaly (CUNY)
Small questions (answer in 2 lines):

Vinayak Javaly talks about a somewhat uncommon job in data science, called the “Data
Janitor”. What is a data janitor?
= As a data scientist you often spend around 10% of your time on the real data science and most of the
time you mess around with the data and clean it and preprocess it in order to do the real data science
on this data.

What is the most time consuming part of a data scientist’s job, according to Vinayak
Javaly?
o 1. Building a data pipeline
o 🡪 2. Processing the data
o 3. Building a model
o 4. Explaining what you did to stakeholders

Lecture 8: Walter Daelemans (UA)


Large questions (answer in 2 pages):

What is the ambiguity and reasoning problem? How does this relate to the polarized
opinions on GPT-3?
Nog op te lossen

What are the potential tasks of computer generated transitions between speech, text
and images. Describe these, and list at least one application (real-life use case) for each
transition.
Text to text: A linguistic model can take a sequence of text as input and transform it into another piece of
text. A simple task is paraphrasing a piece of text. Similarly, a more advanced model can automatically
make a summarization of a large text or book. Another application can be the translation of text using a
GPT-3 based model. Finally, GPT-3 models can be used for linguistic tasks like question answering. The
model predicts what would be the most probable response to a textual question. This can be used in
chat-bots to help customers. An example is the chatbot ‘Bruce’ that is used to assist passengers find their
flight information in the Brussels Airport.

Text to speech: A model could take text as an input and have a robotic voice answer them in natural
language. This is how Alexa,Siri etc. can form responses to questions you ask them through voice or text.

Speech to text: A model can take a spoken fragment as input and be trained to understand what is said.
This way it can make a transcription of what is said. This is a very powerful tool, because it can transform
unstructured audio data into (semi) structured textual data. A feature like this is already available on the
Google Translate app. The combination of this function and the previous 2 functions make it possible for
2 people to have a conversation in their native language and use a phone to transcribe,translate and
dictate the translated sentences to each other.

Text to image: This is an application using generative AI models. The model takes a text sequence as
input and tries to come up with an image that matches the description the input provided. For instance
feeding ‘a black cat on a table’ can be understood by the model and transformed into an image of a black
cat sitting on a table. This can be used in product design applications.

Images to text: Convolutional networks can be combined with the linguistic models to understand what
is visible in the photo. This can provide a more complex description than traditional image classifications.
An example of image to text are the tags that Google Photos automatically generates for your photo
albums.

Why did GPT-3 work better than GPT-2?

The GPT-3 model worked better than the GPT-2 model because it has more and larger layers. Size does
matter as it can use more data to train, which results in better results.

Explain the problems of ambiguity and inference in the context of text mining

NLP algorithms have a lack of background knowledge. This results in the ambiguity problem, as they can
not always select the right meaning and representation, and in the inference problem as they can not
make reasonings based on the context.

Lecture 9: Kris Laukens (UA)


Large questions (answer in 2 pages):
Explain the difference between an experimental and computational approach to learn
protein interactions.
Learning protein interactions is unraveling the network. Just like a social network, proteins interact with
each other and this is examined in the lab.

- The experimental approach can be compared to fishing. If a certain fish bites when you have a
certain bait, you know they have an affinity for each other. So you have a protein and you send a
lot of other proteins to this one, then you check which proteins just pass by and thus don’t
interact with the protein and also which proteins stick around and therefore interact with the
protein. This allows the scientist to elucidate/reveal step by step the molecular network that
underlies the organism. (It’s expensive and time consuming)

- That is why the computational approach exists: this approach lets the computer discover the
rules that determine the protein interactions.
o Step 1: collect enough interaction observations and extract the features that play a role.
o Step 2: look for patterns and automatically predict label of unseen instances
o Features of the proteins are sequences of amino acids. When you have a certain protein
and the sequences of the ones it interacts with, then it’s possible to extract the pattern =
consensus sequence.
o Now also more features, this is a traditional approach.
o (dit is useful voor simple cases maar challenges…)
Why is it beneficial to use AI in order to predict protein interactions? Which problem
does this solve? What are the current challenges with this? How was this done
previously?
weet niet zeker??????????

CHALLENGES of protein interactions:

- Protein is much more than a sequence (sequencing a protein also means throwing away some of
the info)
- Pattern location is often unknown: if you don't know what the pattern is and don't know how it
looks like 🡪it is hard to find then
- Often Many to many relationships (interactions with many parts): how do we deal with that =
complex network

…?
Discuss how bioinformatics can improve vaccine development and discuss some ethical
implications.
During the first outbreak of corona in china 🡪 by using bioinformatics, the Genome sequence of the virus
was already available: 30.000 characters used immediately to see that the virus was correlated to SARS
like coronaviruses.

- They annotated the genome: analyze the structure of the active proteins and the coronavirus
structure became available.
o They are also able to predict all the proteins that were encoded by the genome
- Now all the different variants of the COVID 19 virus are examined with genomic sequencing on
a large scale.
- They can now also check to which other viruses the virus is related by comparing the
sequences.
- Bio-informaticians joined forces by sharing their data about the virus on open data sharing
platforms.
- (Based on the genetic code of the virus (which is built by bio-informaticians) you can synthesize
RNA and use that to inject into a patient or a person to let the cell itself produce the proteins
that the antibodies will be reacting against. Choosing the target sequence is done with
bio-informatical procedures.)
- The use of bioinformatics in vaccine development speeded up the process a lot, usually it took
decades to develop a vaccine and now there are novel vaccine technologies like the one of
Moderna where it is possible to develop a vaccine completely in a computer in just a few days
- It’s also possible to predict whether a certain vaccine will work for a certain person, thus
whether a good immune response will be triggered.

Ethical implications:
- Something that is still present in the health sector are biases like gender bias or color bias
- Maybe the speed of vaccine development could pose negative effects for the quality and
long-term effects are more difficult to investigate (shorter clinical trials etc.)
- Knowing your genome information will reveal some genetic risks like risks for diseases and that
is a danger and if the patient did not asked for that you cannot reveal that info if patient doesn't
want to
- Another problem: if i know my risk because i know my genome, should i tell banks etc for getting
a loan etc
- Or DNA data used by insurance companies?
=>we are alle full of genetic risks in the future and maybe not knowing that is preferred over
knowing it beforehand
- So there is anonymity in the data but it is difficult 🡪Removing the name and postcode is not
making it fully anonymous but never be coupled to commercial use
- It’s possible to do personalized immunology, this makes use of medical data which should be
handled very carefully and with the highest degree of privacy.
- Open data is good but the privacy should be guaranteed. If the DNA data is used for commercial
goals, I think consent of the person is required, if used for medical goals I think it’s possible to
use the data without consent but anonymized.

Explain the following figure:


- What is described in the figure?

On the y-axis is the cost per human genome and on the x-axis is the time in years.

- At the beginning the cost of reading a human genome was about 100 million.
- In 2007-2008, there was a disruption which pushed the cost down rapidly and then on the price
kept on decreasing until this moment where we are able to read a human genome for 1.000
dollars
- This is interesting, because now there is the evolution of a New second generation technology
that appeared on the market (van ervoor met die in parallel way by sequencer en da zorgde
ervoor dat de kost veel lager is)

- What is Moore’s law?


The yellow line represents Moore’s law, this is the law which measures the declining cost for computing
capacity, the declining cost for computational power for one cpu unit.

The cost is halved every 18 months (2 years) for the same capacity. You see that the evolution of
genome sequencing (green line) was much faster than moore’s law in the exponential decrease in cost,
meaning that in these years we were running behind with computers, they were less and less able to
deal with the amounts of data and the computers could not catch up with the genomic evolutions
- What is the black line?
You see that the evolution of genome sequencing (green line) was much faster than moore’s law in the
exponential decrease in cost, meaning that in these years we were running behind with computers, they
were less and less able to deal with the amounts of data and the computers could not catch up with the
genomic evolutions

- What caused the sudden drop around 2007?


Financial crisis? Disruption?

Drop around 2007 was caused because huge technology (second generation of sequencing technology)
appeared on market, making is possible in the parallel way

- What are the implications of this figure?


Look text above

Lecture 10: Ellen Tobback (AXA)

Small questions (answer in 2 lines)

In the mail switch classification models of AXA, how come the predicted and probability
values can differ? How is this solved?
Heeft dit te maken met die figuur die zij tekende op het bord?

BERT is trained on masked sentences, briefly explain what this means.


In every sentence you feed to the model you mask 1 word, so the model does not know which
word was there. Then BERT must predict which word was in that place, using the contextual
information from the sequence. This way, BERT can be trained in an unsupervised way on
massive amounts of data (like the entire Wikipedia corpus).

What are reasons not to use the deep learning model in the mail switch system of AXA?
Human errors are easier to understand than errors made by the DL model. For a human some
mistakes the model makes can seem like gibberish, because it fits on very small details.
Additionally, the model has real life consequences on the people that manually classified the
emails before. If there is no plan ready to support them in getting a new task, using the model
will not get the required business support and might seem unethical to use.

Lecture 11: Compas Case

You might also like