Professional Documents
Culture Documents
Understand The Softmax Function in Minutes: Data Science Bootcamp
Understand The Softmax Function in Minutes: Data Science Bootcamp
You have 2 free stories left this month. Sign up and get an extra one for free.
Uniqtech Follow
Jan 30, 2018 · 13 min read
Learning machine learning? Specifically tr ying out neural networks for deep
learning? You likely have run into the Softmax function, a wonderful activation
function that turns numbers aka logits into probabilities that sum to one.
Softmax function outputs a vector that represents the probability
distributions of a list of potential outcomes. It’s also a core element used in deep
learning classification tasks. We will help you understand the Softmax function in a
beginner friendly manner by showing you exactly how it works — by coding your
ver y own Softmax function in python.
If you are implementing Softmax in Pytorch and you already know Pytorch well,
scroll down to the Deep Dive section and grab the code.
This article has gotten really popular: 5800+ claps. It is updated constantly. Latest
update Jan 2020 added a TL;DR section for busy souls. Dec 2019 (Softmax with
Numpy Scipy Pytorch functional. Visuals indicating the location of Softmax function
in Neural Network architecture.) and full list of updates below. Your feedback is
welcome! You are welcome to translate it and cite it. We would appreciate it if the
English version is not reposted elsewhere. If you want a free read, just use incognito
mode in your browser. A link back is always appreciated. Comment below and share
your links so that we can link to you in this article. Clap for us on Medium. Thank
you in advance for your support!
Skill pre-requisites: the demonstrative codes are written with Python list
comprehension (scroll down to see an entire section explaining list comprehension).
The math operations demonstrated are intuitive and code agnostic: it comes down
to taking exponentials, sums and division aka the normalization step. This article is
for your personal use only, not for production or commercial usage. Please read our
disclaimer.
. . .
Udacity Deep Learning Slide on Softmax
The above Udacity lecture slide shows that Softmax function turns logits [2.0, 1.0,
0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1.
In deep learning, the term logits layer is popularly used for the last neuron layer of
neural network for classification task which produces raw prediction values as real
numbers ranging from [-infinity, +infinity ]. — Wikipedia
Logits are the raw scores output by the last layer of a neural network. Before activation
takes place.
. . .
TL;DR:
Softmax turn logits (numeric output of the last linear layer of a multi-class
classification neural network) into probabilities by take the exponents of each
output and then normalize each number by the sum of those exponents so the
entire output vector adds up to one — all probabilities should add up to one. Cross
entropy loss is usually the loss function for such a multi-class classification problem.
Softmax is frequently appended to the last layer of an image classification network
such as those in CNN ( VGG16 for example) used in ImageNet competitions.
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
Where does the Softmax function t in a CNN architecture. Image augmented from neurohive cnn. As shown
above, Softmax’s input is the output of the fully connected layer immediately preceeding it, and it outputs the
nal output of the entire neural network. T his output is a probability distribution of all the label class
candidates.
. . .
Softmax is not a black box. It has two components: special number e to some power
divide by a sum of some sort.
y_i refers to each element in the logits vector y . Python and Numpy code will be
used in this article to demonstrate math operations. Let’s see it in code:
e**(100) = 2.6881171e+43
e**(-100) = 3.720076e-44 # a very small number
3.720076e-44 > 0 # still returns true
By the way, special number e exponents also makes the math easier later!
Logarithm of products can be easily turned into sums for easy summation and
derivative calculation. log(a*b)= log(a)+log(b)
Replacing i with logit is another verbose way to write out exps = [np.exp(logit)
for logit in logits] . Note the use of plural and singular nouns. It’s intentional.
We just computed the top part of the Softmax function. For each logit, we took it to
an exponential power of e . Each transformed logit j needs to be normalized by
another number in order for all the final outputs, which are probabilities, to sum to
one. Again, this normalization gives us nice probabilities that sum to one!
We compute the sum of all the transformed logits and store the sum in a single
variable sum_of_exps , which we will use to normalize each of the transformed logits.
sum_of_exps = sum(exps)
Now we are ready to write the final part of our Softmax function: each transformed
logit j needs to be normalized by sum_of_exps , which is the sum of all the logits
including itself.
List comprehension gives us a list back. When we print the list we get
>>> softmax
[0.6590011388859679, 0.2424329707047139, 0.09856589040931818]
>>> sum(softmax)
1.0
The output rounds to [0.7, 0.2, 0.1] as seen on the slide at the beginning of this
article. They sum nicely to one!
. . .
Now that you know the pythonic way to implement Softmax can you implement it
using Numpy?
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
Source: StackOverflow
Below is the name of the API and its Numpy equivalent, specified on the Scipy
Documentation. Source
scipy.special.softmax
softmax(x) = np.exp(x)/sum(np.exp(x))
. . .
sample_list = [1,2,3,4,5]
# console returns None
sample_list # console returns [1,2,3,4,5]
# note the entire expression 1st half & 2nd half are wrapped in []
# so the final return type of this expression is also a list
# hence the name list comprehension
# my tip to understand list comprehension is
# read the 2nd half of the expression first
# understand what kind of list we are iterating through
# what is the individual item aka 'each'
# then read the 1st half
# what do we do with each item
. . .
[[1,0,0], #cat
[0,1,0], #dog
[0,0,1],] #bird
Optional Reading: FYI, this is an identity matrix in linear algebra. Note that only the
diagonal positions have the value 1 the rest are all zero. This format is useful when
the data is not numerical, the data is categorical, each categor y is independent
from others. For example, 1 star yelp review, 2 stars, 3 stars, 4 stars, 5
stars can be one hot coded but note the five are related. They may be better encoded
as 1 2 3 4 5 . We can infer that 4 stars is twice as good as 2 stars. Can we say the
same about name of dogs? Ginger, Mochi, Sushi, Bacon, Max , is Macon 2x better
than Mochi? There’s no such relationship. In this particular encoding, the first
column represent cat, second column dog, the third column bird.
The output probabilities are saying 70% sure it is a cat, 20% a dog, 10% a bird. One
can see that the initial differences are adjusted to percentages. logits = [2.0, 1.0,
0.1]. It’s not 2:1:0.1. Previously, we cannot say that it’s 2x more likely to be a cat,
because the results were not normalized to sum to one.
The output probability vector is [0.7, 0.2, 0.1] . Can we compare this with the
ground truth of cat [1,0,0] as in one hot encoding? Yes! That’s what is commonly
used in cross entropy loss (We have a cool trick to understand cross entropy loss
and will write a tutorial about it. Read it here.). In fact cross entropy loss is the “best
friend” of Softmax. It is the most commonly used cost function, aka loss function,
aka criterion that is used with Softmax in classification problems. More on that in a
different article.
Why do we still need fancy machine learning libraries with fancy Softmax function?
The nature of machine learning training requires ten of thousands of samples of
training data. Something as concise as the Softmax function needs to be optimized
to process each element. Some say that Tensorflow broadcasting is not necessarily
faster than numpy’s matrix broadcasting though.
. . .
Thanks. I can now deploy this to production. Uh no. Hold on! Our implementation is
meant to help ever yone understand what the Softmax function does. It uses for
loops and list comprehensions, which are not efficient operations for production
environment. That’s why top machine learning frameworks are implemented in
C++, such as Tensorflow and Pytorch. These frameworks can offer much faster and
efficient computations especially when dimensions of data get large, and can
leverage parallel processing. So no, you cannot use this code in production.
However, technically if you train on a few thousand examples (generally ML needs
more than 10K records), your machine can still handle it, and inference is possible
even on mobile devices! Thanks Apple Core ML. Can I use this softmax on imagenet
data? Uh definitely no, there are millions of images. Use Sklearn if you want to
prototype. Tensorflow for production. Pytorch 1.0 added support for production as
well. For research Pytorch and Sklearn softmax implementations are great.
You have decided to choose Softmax as the final function for classifying your data.
What loss function and cost function should you use with Softmax? The theoretical
answer is Cross Entropy Loss (let us know if you want an article on that. We have a
full pipeline of topics waiting for your vote).
Tell me more about Cross Entropy Loss. Sure thing! Cross Entropy Loss in this case
measures how similar your predictions are to the actual labels. For example if the
probabilities are supposed to be [0.7, 0.2, 0.1] but you predicted during the
first try [0.3, 0.3, 0.4], during the second try [0.6, 0.2, 0.2] . You can expect
the cross entropy loss of the first tr y , which is totally inaccurate, almost like a
random guess to have higher loss than the second scenario where you aren’t too far
off from the expected. Read our full Cross Entropy Loss tutorial here.
def softmax(x):
return torch.exp(x)/torch.sum(torch.exp(x), dim=1).view(-1,1)
dim=1 is for torch.sum() to sum each row across all the columns, .view(-1,1) is for
preventing broadcasting. For details of this formula, visit our Softmax Beyond the
Basics article. The above is the formula. If you are just looking for an API then use
softmax or LogSoftmax see Pytorch documentation.
We write beginner friendly machine learning, deep learning and data science
articles on Medium. Follow our profile nearly 500 followers. Our top publication
Data Science Bootcamp 500+ followers. You can also find our paid newsletter on
Substack.com where to post Machine Learning Resources, paid subscriber easter
eggs for the best internet resources for ML DL and data, trend, summar y of
conferences and seminars. Read more about our offering here. We are developing a
machine learning course as we speak. Thank you for your support. Claps and
followers are always appreciated. New articles from all sites are tweeted out
@siliconlikes
Update History
Updated April 2020, hyperlink fixed for logits.
June 2019 Correction: there was an incorrect statement about Sigmoid and it
has been removed. Remember Sigmoid predicts one class for example
Prob(class=A) to calculate Prob(class=B) just do 1-Prob(class=A) because
there are only two classes in binar y classification. It’s an either or relation. Also
changed the Deep Dive section.
May 11 2019 In Progress: a deep dive on Softmax source code Softmax Beyond
the Basics (post under construction): implementation of Softmax in Pytorch
Tensorflow, Softmax in practice, in production.
Coming soon: citation for all the sections to make the article beginner friendly
and robust at the same time.
InProgress May 11 2019: Softmax Beyond the Basics, Softmax in textbooks and
university lecture slides
Logits, aka the scores before Softmax activation, are useful too. Is there a
reason to delay activation with Softmax? Softmax turn logits into numbers
between zero and one. In deep learning, where there are many multiplication
operations, a small number subsequently multiplied by more small numbers will
result in tiny numbers that are hard to compute on. Hint: this sounds like the
vanishing gradient problem. Sometimes logits are used in numerically stable loss
calculation before using activation (Need more citation and details. Section
under construction). — — -
7.6K
claps
WR ITTEN BY
Uniqtech Follow
We are Uniqtech Co. Learn coding, data and software package skills with Uniqtech tutorials and
articles. Contact us hi@uniqtech.co We’d like to hear from you!