Season 11 Taric Jungle Metrics - Machine Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Season 11 Taric Jungle Metrics - Machine Learning

A League of Legends Research Paper


James Booth
January 25, 2022
Abstract
In the previous paper, Season 11 Taric Jungle Metrics, lightrocket2’s Taric jungle statistics for
Season 11 were broadly discussed for various metrics of performance. The culmination of that paper was
a logistic and tree-based model to predict wins and losses for a given game based on six predictors with a
high degree of accuracy. In this paper, that goal remains unchanged though it will be accomplished
through different means, machine learning. A single machine learning algorithm will be applied in this
paper to the n = 151 test observations used to construct logistic regression and tree-based models seen in
the last section of Season 11 Taric Jungle Metrics. Due to the added complexity of the neural network,
extensive effort will be made to explain what these models are doing and how it is different from models
introduced in the previous paper. This will be done in the first two sections of this paper. Since there is no
need to introduce lightrocket2’s Taric jungle statistics again, this paper will be shorter. It is best viewed as
a companion to the previous paper, with its primary focus on the mathematics behind neural networks and
limited focus on Taric jungle.

Why Neural Networks?


The neural network is an advanced algorithm meant to teach machines, in this case a computer
program, to imitate the basic functions of the brain. Biologically, the brain makes decisions based on a
wide array of information which is then processed by neurons which fire depending on the inputted
information and the importance of that information. This is imitated mathematically by a complex
algorithm derived from principles of algebra, linear algebra, and multivariable calculus covered in the
first section of this paper, Neural Networks – Background. We will attempt to develop a neural network
through such methods to predict the likelihood of lightrocket2 winning or losing a given Taric jungle
game based on 21 input variables. This may seem needlessly complicated since, in Season 11 Taric
Jungle Metrics, we were able to correctly predict the results of 151 games 93.69% of the time with a
much simpler 6 variable logistic regression model. So why the neural network? Much like our brains, the
neural network inspects all available information to arrive at a decision, in this case winning or losing,
and as a result it makes a lot of sense to create a model that can imitate that process. Not only will this
model incorporate many more variables, as our brains might, but it will think about them in a similar
manner as well. It is expected that this model will be significantly more accurate as a result.

Neural Networks – Background


To motivate this section, we begin by introducing the derivative, what it does, and building from
this concept to the neural network. For reference, the neural network is a machine learning (aka artificial
intelligence) method where a computer will be taught information to predict a win or loss for a given
game of Taric jungle. This method is built on a complex mathematical algorithm which requires a
significant understanding of linear algebra and calculus, an understanding which will broadly developed
here. The equation for the rate of change for a two-dimensional space is given by the rise (change in y)
over the run (change in x) or in formula form:
Equation A-1

Where 𝑦 denotes the final y-value and 𝑦 denotes the initial y-value. 𝑦 − 𝑦 denotes the change in y and
𝑥 − 𝑥 denotes the change in x. We call the rate of change the slope since one can envision the change in
y as akin to the change in height and x the change in horizontal distance in two dimensions. Rates of
change can be very large if 𝑦 − 𝑦 is much bigger than 𝑥 − 𝑥 or very small if the reverse is true.
Similarly, they can increase if the overall value is positive or decrease if that value is negative. To extend
this discussion to derivatives we will need to consider very small numbers.
There are several types of numbers, the natural numbers, the integers, the rational numbers, and
the real numbers. The natural numbers are given by the set N = {1, 2, 3, … n} where n is some infinitely
large positive number, and zero is not included in the set. The integers Z = { -n, …, -3, -2, -1, 0, 1, 2, 3,
…, n} where -n is an infinitely large negative number and zero is included. Notice that the integers
include every value of the natural numbers while also including their negatives and zero. Further, because
Z includes all of N and N is infinitely large, so is Z. The rational numbers include all the values of the
integers and the natural numbers while also including a ratio of any two integers as a ‘rational number’.
That is the rational numbers Q = {-n, …, -3, -2, -3/2, -1, -1/2, 0, 1/2, 1, 3/2, 2, 3, …, n}. Notice that there
is a single fraction listed between 1 and 2, which is 3/2 = 1 1/2. We could list any number of fractions
between 1 and 2 in the rational numbers, not just one. For example…
Incomplete list of rational numbers between 1 and 2:
{1, …, 10/9, 9/8, 8/7, 7/6, 6/5, 5/4, 4/3, 3/2, 7/4, 19/10, …, 2}
Even more generally… for any two numbers n and n + 1 in Q there is an infinite number of
rational numbers for m, p in Z (the integers) where m > p and n < < m + 1 since there is an infinite
number of possible m and infinite number of possible p satisfying the given conditions. Knowing that
there is an infinite number of fractions between any two integers Z is a concept extended even more
generally to the real numbers, R. From now on, we will refer to the real numbers as R. The real numbers
contain every rational number, and every integer and natural number by extension, while also containing
numbers which cannot be expressed as ratios of one integer over another ( ) which we call irrational
numbers. An example of such a number is called Euler’s number, e, developed by the 18 th century
mathematician Leonhardt Euler. e is given by the following series:

1 1 1
=1+ + +⋯
𝑛! 1∗2 1∗2∗3

Where ∞ denotes infinity and Σ denotes the sum, in this case from n = 1 to infinity. That is,
Euler’s number can be thought of as an infinite sum of 1 and a bunch of fractions. The result is
approximately e = 2.71828 though the number extends infinitely as smaller and smaller fractions are
always added. This is an irrational number as it cannot be expressed as a single ratio of two numbers but
is expressed as the sum of the ratio of 1 and an infinite number of numbers. Thus, it exists only in R and
not the rational numbers.
Recognizing the existence of an infinite number of numbers between any two numbers such as 1
and 2 is important to motivate the concept of the derivative, which will lead into the focus of this paper,
machine learning as it pertains to Taric jungle. The rate of change or slope can be thought of as the
change in y over the change in x. Suppose we want to find the rate of change in a two-dimensional space
(x and y) in R and suppose that the change in x is very small. We know that there are infinitely many
numbers between any two numbers in R. For example, there are infinitely many numbers between 0 and
1/10, such as {1/1000, 1/100, 1/50, 1/25, 1/15}. Then it follows that there are infinitely many numbers
between 0 and 1/1000 such as {1/100000, 1/10000, 1/5000, 1/1001} and there are infinitely many
numbers between 0 and 1/100000 and so on. We can continue squeezing this interval between 0 and some
number, which we will call h, so tightly, that there ceases to be any significant difference. That is, 𝑥 −
𝑥 , where 𝑥 = h and 𝑥 = 0, approaches zero for the tightest interval you can imagine, say 1/10000000 –
0 though as mentioned, we can imagine an interval as small as we like.
This leads us to the formal definition of the derivative. For some function y = f(x) the derivative
is given by:
Definition A-1
( ) ( ) ( ) ( )
𝑙𝑖𝑚 ( ) ( )
= 𝑙𝑖𝑚
→ →

Where h is an extremely small number, the numerator is the change in y where f(x + h) denotes 𝑦 , f(x)
denotes 𝑦 , and the denominator is the change in x given by h + a – a = h. 𝑙𝑖𝑚 denotes the limit as the

very small change in x, a + h, goes to zero from choosing smaller and smaller intervals of x-values. This
expression is known as the derivative and can be thought of as the ‘instantaneous rate of change’ since it
is the change in y over an extremely small, essentially instantaneous, change in x. The derivative is given
by for a function of a single variable f(x) and is given by for a function of multiple variables f(x, y,
…). The swirly notation in the second fraction represents the partial derivative which is an extension of
the derivative. means the derivative of y with respect to x and can be thought of as the change of y for
an instantaneous change in x aka the instantaneous rate of change of y. On the other hand, implies the
instantaneous rate of change of f, a function of multiple variables not just x, for an instantaneous change
in x. For example, suppose there exists some function f(x, y) where it takes inputs of two dimensions and
outputs a number in three dimensions given by the variable z. Then could be rewritten as and would
mean the change in z for an instantaneous change in x.
It is important to discuss dimensions as it will be important regarding machine learning and
particularly Taric jungle algorithms as 21 predictor variables will be involved. Two-dimensional space,
where x denotes horizontal distance and y vertical, is familiar. Three-dimensional space consists of the x,
y, and z axis. For the sake of mathematical consistency, we assume the existence of n-dimensional spaces
for n = {1, 2, 3, 4, …, ∞}. In the machine learning algorithm that we will present at the end of this paper
we will assume a 22-dimensional space with 21 input variables and one output. This is impossible to
visualize so we will rely on the algorithm to interpret it for us. Partial derivatives are a useful way to
interpret the instantaneous rate of change, or derivative, in higher dimensional (three dimensional or
greater) space. Partial derivatives are generally given by the following:
Definition A-2
𝑓(𝑥 + ℎ, 𝑦, 𝑧, … , ) − 𝑓(𝑥, 𝑦, 𝑧, … , )
𝑙𝑖𝑚
→ ℎ
For the partial derivative of f with respect to x, though partial derivatives may be taken with respect to
any input variable of a function f(x, y, z, …,). The partial derivative is not only evaluated with respect to a
certain input variable, such as x, but is also evaluated with respect to a certain direction. Normally, this
will be the direction along the x-axis, or the instantaneous rate of change along x, for a partial derivative
with respect to x. However, it can be evaluated along any vector in any direction.
A vector is a directed line segment given by 𝑣⃗ for some vector v. In two dimensions, a vector has
components (x, y) where the first denotes its x-value and the second its y-value. A vector along the x-axis
is given by 𝚤⃗ = (1, 0) or a movement of 1 in the x-direction and 0 in the y-direction. Similarly, a vector
along the y-axis is given by 𝚥⃗ = (0, 1) or a movement of 1 in the y-direction and 0 in the x-direction. A
vector that moves along neither axis we can call 𝑣⃗ = (1, 1) for instance, corresponding to a movement of 1
along x and 1 along y resulting in a point not on the x or y-axis. The magnitude (length) of a vector is
given by the Pythagorean theorem, 𝑎 + 𝑏 = 𝑐 with |𝑣⃗| = √1 + 1 = √2 where |𝑣⃗| denotes the
magnitude. A unit vector is a vector of magnitude 1 so for vector v, this requires dividing its x and y
components by its magnitude, √2. Here the unit form of 𝑣⃗ is given by 𝑣 = , . Evaluating its
√ √

magnitude shows that it is now 1… |𝑣 | = + = + = 1.


√ √
Because vectors 𝚤⃗ and 𝚥⃗ both have magnitude 1 they are unit vectors denoted 𝚤̂ and 𝚥̂, which are
also called basis vectors for two-dimensional space. 𝚤̂ and 𝚥̂ represent the basis vectors for x and y
respectively and are called basis vectors because combining these vectors (through addition or
multiplication) will give any point in two-dimensional space. We can think of a partial derivative of some
function f(x, y) with respect to x, in the direction of the x axis, as being in the direction of the basis vector
𝚤̂. Applying Definition A-2, here is what that looks like:
, ( ) ( , ) ( ) ( , )
𝑙𝑖𝑚 = 𝑙𝑖𝑚
→ →

Notice that they are the same thing. That is, the direction of the instantaneous rate of change with respect
to x is not changed when taken along 𝚤̂ because 𝚤̂ is a vector that already lies along the x-axis. On the other
hand, if the direction of the same derivative is changed to unit vector 𝑣 with components (a, b) this results
in
Definition A-3
𝑓(𝑥 + 𝑎ℎ, 𝑦 + 𝑏ℎ) − 𝑓(𝑥, 𝑦)
𝑙𝑖𝑚
→ ℎ
We call this the directional derivative, and it is denoted by ∇ ⃗ 𝑓. The directional derivative can
also be written as the dot product of the gradient and the unit vector the derivative is being taken in the
direction of, written as ∇ ⃗ 𝑓 = ∇⃗𝑓 ∗ 𝑣 . A proper understanding of the gradient and directional derivatives
will be important to the machine learning algorithm. We define the gradient below and prove its
relationship to the directional derivative as a dot product between itself and unit vector 𝑣 .
The gradient is a vector of the partial derivatives given by:
Definition A-4
∂𝑓 ∂𝑓 ∂𝑓
∇⃗𝑓 = , , ,…,
∂𝑥 ∂𝑦 ∂𝑧
Where every component of the vector is the instantaneous rate of change in the direction of each predictor
variable of the function f. The dot product is vector multiplication aka the product of two directed line
segments. For any two vectors 𝑣⃗ = (a, b) and 𝑢⃗ = (d, c) where a, b, c, d are real numbers it is given by
Equation A-2
𝑣⃗ ∗ 𝑢⃗ = 𝑎𝑑 + 𝑏𝑐
We have discussed what derivatives mean at the fundamental level but have not shown how
derivatives are computed. We do so here, and this will return full circle to the discussion of the gradient,
directional derivatives, and the dot product. All derivatives are computed in a manner based on the
formula presented in Definition A-1. Different derivatives are described by many rules which will
generally not be covered here except for the following. First, the derivative of any function raised to a
power is reduced by one. That is, 𝑥 = 𝑛𝑥 ( )
. Second, the derivative of a constant is zero. And third
the derivative of a function multiplied by a constant only derives the function and ignores the constant
𝑛𝑥 = 𝑛. With that in mind, we discuss the chain rule and its generalization, the multivariable chain
rule.
Definition A-5

𝑓 𝑔(𝑥) =

This is the chain rule for a single variable function composition. A function composition is a
function whose input is another function. An example of this is f(x) = 𝑥 which can be thought of as a
function composition f(g(x)) where g(x) = x and f(x) = 𝑥 . Composing g(x) with f(x) gives the
composition. Deriving the composition gives a product of the derivatives of each function. For f(g(x)) =
𝑥 the derivative of this composition is 2x. The multivariable chain rule is a generalized version of the
chain rule for functions of multiple input variables f(x, y, z, …). It is given in Definition A-6 below.
Definition A-6

𝑑𝑓 𝑔 (𝑥), 𝑔 (𝑥), … , 𝑔 (𝑥) 𝑑𝑔 (𝑥)


= 𝐷 𝑓(𝑔 (𝑥), 𝑔 (𝑥), … , 𝑔 (𝑥))
𝑑𝑥 𝑑𝑥

Where 𝑔 (𝑥), 𝑔 (𝑥), … , 𝑔 (𝑥) denotes k functions which are functions of f(x). That is, instead of there
being only one function g(x) serving as the input to f(x) now there are i = k such functions. Furthermore,
𝐷 𝑓 is the notation for the partial derivative of f for the 𝑖 input variable. This is necessary because
𝑓(𝑔 (𝑥), 𝑔 (𝑥), … , 𝑔 (𝑥) is a multivariable function, with each input function 𝑔 (𝑥), 𝑔 (𝑥), … , 𝑔 (𝑥)
being a new variable of f. Now we will bring the ideas of vectors, the vector dot product, the gradient, the
directional derivative, and the multivariable chain rule together. That is, we will use these concepts
together to demonstrate that
Equation A-3

∇ ⃗ 𝑓 = ∇⃗𝑓 ∗ v
That is, the directional derivative, or the derivative taken in the direction of a unit vector 𝑣 is
equal to the dot product of the gradient of f and that unit vector. To prove this, we will need to use the
multivariable chain rule in the previous definition. Suppose we have a function f(x, y) and a unit vector 𝑣
= (a, b). Then the directional derivative of f along 𝑣 can be written as:
𝑓(𝑥 + 𝑎ℎ, 𝑦 + 𝑏ℎ) − 𝑓(𝑥, 𝑦)
∇ ⃗ 𝑓 = 𝑙𝑖𝑚
→ ℎ
Where h denotes an infinitely small change along x and y respectively and a and b are multiplied to that
infinitely small change h, changing the direction of the derivative from along the x and y axis to along the
components of vector 𝑣 . Now suppose we set g(z) = f(𝑥 + az, 𝑦 + bz), where 𝑥 , 𝑦 , a, and b are
constants and the function g changes with respect to some variable z. We can think of (𝑥 , 𝑦 ) as the
starting point in the two-dimensional input space, from which we change an infinitely small amount along
unit vector 𝑣 to some new point given by (𝑥 + 𝑎𝑧, 𝑦 + 𝑏𝑧). Then the derivative of g, =
( ) ( ) ( ) ( ) ( , ) ( , )
𝑙𝑖𝑚 suppose z = 0 then 𝑙𝑖𝑚 = 𝑙𝑖𝑚 = ∇ ⃗ 𝑓. That is, the
→ → →
derivative of g(x) at z = 0 equals the directional derivative of f(x, y). Now suppose g(z) = f(x, y) where x
= 𝑥 + az, y = 𝑦 + bz. Then we can rewrite g(z) as g(f(x(z), y(z))) where g is equal to a composition of
functions of z. By definition A-6, = + . Because the derivative of a constant is zero, the
derivatives of x and y with respect to z evaluate to a and b respectively. The result is = 𝑎+ 𝑏.
The gradient of f(x, y) is given by ∇⃗𝑓 = , and unit vector 𝑣 = (𝑎, 𝑏) thus their dot product ∇⃗𝑓 ∗ v
= 𝑎+ 𝑏 by Equation A-2 and ∇⃗𝑓 ∗ v = ∇ ⃗ 𝑓 satisfying Equation A-3.
Determining that the directional derivative of a multivariable function is equal to the dot product
of the gradient vector of that multivariable function with the directed unit vector is the first step in
creating the neural network framework. Why? Because we are going to use this information to
demonstrate a property of the gradient, that the direction the gradient vector points in is the direction of
greatest increase of a multivariable function. This would seem nonsensical without first establishing the
relationship between the directional derivative and the gradient, as done above. Suppose that we wanted
to maximize the directional derivative, that is we wanted to find the greatest instantaneous rate of change
for some multivariable function f(x, y). The dot product as a property, not discussed here, where the dot
product between any two vectors 𝑢⃗, 𝑣⃗ is maximized when they point in the same direction. Thus, if we
want to maximize the directional derivative, we must maximize ∇⃗𝑓 ∗ v. This only occurs when 𝑣 points in
the direction of the gradient, therefore the gradient vector points in the direction of greatest increase for
any multivariate function f(x, y, z, …).
In this section we have discussed the relevant background information for neural networks. This
was important to do to understand what will be presented further in this paper. For example, in the
following section, we introduce the components of the neural network: weights, biases, and the neurons
themselves along with neural network error and gradient descent. At the end of the section, we will
present our neural network model for lightrocket2’s Taric jungle games before applying it against test
data.

Neural Networks – Developing the Network


Neural networks are designed to operate like the human brain. The human brain, on a very basic
level, works through the firing of neurons, which are cells that transmit nerve impulses through the brain.
We can think of these impulses as signals which can convey anything about any information we see,
touch, or feel. Neurons are connected to each other through synapses and collections of neurons can
transmit increasingly complex signals, in other words more complex information. Neural networks aim to
mimic this process by acting as an artificial brain. The way this is done, simply put, is by introducing
various variables for the network to use to make a particular decision. Why? Because at its core, the
neural network is a function of all the variables input into it. Regarding Taric jungle, this may involve
things such as kills, deaths, assists, cs score, and dragons and barons secured. Neurons, usually denoted
by circles in the network, take in information about each variable in the form of linear combinations and
arrive at a decision. To be clear neurons are simply things that hold a single number. That number, for the
sake of clarity, is called the neuron’s activation. Depending on the value of the activation, the neuron may
fire (send information from the neuron to another neuron) or not fire (do not send information from the
neuron). All these variables are then factored into a mathematical algorithm whose output is a number
between 0 and 1, which we call the likelihood of lightrocket2 winning a Taric jungle game (in this case).
To properly introduce and develop the network we begin with weights and biases.
Every neural network has input variables, in this case 21 predictor variables denoted
𝑥 , … , 𝑥 and outputs a result y (a win or a loss) for any given game. The games, along with their
respective values for x and y are input from what is called training data. Training data is information
taken in by the neural network to teach or ‘train’ it to recognize patterns and perform well for data it has
not seen before (test data). The neural network is divided into three layers: the input, hidden, and output
layers. There can only be one input layer and one output layer but there can be as many hidden layers as
we want. The input layer is simply 𝑥 , … , 𝑥 , the first hidden layer is a series of linear combinations of
the input layer (any further hidden layers are linear combinations of the previous hidden layers), and the
output layer is the overall output of the network, a win or a loss in the case of Taric jungle games. What is
meant by the output of the input layer brings up weights (w) and biases (b). When the brain decides about
anything it weighs information against each other as well as the significance of the decision and makes a
choice one way or the other. Yes or no. Win or loss. We can think of the weighting of information as
giving a weight, w, to each variable 𝑥 , … , 𝑥 and the significance of the decision as some threshold t.
Weights determine the significance of a particular variable x, and the threshold is the level of significance
required for a neuron to fire aka a positive decision to be made. In Equation B-1 (below) bias is given by
b = -t. This is algebraic maneuvering, and we introduce t in the first place because it is simplest to think of
the bias as a kind of threshold in decision making.
To decide if a given Taric jungle game will be won or lost evidence from the input layer must be
weighed against a threshold in one or more hidden layers to determine if it is meaningful in predicting a
win or a loss in the output layer. Suppose a single neuron in the hidden layer is deciding about whether
vision score and melee jungle matchups have an impact on winning a particular game. From Season 11
Taric Jungle Metrics we know that neither of these predictors were particularly powerful so perhaps we
give them low weights and a relatively high bias. To be clear about what is going on, the value of the
neuron or its activation is given by some linear combination of weights and biases of the input variables
from the input layer of neurons. If the threshold is high, then it is less likely that the neuron will send its
information to the next neuron, and vision and melee jungle matchup will be discarded by the neural
network as meaningful variables in the artificial decision-making process. This process is repeated for
linear combinations of weights and all 21 predictor variables. Here is what this looks like for n predictors:
Equation B-1
The neuron will fire if
𝑤 𝑥 +𝑤 𝑥 + ⋯+ 𝑤 𝑥 < 𝑡
𝑤 𝑥 +𝑤 𝑥 + ⋯+ 𝑤 𝑥 ≥ 𝑡
The neuron fires if the sums are ≥ t and does not fire if they are < t. Subtract t from both sides of the inequality and
let -t = b where b denotes the bias of the neuron.

𝑤 𝑥 + 𝑤 𝑥 + ⋯+ 𝑤 𝑥 + 𝑏 < 0
𝑤 𝑥 + 𝑤 𝑥 +⋯+ 𝑤 𝑥 + 𝑏 ≥ 0
The neuron fires if the sums plus the bias b are ≥ 0 and does not fire if they are < 0.

Where 𝑥 , … , 𝑥 and 𝑤 , … , 𝑤 denote the predictor variables x and their respective weights
while b denotes the bias. Note that by Equation A-2, this expression can be rewritten as 𝑥⃗ ∗ 𝑤⃗ + b where 𝑥⃗
is the vector of values for the x variable and 𝑤⃗ is the vector of weights and * denotes their dot product. It
can also be written using summation notation as ∑ 𝑤 𝑥 + 𝑏. With this model, any neuron will fire,
returning a 1 from that neuron, if the weighted sum plus the bias is greater than 0, and will return a 0 if
that same sum is less than or equal to 0. The problem with this model is it tends to ignore small changes
in w and b. The brain tends to make slight alterations to decisions based on slight alterations to either the
importance of variables or the variables themselves, as a result, the neural network should too. We have
originally stated that a neuron will either fire or not fire depending on whether the weighted values for
certain x-values plus the bias is greater than 0. We alter that definition through the sigmoid function on
the following page.
Definition B-1
1
σ(𝑧) =
1+𝑒
This is the sigmoid function, σ(𝑧). To understand how this function works, consider 3 points, z =
{-10, 0, 10}. For z = -10, the sigmoid function returns ( ) = 0.00005 or approximately 0, if z = 0
then the sigmoid function returns = , and if z = 10 then ( ) = 0.99995 or approximately 1. That
is, for increasingly positive values of z, the sigmoid function returns values closer and closer to 1, though
never equal to or greater than 1, and for increasingly negative z-values, returns numbers closer and closer
to 0 but never equal to or below it. Why? Because Euler’s number, defined as = 2.718… in the previous
!
section, never returns a value below zero for negative exponents (𝑒 is positive for example.
That’s because it is equivalent to ( ) which is 1 divided by a large number, resulting in a very small
positive number), and for positive exponentiation returns an increasingly huge number. In Definition B-1,
if 𝑒 is extremely small, then the sigmoid function essentially simplifies to 1/1 or approximately 1.
Similarly, if 𝑒 is very large, then it reduces to 1 divided by a very large positive number, or
approximately 0. These statements remain true for infinitely large or infinitely small values from 𝑒 .
Furthermore, 𝑒 is positive in the exponent when the input to -z is negative, as -1 * -1 = 1 and it is
negative when the input to -z is positive.
Suppose we feed our linear combinations of weights and biases into the sigmoid function. There
are several reasons to do this, but the main reason is that when we want to make small adjustments to
decisions based on small adjustments to variables or the importance of those variables (the weights). The
sigmoid function is useful for this because it intakes any value of z in R and outputs an infinite range of
values between 0 and 1 rather than just 0 or 1. In this case, a 0 represents no confidence in a particular
decision, such as vision and melee jungle matches having influence on winning or losing a Taric jungle
game, and a 1 represents complete confidence. It is akin to a real brain in considering multiple options in
a decision-making process.
Equation B-2 demonstrates what this looks like:
Equation B-2

σ(𝑧) = σ(∑ 𝑤 𝑥 + 𝑏) =
(∑ )

Where the input of the sigmoid function, z = ∑ 𝑤 𝑥 + 𝑏. Clearly, a small change in the weight or
biases of a particular neuron will result in a small change in the output of that neuron, given by σ(𝑧). We
do not only use the sigmoid function in the hidden layer but also in the output layer of the neural network.
Suppose we wish to know whether a given game will be won after putting all the information for 21
predictor variables in the input layer through one or more hidden layer transformations. The output value
must either be a win or a loss, and whatever the value is, it will be transformed by the sigmoid function to
be between 0 or 1. Then a classifier, called the Bayes Classifier, is used to say any value 0.5 or greater
denotes a win, anything below denotes a loss. Why? Because a 1 indicates a 100% chance of winning, 0 a
0% chance of winning, and 0.5 or greater a 50% or above chance of winning. It is natural then, to expect a
win if the chance of winning is greater than 0.5 and so the Bayes Classifier classifies output layer data in
this manner.
We have discussed the essentials of the neural network now we aim to get into greater detail
about how it works. At the beginning of this section, we mentioned that the neural network is a function,
inputting 21 input variables with 1 output variable 𝑓(𝑥 , 𝑥 , … , 𝑥 ) = 𝑦(𝑥) = 𝑦. The goal of this
function is to classify any given game in n = 151 test observations as a win or a loss, represented as a 1 or
a 0 numerically, based on information from 𝑥 , 𝑥 , … , 𝑥 variables for those games. In Season 11 Taric
Jungle Metrics, we used a different function, a logistic function, for such a purpose. That function had an
error rate (around 5%) because sometimes it would classify a game as a win when it was a loss or vice
versa. We call the actual result of a given Taric jungle game y, since it came from the data itself. The
predicted value, for our neural network, will be the value of the single neuron in the output layer. As
previously mentioned, the value of neurons in a neural network is called the neuron’s activation and so it
will be denoted a.
The difference between the actual result and the activation is called the error and is given by y(x)
– a. The total amount of error is the sum of these differences, given by ∑ |𝑦(𝑥) − 𝑎| . Why this
expression? Because if we want the total amount of error between our expected and actual outputs, or y-
values, then we need a cumulative sum. That is, if there is a negative error and an equal positive error,
adding them together will cancel them out, giving the misleading perception that there is no error. The || ||
operator can be thought of as squaring the result of y(x) – a, giving positive squared error, then square
rooting it to give the actual error but as a positive number for any error e. This is then summed over all
observations n to give the average error. The goal of this paper is the creation of a neural network that
correctly predicts the result of a Taric jungle game at least 93.69% of the time (the success rate of the
logistic model). This will be done by minimizing error which is done with the cost function.
Equation B-3
1
𝐶(𝑤, 𝑏) = Σ |𝑦(𝑥) − 𝑎|
2𝑛
Denotes the average cost function of a neural network. It is called the quadratic cost function because the
squared error |𝑦(𝑥) − 𝑎| has the form of a quadratic 𝑦 = 𝑥 . The cost function is a function of two
variables, the weights, and biases, and it outputs the average error of the neural network between the
actual result y(x) and the activation value a, of the neuron in the output layer (only a single neuron for our
Taric jungle network). Since the cost function denotes the average error of the neural network, we wish to
minimize its value. Minimizing the cost function requires gradient descent, which we mentioned earlier
and will discuss in detail here.
The gradient is a multidimensional vector that points in the direction of greatest increase from
any given point in some real numbered space. It follows, then, that if we multiply the gradient by a
negative constant, say −η, that the gradient will instead point in the direction of greatest decrease from
the same point. The value of −η varies but we want it to be able to reach a global minimum for the cost
function without skipping over the minimum or being stuck in a local minimum. What it does
mathematically is it increases the downward rate of change of gradient descent. This causes the network
to reduce the average error of the network more quickly and as a result, -η is called the learning rate. The
larger the −η the faster the learning and vice versa.

We denote the gradient of the cost function C(w, b) as ∇⃗𝐶. Since C is defined in terms of w and b,
the full gradient vector is given by ∇⃗𝐶 = , and the instantaneous change in C in the direction of-
η∇⃗𝐶 is written as Δ𝐶 = ∇⃗𝐶 ∗ −η∇⃗𝐶 = -η ∇⃗𝐶 which we know is true from the Equation A-3. So, if the

cost function has a particular value, and it is changed by η ∇⃗𝐶 , its new value 𝐶 = C - η ∇⃗𝐶 .
Because C is given in terms of w, b, the weights, and biases, then the new value of C can be rewritten in
terms of w and b.
Equation B-4
∂𝐶
𝑤 =𝑤−η
∂𝑤
∂𝐶
𝑏 =𝑏−η
∂𝑏

Where the change in w and b is given by -η and −η , or the components of - η ∇⃗𝐶 . This is
gradient descent, where the error of the neural network (the cost function) is decreased in the greatest
direction (the negative gradient) with respect to the function’s components, the weights, and the biases.
Now that we know how error is minimized to make neural networks preform well on their
training and test data, we take a step back to discuss neural networks in broad terms. Neural networks are
a modeling tool designed to teach computers how to solve problems on their own. Graphically they are
represented by Figure B-1 below:
Figure B-1

Where the circles on the graph represent each neuron in every layer. In the first layer of Figure B-1, the
input layer, there are five circles and therefore five neurons. There are three neurons in the hidden layer
and one in the output layer. The arrowed lines are synapses and indicate where activations, the number
contained in each neuron, is going. In Figure B-1, all five neurons send information from the input layer
to each of the three neurons in the output layer. The activation for each neuron in the hidden layer is then
given by a linear combination of the activations in the input layer, those activations being values for
𝑥 , 𝑥 , 𝑥 , 𝑥 and 𝑥 respectively. That linear combination is, in general, given by 𝑎 =
𝑓(𝑤 𝑥 + 𝑤 𝑥 + 𝑤 𝑥 + 𝑤 𝑥 + 𝑤 𝑥 + 𝑏) where 𝑎 denotes the activation for the first neuron of the
hidden layer and f denotes some nonlinear transformation function. Since there are three of these neurons
there are three linear combinations, all with different weights and biases. A similar linear combination
exists for the output neuron, which is the result of the neural network.
To reduce error, the weights and biases are changed in order to minimize the cost function,
Equation B-3, in accordance with the calculus shown throughout this paper. Neural networks, particularly
networks with many hidden layers are known for high accuracy. This is due to repeated optimization of
parameters using cost functions like the one in Equation B-3. For example, in Figure B-1 which is a very
simple neural network, there are 22 coefficients for a five variable model. This is four times the
complexity of a five variable logistic model. And as we will see in the following section, our neural
network for Taric jungle is considerably more complicated.

Neural Networks – Applications to Taric Jungle


In this section we will present our neural network for Taric jungle, demonstrate its classification
Figure C-1

6-layer neural network for classifying lightrocket2’s Taric jungle games as wins or losses. Test set size n = 151
observations. 21 predictor variables and one qualitative response, either a win or a loss. The network also includes
701 coefficients, those being 660 weights and 41 biases. Due to the complexity of the network, we do not provide the
formula for it in this report.

accuracy, and compare it to the logistic and tree-based regression methods used in Season 11 Taric Jungle
Metrics. Figure C-1 on the previous page illustrates the neural network for a sample of lightrocket2’s
Season 11 Taric jungle games. It is a 6-layer neural network consisting of 1 input layer of 21 predictor
variables 𝑥 , … , 𝑥 , four hidden layers with their respective activations, and one output layer, which is
the probability of lightrocket2 winning a given Taric jungle game based on the predictors. The goal of this
paper is to test the viability of neural networks in correctly predicting wins and losses for Taric jungle
games. This is demonstrated in Table C-1 below:
Table C-1

Avg. Correctly Classified Avg. Incorrectly Classified Avg Accuracy


146 5 96.69%
Max Correctly Classified Max Incorrectly Classified Max Accuracy
150 8 99.33%
Min Correctly Classified Min Incorrectly Classified Min Accuracy
143 1 94.41%

The neural network was run 50 times in R, with each run having no influence on the previous
one. Of the 151 test observations, the network, on average classified 146 of them correctly and 5
incorrectly for an average accuracy of 96.69%. Its maximum accuracy was 99.33%, which only occurred
once (2% of the time). The network was never completely accurate in predicting wins and losses, though
it did sometimes correctly predict all losses, or all wins separately. The network’s minimum accuracy was
94.41%, incorrectly classifying 8 games and this occurred five times, or 10% of the time. In other words,
the network was ten times more likely to obtain its minimum accuracy than its maximum accuracy,
dragging the overall accuracy down as a result.
Table C-2

Method No. of Coefficients Test Accuracy


Tree Diagram 5 86.75%
Logistic Regression 6 93.69%
Neural Network 701 96.69%

Given the very high complexity of the neural network we would expect to see significant
increases in accuracy. Table C-2 shows the complexity and accuracy of three classification models, the
neural network, logistic regression, and the tree diagram in predicting Season 11 Taric jungle wins and
losses for n = 151 test observations. The tree diagram was the simplest model, including the fewest
predictor variables and thus the fewest coefficients at 5. It also had the lowest classification rate at
86.75% accuracy for the test data. The logistic model, a six variable model presented in Season 11 Taric
Jungle Metrics was considerably better with a 6.94% increase in accuracy at the expense of one additional
coefficient. As we expected, the neural network presented in this paper was the most accurate, at 96.69%
accuracy after running it 50 times, or a 3% improvement on logistic regression and a 9.94% improvement
on the tree model. To put this in perspective, logistic regression was classifying 141 of the 151 Taric
jungle games correctly or incorrectly classifying 10 games. The neural network has, on average, cut that
error in half, at the expense of being 117 times more complex.
Conclusion
Is the neural network worth using over logistic regression? That is, is a modest improvement in
accuracy worth the cost of greatly increased complexity and interpretability? Consider the following: the
logistic model has a simple formula and was able to be input into online calculators to make live
predictions for given games. The neural network can do the same but to regularly use it one must input
the entire equation with its 701 coefficients and 21 variables. While it may capture a broad variety of
factors influencing a given game’s outcome and increasing its accuracy as a result, the network is too
cumbersome to easily use. Indeed, the neural network cannot be easily understood in detail and can only
be understood at a broad level. Neither of these things would be problems if logistic regression was ill
suited for Taric jungle data however it is only 3% less accurate than the neural network despite being
considerably less complex. We conclude that the neural network is suited to more complex problems and
default to logistic regression for our data.
This paper is an addendum to Season 11 Taric Jungle Metrics and is intended to both be a
research paper and an instructive document for those curious about the fundamentals of neural networks.
It was interesting to do this research and apply mathematical methods to lightrocket2’s Taric jungle
games. I hope you have found these papers as interesting to read as I have found them to write. Thanks
for reading.
Cited Works

James, Gareth, et al. An Introduction to Statistical Learning with Applications in R Second


Edition. Aug. 2021, https://hastie.su.domains/ISLR2/ISLRv2_website.pdf.

Nielsen, Michael A. Neural Networks and Deep Learning, Determination Press, 2015,
http://neuralnetworksanddeeplearning.com/index.html.

Sanderson, Grant, director. But What Is a Neural Network? | Chapter 1, Deep Learning.
YouTube, 3blue1brown, 5 Oct. 2017, https://youtu.be/aircAruvnKk. Accessed 25 Jan.
2022.

Dawkins, Paul. “Directional Derivatives.” Calculus III - Directional Derivatives, Paul Dawkins,
10 Mar. 2021, https://tutorial.math.lamar.edu/classes/calciii/directionalderiv.aspx.

You might also like