Biomedical Answers

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Question#1:

Part (a)

Let's go through the steps to prove that PCA and a linear autoencoder with linear activation yield
the same result for dimensionality reduction and that the optimization problems are identical.

(a) To show that the term z̅ in Eqn 3 is the zero vector, i.e., z̅ = 0, let's calculate it:

z̅ = (1/N) * Σ z(i)

Now, using the definition of z(i) in PCA:

z(i) = U^T(x(i) - μ)

So, z̅ = (1/N) * Σ [U^T(x(i) - μ)]

Now, substitute the definition of μ from Eqn 1:

z̅ = (1/N) * Σ [U^T(x(i) - (1/N) * Σ x(i))]

z̅ = (1/N) * Σ [U^T(x(i) - (1/N) * Σ x(i))]

Since U^T is a constant matrix, you can take it out of the summation:

z̅ = U^T * (1/N) * Σ [x(i) - (1/N) * Σ x(i)]

Now, notice that the second part of the expression is just μ (the mean of the dataset), and the first
part is just Σ x(i):
z̅ = U^T * (1/N) * [Σ x(i) - μ]

z̅ = U^T * (1/N) * [Σ x(i) - (1/N) * Σ x(i)]

z̅ = U^T * (1/N) * [(N - 1/N) * Σ x(i)]

z̅ = U^T * (1/N) * [(N - 1) * (1/N) * Σ x(i)]

z̅ = U^T * (1/N) * (N - 1)

z̅ = U^T * (N - 1) / N

Now, since z̅ is the mean of z(i), for it to be zero, we require:

U^T * (N - 1) / N = 0

For this to be true, we must have (N - 1) = 0, which means N = 1. However, this would imply a
dataset with only one data point, which is not a meaningful case for dimensionality reduction. In
practice, we work with datasets where N is greater than 1. Therefore, z̅ is not the zero vector for
practical datasets.

(b) To show that PCA and a linear autoencoder yield the same result, we'll compare the
optimization problems in Eqn 3 and Eqn 4:

Eqn 3 (PCA):
max (1/N) * Σ ∥z(i) - z̅ ∥^2

Eqn 4 (Linear Autoencoder):


min (1/N) * Σ ∥x(i) - x̃(i)∥^2
Now, let's show that these optimization problems are equivalent. We'll use Eqn 1 (μ = (1/N) * Σ
x(i)) and the result from part (a) (z̅ ≠ 0):

For PCA (Eqn 3):


max (1/N) * Σ ∥z(i) - z̅ ∥^2

Substitute the expression for z̅ :

max (1/N) * Σ ∥z(i) - 0∥^2

max (1/N) * Σ ∥z(i)∥^2

For Linear Autoencoder (Eqn 4):


min (1/N) * Σ ∥x(i) - x̃(i)∥^2

Using x̃(i) = μ + U * z(i), and μ = (1/N) * Σ x(i):

min (1/N) * Σ ∥x(i) - (μ + U * z(i))∥^2

min (1/N) * Σ ∥x(i) - [(1/N) * Σ x(i) + U * z(i)]∥^2

min (1/N) * Σ ∥x(i) - (μ + U * z(i))∥^2

Now, μ + U * z(i) represents the projection of x(i) onto the subspace S in PCA. This is the
closest approximation to x(i) in a K-dimensional subspace. By the Pythagoras' Theorem in
geometry, the squared difference between the original point x(i) and its projection (μ + U * z(i))
is equal to the sum of the squared differences between x(i) and μ, and between μ and its
projection:

∥x(i) - (μ + U * z(i))∥^2 = ∥x(i) - μ∥^2 + ∥μ - (μ + U * z(i))∥^2


Now, the second term (∥μ - (μ + U * z(i))∥^2) simplifies to ∥U * z(i)∥^2. Therefore, we have:

∥x(i) - (μ + U * z(i))∥^2 = ∥x(i) - μ∥^2 + ∥U * z(i)∥^2

So, the Linear Autoencoder optimization problem becomes:

min (1/N) * Σ [∥x(i) - μ∥^2 + ∥U * z(i)∥^2]

min (1/N) * Σ ∥x(i) - μ∥^2 + (1/N) * Σ ∥U * z(i)∥^2

Now, the first term (1/N) * Σ ∥x(i) - μ∥^2 is a constant for all i and does not affect the
minimization, so we can focus on the second term:

min (1/N) * Σ ∥U * z(i)∥^2

Now, if you compare this term with the PCA optimization problem, you'll see that they are the
same:

max (1/N) * Σ ∥z(i)∥^2

Hence, we have shown that the optimization problems in Eqn 3 (PCA) and Eqn 4 (Linear
Autoencoder) are identical. Therefore, PCA and a linear autoencoder with linear activation yield
the same result for dimensionality reduction.

Question#2:
Let's break down this problem into parts:
(a) Gradients for Batch Normalization: The normalized input for each component k is:
xˆ(i)_k = (x(i)_k - µB_k) / sqrt((σB_k)^2 + ε)
The output is calculated as:
y(i)_k = γ * xˆ(i)_k + β
We want to compute the gradients ∂L_B / ∂(σB_k)^2, ∂L_B / ∂µB_k, and ∂L_B /
∂x(i)_k.

i. ∂L_B / ∂(σB)^2:
∂L_B / ∂(σB)^2 = ∂L_B / ∂y(i)_k * ∂y(i)_k / ∂(σB)^2

We already calculated ∂y(i)_k / ∂(σB_k)^2 in the previous response:


∂y(i)_k / ∂(σB_k)^2 = γ * (1 / sqrt((σB_k)^2 + ε)) * (x(i)_k - µB_k)

Therefore, the gradient is:


∂L_B / ∂(σB)^2 = ∂L_B / ∂y(i)_k * γ * (1 / sqrt((σB_k)^2 + ε)) * (x(i)_k - µB_k)

ii. ∂L_B / ∂µB:


∂L_B / ∂µB_k = ∂L_B / ∂y(i)_k * ∂y(i)_k / ∂µB_k

We also calculated ∂y(i)_k / ∂µB_k in the previous response:


∂y(i)_k / ∂µB_k = γ * (1 / sqrt((σB_k)^2 + ε)) * (-1)

Therefore, the gradient is:


∂L_B / ∂µB_k = ∂L_B / ∂y(i)_k * γ * (1 / sqrt((σB_k)^2 + ε)) * (-1)

These gradients are essential for training and optimizing the batch normalization layer
during backpropagation in a neural network.

(b) In practice, during inference (testing or prediction), the batch normalization layer uses
a running mean and a running variance computed during training on the entire dataset.
This running mean and variance are typically calculated as exponential moving averages
over all the batches seen during training.When running a single sample through the model
during inference, the batch normalization layer uses the running statistics instead of
calculating the mean and variance from that single sample. This is important because
using a single sample to calculate mean and variance can lead to a loss of information and
can be noisy. By using the running statistics, the batch normalization layer ensures that
the normalization process is consistent with what the model has learned during training,
and it doesn't wipe out the sample itself when subtracting the mean.

In summary, the batch normalization layer takes care of this in practice by using running
statistics (running mean and running variance) instead of calculating mean and variance
from individual samples during inference, ensuring that the normalization process
remains consistent with the training process.

You might also like