625 Preliminary

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Question-2:

a)
pairwise plot and summary:

From the correlation matrix you provided, here are some observations:

1. **Strong Positive Correlation**:


- There is a strong positive correlation (0.842) between "Year" and "Volume". This suggests
that the trading volume tends to increase over the years.

2. **Weak Correlation**:
- Most of the correlations between lag variables ("Lag1" to "Lag5") and other variables,
including "Today" and "Volume", are weak. For example, the correlations between lag variables
and "Today" are generally around -0.07 to 0.06, indicating weak associations.

3. **Weak Negative Correlation**:


- There is a weak negative correlation (-0.033) between "Volume" and "Today". This suggests a
slight negative relationship between trading volume and today's returns.
4. **Correlation with "Today"**:
- The correlations between lag variables and "Today" are generally weak, with values ranging
from -0.075 to 0.059.

5. **Correlation with "Volume"**:


- Except for the strong positive correlation with "Year", correlations between other variables
and "Volume" are relatively weak.

Overall, the correlation matrix suggests that there are mostly weak correlations between
variables in the `Weekly` data set. The strongest correlation is observed between "Year" and
"Volume", indicating an increasing trend in trading volume over the years. However, it's
important to remember that correlation does not imply causation, and further analysis may be
needed to understand the underlying relationships between variables.
b)
Lag2: This predictor has a statistically significant coefficient with a p-value of 0.0296, denoted
by a single asterisk (*). The coefficient estimate for Lag2 is 0.05844, indicating that a one-unit
increase in Lag2 is associated with an increase in the log-odds of the response variable by
0.05844 units, holding other predictors constant.

Intercept: The intercept term also appears to be statistically significant with a p-value of 0.0019,
indicated by double asterisks (**). The intercept represents the log-odds of the response variable
when all predictor variables are zero.

Other Predictors (Lag1, Lag3, Lag4, Lag5, Volume): These predictors do not appear to be
statistically significant, as their p-values are greater than 0.05. Specifically, Lag1 has a p-value of
0.1181, Lag3 has a p-value of 0.5469, Lag4 has a p-value of 0.2937, Lag5 has a p-value of
0.5833, and Volume has a p-value of 0.5377.

In summary, only Lag2 and the intercept term are statistically significant predictors in the
logistic regression model. The other lag variables (Lag1, Lag3, Lag4, Lag5) and the Volume
variable do not appear to have a significant association with the response variable "Direction" in
this model.
c)

The provided confusion matrix represents the classification performance of a logistic regression
model with respect to predicting the direction of the market.

Here's the interpretation:

- The rows of the confusion matrix represent the actual directions of the market, while the
columns represent the predicted directions by the logistic regression model.
- In this specific confusion matrix:
- Out of 534 instances where the market actually went down (actual "Down"), the model
correctly predicted 54 of them to go down (predicted "Down"). However, it incorrectly predicted
430 instances to go up (predicted "Up") when they actually went down.
- Out of 605 instances where the market actually went up (actual "Up"), the model correctly
predicted 557 of them to go up (predicted "Up"). However, it incorrectly predicted 48 instances
to go down (predicted "Down") when they actually went up.

The overall accuracy of the model, calculated as the fraction of correct predictions, is
approximately 0.561 or 56.1%. This means that the logistic regression model correctly predicted
the direction of the market approximately 56.1% of the time based on the provided data.
From the above data from LDA and QDA model, by considering the overall accuracies of 62.5%
from LDA model and 58.6% from the QDA model, we can say that LDA model appears to
provide the best results on the data.
Scatterplot:
SVM Type and Kernel: The SVM model is a C-classification type, indicating that it is designed
for classification tasks. The radial kernel is employed for modeling the decision boundary, which
is flexible and suitable for nonlinear classification problems.

Cost Parameter: The cost parameter controls the penalty for misclassification. A smaller value of
cost (0.01 in this case) suggests a relatively high penalty for misclassification errors, which could
lead to a more conservative decision boundary.

Number of Support Vectors: The number of support vectors is 629, with 313 belonging to one
class and 316 belonging to the other. Support vectors are the data points that lie closest to the
decision boundary and are crucial for defining the decision boundary.

Number of Classes: The model is trained to classify data into two classes, indicated by the levels
"CH" (Citrus Hill) and "MM" (Minute Maid).
Repeating (b):
Repeating (c):

Repeating (b):
Repeating (c):

For the SVM with a radial kernel:

Training error rate: 0.39125


Test error rate: 0.3851852
For the SVM with a polynomial kernel (degree=2):

Training error rate: 0.3725


Test error rate: 0.3740741
Comparing these results, we can see that the SVM with a polynomial kernel has lower training
and test error rates compared to the SVM with a radial kernel. Therefore, based on these error
rates, the SVM with a polynomial kernel seems to give better results on this data.
The result optimal_degree = 1 indicates that after performing cross-validation or another
approach to select the optimal degree for the polynomial regression model, the model with a
polynomial degree of 1 was found to be the most optimal.
In polynomial regression, the degree of the polynomial represents the complexity of the model. A
polynomial of degree 1 corresponds to a linear regression model, where the relationship between
the predictor (dis in this case) and the response (nox) is assumed to be linear.

When optimal_degree = 1, it suggests that the linear model (degree 1 polynomial) provided the
best balance between model complexity and performance on unseen data, as evaluated by the
cross-validation or selection method used. This means that the data does not exhibit strong non-
linear patterns that would require higher-degree polynomials to capture.

Overall, the result implies that a simpler linear model is sufficient for modeling the relationship
between dis and nox in this particular dataset.
In the simulated data set provided:
n: The number of observations is 100.
p: The number of predictors is 1.
The model can be written as:
The scatterplot shows a downward quadratic trend, consistent with the model used to generate
the data. As the value of x increases, the value of y initially increases but then decreases after
reaching a peak. The highest density of points is indeed observed near the center (around zero),
and it decreases as we move away from zero along the x-axis. This pattern aligns with the
quadratic relationship specified in the model, where the response variable y is influenced by the
predictor variable x and its square term.
The data is not the same as the LOOCV errors are different for each random seed because the
process of generating the simulated data involves random noise. Since the data points are
generated randomly, different random seeds will result in different sets of data points, and
consequently, different models will be fitted to these data points. As a result, the fitted models
will have different parameters and performance metrics, such as the LOOCV errors. Therefore,
we observe variability in the LOOCV errors when different random seeds are used.
Since the goal is to minimize the LOOCV error, the model with the smallest LOOCV error is
considered the best performer. In this case, Model 2, which includes only a quadratic term, has
the smallest LOOCV error of 93.74236. This suggests that among the models tested, the
quadratic model (Model 2) provides the best predictive performance when evaluated using
LOOCV. This suggests that the additional complexity introduced by Models 3 and 4 may not
necessarily lead to better predictive performance, and there might be a trade-off between model
complexity and model accuracy.

The statistical significance of coefficient estimates resulting from least squares regression can be
assessed using p-values. Lower p-values indicate greater significance. Cross-validation results
may support models with lower LOOCV errors, but models with higher errors could still have
significant coefficients. It's crucial to consider both predictive performance and coefficient
significance when evaluating regression models.
The estimate of the standard error of p is approximately 0.409. This means that if we were to
repeatedly take samples from the population and calculate the mean, the standard deviation of
these sample means would be around 0.409. In other words, it gives us a measure of how much
the sample mean might vary from one sample to another due to random sampling variability. A
smaller standard error indicates that the sample mean is more likely to be close to the true
population mean, while a larger standard error suggests that there is more uncertainty in the
estimate of the population mean based on the sample.

The result from (c), obtained using the bootstrap method, is slightly higher (0.4195471) than the
result from (b) (0.4088611), which was calculated using the standard formula for the standard
error of the sample mean. This difference suggests that the bootstrap method may be slightly
more conservative in estimating the standard error of the population mean. However, both results
are in the same range, indicating consistency in the estimation.
The 95% confidence interval obtained from the bootstrap method is [21.69371, 23.3719], while
the confidence interval obtained from the t.test method is [21.72953, 23.33608]. Both intervals
are very similar in range, indicating that the bootstrap method provides a comparable estimate of
the confidence interval as the t.test method.

You might also like