Professional Documents
Culture Documents
Categorical Analysis Assinment Assignment
Categorical Analysis Assinment Assignment
DEPARTMENT OF STATISTICS
ASSINGMENT STUDENT
NAME ID
1. GOSA CHALA.……………....…………………………………………….RU1377/14
• Numerical: Descriptive statistics assume that the data under examination is numerical, as these
measures primarily deal with quantitative information.
• Random: These statistics operate under the assumption of a random and representative sample,
ensuring that the calculated values accurately reflect the broader population.
Understanding these assumptions is pivotal for accurate interpretation and application,
reinforcing the importance of meticulous data collection and the consideration of the statistical
context in which descriptive statistics are employed.
1
How to Perform Descriptive Statistics in SPSS
Performing Descriptive Statistics in SPSS involves several steps. Here’s a step-by-step guide to
assist you through the procedure:
2
3. STEP: Specify Variables
Upon selecting “Descriptives,” a dialog box will appear. Transfer the continuous variable you
wish to analyses into the “Variable(s)” box.
3
SPSS Output for Descriptive Statistics
Descriptives
Interpreting the SPSS output for descriptive statistics is pivotal for drawing meaningful
conclusions. Firstly, focus on the measures of central tendency, such as the mean, median, and
mode. These values provide insights into the typical or average score in your dataset. Next,
examine measures of variability, including the range and standard deviation. The range indicates
the spread of scores from the lowest to the highest, while the standard deviation quantifies the
average amount of variation in your data.
4
SPSS Statistics for Descriptive
In our example, SPSS Output for Descriptive Statistics, the descriptive statistics provided
describe five variables: “Having lung cancer or not, Gender of respondents, Smoking cigarete,
Age of respondents, Body Mass Index of respondents” based on a sample of 290 individuals.
The descriptive statistics provided in the data output offer a summary of the data related to lung
cancer.
Here's an interpretation of the key statistics: -
• Having Lung Cancer or Not
• Mean: 0.39, indicating that less than half of the respondents have lung cancer.
• Standard Deviation (Std. Deviation): 0.489, suggesting moderate variability in
the responses.
• Skewness: -1.806, indicating the data is highly skewed towards respondents not
having lung cancer.
• Gender of Respondents
• Mean: 0.37, suggesting a slightly lower proportion of one gender over the other. –
• Std. Deviation: 0.484, showing moderate variability in gender distribution. –
• Skewness: -0.141, indicating a slight skewness in gender distribution. –
• Smoking Cigarette
• Mean: 0.42, implying that a slightly higher proportion of respondents are
smokers.
• Std. Deviation: 0.495, indicating moderate variability in smoking status among
respondents.
• Skewness: 0.323, showing a slight skew towards more respondents being
smokers. - Age of Respondents: The statistics for age are not visible in the image
provided. –
• Body Mass Index (BMI) of Respondents:
The statistics for BMI are also not visible in the image provided.
Valid N (listwise): 290, indicating that all the statistics are based on 290 valid responses.
• The range and kurtosis for each variable are not discussed here as they are not fully
visible in the table. The range would provide insights into the spread of the data, while
kurtosis would indicate the 'tailedness' of the distribution.
• When interpreting these statistics, it's important to consider the context of the study and
the population from which the data was drawn.
• The mean values give us an average, but the standard deviation, skewness, and kurtosis
(if available).
5
Binary Logistic Regression in SPSS, a powerful statistical technique that unlocks insights in
various fields, from healthcare to marketing. In this blog post, we’ll navigate the intricacies of
binary logistic regression, providing you with a comprehensive understanding of its applications,
modeling, this guide will equip you with the knowledge and skills to leverage binary logistic
regression effectively.
Before delving into binary logistic regression, let’s take a moment to explore the broader
landscape of logistic regression.
There are three primary types:
• Binomial Logistic Regression,
• Multinomial Logistic Regression, and
• Ordinal Logistic Regression.
• Binomial Logistic Regression deals with binary outcomes, where the dependent variable has
only two possible categories, such as yes/no or pass/fail.
• Multinomial Logistic Regression comes into play when the dependent variable has more than
two unordered categories, allowing us to predict which category a case is likely to fall into.
• Ordinal Logistic Regression is employed when the dependent variable has multiple ordered
categories, like low, medium, and high, enabling us to predict the likelihood of a case falling into
or above a specific category.
Our focus will be on Binary Logistic Regression, which is widely used for binary outcomes and
forms the foundation for understanding logistic regression.
Definition: Binary Logistic Regression
Binary Logistic Regression is a statistical method that deals with predicting binary outcomes,
making it an invaluable tool in various fields, including healthcare, finance, and social sciences.
In binary logistic regression, the dependent variable is categorical with only two possible
outcomes, often coded as 0 and 1. This technique allows us to model and understand the
relationship between one or more independent variables and the probability of an event occurring
or not occurring.
6
Logistic Regression Equation
At the core of Binary Logistic Regression lies the Logistic Regression Equation, which is vital
for understanding the relationship between the predictor variables and the binary outcome. The
equation can be expressed as follows:
Binary Logistic Regression not only predicts the outcome but also provides insights into the
factors that influence the outcome. By examining the coefficients and odds ratios associated with
each predictor variable, analysts can identify the significance and direction of these influences.
This information is invaluable for decision-making and understanding the driving forces behind
7
binary outcomes in various scenarios, such as predicting customer churn, diagnosing medical
conditions, or assessing the likelihood of loan default.
Assumptions of Binary Logistic Regression
Assumptions
• Logistic regression does not assume a linear relationship between the dependent and
independent variables.
• The independent variables need not be interval, nor normally distributed, nor linearly
related, nor of equal variance within each group
• Homoscedasticity is not required. The error terms (residuals) do not need to be normally
distributed.
• The dependent variable in logistic regression is not measured on an interval or ratio scale.
The dependent variable must be a dichotomous ( 2 categories) for the binary logistic
regression.
• The categories (groups) as a dependent variable must be mutually exclusive and
exhaustive; a case can only be in one group and every case must be a member of one of the
groups.
• Larger samples are needed than for linear regression because maximum coefficients using
a ML method are large sample estimates. A minimum of 50 cases per predictor is
recommended (Field, 2013)
• Hosmer, Lemeshow, and Sturdivant (2013) suggest a minimum sample of 10 observations
per independent variable in the model, but caution that 20 observations per variable should
be sought if possible.
• Leblanc and Fitzgerald (2000) suggest a minimum of 30 observations per independent
variable.
Hypothesis of Binary Logistic Regression
In Binary Logistic Regression, hypotheses guide the analysis and the interpretation of results.
Specifically, two hypotheses are central to binary logistic regression:
• Null Hypothesis (H0): there is no significant relationship between the independent variables and
the binary outcome.
• Alternative Hypothesis (H1): At least one of the independent variables has a significant effect
on the binary outcome.
Hypothesis testing in binary logistic regression involves examining the significance of the
coefficients associated with each independent variable. If any coefficient has a p-value less than
the chosen significance level (commonly 0.05), it implies that the corresponding variable has a
significant effect on the outcome. Hypothesis testing is a critical step in determining which
predictor variables contribute significantly to the model and understanding their impact on the
binary outcome.
8
Step by Step: Running Logistic Regression in SPSS Statistics
Now, let’s delve into the step-by-step process of conducting the Binary Logistic
Regression using SPSS Statistics.
Here’s a step-by-step guide on how to perform a Binary Logistic Regression in SPSS:
Note
Conducting a Binary Logistic Regression in SPSS provides a robust foundation for
understanding the key features of your data. Always ensure that you consult the documentation
corresponding to your SPSS version, as steps might slightly differ based on the software version
in use.
9
SPSS Output for Binary Logistic Regression
Model Summary
Step -2 Log likelihood Cox & Snell R Nagelkerke R
Square Square
1 282.315a .305 .413
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.
Classification Tablea
Observed Predicted
Having lung cancer or not Percentage
Yes No Correct
10
Interpreting Binary Logistic Regression
Interpreting the SPSS output of binary logistic regression involves examining key tables to
understand the model’s performance and the significance of predictor variables. Here are the
essential tables to focus on:
Model Summary
Step -2 Log likelihood Cox & Snell R Nagelkerke R
Square Square
1 282.315a .305 .413
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.
Interpretation:
-2 Log likelihood: This is a measure of how well the model fits the data. Lower values indicate
better fit. In this case, the value is 282.315.
Cox & Snell R Square: This is a measure of the proportion of variance in the dependent
variable that is accounted for by the independent variables. It ranges from 0 to 1, where higher
values indicate a better fit. Here, its 0.305, suggesting that the model explains about 30.5% of the
variance in the dependent variable.
Nagelkerke R Square: This is another measure of the proportion of variance explained by the
model, adjusted for the number of predictors. It's also scaled from 0 to 1, with higher values
indicating a better fit. Here, its 0.413, indicating that the model explains about 41.3% of the
variance in the dependent variable.
11
BMI 1.490 1 .222
Overall Statistics 95.616 4 .000
Interpretation
For each variable listed in the "Variables not in the Equation" section, the score, degrees of
freedom, and significance level are provided.
A low p-value (usually less than 0.05) suggests that the variable is significantly related to the
dependent variable.
In this case, it seems that "gender" and "smoking" have very low p-values (both 0.000),
indicating that they are highly significant predictors of the dependent variable. Age also has a
significant p-value of 0.016, suggesting it is also relevant to the model.
However, BMI's p-value is 0.222, indicating that it is not statistically significant in predicting the
dependent variable in this model.
Overall, these results suggest that "gender," "smoking," and "age" are important predictors in the
model, while "BMI" is not significant in predicting the dependent variable.
data
Omnibus Tests of Model Coefficients is used to test the model fit. If the Model is significant, this
shows that there is a significant improvement in fit as compared to the null model, hence, the
model is showing a good fit
12
Model 105.470 4 .000
Interpretation
The omnibus test evaluates the overall significance of the model. In this case, the chi-square
value is 105.470 with 4 degrees of freedom, and the associated p-value is < .0001 (or .000). This
indicates that the model as a whole is statistically significant at a very high level of significance.
Essentially, the model, including all the predictors considered together, is providing statistically
significant information in explaining the variation in the dependent variable.
Overall, these results suggest that the predictors included in the model collectively have a
significant effect on the dependent variable.
The Hosmer and Lemeshow test is also a test of Model fit. The Hosmer-Lemeshow statistic
indicates a poor fit if the significance value is less than 0.05. Here, the model adequately fits the
data. Hence, there is no difference between the observed and predicted model
Interpretation
The Hosmer and Lemeshow test is a statistical test used to assess the goodness of fit of a logistic
regression model. It helps determine whether there is a significant difference between the
observed and predicted values in the model.
In the given data, the test statistic (Chi-square) is 9.509, and the degrees of freedom (df) are 8.
The significance level (Sig.) is 0.301.
To interpret the results:
The chi-square statistic of 9.509 indicates the overall discrepancy between the observed and
predicted values in the logistic regression model.
The degrees of freedom (df) represent the number of categories minus the number of estimated
parameters in the model. In this case, there are 8 degrees of freedom.
13
The significance level (Sig.) of 0.301 indicates the p-value associated with the chi-square
statistic. It represents the probability of obtaining a test statistic as extreme as the observed one,
assuming the null hypothesis is true.
Based on the provided data, since the p-value is 0.301, which is greater than the common
significance level of 0.05, we fail to reject the null hypothesis. This means that there is no
significant difference between the observed and predicted values in the logistic regression model.
In other words, the model appears to provide a good fit to the data.
Classification Table
The Classification Table displays the accuracy of the model in classifying cases into their
respective categories. It includes information on true positives, true negatives, false positives,
and false negatives. The overall classification accuracy percentage indicates how well the model
predicts the binary outcome.
Classification Tablea
Observed Predicted
Having lung cancer or not Percentage
Yes No Correct
14
Based on the provided data, the model's performance seems reasonably good, with a high overall
percentage correct of 76.9%.
• Coefficients (B) represent the impact of each predictor on the log odds of the binary outcome.
• Exp (B): Odds ratios, which can be calculated as Exp (B) indicate the change in odds for a one-
unit change in the predictor. An odds ratio greater than 1 suggests an increase in the odds of the
event occurring, while a value less than 1 implies a decrease.
• Wald Statistics Wald statistics values for each predictor variable. These values help assess the
significance of each predictor. Lower p-values indicate a more significant impact on the
outcome.
• t-values: Indicate how many standard errors the coefficients are from zero. Higher absolute t-
values suggest greater significance.
• P values: Test the null hypothesis that the corresponding coefficient is equal to zero. A low p-
value suggests that the predictors are significantly related to the dependent variable.
By thoroughly examining these output tables, you can gain a comprehensive understanding of
the binary logistic regression model’s performance and the significance of predictor variables.
This information is essential for making informed decisions and drawing meaningful conclusions
from your analysis.
15
• The beta coefficients can be negative or positive, and have a t-value and significance of the
t-value associated with each. … If the beta coefficient is negative, the interpretation is that
for every 1-unit increase in the predictor variable, the outcome variable will decrease by
the beta coefficient value.
Based on the Variables in the Equation, it appears that a logistic regression analysis was
performed with several independent variables entered in
Step 1. Let's interpret the results for each variable:
Gender:
The coefficient (B) for gender is 1.410.
The standard error (S.E.) associated with the coefficient is 0.308.
The Wald statistic is 21.031, which is the ratio of the coefficient to its standard error.
The degrees of freedom (df) associated with the Wald statistic is 1.
The significance (Sig.) value indicates the p-value for the Wald statistic, which is 0.000.
The odds ratio (Exp(B)) is 4.097, which represents the change in odds for each unit
increase in gender.
The 95% confidence interval (C.I.) for the odds ratio ranges from 2.242 to 7.486.
Interpretation for Gender:
The variable "gender" is statistically significant (p < 0.001) and has a positive coefficient of
1.410. This indicates that being male (assuming gender 1 represents male and 0 represents
female) is associated with higher odds of the outcome, controlling for other variables.
Specifically, males have 4.097 times higher odds of the outcome compared to females, with a
95% confidence interval ranging from 2.242 to 7.486.
Smoking:
The coefficient (B) for smoking is 2.363.
The standard error (S.E.) associated with the coefficient is 0.304.
The Wald statistic is 60.579.
The degrees of freedom (df) associated with the Wald statistic is 1.
The significance (Sig.) value is 0.000.
The odds ratio (Exp(B)) is 10.626.
The 95% confidence interval (C.I.) for the odds ratio ranges from 5.860 to 19.268.
Interpretation for Smoking:
The variable "Smoking" is statistically significant (p < 0.001) and has a positive coefficient of
2.363. This indicates that smoking is associated with higher odds of the outcome, controlling for
other variables. Specifically, smokers have 10.626 times higher odds of the outcome compared to
non-smokers, with a 95% confidence interval ranging from 5.860 to 19.268.
16
Age:
The coefficient (B) for age is 0.020.
The standard error (S.E.) associated with the coefficient is 0.009.
The Wald statistic is 4.455.
The degrees of freedom (df) associated with the Wald statistic is 1.
The significance (Sig.) value is 0.035.
The odds ratio (Exp(B)) is 1.020.
The 95% confidence interval (C.I.) for the odds ratio ranges from 1.001 to 1.039.
Interpretation for Age:
The variable "Age" is statistically significant (p = 0.035) and has a positive coefficient of 0.020.
This indicates that for each unit increase in age, the odds of the outcome increase by a factor of
1.020, controlling for other variables. The 95% confidence interval for the odds ratio ranges from
1.001 to 1.039.
BMI:
The coefficient (B) for BMI is -0.004.
The standard error (S.E.) associated with the coefficient is 0.008.
The Wald statistic is 0.307.
The degrees of freedom (df) associated with the Wald statistic is 1.
The significance (Sig.) value is 0.579.
The odds ratio (Exp (B)) is 0.996.
The 95% confidence interval (C.I.) for the odds ratio ranges from 0.980 to 1.011.
Interpretation for BMI:
The variable "BMI" is not statistically significant (p = 0.579), as the p-value is greater than the
common significance level of 0.05. This suggests that BMI is not significantly associated with
the outcome when controlling for other variables.
Constant:
The coefficient (B) for the constant term is -2.927.
The standard error (S.E.) associated with the coefficient is 0.758.
The Wald statistic is 14.920.
The degrees of freedom (df) associated with the Wald statistic is 1.
The significance (Sig.) value is 0.000.
The constant term is not directly interpretable in logistic regression, but it is used as a reference
point for the other variables in the equation.
Overall, the interpretation of the logistic regression results suggests that gender, smoking, and
age are statistically significant predictors in the logistic regression model for the outcome
variable, while BMI does not appear to be a significant predictor. The coefficients and odds
17
ratios provide information about the direction and magnitude of the associations between the
independent variables and the outcome, controlling for other variables.
Odds Ratio: 1
Probability of falling into the target group is equal to the probability of falling into the non-target
group
Odds Ratio: > 1 (Probability of Event Occurring)
Probability of falling into the target group is greater to the probability of falling into the non-target
group. The Event is likely to occur.
Odds Ratio: < 1 (Probability of Event Occurring Decreases)
Probability of falling into the target group is Less to the probability of falling into the non-target
group. The Event is unlikely to occur.
We can say that the odds of a customer choosing Private Bank offering Value Added Services are
1.367 times higher than those Public Sector Banks which do not offer Value Added Services, with
a 95% CI of 1.097 to 1.703.
The important thing about this confidence interval is that it doesn’t cross 1. This is important
because values greater than 1 mean that as the predictor variable(s) increase, so do the odds of (in
this case) selecting Private Bank. Values less than 1 mean the opposite: as the predictor increases,
the odds of selecting Private Bank Decreases.
Case Processing Summary and Encoding
The first section of the output shows Case Processing Summary highlighting the cases included in
the analysis. In this example we have a total of 341 respondents.
Case Processing Summary
Unweighted Casesa N Percent
Included in Analysis 290 100.0
Selected Cases Missing Cases 0 .0
Total 290 100.0
Unselected Cases 0 .0
Total 290 100.0
a. If weight is in effect, see classification table for the total number
of cases.
Interpretation
Based on the case processing summary, it appears that all 290 cases in the dataset were selected for
analysis. There are no missing cases, indicating that there is complete data available for the selected
cases. The total number of cases in the dataset is also 290.
This summary provides an overview of the case selection and missing data status, allowing us to
understand the completeness of the dataset and the number of cases available for analysis.
18
The Dependent variable encoding table shows the coding for the criterion variable, in this case
those who will encourage are classified as 1 while those who will not encourage to take up the
Islamic Banking are classified as 0.
Dependent Variable Encoding
Original Value Internal Value
Yes 0
No 1
Interpretation
The encoding of the dependent variable suggests that the value "Yes" is represented by the
internal value 0, while the value "No" is represented by the internal value 1. This encoding
allows for numerical representation and analysis of the dependent variable in statistical models
or algorithms
19