Professional Documents
Culture Documents
Test Metrics
Test Metrics
subsample)?
(Note – it is not a screenshot, I used “copy as HTML table” option). So, here we can see that
the mean income for females is ~6048, while the maximum values is around 45000. What is
interesting, there is at least 1 observation with negative income. Since this variable is not
connected with losses, I will drop all negative values.
As we can see from the histogram above, the distribution tends to be lognormal. However,
before log-transformation, I will ensure the log-normality via test.
Variable Obs Pr(skewness) Pr(kurtosis) Adj chi2(2) Prob>chi2
earn 1,654 0.0000 0.0000 988.53 0.0000
Statistical test indicated that the distribution is not normal – p-value is around 0 or 0.000. It
means that we can take logarithm from or variable. However, before doing so, there is 1 more
step to cover: outliers
There are several dotes in the higher part of the distribution, however, there is 1 observation
that seems to be too extreme. I will delete it from the sample. Finally, transforming the
variable to log and looking at the result:
Indeed, it is much better than we have at the beginning. One more thing should be said: there
might be another dependent variable that indicates the income per hour. However, 1) I do not
have the amount of hours worked and 2) it will have the same logic as monthly income.
1.3. Variable of interest –
Now we can move on to the second half of the RQ: number of children. The variable in the
dataset is children. Children is a numeric variable, however, it is not continuous(there cannot
be 1.5 children, fortunately). Here is the summary statistics:
Children Freq. Percent Cum.
0 990 59.82 59.82
1 361 21.81 81.63
2 262 15.83 97.46
3 35 2.11 99.58
4 6 0.36 99.94
5 1 0.06 100.00
Total 1,655 100.00
There are numbers of children from 0 to 5. The majority of women do not have children(990)
and only 6 and 1 females indicate 4 and 5 children respectively. As I have mentioned above,
it is a discrete variable, so there is no much sense to log-transform variable and check its
normality. Instead, I will group(transform) females with 3-4-5 children into 1 group since
they are very small separately. Here is a new variable of interest:
Children Freq. Percent Cum.
0 990 59.82 59.82
1 361 21.81 81.63
2 262 15.83 97.46
3 42 2.54 100.00
Total 1,655 100.00
1.4. Control variables:
We have covered all aspects of RQ. But is number of children the only factor that can
influence females income? Obviously, no, that is why I will include several control variables
that also can have impact on the incomes. Here they are(Note: all relationships will be
explained after the descriptive statistics of each variable):
a) Yeareduc – indicates the number of educational years
Variable Obs Mean Std. Dev. Min Max
yeareduc 1,655 12.3861 2.295872 8 24
Nothing special, females study on average ~12 years with minimum of 8 and
maximum of 24 years. Talking about distributions and transformations,
It looks approximately like the normal one, however, test shows the opposite
Variable Obs Pr(skewness) Pr(kurtosis) Adj chi2 Prob>chi2
yeareduc 1,655 0.0000 0.0000 137.83 0.0000
Despite the test result, I will not transform it because the distribution seems to be suitable for
regression.
b) Age -represents age
Variable Std Mean Std. Dev. Min
Finally, we analyzed all the variables and now it is time to state the signs and explanations for
the relationships:
Variable Sign of relationship Explanation
Number of children + As I have stated at the
beginning of the report, the
higher number of children
can indicate the higher
working and study
experience, hence the higher
income.
Years of education + Similar to the previous cell,
the more educated the
person, the higher the
demand for his/her skills and
hence the higher will be the
salary(income)
Age(and age squared) + The higher the age, the more
experienced the person can
be(18-years old person
knows almost nothing
comparing with 45-years old
top-manager)
Skin color - Very relevant topic(BLM),
however, nothing game
changing has happened in
this question. Oftentimes,
people are subjective
towards skin color and tend
to undervalue the impact of
black people => employers
pay them less, That is why
the sign is negative.
Here the situation is opposite: p-value is small, less than 0.05, hence we reject null hypo and
say that not all income means for given number of children are equal.
4.2. Multicollinearity
Variable VIF 1/VIF
children_new
1 1.19 0.840726
2 1.25 0.799660
3 1.06 0.941192
yeareduc 1.04 0.960742
age 58.75 0.017021
age_2 60.59 0.016505
hours 1.03 0.975130
1.black 1.00 0.996950
4.3. Heteroscedasticity
The next post-estimation is heteroscedasticity. It measures whether the error term is
the same across different values of variables.
Source chi2 df p
Heteroskedasticity 56.46 36 0.0162
Skewness 9.94 8 0.2691
Kurtosis 7.07 1 0.0078
Total 73.48 45 0.0047
For this test, null hypo states that error terms are equal. However, we can see that p-
value is 0.0162<0.05, hence, we reject the null hypo and state that there is a
heteroscedasticity.
Relationship between earnings and children are log-level(or log-lin). Hence, we can
say as follows:
a) If number of children is 1, then income is less(comparing to women with 0
children, which is a base category ) on 0.019*100% = 1.9%
b) If number of children is 2, then income is less(comparing to women with 0
children, which is a base category ) on 0.066*100% = 6.6%
c) If number of children is more than 2(meaning 3-4-5), then income is
less(comparing to women with 0 children, which is a base category ) on
0.212*100% = 21.2%
That is for simple regression, for the multiple regressions the logic is the same
and the signs is the same, the only difference is in the final percentages – they
are 4.9, 10.8 and 19.9 for (a), (b) and (c) respectively.
Overall tendency is that the higher the number of children, the lower the income.
Moreover, all coefs described above are statistically significant(expect for 1 in
simple regression).