Test Metrics

Does having children increase the person's income (only for the female
subsample)?
Part 1: Mechanism of how it works

First of all, it is crucial to understand the mechanism, the logic behind possible
relationship between woman’s income and number of children she has. Well, in my
opinion the logic is as follows: women with higher number of children are usually
elder than ones with fewer children(because only a pregnancy period lasts for 9
months). Going further, the elder the women, the higher education, working
experience she can get and achieve. Therefore, it can lead to the higher income.
That is the main justification of the mechanism.
Part 2: Describing the data and choosing the variables

1.1. The whole dataset – the data represents subsample of socio-economic
characteristics from one of the European countries. It includes such
information as place of living, gender, income, skin color, education, industry
of working and many others. However, we do not need all of them. In the next
few sections we will select only necessary ones.
1.2. Dependent variable
Before we start, I should pay attention that RQ is connected only with women
subsample. So, while observing the variables, I think it would be better if we have
values that are related to female observation only. That is why I took the subsample
where the sex is female. As a result, 2182/3837 observations were dropped. That is a
huge amount of data, however, it is okey, because, again, we are interested only in
female part. We can, of course, state in the regression model “if gender == 0” and it
will work, however, our descriptive analysis can be biased due to the full sample
observations.
Now let us choose the dependent variable. Since we are interested in whether
having children increase the income, our dependent variable should represent
people’s earnings. In our dataset the only suitable variable is earn. It represents
monthly income for the respondents. Now lets look more precisely on this variable:
Variable Obs Mean Std. Dev. Min Max

earn 1,655 6047.638 2797.286 -99 44999.98
(Note – it is not a screenshot, I used “copy as HTML table” option). So, here we can see that
the mean income for females is ~6048, while the maximum values is around 45000. What is
interesting, there is at least 1 observation with negative income. Since this variable is not
connected with losses, I will drop all negative values.
As we can see from the histogram above, the distribution tends to be lognormal. However,
before log-transformation, I will ensure the log-normality via test.
Variable Obs Pr(skewness) Pr(kurtosis) Adj chi2(2) Prob>chi2
earn 1,654 0.0000 0.0000 988.53 0.0000
Statistical test indicated that the distribution is not normal – p-value is around 0 or 0.000. It
means that we can take logarithm from or variable. However, before doing so, there is 1 more
step to cover: outliers
There are several dotes in the higher part of the distribution, however, there is 1 observation
that seems to be too extreme. I will delete it from the sample. Finally, transforming the
variable to log and looking at the result:
Indeed, it is much better than we have at the beginning. One more thing should be said: there
might be another dependent variable that indicates the income per hour. However, 1) I do not
have the amount of hours worked and 2) it will have the same logic as monthly income.
1.3. Variable of interest –
Now we can move on to the second half of the RQ: number of children. The variable in the
dataset is children. Children is a numeric variable, however, it is not continuous(there cannot
be 1.5 children, fortunately). Here is the summary statistics:
Children Freq. Percent Cum.
0 990 59.82 59.82
1 361 21.81 81.63
2 262 15.83 97.46
3 35 2.11 99.58
4 6 0.36 99.94
5 1 0.06 100.00
Total 1,655 100.00
There are numbers of children from 0 to 5. The majority of women do not have children(990)
and only 6 and 1 females indicate 4 and 5 children respectively. As I have mentioned above,
it is a discrete variable, so there is no much sense to log-transform variable and check its
normality. Instead, I will group(transform) females with 3-4-5 children into 1 group since
they are very small separately. Here is a new variable of interest:
Children Freq. Percent Cum.
0 990 59.82 59.82
1 361 21.81 81.63
2 262 15.83 97.46
3 42 2.54 100.00
Total 1,655 100.00
1.4. Control variables:
We have covered all aspects of RQ. But is number of children the only factor that can
influence females income? Obviously, no, that is why I will include several control variables
that also can have impact on the incomes. Here they are(Note: all relationships will be
explained after the descriptive statistics of each variable):
a) Yeareduc – indicates the number of educational years
Variable Obs Mean Std. Dev. Min Max
yeareduc 1,655 12.3861 2.295872 8 24
Nothing special, females study on average ~12 years with minimum of 8 and
maximum of 24 years. Talking about distributions and transformations,
It looks approximately like the normal one, however, test shows the opposite
Variable Obs Pr(skewness) Pr(kurtosis) Adj chi2 Prob>chi2
yeareduc 1,655 0.0000 0.0000 137.83 0.0000
Despite the test result, I will not transform it because the distribution seems to be suitable for
regression.
b) Age -represents age
Variable Std Mean Std. Dev. Min
age 1,655 38.83021 10.61145 16 60

The average age is ~39 years with minimum of 16 and maximum of 60.
Talking about distribution, it is neither normal, nor log-normal. Hence, typical

log-transformation will not help us as such. Instead, we can use quadratic
transformation – age squared.
c) Skin color – black variable.
Skin Freq. Percent Cum.
0 1,351 81.63 81.63
1 304 18.37 100.00
Total 1,655 100.00
Here the classes are imbalanced, only ~18% of females are black people,
while the rest are not. Normality tests and histograms are not required since it
is binary variable.
Finally, we analyzed all the variables and now it is time to state the signs and explanations for
the relationships:
Variable Sign of relationship Explanation
Number of children + As I have stated at the
beginning of the report, the
higher number of children
can indicate the higher
working and study
experience, hence the higher
income.
Years of education + Similar to the previous cell,
the more educated the
person, the higher the
demand for his/her skills and
hence the higher will be the
salary(income)
Age(and age squared) + The higher the age, the more
experienced the person can
be(18-years old person
knows almost nothing
comparing with 45-years old
top-manager)
Skin color - Very relevant topic(BLM),
however, nothing game
changing has happened in
this question. Oftentimes,
people are subjective
towards skin color and tend
to undervalue the impact of
black people => employers
pay them less, That is why
the sign is negative.
Part 3: Preliminary Analysis

Now we are moving towards the preliminary analysis. It is needed in order to understand
better the relationships between the variables.
3.1. Correlation matrix –
| log_earn yeareduc age age_2 hours
-------------+---------------------------------------------
log_earn | 1.0000
|
|
yeareduc | 0.3861* 1.0000
| 0.0000
|
age | 0.1212* -0.0367 1.0000
| 0.0000 0.1352
|
age_2 | 0.1067* -0.0609* 0.9900* 1.0000
| 0.0000 0.0132 0.0000
|
hours | -0.1122* 0.0785* 0.0293 0.0148 1.0000
| 0.0000 0.0014 0.2335 0.5461
|
We see that coefficients are not very high or very low(age and age_2 are not taken into
account, since the second is created from the first ones, it is okey). The highest correlation is
between years and earnings ~0.39. Moreover, the majority of coefs are statistically significant
on the 95% confidence level. Right now, I do not see multicollinearity. Nevertheless, we will
see it for sure in post-estimation section.
3.2. T-test for earnings between skin color groups
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
0 | 1,350 8.62553 .0108161 .3974087 8.604312 8.646749
1 | 304 8.622339 .0231952 .4044219 8.576695 8.667983
---------+--------------------------------------------------------------------
combined | 1,654 8.624944 .0098006 .3985856 8.605721 8.644167
---------+--------------------------------------------------------------------
diff | .0031909 .0253113 -.0464547 .0528366
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = 0.1261
Ho: diff = 0 degrees of freedom = 1652
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.5502 Pr(|T| > |t|) = 0.8997 Pr(T > t) = 0.4498
Interesting, but t-test showed that the there is no difference between black and white females.
All alternative hypothesis have high p-value, so we do not reject the bull hypo of equal
means. So, maybe I was wrong saying that racism is relevant here. Will see it later on.
3.3. ANOVA test for children:

Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 2.60742748 5 .521485497 3.31 0.0056
Within groups 260.005448 1648 .157770296
------------------------------------------------------------------------
Total 262.612875 1653 .158870463
Bartlett's test for equal variances: chi2(4) = 5.6292 Prob>chi2 = 0.229
Here the situation is opposite: p-value is small, less than 0.05, hence we reject null hypo and
say that not all income means for given number of children are equal.
Part 4: Regression models

Econometric equation: log_earn = B0 + B1*children_new + B2*yeareduc + B3*age +
B4*age_2 + B5*black
4.1. Initial simple and multiple regressions

In this section, the regression models will be built. There will initially be 2 regressions:
simple and multiple(without any adjustments to multicollinearity and heteroscedasticity)
regressions. Then, the post-evaluation analysis will be made, all corrections implemented and
then the final table with all models and coefs. So, there will be no intermediate separate
outputs about each model(since the page limit is only 10).
The 2 models described above are built, now let us check the multicollinearity and
heteroscedasticity(again, the outputs of regression will be shown later).
4.2. Multicollinearity
Variable VIF 1/VIF
children_new
1 1.19 0.840726
2 1.25 0.799660
3 1.06 0.941192
yeareduc 1.04 0.960742
age 58.75 0.017021
age_2 60.59 0.016505
hours 1.03 0.975130
1.black 1.00 0.996950
The multicollinearity is measured via variance inflation factor(VIF) which measures

the correlation and its strength between the variables used in regression(except
dependent variable). The higher the index, the higher the VIF, the more probable the
presence of multicollinearity. Here we can see that only age and age_2 indicates
high VIF(more than 50), however, as it was stated in the 3.1. section with correlation
matrix, it is acceptable, since age_2 is just age squared. Moving on to the other
variables, their coefs are roughly equal to 1. It can be said that there is no
multicollinearity problem to handle.
4.3. Heteroscedasticity
The next post-estimation is heteroscedasticity. It measures whether the error term is
the same across different values of variables.
Source chi2 df p
Heteroskedasticity 56.46 36 0.0162
Skewness 9.94 8 0.2691
Kurtosis 7.07 1 0.0078
Total 73.48 45 0.0047
For this test, null hypo states that error terms are equal. However, we can see that p-
value is 0.0162<0.05, hence, we reject the null hypo and state that there is a
heteroscedasticity.
4.4. Corrected to heteroscedasticity model

Once we detected the heteroscedasticity, we can adjust our model to this fact via
adding “robust” factor.
4.5. The outputs of all regression models and their Interpretation
In this section, I will show the final results of 3 regression models and interpret them.
(1) (2) (3)
corrected_multip
simple_reg multiple_reg
le_reg
1.children_new -0.019 -0.049* -0.049*
(0.024) (0.023) (0.024)
2.children_new -0.066* -0.108*** -0.108***

(0.028) (0.027) (0.027)
3.children_new -0.212*** -0.199*** -0.199***

(0.063) (0.057) (0.054)
yeareduc 0.068*** 0.068***

(0.004) (0.004)
age 0.026*** 0.026***

(0.006) (0.007)
age_2 -0.000*** -0.000**

(0.000) (0.000)
hours -0.043*** -0.043***

(0.006) (0.007)
1.black 0.001 0.001

(0.023) (0.023)
_cons 8.645*** 7.629*** 7.629***

(0.013) (0.116) (0.131)
N 1654.000 1654.000 1654.000
r2_a 0.008 0.199 0.199
F 5.293 52.273 50.583
aic 1642.199 1293.396 1293.396
bic 1663.843 1342.095 1342.095
Standard errors in parentheses
*
p < 0.05, ** p < 0.01, *** p < 0.001
4.5.1. Relationship between dependent variable and variable of interest.
Relationship between earnings and children are log-level(or log-lin). Hence, we can
say as follows:
a) If number of children is 1, then income is less(comparing to women with 0
children, which is a base category ) on 0.019*100% = 1.9%
b) If number of children is 2, then income is less(comparing to women with 0
children, which is a base category ) on 0.066*100% = 6.6%
c) If number of children is more than 2(meaning 3-4-5), then income is
less(comparing to women with 0 children, which is a base category ) on
0.212*100% = 21.2%
That is for simple regression, for the multiple regressions the logic is the same
and the signs is the same, the only difference is in the final percentages – they
are 4.9, 10.8 and 19.9 for (a), (b) and (c) respectively.
Overall tendency is that the higher the number of children, the lower the income.
Moreover, all coefs described above are statistically significant(expect for 1 in
simple regression).
4.5.2. Relationship between dependent variable and control variables.

If I am not mistaken, it is not required to interpret coefs like it is needed for
variable of interest. However, for my safety, I will say that the logic is the same:
all relationships are log-level, hence, in order to interpret, we can say that with
increase of control variable on 1, the log_earn will decrease/increase on
100%*coef.
Moving on to the “signs” of relationships:
1) Years of study: coef is 0.068, meaning positive relationship, the higher the
years of study, the higher the income(statistically significant) – I predicted the
relationship sign correctly
2) Age and Age^2: coefs are 0.026 and -0.0000, meaning positive relationship
for age and extremely little negative for age_2. The age, the higher the
income(statistically significant) - I predicted the relationship sign correctly
3) Hours: coef is -0.043, meaning negative relationship, the higher the hours
worked, the lower the income(statistically significant), Interesting…. I
predicted the relationship sign incorrectly. Maybe because of different salaries
per hour for different positions.
4) Skin color: coef is 0.001, meaning positive relationship, black people have a
little higher income, interesting as well… Again, my sign prediction was
wrong. Probably, the level of tolerance is now much higher, and all people are
treated equally.
4.5.3 Inference about models in total

The main indicator of model quality in our case is R squared. Specifically, adjusted R
squared that takes into account the number of variables used.
For simple linear regression, it is only 0.008, while for multiple regression roughly
0.20! It means that multiple regression can better “explain” different behaviours of
the variables. So, as a final model, I would use corrected to heteroscedasticity
multiple regression model(yes, r_2 is the same for both multiple regressions, but as it
was said at the lecture: one cannot simply use regression without meeting the
assumptions)
Part 5: Limitations and possible endogeneity problem

We have estimated the model from the scientific or analytical point of view. But does
the model make sense from simple human logic? Does it have limitations?
1) Definitely, yes. The first reason is: I do not consider all possible variants of
regressions and variables given in the dataset due to the limited timeframe
2) R_2 is not extremely high, meaning, there is a space to grow
3) Sample size: obviously, due to Central Limit Theorem(if I am not mistaken),
the closer sample size to the population size, the closer coefs will be to its real
values. Here, we have only subsample of 1 country
4) Possibility of endogeneity problem: that is the case when our errors are
correlated with some specific variables. In simple words, when our model has
errors because one or several extremely important variables are not taken
into account. That is the main reason of endogeneity problem. In our case, the
ungiven variables that can influence the model are:
a) Average salary per industry – it is obvious that some industries are paid
higher than others, I think it will improve our model.
b) Number of children 16-18 years old – we are given the info about how
many children under 15 woman has, however, 16-18 years old children
are still children, and they should have been considered as well.
c) Years of work in the current company and position: it can be the case that
employees can be paid more not only due to their achievements and
productivity, but also for their loyalty. It also can change our model.
Part 6: Answer to RQ:
No, the number of children does not increase the income of females.

Test Metrics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Test Metrics

Uploaded by

Copyright:

Available Formats

Does having children increase the person's income (only for the female

Part 1: Mechanism of how it works

Part 2: Describing the data and choosing the variables

Variable Obs Mean Std. Dev. Min Max

age 1,655 38.83021 10.61145 16 60

Talking about distribution, it is neither normal, nor log-normal. Hence, typical

Part 3: Preliminary Analysis

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

3.3. ANOVA test for children:

Bartlett's test for equal variances: chi2(4) = 5.6292 Prob>chi2 = 0.229

Part 4: Regression models

4.1. Initial simple and multiple regressions

The multicollinearity is measured via variance inflation factor(VIF) which measures

4.4. Corrected to heteroscedasticity model

2.children_new -0.066* -0.108*** -0.108***

3.children_new -0.212*** -0.199*** -0.199***

yeareduc 0.068*** 0.068***

age 0.026*** 0.026***

age_2 -0.000*** -0.000**

hours -0.043*** -0.043***

1.black 0.001 0.001

_cons 8.645*** 7.629*** 7.629***

4.5.1. Relationship between dependent variable and variable of interest.

4.5.2. Relationship between dependent variable and control variables.

4.5.3 Inference about models in total

Part 5: Limitations and possible endogeneity problem

You might also like

2.children_new -0.066* -0.108* -0.108*

3.children_new -0.212* -0.199* -0.199***

yeareduc 0.068* 0.068*

age 0.026* 0.026*

age_2 -0.000* -0.000

hours -0.043* -0.043*

_cons 8.645* 7.629* 7.629***