Reaserch Assignment Part I

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

Part I Data analysis

Question 1. Classify your variables as: continuous, dummy, ordinal, nominal

 Ans .Using tab Varname for categorical variable

Using Sum var name for continuous variable

variable variables category number per cent mean SD max min remark

1 tigray 10 25.64
region 2 amhara 8 20.51 nominal
3 oromiya 9 23.08
4 SNNP 12 30.77
1 12 30.77
woreda 2 14 35.9 nominal
3 13 33.33
employed public 15 38.46 binary
private 24 61.54
sex female 27 30.77 binary
male 12 69.23
1 5 12.82
2 8 20.51
Categorica happiness 3 3 7.69
l variable ordinal
4 4 10.26
5 7 17.95
6 3 7.69
7 4 10.26
8 5 12.82
1 9 23.08
Satisfaction rate 2 14 35.90 ordinal
3 16 41.03
income 8807.128 10925.38 45000 -100
Continuou consumption 5029.487 14112.63 90000 300
s variables Number of indivisuals 4.410256 2.807115 14 1
Age of hh head 34.92308 10.9024 65 23
Price of teff 1688.105 1704.931 7000 -1000

Question 2. Label the following variables:

 region (region 1"Tigray" 2"Amhara" 3"Oromia" 4"SNNP")

. label define region 1"Tigray" 2"Amhara" 3"Oromiya" 4"SNNP"

. label value region region


 sex = 2 male 1 Female)
. label define sex 1"Female" 2"Male"

. label value sex sex

 employed= (Employed 1 , public 2 private)


label define employed 1"public" 2"private"
label value employed employed
 Satisfaction= 1 means not satisfied, 2 satisfied and 3 very satisfied

. label define satisfactionrate 1"not satisfied" 2"satisfied" 3" very satisfied"

. label value satisfactionrate satisfactionrate

4. Check for the existence of outliers using graphs and statistical measures (K, S, mean, median and
sd)

 Ans. Using graphs to detect outlier we use box plot only for continuous variables

Since the box plot method uses the cut-off point of 1.5 times the interquartile range it can tell us
those points above 150% of the interquartile range might be an outlier but there are also other
standard cut-off point in identifying outliers like 300% in which case box plot will not give us the
flexibility to do that hence we will check those suspected variables with other commands like
extremes command,

. graph box income

. graph box consumption

. graph box numbofindividu


. graph box ageofhhhead

. graph box priceofteff

Let’s check those variables with extremes command again with 300% of the interquartile range

. extremes income,iqr(3)

The output shows no result means that that there is no value in this variable with value greater than
300% of the interquartile range.
. extremes consumption ,iqr(3)

obs: iqr: consum~n

34. 25.909 90000

The consumption variable has value greater than 300% of the interquartile range.

Let’s use Histogram just for income, consumption and price of teff variable

The command is histogram varname


As we can see from the graph the consumption and income graph is more skewed which is one sign
of the availability of outliers.

We can identify outliers using Z-Score we generate a variable for SD of each value and any value
greater than 3 times the standard deviation value is considered an outlier

. egen stdcons=std( consumption)

After sorting the data we can clearly see that the 34 obs in the consumption variable is more than 6
times the Sd

We can use skewness kurtossis measures to check the exsistance outliers in our data

. tabstat income consumption numbofindividu ageofhhhead priceofteff ,stats(mean sd


> median max min count variance skewness kurtosis)

stats income consum~n numbof~u ageofh~d priceo~f

mean 8807.128 5029.487 4.410256 34.92308 1688.105


sd 10925.38 14112.63 2.807115 10.9024 1704.931
p50 4000 2500 3 34 945
max 45000 90000 14 65 7000
min -100 300 1 23 -1000
N 39 39 39 39 38
variance 1.19e+08 1.99e+08 7.879892 118.8623 2906789
skewness 1.662905 5.807179 1.748539 1.168226 1.271826
kurtosis 5.124449 35.49622 6.413925 3.890963 4.178991

By checking the skewness and kurtosis values the cut-off point for skewness is between -3 to 3 and
for kurtosis -6 to 6 hence we can see that the consumption variable has a skewness greater than 3
and kurtosis of greater than 6 suggesting the existence of outliers.

Hence we will generate ln of consumption to minimize the effect of the outlier

. gen lcon=ln(1+ cons)

Checking for skewness and curtosis


. tabstat lcon ,stats(mean sd median max min count variance skewness kurtosis)

variable mean sd p50 max min N

lcon 7.698604 1.088007 7.824446 11.40758 5.70711 39

variable variance skewness kurtosis

lcon 1.183759 .5550591 4.958275

We can see that the skewness and kurtossis of the newly generated variable is within the acceptable
range we may use lcons variable instead of the original consumption variable for further analysis.

6. Generate a variable saving (income-consumption) ….. use the new variable name as= saving

 Ans
. gen saving= income- consumption

7. Generate a dummy variable for saving … use the new variable as = svd

 Ans
. gen svd= saving>0

8. Generate a three category for saving (High saver, middle saver and low saver) and use the new

variable name as sact

Ans:
. xtile sact= saving,nq(3)

. label define sact 1"low saver" 2"middle saver" 3"high saver"

. label value sact sact

. tab sact

3 quantiles
of saving Freq. Percent Cum.

low saver 14 35.90 35.90


middle saver 12 30.77 66.67
high saver 13 33.33 100.00

Total 39 100.00

9. Cross tabulate between svd and sex


. label define svd 0"no saving" 1"has saving"

. label value svd svd

. tabulate svd sex

sex
svd Female Male Total

no saving 3 11 14
has saving 9 16 25

Total 12 27 39

10. Cross tabulate between svd and employed

. tabulate svd employed

Employed
svd public private Total

no saving 10 4 14
has saving 5 20 25

Total 15 24 39

11. Cross tabulate between svd and region

. tabulate svd region

region
svd Tigray Amhara Oromiya SNNP Total

no saving 3 7 3 1 14
has saving 7 1 6 11 25

Total 10 8 9 12 39

12. Compute the average saving difference between those who are employed in the public sector
and private sector

. tab employed,sum( saving)

Summary of saving
Employed Mean Std. Dev. Freq.

public -6.6666667 922.31593 15


private 6142.8333 13736.628 24

Total 3777.641 11122.484 39

Hence the average saving difference is 6149.4996


13. See the correlation between continuous variables, check the test
. pwcorr income consumption,sig

income consum~n

income 1.0000

consumption 0.6318 1.0000


0.0000

The correlation between income and consumption is they are positively correlated and the
correlation is medium and the significant level is less than 10% means there is significant correlation
between the two.

. pwcorr priceofteff consumption,sig

priceo~f consum~n

priceofteff 1.0000

consumption 0.1412 1.0000


0.3977

The significance level is greater than 10% hence there is no significant correlation

14. Find the covariance between saving and income

. cor saving income,cov


(obs=39)

saving income

saving 1.2e+08
income 2.2e+07 1.2e+08

Hence the covariance between saving and income is 22000000. (2.2e+07).

15. Do pie chart for svd.


. graph pie,over( svd)
16. If your research title is determinants of saving and you identified: consumption, sex and

satisfaction rate as independent variables

a) What is the appropriate model and regress based on that … please use the following

Format to report your result

Ans a. The appropriate model should be OLS regression since our dependent is continuous.

saving coef Pvalue (p>t)


consumption -0.6268 0.000
sex
male 2186.426 0.421
satisfactionrate
Satisfaction 2 8021.96 0.017
Satisfaction 3 -2959.358 0.328
2
R =0.6386
P>F =0.000

b) Interpret Overall significant, individual significant, R2 and coefficients

The overall significant test or group test shows that since P>F value is closer to zero the model is
good or the independents as a group significantly affect the dependent

The R2 value of 0.6386 indicates that the variation in the dependent variable, saving, 63.86% can be
Explained by the variation in the independent

Individually consumption and satisfaction 2 are significant since their P-Value is less than 10%

While the other two are not significant

Coefficient interpretation will be for consumption since it is continuous the interpretation will be
direct and it tells us that the for every one unit increase in consumption saving will decrease by
0.6268
For the categorical variables the interpretation is against the base value hence our interpretation will
be on average male saving is 2186.426 birr higher than female and for satisfaction we can interpret
the result as the saving of satisfaction 2 is on average 8021.96 birr higher than the satisfaction 1 and
the satisfaction 3 saving is on average 2959.358 birr less than the satisfaction 1 category.

c) Do the four major post estimation tests and comment on it. If there are problems, what is the
solution?

 Checking for the existence of multicolliniarity the mean vif is less than 10 then there is no
multicolliniarity between the independent variables.

. vif

Variable VIF 1/VIF

consumption 1.11 0.902812


2.sex 1.20 0.833647
satisfacti~e
2 1.83 0.545992
3 1.68 0.596639

Mean VIF 1.45

 Omitted variable test

. ovtest

Ramsey RESET test using powers of the fitted values of saving


Ho: model has no omitted variables
F(3, 31) = 6.65
Prob > F = 0.0013

Since the P value is less than 10 showing that omission of relevant variable or inclusion of irrelevant
variable hence we need to check our variables include the relevant and excluding the irrelevant and
regress again.

 Heteroskedasticity test

 The null hypothesis is that the variance of the residuals is homogenous. Therefore, if the p-
value is very small, we would have to reject the hypothesis and accept the alternative
hypothesis that the variance is not homogenous hence in our case the p value is smaller than
10 showing the alternate hypothesis which is the variance of residuals are not homogenous.
 The appropriate measure for this issue is to regress again with the option robust!

Normality test
. predict e,resid

. swilk e

Shapiro-Wilk W test for normal data

Variable Obs W V z Prob>z

e 39 0.95076 1.909 1.359 0.08713

 The p-value is based on the assumption that the distribution is normal. In our case, it is very
small (.0871), indicating that we can reject the assumption that r is normally distributed.
The solution is to check if there are outliers in the continuous data and take appropriate
measure

17. If your research is determinants of saving where your dependent is (svd) and you identified:

Consumption, sex and satisfaction rate as independent variables

a) Which variable is your dependent and find the appropriate model?


The dependent is Svd variable which is binary hence the appropriate model will be logit
model.
b) Interpret Overall significant, individual significant, R 2 and coefficients

. logit svd consumption i.sex i.satisfactionrate

Iteration 0: log likelihood = -25.460206


Iteration 1: log likelihood = -20.052519
Iteration 2: log likelihood = -19.873074
Iteration 3: log likelihood = -19.872591
Iteration 4: log likelihood = -19.872591

Logistic regression Number of obs = 39


LR chi2(4) = 11.18
Prob > chi2 = 0.0247
Log likelihood = -19.872591 Pseudo R2 = 0.2195

svd Coef. Std. Err. z P>|z| [95% Conf. Interval]

consumption -.0000366 .0000372 -0.98 0.326 -.0001095 .0000364


2.sex -1.796979 1.037766 -1.73 0.083 -3.830963 .2370059

satisfactionrate
2 -.5364898 1.281625 -0.42 0.676 -3.048429 1.975449
3 -2.917887 1.305471 -2.24 0.025 -5.476563 -.3592119

_cons 3.578776 1.430678 2.50 0.012 .7746976 6.382854

Prob > chi2 – This is the probability of obtaining the chi-square statistic given that the null
hypothesis is true. In other words, this is the probability of obtaining this chi-square statistic
(11.17) if there is in fact no effect of the independent variables, taken together, on the
dependent variable. This is, of course, the p-value, which is compared to a critical value,
perhaps 0.1 .05 or .01 to determine if the overall model is statistically significant. In this
case, the model is statistically significant because the p-value is less than .0.1
The R2 in logit model is a pseudo R2 Logistic regression does not have an equivalent to the R-
squared that is found in OLS regression.

For the individual significance we will check the p-value (P>|Z|) and if it’s less than 0.1 the
variable will be considered significant otherwise insignificant. Hence from the above output
only sex 2 and satisfactionrate 3 are significant

The coefficient (or parameter estimate) for the variable consumption is -0.0000366. This
means that for a one-unit increase in consumption we expect a 0.0000366 decrease in the
log-odds of the dependent variable Svd, holding all other independent variables constant.
c) Check post estimation test

. lfit

Logistic model for svd, goodness-of-fit test

number of observations = 39
number of covariate patterns = 34
Pearson chi2(29) = 27.56
Prob > chi2 = 0.5415
 Since the p value is greater than 10 then we can say that the model is fit

d) Find the probability of saving for a person who is male and very satisfied and who has an
Average consumption value
log(p/1-p) = b0 + b1*mean consumption + b2*sex2 + b3*satisfactionrate 3
Where p is the probability of a person saving
log(p/1-p) = 3.5787 -0.0000366*5029.487-1.7989*2 -2.9178*3

=3.5787-0.1840-3.5989-8.7534=-8.9576

 Using conversion table we will get that the probability to be close to zero

18. If your research is on determinants of households satisfaction

a) What is your dependent variable? And the appropriate model

 Ans: the dependent variable will be satisfactionrate and the appropriate model will
be ordered logit

b) If the independent variables include: price, family size, employed ….

c) Interpret Overall significant, individual significant, R2 and coefficients


. ologit satisfactionrate priceofteff numbofindividu i.employed

Iteration 0: log likelihood = -40.885697


Iteration 1: log likelihood = -38.023094
Iteration 2: log likelihood = -37.985062
Iteration 3: log likelihood = -37.984953
Iteration 4: log likelihood = -37.984953

Ordered logistic regression Number of obs = 38


LR chi2(3) = 5.80
Prob > chi2 = 0.1217
Log likelihood = -37.984953 Pseudo R2 = 0.0709

satisfactionrate Coef. Std. Err. z P>|z| [95% Conf. Interval]

priceofteff -.0002888 .0002455 -1.18 0.239 -.0007699 .0001923


numbofindividu .1224913 .1300952 0.94 0.346 -.1324905 .3774732
2.employed .0241583 .7809356 0.03 0.975 -1.506447 1.554764

/cut1 -1.322919 .9603114 -3.205095 .5592564


/cut2 .4912193 .9367908 -1.344857 2.327296

 Prob > chi2 – This is the probability of getting a likelihood ratio (LR) test statistic as
extreme as, or more so, than the observed under the null hypothesis; the null
hypothesis is that all of the regression coefficients in the model are equal to zero
This p-value is compared to a specified alpha level, our willingness to accept a type I
error, which is typically set at 0.1 The small p-value from the LR test, would lead us
to conclude that at least one of the regression coefficients in the model is not equal
to zero.
 z and P>|z| – These are the test statistics and p-value, respectively, for the null
hypothesis that an individual predictor’s regression coefficient is zero given that the
rest of the predictors are in the model. The test statistic z is the ratio of the Coef. to
the Std. Err. of the respective predictor. The z value follows a standard normal
distribution which is used to test against a two-sided alternative hypothesis that the
Coef. is not equal to zero. The probability that a particular z test statistic is as
extreme as, or more so, than what has been observed under the null hypothesis is
defined by P>|z| The z test statistic for the predictor price (0.19/0.12) is -1.18 with
an associated p-value of <0.0001. If we again set our alpha level to 0.1, we would
accept the null hypothesis and conclude that the regression coefficient price has
been found to be statistically not different from zero in estimating satisfaction given
that employed and family size are in the model.

 R2 is a false R2 hence it does not say an equivalent thing as that of R2 in OLS


regression about describing the effect of the independent variables have in the
dependent.

 Coef. – These are the ordered log-odds (logit) regression coefficients. Standard
interpretation of the ordered logit coefficient is that for a one unit increase in the
predictor, the response variable level is expected to change by its respective
regression coefficient in the ordered log-odds scale while the other variables in the
model are held constant.

19. If your study is determinants of households choice for saving institutions,

a) What is your dependent variable? And the appropriate model

 Ans. the dependent is choice of saving institution and the appropriate model will be
multinomial

b) If the dependent variables are: saving, age, sex, employed

c) Interpret Overall significant, individual significant, R2 and coefficients

 We use the command mlogit in stata

20. If your study is determinants of saving and if your dependent variable is sact,

d) What is your dependent variable? And the appropriate model

 Ans.the dependent is sact and the approprait model will be orderd logit

e) If the dependent variables are: age, employed

f) Interpret Overall significant, individual significant, R2 and coefficients

. ologit sact ageofhhhead i.employed

Iteration 0: log likelihood = -42.76888


Iteration 1: log likelihood = -34.505882
Iteration 2: log likelihood = -34.335307
Iteration 3: log likelihood = -34.334718
Iteration 4: log likelihood = -34.334718

Ordered logistic regression Number of obs = 39


LR chi2(2) = 16.87
Prob > chi2 = 0.0002
Log likelihood = -34.334718 Pseudo R2 = 0.1972

sact Coef. Std. Err. z P>|z| [95% Conf. Interval]

ageofhhhead -.0297168 .0312603 -0.95 0.342 -.0909858 .0315522


2.employed 2.833371 .7796339 3.63 0.000 1.305316 4.361425

/cut1 -.1794248 1.131534 -2.397191 2.038341


/cut2 1.690048 1.186995 -.6364194 4.016515

 Ans. and the interpretation will be the same as that of question 18 above.
Part II Qualitative part
Case I
1. What could be the possible title of the research?
ANS:
 Determinant Factors of job satisfaction in public sector: the case of Mash Woreda,
Sheka zone.
2. What are the limitations associated with the problem statement?
Ans:
 the problem statement as a whole lacks coherence and problem is not well stated,
besides the extent and severity of the problem is not mentioned at all
3. What are your comments with the objectives?

Ans:

 some of objectives in the first place is very difficult to understand


 And they are not specific
 There is repetition of specific objectives
4. Comment on the unit of analysis? If it is wrong, what do you suggest?
Ans:
 The researcher’s choice of unit of analysis which are the managers of the
public sector mismatches with the group whom he want to study whether
they are satisfied or not hence the unit of analysis should be the employees
of the selected public offices not just the managers.
5. Comment on the sample size determination and sampling technique?
Ans:
 The researcher choice of 45 employees out of 1200 employees in the
woreda is not clear regarding what techniques he used to decide the
sample size.
 The sampling technique the researcher chooses is purposive but no
justification is there
6. Comment on the type of data
Ans:
 Regarding the data type I can see no problem as the researcher wants to
use both primary and secondary data.
7. Comment on the data collection method

Ans:

 In this paper the researcher says nothing about how he plan to collect the
data

8. Comment on the method of analysis


Ans:
 the decision to use OLS regression for this situation doesn’t seem good as
the nature of dependent variable is somewhat category in which case the
OLS will not work.
9. Comment on the nature of the dependent variable
Ans:
 The nature of the dependent variable will be ordinal describing the level of
job satisfaction of the employees
10. Could you list possible independent variables?
Ans:
 The possible independent variables includes Age,sex,martial
status,Experience,salerylevel,job freedom,family size,ethnic background, …
etc
11. Can you draw conceptual framework
Institutional factor
Economy
Demographic Transparency
Salary
Age Job freedom
Employment type
Sex

race
Figure 1.1

Source: own demonstration

12. What are your comments on the model specifications and estimation method? If it
is wrong, what do you suggest?
ANS:
 The model chosen on this paper is the OLS regression model but the nature
of the dependent shows it will be ordered data hence ordered logit model
looks appropriate.

Case II
1What could be the possible title of the research?
ANS:

 The impact of microfinance institutions on the living standard and poverty


reduction of farmers: the case of Amhara region credit and finance dejen district

2What are the limitations associated with the problem statement?

Ans:
 the problem statement as a whole lacks coherence and problem is not well stated,
besides the extent and severity of the problem is not mentioned at all
 no scientific papers are mentioned
 most of the issue is not about the objective
 abbreviations are difficult to understand
 language issues
3 What are your comments with the objectives?

Ans:

 The objective looks good!


4 Comment on the unit of analysis? If it is wrong, what do you suggest?
Ans:
 The researcher’s choice of unit of analysis is the beneficiaries of the MFI but
it should include those who are not taking loans from MFI to demonstrate
the impact.
5 Comment on the sample size determination and sampling technique?
Ans:
 Appropriate method of sample size determination is used no specific
sampling method is stated
6 Comment on the type of data
Ans:
 Regarding the data type the researcher better use both primary and
secondary data since determining the objective of the research is to identify
the impact of the MFI on poverty reduction which might be difficult to
demonstrate by just using primary data.
7 Comment on the data collection method

Ans:

 In this paper the researcher says nothing about how he plan to collect the
data
8 Comment on the method of analysis
Ans:
 the decision to use logit model is okay considering the dependent to be
binary like under poverty or not but it would be more handy if it is orderd
logit model to describe poverty in different level
9 Comment on the nature of the dependent variable
Ans:
 The nature of the dependent variable can be binary or ordinal.
10 Could you list possible independent variables?
Ans:
 The possible independent variables include Age, sex, marital status,
familysize, etc.
11 Can you draw conceptual framework Geography

Economy Rain condition

Source of income Access road

Access to loan
Demographic

Age

Sex

religion

Personal

Marital status

Family size

Figure 2.1

Source: own demonstration

12 What are your comments on the model specifications and estimation method? If it
is wrong, what do you suggest?
ANS:
 The model chosen on this paper is the logit model and it is okay to use this
model since poverty can be described as a binary variable.

Case III: Given the following research title: “Factors affecting loan repayment performance
of small holder farmers: The case of ACSI” answer the following questions

a) Write the statement of the problem (write only half a page)

In countries like Ethiopia where smallholder farmers are dominating the economy; smallholder
farmers work on 96.3 percent of the total cultivated area and produce over 95 percent of the
national crop production (CSA, 2007). However, smallholder farmers face severe shortage of
financial resources to purchase productive agricultural inputs. Besides this the rapid rise of the price
of agricultural input every year increase the demand for financial institution which can support these
groups of subsistence farmers

It is important that borrowed funds be invested for productive purposes, and the additional incomes
generated be used to repay loans to have sustainable and viable production processes and credit
institutions. However, failure by farmers to repay their loans on time or to repay them at all has
been a serious problem faced by both credit institutions and smallholder farmers. Poor loan
repayment in developing countries has become a major problem in agricultural credit
administration, especially to smallholders who have limited collateral capabilities (Okorie, 2004).

Failure to pay the loans by the farmers will discourage the credit institutions to give credit to other
farmers who need financial support that will finally lead to system failure.

Therefore this study will try to identify the different factors affecting the loan repayment
performance of smallholders’ farmers. There are few researches conducted on this topic (Million
2012; Kebede 2016).all previous studies were used descriptive analysis which doesn’t describe the
important variables well, hence this study will use econometric analysis to see the extent of
relationship among variables and will incorporate important socio economic variables which were
missing in previous studies.

b) Write objectives of the study (write at least 2 specific objectives)

1. To identify the different variables which affect the loan repayment of farmers?

2. To estimate the extent of the effect of the important variables in loan repayment and to
suggest policy options.

b) Unit of analysis (write only 1 line)


 The unit of analysis will be smallholder’s farmers who are using or previously use
loan service
c) Type of data (write only 1 line)
 Both primary and secondary data can be used
d) Population (write only 1 line)
 All smallholders farmers living in the study area
e) Sampling techniques (write only 1 line)
 Random sampling technique will be used
f) Sample size determination (write only 2 lines)
 Sample size determination can be using solvin’s formula or other appropriate
sample size determination like multistage sampling techniques.
g) Dependent variable and Independent variables (write only 2 lines)
 The dependent variable can be a binary which will have a value of 0 or 1
depending on the smallholders pay or not; the independents can be
continuous like age, family size, and distance from the credit facility or
category like having credit experience or not ,marital status.
h) The characteristics of the dependent variable (write only 1 line)
 The dependent variable can be binary with two outcomes.
i) The model (write only 1 line)
 The appropriate model will be logit model since the dependent variable is
binary.

You might also like