Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Assignment 3 Exploratory Visual Analysis

Exploratory Visual Analysis

Introduction
This report gives a step by step overview of how I approached the given dataset, formulated my
questions and then, the processes I followed to answer them. I start off by giving a brief
overview of the given dataset, its pros, and cons and how I manipulated it to suit my needs.
From there I move on to my first question, where I explain the methods I considered and the
decisions I took to formulate the answer. Then I explain how the answer to the first question
helped in conceptualizing the second question and then progress on to answer it. I finish the
report by giving my two-cents on the entire experience of performing an exploratory visual
analysis and the insights I gained from it.

Data Profile
The dataset being analyzed is the 2017 “World Development Indicators” report
(https://data.worldbank.org/data-catalog/world-development-indicators), released by The
World Bank. The dataset contains 6 csv files in total. Upon examining I realized that 5 of the
files are data-dictionaries, that contains metadata. The core data is present only in one file:
WDIData (409992 rows and 61 columns) and it is 191 MB in size.

Fig.1 Files in WDI dataset

This data is one the most comprehensive reports available today, on how countries across the
world are progressing on numerous factors, 1553 of them to be exact. The factors can be
broadly classified into Agriculture & Rural Development, Aid Effectiveness, Climate Change,
Economy & Growth, Education, Energy & Mining, Environment, External Debt, Financial Sector,
Gender, Health, Infrastructure, Labor & Social Protection, Poverty, Private Sector, Public Sector,
Science & Technology, Social Development, Trade, and Urban Development. This is a time
series data, spanning from 1960 to 2016, which shows how different countries are performing
each year on the areas mentioned above. There are 217 countries and 47 aggregate economies
being covered in this report.
Now focusing on the data, the first thing I noticed was that this was a very sparse dataset. Out
of all the 1553 factors, there were only 60 or so that had no missing values. This made the
identification and comparison of countries by factors of significance during the initial data
exploration, difficult. The second thing was the arrangement of data; since all the factors were
Assignment 3 Exploratory Visual Analysis

given as values in a column, it proved very difficult to navigate through it in Tableau. Hence, I
decided to modify the arrangement. I created columns for each of the 1553 factors and then
grouped the years that were previously in individual columns, into a single column. Now I got
each factor as a measure in Tableau, which made it easy to filter and group the data.

Question 1
As mentioned before, I was aware that the WDI data has information on Education and that led
me to form my first question:
How is the world progressing on providing primary education to its children and what are the
influential factors?

I chose the factor ‘Children out of school (% of primary school age)’ as the basis for my
analysis. This factor gives the percentage of students in primary school age who are not
attending school. I started off with the most obvious step: plotted how the world has changed
with respect to this factor, from the start of this century. As this is a time-series plot I went
ahead with the most suitable choice: a points and line chart. This provided a concise and
precise view of the trend and at the same time achieved minimum ‘Data-Ink ratio’, thereby
adhering to Tufte’s principles.

Fig.2 Overview of out of school children across the years.


Assignment 3 Exploratory Visual Analysis

The visualization showed that, the world is gradually improving in providing primary education,
but I wanted to investigate further to see whether the improvement is happening uniformly in
all countries or not. I decided to start this approach by grouping the world according to the
income levels. The rationale behind this choice was that I wanted to see if the education
prevalence in a country, was independent of its income level or not. If they were independent,
then the distribution of out of school children will more or less be the same across all the
countries, and the progress achieved will also be uniform. To do this, I used 4 aggregate
economies provided: High Income, Upper Middle Income, Lower Middle Income and Low
Income.
I chose a stacked bar chart to visualize this, as it allowed me to precisely depict the actual
distribution of out of school children in each of these economies which again allowed mem to
draw comparisons between them. Also in a time-series context it does a decent job of showing
how the factor varied in each economy across the years.

Fig 3. Distribution of children not attending primary school, by income levels


Assignment 3 Exploratory Visual Analysis

The result we see above was in line with my thoughts, that the majority of the children not
going to primary school were in low income countries. I observed the that low-income
countries had significantly improved in the past 15 years, from 45.61% in 2000 to 18.92%, while
other economies, not so much. I was excited to see this level of improvement which prompted
me to delve further into this. According to the World Bank classification [1], most of the Low-
Income countries are from Africa.

Fig 4. Countries classified as low income by The World Bank

I wanted to focus on a balanced subset from these countries, in which there were countries
that performed well, moderately well and not so well. I also left out countries that didn’t have
enough data and chose the years 2002, 2006, 2010 and 2014 because, majority of the countries
had data in these years. I achieved this by applying a series of filtering.

Fig 5. Out of primary school population through the years


Assignment 3 Exploratory Visual Analysis

I chose a stacked area graph here, because the varying thickness allows us to promptly discern
the change or progress made by each country. The color encoding of the countries in line with
Mackinlay’s ranking. Although mentioning the values next to each stream results in a low Dat-
Ink ratio, I decided to use it to improve the readability. From the graph I noticed that, Ethiopia
has made significant improvements, whereas Eritrea and Liberia had become more worse. Then
there were countries like Senegal and Mali where the situation had not changed much. But,
overall there was a shift towards improvement, as shown by the decreasing width of the graph
from left to right.
Moving on, I wanted to understand why there is a difference in performance between each of
these countries. After exploring the available factors in the dataset, I zeroed in on the factor
‘Government Expenditure on Education (% of GDP)’. I felt that the money, the government of
a country spends on education, would have a significant impact on the effectiveness and quality
of the same. I verified this by choosing the year 2014 and correlating it with the data in pervious
graph. This gave me the final visualization given below.

Final Visualization
Assignment 3 Exploratory Visual Analysis

Fig.6 Out of school children in low income countries(above), Government expenditure on education for year
2014(below)

According to Mackinlay the best way to represent a quantitative data is either using position or
length and the bar chart allows me to use best of both worlds. Hence, I opted for it without
hesitation. We can see that the bar chart makes it easy to compare the expenditure of various
countries within the graph and the consistency in color encoding across graphs makes the
process of looking-up far easier.
By comparing the graphs, it’s evident that my assumption was correct. Countries like Eritrea
and Liberia that spends less money (approx. 2% of GDP) on education, have deteriorated in
progress whereas in countries like Ethiopia, Niger and Burkina Faso the government has
recognized the importance of education and is focused on improving it by putting in more
resources.
To conclude we can say that the world is going in the right direction in the case of primary
education. Most of the progress is made in the low-income countries, this needs to continue,
but the progress is not uniform. There are countries whose conditions are deteriorating, and we
cannot progress much further unless we bring these countries back on track. Hence a conscious
effort is needed from the government of low performing countries, as well as the global
community to improve the current situation.

Question 2
My second question is a continuation of the first.
We have seen that the low-income countries are making a lot of progress in improving
primary school enrolment. But does that translate into an improvement in overall education?
Here I focused on the secondary education data provided for the same countries we considered
for the first question, since it served as a context as well as a continuation for our analysis.
I began my investigation by looking at the factor ‘Progression to secondary school (%)’, which
tells us the percentage of students who move onto secondary school after completing primary
school. I believed that progression into secondary school should be related to the improvement
in primary education. Hence, I expected the countries who were doing well in improving
primary school enrolment to perform well here also. I chose the year 2014, as most countries
had data on 2014, and it allows us to draw comparisons with the graphs in question 1.
On exploring the graph below (Fig.7), and comparing it with Fig.5 I saw that there was a
correlation between primary school and secondary school enrolment in low income countries.
Countries like Ethiopia, Senegal, Burkina Faso etc. that had improving primary school enrolment
rates, also had a good progression rate. Hence my assumption was partly correct. But there
were a couple of outliers that did not adhere to the assumption. There were countries that
Assignment 3 Exploratory Visual Analysis

performed well in primary education but did not fare well while considering the progression
rate, and vice-versa. One such example was Eritrea. Eritrea had one of the largest ‘out of
primary school’ population amongst all the low-income countries we considered. But it had one
of the highest retention rate. I did some background research on this peculiar behavior and
gained some powerful insights. Turns out that Eritrea’s high ‘out of school’ population is mainly
because there are many nomadic tribes in Eritrea [5], and the children in these tribes do not
attend school. Also, there is no effort from the government to increase the awareness amongst
the tribes. Thus, the children who attend primary school in Eritrea were mostly from the
general population and as expected, they went on to attend secondary school as well.

Fig.7 Children progressing to secondary school for year 2014(below)


Another outlier I observed here was Niger, which was showing a positive trend in reducing out
of primary school population, but here it had the worst proportion. Less than 60% of Niger’s
primary school students joined secondary school. Again, I did some research on this, and found
Assignment 3 Exploratory Visual Analysis

out that Niger’s education is suffering due to extreme poverty and lack of facilities [4]. Most of
the children dropped out in primary schools simply because the schools were situated far away,
and they didn’t have the sufficient transportation facilities. Secondary schools were even still
less in number, which translated into poor progression rate.
Similarly, most of the outliers I saw in the visualization had a unique factor that affected its
performance on one way or the other. Hence, it’s safe to say that generally a good primary
school participation translates into a good progression rate to secondary school.
Even though it’s a step in the right direction, overall education improvement cannot be
achieved unless the kids stay in school. Thus, I wanted to analyze how many of the students
stayed in secondary school till completion. Due to lack of relevant data, I opted to replace
secondary school with lower secondary school. I analyzed the factor, ‘Lower Secondary school
completion rate’ to gain some insight. This factor gives the percentage of students who
completed lower secondary school as a percentage of the total number of students who
progressed to lower secondary school. This helped me in understanding how many students
dropped out of lower secondary school. Hence, I treated this as the final criteria for evaluating
the overall progress in education.

Final Visualization

Fig. 8 Students completing lower secondary school from 2002 to 2014


Assignment 3 Exploratory Visual Analysis

The above result gave me a fair idea about the overall performance of each country in
education. I saw that Niger had the lowest completion rate, which aligned with the previous
result (Fig.7). Combining the above two results, I saw that that even if the countries have a
good progression rate, it does not correlate to a good completion rate. Ethiopia had a
progression rate greater than 90% but its secondary completion rate was only 29% in 2014.
Similar results were observed for the rest of the countries except Gambia. Finally, when I
combined the results from question one (Fig.5), I got to see that Gambia was the best
performer, which was surprising. It had a constant but low out of primary school population,
the highest progression rate as well as the highest completion rate also. On the other hand,
Niger was showing good progress in reducing ‘out of primary school’ population, but when I
considered the overall performance, it fared poorly.
Thus, from the results of both Question 1 and Question 2, we can infer that the current
progress being made in education is not balanced. In most countries progress is happening
primary education, and they have good progression rate as well, but there is still a long way to
go in secondary education to consider it as real progress.

Reflection
As an aspiring data scientist, conducting exploratory visual analysis will play an important part
in my career. The first step in any data modelling or machine learning project is to analyze and
understand the data. Although visual analysis is not a part of the job requirement, the scope of
understanding the data is limitless when you use the aid of visual analytics. This will help me to
quickly get my head around the data, discern hidden patterns and correlations present in it. The
data was a mess. In Few’s Now You See It, he had elaborated on the traits of meaningful data.
This dataset did not satisfy a lot of his criteria. For instance, it had lots of missing fields, the data
was not atomic, there were redundancy in several places. I had to impute the data, by taking
the mean in several instances. The structuring choice was also poor. I had to restructure the
data to make it easier to understand and work with. I used Python and Pandas framework to
clean, impute and restructure the data.
Overall, the assignment was a good experience for me. It helped me get familiarized with
Tableau, which is a powerful tool. The amount of options it provides the user is simply amazing.
However, there were certain aspects where there is scope for improvement. The
customizations available to the user is less. Also, a lot of modern visualizations like Sankey chart
are missing. Having said that, I feel that Tableau is a good tool as it is powerful enough to
handle large datasets and it allows the user to create simple visualizations effortlessly.
Assignment 3 Exploratory Visual Analysis

Bibliography
[1] World Development Index, https://data.worldbank.org/data-catalog/world-development-
indicators
[2] Few, Stephen, “Now You See It”, Chapter 2.
[3] World Bank, “Low Income Countries”, https://data.worldbank.org/income-level/low-income
[4] Knofczynski, Allie, “Why Education in Niger Falls Short”, Sept 6, 2017
https://borgenproject.org/why-education-in-niger-falls-short/
[5] UNICEF, “Basic Education in Eritrea”, https://www.unicef.org/eritrea/education.html

You might also like