166 Int. J. Knowledge Engineering and Soft Data Paradigms, Vol. 4, No.

2, 2013

Geospatial data mining techniques to investigate

gender equality and empowerment of women status
in Bangladesh

Rashedur M. Rahman* and Nahida Sultana

Department of Electrical Engineering and Computer Science,
North South University,
Plot-15, Block-B, Bashundhara,
Dhaka 1229, Bangladesh
*Corresponding author

Abstract: In this paper we present and investigate geospatial data mining

techniques on different data set to investigate how the promotion of gender
equality and empowerment of women is carried out throughout Bangladesh.
Promoting gender equality and empowerment of women is one of the goals of
the Millennium Development Goals of 2015. The indicators of this paper are
women literacy rate, ratio of girls in primary, secondary and tertiary education,
and employment history in different sectors of Bangladesh. Our aim is to study
these indicators, analyse those statistically and geospatially mining those data
with respect to different areas of Bangladesh. We also compare the results of
spatial regression model with classical regression model on this data. The
results demonstrate that spatial lag model outperforms the classical model in
different perspectives. We have found that education indicators have a
tendency to produce spatial clusters. It is clear that spatial data mining can
provide interesting and useful insights for the government, economists and
relevant decision makers. The results can also be used for causal analysis by
domain experts.

Keywords: data mining; spatial autocorrelation; spatial regression; exploratory

spatial data analysis; ESDA.

Reference to this paper should be made as follows: Rahman, R.M. and

Sultana, N. (2013) ‘Geospatial data mining techniques to investigate gender
equality and empowerment of women status in Bangladesh’, Int. J. Knowledge
Engineering and Soft Data Paradigms, Vol. 4, No. 2, pp.166–186.

Biographical notes: Rashedur M. Rahman is working as an Associate

Professor in Electrical Engineering and Computer Science Department in North
South University, Dhaka, Bangladesh. He received his PhD in Computer
Science from University of Calgary, Canada and Masters from University of
Manitoba, Canada in 2007 and 2003 respectively. He has authored more than
60 peer reviewed journals and conference proceedings in the area of parallel,
distributed, grid and cloud computing, knowledge and data engineering. His
current research interest is in data mining particularly on financial, medical and
educational data, data replication on grid, cloud load characterisation,
optimisation of cloud resource placements and computational finance. He has
been serving in the editorial board of a number of journals in the knowledge
and data engineering field. He also served as reviewer of couple of journals
published by Elsevier, Springer and Wiley. He also serves as an organising

Copyright © 2013 Inderscience Enterprises Ltd.

Geospatial data mining techniques to investigate gender equality 167

committee member of different international conferences organised by IEEE

and ACM.

Nahida Sultana received her MSc in Computer Science from North South
University in 2011. She is currently investigating on different spatial data
mining techniques on spatial and statistical data on Dhaka, especially on the
women literacy rate, women employment in service and agriculture sectors of
Bangladesh. She has also interest on regression model, both classical and
spatial regression model.

1 Introduction

Geospatial data is the data or information that identifies the geographic location of
features and boundaries on earth. Geospatial data mining basically performs exploratory
spatial data analysis (ESDA) on large computerised data repositories that have
geographic metadata. At present ESDA is more important than ever before due to the
remarkable processing power of modern computers, and the availability of enormous
amount of geospatial data. However, examples of the application of this field are
inadequate and almost unheard of in Bangladesh. In this analysis we apply the ESDA
techniques on Bangladeshi geospatial data, especially data that is related to women
literacy and empowerment.
The objective of this paper is to check whether the geographic distributions of women
education indicators, gender equality and GDP growth rate which related to women
literacy are consistent with Waldo Tobler’s (1970) first law of geography – “Everything
is related to everything else, but near things are more related than distant things”. Our
research aims to find the nature of the spatial patterns of these variables. We are
interested to see whether they produce any interesting outcome under spatial
autocorrelation and bivariate analysis. We are also interested to find probable ‘hot spots’
that are surrounded by high women literacy rates and ‘cold spots’ surrounded by low
women literacy rates. Moreover, in this research we investigate which regression and
spatial regression model is more appropriate to fit the data we have.
This paper highlights one of the significant parts of millennium development goals
(MDGs) of Bangladesh. We are focusing on Goal 3, that is, the promotion of gender
equality and empowerment of women. We analyse the women employment condition in
different sectors in this country. Women in Bangladesh are becoming increasingly visible
in economic spheres. Practically in all spheres of the development women are
contributing to the growth of economy. Women’s increasing involvement in both
agricultural work and in non-firm activities has provided with increased opportunities for
wage work and certain economic independence.
In this paper we introduce global autocorrelation. We use univariate Moran I and
applying various bivariate ESDA techniques. Using those techniques we illustrate the
women literacy rate within 64 zilla’s of Bangladesh. Also we look the local indicators of
spatial association (LISA), univariate LISA. We will find out the Moran I of the primary,
secondary and tertiary education of men and women. The situation of women and men
employment rate is also presented in the analysis. Finally, we use the regression analysis
on total women literacy rate, primary, secondary and adult literacy rate and women at
168 R.M. Rahman and N. Sultana

64 zilla’s. We fit the data with traditional and spatial regression and compare those
The paper is organised as follows: Section 2 describers related work in the area of
geospatial data mining, studies on gender inequality and its causes, effects and
consequences. Section 3 presents the software package used in this research. Section 4
describes the data, the sources and the process of data acquisition. Section 5 outlines the
methodologies used in this research. A detailed analysis of our research findings is
presented in Section 6. Finally Section 7 concludes and gives direction of future research.

2 Related work

There are numerous works on gender inequality and its causes, effects and consequences.
Though the geospatial data mining and ESDA are relatively new field. The review of
some selected studies is as follows:
Chaudury et al. (2009) explore the impact of gender inequality in education on rural
poverty in Pakistan using logit regression analysis on primary datasets. It is
concluded that gender inequality in education has adverse impact on rural poverty. The
empirical findings suggest that female-male enrolment ratio, female-male literacy ratio,
female-male ratio of total years of schooling, female-male ratio of earners and education
of household head have significant negative impact on rural poverty.
Ahmed et al. (2004) explore the relationship between different levels of education
and poverty through an analysis of household-level data from 60 villages in Bangladesh.
First it depicts the overall trend in school enrolment at primary and secondary level
within 1988–2000, and confirms the inequality that exists in the access to education at
post-primary level. This is followed by a presentation of income and occupation data that
show a strong positive correlation with the level of education. In the second part, an
income function analysis has been done to assess the impact of education along with
other determinant. The third part analyses the effects of education on child/woman ratio,
and on the secondary school participation rate of male and female children.
Ahmad (2001) analyse private benefits and costs of primary versus secondary
education in rural Bangladesh on the basis of household-level data. It indicates that while
social benefits for primary education are high in Bangladesh, private benefits are higher
for secondary-level education than primary level. On the other hand, private costs are
lower for primary education than for secondary education. Poor households in
Bangladesh cannot afford to keep their children until they complete the secondary level
because of high costs – both direct costs and opportunity costs. Inequality in the access to
secondary education is the main cause of persistent poverty in Bangladesh. The recent
improvement of female participation rates in both primary and secondary levels confirms
the favourable impact of targeted approach. Policies should be directed to both boys and
girls from poor households.
Paraguas et al. (2005) studied the spatial relationship between the proportion of the
population with a standard of living below the poverty line and soil condition. de
Dominicis et al. (2007) compared traditional measurements of geographic concentration
of economic activities with spatial data analysis techniques. They also presented a
comprehensive analysis of the manufacturing industry and a set of hypotheses.
Türkcan et al. (2009) present the applications of spatial data analysis besides its
research findings. The authors elaborated on the potential application of spatial data
Geospatial data mining techniques to investigate gender equality 169

analysis techniques for policy making and discussed some interesting related projects
before delving into their research. This research is quite similar to de Dominicis et al.
(2007). It analyses the spatial properties of Turkey’s manufacturing industry. At the end,
the paper presents examples of the successful application of spatial analysis for
generating clustering policies.
Oliveau (2005) used different techniques to find the spatial patterns of India’s
contemporary demography. District level data on population density, urbanisation level,
fertility rate, child mortality rate and gender inequality were derived from India’s 2001
census. The data was used to identify the spatial patterns and spatial correlation that
described the demographic trend in India.
Uthman (2008) examined the impact of state-level access to basic environmental
services and neighbourhood deprivation on under-five mortality rate in Nigeria. They
concluded that spatial distribution of rates of under-five mortality rate was non-random
and clustered with a Moran’s I = 0.654 (p = .001). Spatial clustering suggested that
North-East and North-West of Nigeria could be grouped as under-five mortality
‘hot-spot’, and South-West, South-South, and South-East of Nigeria could be clustered as
under-five mortality ‘cold-spot’. The results outlined the consistent finding that access to
safe water, proper sanitation, and low pollution cooking fuel are important factors that
could increase the chances of child survival.

2.1 Primary tool – OpenGeoDa

We use the tool for this research is OpenGeoDa (Anselin et al., 2009). It is a software
package for spatial data analysis that is being developed at the US National Science
Foundation (NSF) funded Centre for Spatially Integrated Social Sciences (CSISS). The
development effort is led by Dr. Luc Anselin – who is currently one of the leading
researchers in the field of spatial data analysis. OpenGeoDa is the improved and open
source version of GeoDa. GeoDa was developed at the Spatial Analysis Laboratory of the
University of Illinois at Urbana-Champaign under the direction of Dr. Anselin. Even in
its early days GeoDa and its predecessor DynESDA was considered powerful tools for
ESDA. Having such a rich background OpenGeoDa inherently supports most ESDA
techniques. It supports basic geospatial visualisation methods like choropleth mapping,
histograms, and box plots and can perform many advanced tasks such as spatial
autocorrelation, the generation of the Moran scatter plot, calculating LISA, building and
analysing spatial regression models etc. Some of the advanced ESDA methods were
discovered by Luc Anselin. For these the implementation in OpenGeoDa can be
considered the authoritative one. OpenGeoDa is an open source software with its entire
source code available for alteration. However for our work we did not need to make any
modifications. We used the binary release of version alpha, available at
OpenGeoda (2011).

3 Data acquisition

Collecting geospatial data is not easy. It is more difficult in the context of a developing
country like Bangladesh. On the top of that, our research is about the geospatial analysis
of statistical data; not geographic or topographic data. So the data we collected did not
have any geographical metadata for geographic information systems – such as longitudes
170 R.M. Rahman and N. Sultana

and latitudes. We had to geo reference the data ourselves by associating elements from
the dataset to geographic polygons in a zilla level digital map of Bangladesh.
First we had to collect this digital map using a python27 software that make the
Shape file and also using the excel file to make. We got our statistical data from
Bangladesh Bureau of Statistics (BBS) – which is Bangladesh government’s official
organisation for the collection and dissemination of statistical data. We also collect data
from Bangladesh Bureau of Educational Information and Statistics (BANBEIS) and
website of UNDP, UNICEF. Literacy rates are collected at ten year intervals as part of
the National Population Census. Also, we collect data from Gender statistics book GSB,

4 Algorithms and methodology

Our research deals with spatial autocorrelation and spatial regression. For preliminary
visualisation we used a choropleth map drawn using the equal interval classification
scheme (Choropleth, 2010) and a Moran scatter plot (Anselin, 1993). We used the global
Moran statistic to get a sense of the global spatial autocorrelation and LISA to find spatial
clusters. To visualise LISA we used significance maps and cluster maps. We also fit the
data to an appropriate spatial regression model.

4.1 Global spatial autocorrelation

Spatial autocorrelation is the correlation of a variable with itself across space. A common
statistic used for this is the Moran’s (1950) I index. It is calculated by the following

∑ ∑ W ( X − X )( X
i j ij i j −X )
i j ij
∑ (X − X ) i i

Here, I is the Moran’s I index. The sign and magnitude of this index gives the nature of
the correlation of the N number of points that are all indexed by i and j. Xi is the value at
current point in the dataset that is under consideration, X is the mean of all the points in
the dataset. Wij is the individual points in the weights matrix. Wij is 1 if Xi is a neighbour
of Xj, otherwise it will be 0.

4.2 The Moran scatter plot and univariate Moran I

For row standardised weight matrix, the Moran’s I statistic could be written as follows
(Scrucca et al., 2005):

I = xT Wx xT x

From the above equation it is clear that I is the slope of the regression of the Wx, on the
mean centred value of x. The plot of the constructed lag variable Wx versus x is called
Moran scatter plot and it provides a nice interpretation of Moran’s I. Here Wx is
standardised weighted average of neighbouring values of x.
Geospatial data mining techniques to investigate gender equality 171

The Moran Scatter Plot is basically a two dimensional grid which displays points
representing all the geospatial entities. It calculates the Moran’s I index from all the
points and displays a straight line whose gradient is the Moran’s I. The values on the X
axis are standardised so that their mean is zero and their variance is one. Along the Y axis
are the spatial lags of the chosen variable. This is calculated using the given contiguity
matrix. Spatial lag shows how much the neighbours of a point of interest are affecting
that point.
In Anselin (1993), this was developed and shown to be a good ESDA tool for
studying how the local spatial behaviour of the variable builds up the global Moran’s I
statistic. The scatter plot shows how spatially dispersed the data is. It also gives a hint as
to where the potential spatial clusters and outliers lie. The scatter plot is divided into four
quadrants namely high-high, low-low, high-low, and low-high. here points that lie in the
low-low and high-high regions indicate that positive correlation exits between the data in
those regions. Similarly, data in the high-low and low-high regions indicate that negative
correlation exits in the points of those regions.

4.3 Bivariate Moran’ I

For univariate Moran’s I, we find the correlation among the value of a random variable at
any position with the values of its neighbouring positions. In contrast, for bivariate
Moran’s I, a multivariate coefficient of spatial correlation between two random variables
is calculated. The coefficient is defined as follows (Anselin et al., 2002):

corrlk = ylT Wyk

where yl = [ xl − xl ] / σ l and yk = [ xk − xk ] / σ k . Here both random variables are

standardised such that the mean is zero and deviation is equal to one and W is the weight
matrix as before. The scatter plot is created by plotting two variables. The first variable yl
is plotted on the X axis, and the second variable, yk, the weighted (lagged) with the
weights matrix is plotted in Y axis.
The interpretation of the scatter plot is similar to the univariate one. It also includes
the high-high values and the low-low values. The values correlate to one another
spatially. The high-high region of the scatter plot demonstrates that among the two
variables, if variable yl has a high value at position s, then the variable yk that is near or
neighbours to s, will also have a high value.

4.4 Local spatial autocorrelation

Moran’s I index overviews the data from a global perspective. That is, the analysis brings
about a statistic that summarises the whole dataset by assuming dataset is homogenous,
i.e., similar in nature. However, it is possible that dataset contains no global auto
correlation. In that case we can still find cluster(s) on a local level using local spatial
autocorrelation. The fact that Moran’s I is a summation of individual cross products is
exploited by the ‘LISA’ (Anselin, 1993). Local Moran’s I for each spatial unit is
calculated and evaluated for statistical significance. Local Moran’s Ii was defined by
Anselin (1993):
172 R.M. Rahman and N. Sultana

∑W x
Ii = ij j
m2 j

where m2 = ∑x
j / N , the relation between global Moran I and local Moran Ii is defined

by I = ∑I
i / N and N is the number of data points. Large positive Ii indicate clustering

of data values around ith location and it deviates strongly from the average.

4.5 Univariate and bivariate LISA

We analysed this local statistic on one or two variables using univariate and bivariate
LISA respectively. For the univariate LISA, the statistic indicates the degree of
association between the value of a variable at position s and the average of values of that
variable to the neighbouring locations of s. On the other hand multivariate LISA indicates
degree of association between the value of a variable at position s and the average of
values of another variable to the neighbouring locations of s. LISA scatter plot presents
high-high and low-low cluster with different level of significance.

4.6 Odd ratio

The odds ratio (OR) is a way of comparing whether the probability of a certain event is
the same for two groups. An OR of 1 implies that the event is equally likely in both
groups. An OR greater than one that implies the event is more likely in the first group.
Shown below is the typical two by two table.

X− X+
Y− a b a+b
Y c d c+d
a+c b+d n = a+b+c+d

We can understand the OR by first noticing what the odds are in each row of the table.
The odds for row Y– are a/b. The odds for row Y+ are c/d. The OR is simply the ratio of
the two odds.
OR =
which can be simplified to
OR =
Geospatial data mining techniques to investigate gender equality 173

4.7 Correlation coefficient, r

The quantity r, called the linear correlation coefficient, measures the strength and the
direction of a linear relationship between two variables. The mathematical formula for
computing r is:

∑ xy − ( ∑ x )( ∑ y )

n (∑ x ) − (∑ x) n (∑ y ) − (∑ y )
2 2
2 2

where n is the number of pairs of data.

The value of r is such that –1 < r < + 1. The + and – signs are used for positive linear
correlations and negative linear correlations, respectively. If x and y have a strong
positive linear correlation, r is close to +1. An r value of exactly +1 indicates a perfect
positive fit. If x and y have a strong negative linear correlation, r is close to –1. An r
value of exactly –1 indicates a perfect negative fit. If there is no linear correlation or a
weak linear correlation, r is close to 0. The coefficient of determination, r2, is useful
because it gives the proportion of the variance (fluctuation) of one variable that is
predictable from the other variable. It is a measure that allows us to determine how
certain one can be in making predictions from a certain model/graph. The coefficient of
determination is the ratio of the explained variation to the total variation. The coefficient
of determination is such that 0 < r2 < 1, and denotes the strength of the linear association
between x and y.

4.8 Spatial regression

Traditional regression models do not take the spatial properties of data into consideration.
This is why we need specialised regression techniques for dealing with geospatial data.
OpenGeoDa supports the construction and analysis of spatial regression models.
We have used these features to see which spatial regression model would best fit the data
we have.

5 Result analysis

5.1 Global spatial autocorrelation and univariate Global Moran’s I

In this study, we demonstrate the woman literacy rate in Bangladesh. The primary,
secondary and adult woman literacy rate include in this literacy rate. At first in Figure 1
we show the percentile choropleth map of literacy rate of 2001 of 64 zilla in Bangladesh.
This map counts the literacy rate 7+ and above both girls. There are six categories in this
percentile map where 26 zilla’s literacy rate is 10%–50% and 25 zilla’s literacy rate is
50%–90%. Only seven zilla in where literacy rate more than 90%.
174 R.M. Rahman and N. Sultana

Figure 1 A percentile choropleth map showing six types percentile range of woman literacy rate
of 64 zilla’s of Bangladesh (see online version for colours)

In the percentile map we could observe that in remote hill track area, like in Bandarban,
the women literacy rate is low, and Dhaka, Gajipur, Jhalokathi, Barisal, Pirozpur,
Bagerhut districts show high women literacy rate.
From Table 1 we see the Moran’s I value of primary, secondary and adult women
tertiary literacy rate of 2006, and 2009. Universal access to basic education and the
achievement of primary education by the world’s children is one of the most important
goals of the MDGs. Primary school level group includes persons who have completed up
to five years of schooling. Persons attending 1st and 6th year of schooling have also been
included in this group. Secondary school level group includes person who have
completed six to nine years of schooling. Persons attending 10th year of schooling have
also been included in this group. Table 1 illustrates the Moran I values for primary,
secondary and 15 to 24 years aged woman literacy rate. The literacy rate of aged 15 to 24
is the percentage of persons aged 15 to 24 who show their ability to both read and write
by understanding a short simple statement on their everyday life. The indicator has a
special significance in reflecting the recent outcomes of the basic education process.
Table 1 Calculated Moran’s I values for woman literacy rate

Variable 2006 2009

Primary girls literacy rate 0.3475 0.3776
Secondary girls literacy rate 0.4062 0.5657
15 to 24 years aged women literacy rate 0.4299 0.4619

From Table 1 we see that all the Moran’s I value of primary, secondary and 15 to 24 ages
woman literacy rate at 2009 are higher than those values at 2006.
Geospatial data mining techniques to investigate gender equality 175

In Figure 2, we look at standardised primary girls literacy rates along X and standardised
average neighbour’s literacy rates along Y. The regression line with Moran’s I as slope is
reasonably accurate. So using spatial autocorrelation we have found that the woman
literacy rates in both times had a suitable positive global spatial autocorrelation. This is
also true for Figures 3 and 4.

Figure 2 Moran scatter plot of the primary girls literacy rate of 64 zilla in Bangladesh at 2009
and 2006 according to (a) and (b) (see online version for colours)

(a) (b)

Figure 3 Moran scatter plot of the secondary girls literacy rate of 64 zilla in Bangladesh at 2009
and 2006 according to (a) and (b) (see online version for colours)

(a) (b)
176 R.M. Rahman and N. Sultana

Figure 4 Moran scatter plot of the 15 to 24 years aged women literacy rate of 64 zilla in
Bangladesh at 2006 and 2009 according to (a) and (b) (see online version for colours)

(a) (b)

5.2 Bivariate global Moran’s I

In the univariate Moran’s I, we analyse the impact of the neighbouring locations to a

specific location using a single variable. In this section, we extend this idea by analysing
the association between two variables. The value of a variable is compared to the spatial
lag of another variable. This lets us study the effect of one variable has on another in a
spatial context.
The definition of literacy used in 1991 and 2001 is that a person of age seven years
and above and who is able to write a letter has been consider as literate. Literacy has been
calculated for age seven years and over population.
We calculate multivariate Moran I analysis within school attendance where we
count 5 to 24 years aged student who attend into primary and secondary school and
7+ and above aged literacy rate (where not only include these people who can read
or write within this ages but also include primary, secondary and tertiary literate person)
of girls and boys with the literacy rate and from these analysis we get the result in
Table 2.

Table 2 Bivariate Moran’ I analysis

Moran I value
2001 1991
School attendance girls (5 to 24 years age) vs. literacy rate 0.3721 0.4470
School attendance boys (5 to 24 years age) vs. literacy rate 0.3697 0.4556
7+ and above aged girls vs. literacy rate 0.4349 0.4536
7+ and above aged boys vs. literacy rate 0.4028 0.4552
Geospatial data mining techniques to investigate gender equality 177

In Table 2 we see that in 2001 the bivariate Moran I for School attendance girl’s vs.
literacy rate 2001 is more than school attendance boy’s vs. literacy rate 2001. Also, the
Moran I value of 7+ and above aged girl’s verses literacy rate is more than the 7+ and
above aged boy’s vs. literacy rate.
For example, multivariate Moran scatter plot relates the values for literacy of 2001 at
each location (LR2001, horizontal axis) to the average literacy rate of school attendance
girls, 2001 for the neighbouring locations (W_WLR2001, vertical axis). The observed
Moran’s I value of 0.3721 is highly significant and not compatible with a notion of
spatial randomness. So zillas that were surrounded by high school attendance girls in
2001 showed a good impact on the overall literacy rate in 2001.

5.3 Univariate local Moran’s I

Local Moran statistic can be used to identify ‘hot spots’ and ‘cold spots’. For each
location LISA values allow for the similarity with its neighbours and five possibilities
could be found:
a ‘high-high’ represented the location with high values with similar neighbours
b ‘low-low’ values represented locations with low values with similar types
c ‘low-high’ values represented locations with high values with low value neighbours
d ‘high-low’ represented locations with low values with high value neighbours
e locations with no significant autocorrelation.
High-high and low-low values are significant and the regions having those values form
clusters where the regions with low-high or high-low are outliers. Figure 5 shows the
cluster map of the woman literacy rate of 2001.

Figure 5 Univariate LISA cluster map of zilla level woman literacy rate of 2001 (see online
version for colours)
178 R.M. Rahman and N. Sultana

Figure 5 shows that there are two types significant cluster, one is low-low and other is
high-high. The low cluster area of women literacy rate 2001 is Jamalpur, Kurigram,
Rangpur, Lalmonirhut, Mymensingh, Netrokona, Chitagong, Bandarban, Rangamati,
Cox’s Bazar. This might be due to the fact that a large portion of tribal people are living
in those zillas, and tribal people are reluctant to avail the education facilities For the
women literacy rate 2001 the high cluster area is Narayangonj, Comilla. Gopalgonj,
Norail, Bagerhut, Pirojpur, Barisal, Jhalokathi, Barguna, PotuaKhali. Due to scarcity of
farmlands in those areas, people need to survive by service and industry oriented jobs.
The requirements of those jobs are generally at least ten years of education. This forces
the people of those areas to be literate.
Figure 6 presents the univariate LISA Moran scatter plot of zilla level woman literacy
rate of 2001. Where, the Moran I value is 0.6055. In this figure, there are four quadrants
which are low-low, high-low, high-high and low-high. Most of the districts situated only
two quadrant high-high and low-low in Figure 6.

Figure 6 Univariate LISA Moran scatter plot of zilla level woman literacy rate of 2001
(see online version for colours)

5.4 Odd ratio

The odd ratio is not present the geospatial data, we need statistical calculation to find it.
In this study we compare literacy and non-literacy rate for male and female and also
analyse employment status within men and women using this odd ratio.
The percentage of literacy rate of 11 years and above by literacy skilled level that
represent in Table 3.

Table 3 11 years and above by literacy skilled level

Female Male
Literate 66.2 65.6
Non-literate 33.8 34.4
Geospatial data mining techniques to investigate gender equality 179

Table 4 The employment data in the agricultural and non-agricultural sector from
1999 to 2006

Major 1999–2000 2002–2003 2005–2006

Both Both Both
Male Female Male Female Male Female
sex sex sex
Total 38979 31087 7891 44322 34478 9844 47357 36080 11277
Professional, 1567 1192 374 1723 1319 403 2231 1737 494
Administrative, 188 173 15 96 92 4 223 201 22
Clerical 1211 1081 130 1521 1336 185 1016 872 144
Sales workers 5762 5321 441 6547 6261 286 6711 6476 235
Service 2237 998 1239 1979 1027 951 2757 1892 865
Agriculture, 19343 15577 3767 22764 16992 5772 22926 15221 7705
forestry and
Production and 8671 6744 1926 9693 7450 2243 11493 9681 1812
labourers and
Total 100 100 100 100 100 100 100 100 100
Professional, technical 4.0 3.8 4.7 3.9 3.8 4.1 4.7 4.8 4.4
Administrative, managerial 0.5 0.6 0.2 0.2 0.3 0.0 0.5 0.6 0.2
Clerical workers 3.1 3.5 1.7 3.4 3.9 1.9 2.1 2.4 1.3
Sales workers 14.8 17.1 5.6 14.8 18.2 2.9 5.8 5.2 7.7
Service workers 5.7 3.2 15.7 4.5 3.0 9.7 14.2 18.0 2.1
Agriculture, forestry and 49.6 50.1 47.7 51.4 49.3 58.6 48.4 42.2 68.3
Production and transport 22.3 21.7 24.4 21.9 21.6 22.8 24.3 26.8 16.0
labourers and others

The odd ratio female to male literacy ratio (FTMR) is 1.02 that represent female literacy
rate is more than male literacy rate.
The latest Labour Force Survey 2008 shows that total labour force participation rate
for females is around 29.2%. The male-female ratio of non-agricultural employment has
been 77:23 in 1995–1996 which went up to 80:20 in 2005–2006 indicating relative
decline of females’ share in the non-agricultural employment. Creation of opportunities
for the women labour force remains as the major bottlenecks for wage employment in the
non-agricultural sector for women with an exception in the garment sector. Table 4 shows
the employment in the agricultural and non-agricultural sector from 1999 to 2006.
Our study also illustrates the man verses woman employment ratio (MWER). We
calculate the odd ratio to find out this employment ratio. In this category we find out in
180 R.M. Rahman and N. Sultana

2002–2003 the odd ratio is 3.13 and in 2005–2006 the odd ratio is 6.28 that present in
Table 5.
Table 5 The man verses woman employment ratio (MWER)

2002–2003 2005–2006
MWER 3.13 6.28

The odd ratio more than one that illustrates that man employee persons are more than
woman employee person and it is not decreasing as year passes by that hinders the MDG
of Bangladesh.

5.5 Correlation of coefficient and coefficient of determination

From Table 4 we see male and female Employment data at 1999–2000, 2002–2003 and
2005–2006. The correlation coefficient and coefficient of determination result R2 is
illustrated in Table 6. From female employment datasets, R2 value is consequently
increased from 2000 to 2006, but the male employment coefficient of determination
result decreases from 2000 to 2006 though these results are much more than female
employment datasets rate. In the table represent FVBS that is female occupation verses
both sex occupations and MVBS means male occupation verses both sex occupations.
Table 6 The male and female employment correlation regression result

Year Employment person Correlation Coefficient of determination (R2)

1999–2000 FVBS 0.9452 0.8934
MVBS 0.9969 0.9937
2002–2003 FVBS 0.9624 0.9261
MVBS 0.9956 0.9912
2005–2006 FVBS 0.9404 0.8844
MVBS 0.9862 0.9727

5.6 Ordinary and spatial regression

In this research we investigate the importance and necessity of including spatial
characteristics into analysis. Our objective is to develop a model that can predict the
literacy rate of all the women in 2001 on the basis of other explanatory variables such as
adult literacy rate and 5 to 24 ages female school attendance rate of 2001. Adult literacy
rate is an indicator of great social significance. Adult literacy rate for population aged
15 years and above can be defined as the ratio between the literate population of age
15 years and over to total population of age 15 years and over expressed in percentage.
The regression model is given below:
ALFE 01 = CONST + a1 FESA01 + b1 FADU 01
Here CONST is the constant term and a1, b1 is the regression coefficient and the literacy
rate of all women 2001, school attendance girls and adult women is represented by
ALFE01, FESA01and FADU01 respectively.
Geospatial data mining techniques to investigate gender equality 181

From Table 7, we see that the adult women education rate is significant because it is
prob < 0.04. Other variables and constants are not significant as their prob > 0.04.

Table 7 Ordinary least square result

Variable Coefficient Std. error t-statistic Probability

CONSTANT 1.728758 4.231374 0.408557 0.6842970
FESA01 0.1517121 0.127164 1.193043 0.2374725
FADU01 0.6236322 0.05945695 10.4888 0.0000000

The R-squared for linear regression is 0.770252 that is close to 1 that demonstrates the
goodness of fit between the linear model and the data we have. As the adult women
education is significant, we analyse the adult women education with women literacy rate
and the regression equation is
ALFE 01 = CONST + a1 FADU 01
The regression analysis is in Table 8.

Table 8 Ordinary least square result

Variable Coefficient Std. error t-statistic Probability

CONSTANT 6.209437 1.955862 3.174783 0.0023356
FADU01 0.6673411 0.04698797 14.20238 0.0000000

Before including the spatial characteristics into consideration we first test whether any
spatial autocorrelation exists in data. A total of five test statistics are reported in Table 9
to test for the spatial dependence.

Table 9 Diagnostics for spatial dependence

Test MI/DF Value PROB

Moran’s I (error) 0.178132 2.4277752 0.0151918
LM (lag) 1 12.8986848 0.0003288
Robust LM (lag) 1 8.7814843 0.0030430
LM (error) 1 4.1399585 0.0418822
Robust LM (error) 1 0.0227580 0.8800881

The first statistic is Moran’s I. In here the probability of Moran’s I is 0.178132. This
Moran’s I is significant. The significance is tested by the value of the probability
(PROB). If the value is <0.04 then we consider it to be significant (Moran, 1950). As the
Moran’s statistic is significant it suggests spatial dependence. However, this test that data
illustrate in Table 9 could not say whether spatial lag model or spatial error model would
fit the data best. Four Lagrange multiplier test statistics are used for this purpose. The
following workflow (OpenGeoda, 2011) is used to take decision between two
alternatives, i.e., spatial lag model or spatial error model.
182 R.M. Rahman and N. Sultana

Figure 7 Workflow for spatial regression model decision

With the respect to above data flow we can see from Table 9 that the probability value of
LM Lag is <0.04, so it is significant among others. As the probability is of LM lag model
is 0.0003288, we fit the data with this model. The spatial lag model is presented below:
ALFE 01 = CONST + a1 FEADU 01 + β W _ ALFE 01

Here, W_ALFE01 is a spatially lagged dependent variable for the weight matrix W,
ALFE01 is the woman literacy rate in 2001, FEADU01 is the adult literacy rate of 2001
CONST is a constant term, a1 and β are parameters or the coefficients. Running the
spatial lag model we find the following value of coefficients and corresponding
Geospatial data mining techniques to investigate gender equality 183

From this regression analysis we get the R-squared is 0.816588 that is close to 1. So it
is closer to significant. In Table 10, where W_ALFE01 and FEADU01 probability is
<0.04 these are significant.
From Table 10, we see all the coefficients are significant including the autoregressive
coefficient with the value, β = 0.3650573.

Table10 Summary result of linear regression adult literacy rate verses women literacy rate

Variable Coeff Std. err z-value Prob

W_ALFE01 0.3650573 0.08779334 4.158143 0.0000321
CONST 0.2695622 2.289025 0.1177629 0.9062555
FEADU01 0.5137231 0.05352518 9.597782 0.0000000

Table 11 presents a relative performance measure between classical and spatial lag
model. R-squared value is increased from classical model representing a better fit of the
linear spatial model than the classical model. Besides this there is an increase of log
likelihood in the spatial model from –175.44 to –168.583. Compensating the improved fit
for the added variable (the spatially lagged dependent variable), the AIC and SC
decreases relative to OLS.

Table 11 Spatial lag model results

Metric Linear regression Spatial regression

R-squared: 0.764891 0.816588
Log likelihood: –175.44 –168.583
Akaike info criterion: 354.88 343.165
Schwarz criterion: 359.198 349.642

Table 12 presents the prediction of the true woman literacy rate of 2001 (ALFE01),
predicted literacy rate ( n
ALFE 01), prediction error and residuals for the first ten
The residuals are the estimates of the model error term, i.e., (1 – βW)
ALFE01 – (CONST + a1 FEADU01) where the prediction error is, ALFE 01 − n ALFE 01.
We also calculate the Moran’s I test statistic for the residuals.
Figure 8 presents the scatter plot Moran’s test statistic is –0.0570 that present
the disperse area of women literacy, though it is close to 0. This indicates that
including the spatially lagged term into model eliminates all the spatial autocorrelation as
it should.
184 R.M. Rahman and N. Sultana

Table 12 LM lag regression result

OBS ALFE01 Predicted Residual Pred error

1 44.86 44.83870 0.08007 0.02130
2 52.25 51.11696 1.07114 1.13304
3 48.21 40.62344 7.93993 7.58656
4 44.21 45.35398 –1.21959 –1.14398
5 38.91 40.86728 –2.74384 –1.95728
6 18.69 23.14516 –1.99797 –4.45516
7 23.49 38.15266 –13.79261 –14.66266
8 19.95 24.26015 –0.82060 –4.31015
9 28.7 31.39624 –2.36897 –2.69624
10 39.59 39.54879 0.71634 0.04121
11 34.23 36.31013 –2.04572 –2.08013
12 25.64 28.12045 0.41861 –2.48045
13 42.24 41.69074 2.76912 0.54926
14 33.32 37.15178 –3.56555 –3.83178
15 39.71 41.20976 –1.01411 –1.49976
16 27.51 28.73009 1.40811 –1.22009
17 50.58 44.08991 6.03481 6.49009
18 43.81 39.15116 4.29259 4.65884
19 30.4 29.23077 0.58651 1.16923
20 41.15 38.63325 2.73375 2.51675

Figure 8 Moran scatter plot for spatial lag residuals (see online version for colours)
Geospatial data mining techniques to investigate gender equality 185

Finally, if we plot prediction errors against the predicted values we got a line almost
parallel to x axis with a slope 0.0135 (Figure 9). The slope is close to zero indicating the
goodness of fit of the data to the model. As a result prediction error is almost zero

Figure 9 Scatter plot of prediction errors against predicted values (see online version
for colours)

6 Conclusions

The analysis presented in this paper clearly shows that there is some spatial consistency
in the distribution of women literacy rate. It also reports the impact of women literacy
rate on GDP growth in Bangladesh. Besides, we present a comparative study on woman
and man literacy rate. In this study, we can observe that though the woman literacy rate is
increased, the employment rate is not satisfactory. On the other hand women literacy on
the agricultural and service growth rate has an impact on GDP growth rate in this
country. Taking policy level decisions with these spatial properties in mind can lead to
uniform positive development throughout the country.

