R.A. Rafsan1*, Z.S. Ishmam2, T. Ahammed3
Corresponding Author


The prevalence of various kinds of chest diseases, including Pulmonary Tuberculosis, Asthma, COPD,
Emphysema, Pneumonia and allied diseases have been on the rise worldwide. Hospitalization and morbidity of
patients with some of chest and pulmonary diseases have shown a distinct correlation with air quality.
Specialized hospitals that treat chest diseases have seen a sharp rise in hospital admissions in Dhaka City, where
air quality level is one of the worst in the world. This study aims to find models to predict hospital admission
due to chest diseases using air quality levels and meteorological parameters as predictors. We developed two
prediction models, Multiple Linear Regression (MLR) prediction and Multi-layer Perceptron (MLP), a type of
feedforward Artificial Neural Network (ANN) to predict hospital admissions due to chest diseases and
compared the performances of these models. We collected daily hospital admission data from National Institute
of Diseases of the Chest and Hospital (NIDCH), Dhaka, which is a specialized hospital for treating patients with
different kinds chest diseases. Daily average concentration of pollutants data was collected from three different
Continuous Air Monitoring Stations (CAMS) in Dhaka and daily average data of meteorological parameters for
Dhaka was collected from Bangladesh Meteorological Department. All data were collected for the period of
2013 to 2018. We measured the performance of the model using Root Mean Squared Error (RMSE) for the
predicted number of hospital admission with no scaling of the data. The RMSE value for the prediction model
derived via MLP is 12.879 which is lower than the RMSE of the model derived via MLR, for which the RMSE
is 12.978. The results show that, prediction error of the prediction model derived with MLP is usually lower
than the model derived via MLR, which indicates that Artificial Neural Networks can improve the prediction
performance for this type of dataset. This study demonstrated a modeling approach to predict hospital admission
due to air pollution level, which could help the hospitals to get prepared for future surge in patient due to severe
air pollution.

Keywords: Chest disease; air quality; meteorological parameter; Multi-layer Perceptron (MLP); Multiple
Linear Regression (MLR).

1. Introduction

Dhaka city has seen staggering increase of air pollutant concentration in the past decade. With the gradual
worsening of air quality, the number of patients with chest and respiratory diseases has also risen in the hospitals
of Dhaka city. Past studies show correlation between ambient air pollution and health deterioration [1]. Many
studies also recognize the increase in air pollution level to be associated with respiratory tract and chest diseases
like Asthma, Chronic obstructive pulmonary disease (COPD) etc. [2, 3]. Along with air pollution past studies
have found atmospheric changes like weather and meteorological conditions to be a causal factor of sudden
trigger for patients with respiratory conditions to have emergency hospital visit [4, 5].
Multiple Linear Regression (MLR) analysis [6], linear regression model with variables unlagged or lagged by
24 hours [7], are some of the studies that use statistical analysis and machine learning based techniques to assess
the association of daily averaged pollutant concentrations, meteorological variables and daily hospitalization
counts due to respiratory diseases. Examples of neural network study in the same field include, use of both
Artificial Neural Network (ANN) and conditional logistic regression [8], ANN based classifier using Multi-
Layer Perceptron (MLP) with back propagation algorithm that predicts Peak Event (peak demand days) [9] and

seven days ahead forecasting of childhood hospital admissions using meteorological and air pollution as input
variables using ANN [10]. There have been few studies related to air pollution, meteorological influence and
health hazard in Bangladesh [11–13].To the best of our knowledge there have been no past studies in
Bangladesh for creating a prediction model of patients suffering from chest diseases.

This study seeks to develop forecast models using MLP and MLR and evaluate the possible association of
meteorological parameters and air pollution with the number of daily hospital admissions due to chest diseases.
The relevant patient data we used is the number of daily patient admission in National Institute of Diseases of
the Chest and Hospital (NIDCH), Dhaka. This hospital best represents the acute situation of respiratory
distresses of the mass people exposed to an alarming pollution level. We also evaluated and compared the
performance of two the two different models mentioned, for the prediction of number of indoor admissions with
air pollutant and meteorological data. We believe this study can help hospital administration, policy makers in
both environmental and medical disciplines in taking suitable decisions to combat sudden surge of patients from
a broader approach.

2. Methodology
2.1 Study Site

Dhaka city, currently the sixth most densely populated cities in the world, has an area of 2161 km2. Following
the global urbanization trend, the city has become one of the fastest growing cites. But, due to the lack of proper
planning for infrastructures, resource management and negligence towards environmental policy frameworks,
Dhaka Metropolitan Area (DMA) has witnessed a dreadful degradation in the overall environmental condition,
especially in the air quality [14]. Vehicular emission, brick kilns and construction works have been identified as
the major source of pollution inside the city for the past few decades [15–17]. The major vehicle heavy roads are
located in the south and south eastern part of the city where the Continuous Air Monitoring Systems (CAMS)
are operating. Most of the brick kiln clusters are in the six districts surrounding DMA [18]. The meteorological
data collection station is located at Agargaon, which is within a range of three kilometers from all three CAMS

2.2 Medical, Meteorological and Pollutant Data

National Institute of Diseases of the Chest and Hospital (NIDCH), Mohakhali, Dhaka, has a database of their
daily indoor and outdoor patients’ record. The records include gender and morbidity data for indoor patients,
and for outdoor attendees a separate children and adult (both male and female) demographic data is also
available. For our study, daily number patient admission (Indoor Patients) was collected for the time period of
2013 to 2018. The corresponding daily meteorological data of Dhaka city was obtained from Bangladesh
Meteorological Department (BMD) which collects and processes meteorological data obtained from different
monitoring stations throughout the country. Of all the meteorological parameters recorded by BMD, we used 4
parameters in this study namely – average dry-bulb temperature (°C), average humidity (%), total rainfall (mm)
and prevailing windspeed (knots).
Pollutant data was provided by the Department of Environment (DoE), Dhaka. Their Clean Air and Sustainable
Environment (CASE) project monitors the criteria pollutants, carbon monoxide, nitrogen dioxide, ozone, sulfur
dioxide, PM10 and PM2.5 with 11 Continuous Air Monitoring Systems (CAMS), which has essentially created a
monitoring network across the country. Three of these CAMS (CAMS -1, Sangshad Bhaban, Sher-e-Bangla
Nagar; CAMS-2, Farmgate; CAMS-3, Darussalam) are continuously monitoring the criteria pollutants inside
Dhaka city. These CAMS also collect some meteorological parameters (Solar radiation, Relative Humidity,
Ambient Temperature and Rainfall) among which, solar radiation parameter is also used as a meteorological
parameter in this study.

2.3 Data Preprocessing

The daily CAMS data provided by DoE from all the three CAMS stations was combined with meteorological
parameters and number indoor patient admissions to make a total of three datasets. All of the datasets had a total
of 11 predictor variables comprising the air pollutant and meteorological parameters mentioned before and the
only response variable was the number of daily indoor patient admission in NIDCH. Daily CAMS data and the
indoor patient data contained a significant percentage of missing values; whereas, the data collected from BMD
contained no missing value. We imputed the missing data with simple average method by which we replaced a
missing value of a specific chronology in a year with the average value of the previous and next year’s data of

the same chronology. Missing values were imputed for every pollutant but the patient data was kept unchanged
as it was the output variable in our models. Table 1 shows the summary of the percentage of missing values for
different variables before and after the imputation process. After the completion of this data imputation process,
some missing values were still in the datasets. In the data cleaning process, we made the datasets uniform by
ensuring no sample in the datasets contained any missing value. The overall flow of this study including the
preprocessing and forecasting steps are shown in Figure 1.

Table 1: Percentage of missing values before and after data imputation for different predictor variables
containing missing values
Unprocessed Data After Imputation
SO2 78.5 21.2 12.1 7.3 4.1 4.5
NO2 84.1 42.0 12.8 8.4 2.6 1.6
CO 54.7 14.6 24.6 10.2 3.1 0.2
O3 65.3 11.4 17.5 7.1 3.0 1.5
PM2.5 43.8 35.9 5.0 7.6 7.9 0.4
PM10 52.6 46.6 5.2 5.5 6.8 0.3
Solar Radiation 36.7 34.5 2.6 5.2 2.5 0.2

Medians, quartiles along with the spread of different variables of the datasets are shown in Figure 2. From the
figure, it is worth noting that, most of the predictor variables and the response variable contain a significant
amount of noise or outliers and all the predictor variables are in different ranges. As a result, in order to ensure
that the data that are being compared are comparable and to speed up the optimization process of the models, all
the input variables otherwise known as features were scaled. For scaling we used the standardization method
also known as z-score normalization. The formulation of this method is:


Where, = feature vector after scaling, = original feature vector, = mean of the feature vector and =
standard deviation of the feature vector.

Figure 1: Overall data preprocessing, data cleaning and forecasting process of this study.

2.4 Multiple Linear Regression

The Multiple linear regression model (MLR) is one of the most widely used models in statistical analysis and
predictive modeling. For our multiple linear regression model, we used Scikit-learn (version 0.23.1), a Python
machine learning library [19]. For training and testing purpose, all three datasets were split with a ratio of 60:40
and for better prediction accuracy the datasets were randomly split. The objective of MLR is to use several
explanatory variables to predict one or more response variable with a linear relationship. The model we used can
be expressed as:

Here, = value of response variable (daily no. of indoor patients), = unknown regression bias, = unknown
regression coefficients, = values of independent variables. To determine the accuracy of the models, we
determined the Root-Mean-Square-Error (RMSE) for the observed and predicted no. of patient admission.

Figure 2: Summary of air pollutant, meteorological and medical data showing median, quartiles and range. Whiskers are
showing the range for all variables.

2.5 Artificial Neural Network

Artificial neural networks (ANNs) are complex multivariate statistical models, which can be used to establish a
complex nonlinear relationship between random response variables with given explanatory variable. We used a
fully connected feedforward neural network, specifically Multi-layer Perceptron (MLP) which has previously
been used for patient forecasting using medical and environmental data[9, 20]. This type of network is
constructed with different layers, specifically, an input layer, one or more hidden layers and an output layer,
each layer consisting one or more nodes. The input data is inserted inside the network through the input layer
nodes and the data moves forward though the nodes of the network to the output nodes. The nodes are
connected with linear connections. The value of each node is calculated by summing up the input values with its
associated weights and biases, and then the result is passed to the nodes of the next layer. To introduce non
linearity to the network we used Rectified Linear Units (ReLU) as activation function to each of the hidden
nodes. The ReLU activation function is as follows:


Where, is the input to a node. In our present study, we used a deep feedforward neural network architecture
with a total of 6 hidden layers, determined by trial and error method, generalized for all 3 of our datasets. The
overall network architecture along with the no. of nodes used in each layer is shown in Figure 3. The no. of
nodes in each layer is also determined by trial and error.
For all 3 datasets, the datasets were first split into train and test subsets and we used the same subsets used in the
MLR models. To implement the MLP models we used PyTorch (Version 1.6.0), a Torch based open-source
machine learning framework for Python, with good Graphical Processing Unit (GPU) support [21]. The
networks were optimized by Adam optimizer using backpropagation algorithm[22]. We trained the networks
using 3 different loss functions, Mean Squared Error (MSE), Mean Absolute Error (MAE) and Huber Loss [23].

We used a specific form of Huber loss known as Smooth L1 Loss, which is a robust loss function for regression
problems. The function can be described as:


Where, = observed value, = predicted value. In case of outliers, Huber loss function increases less rapidly
than quadratic loss functions. Hence, the estimation of the loss using this loss function is robust to outliers [24].
For hyperparameter optimization and model selection, we further split the test subset into cross-validation set
and final test set with a ratio of 50:50. After determining the optimized models for all 3 datasets, we determined
RMSE as a measure of accuracy of the models, as we did for the MLR models.

Figure 3: Schematic diagram of feedforward neural network architecture with no. of nodes in each layer. Bias nodes are not
shown in the diagram. Outputs of the nodes is passed through a ReLU activation function before entering the next layer
except for the output layer.

3. Results and Discussion

We used ANN, specifically MLP and MLR to forecast the number of hospital admissions due to chest related
diseases. RMSE of the observed and predicted numbers for each of the datasets and training methods used, are
summarized in Table 2. From the table it is evident that for our datasets, prediction error is the lowest for the
MLP model trained with CAMS-3 air pollutant parameters and MSE-loss function. Prediction error for MLR is
also lower for the dataset with CAMS-3 air pollutant parameters. This indicates that, the optimal model for
predicting hospital admissions would be a MLP trained with MSE-loss compared to MLR for this type of
dataset. Figure 4 shows the observed and predicted number of indoor admissions in the test set with the best
performing model by taking 7 data points moving average. From the figure, we observed that, although our
model could recognize the patterns of patient admission fluctuation, it could not identify the peak numbers.

Table 2: Summary of RMSE of predicted no. of indoor admissions using models trained with different
loss functions
Station MLR
(MSE-trained) (MAE-trained) (Huber Loss-trained)
Sangshad (CAMS-1) 13.321 13.413 13.447 13.133
Farmgate (CAMS-2) 13.400 13.548 13.450 13.380
Darussalam (CAMS-3) 12.524 12.803 12.879 12.978

It is also worth mentioning that, in our study, feature selection was done considering the correlations between
different air pollutant and meteorological parameters. But the prediction error is higher compared to previous

similar studies [10]. One possible reason could be that, on weekly and government holidays, the patient records
are usually low and on the next working day, there is a surge of patients. Due to this sudden surge events, the
overall unpredictable nature of the dependent variable of our dataset increases. Another reason could be not
considering the sequential studied variables as time series. All independent and target variables were taken to be
discrete observations fed into the neural network. This does not take into account the complex seasonality i.e.
weekly or monthly seasonality of patient data and yearly seasonality of pollutant data. And it is also observed
that patient admissions are much higher in the winter season.
The handling process of our data classified the real-time variability and surges of the datasets as anomalies and
outliers. This had a significant impact on the prediction results. And it is quite difficult to explain in an
intelligible form the relative importance of the various input variables because association with environmental
parameters and medical problems are too complex to be expressed analytically. The training and application of
ANN models after having taken into account even more of the factors affecting the phenomenon would
probably result in significant improvement of predicting ability of ANN models. Other neural network
architectures, specifically recurrent neural network with Long Short-term Memory (LSTM) cells may also
improve the prediction performance.

Figure 4: No. of hospital admissions for chest diseases predicted by the optimum model (7 data points moving average).
Prediction is done on the whole test set and compared to the observed values in the same set.

4. Conclusion

Our study showed MLP trained with MSE-loss function will produce a better result than MLR for predicting
hospital admission of patients with chest diseases from air quality and meteorological data sets. Identification of
the trends in indoor patient admission can also be done with these predictive models. Further study needs to be
conducted on multiple hospitals to have a more conclusive result. Time series prediction with ANN Most of the
hospital policies in Bangladesh are based on cause and effect, not from factual studies. This study will help
hospital administrators adopt necessary changes to be well prepared before a sudden influx of patients.


We are thankful to Department of Civil Engineering, Bangladesh University of Engineering and Technology
(BUET) for the necessary guidance and motivation for this study. We would also like to thank Department of
Environment (DOE) for providing us air quality, Bangladesh Meteorological Department (BMD) for the
meteorological data, National Institute of Diseases of the Chest and Hospital (NIDCH) for the patient data.

Proceedings of the 5th International Conference on Advances in Civil Engineering (ICACE 2020)
