Generate bioinformatics data using Generative
Adversarial Network: A Review
Sharmilan S 1, Hapugahage Thilak Chaminda 2
Informatics Institute of Technology, Affiliated to Robert Gordon University, Scotland
Colombo, Sri Lanka 1, 2

Abstract – Data is the most important part in machine learning. to two main types such as text and image. Many researchers
In bioinformatics field the sensitivity of the data is high and done to predict or analysis the medical issues or dieses using
due to that the accessibility of the data for a secondary purpose the images [19] [33]. And same time many researchers done
(e.g.: research) is consist with many legal and ethical issues.
Due to that in many bioinformatics researches collecting the
to predict the things using patient records or electronic
data consume more time than the development phase. There health records [8] [36] [33]. In that case the sensitivity and
are some researches done to solve the legal and ethical issues privacy of the data comes in to play. As most of the health
by anonymising the data using encryption, de-identification records consists with personal details.
and perturbation of potentially identifiable attributes. For If we take a process of prediction models it all depend on the
some extend those solutions restricted the data breach but in
other hand anonymized data not performed well during the data sets. The attribute selection will be done based on
analysis and mining tasks. Recently Generative adversarial looking at the data set. The collected data will be
networks (GANs) have become a research focus of artificial preprocessed and use for model training and testing. So the
intelligence. The goal of GANs is to estimate the potential model learns about the things that provided based on those
distribution of real data samples and generate new samples data sets. If the gathers data is poor in quality wise or if the
from that distribution. Here, researcher review GAN in data is class imbalanced then the trained model will be
bioinformatics to generate data sets, presenting examples of inefficient [27]. So to improve this whole process collected
current research. To provide a useful and comprehensive
perspective, Researcher categorize research both by the data should be balanced and quality one. Also it should
bioinformatics data and GAN architecture and flow. cover a wide range of inputs.
Additionally, discussed about the issues of GAN in Modern machine learning methods based on deep neural
bioinformatics to generate data sets and suggest future network architectures require large amounts of training data
research directions. Researcher believes that this review will to achieve the best possible results [25]. But due to the
provide valuable insights for researchers to apply GAN to
generate bioinformatics data sets. privacy and legal issues it’s not possible to access large scale
of real patient data. But if try to get access some sort of data
sets then the quality and the distribution of the data will be
I INTRODUCTION poor in many times [8] [33] [19] [36]. So in this paper
Access to data is one of the bottlenecks in the development researcher going to discuss about the issues and difficulties
of machine learning solutions to domain-specific problems. in the bioinformatics researches in terms of access and
The availability of standard datasets (with associated tasks) availability of data. Also discussing about the recent
has helped to advance the capabilities of learning systems in researches done using GAN to generate various things
multiple tasks. However in bioinformatics and medical field including data sets as well. Also how GAN improving the
it is hard to collect the standard datasets in a huge amount field of bioinformatics in terms of generating data samples.
[33]. For example in medical, defense, security and some
other fields the sensitivity of the data is high. In that case the II BIOINFORMATICS
access to data is highly controlled.
Bioinformatics is an interdisciplinary field that develops
The exponential growth of the amount of biological data software tools and machine learning models to understand
available raises two problems: on one hand, efficient the biological data. As it’s an interdisciplinary area of
information storage and management and, on the other hand, science, it combines computer, science, mathematics to
the extraction of useful information from these data. The analyses and understand biological data. As it’s a wide and
second problem is which requires the development of tools complicated area bioinformatics have different genres
and methods capable of transforming all these inside itself. Most popular ones are sequence analysis, Gene
heterogeneous data into biological knowledge about the and protein expression, Analysis of cellular organization,
underlying mechanism [19]. Medical data will be divided in Structural bioinformatics and Network and systems biology
[26]. Even we can divide these things in to sub parts as well. A Quality and quantity of the data
For an example DNA sequencing, Sequence assembly, Generally a successful decision support or prediction
Genome annotation, Computational evolutionary biology, system needs a good amount of quality data. The data can
Analysis of mutations in cancer are the sub parts of be collected as domain knowledge or real patient data sets.
sequence analysis [26]. The National Center for But the first approach is more expensive and we need to
Biotechnology Information reports that there are three main collect good quality and quantity of knowledge [33] [32].
scientific applications of bioinformatics. They categorize But the other one is easy to get but the amount of data that
them as Evolutionary Biology, Protein Modeling and needed is the issue. During the big data boom as similar as
Genome Mapping [29]. As it’s improving dramatically, other industries health care industry also understand the
over the past decades the quantity and quality of biological importance of the data and they started to collect and stored
information has skyrocketed. them for the future works. Even there are researches done to
As bioinformatics containing mathematics and computer unify all the medical data in to a one central system to solve
programing, the advancement of machine learning models the data diverse issues as well [35]. But if a researcher try to
and artificial intelligence largely improved the access the data they will face many legal and ethical issues
bioinformatics field as well. Deep learning has advanced as these data are sensitive as its collected form patients.
rapidly since the early 2000s and now demonstrates state- Basically to train a supervised model the amount of required
of-the-art performance in various fields as well as data is not a constant. The amount of data required is depend
bioinformatics also [20]. Even the invention of GAN helped on the complexity of the problem and the models as well
the bioinformatics researchers to develop biomedical data [25] [10] [32]. Most of the bioinformatics researches are is
and images to solve some complicated areas of more complex and sensitive. Due to that researchers needs
bioinformatics [33] [11] [33] [19] [41]. Using the machine to build an efficient models to provide high accuracy
learning and data mining models bioinformatics solved predictions. Also if the model build using some nonlinear
many complicated issues like predicting diseases in early algorithm then they need more data samples for training and
stage, calculate the patient risk level in early stage and even testing compare to linear model algorithms [25].
modeling and remodeling the RNA and DNA [26] [28].
There are applications and tools developed by researchers B Lack of data in terms of quality and quantity
to detect or identify various types of cancer, brain tumor, Data mining and machine learning is typically associated
diabetes, heart attack and etc. [26] [29]. So as a conclusion with solving real world problems that are characterized by a
the field of bioinformatics absorbed the advancement of large amount of data. However, in practice, collecting large
computing and AI and used effectively in biology and amounts of data in medical field is infeasible. Although data
medical mining could make important advances in this field, several
III BIOINFORMATICS DATA challenges must be addressed [11]. Existing works that
apply data mining to small datasets have shown the
A wide array of biomedical data are generated and made
following challenges:
available to healthcare experts and researchers for the
purposes of research. However, due to the diverse nature of • The over fitting problem. Obviously, a
the medical data, it is difficult to analyze and predict classification decision based on a small number of
outcomes [28]. When we consider diseases most of the instances is susceptible to the over fitting problem,
symptoms and causes are differ from region to region or because of lack of samples representative of the
even country to country. Also for many disease, symptoms whole data distribution [27].
and causes are vary from many non-medical parameters • Due to the small amount of data, it will cause very
such as climate, behaviors and culture [33] [29] [43]. So if poor classification performance [13].
a data collected from a specific place and used for research • Noise. The noisy data instances will lead to
will not applicable for another place with different non- unclear class boundaries and reduce overall classification
medical parameters. When a researcher tries to access the accuracy.
medical data many of the important parameters will be hided New researches and innovations are needed in data mining
due to the privacy issues of patients such as date of birth, technology to address these challenges [11]. Class
birth place, addictions and some diseases like HIV [30] [17]. imbalance is another major problem with data and
Most of the times collected data will not cover all the classes especially in medical field. In the imbalance data set the
of prediction or even there will be no data for some rare class having more number of instances is called as major
cases. With this type of data used to train a supervised model class while the one having relatively less number of
to predict or classify, it will not able to predict rare cases or instances are called as minor class [27] [10]. Most machine
even it will predict them wrong. So the efficiency and the learning algorithms works best when the number of
accuracy of the model will go down due to the class instances of each classes are roughly equal. When the
imbalanced. number of instances of one class far exceeds the other,
problems arise. In such situation most of the classifier are
biased towards the major classes and hence show very poor Now a day’s health care industry facing the big issue was
classification rates on minor classes. It is also possible that data breaching. As they storing sensitive patient data
classifier predicts everything as major class and ignores the including their personal details. Totally 40 biggest health
minor class as it not have enough evidence for the minor care record breaches done all around the world in 2017 until
class [27]. October [14]. Also the data breaches done in places where
the data shared for secondary purposes like researches. To
IV LEGAL AND ETHICAL ISSUES avoid the privacy issues while if there is a data breach,
health care industries used information randomizes and
One reasons behind limited access stems from the fact that
generalization techniques. However, this approach is not
EHR data are composed of personal identifiers, which in
impregnable to attacks, such as linkage via residual
combination with potentially sensitive medical information,
information to re-identify the individuals to whom the data
induces privacy concerns. As a result, access to such data
corresponds [24]. Also anonymising and sharing patient
for secondary purposes (e.g., research) is regulated, as well
data is the new trend in health care industry. But still this
as controlled by the health care organizations that are at risk
if data are misused or breached [17]. The review process by process consume more time to anonymising the data [24].
legal departments and institutional review boards can take Researchers not able to predict the outcome or symptoms by
months, with no guarantee of access [11]. This process region or a particular place because the residential data are
limits timely opportunities to use data and may slow anonymised.
advances in biomedical knowledge and patient care. Health In recent times researches done to generate data samples to
care organizations often aim to mitigate privacy risks overcome these issues. In Recent years advancement of auto
through the practice of de-identification [22], typically generated data went to next extend to create total fake
through the perturbation of potentially identifiable attributes records based on some real record samples. This type of
(e.g., dates of birth) via generalization, suppression or solutions will resolve the data access issues as well as the
randomization. And then they made the data available for data piracy and security issues.
research uses [21]. But for most of the bioinformatics
researches in diseases predication or risk analysis field
needs a wide range of data that includes exact date of birth VII GENERATIVE ADVESARILA NETWORKS
and residential details. So if the data is randomized then GANs are neural networks that learn to create synthetic data
there will be a chance that the accuracy and efficiency is not similar to some known input data. For instance, researchers
up to the high level. have generated convincing images from photographs of
everything from bedrooms to album covers, and they
V OTHER ISSUES WHILE ACCESSING DATA display a remarkable ability to reflect higher-order semantic
logic [2] [1] [37] [19] [11] [7]. GAN was invented by ian
As considering patient data still in many countries like Sri
goodfellow [16]. It was first introduced in 2014 and
Lanka there is no EHR management in large scale [40]. And
afterword’s there are many number of GAN variants were
for some diseases and medical cases such as maternal,
introduced by researchers for different tasks [23]. The
autism and many mental diseases still there is no recorded
concept is basic as if anyone want to improve some skills
data sets and only having the domain knowledge
especially in games they will compete with an opponent
[36][15][38]. So if there is a bioinformatics research need to
better than them. Then they will analyze what went wrong
be done to build some machine learning models using data
or which point it went wrong. Afterword’s they use that
then the researches by them self they need to collect and
knowledge to improve their skills. Same as that hear in GAN
record the data. So the time allocated for research will be
generator network will always compete with the high
diverted to collect the data and store them [36] [40]. Most
accuracy discriminator and then learn his mistakes and
of the times researchers are not in the medical fields, due to
improve his accuracy. So in one point generator will beat
that their domain knowledge is limited in those areas
the discriminator [16]. The efficiency and accuracy of the
compare to the doctors and specialist in medical field. So
generator depend on how powerful the discriminator is, so
they may miss some important attributes while building a
all the time in GAN must to have a powerful discriminator
model as they will only consider about the data and the
[23] [4].
attribute variant. As a conclusion many developed countries
having issues of privacy and storing patient data and the
same time some developing countries still figuring out ways A Architecture and Flow
to collect and store the data. So for these type of countries The architecture of the GAN consist with two classifier
there will be no history of previous data and they will only models. One is discriminator and other one is generator. The
have current data cases. task of generator model to learn and generate things such as
images, sound, and etc. and other one is discriminator.
Discriminator will classify the generated things as real or
fake by its trained knowledge. Discriminator model
determines whether a given image looks like a real image equation (2) achieves its minimum value based on the below
from the dataset or like an artificially created image. This is equation.
basically a binary classifier that will take the form of a 𝑃𝑑𝑎𝑡𝑎 (𝑥)
normal convolutional neural network over the course of 𝐷𝐺∗ (𝑥) =
𝑃𝑑𝑎𝑡𝑎 (𝑥) + 𝑝𝑔(𝑥)
many training iterations, the weights and biases in the
discriminator and the generator are trained through back
propagation. The discriminator learns to tell "real" This is th ebest solution of discriminator D. based on the
things/data apart from "fake" things/data created by the equation 4, discriminator of GAN estimates the ratio of two
generator. And once the generated one classified as fake probability densities. D(x) denoting the probability of X.
then generator will get that feedback from discriminator and that meand D(x) denoting the real data, so the discriminator
stat generate a new sample. Until the discriminator fooled try to make the D(x) as 1. And same time if the input data
by the generator to classify generated samples as real one. comes from G (z). G(z) denoting the generated data, then
The sample structure of GAN diagram given below. the discriminator try to make that D(G(z)) as 0 [16]. And he
same time generator G tries to make it approach 1. Since it’s
a min max game between G nad D, the loss function of G is
ObjG(θG) = −ObjD(θD, θG). Therefore, the optimization of
GAN can be formulated as a minimax problem:

𝑚𝑖𝑛 𝑚𝑎𝑥 𝑉(𝐷, 𝐺) = 𝐸𝑥~𝑃𝑑𝑎𝑡𝑎(𝑥) [𝑙𝑜𝑔𝐷(𝑥)] + 𝐸𝑧~𝑃𝑧 (𝑧) [𝑙𝑜𝑔(1

Fig. 1 Generative Adversarial Network Architecture − 𝐷(𝑔(𝑧)))]
In this section researcher going to discuss about the training As a conclusion, during the learning process of GAN, needs
and learning process of GAN. To get in to the training to train the discriminator D to maximum accuracy to
process first, researcher describe the optimization of discriminating input data. That means discriminator should
discriminator D given generator G. Similar to the training of have high accuracy to confirm whether the input data is real
sigmoid function based classifiers, training the or generated one. Then need to train the the generator G to
discriminator involves minimizing the cross entropy [23]. minimize log(1 − D(G(z))) [23] [16]. Normally in GAN first
The loss function given in the below formula. train the discriminator D to its maximum accuracy and not
train the G. then once the training completed for D then fix
1 the D as it is and train the G to minimize the discrimination
𝑂𝑏𝑗𝐷 (𝜃𝐷 , 𝜃𝐺 ) = − 𝐸𝑥~𝑃𝑑𝑎𝑡𝑎(𝑥) [𝑙𝑜𝑔𝐷(𝑥)] accuracy of D. as this process continue in one point the
1 model achieve the global optimal solution if and only if
− 𝐸𝑧~𝑃𝑧 (𝑧) [𝑙𝑜𝑔(1 − 𝐷(𝑔(𝑧)))]
2 pdata = pg [23] [16]. That time discriminator fooled by the
(1) generator and, discriminator classify the generated data as
In GAN the training data will be in 2 parts. One is the real the real data.
data pdata(x) and another one is the generated data
distribution pg(x). Where x is sampled from real data VIII RELEATED WORKS TO GENERATE DATA SETS USING
distribution pdata(x), z is sampled from the prior GAN
distribution pz(z) such as uniform or Gaussian distribution,
There is a research done to generate electronic health
and E(·) represents the expectation [23]. Its slightly
records for secondary purposes [11]. The access and
differentiate from the conventional binary classification. In
availability of the electronic health records has motivated
generator need to minimize equation 1 value to get a best
the bioinformatics advances in research side. However still
solutions. In continuous space, equation 1 can be
the EHR systems not automatically provide easy access to
reformulated as below.
the data for researches. The main reason behind the limited
access is the data consist with personal details. So the health
care organizations put high security over their data because
it can easily misused or breach.
As a solution they used GAN and created a new approach
call medGAN (medical Generative Adversarial Network)
[11]. medGAN will generate realistic synthetic patient
records. It’s trained by using real records to create fake
records. They used auto encoder and GAN to generate high
(2) dimensional discrete variables from binary to count values
The above equation will achieves its minimum value at y = [11]. Also they proposed mini batch averaging to effectively
m/ (m + n). Hence, given generator G, the objective
avoid mode collapse. That will improve the efficiency of medGAN will improve the healthcare industry and provide
learning with batch normalizations [11]. good amount of contribution to bioinformatics field in terms
They used GAN along with the auto encoder to generate the of data sets. For future directions, they plan to explore the
datasets. The model was trained using real world data sets sequential version of medGAN, and also try to include other
and from that they try to produce similar generated sets. So modalities such as lab measures, patient demographics, and
the generate network will create data samples and the free-text medical notes [11].
discriminator will tell whether the created one is real or fake. Now a days in bioinformatics images are widely using for
If the data is fake then the generator will recreate another predictions and analysis. But the availability of annotated
sample by using the advice from discriminator. For the data in large amount becoming increasingly critical [33].
experiment they used 3 large electronic health record sets. However, annotated medical data often scarce and costly to
They used a proprietary dataset from a private health obtain. In recent advances of deep learning and deep
organization, which consists of 10-years of longitudinal networks are required large amount of data to be trained. A
medical records of 258K patients, MIMIC-III dataset wide availability of such data may allow researchers to
(Johnson et al., 2016; Goldberger et al., 2000), which is a develop and validate more sophisticated computational
publicly available dataset consisting of the medical records techniques. As a solution for this problem researchers
of 46K intensive care unit (ICU) patients over 11 years old introduced GAN model to generate synthesis retinal color
and a proprietary dataset for a heart failure study from a images [33]. It’s an image simulation based approach. So
private health organization, which consists of 18-months using that researchers can create synthetics retinal images as
observation period of 30K patients [11]. much as they want.
They carried variety of experiments and evaluation to check Researchers proposed to implement an adversarial auto
the quality and the performance of the system. They found encoder for the task of retinal vessel network synthesis. Also
that min batch averaging increased the performance of researchers used generated vessel trees as an intermediate
medGAN [11]. stage for the generation of color retinal images, which is
accomplished with a GAN. To train the model they used
Messidor-1 dataset that consist automatic retinal vessel
segmentations data [33].
This model achieved a 0.9755 AUC on the DRIVE test set,
a result aligned with state-of-the art methods for retinal
vessel segmentation. Only images from Messidor-1 with
grades 0, 1 and 2 were used in this work, reducing the
number of example pairs to 946. This dataset was randomly
Fig. 2 Score distribution for real and fake datasets [11]. divided into training (614 pairs), validation (155 pairs) and
test (177 pairs) sets, which were downscaled to 256 × 256
before training the model [33]. This data preprocessing
Also the Independent sampling naturally shows great happened to avoid the poor generalization of their vessel
performance. medGAN, seems to capture the dimension- segmentation technique, which produced incorrect
wise distribution relatively well, showing specific weakness segmentations for images in a later stage of diabetic
at processing codes with low probability. To evaluate the retinopathy [33]. To evaluate the quality and quantity of the
medGAN they used two EHR datasets. Then they evaluated images they carried out different methods.
the generated data by get reviews from medical experts. To
As a conclusion about this research, researchers developed
get the reviews from doctors they randomly got 50 records
the model using GAN to generate retinal vessels synthetics
from real data and another 50 from generated data and
images. Even though they produced some good quality
shuffle them and presented to experts to scale from 1-10
images, the knowledge of the GAN is limited. As they
based on the realistic of the data. The results are given in
trained the GAN using only 614 images and that also
figure 2 [11].
retrieved form one database [33]. And researchers
The findings suggested that medGAN’s synthetic data are mentioned that the future extension of this research as they
generally indistinguishable to a human doctor except for going to trained the model with large scale data sets form
several outliers. The fake records identified by the doctor different databases. The size of the synthetic images (256 ×
are mainly lacked appropriate medication codes [11]. And 256) is far from the resolution provided by images produced
these can happen in real world scenarios as well when the by current retinal fundus image acquisition systems. And
data is missing or not recorded. these drawbacks can be solved if they using large amount of
medGAN uses GAN to generate real-world multi-label data.
discrete electronic health records (EHR). Through rigorous As mentioned previously the need of generated data is in
evaluation using real datasets, medGAN showed impressive demand for researchers. And to contribute that another
results [11]. According to medical filed there is no efficient research was done using two stage pipeline for generating
way for researchers to access the real data. Considering that synthetics medical images. To test that they used to generate
retinal synthetic images. In their research they used dual and increased the image resolution to 128x128.To evaluate
GAN to generate images [19]. the model they used 3 fold cross validation on the SCR Lung
In their two stage pipeline the first one is to produce Database [41]. For the quantitative evaluation, they chose to
segmentation masks that represent the variable geometries perform image segmentation using the U-Net fully
of the dataset, and the second one is to translate the masks convolutional network architecture. They used the Nesterov
produced in Stage-I to photorealistic images. For the optimizer at a learning rate of 0.00001 for the segmentation
training and testing they used DRIVE database for stage one task, with a momentum of 0.99 and a weight decay of
GAN. And it contains forty pairs of retinal fundi images and 0.0005 [4-].
segmentation masks extracted manually by two experts They tested the model using real data, real and synthetic data
[19]. For stage two they used with segmentation masks, and synthetic data only. To evaluate the segmentation
derived from a CNN segmentation network on the results, we used the Dice coefficient and Hausdorff distance
MESSIDOR database. Stage one GAN is to generate metrics [41]. The results shown in table 1 for full training
variable segmentation masks. It is based on the deep set and in table 2 for reduce training set.
convolutional generative adversarial network (DCGAN)
architecture, and built on the TensorFlow platform [19]. TABLE 1
And stage two also build on the TensorFlow. To improve Full training set results.
the realistic of the image they used the u-net. Generation
mask given a photorealistic medical image. The u-net
architecture, specifically formulated for biomedical image
segmentation, is derived from an auto encoder architecture
that relies on unsupervised learning for dimensionality
They evaluated the u-net on test images from the DRIVE TABLE 2
Reduce training set results.
database and compared them with the ground truth to
calculate an F1 score. Also calculated the variance between
the 4 synthetic and real datasets through a Kullback–Leibler
(KL) divergence score [19]. They received an F1 accuracy
rating of 0.8877 for synthetic data and an F1 accuracy of
0.8988 on the DRIVE dataset. When testing variance,
received a KL-divergence score of 4.752 [19]. As a conclusion researcher mentioned that after using the
As a conclusion they used dual GAN models to generate image segmentation they find out the images showing small
medical images due to the extreme complexion in the details and noises correctly than before in lung image
medical images. However, it is able to identify simple datasets [41]. That not shown in details when they using
features such as general color, shape, and lighting. But still only GAN. Also they found if the training time for GAN is
they need more variant and accurate real image data to too short, generated images are not in a usable format for
improve the dual GAN pipeline to generate more realistic later supervised trainings. Finding a suitable stopping point
images. for GAN training is still a hot topic of current research, as a
Another similar kind of research done to synthesized lower GAN loss during training typically does not indicate
medical images by using GAN. As they also tried to solve higher image quality of the generated images. Overall they
the lack of data problem in bioinformatics field. To did a research and proved that using GAN along with image
overcome this issues in their research they proposed a new segmentation can generate more accurate image data sets.
variant of GAN, which, in addition to synthesized medical A research conducted in Switzerland to create a tool using
images, also generates segmentation masks for the use in GAN to generate medical data sets. As GAN shown
supervised medical image analysis applications. remarkable success in generate things and specially produce
To get maximum benefit from the generated images while data sets. As to evaluate the system they used to generate
using that in supervised algorithms or deep learning tasks medical time series data set for intensive care unit [39]. In
it’s necessary to have a ground truth solution for any given ICUs, doctors have to make snap decisions under time
input image. So for that researcher used a modified GAN to pressure, where they cannot afford to hesitate. It is already
generate images as well as generate segmentation masks as standard in medical training to use simulations to train
well [41]. So it will be easy for the discriminator to decide doctors, but these simulations often rely on hand-engineered
whether the image is real or synthetic. That will improve the rules and physical props. Thus, a model capable of
learning of generator as well as discriminator networks. generating diverse and realistic ICU situations could have
Researcher used DCGAN architecture for this research. an immediate application, especially when given the ability
They used tensorfolow implementation of DCGAN in this to condition on underlying ‘states’ of the patient. As a
research [41]. And they modified that architecture and solution they proposed a Recurrent GAN (RGAN) and
include support for the generation of segmentation masks Recurrent Conditional GAN (RCGAN) to produce realistic
real-valued multi-dimensional time series, with an emphasis methods on classification problems to improve the accuracy
on their application to medical data [39]. of a well-sought class. First one is to remove border samples
The model presented in this work follows the architecture of between two categories, second one is reduce
a regular GAN, where both the generator and the dimensionality through feature selection, third one sacrifice
discriminator have been substituted by recurrent neural the accuracy of less-valuable classes. To train the model
networks. Therefore, researcher present a Recurrent GAN they used the Share2Quit dataset [42].
(RGAN), which can generate sequences of real-valued data, There are many existing methods to deal with this problem.
and a Recurrent Conditional GAN (RCGAN), which can However, their performance needs to be significantly
generate sequences of real-value data subject to some improved in practice. Their results show that applying each
conditional inputs [39]. And researchers evaluated their of these analysis methods improves classification accuracy
work by using some toy data sets. They used generated data of the well-sought class and proved the GANs can generate
to train the model and used real data to test the model many simulation data. Researchers carried out different
(TSTR) [39]. And then they used that in reverse to train the evaluations to get the better accuracy of the methods. They
model with real data and tested with synthetic data (TRTS). tried and check the accuracy before and after feature
In this research to generate icu data they used recently- selection by each time drop one feature, Excluding Border
released Philips eICU database. It contains around 200,000 Samples, Data Augmentation inclusion and Sacrificial
patients from 208 care units across the US, with a total of Boundary. They got around 87% accuracy after include the
224,026,866 entries divided in 33 tables. And for the segmented data and got high accuracy after the Sacrificial
research purpose they get only 4 main variables such as Boundary [42]. But due to the imbalanced of their training
oxygen saturation measured by pulse oximeter (SpO2), data they got some classes’ accuracies were poor.
heart rate (HR), respiratory rate (RR) and mean arterial
pressure (MAP). After preprocessing the data, they end up
with a cohort of 17,693 patients [39].
Suppose that the model has overfit, and most points in latent
space map to training examples. Then all the generated data
will be similar or same as the training data. And if suppose
model underfit then all the generated data will be totally
varying with training data in terms of distribution. To avoid
that hey compared the distribution of reconstruction errors
and compared the generated samples [39].
As a conclusion they created a tool to generate data sets
using RGAN. And they evaluated the model by generating
time series data sets for ICU. And the major finding of the
research is by generating labelled training data - by
conditioning on the labels and generating the corresponding Fig. 3 Accuracy level based on the size of data [42].
samples, anyone can evaluate the quality of the model using
the ‘TSTR technique‘, where they train a model on the
So researcher presented machine learning methods such as
synthetic data, and evaluate it on a real data. They have
removing noise, reducing the number of dimensions,
additionally illustrated that such a synthetic dataset does not
generating simulation data, and sacrificing the accuracy of
pose a major privacy concern or constitute a data leak for
other unimportant classes to improve the accuracy of
the original sensitive training data.
important class [42]. And they got good accuracy of
The primary goal of this research is to generate simulated simulated data as well and it given in image 3. In their future
data which is useful when only limited data is available. For work they going to try to improve the quality of augmented
that researcher took smoking cessation as a case study in this data.
research to generate data sets. Case study is to test a peer
recruitment strategy to increase access to a smoking
cessation resource (a technology-assisted tobacco IX PREVIOUSE WORKS TO GENERATE THINGS USING
intervention) [42]. As noted, a key challenge is that data is
limited in health interventions. Identifying ideal recruiters is After the remarkable success of GAN, it’s widely used in
therefore a classical classification problem. So researcher many industries to generate things. GAN used to generate
tried to augment the data for this case study to improve the images, text, music and many more things. So hear
efficiency of the classification algorithms. researcher going to discuss about the use of GAN in other
For that they introduce two techniques in this study to industries in terms of generate things. Research done and
augment the datasets. One is to change a very small set of proven that GAN can generate open-domain dialogue [18].
‘fields’ values randomly. And the second one is to augment
the data using GAN [42]. Also they proposed a few analysis
They proposed a solution using GAN to produce sequences
that are indistinguishable from human-generated dialogue
utterances. The aim of this research is to generating
meaningful and coherent dialogue responses given the
dialogue history. And they used the idea of adversarial
evaluation to train a discriminant function to separate
generated and true sentences, in an attempt to evaluate the
model’s sentence generation capability. And they got a good
accuracy levels and they used humans to validate the results
[18]. Fig. 4 Generated bird images using sentences [37].
Another research done to improve the training of GAN for
image synthesis [6]. As GAN is more popular in image GAN is famous to generate text data in handwritten format.
generation field [7] [37][12][33][19]. Recent works has Already there are researches done to generate numbers in
shown that GANs can produce convincing image samples hand written format [5][16]. From very early age humans
on datasets with low variability and low resolution [9][34]. learn handwriting as a skill. This research deals with this
However, GANs struggle to generate globally coherent, problem where an intelligent system tries to learn the
high resolution samples - particularly from datasets with handwriting of an entity using GAN. The start to train the
high variability. They construct a variant of GANs model to generate alphabetic and then single word and then
employing label conditioning that results in 128 × 128 list of words. For this task they proposed a modified
resolution image samples exhibiting global coherence. Also architecture of DCGAN to achieve this [5]. Also to achieve
they expand on previous work for image quality assessment faster learning they used the reinforcement method. Early
to provide two new analyses for assessing the implementation of their algorithm illustrates a good
discriminability and diversity of samples from class- performance with MNIST datasets [5]. Also during
conditional image synthesis models [6]. evaluation they tried to generate ascii and 0 -9 number data.
Hear in another research, researchers tried to generate high And the wide variation in style of handwriting produced a
quality samples of natural images. Building a model to wide range of different style images of the same digit [5].
generate high resolution natural images has been a major Their model hopes to give new insights in this area and its
problem in computer vision [12]. Recent advancement of uses include identification of forged documents, signature
GAN there is a new path open in computer vision to verification, computer generated art, digitization of
generate images similar like real ones. Their approach is to documents among others.
use a cascade of convolutional networks within a Laplacian Another research done to generate language using GAN.
pyramid framework to generate images in a coarse-to-fine Training GANs for language generation has proven to be
fashion [12-]. At each level of the pyramid, a separate more difficult, because of the non-differentiable nature of
generative convent model is trained using the GAN generating text with recurrent neural networks [31]. In this
approach. In a quantitative assessment by human evaluators, work, researcher shown that recurrent neural networks can
their samples were mistaken for real images around 40% of be trained to generate text with the use of GANs from
the time, compared to 10% for samples drawn from a GAN scratch using curriculum learning, by slowly teaching the
baseline model [12]. They used 15 volunteers to evaluate the model to generate sequences of increasing and variable
quality of the images generated by model. And they found length. They evaluate the model by generating 640
that the images generated by their model are far better than sequences from each model and measuring %-IN-TEST-n,
the standard GAN model. that is, the proportion of word n-grams from generated
There is an interesting and useful research done to generate sequences that also appear in a held-out test set [31]. They
images from text. But for this task normal AI systems and found that training the generator for 50 iterations every 10
algorithms are still far from the goal [37]. But to solve this training iterations of the discriminator resulted in superior
issue they develop a novel deep architecture and GAN performance [31]. They implemented and proved that their
formulation to effectively bridge these advances in text and approach vastly improves the quality of generated
image modeling, translating visual concepts from characters sequences compared to a convolutional baseline.
to pixels. Their main contribution in this work is to develop
a simple and effective GAN architecture and training X ADVANTAGES AND LIMITATIONS OF GAN
strategy that enables compelling text to image synthesis of
As a new framework GAN comes with advantages as well
bird and flower images from human-written descriptions
as disadvantages. One of the major advantage is GAN
[37]. And below image showing the Generated bird images
framework will work with any type of neural networks.
by interpolating between two sentences (within a row the
Anyone can implement GAN using conventional neural
noise is fixed). In future work, they target to scale up the
networks, recurrent neural network or even advances deep
model to higher resolution images and add more types of
learning networks. Another advantage is Markov chains are
text [37].
never needed [16] [4] [23] [3], only the conventional back XI CONCLUSION
propagation is used to get gradients, and no external Researcher analyze and review about bioinformatics data,
interface needed during the training [16-]. In terms of actual legal and ethical issues while accessing EHR data and other
result GAN produce better generated samples than the other security issues such as data breaching and misusing.
methods [16]. There’s no need to design the model to obey Researcher explained about the current methods to stop data
any kind of factorization. Any generator net and any breaching and security. Also explained about the struggles
discriminator net will work. Compared to the PixelRNN, the researchers facing while accessing the medical data for
runtime to generate a sample is smaller. GANs produce a research purposes. Further in this paper, researcher survey
sample in one shot, while PixelRNNs need to produce a the state of the art of GANs. This model has been used in
sample one pixel at a time [16] [3].Especially the generation many generative works including images, text, sound and
of HD data GAN perform better than PixelRNN and also data. And got good feedback from scientist and researchers.
GAN does not limit the generation dimension. Which Basic concept of GAN is min-max game theory. And it
increase the scope of the generated data samples in wide contain generator and discriminator networks. Best part of
range. this model is it can be develop using any type of neural
The generation process of GANs does not require tedious networks.
sampling sequence, but can directly sample and predict new In addition this paper reviewed the researches done to
samples, which improve the efficiency of generating new generate bioinformatics data for secondary purposes such as
samples [16]. In practice, the samples generated by GANs research. Researcher review about image data generation
are easy to understand for humans. For example, GANs can and EHR data generation as well. Additionally this paper
generate very sharp and realistic images [2] [37] [23]. Not contain review about usage of GAN in other industries.
only have GANs made great contributions to the Further researcher analyses and review about GAN
development of generative models, but they are also advantages and disadvantages also explained about future
meaningful and instructive for semi-supervised learning [3]. research direction of GAN as well. Researcher believes that
GAN have solved a lot of problems in terms of generative this review will provide valuable insights and serve as a
models and brought a new in AI field, but they still have starting point for researchers to apply GAN to generate data
limitations. GANs adopt the adversarial learning idea, but sets for their bioinformatics researches.
convergence of the model and existence of equilibrium
