1 s2.0 S2211926421000758 Main

Algal Research 55 (2021) 102256
Contents lists available at ScienceDirect
Algal Research
journal homepage: www.elsevier.com/locate/algal
Microalgae classification based on machine learning techniques

P. Otálora a, J.L. Guzmán a, *, F.G. Acién b, M. Berenguel a, A. Reul c
a
Department of Informatics, University of Almería, CIESOL, ceiA3, 04120 Almería, Spain
b
Department of Chemical Engineering, University of Almería, CIESOL, 04120 Almería, Spain
c
Department of Ecology and Geology, University of Málaga, Campus de Teatinos, 29071 Málaga, Spain
A R T I C L E I N F O A B S T R A C T
Keywords: In this paper, two models for classification of microalgae species based on artificial neural networks have been
Artificial neural network developed and validated. The models work in combination with FlowCAM, a device capable of capturing each of
Deep learning the particles detected in a sample and obtaining a set of descriptive features for each one. One of the models uses
Image analysis
these feature variables as input, while the other makes use of the captured images, both being able to distinguish
Microalgae classification
FlowCAM
between two well-known species of microalgae, Scenedesmus almeriensis and Chlorella vulgaris, calculating the
proportion of each of these in the analyzed mixture. The models were trained with pure samples of each specie
and validated using mixed combinations of them. The results confirm the potential of image analysis and deep
learning techniques for the identification of microalgae cultures, as well as the higher accuracy of the feature-
based model, thus extending the range of classification approaches in this field.
1. Introduction the reactor design, and not all of them are controllable [4]. In order to
make the production process really competitive, it is essential to maxi
These days, society is facing several issues related to pollution and mize productivity, and in order to do so, it is crucial to correctly char
the sustainability of the current lifestyle. The availability of clean water, acterize the process [6].
the emission of greenhouse gases and energy production are three A key aspect is the species of microalgae that is produced. Although
problems that will grow with population increase in the upcoming years. there are more than 30,000 known species of microalgae, less than a
In light of this scenario, the investigation of alternatives to the classic hundred have been studied, and among these less than 20 are
solutions to these problems becomes mandatory, in order to find more commercially exploited [7]. Some examples are Chlorella, Spirulina,
sustainable and efficient methods. Dunaliella or Nannochloropsis. Depending on the application for which
In this context, the production of microalgae at industrial scale ap the process is intended, the use of one species or another, or even mixed
pears to be a fairly interesting process. Microalgae are photosynthetic cultures, will be more advantageous. Therefore, a very important step in
microorganisms that can grow and reproduce in a wide variety of en optimizing the process is monitoring the culture composition [8].
vironments, without the need for fertile soil or clean water, being able to Typically, this monitoring is performed by analyzing a sample with
grow even in polluted water [1]. The microalgae production is presented an optical microscope. This is possible due to the morphological dif
as a suitable option for the abovementioned problem, since among its ferences of the different species of microalgae [9]. However, this re
applications lies the production of biofuels [2], as well as the fact that quires the participation of an expert capable of appreciating these
the process itself is useful for wastewater treatment or CO2 emissions differences, while also being a complex and time-consuming activity
mitigation from other industrial activities [3]. Furthermore, they have a [10].
very beneficial composition for the production of high value products, In previous works, different culture characterization methods based
such as cosmetics, chemicals, human nutrition or animal feed [4,5]. on absorption spectroscopy or flow cytometry have been developed. The
Despite its multiple applications, it is a very complex process due to methods based on absorption spectroscopy use spectrophotometers to
its pronounced biological nature. Its productivity is determined by a determine the pigment compositions of the samples, but they are
large number of variables related with the conditions of the culture and expensive [11]. Flow cytometry methods allow, in addition to
* Corresponding author.
E-mail addresses: p.otalora@ual.es (P. Otálora), joseluis.guzman@ual.es (J.L. Guzmán), facien@ual.es (F.G. Acién), beren@ual.es (M. Berenguel), areul@uma.es
(A. Reul).
https://doi.org/10.1016/j.algal.2021.102256
Received 23 October 2020; Received in revised form 20 January 2021; Accepted 20 February 2021
Available online 10 March 2021
2211-9264/© 2021 Elsevier B.V. All rights reserved.
P. Otálora et al. Algal Research 55 (2021) 102256
determining the composition, cell count and the analysis of phyto freshwater strains, characterized by its high growth rate and tolerance to
plankton size structure [12]. Another approach is the optical image wide ranges of culture conditions such as pH, temperature and dissolved
analysis. This method allows individual classification of microalgae cells oxygen. Thus, these strains are typical of open reactors including
based on their morphology [8]. However, it requires advanced image coupling with wastewater treatment. Moreover, Chlorella vulgaris is one
capture and processing techniques. of the microalgae strains more produced in worldwide for food related
In the last few years, machine learning techniques, and particularly applications, whereas Scenedesmus almeriensis has been reported as po
artificial neural networks (ANNs), have experienced a popularity in tential source of lutein also for human related applications [22,23].
crease thanks to the growing computational capacity, and the large
volume of data that can be generated and managed nowadays [13]. 2.2. FlowCAM tool
These techniques have the ability to extract models based solely on data,
requiring less in-depth knowledge of the system and being able to adapt FlowCAM is a device that analyzes particles or cells in a moving fluid
properly to different circumstances. These models can make use of [21]. This tool takes photos and detects and counts the particles it de
different types of data, such as categories, numerical values or images, tects. Afterwards, it uses this information to analyze the particles to
depending on the problem to be solved. This type of techniques can be extract a series of descriptive variables. The tool combines flow
also combined with particle analysis tools to obtain data that the ANN cytometry, microscopy and fluorescence detection techniques. It is a
will use as input. widely used tool in the field of biotechnology, making it a source of
Regarding the characterization of microalgae cultures, a neural easily acquired input data.
network model based on light absorption data was proposed in [14] to FlowCAM works as follows. A pump drives the sample into the fluid
distinguish between 4 species of microalgae. A convolutional neural chamber. In this chamber, the device constantly captures images of the
network in combination with FlowCAM is proposed in [15] to classify 19 moving fluid with a period defined by the user. From these images, the
different species. A hybrid method between image processing and ANN software detects, segments and processes each group of pixels repre
is presented in [16] for identification and classification of 6 species. A senting a particle, thus obtaining the image of each cell and a series of
model based on deep learning techniques for detection and classification representative variables.
of 5 different species is shown in [17]. The work in [18] proposed two Each of these samples was first passed through a 20 μm mesh in
models to identify, classify and estimate the growth of microalgae using autoimage mode. The device was configured with a 20× lens, a cell size
support vector machine and random forest techniques. An image-based of 50 μm and a flow chamber of 20 μm depth. A stop condition of 50,000
neural network model for the classification of 16 different species was images is also imposed. Once set, the pump is activated and the image is
developed in [19], capable of differentiating unique images of each cell correctly focused. From here, the device automatically detects and
type with great precision. Several models with machine learning tech captures 50,000 images. For each of the samples mentioned, 3 tests are
niques were proposed in [20] in order to differentiate living and dead performed as described, resulting in 150,000 images per sample.
Chlorella vulgaris cells. FlowCAM provides as outputs images as those shown in Fig. 1 in .tif
This paper develops two artificial neural networks to characterize format. As can be seen, it detects microalgae cells correctly, but has the
cultures composed by Chlorella vulgaris and Scenedesmus almeriensis, drawback that it can also detect undesirable particles, such as flocs or air
which are one of the most commercial strains. The first of these ANNs bubbles entering the pump. Another phenomenon that occurs and that
uses input data referring to the morphology of each one of the cells in the can be detrimental to the performance of the network is the agglomer
sample, while the second makes use of the images of these cells, both of ation of multiple cells in the same image. The data sheet provided by the
them providing as output the composition of each one of the species in software already takes into account this phenomena in a column called
the analyzed sample. All the data have been obtained with the tool “ParticlesPerChain”, but the fact that these agglomerations can occur
FlowCAM [21]. between cells of different species, or even cells of one species and un
The main contributions of the work are a simple methodology that desired bodies implies the need for using filtering techniques afterwards.
can be easily generalized to other data acquisition devices. The fact of Besides these images, FlowCAM provides the features shown in Table 1,
having developed two ANNs provides great flexibility and adaptability as well as some others related to the date and time of image capture. It is
to the data to be used. The high volume of data used provides robustness a tool capable of providing a large amount of information in a user-
and reliability to the model, increasing its learning and verifying its friendly way, so it is appropriate for the problem faced.
capacity of generalization. The validation of the model with mixed
samples is a very representative measure of the effectiveness of the 2.3. Artificial neural networks
models developed. In addition, the application of detection thresholds
allows the classification of particles that do not correspond to micro Artificial neural networks (ANN) are part of the group of algorithms
algae cells. known as “machine learning”. These are characterized by being devel
The structure of the paper is the following. In Section 2, the tech oped only from data, and are able to perform a given task without being
niques used are presented, as well as the methodology that has been explicitly programmed for it [24]. Despite the fact that the computa
followed. Section 3 describes the tests performed to obtain data, the tional cost of these algorithms is usually high, the growth of the com
treatment of these data and the training process of the ANNs and their puter capacity in recent years has increased its popularity [13].
validation. Finally, conclusions are given in Section 4. ANNs take their name from their similarity to the natural ones. They
are composed of a set of nodes divided into layers, where each node has
2. Material and methods a series of inputs and outputs. The nodes of a given layer receive as input
the outputs of the nodes from the previous layer and provide as output a
This section presents the tools used to obtain and process the data, as function of these. This output is then used as input for the nodes of the
well as to develop the proposed models. As previously mentioned, the next layer, and so on until the last layer is reached [25].
FlowCAM tool will be used to obtain images and features of each one, A satisfactory ANN requires three fundamental parts to operate.
which will be used as inputs by the ANNs. First, it is essential to have sufficient and adequate data for the training
of the network and its validation. The second fundamental component is
2.1. Microalgae species the structure of the network. The choice of the layers, the type and the
specific size are selected depending on the type of problem to be solved,
The microalgae species that were used in the trials to develop the the type of inputs to be used, the number of data available, or the
network were Chlorella vulgaris and Scenedesmus almeriensis. Both are complexity of the model to be developed. Finally, the last element to
2
Fig. 1. Sample images taken by FlowCAM of each species.
develop the model is the training process. This is defined by a series of The proportions were selected to represent very diverse samples,
options such as its duration, the calculation frequency of the parameters, with cases of balance between the two species, cases with one dominant
the stop condition, but also which data will be used to train the network species over the other or extreme cases, such as 99/1, where the dif
and which will not [26]. ference between concentrations is very significant. In this way it can be
In this work, ANNs will be developed for a classification problem, checked whether the models developed are capable of correctly identi
comprising the abovementioned three parts as explained in Section 3. fying samples of all types.
The aim of this problem is that, given a series of entries, the ANN is able In order to have a powerful and simple methodology, the idea is to
to distinguish to which “class” these entries belong. Thus, for each series use only the pure samples for the ANN training process, since those are
of entries the network returns a value between 0 and 1 for each class. the only ones in which it is known with absolute certainty to which
The closer this value is to 1, the more likely it is that the classified entries species of microalgae each particle corresponds. The rest of the combi
belong to that class. The sum of all outputs will always be equal to 1. nations have been used to test the ANN’s interpolation capabilities.
The ANNs and the data processing process of this work have been Notice that sometimes air is introduced into the pump, producing a
developed using the “Deep Learning Toolbox” [26] and “Image Pro bubble that can alter the results. The FlowCAM software has a post-
cessing Toolbox” [27] in the MATLAB environment, which is described processing tool that allows easily removing undesired images. These
in the following section. images can be selected by size or by similarity with other images. From
this way, the most erratic images were removed, although this filtering
2.4. Deep learning and image processing toolbox could have been done later. This process is relatively easy, but it requires
human intervention and is not possible to automatize it. In order to
MATLAB’s “Deep Learning Toolbox” has been used for training and minimize time consuming intervention only bubbles easily to detect
validation of both ANNs. This toolbox allows the design of the ANN were removed. This intervention is justified by the fact that bubbles are
structures layer by layer, establishing training options, carrying out that not part of the sample itself but is an artefact of data acquisition. In fact,
training while providing relevant information about it, in addition to the bubble retrieving should have been only necessary in the training
subsequent use of the trained ANNs for validation or prediction pur samples. This makes the final number of images in each sample slightly
poses. It provides the user with an intuitive and powerful tool for the less than 150,000.
creation of models based on deep learning techniques for different types
of problems [26]. 3.2. Feature-based ANN
The “Image Processing Toolbox” has a different functionality. It
provides the user with a large number of image processing, analysis and The first neural network to be developed will use as inputs the fea
display techniques, and can perform image segmentation, filtering or tures provided by FlowCAM from each of the detected particles (see
batch processing tasks. This tool has been used to process all the data Table 1). As data acquisition was carried out in autoimage mode fluo
used in the image-based ANN [27]. rescence data were not available in this study. The acquisition of the
specific fluorescence per cell requires the presence of one cell in the flow
3. Results and discussion chamber when acquiring the image and data of each cell. If several cells
are in the flow chamber a mean fluorescence is associated to all cells.
This section presents the tests performed and the complete process to Thus, fluorescence per cell values would require a very long acquisition
obtain the ANN model. In addition, the results achieved in the training time. The autoimage mode in contrast allows the acquisition of all data,
and validation processes will be shown and discussed. except fluorescence, of 1–100 particles per frame. Therefore, in this
study autoimage and colour ratios were considered as proxy of cell
3.1. Performed tests pigmentation.
Data acquisition for ANN training and testing was performed by the 3.2.1. Data processing
FlowCAM device. Two pure samples of Chlorella vulgaris and Scenedesmus As mentioned previously, the ANN is able to learn directly from data.
almeriensis were available. These samples were divided and mixed with For that reason, it becomes vitally important to guarantee the quality of
each other, in order to obtain different samples with different pro the available data. Using erratic or inappropriate data can lead to a
portions of each specie. Therefore, the samples analyzed by the tool model that does not accurately represent reality. Therefore, before
presented proportions from 1% to 99%, in addition to pure samples for training the network it is necessary to perform data processing.
each species. For this ANN, the data processing process involved the selection of
3
Table 1 known with certainty to which species each cell belongs. Using the data
Correlation index between each feature and the “Chlorella vulgaris” output [21]. sheets that FlowCAM provided after the tests, an input dataset was
Property Correlation Description created, consisting of 289,708 rows corresponding to each cell detected,
and 54 columns, corresponding to each of the properties of each cell. In
‘Area_ABD_’ − 0.5685 Number of pixels converted to a measure of
area addition, two more columns were added, one of which takes the value
‘AspectRatio’ 0.5054 Width/length 0 for the rows corresponding to Chlorella vulgaris samples and 1 for
‘AverageBlue’ 0.5587 Average pixel value of the blue color plane Scenedesmus almeriensis samples, while the other column always takes
‘AverageGreen’ 0.2451 Average pixel value of the green color plane the opposite value; 0 for Scenedesmus almeriensis and 1 for Chlorella
‘AverageRed’ 0.2837 Average pixel value of the red color plane
‘CalibrationFactor’ – Conversion factor from pixels to microns
vulgaris. Consequently, the initial input dataset have a size of 289708 ×
‘CalibrationImage’ – Number of the calibration image used in 56.
processing the particle However, the use of every cell property as an input is unnecessary
‘Camera’ – Camera number and may even have a negative effect on the performance of the ANN.
‘CaptureX’ 0.0311 Leftmost X coordinate of the particle in the
Thus, the correlation between each variable and one of the outputs will
original image
‘CaptureY’ − 0.0015 Top Y coordinate of the particle in the original be analyzed, in order to select the variables to be used. Table 1 presents
image the correlation between each of the inputs and the “Chlorella vulgaris”
‘Ch1Area’ – Area of fluorescence peak of photomultiplier output.
tube 1 As it can be seen, there are several properties that have no correlation
‘Ch1Peak’ Peak fluoresence value from photomultiplier
with the output. This is because they present constant values, so they do
–
tube 1
‘Ch1Width’ – Sample wifth of photomultiplier tube 1 above not provide information and thus are discarded. In addition to these, the
threshold value ‘CaptureX’, ‘CaptureY’, ‘ImageX’, ‘ImageY’ and ‘SourceImage’ proper
‘Ch2Area’ – Area of fluorescence peak of photomultiplier ties were discarded, as they only provide information on the order in
tube 2
which the images were taken and this is not relevant information for the
‘Ch2Peak’ – Peak fluoresence value from photomultiplier
tube 2 network. The rest of the variables, regardless of their correlation, pro
‘Ch2Width’ – Sample wifth of photomultiplier tube 2 above vide information about the cell morphology, so they will be considered
threshold value as inputs to the network. Fluorescence data were not available
‘Ch2_Ch1Ratio’ – Ch2Peak/Ch1Peak (explained above) therefore this data are not included. Since the ANN
‘CircleFit’ 0.3641 Deviation of the particle edge from a best-fit
will learn from data, it is not necessary to eliminate the variables with
circle
‘Compactness’ − 0.2988 Shape parameter derived from the perimeter lower correlation, since it will be the ANN itself that will give them less
and the area weight. Therefore, the final training dataset was reduced in size to
‘ConvexPerimeter’ − 0.6610 Approximation of the perimeter of the convex 289705 × 32 elements.
hull of a particle
Finally, in order to avoid problems associated with erratic sampling
‘Diameter_ABD_’ − 0.7189 Diameter based on a circle with an area equal
to the ABD area related to undesired particles, such as bubbles, an outliers filtering was
‘Diameter_ESD_’ − 0.6610 Mean value of 36 feret measurements applied to the dataset. This filtering eliminates the rows that present
‘EdgeGradient’ − 0.4227 Average intensity of the pixels of the outside significantly different values from the rest in any of the columns. This
border of the particle ensures that there are no erroneous data that alter the learning of the
‘Elongation’ − 0.2989 Lenght/breadth ratio based on perimeter and
ANN.
area
‘FeretAngleMax’ 0.0225 Angle of the largest feret measurement
‘FeretAngleMin’ 0.0970 Angle of the smallest feret measurement 3.2.2. ANN training
‘FilterScore’ – Statistical filter score Once the data has been processed, the structure of the ANN must be
‘ImageHeight’ − 0.5826 Image height in pixels in the collage
determined. As mentioned above, a classification problem is faced with
‘ImageWidth’ − 0.6271 Image width in pixels in the collage
‘ImageX’ − 0.1199 Leftmost X coordinate of the particle in the
two different classes: Chlorella vulgaris and Scenedesmus almeriensis. The
collage ANN will be trained only with data corresponding to pure samples of
‘ImageY’ − 0.2003 Top Y coordinate of the particle in the collage each species. Thus, it is known primarily that the ANN must have two
‘Intensity’ 0.3264 Average grayscale value of the pixels outputs. On the other hand, according to the variables analyzed above,
‘Length’ − 0.6960 Maximum value of 36 feret measurements
the input to the network must have a size of 30, corresponding to the
‘ParticlesPerChain’ 0.0223 Number of particles grouped into one
‘Perimeter’ − 0.5141 Total length of the edges of a particle number of features used. Therefore, the complete architecture of the
‘RatioBlue_Green’ 0.5998 Average blue/average green network will be composed of an input layer of size 30, a hidden layer and
‘RatioRed_Blue’ − 0.6579 Average red/average blue an output layer of size 2. After multiple trials with sizes between 10 and
‘RatioRed_Green’ 0.1339 Average red/average greeen
100 nodes, a size of 25 nodes was finally set for this layer. This layer size
‘Roughness’ − 0.3389 Measure of the irregularity of a particle’s
surface
has proven to perform very well while maintaining a simple and fast to
‘ScatterArea’ – Area of the peak of the scatter detector train structure. Fig. 2 represents the selected network architecture.
‘ScatterPeak’ – Peak value read from the scatter detector The next step in training is the division of the training dataset into
‘ScatterWidth’ – Sample width of scatter detector values above training, validation and test datasets. The training set will be the one
the threshold
from which the network will learn, and will therefore contain most of
‘SigmaIntensity’ − 0.3334 Standard deviation of grayscales values
‘SourceImage’ − 0.6209 Camera image number where the particle was the data. For the training set, 70% of the data will be selected, resulting
captured in a total of 202,796 samples. The purpose of the validation set is to
‘SumIntensity’ − 0.5562 Sum of grayscale pixel values check the network’s generalization capacity. The network will not learn
‘Transparency’ − 0.0366 1-(ABD diameter/ESD diameter) directly from this set, but it will be used to stop the training when the
‘Volume_ABD_’ − 0.1220 Sphere volume calculated from ABD diameter
‘Volume_ESD_’ − 0.0437 Sphere volume calculated from ESD diameter
generalization capacity does not improve. A 15% of the data (43,456
‘Width’ − 0.5710 Minimum value of 36 feret measurements samples) will be used for this purpose. Finally, the test set has no effect
on training, and it only provides a measure of the network’s perfor
mance. The remaining 15% of the data will constitute this set.
input variables, the creation of output variables and the filtering of The network was trained using scales conjugate gradient back
outliers. The first data to be processed has been the data for the training propagation. The training will last until the accuracy with the validation
process. These are the ones obtained from pure Chlorella vulgaris and set stops increasing for 6 iterations consecutively. This will mean that
Scenedesmus almeriensis samples, as they are the only ones where it is the ANN is memorizing instead of learning, which is detrimental to its
4
Fig. 2. Selected ANN architecture [26].
performance. During the training, 254 iterations were performed, 3.2.3. Result validation
obtaining the results shown in the confusion chart in Fig. 3. This table To completely validate the ANN it is necessary to check its accuracy
represents for each of the sets, as well as for the combination of all three, regarding the identification of mixed samples. For this purpose, the
the results of the prediction. The first two columns represent the actual datasets corresponding to the rest of the tests performed will be used.
class to which an element belongs, while the rows show the class pre The methodology will consist in introducing each row of the dataset as
dicted by the network. Thus, the perfect prediction would be with 0 el input to the network, so that the ANN returns the output “Chlorella
ements in positions (2,1) and (1,2) of the table. Row 3 and column 3 are vulgaris” and “Scenedesmus almeriensis” for each row. From these outputs,
summaries of the accuracy with which the ANN predicts each species which as previously mentioned can assume values between 0 and 1, it
and its success rate with each species, respectively. In all sets, the pre will be detected how many of the rows correspond to each class, and
diction error is approximately 3.1%, which can be considered highly what is the overall proportion.
satisfactory with all the used sets. Therefore, the network was validated with each dataset corre
sponding to the five different mixtures. The concentrations identified by
the network against the actual ones are shown in the first column of
Fig. 3. Confusion chart of the training process.
5
Table 2 the results are very close to the actual concentrations, as well as being
Actual and predicted concentration for the ANN based on features with and symmetrical, which certifies that the threshold has been correctly
without threshold modification. imposed. Fig. 5a and b shows the real concentration of each species
Actual Predicted concentration Predicted concentration compared to that determined by the ANN, for each of the 3 tests per
concentration without threshold with threshold formed with each concentration. The ANN behaves in a very similar way
1%Ch–99%Sc 22.0–77.0 5.0–95.0 for each test on the same sample, forming practically a unitary slope
10%Ch–90%Sc 33.6–66.4 12.3–87.7 line.
50%Ch–50%Sc 75.8–24.2 56.5–43.5
90%Ch–10%Sc 94.0–6.0 86.8–13.2
3.3. Image-based ANN
99%Ch–1%Sc 97.2–2.8 93.5–6.5
The second neural network developed use as inputs the images

Table 2. As can be seen, the predicted concentrations are significantly captured by FlowCAM from each of the detected particles, providing a
different from the actual ones. For almost all the mixtures, the network more general classification method.
predicts a much higher percentage of Chlorella vulgaris than the actual
one. Considering that the prediction error in pure samples was around 3.3.1. Data processing
3%, it can be deduced that the error when measuring the concentration In this case, data processing is equally or even more important than
does not lie in the accuracy of the ANN when classifying a cell. for the ANN developed earlier. As already mentioned, this network will
Thus, there can be other sources of error. As mentioned above, work from the .tif files provided by FlowCAM, reason why the steps that
FlowCAM does not only detect microalgae cells, but also any particle will be followed for its development will be the segmentation of images
larger than the configured cell size. This implies that there are rows and their resizing to ensure a homogeneous input size.
corresponding to undesired particles. Furthermore, the grouping of Since there is no way to separate the individual images directly from
several cells that FlowCAM can detect in a single row, either of different the FlowCAM, it is necessary to divide them afterwards. For this purpose,
species or of the same species, is also a factor that can result in a worse MATLAB’s Image Processing Toolbox has been used with a script that can
identification. Fig. 4 shows a histogram of both outputs for the 50/50 accomplish this objective. The script transforms the image to grayscale,
sample dataset. As can be seen, the network detects a very high number and by imposing a threshold of intensity, filters each pixel with a value
of cells with a Chlorella vulgaris output very close to 1, while with Sce of intensity higher than that threshold. Once only these pixels are ob
nedesmus almeriensis it is more uncertain. This can be due to the fact that tained, the different images are separated, detecting as independent
FlowCAM detects Chlorella vulgaris cells more easily, or that the ANN has images those whose pixels are not in contact with each other. Then, once
a tendency to detect Chlorella vulgaris when different types of particles each individual image has been detected, its position in the initial image
join together. is obtained and it is cut out of it, saving it as a separate image in .png
Therefore, it is necessary to perform a tuning that achieves a balance. format. In this way, each image is available to be treated as an individual
So far, when counting the cells of each type, a cell has been considered to input by the network. This process was performed with the images
be Chlorella vulgaris as long as its Chlorella vulgaris output is greater than corresponding to all the samples, automating and simplifying image
its Scenedesmus almeriensis output, i.e. greater than 0.5. Nevertheless, treatment. As can be seen in Fig. 1, each image has a different size, while
since the ANN tends to detect more Chlorella vulgaris than there really is, the network that has been used needs uniform size inputs, all the images
the threshold above which a cell is considered to be Chlorella vulgaris was were resized to 227 × 227 pixels.
chosen incrementally. This was changed to 0.5 to 0.995. This value was
selected in view of the histogram shown in Fig. 4, and by trial and error 3.3.2. ANN training
steps. The implementation of the threshold also allows the detection of As done for the other ANN, the next step is the training stage. Given
particles that do not belong to any of the species, being those for which that this network works with a different type of data than the previous
the output does not exceed any of the two thresholds. one, it is essential to determine a different architecture. Since it is an
Following this adjustment, the concentrations determined by the image classification network, it was decided to start from the structure
network are those shown in the third column of Table 2. As can be seen, of an existing network. That is AlexNet, a convolutional ANN composed
of 25 layers shown in Table 3. For this specific case, the size of the input
layer is 227x227x3, color images are used, and the size of the output
layer is 2. This ANN belongs to the group of algorithms known as deep
learning, due to its depth and the complexity of the used layers.
The dataset available for the training, which is composed by the
images corresponding to the unmixed Chlorella vulgaris and Scenedesmus
almeriensis samples, will be divided using 80% of the images for the
training set and the remaining 20% for validation. In this case, no test set
will be used, Since images are more complex data and it is convenient to
employ more resources in training. In addition, it is a long process, so the
division into three sets would only make the training longer. Certain
training options will also be established such as the maximum training
duration, which will be 3, the initial learning ratio, which will be 0.001,
and the validation patience, also used in the training of the other
network, which will adopt a value of 5 iterations.
This training process is significantly slower than the previous one,
exceeding 4 h in duration, because the size of the entries is much larger,
as well as the number of layers in the network. After the training process,
the final error of the trained network is 1.6% with the validation data.
This is a reasonably low error, and so it is possible to claim that the
network is able to accurately predict which class each image belongs to.
Fig. 4. Histogram of the ANN output for the 50/50 sample.
6
Fig. 5. Predicted concentration against actual concentration for both species and both ANNs.
mixed samples. The methodology in this case is the same as the previous
Table 3
one. The input dataset of each sample is provided and the images of each
Layers of the neural network for image classification [26].
class classified by the network are counted, obtaining the composition of
Number Layer type the mixture. The results obtained at first instance are shown in Table 4.
1 Image input Similar to the previous case, these are not particularly accurate, as the
2 Convolution network has a tendency to predict a higher percentage of Chlorella vul
3 ReLU garis than the actual one. The justification for the error remains the same
4 Cross channel normalization
5 Max pooling
as before: undesired particles or particle clusters, or even a tendency for
6 Grouped convolution FlowCAM to detect Chlorella vulgaris cells more frequently.
7 ReLU In view of this circumstance, the previous solution was considered. A
8 Cross channel normalization detection threshold was applied, so that an image will only be consid
9 Max pooling
ered as Chlorella vulgaris when its similarity to this species exceeds this
10 Convolution
11 ReLU threshold. After several iterations, the threshold with the best results
12 Grouped convolution was determined to be 0.9999975. To obtain this value, the default
13 ReLU threshold of 0.5 was used as a starting point, gradually increasing this
14 Grouped convolution value. In each iteration, the predicted ratio for the 50/50 set was
15 ReLU
16 Max pooling
checked until it was found a value that provided satisfactory results.
17 Fully connected From here, this value was tested with the other sets, completing its
18 ReLU validation. This means that for the network to decide that an image is
19 Dropout
20 Fully connected
21 ReLU Table 4
22 Dropout Actual and predicted concentration for the neural network based on images with
23 Fully connected and without threshold modification.
24 Softmax
Actual Predicted concentration Predicted concentration
25 Classification output
concentration without threshold with threshold
1%Ch–99%Sc 25.5–74.9 1.2–98.8

3.3.3. Result validation 10%Ch–90%Sc 34.6–65.4 10.9–89.1
Following the network training, as in the previous case, it is imper 50%Ch–50%Sc 77.0–23.0 56.4–43.6
90%Ch–10%Sc 95.4–4.6 89.7–10.3
ative to test its performance when identifying the compositions of the 99%Ch–1%Sc 98.6–1.4 97.5–2.5
7
Chlorella vulgaris, it must be completely certain of this. Consequently, the development, and the writing of the manuscript. Guzmán J. L. was the
identification of both species becomes balanced and in the same way as responsible of the model adaptation and the parameter tuning. Acién F.
in the previous case, the detection of particles not belonging to any of G. contributed to the elaboration of the manuscript and revision.
the species or aggregates of cells is achieved at the same time. Berenguel M. was the responsible of the Discussion section. Reul A.
Through this methodology, the results summarized in the third col contributed to revision and finalization of the manuscript.
umn of Table 4 are obtained. The ANN has the capacity to represent
reality with considerable precision, with a minimum difference between
the true and the predicted composition. Fig. 5c and d shows the graphs Declaration of competing interest
representing the predicted concentration against the real one is shown
for each of the tests performed with each sample. The dispersion be The authors declare any potential financial or other interests that
tween the predictions is quite small, and the line relating both axes is could be perceived to influence the outcomes of the research.
very similar to the unitary slope, which would be the perfect prediction.
Acknowledgments
3.4. Discussion
This work has been partially funded by the following projects:
After the development and validation of both ANNs, both have DPI2017 84259-C2-1-R (financed by the Spanish Ministry of Science and
proven to be capable of overcoming the proposed task. The identifica Innovation and EU-ERDF funds), and the European Union’s Horizon
tion of microalgae cultures formed by two different species has been 2020 Research and Innovation Program under Grant Agreement No.
successfully accomplished using two different types of data, which val 727874 SABANA.
idates the usefulness of ANNs for this kind of problems. A great volume The acquisition of the FlowCAM by the University of Málaga was co-
of data has been used for both the training and the validation steps. financed by the 2008–2011 FEDER program for Scientific-Technique
These data belong to three different trials for each sample, and in all Infrastructure (UNMA08-1E005).
cases the results have been very precise. The classification of images
proves to be fairly fast, with a classification time of less than 0.001 s per References
sample for the feature-based network, and 0.1 per sample for the image-
[1] A. Hernández-Pérez, J.I. Labbé, Microalgas, cultivo y beneficios (Microalgae,
based network, respectively (processor: Intel(R) Core/TM i7-9700 CPU
culture and benefits, in Spanish), Revista de Biologia Marina y Oceanografia 49
@ 3.00 GHz; RAM: 16.0 GB). (2014) 157–173.
When comparing the two networks with each other, both have fol [2] J.K. Pittman, A.P. Dean, O. Osundeko, The potential of sustainable algal biofuel
lowed a similar methodology but have shown slightly different results. production using wastewater resources, Bioresour. Technol. 102 (2011) 17–25.
[3] N. Abdel-Raouf, A.A. Al-Homaidan, I.B. Ibraheem, Microalgae and wastewater
The ANN that has used images as input has achieved more accurate treatment, Saudi Journal of Biological Sciences 19 (2012) 257–275.
results, which is at first sight the most important aspect. However, this [4] F.G. Acien, J.M. Fernández Sevilla, E. Molina Grima, Microalgae, The basis of
network also requires much more time to be trained and to classify any mankind sustainability, in: Case Study of Innovative Projects - Succesful Real
Cases, 2017, pp. 123–140.
sample, since it is more dense and uses more complex data. The images [5] C. V. González-López, F. García-Cuadra, N. Jawiarczyk, J. M. Fernández-Sevilla, F.
are also heavier data, being more complicated to be shared. Therefore, G. Acién-Fernández, Valorization of microalgae and energy resources, in:
the use of one network or another is justified by the compromise be Sustainable Mobility, 2020.
[6] J.L. Guzmán, F.G. Acién Fernández, M. Berenguel, Modelling and control of
tween precision and complexity. Another factor that can be taken into microalgae production in industrial photobioreactors (in Spanish), Revista
account is the ease of obtaining data, because although in this work Iberoamericana de Automática e Informática Industrial 00 (2020) 1–15.
FlowCAM is able to provide both, if it is desired to use another data [7] K. Heimann, R. Huerlimann, Microalgal classification: major classes and genera of
commercial microalgal species, Handbook of Marine Microalgae: Biotechnology
acquisition tool, this fact could change and the ANN using images as Advances (2015) 25–41.
inputs allows generalization to other devices. [8] S. Promdaen, P. Wattuya, N. Sanevas, Automated microalgae image classification,
Procedia Computer Science 29 (2014) 1981–1992.
[9] P. Coltelli, L. Barsanti, V. Evangelista, A.M. Frassanito, V. Passarelli, P. Gualtieri,
4. Conclusions
Automatic and real time recognition of microalgae by means of pigment signature
and shape, Environmental Sciences: Processes and Impacts 15 (2013) 1397–1410.
This paper presents two ANNs for identification of microalgae sam [10] E. Álvarez, M. Moyano, Á. López-Urrutia, E. Nogueira, R. Scharek, Routine
determination of plankton community composition and size structure: a
ples composed by Chlorella vulgaris and Scenedesmus almeriensis. The
comparison between FlowCAM and light microscopy, J. Plankton Res. 36 (2014)
networks use as input data images of microalgae cells and descriptive 170–184.
features provided by the tool FlowCAM. In both networks, training has [11] J. Wang, J. Zhao, Y. Wang, W. Wang, Y. Gao, R. Xu, W. Zhao, A new microfluidic
been done with pure samples and a classification threshold tuning has device for classification of microalgae cells based on simultaneous analysis of
chlorophyll fluorescence, side light scattering, resistance pulse sensing,
been applied. The training error did not exceed 4%, while in validation Micromachines 7 (2016) 1–17.
the maximum error was 6.5%. Regarding future work, it is interesting to [12] A. Reul, M. Muñoz, B. Bautista, P.J. Neale, C. Sobrino, J.M. Mercado, M. Segovia,
extend the network by using a greater number of species and samples in S. Salles, G. Kulk, P. León, W.H. van de Poll, E. Pérez, A. Buma, J.M. Blanco, Effect
of CO2, nutrients and light on coastal plankton, III. Trophic cascade, size structure
different conditions, as the followed methodology makes it easy to and composition, Aquatic Biology 22 (2014) 59–76.
generalize the classification problem to more species. [13] M.I. Jordan, T.M. Mitchell, Machine learning: trends, perspectives, and prospects,
Science 349 (2015) 255–260.
[14] B.M. Franco, L.M. Navas, C. Gómez, C. Sepúlveda, F.G. Acién, Monoalgal and
Statement of informed consent, human/animal rights mixed algal cultures discrimination by using an artificial neural network, Algal
Res. 38 (2019) 1–7.
“No conflicts, informed consent, human or animal rights applicable”. [15] I. Correa, P. Drews-Jr, S. Botelho, M. S. De Souza, V. M. Tavano, Deep learning for
microalgae classification, Proceedings - 16th IEEE International Conference on
Machine Learning and Applications, ICMLA 2017 2017-Decem (2017) 20–25.
Declaration of authors [16] G.P.A. Victoria Anand Mary, S. Mohan, Freshwater microalgae image identification
and classification based on machine learning technique, Asian Journal of Computer
Science and Technology (AJCST) 7 (2018) 63–67.
All the participants are authorship of this work and agree to submit
[17] P. Qian, Z. Zhao, H. Liu, Y. Wang, Y. Peng, S. Hu, J. Zhang, Y. Deng, Z. Zeng, Multi-
the manuscript for peer review to Algal Research. Target deep learning for algal detection and classification, Proceedings of the
Annual International Conference of the IEEE Engineering in Medicine and Biology
CRediT authorship contribution statement Society, EMBS 2020-July (2020) 1954–1957.
[18] S. He, Z. Xu, Y. Jiang, J. Ji, E. Forsberg, Y. Li, Classification, identification and
growth stage estimation of microalgae based on transmission hyperspectral
Otálora P. was the responsible of the work, the neural network microscopic imaging and machine learning, Opt. Express 28 (2020) 30686–30700.
8
[19] D. Yadav, A. Jalal, D. Garlapati, K. Hossain, A. Goyal, G. Pant, Deep learning-based [23] R.A. Kay, L.L. Barton, Microalgae as food and supplement, Crit. Rev. Food Sci.
ResNeXt model in phycological studies for future, Algal Res. 50 (2020), 102018. Nutr. 30 (1991) 555–573.
[20] R. Reimann, B. Zeng, M. Jakopec, M. Burdukiewicz, I. Petrick, P. Schierack, [24] Y. Lecun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436–444.
S. Rödiger, Classification of dead and living microalgae Chlorella vulgaris by [25] J. Schmidhuber, Deep learning in neural networks: an overview, Neural Netw. 61
bioimage informatics and machine learning, Algal Res. 48 (2020), 101908. (2015) 85–117.
[21] FlowCAM ® Manual, 2011. [26] M. Hudson, B. Martin, T. Hagan, H. B. Demuth, Deep Learning Toolbox™ User’s
[22] J.M. Fernández-Sevilla, F.G. Acién Fernández, E. Molina Grima, Biotechnological Guide, 1992.
production of lutein and its applications, 2010. [27] C. M. Thompson, L. Shure, Image Processing Toolbox™ User’s Guide, 1995.

1 s2.0 S2211926421000758 Main

Uploaded by

Copyright:

Available Formats

You might also like

1 s2.0 S2211926421000758 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S2211926421000758 Main

Uploaded by

Copyright:

Available Formats

Algal Research 55 (2021) 102256

Contents lists available at ScienceDirect

Microalgae classification based on machine learning techniques

Fig. 1. Sample images taken by FlowCAM of each species.

Fig. 2. Selected ANN architecture [26].

Fig. 3. Confusion chart of the training process.

The second neural network developed use as inputs the images

Fig. 4. Histogram of the ANN output for the 50/50 sample.

1%Ch–99%Sc 25.5–74.9 1.2–98.8

You might also like