Professional Documents
Culture Documents
Comparing U-Net Convolutional Network With Mask R-CNN in Agricultural Area Segmentation On Satellite Images
Comparing U-Net Convolutional Network With Mask R-CNN in Agricultural Area Segmentation On Satellite Images
Abstract—Deep learning is the fastest-growing trend in deep learning include automatic speech recognition, image
statistical analysis of remote sensing data. Deep learning models processing identification, natural language processing,
are used for information processing of spectral steps, bioinformatics, etc.
identification statistics, segmentation and classification of the
objects in satellite images, etc. Image segmentation could help to Deep learning in agricultural satellite image processing is
make the object statistics more accurate by separating the one part of applying deep learning methods in identifying and
objects from the background. In this paper, we propose segmenting the agricultural objects on satellite images. Along
knowledge of Mask R-CNN and U-Net in satellite imagery with the development of deep learning models in image
segmentation, and we also make an experiment for these models processing [6][7], this topic has been studied in many
to show the appropriateness in this field. Experimental result of
the mean average precision (mAP) on dataset of Vietnam
scientific works for a long time. Some previous researches
satellite images is 95.21% for Mask R-CNN and 92.69% for U- are Land Use and Land Cover Classification [8], Forest
Net. Classification and Structural Estimation [9], Agricultural
Land Detection in Insular Areas using improved AlexNet
Keywords—Deep learning, segmentation, Mask R-CNN, U- network model [10], etc. These previous researches indicate
Net, agricultural areas, satellite images. that the identification and segmentation of agricultural areas
with deep learning model have high applicability in various
I. INTRODUCTION
fields: such as agricultural mapping, calculation of harvested
Satellite imagery or remote sensing imagery is usually agricultural yields, and estimation of the amount of fertilizer,
known as the data obtained from the satellite, civil aircraft, etc. The experimental works mentioned are tested only on the
dedicated aircraft, or other drones. Their purpose is to show dataset in Western countries (The Americas or Europe),
objects on the surface of the earth through sensors and video where science and technology are developed with modern
cameras. Typically, satellite imagery is used for studying and high-resolution machines. Their findings raise a big
measurements, gathering objective information, and studying question about the effectiveness of agricultural satellite
the surface of the Earth or different planets. The result of imagery in developing countries, especially in Vietnam.
those studies affect many areas of life such as weather Beside the fact that Vietnam's agriculture accounts for 17.4%
forecast, natural disaster forecast in meteorology; monitoring of its GDP in 2015, Vietnam is also the second-largest rice
the rate of desertification, the rate of coastal erosion in exporter in the world. This means that Vietnam may represent
geology; monitoring forest cover, warning and monitoring the agriculture of Eastern countries.
forest fires in forestry and many other fields.
Regarding the segmentation problem of agricultural areas
In agriculture, remote sensing is also used in on satellite imagery, there are very few experimental
management, statistics, and farming. Its applications could be researches on both Mask R-CNN and U-Net models, in which
mentioned as forecasting and managing agricultural the researchers compare their effectiveness, advantages, and
production, remote sensing information and geographic disadvantages. Mask R-CNN model [13], proposed by K. He
information, surveying and mapping agricultural maps, et al, was developed from Faster R-CNN with an additional
warning deforestation, monitoring normalized difference branch in the architecture to create a mask layer for the object
vegetation index (NDVI), etc [1][2]. The emergence of segmentation. Since then, Mask R-CNN has been
Agriculture 4.0 attributes to practical applications of experimented on many different satellite imagery processing
agricultural monitoring through satellite imagery. problems. All of them achieve high efficiency [14][15].
Deep learning, a small branch of artificial intelligence [3], Mask R-CNN surpasses the two winners of COCO Challenge
with many works and researches after 2012, has impacted on [16]: MNC [17] (2016) and FCIS [18] (2015). Moreover,
different aspects of life through solving many practical regarding the effectiveness of experiments on many problems
problems [4] such as health care, data processing, stock of satellite image segmentation [14][15], Mask R-CNN is
analysis, physical recognition, etc. Deep learning models are hoped to bring about the remarkable advances in agricultural
faster and more accurate than other methods in solving area segmentation on satellite imagery. Besides, U-Net [19],
problems of artificial intelligence. The reason is that they are the model proposed by O. Ronneberger et al with the original
built on the human neural network-an architecture that has purpose of serving the segmentation of biomedical objects,
many advantages in information transmission and processing has quickly been experimented in segmenting different types
[5]. The areas of science that can be well addressed by using of objects in various fields after showing its high efficiency
[20][21][22]. In processing satellite imagery, U-Net was also model achieves certain effectiveness in the agricultural area
experimented with the segmentation of buildings [23], urban segment. The drawback of this model is that its architecture
areas [24], etc. These experiments are yielding high-precision is relatively complex so it requires a large number of
prediction; therefore, this shows that U-Net can also be a convolution layers, which leads to model training and testing
good model for agricultural area segmentation on satellite in a large area with the requirement of large computer
images. Due to these reasons, we decided to experiment and resources (at least 16GB of RAM for wide-area training and
compare the two models of Mask R-CNN and U-Net in the testing). When comparing FCN with another model (SegNet),
agricultural areas object segmentation on satellite images. the experimental study by M. Yang and colleagues [12]
This work will help to highlight the effectiveness of Mask R- showed the higher efficiency level of FCN in the segment of
CNN and U-Net in supporting agricultural remote sensing. In rice areas. However, both models have limitations with the
addition, the experiment is conducted on two datasets: one images of large areas, when the objects that need segmenting
dataset with images obtained in Vietnam, and the other with get smaller. U-Net is found to be a simple yet effective model
images obtained in other regions (The Americas or Europe). through different studies that goes beyond the original
This method not only confirms the accuracy of Mask R-CNN proposals of U-Net. With this finding, U-Net seems to be
and U-Net on satellite imagery with high-quality images of more optimal than FCN during training and remains
Western countries in previous works but also tests their competitively accurate in object segmentation. With the U-
effectiveness in the case of Eastern agricultural areas shaped architecture of symmetric encoders and decoders, this
(especially in Vietnam). helps U-Net to synthesize the attributed information of
objects more accurately. U-Net model has also been used in
In this paper, we present the results of research and
processing agricultural satellite imagery by experiment.
experiment on two Mask R-CNN and U-Net models in the
Andrei Stoian [26] utilized U-Net to segment the distribution
agricultural image segmentation. The paper includes 6 main
of many soils on satellite images and compared traditional U-
parts: part I-an overview of the work, part II-the current state
Net to fine-tune U-Net. However, due to the focus on many
of deep learning application in satellite image processing in
types of objects (17 types), the highest accuracy in the object
agriculture, part III-the process of two deep learning Mask R-
segment of this work is achieved when segmenting water
CNN and U-Net models for the agricultural area
areas, not agricultural areas. Mask R-CNN is a model with
segmentation on satellite images, part IV-the detailed process
the nature of solving object recognition problems like those
of training for two test models, part V-the predicted results of
in Charou's research [25]. However, it is more optimal than
Mask R-CNN and U-Net for real images, the accuracy of
Charou's models due to an additional branch used in the
each model and the comparison of their effectiveness in the
object segmentation. Mask R-CNN has been proved to be
agricultural area segmentation on satellite images, part VI-
effective in many object segmentation problems
conclusion and future development for this work.
[15][19][27][28], including segmentation in satellite
II. DEEP LEARNING IN AGRICULTURAL imagery. W. Zhang [14] applied this model in the segment of
SATELLITE IMAGE SEGMENTATION the Arctic glaciers with mAP accuracy of 70%. Another work
of L. Chen [15] tested Mask R-CNN in the segment of urban
The processing of agricultural satellite imagery consists areas in Xiamen (China), with 90% precision and 87.18%
of problems on different levels of complexity. The first level
recall. Particularly in the segment of agricultural areas on
of image processing is object classification. In their study, P.
satellite images, the Mask R-CNN model does not seem to be
Helber's authors used deep learning model to classify 10 land a matter of concern and does not have its own work to test the
objects and achieved a high efficiency with more than 98% effectiveness. Through what Mask R-CNN demonstrates in
accuracy in classification. The next level of image processing other studies, this model showed the possibility of high
combines object classification and object detection. The accuracy in the test of segmenting agricultural area objects on
study of T. Chang et al applied deep learning model to satellite imagery. Besides, Mask R-CNN also has a high
classify and identify four different types of forests. With the
applicability because it is possible to take advantage of object
synthesis, statistics, and comparison of models, the author
positioning predictions (bounding box) combined with object
Charou [25] has conducted research and compared various
segmentation prediction results (mask layer). When
deep learning methods in identifying agricultural area objects
comparing the effectiveness of the two models Mask R-CNN
on satellite images. The methods that Charou tested mostly and U-Net, T. Zhao's experiment [29] put these two in the
achieved high-result performance. However, the above works evaluation of agricultural satellite imagery segmentation,
only highlight the effectiveness when solving the problem of with the object of pomegranate canopy. This comparative test
recognizing objects from satellite images of some deep indicates that the Mask R-CNN gives better results than U-
learning models without mentioning the ability to segment Net. However, the test object of the authors is low complexity
objects. and the test dataset does not have enough noise level, so the
At the higher level when processing agricultural satellite assessment is not highly referential in our opinion.
imagery, object segmentation is concerned by many Through the stage of studying the current state of deep
researchers. Object segmentation is the process of separating learning in image processing, it can be seen that many models
objects from the background data by defining their margins.
have been used for the effectiveness test, from low-level
It makes observing and analyzing objects on images more
image processing models such as classification and object
closely and accurately. Regarding the problem of agricultural detection, to higher-level ones such as object segmentation.
area segmentation on satellite imagery, the study of K.M. Especially, in processing satellite image segmentation, many
Masoud [11] has experimentally applied the improved FCN
popular models have been tested on many types of
model to solve problems. He reaches the conclusion that this
125
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
agricultural objects. However, among them, the effectiveness the two models when solving segmentation problems of
verification and comparison of Mask R-CNN and U-Net agricultural areas on satellite imagery will be evaluated.
models in the problem of agricultural area segmentation seem Step 6 - Processing predicted results: In this step, we will
to be less studied and experimented. In our research and post-process the results of the test model to get more accurate
experiment work, we will also conduct and compare the results. False predictions will be removed. Add missing
effectiveness of two Mask R-CNN and U-Net models in the predictions and refine those that are not completely accurate.
agricultural satellite image segment particularly in The refinement makes the segmentation values more
agricultural areas which are more general and more complex. accurate, which could be used for various purposes such as
creating new training datasets or using agricultural area
III. EXPERIMENTAL PROCESS statistics. The more accurate an image is, the more reliable it
is to apply to statistics.
IV. IMPLEMENT AND TRAINING
A. Training data
Fig. 1. Process of satellite image segmentation In remote sensing, information collection devices play an
extremely important role because their accuracy affects the
The experimental process consists of 6 main steps as reliability of later analysis and statistics. However, these
shown in “Fig. 1”. devices are relatively expensive. Therefore, with small and
limited experimental research scale in terms of funding, we
Step 1 - Data collection: Data is one of the important
will use available and free satellite image data sources.
factors, directly affecting the accuracy of the model.
Google Map is a popular software integrated satellite view
Therefore, we had created our own dataset that was the most
suitable to the requirement of our experiment. The dataset function that we use for data collection. This is a software
was collected using Google Map and Microsoft's Snipping with relatively wide image altitude (5m to 2000km) and
Tool, with a total of 2400 images for the two datasets of extremely flexible, which can be adjusted depending on the
agricultural areas in Vietnam and in other countries. Besides characteristics of the desired vision. Our dataset is the small
the quantity of each sample object, the variety of categories self-collected from Google Map public domain. During our
should be sufficient to be learned the necessary attributes. data collection, selected images have at least one object in
each image. In addition, the ratio of the object needed to be
Step 2 - Assigning data labels: In this step, VGG Image segmented (agricultural land) is usually over 50% in each
Annotator tool [30] was used to assign location labels of the
image. In the labeling process, the objects will be separated
objects on the dataset photos. Data labeling is also a very
based on the boundary of each type of land. Besides,
important stage, as the model will rely on these locations to
train. If labeled incorrectly, the model would fail in learning confusing data will also be restricted to the labeling process,
the object's properties, which leads to the inaccuracy in such as tree shadows, tractors, wells, ... The number of
prediction. objects in the images is in ranges from 1 to 20 depending on
the image.
Step 3 - Model training: The core task of the model
training process is based on the provided data to extract the The dataset we extracted includes 2400 images, which are
features of labeled objects. These features are put in the divided into two datasets, a satellite image of Vietnam
model to "remember". As a result, when testing with another agricultural area (Vietnam) and a satellite image of
picture (not been trained yet), the model could recognize and agricultural area of other regions (NonVietnam). The images
confirm the objects. Different models have different learning taken are mostly concentrated in the key agricultural areas.
styles, so it is important to adjust the learning parameters so The image taking on each dataset is as the rate follows: the
that the training process achieves the best results. Vietnam dataset with the South around 75%, the North
Step 4 - Model testing: Testing helps to identify whether around 20%, the Central around 5%. And the NonVietnam
the model is learned properly. If not yet, the training process dataset with America around 40%, Europe around 40%,
needs to refine the data or adjust the learning parameters more Africa around 15%, Asia around 5%. Compared to other
appropriately. Then, the models would be replayed with the regions, Vietnam is a country with an intricate and dense
training process to achieve the best results. The identification network of rivers. Therefore, the agricultural satellite images
could be made from the predicted results in a picture or from of this area often contain many interfering objects (houses,
the learning curve graph. Because each training state reflects rivers, lakes, canals,...). In addition, the river network
a certain meaning of the model, it’s necessary to rely on characteristics also affect the planning of agricultural land.
different factors such as the requirements of the problem, the This causes the diversity in the shape and size of agricultural
criteria of the dataset, the features of the object to choose an lands in Vietnam. Besides, the application of high technology
optimal model that is the most suitable for solving each such as segmenting agricultural using commercial satellites
separate problem. is limited, so having a full set of agricultural images of
Step 5 - Commenting and evaluating the model: Based on different regions in Vietnam is an obstacle. Details of training
the data obtained in step 4, as well as the accurate data division are shown in “TABLE I”.
measurement of the models, we will make comments in
comparison with the model theory. Simultaneously, the two TABLE I. DETAIL OF TRAINING DATA
experimental models will be compared in terms of the quality Dataset Total Training Validation Testing
and effectiveness in the segmentation on the same images and Vietnam 650 480 120 50
then, the results will be interpreted. Thus, the performance of NonVietnam 1750 1280 320 150
126
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
B. U-Net
The error of U-Net is shown in “Fig. 5”. It is clearly that
the error of the training process tends to decrease, then with
Fig. 3. The Mask R-CNN model to be processed during the training slow speed. The model achieved the best error value on
Vietnam training in the 11th training step (loss = 0.2902), on
NonVietnam training in the 34th training step (loss = 0.1720).
Conducting the evaluation of these two models at the best
training steps by the values of IoU and Dice, we get the
results as in “TABLE IV”. Dice parameter (the Dice
127
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
coefficient) is a parameter to evaluate another accuracy, agricultural areas and zoning is relatively clear. For U-Net,
which is calculated based on the overlap of the predicted the segment values are still noise. However, the boundaries
value and the correct label of object, same as IoU, but with a of the objects on U-Net are proving to be more effective and
different formula. accurate than that of the Mask R-CNN.
Similar to Mask R-CNN, in “TABLE IV” In “Fig. 8”, the original image feature is the agricultural
U-Net_NonVietnam achieves better values when the amount area of the Americas, U-Net shows superiority in audience
of training data is greater than that of U-Net_Vietnam. segmentation, although there is still noise in a small green
area (right above). In contrast, this Mask R-CNN has shown
poor partition when the wrong partition caused many objects
and the mask value of the right object is not really accurate.
In “Fig. 9”, the characteristic of the original image shows
that there are many objects to identify, with many small
objects and relatively blurred boundaries. As a result, U-Net
and Mask R-CNN are both poor identity and partitioning. In
the U-Net model, although the model has segmented well
most areas of the fields in the image above, it has merged
Fig. 5. The Error of U-Net on the two datasets several objects into one (right below) as the boundaries of the
objects are hard to recognize. However, in general, the
TABLE IV. THE PARAMETERS EVALUATE THE ACCURACY OF U- subjects labeled as agricultural areas are correctly identified
NET
and segmented by U-Net. For the Mask R-CNN case, some
Model IoU (0.5) Dice mAP model agricultural area objects have not yet been identified
UNet_Vietnam 0.7284 0.8368 0.7924 (left bottom) or have mistakenly identified the jamming
UNet_NonVietnam 0.8302 0.9048 0.9269
object (middle), and the results of the segment (mask) still
C. Comparison and discussion contain several mistakes. However, Mask R-CNN performs
well in partitioning the image into separate objects, without
After the process of experimenting and evaluating Mask
segmenting multiple objects into one as by U-Net.
R-CNN and U-Net models on two datasets of agricultural
areas in Vietnam and in other regions, we have highlighted In many cases, U-Net segmentation predictions are more
several observations and achieved high results in precision accurate than these of Mask R-CNN when the boundaries of
experiments. It could be seen that the amount of data used in the object are clearly defined. However, the boundary
training greatly affects the learning process of the models. determination of U-Net will be noisy if there are many
Apparently, the values of evaluation parameters on the obstructions. For Mask R-CNN, the mask layer of Mask R-
Vietnam data are always lower than on the NonVietnam data. CNN is ineffective and has no absolute accuracy in some
This is also affected by the quality of training data because cases. In addition, by defining segmented objects as
the sharpness of the Vietnam data is lower than that of the individual mask layers (with different colors as above), the
NonVietnam data. Low resolution causes difficulty in predictions of Mask R-CNN will be more valuable in
learning the attributes of identifiable objects on models and practical applications than these of U-Net. In general,
easy confusion with other objects. However, the deviations in agricultural area zoning on satellite imagery can be well
the evaluation parameters are not high between the two handled by both Mask R-CNN and U-Net models.
datasets when compared) in each model. It can be concluded
that deep learning models are completely consistent with
Vietnam agricultural areas and are not far less effective in
comparison with Western countries. In addition, our
experiment has shown that Mask R-CNN and U-Net can be
applied to solve agricultural area segmentation on satellite Fig. 6. a. Original image - b. U-Net’s prediction - c. Mask R-CNN’s
images. prediction - d. Mask R-CNN’s mask
128
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
129