1 s2.0 S0029801822018364 Main

Ocean Engineering 265 (2022) 112553
Contents lists available at ScienceDirect
Ocean Engineering
journal homepage: www.elsevier.com/locate/oceaneng
Automatic berthing using supervised learning and reinforcement learning

Shoma Shimizu a ,∗, Kenta Nishihara a , Yoshiki Miyauchi b , Kouki Wakita b , Rin Suyama b ,
Atsuo Maki b , Shinichi Shirakawa a
a Yokohama National University, 79-7 Tokiwadai, Hodogaya-ku, Yokohama, Kanagawa, 240-8501, Japan
b Osaka University, 2-1 Yamadaoka, Suita, Osaka, 695014, Japan
ARTICLE INFO ABSTRACT
Keywords: Although various studies have been conducted on automatic berthing, including offline optimization and online
Autonomous vessel control, real-time berthing control remains a difficult problem. Online control methods without reference
Automatic berthing/docking trajectories are promising for real-time berthing control. We used reinforcement learning (RL), which is a
Supervised learning
type of machine learning, to obtain an online control law without reference trajectories. As online control
Reinforcement learning
for automatic berthing is difficult, obtaining an appropriate control law with naive reinforcement learning is
difficult. Furthermore, almost all existing online control methods do not consider port geometries. This study
proposes a method for obtaining online berthing control laws by combining supervised learning (SL) and RL.
We first trained the controller using offline-calculated trajectories and then further trained it using RL. Owing
to the SL process, the proposed method can start the RL process with a good control policy. We evaluated the
control law performance of the proposed method in a simulation environment that considered port geometries
and wind disturbances. The experimental results show that the proposed method can achieve a higher success
rate and lower safety risk than the naive SL and RL algorithms.
1. Introduction differential (PID) controllers (Shouji et al., 1993; Rachman et al., 2021;
Sawada et al., 2021; Rachman et al., 2022), model predictive controller
Considerable research is being conducted at many research insti- (MPC) (Li et al., 2020), neural network (NN)-type controllers (Tran and
tutes toward realizing autonomous vessels. An outstanding issue in Im, 2012; Ahmed and Hasegawa, 2014; Wakita et al., 2022a; Akimoto
autonomous vessel operation is the automatic berthing/docking. Au- et al., 2022), and so on.
tomatic berthing has a long history of research and development, and Most automatic berthing control studies have applied a path-
automatic berthing/docking was demonstrated using an actual ship in following or trajectory-tracking approach (Shouji et al., 1993; Rachman
Japan as early as the 1980s (Takai and Yoshihisa, 1987; Takai and et al., 2021, 2022; Sawada et al., 2021; Wakita et al., 2022a). However,
Ohtsu, 1990). Various studies have been conducted since then, and they a control algorithm other than path following or trajectory tracking
are ongoing. is necessary. The primary issue with the path-following or trajectory-
Numerous studies have been conducted on autonomous berthing/ tracking approach is the risk of control failure during the actual control
docking from various perspectives, including trajectory planning. Shouji
procedure. If the control fails to follow the reference trajectory, the
formulated an offline berthing problem as an optimal control prob-
subsequent control may collapse because it cannot return to the original
lem (Shouji et al., 1992). Following their success, several studies have
trajectory.
been conducted (Mizuno et al., 2015; Maki et al., 2020; Bitar et al.,
In contrast, control without reference is a much more difficult prob-
2020; Martinsen et al., 2020; Maki et al., 2021; Miyauchi et al., 2022b).
lem to solve. This is because the state equation governing ship motion is
These methods require a dynamic model or a state equation. Therefore,
complicated and highly nonlinear during the berthing maneuver owing
to achieve more realistic control, the estimation of dynamic models
or state equations was recently addressed by Miyauchi et al. (2022a) to the complex and transient flow field created by large drift angles and
and Wakita et al. (2022b). the use of special propeller operations: reversal, idling, and/or boost-
The ultimate objective of the automatic berthing problem is to ing. For those non-linear problems, only a few optimization schemes
achieve real-time control. Therefore, several studies on online control can obtain a solution. Previous studies have been conducted using NN-
algorithms have been conducted using proportional–integral– type controllers (Tran and Im, 2012; Ahmed and Hasegawa, 2014;
∗ Corresponding author.
E-mail address: shimizu-shoma-kr@ynu.jp (S. Shimizu).
https://doi.org/10.1016/j.oceaneng.2022.112553
Received 6 July 2022; Received in revised form 18 August 2022; Accepted 11 September 2022
Available online 28 September 2022
0029-8018/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
Akimoto et al., 2022; Mizuno et al., 2007). Another difficulty with

berthing control is the existence of spatial constraints owing to harbor
geometry. To ensure safety, the harbor geometry must be considered in
berthing control algorithms. To date, studies that explicitly considered
the existence of harbor geometry have been conducted in the frame-
work of offline trajectory exploration, such as Maki et al. (2020, 2021),
Miyauchi et al. (2022b). However, there has not been much research on
an online control algorithm that explicitly considers spatial constraints.
On the other hand, reinforcement learning (RL) (Sutton and Barto,
2018), a subfield of machine learning and another paradigm for con-
trol theory, has been widely applied in various fields such as robot
control (Kober et al., 2013) and Go (Silver et al., 2017; Schrittwieser
et al., 2020). RL algorithms can obtain an online control law without
explicitly knowing the form of state equations. This means that it
can be applied to nonlinear problems such as berthing control. Koide
and Mizuno (2021) attempted to construct a controller for automatic
berthing using RL. However, their controller was only for approaching
the berth point and not for the entire berthing maneuver. Furthermore,
RL algorithms require many state transition computations, i.e., numer-
ous simulation repetitions, which makes obtaining the control laws
expensive. Fig. 1. Coordinate systems of 3-DoF motion.
In this study, we used RL to obtain an online control law aimed at
enabling the real-time generation of berthing trajectories under spatial
constraints. Algorithms for such a problem setting, i.e., obtaining an Here, 𝑚 is the mass of the ship; 𝑥𝐺 is the distance to the midship from
online control law without reference trajectories under spatial con- the center of gravity; 𝐼𝑧𝐺 is the moment of inertia of the yaw direction
straints, have not been developed. Therefore, one of the novelties of at the center of gravity; 𝐹𝑥0 , 𝐹𝑦0 , 𝑀𝑧 are the hydrodynamic forces
this study was to target such a problem setting based on RL. In addition, and moment acting on the ship on surge, sway, and yaw directions,
to reduce the number of computations of state transitions required to respectively.
obtain a good control law, we perform supervised learning (SL) with The MMG model (e.g., (Ogawa and Kasai, 1978)) was used to solve
trajectories obtained by offline optimization to warm up the RL. We Eq. (2). The MMG model is a commonly used mathematical model for
also conducted simulation experiments under various initial conditions ship maneuvers. The MMG model assumes that the hydrodynamic force
and wind disturbances to quantitatively evaluate the performance of is quasi-steady. Hence, the time derivative of the motion variables 𝒙̇ is
the obtained control law. simulated by the equations of the MMG model 𝒇 MMG as follows:
{ }
𝒙(𝑡)
̇ = 𝒇 MMG 𝒙(𝑡), 𝒂(𝑡), 𝒘(𝑡) . (3)
2. Preliminaries
The MMG model has various choices of sub-components to ex-
2.1. Maneuvering simulation model press the hydrodynamics acting on the hull, propeller, and rudder
of the ship. Here, we used the same MMG model as in the litera-
This section introduces a mathematical model for simulating ship ture (Miyauchi et al., 2022a), which incorporates maneuvering mod-
maneuvers. A single-propeller single-rudder ship was selected as the els for large-drift-angle motion (Yoshimura et al., 2009) and pro-
subject ship. The maneuver of the ship was modeled as a surge-sway- peller reversal (Hasegawa and Fukutomi, 1994) during the berthing
yaw 3 degrees-of-freedom (DoF) motion. maneuver.
The coordinate systems used herein are as follows. Fig. 1 shows the
space-fixed O−𝑥𝑦 coordinate system and ship-fixed o0 −𝑥0 𝑦0 coordinate 2.2. Reinforcement learning
system. The O − 𝑥𝑦 and o0 − 𝑥0 𝑦0 systems have their origins at the
Reinforcement learning (RL) (Sutton and Barto, 2018) is an area of
upper-left corner of the berth and midship, respectively. Here, 𝑢 and 𝑣m
machine learning wherein an agent learns what actions to take through
represent the longitudinal and lateral speeds of the ship on o0 − 𝑥0 𝑦0
interactions with its surrounding environment. The agent observes the
coordinate, 𝜓 is the heading angle [rad], and 𝑟 represents the angular
current state in the environment and decides the next action based
velocity of the yaw motion. Regarding the ship actuators, 𝛿 is the rud-
on state observations. The decision rule for the action is called a
der angle [deg], and 𝑛 is the propeller speed [rps]. For environmental
policy and is represented as a function or probability distribution.
disturbance, we only considered wind; 𝛾𝑇 and 𝑈𝑇 represent the wind
Recent RL algorithms have adopted (deep) neural networks for policy
direction and speed over the ground, and 𝛾𝐴 and 𝑈𝐴 represent the
representation (Schulman et al., 2015, 2017; Espeholt et al., 2018).
apparent wind direction [deg] and wind speed [m/s], respectively. In
The agent transitions to a new state by taking action every unit of
summary, we define the motion variable vector 𝒙, action vector 𝒂, and
time and receives a reward from the environment. The function that
disturbance vector 𝒘 as follows:
calculates the reward value is called reward function. The unit time
⎧𝒙 ≡ (𝑥, 𝑢, 𝑦, 𝑣𝑚 , 𝜓, 𝑟)𝖳 ∈ R6 at which the agent performs decision-making is called the step, and a
⎪ series of steps from the beginning to the end of interactions is called
⎨𝒂 ≡ (𝛿, 𝑛)𝖳 ∈ R2 (1)
⎪ 𝖳 2 an episode. RL algorithms aim to find the policy that maximizes the
⎩𝒘 ≡ (𝛾𝑇 , 𝑈𝑇 ) ∈ R . expected cumulative reward obtained in an episode by repeating the
The details of the numerical simulation of ship motion are as interaction between an agent and the environment. In this study, agent
follows. First, we introduced the equation of motion as follows: and environment refer to the ship and maneuvering simulation model,
respectively. Furthermore, the agent’s policy indicates the ships’ control
𝑚(𝑢̇ − 𝑣m 𝑟 − 𝑥𝐺 𝑟2 ) = 𝐹𝑥0
law.
̇ + 𝑥𝐺 𝑟̇ + 𝑢𝑟) = 𝐹𝑦0
𝑚(𝑣m (2) RL algorithms have been successfully applied to various problems
𝐼𝑧𝐺 𝑟̇ = 𝑀𝑧 . such as robot control (Kober et al., 2013) and Go (Silver et al., 2017;
2
Table 1 Table 4
Principal particulars of the ship used in this Acceptable errors for berthing.
study. Item Value
Item Value
𝑥 ±0.1 (m)
Length: 𝐿𝑝𝑝 3.00 (m) 𝑢 ±0.05 (m/s)
Breadth: 𝐵 0.49 (m) 𝑦 ±0.1 (m)
Draft: 𝑑 0.20 (m) 𝑣𝑚 ±0.05 (m/s)
Block coefficients: 𝐶𝑏 0.83 𝜓 ±𝜋∕180 (rad)
𝑟 ±𝜋∕180 (rad/s)
Table 2
Ranges of action. Table 5
Item Range Ranges of initial state for training by RL.
Item Value
𝛿 [−35.0, 35.0] (deg)
𝑛 [−20.0, 20.0] (rps) 𝑥 [6𝐿𝑝𝑝 , 8𝐿𝑝𝑝 ] (m)
𝑢 [0.1977353171, 0.2966029756] (m/s)
𝑦 [−2𝐿𝑝𝑝 , 2𝐿𝑝𝑝 ] (m)
Table 3 𝑣𝑚 0.0 (m/s)
Target state 𝒔target . 𝜓 𝜓0 (rad)
Item Value 𝑟 0.0 (rad/s)
𝑥 −𝐿𝑝𝑝 (m)
𝑢 0 (m/s)
𝑦 1.5𝐵 (m)
𝑣𝑚 0 (m/s) this harbor geometry. Therefore, the berthing policy should consider
𝜓 𝜋 (rad) the spatial constraints of the pond. For example, the ship should avoid
𝑟 0 (rad/s) colliding with the pier’s corner when starting at initial positions of
𝑦 ≥ 0.0.
4. Proposed method and evaluation metrics

Schrittwieser et al., 2020). The advantage of RL algorithms is that they
can learn the policy without knowing the exact mathematical forms of RL is a promising approach for learning berthing control policy.
the state transition model and reward function, indicating that they can However, in general, RL algorithms require many trials and errors to
be applied even to complex nonlinear environments. obtain a good policy from scratch, implying that many simulations are
required.
3. Problem definition A previous study (Miyauchi et al., 2022b) successfully provided the
trajectory (sequence of states and actions) for a given environment in
Simulations were performed at the scale of a model ship of VLCC an offline optimization manner. We consider this offline control path
M.V. Esso Osaka, whose scale was 1/108.33. Table 1 lists the main to be useful in accelerating online policy learning. The trajectories
characteristics of the ship. This study aims to obtain the (online) control calculated by offline optimization can be used to train the policy model
law for the berthing problem using the RL framework. We used two in a supervised learning (SL) manner because they provide a set of state
different state definitions, depending on whether the wind speed and variables and the corresponding action to be performed.
direction were observable. If they are unobservable, the definition of Training a policy using reference trajectories based on SL is called
behavioral cloning (Bain and Sammut, 1999), where training is more
state 𝒔 is the same as that of the motion variable vector 𝒙. In other
efficient than RL. However, a policy trained by SL is generally sensitive
words, the definition of the state is given by
to changes in the initial state or disturbances (Bain and Sammut, 1999),
𝒔≡𝒙 . (4) and prediction errors, once generated, lead to further errors (Ross and
Bagnell, 2010). In general, the policy trained only by SL cannot be
If the wind speed and direction are observable, the definition of the generalized well because it only experiences limited state and action
state is given by pairs in the training dataset.
In summary, SL is advantageous for cloning given trajectories;
𝒔 ≡ (𝒙𝖳 , 𝒘𝖳 )𝖳 (5)
however, it is not generalized well in a limited amount of training data.
𝖳 8
= (𝑥, 𝑢, 𝑦, 𝑣𝑚 , 𝜓, 𝑟, 𝛾𝑇 , 𝑈𝑇 ) ∈ R . (6) By contrast, RL is expected to obtain a generalized control policy in
exchange for high computational costs. This study proposes combining
The action of the agent corresponds the action vector 𝒂 = (𝛿, 𝑛)𝖳 . SL and RL to exploit their advantages and accelerate the learning of
Table 2 lists the lower and upper bounds of the rudder angle 𝛿 and berthing control polices. Our strategy is to use the trained control policy
propeller speed 𝑛. by SL as a good initial policy for RL.
Each episode ends in 40 steps or when the ship successfully berths The policy-training procedure for the proposed method comprises
at the target position. The values of each dimension of the target two phases.
state 𝒔target are listed in Table 3. An episode does not end after a
collision; instead, it continues as if there are no obstacles. A single step 1. Train a policy model, represented as a neural network, by SL
corresponded to 5 s of ship movement. In other words, the rudder angle using a training dataset train collected by offline optimiza-
and propeller speed can be changed at intervals of 5 s. We consider tion (Miyauchi et al., 2022b). At offline optimization, the actu-
berthing successful when the difference between the current and target ator states are restricted as 𝛿 ∈ [−25, 25] and 𝑛 ∈ [−10, 10] to
states is within the acceptable errors shown in Table 4. The initial state generate trajectories with some manipulation margin.
of the ship is randomly sampled from the ranges listed in Table 5. 2. Train the policy model by RL using the model parameters trained
The initial angle 𝜓0 is the angle from the initial position to the target in Phase 1 as initial parameters.
position. In other words, the ship is initialized facing the direction of This two-phase training procedure is simple; however, it is expected
the target position. to accelerate the overall training process because RL can warm start
Fig. 2 shows an aerial photograph of Inukai Pond, the pond used with good initial model parameters obtained by SL. After training the
in this study, and a visualization of the environment created based on policy model, we evaluated it using the evaluation metrics described in
the pond. We conducted numerical experiments using a simulator with Section 4.3.
3
Fig. 2. Aerial photograph of Inukai Pond and a visualization of the environment created based on the pond.
4.1. Phase 1: Supervised learning represent the set of states that have successfully berthed and the set of
states that have collided with the obstacles, respectively. The indicator
We collect 𝑁 trajectories for successful berthing calculated using functions 𝟏suc and 𝟏obs are defined as follows.
offline optimization in Miyauchi et al. (2022b). We denote the 𝑖th tra- {
jectory by 𝝉 𝑖 = {(𝒔(𝑖) , 𝒂1(𝑖) ), … , (𝒔(𝑖) , 𝒂(𝑖) )}, where 𝑇𝑖 denotes the trajectory 1 (𝑠 ∈ suc )
1 𝑇𝑖 𝑇𝑖 𝟏suc (𝑠) = (10)
length of the 𝑖 -th berthing data. The training dataset for supervised 0 (𝑠 ∉ suc ) ,
learning is then given by train = {𝝉 1 , … , 𝝉 𝑁 }. {
1 (𝑠 ∈ obs )
A neural network with trainable model parameters 𝜃, denoted by 𝟏obs (𝑠) = (11)
𝑦𝜃 ∶ R𝐷 → R2 , was adopted as the policy model and trained to minimize 0 (𝑠 ∉ obs ) .
the following mean squared error (MSE) function, where 𝐷 indicates a The second and third terms of Eq. (8) represent the rewards for success-
state’s dimension.1 ful berthing and the penalty for colliding with obstacles, respectively.
∑ 𝑇𝑖
𝑁 ∑ In this study, we consider that a collision occurs when any of the
‖ (𝑖) (𝑖) ‖2
MSE (𝜃; train ) = ‖𝑦𝜃 (𝒔𝑗 ) − 𝒂𝑗 ‖ (7) four vertices or internally dividing points of the edges of the rectangle
‖ ‖
𝑖=1 𝑗=1 surrounding the ship are within the obstacles.
Neural network training was performed by stochastic gradient descent During the RL phase, we further trained the policy model obtained
using backpropagation, a standard neural network training algorithm. in the SL phase. Therefore, we used policy gradient-based RL methods
We monitored the validation error, MSE, in a separate dataset from rather than value function-based RL methods, such as Q-learning (Sut-
train and selected the model parameter with the smallest validation ton and Barto, 2018), to explicitly optimize the policy model. We can
error to prevent overfitting. use any policy gradient-based RL algorithm in our proposed method
and adopt the trust region policy optimization (TRPO) (Schulman
4.2. Phase 2: Reinforcement learning et al., 2015, 2016) in the following experiment. The initial state was
randomly determined from the ranges listed in Table 5. Therefore, the
We used the simulator described in Section 3. A reward function trained policy after the RL phase should be generalized for various
that evaluates the goodness of the agent situation should be designed initial states.
for an RL algorithm. A reward function should not depend on a specific
simulator model because the simulator model is replaced in the future 4.3. Evaluation metrics
owing to the refinement of the model and the introduction of further
disturbances. Fu et al. (2018) stated that a reward function that takes We propose quantitative evaluation metrics to compare the control
only the current state as input is robust to changes in the model. Thus, policies obtained using different methods. Note that proposing these
this study uses a reward function expressed in the following formulas quantitative evaluation metrics is a contribution of this study because
that take only the current state 𝒔 as the input. previous studies did not exploit such metrics to evaluate the effective-
ness of berthing policies. Specifically, we ran the obtained policy for
𝑟(𝒔) = 𝑟′ (𝒔) + 𝑐suc 𝟏suc (𝒔) + 𝑐obs 𝟏obs (𝒔) , (8)
√ 100 episodes (100 berthing simulations) from random initial states and
√ (
√∑ target )2 measured the following three quantitative metrics:
√ 6 𝑠𝑖 − 𝑠𝑖
𝑟 (𝒔) = −√
′
, (9)
𝑖=1
𝑏𝑖 Safety risk The number of episodes that collide with obstacles. The
smaller the better.
target
where 𝑠𝑖 and 𝑠𝑖 denote the 𝑖th dimension of the state and 𝑖th
dimension of the target state, respectively, and 𝑏𝑖 indicates the range of Success rate The number of episodes wherein the ship successfully
the 𝑖th dimension of the acceptable errors. The coefficients 𝑐suc and 𝑐obs berthed without colliding with the obstacles. The larger the
are constant values to balance each term in Eq. (8), and suc , obs ⊂  better.
Stability The median and interquartile range of the minimum value

1
The dimension of a state is 𝐷 = 8 if the wind speed and direction are of |𝑟′ (𝑠)| in each episode with neither success nor collision. The
observable; otherwise, it is 𝐷 = 6. smaller the better.
4
In the safety risk metric, we evaluated how safely the berthing policy
controls the ship. It is critical not to collide with obstacles during
a ship’s actual berthing. The success rate evaluates the accuracy of
the policy in berthing successfully. The stability metric evaluates how
stable the policy approaches the pier and how much variance there is
in its behavior when it neither collides with the pier nor successfully
berths it. In this study, berthing is considered successful if the difference
between the current and target states is within the acceptable error
range. This condition should be satisfied when |𝑟′ (𝑠)| is minimized.
Therefore, the minimum value of |𝑟′ (𝑠)| was used to evaluate stability.
The median is a metric that evaluates how stable the policy approaches
the target state; the smaller the median, the more stable it is. The
interquartile range is a metric of behavior variance; the smaller the
range, the lower the variance in the behavior. Using these metrics,
we can compare two policies as follows. First, we compare them in
terms of the number of collisions because it is the most crucial factor in Fig. 3. Spatial distribution of the offline-calculated trajectories. Each green arrow
represents the initial position and the direction of the trajectory. The point corresponds
preventing accidents. If the number of collisions is approximately the to the coordinates of the midship, and the arrow’s direction corresponds to the heading
same, we can compare them based on the number of successes. When angle. The arrow’s size has no meaning.
the numbers of collisions and successes are both similar, the stability
metric can be used to compare them because it is calculated from cases
with no collisions or successes. 5.1.3. Settings for reinforcement learning
Furthermore, to evaluate the change in policy behavior depending We use TRPO (Schulman et al., 2015, 2016) to train the policy.
on the initial position, we divided the range of the initial position into a TRPO is a major policy gradient-based RL algorithm that introduces
4 × 4 grid and evaluated the safety risk and success rate in each region. a constraint on the update strength of the parameters of the policy
neural network, resulting in stable optimization of the policy neural
networks. We used TRPO in our proposed method because we expect
5. Experiment and result it to prevent loss of the control knowledge obtained by the SL phase
owing to a large update strength. As a stochastic policy is required
We conducted two experiments: the first was an evaluation in during policy training in TRPO, the action values are sampled from a
an environment without wind disturbances, and the second was an Gaussian distribution to make the action stochastic. In particular, the
evaluation in an environment with wind disturbances. The objective mean values of the Gaussian distribution corresponded to the output
of the experiment is to evaluate the effectiveness of the combination of values of the policy neural network, and their variance parameters
SL and RL through the comparison with SL and RL algorithms alone. were added as learnable parameters, independent of the network. After
policy training, i.e., the evaluation phase, we used the output values
of the network as the action instead of sampling from the Gaussian
5.1. Experimental settings distribution. In other words, the control policy in the evaluation phase
was deterministic.
5.1.1. Common settings We implemented TRPO using PyTorch (version 1.10.0) (Paszke
et al., 2019) based on ChainerRL (version 0.8.0) (Fujita et al., 2021)
The policy neural network had three hidden layers comprising 64
and Spinning Up (version 0.2) (Achiam, 2018). The neural networks
units and a hyperbolic tangent activation function. The output layer
and probability distributions were implemented as done in ChainerRL.
comprises two units corresponding to action values. The output values
Specifically, it includes the policy network, value function, and scaling
are scaled by the following equation because the action values 𝛿 and 𝑛 in Eq. (12). We implemented the training code for TRPO in the same
have lower and upper bounds, as shown in Table 2. manner as for Spinning Up, and most of the hyperparameters were the
𝑎max − 𝑎min 𝑎 + 𝑎min same as those used in it. We used different values for the training steps
𝑧= tanh 𝑧̄ + max , (12)
2 2 and number of layers in the policy network. We set the number of
where 𝑎max and 𝑎min denote the upper and lower bounds of an action layers in the policy network to three as described in Section 5.1.1. For
value and 𝑧̄ and 𝑧 are the outputs before and after adjusting the scale, the value function in the TRPO algorithm, a neural network with two
hidden layers comprising 64 units and hyperbolic tangent activation
respectively.
was used, which is the default setting in Spinning Up. The specific
values for the training steps are described in each experimental section.
5.1.2. Settings for supervised learning For the reward function in Eq. (8), we set 𝑐suc = 5000 and tested
To train the weight parameters in the policy neural network, we multiple values of 𝑐obs for each experiment. The coefficient of 𝑐obs < 0
used an Adam optimizer (Kingma and Ba, 2015) with a learning rate determines the significance of the penalty term for collision, where
of 0.001 and L2 regularization with a coefficient of 0.03. The MSE a greater magnitude of 𝑐obs considerably penalizes the collision. The
function shown in Eq. (7) was used as the loss function. We used reason for experimenting with multiple collision penalty coefficients of
𝑐obs is that this is an essential parameter that affects the difficulty of
offline optimization to generate 40 trajectories with different initial
the task. For example, if the penalty magnitude is too small, the control
positions and speeds. In particular, we used two initial speed settings of
policy may be allowed to collide; conversely, if the penalty magnitude
0.1977 and 0.2966 and uniformly distributed 20 initial positions. Sub-
is too large, it is expected to avoid approaching the pier and may fail
sequently, we randomly selected 39 trajectories as training data train ,
to berth.
and the remaining trajectory was used as validation data. Fig. 3 shows
the initial positions and directions of the offline-calculated trajectories. 5.2. Experiment 1: Environment without wind disturbances
We trained the neural network for 30,000 epochs with a minibatch
size of 32. Wind disturbances did not occur when the trajectories were In Experiment 1, we used the state without the wind speed and
collected. direction defined in Eq. (1), indicating that wind disturbances did not
5
Fig. 4. Loss value of supervised learning in Experiment 1. Validation loss is minimum

at 18,000th epoch.
Table 6
Results of the environment without the wind in Experiment 1. Each row shows the
results measured in 100 episodes of the corresponding penalty, where ‘‘SL’’ is the result
of SL alone, and rows with ‘‘(w/o SL)’’ are the results of RL alone. ‘‘MED’’ and ‘‘IQR’’
represent the median and interquartile range of the minimum value of |𝑟′ (𝑠)| in the
episodes, respectively.
Method Safety risk Success rate Stability
# Collisions # Successes MED IQR
SL 51 0 12.02 2.41
𝑐obs = −500 1 83 4.24 18.54
𝑐obs = −1000 0 77 4.70 12.92
𝑐obs = −1500 1 0 11.90 7.08
𝑐obs = −2000 0 0 9.97 5.07
𝑐obs = −500 (w/o SL) 0 8 9.62 9.44
𝑐obs = −1000 (w/o SL) 0 0 13.21 9.61
𝑐obs = −1500 (w/o SL) 0 4 9.53 7.44
𝑐obs = −2000 (w/o SL) 0 0 11.44 7.38
occur in the RL and evaluation phases. This experiment aims to evaluate

the effectiveness of the proposed method and investigate the effect of
the collision penalty in a simple setting without wind disturbances.
First, we performed SL using the generated trajectories. Fig. 4 shows
the transition of losses for the SL. As shown in Fig. 4, the validation
loss is minimum at the 18,000th epoch; therefore, we performed RL
using the parameters at the 18,000th epoch as the initial value of the
policy network.
Subsequently, we performed RL and trained the policy 108 steps
with four collision penalties 𝑐obs ∈ {−500, −1000, −1500, −2000}. Table 6
shows the training results for each penalty. ‘‘MED’’ and ‘‘IQR’’ in Fig. 5. Safety risk (# Collisions) and success rate (# Successes) of different initial
Table 6 represent the median and interquartile range of the minimum position areas using the proposed method in Experiment 1. Numbers in brackets
indicate the penalty value 𝑐obs . We measured each metric by running 100 episodes
value of |𝑟′ (𝑠)| in episodes, respectively. Table 6 lists the results when
from each initial position area. The coordinates of initial position areas indicate the
SL and RL were performed separately for comparison. ‘‘SL’’ indicates coordinates of the O − 𝑥𝑦 system divided by the ship length 𝐿𝑝𝑝 . The origin of the
the result of the policy obtained in the SL phase, and the rows with coordinate system is the upper-left corner of the berth.
‘‘(w/o SL)’’ refer to the results of RL alone; that is, the parameters of
the policy network are randomly initialized.
These results reveal that the number of collisions is much lower When using RL alone, the number of collisions is 0 regardless of the
when using SL and RL together than when using SL alone. In addition, penalty; however, the number of successes is lower than that of the
the number of successes increased considerably for penalties of −500 proposed method. This is because RL alone required considerable time
and −1000, indicating that using SL and RL together is effective in re- to learn a good control policy and could not increase the number of
ducing the number of collisions and increasing the number of successes. successes. These results indicate that the proposed combination of SL
However, when the penalty coefficient was set to −1500 or −2000, the and RL can obtain the control policy more efficiently than both naive
number of successes did not increase. This is caused by the high value SL and RL methods.
of the penalty, which prevents the ship from approaching the pier and Fig. 5 shows the number of collisions and successes in different
thus from learning. The inability of the ship to approach the pier can initial position areas with each penalty. Figs. 5(a) and 5(c) show
be inferred from the fact that the median values (MED) are greater than that several collisions occur at the positions of 𝑦∕𝐿𝑝𝑝 ≤ 0.0, whereas
those of the penalties of −500 and −1000. Penalties of −500 and −1000 Figs. 5(b) and 5(d) show that the numbers of successes in the po-
have larger IQR values than the others. However, these two are not sitions of 𝑦∕𝐿𝑝𝑝 ≤ 0.0 are smaller than those in the positions of
comparable to the others because the number of trajectories used to 𝑦∕𝐿𝑝𝑝 ≥ 0.0. These results are contrary to intuition and require careful
calculate IQR is small owing to the many successes in these two cases. consideration of the causes. Intuitively, berthing from the positions
6
Fig. 6. Temporal changes in states and actions of an offline-calculated trajectory and a trajectory generated by the policy.
of 𝑦∕𝐿𝑝𝑝 ≤ 0.0 appears to be easier than that from the positions of

𝑦∕𝐿𝑝𝑝 ≥ 0.0 because avoiding the pier when the ship berths from
the positions of 𝑦∕𝐿𝑝𝑝 ≤ 0.0 is unnecessary, whereas berthing from
the positions of 𝑦∕𝐿𝑝𝑝 ≥ 0.0 requires avoidance. The characteristics
of the training data are one possible explanation for the results in
these. As Miyauchi et al. (2022b)’s method formulates the berthing
problem as time minimization, excessive acceleration occurs in the
generated trajectories in the early stages when the initial speed is low.
Learning to accelerate at the beginning makes subsequent adjustments
more complicated and prevents successful berthing because there is less
space to adjust the speed and heading angle when berthing from initial
positions of 𝑦∕𝐿𝑝𝑝 ≤ 0.0. Figs. 5(e)–5(h) show that most of the values
are 0, indicating that the policy training did not progress sufficiently.
Fig. 6 shows the temporal changes in states and actions of an offline-
Fig. 7. Loss value of supervised learning in Experiment 2. Validation loss is minimum
calculated trajectory and a trajectory generated by the policy trained at 25,000th epoch.
with a penalty of −1000. This figure illustrates that the intermediate
control process is very different even though the initial position and
velocity are the same. For example, the heading angle frequently
5.3. Experiment 2: Environment with wind disturbances
increases and decreases along the trajectory of the obtained policy. In
addition, the trajectory of the obtained policy lowered the propeller
speed to approximately −20, which was not used in the training dataset. In this section, we train the policy using the state with wind speed
This result suggests that restricting the action range in the training and direction represented by Eq. (6). Wind disturbances occurred dur-
data is insufficient to ensure the manipulation margin and that some ing the RL and the evaluation phases. The purpose of this experiment is
improvement is also necessary for RL. to investigate whether the proposed method can obtain a good policy
7
Table 7
Results of the environment with the wind in Experiment 2. A row for ‘‘Steps’’ represents
the total number of training steps: up to 1.0×108 steps, the policies were trained without
the wind, and between 1.0×108 and 2.0×108 steps, trained with the wind. We measured
each metric in the environment with the wind. The item of ‘‘N/A’’ indicates the stability
metric could not be calculated because all trajectories collided.
Method Steps (×108 ) Safety risk Success rate Stability
# Collisions # Successes MED IQR
SL N/A 45 0 16.65 2.32
1.0 15 82 2.19 0.19
𝑐obs = −500
2.0 0 99 2.30 0.00
1.0 5 75 3.57 3.12
𝑐obs = −1000
2.0 3 83 2.32 1.58
1.0 75 0 18.56 13.48
𝑐obs = −500 (w/o SL)
2.0 3 0 8.95 5.33
1.0 100 0 N/A N/A
𝑐obs = −1000 (w/o SL)
2.0 0 0 13.94 7.11
even in the environment with wind disturbances. During the SL phase,

we filled the value of 10−32 in the state values of the wind speed and
direction for the policy neural network because the wind speed and
direction were excluded from the trajectories of the training dataset.
Fig. 7 shows the transition of losses for the SL. As shown in Fig. 7,
Fig. 8. Safety risk (# Collisions) and success rate (# Successes) of different initial
the validation loss is minimum at the 25,000th epoch; thus, we used
position areas at 2.0 × 108 training steps by the proposed method in Experiment 2.
the parameters at the 25,000th epoch as the initial parameters of the Numbers in brackets indicate the penalty value 𝑐obs . We measured each metric by
policy during the next RL phase. running 100 episodes from each initial position area in the environment with the wind.
During the RL phase, we first trained the policy for 108 steps in The coordinates of initial position areas indicate the coordinates of the O − 𝑥𝑦 system
an environment without wind disturbances, where the wind speed divided by the ship length 𝐿𝑝𝑝 . The origin of the coordinate system is the upper-left
corner of the berth.
and direction were set to 10−32 and then trained it for another 108
steps in an environment with wind disturbances. This means that until
the 108 steps, the experimental procedure was the same as that in
Experiment 1, and additional 108 training steps were performed to several initial conditions. The figure shows that a single control policy
adapt the policy to the environment with wind disturbances. In an can generate reasonable berthing trajectories under multiple initial
environment with wind disturbances, the wind speed and direction conditions.
were randomly determined for each step (5 s of movement) in the
ranges of [10−32 , 1.0] [m/s] and [170, 190] [deg], respectively, and did 6. Discussion
not change during each step. We used two values for the collision
penalty: 𝑐obs ∈ {−500, −1000}. The trained policies are evaluated in an The experimental results show that the proposed method’s obtained
environment with disturbances. control laws have higher success rates and lower safety risks than naive
Table 7 presents the results after 1.0×108 and 2.0×108 steps of train- SL and RL methods. The results imply that online berthing control un-
ing. Compared with naive RL, the combination of SL and RL reduces the der spatial constraints without reference trajectories is feasible, which
number of collisions and increases the number of successes at 2.0 × 108 has not been achieved in previous studies.
steps, as shown in Experiment 1. Therefore, the proposed method However, our results should be interpreted within the context of
was more efficient than the RL-only method. The SL-only method has several limitations. The first limitation is that we only employed col-
numerous collisions and no successes, whereas the proposed method at lision avoidance to improve safety. Future research should focus on
1.0 × 108 steps shows fewer collisions and more successes for the envi- ways to improve safety, such as prohibiting actions that are difficult to
ronment with wind disturbances, although RL training was performed recover, and introducing flexible control devices, such as side thrusters,
only in the environment without wind disturbances. This result implies to increase redundancy. Furthermore, safety risk should be evaluated
that good policies for the environment without wind disturbances can in terms of the number of collisions and the ship’s speed and distance
be generalized to the wind setting to some extent, and the proposed from the obstacles, as demonstrated in Miyauchi et al. (2022b).
method can obtain such policies. Comparing to the results of steps Another limitation is that we experimented only with specific mod-
1.0 × 108 and 2.0 × 108 , additional training improves the performance els and port geometries. Therefore, improving generalizability is an
in an environment with wind disturbances. Therefore, the proposed important research direction. One possible improvement is to support
method with 𝑐obs = −500 penalty coefficient had the lowest safety risk arbitrary port geometries and target locations. It would be effective
and the highest success rate. to express the state as a value relative to the target or to add port
Fig. 8 shows the numbers of collisions and successes in different geometry as an input to achieve this. The distance between the ship
initial position areas under wind disturbances for policies trained for and obstacle in multiple directions is one method for representing port
2.0 × 108 steps, using the proposed method with each penalty. In geometry. Extending the algorithm to handle the addition of various
Fig. 8(c), the numbers of collisions are higher for the positions of actuators is another possible improvement. Incorporating ideas from
𝑦∕𝐿𝑝𝑝 ≤ 0.0, whereas in Fig. 8(d), the numbers of successes are smaller meta-learning (Huisman et al., 2021) or transfer learning (Zhuang
on that positions, which is similar to Experiment 1. However, Figs. 8(a) et al., 2021) could be beneficial. Developing robust control policies
and 8(b) show that the overall numbers of collisions are small and the against modeling errors is also important. Meta-learning (Huisman
numbers of successes are large, indicating that the policy obtained with et al., 2021) and adversarial training (Bai et al., 2021) are two concepts
a −500 penalty has high performance. that may be useful for improving robustness. Furthermore, the control
Fig. 9 shows the temporal changes in the states and actions when policy must be tested on a real ship to ensure that it functions properly
the policy trained 2.0 × 108 steps with a penalty of −500 was run under in real-world situations.
8
Fig. 9. Temporal changes in states and actions when the policy trained 2.0 × 108 steps with the penalty of −500 was run under several initial conditions.
9
In addition, quantitatively comparing the performance of the pro- Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y.,
posed method with existing path-following or trajectory-tracking con- Firoiu, V., Harley, T., Dunning, I., Legg, S., Kavukcuoglu, K., 2018. IMPALA:
Scalable distributed deep-RL with importance weighted actor-learner architectures.
trol methods is necessary.
In: Dy, J., Krause, A. (Eds.), Proceedings of the 35th International Conference on
Machine Learning. In: Proceedings of Machine Learning Research, vol. 80, PMLR,
7. Conclusion pp. 1407–1416, URL: https://proceedings.mlr.press/v80/espeholt18a.html.
Fu, J., Luo, K., Levine, S., 2018. Learning robust rewards with adverserial inverse
reinforcement learning. In: Proceedings of the International Conference on Learning
We proposed a method for obtaining an online berthing control Representations. ICLR, URL: https://openreview.net/forum?id=rkHywl-A-.
law by combining SL and RL. The control policy neural networks were Fujita, Y., Nagarajan, P., Kataoka, T., Ishikawa, T., 2021. ChainerRL: A deep reinforce-
trained and evaluated in a simulator of a specific harbor geometry un- ment learning library. J. Mach. Learn. Res. 22 (77), 1–14, URL: http://jmlr.org/
der various wind and initial conditions. Experimental results show that papers/v22/20-376.html.
Hasegawa, K., Fukutomi, T., 1994. On harbour manoeuvring and neural control
the proposed method can obtain control laws with higher success rates system for berthing with tug operation. In: Proc. of 3rd International Conference
and lower safety risks than naive SL and RL methods. The proposed Manoeuvring and Control of Marine Craft. MCMC’94, pp. 197–210.
method enables online control considering port geometries without Huisman, M., Van Rijn, J.N., Plaat, A., 2021. A survey of deep meta-learning. Artif.
reference trajectories. We hope that this study will serve as an essential Intell. Rev. 54 (6), 4483–4541. http://dx.doi.org/10.1007/s10462-021-10004-4.
Kingma, D.P., Ba, J., 2015. Adam: A method for stochastic optimization. In: Proceedings
first step toward the practical application of automatic berthing and of the International Conference on Learning Representations. ICLR, URL: http:
that future studies will focus on this topic. //arxiv.org/abs/1412.6980.
Kober, J., Bagnell, J.A., Peters, J., 2013. Reinforcement learning in robotics: A
survey. Int. J. Robot. Res. 32 (11), 1238–1274. http://dx.doi.org/10.1177/
CRediT authorship contribution statement
0278364913495721.
Koide, T., Mizuno, N., 2021. Automatic berthing maneuvering of ship by reinforcement
Shoma Shimizu: Methodology, Software, Investigation, Writing – learning. In: The 8th SICE Multi-Symposium on Control Systems.
original draft, Visualization. Kenta Nishihara: Methodology, Software, Li, S., Liu, J., Negenborn, R.R., Wu, Q., 2020. Automatic docking for underactuated
ships based on multi-objective nonlinear model predictive control. IEEE Access 8,
Writing – original draft. Yoshiki Miyauchi: Methodology, Software,
70044–70057. http://dx.doi.org/10.1109/ACCESS.2020.2984812.
Writing – original draft, Writing – review & editing. Kouki Wakita: Maki, A., Akimoto, Y., Umeda, N., 2021. Application of optimal control theory based
Methodology, Software, Writing – review & editing. Rin Suyama: on the evolution strategy (CMA-ES) to automatic berthing (part: 2). J. Mar. Sci.
Investigation, Software, Writing – review & editing. Atsuo Maki: Con- Technol. 26, 835–845. http://dx.doi.org/10.1007/s00773-020-00774-x.
Maki, A., Sakamoto, N., Akimoto, Y., Nishikawa, H., Umeda, N., 2020. Application
ceptualization, Writing – original draft, Project administration, Funding
of optimal control theory based on the evolution strategy (CMA-ES) to automatic
acquisition. Shinichi Shirakawa: Writing – review & editing, Supervi- berthing. J. Mar. Sci. Technol. 25 (1), 221–233. http://dx.doi.org/10.1007/s00773-
sion, Project administration. 019-00642-3.
Martinsen, A.B., Bitar, G., Lekkas, A.M., Gros, S., 2020. Optimization-based automatic
docking and berthing of ASVs using exteroceptive sensors: Theory and experiments.
Declaration of competing interest
IEEE Access 8, 204974–204986. http://dx.doi.org/10.1109/ACCESS.2020.3037171.
Miyauchi, Y., Maki, A., Umeda, N., Rachman, D.M., Akimoto, Y., 2022a. System
The authors declare that they have no known competing finan- parameter exploration of ship maneuvering model for automatic docking/berthing
cial interests or personal relationships that could have appeared to using CMA-ES. J. Mar. Sci. Technol. http://dx.doi.org/10.1007/s00773-022-00889-
3, URL: https://link.springer.com/10.1007/s00773-022-00889-3.
influence the work reported in this paper. Miyauchi, Y., Sawada, R., Akimoto, Y., Umeda, N., Maki, A., 2022b. Optimization
on planning of trajectory and control of autonomous berthing and unberthing for
Data availability the realistic port geometry. Ocean Eng. 245, 110390. http://dx.doi.org/10.1016/j.
oceaneng.2021.110390.
Mizuno, N., Kuroda, M., Okazaki, T., Ohtsu, K., 2007. Minimum time ship maneuvering
The authors are unable or have chosen not to specify which data method using neural network and nonlinear model predictive compensator. Control
has been used. Eng. Pract. 15 (6), 757–765. http://dx.doi.org/10.1016/j.conengprac.2007.01.002,
Special Section on Control Applications in Marine Systems.
Mizuno, N., Uchida, Y., Okazaki, T., 2015. Quasi real-time optimal control scheme
Acknowledgments
for automatic berthing. IFAC-PapersOnLine 48 (16), 305–312. http://dx.doi.org/
10.1016/j.ifacol.2015.10.297, 10th IFAC Conference on Manoeuvring and Control
This study was supported by a Grant-in-Aid for Scientific Research of Marine Craft (MCMC 2015).
from the Japan Society for Promotion of Science, Japan (JSPS KAK- Ogawa, A., Kasai, H., 1978. On the mathematical model of manoeuvring motion
of ships. Int. Shipbuild. Prog. 25 (292), 306–319. http://dx.doi.org/10.3233/ISP-
ENHI Grant #19K04858, #22H01701). We would like to thank Editage 1978-2529202.
(www.editage.com) for English language editing. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,
Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.,
References
2019. Pytorch: An imperative style, high-performance deep learning library. In:
Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R.
Achiam, J., 2018. Spinning up in deep reinforcement learning. https://spinningup. (Eds.), Advances in Neural Information Processing Systems. Vol. 32. Curran
openai.com/en/latest/. (Accessed 30 May 2022). Associates, Inc. pp. 8024–8035, URL: http://papers.neurips.cc/paper/9015-pytorch-
Ahmed, Y.A., Hasegawa, K., 2014. Experiment results for automatic ship berthing using an-imperative-style-high-performance-deep-learning-library.pdf.
artificial neural network based controller. IFAC Proc. Vol. 47 (3), 2658–2663. Rachman, D.M., Aoki, Y., Miyauchi, Y., Maki, A., 2022. Automatic docking (Berthing)
Akimoto, Y., Miyauchi, Y., Maki, A., 2022. Saddle point optimization with approximate by dynamic positioning system with VecTwin rudder. In: Conference Proceedings
minimization oracle and its application to robust berthing control. ACM Trans. Evol. of the Japan Society of Naval Architects and Ocean Engineers. Japan Society of
Learn. Optim. 2 (1), http://dx.doi.org/10.1145/3510425. Naval Architects and Ocean Engineers.
Bai, T., Luo, J., Zhao, J., Wen, B., Wang, Q., 2021. Recent advances in adversarial Rachman, D.M., Miyauchi, Y., Umeda, N., Maki, A., 2021. Feasibility study on the use
training for adversarial robustness. In: Zhou, Z.-H. (Ed.), Proceedings of the Thirti- of evolution strategy: CMA-ES for ship automatic docking problem. In: Proc. 1st
eth International Joint Conference on Artificial Intelligence. IJCAI-21, International International Conference on the Stability and Safety of Ships and Ocean Vehicles.
Joint Conferences on Artificial Intelligence Organization, pp. 4312–4321. http: STABS 2021.
//dx.doi.org/10.24963/ijcai.2021/591, Survey Track. Ross, S., Bagnell, D., 2010. Efficient reductions for imitation learning. In: Teh, Y.W.,
Bain, M., Sammut, C., 1999. A framework for behavioural cloning. In: Machine Titterington, M. (Eds.), Proceedings of Machine Learning Research, Proceedings of
Intelligence 15, Intelligent Agents [St. Catherine’s College, Oxford, July 1995]. the Thirteenth International Conference on Artificial Intelligence and Statistics Pro-
Oxford University, GBR, pp. 103–129. ceedings of Machine Learning Research, vol. 9, 661–668.URL: http://proceedings.
Bitar, G., Martinsen, A.B., Lekkas, A.M., Breivik, M., 2020. Trajectory planning and con- mlr.press/v9/ross10a.html,
trol for automatic docking of ASVs with full-scale experiments. IFAC-PapersOnLine Sawada, R., Hirata, K., Kitagawa, Y., Saito, E., Ueno, M., Tanizawa, K., Fukuto, J.,
53 (2), 14488–14494. http://dx.doi.org/10.1016/j.ifacol.2020.12.1451, 21st IFAC 2021. Path following algorithm application to automatic berthing control. J. Mar.
World Congress. Sci. Technol. 26, 541–554. http://dx.doi.org/10.1007/s00773-020-00758-x.
10
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: An Introduction. MIT Press.
Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., Silver, D., 2020. Takai, H., Ohtsu, K., 1990. Automatic berthing experiments using ‘‘Shioji-Maru’’ (in
Mastering Atari, Go, chess and shogi by planning with a learned model. Nature Japanese). J. Jpn. Inst. Navig.
588 (7839), 604–609. http://dx.doi.org/10.1038/s41586-020-03051-4. Takai, H., Yoshihisa, H., 1987. An automatic maneuvering system in berthing. In:
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P., 2015. Trust region policy Proceedings of 18th Ship Control Symposium.
optimization. In: Bach, F., Blei, D. (Eds.), Proceedings of the 32nd International Tran, V.L., Im, N., 2012. A study on ship automatic berthing with assistance of auxiliary
Conference on Machine Learning. In: Proceedings of Machine Learning Research, devices. Int. J. Nav. Archit. Ocean Eng. 4 (3), 199–210. http://dx.doi.org/10.2478/
vol. 37, PMLR, Lille, France, pp. 1889–1897, URL: https://proceedings.mlr.press/ IJNAOE-2013-0090.
v37/schulman15.html. Wakita, K., Akimoto, Y., Rachman, D.M., Amano, N., Fueki, Y., Maki, A., 2022a.
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P., 2016. High-dimensional Method of tracking control considering static obstacle for automatic berthing using
continuous control using generalized advantage estimation. In: Proceedings of the reinforcement learning (in Japanese). In: Conference Proceedings of the Japan
International Conference on Learning Representations. ICLR. Society of Naval Architects and Ocean Engineers. Japan Society of Naval Architects
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy and Ocean Engineers.
optimization algorithms. arXiv:1707.06347. Wakita, K., Maki, A., Umeda, N., Miyauchi, Y., Shimoji, T., Rachman, D.M., Akimoto, Y.,
Shouji, K., Ohtsu, K., Hotta, T., 1993. An automatic berthing study by optimal control 2022b. On neural network identification for low-speed ship maneuvering model. J.
techniques. IFAC Proc. Vol. 173, 221–229. Mar. Sci. Technol. 1–14. http://dx.doi.org/10.1007/s00773-021-00867-1.
Shouji, K., Ohtsu, K., Mizoguchi, S., 1992. An automatic berthing study by optimal Yoshimura, Y., Nakao, I., Ishibashi, A., 2009. Unified mathematical model for ocean
control techniques. IFAC Proc. Vol. 25 (3), 185–194. and harbour manoeuvring. In: Proceedings of MARSIM2009. pp. 116–124, URL:
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., http://hdl.handle.net/2115/42969.
Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q., 2021.
van den Driessche, G., Graepel, T., Hassabis, D., 2017. Mastering the game of A comprehensive survey on transfer learning. Proc. IEEE 109 (1), 43–76. http:
go without human knowledge. Nature 550 (7676), 354–359. http://dx.doi.org/10. //dx.doi.org/10.1109/JPROC.2020.3004555.
1038/nature24270.
11

1 s2.0 S0029801822018364 Main

Uploaded by

Copyright:

Available Formats

You might also like

1 s2.0 S0029801822018364 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0029801822018364 Main

Uploaded by

Copyright:

Available Formats

Ocean Engineering 265 (2022) 112553

Contents lists available at ScienceDirect

Automatic berthing using supervised learning and reinforcement learning

ARTICLE INFO ABSTRACT

Akimoto et al., 2022; Mizuno et al., 2007). Another difficulty with

4. Proposed method and evaluation metrics

Stability The median and interquartile range of the minimum value

Fig. 4. Loss value of supervised learning in Experiment 1. Validation loss is minimum

occur in the RL and evaluation phases. This experiment aims to evaluate

of 𝑦∕𝐿𝑝𝑝 ≤ 0.0 appears to be easier than that from the positions of

even in the environment with wind disturbances. During the SL phase,

You might also like