Professional Documents
Culture Documents
1 s2.0 S0029801822018364 Main
1 s2.0 S0029801822018364 Main
1 s2.0 S0029801822018364 Main
Ocean Engineering
journal homepage: www.elsevier.com/locate/oceaneng
Keywords: Although various studies have been conducted on automatic berthing, including offline optimization and online
Autonomous vessel control, real-time berthing control remains a difficult problem. Online control methods without reference
Automatic berthing/docking trajectories are promising for real-time berthing control. We used reinforcement learning (RL), which is a
Supervised learning
type of machine learning, to obtain an online control law without reference trajectories. As online control
Reinforcement learning
for automatic berthing is difficult, obtaining an appropriate control law with naive reinforcement learning is
difficult. Furthermore, almost all existing online control methods do not consider port geometries. This study
proposes a method for obtaining online berthing control laws by combining supervised learning (SL) and RL.
We first trained the controller using offline-calculated trajectories and then further trained it using RL. Owing
to the SL process, the proposed method can start the RL process with a good control policy. We evaluated the
control law performance of the proposed method in a simulation environment that considered port geometries
and wind disturbances. The experimental results show that the proposed method can achieve a higher success
rate and lower safety risk than the naive SL and RL algorithms.
1. Introduction differential (PID) controllers (Shouji et al., 1993; Rachman et al., 2021;
Sawada et al., 2021; Rachman et al., 2022), model predictive controller
Considerable research is being conducted at many research insti- (MPC) (Li et al., 2020), neural network (NN)-type controllers (Tran and
tutes toward realizing autonomous vessels. An outstanding issue in Im, 2012; Ahmed and Hasegawa, 2014; Wakita et al., 2022a; Akimoto
autonomous vessel operation is the automatic berthing/docking. Au- et al., 2022), and so on.
tomatic berthing has a long history of research and development, and Most automatic berthing control studies have applied a path-
automatic berthing/docking was demonstrated using an actual ship in following or trajectory-tracking approach (Shouji et al., 1993; Rachman
Japan as early as the 1980s (Takai and Yoshihisa, 1987; Takai and et al., 2021, 2022; Sawada et al., 2021; Wakita et al., 2022a). However,
Ohtsu, 1990). Various studies have been conducted since then, and they a control algorithm other than path following or trajectory tracking
are ongoing. is necessary. The primary issue with the path-following or trajectory-
Numerous studies have been conducted on autonomous berthing/ tracking approach is the risk of control failure during the actual control
docking from various perspectives, including trajectory planning. Shouji
procedure. If the control fails to follow the reference trajectory, the
formulated an offline berthing problem as an optimal control prob-
subsequent control may collapse because it cannot return to the original
lem (Shouji et al., 1992). Following their success, several studies have
trajectory.
been conducted (Mizuno et al., 2015; Maki et al., 2020; Bitar et al.,
In contrast, control without reference is a much more difficult prob-
2020; Martinsen et al., 2020; Maki et al., 2021; Miyauchi et al., 2022b).
lem to solve. This is because the state equation governing ship motion is
These methods require a dynamic model or a state equation. Therefore,
complicated and highly nonlinear during the berthing maneuver owing
to achieve more realistic control, the estimation of dynamic models
or state equations was recently addressed by Miyauchi et al. (2022a) to the complex and transient flow field created by large drift angles and
and Wakita et al. (2022b). the use of special propeller operations: reversal, idling, and/or boost-
The ultimate objective of the automatic berthing problem is to ing. For those non-linear problems, only a few optimization schemes
achieve real-time control. Therefore, several studies on online control can obtain a solution. Previous studies have been conducted using NN-
algorithms have been conducted using proportional–integral– type controllers (Tran and Im, 2012; Ahmed and Hasegawa, 2014;
∗ Corresponding author.
E-mail address: shimizu-shoma-kr@ynu.jp (S. Shimizu).
https://doi.org/10.1016/j.oceaneng.2022.112553
Received 6 July 2022; Received in revised form 18 August 2022; Accepted 11 September 2022
Available online 28 September 2022
0029-8018/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
2
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
Table 1 Table 4
Principal particulars of the ship used in this Acceptable errors for berthing.
study. Item Value
Item Value
𝑥 ±0.1 (m)
Length: 𝐿𝑝𝑝 3.00 (m) 𝑢 ±0.05 (m/s)
Breadth: 𝐵 0.49 (m) 𝑦 ±0.1 (m)
Draft: 𝑑 0.20 (m) 𝑣𝑚 ±0.05 (m/s)
Block coefficients: 𝐶𝑏 0.83 𝜓 ±𝜋∕180 (rad)
𝑟 ±𝜋∕180 (rad/s)
Table 2
Ranges of action. Table 5
Item Range Ranges of initial state for training by RL.
Item Value
𝛿 [−35.0, 35.0] (deg)
𝑛 [−20.0, 20.0] (rps) 𝑥 [6𝐿𝑝𝑝 , 8𝐿𝑝𝑝 ] (m)
𝑢 [0.1977353171, 0.2966029756] (m/s)
𝑦 [−2𝐿𝑝𝑝 , 2𝐿𝑝𝑝 ] (m)
Table 3 𝑣𝑚 0.0 (m/s)
Target state 𝒔target . 𝜓 𝜓0 (rad)
Item Value 𝑟 0.0 (rad/s)
𝑥 −𝐿𝑝𝑝 (m)
𝑢 0 (m/s)
𝑦 1.5𝐵 (m)
𝑣𝑚 0 (m/s) this harbor geometry. Therefore, the berthing policy should consider
𝜓 𝜋 (rad) the spatial constraints of the pond. For example, the ship should avoid
𝑟 0 (rad/s) colliding with the pier’s corner when starting at initial positions of
𝑦 ≥ 0.0.
3
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
Fig. 2. Aerial photograph of Inukai Pond and a visualization of the environment created based on the pond.
4.1. Phase 1: Supervised learning represent the set of states that have successfully berthed and the set of
states that have collided with the obstacles, respectively. The indicator
We collect 𝑁 trajectories for successful berthing calculated using functions 𝟏suc and 𝟏obs are defined as follows.
offline optimization in Miyauchi et al. (2022b). We denote the 𝑖th tra- {
jectory by 𝝉 𝑖 = {(𝒔(𝑖) , 𝒂1(𝑖) ), … , (𝒔(𝑖) , 𝒂(𝑖) )}, where 𝑇𝑖 denotes the trajectory 1 (𝑠 ∈ suc )
1 𝑇𝑖 𝑇𝑖 𝟏suc (𝑠) = (10)
length of the 𝑖 -th berthing data. The training dataset for supervised 0 (𝑠 ∉ suc ) ,
learning is then given by train = {𝝉 1 , … , 𝝉 𝑁 }. {
1 (𝑠 ∈ obs )
A neural network with trainable model parameters 𝜃, denoted by 𝟏obs (𝑠) = (11)
𝑦𝜃 ∶ R𝐷 → R2 , was adopted as the policy model and trained to minimize 0 (𝑠 ∉ obs ) .
the following mean squared error (MSE) function, where 𝐷 indicates a The second and third terms of Eq. (8) represent the rewards for success-
state’s dimension.1 ful berthing and the penalty for colliding with obstacles, respectively.
∑ 𝑇𝑖
𝑁 ∑ In this study, we consider that a collision occurs when any of the
‖ (𝑖) (𝑖) ‖2
MSE (𝜃; train ) = ‖𝑦𝜃 (𝒔𝑗 ) − 𝒂𝑗 ‖ (7) four vertices or internally dividing points of the edges of the rectangle
‖ ‖
𝑖=1 𝑗=1 surrounding the ship are within the obstacles.
Neural network training was performed by stochastic gradient descent During the RL phase, we further trained the policy model obtained
using backpropagation, a standard neural network training algorithm. in the SL phase. Therefore, we used policy gradient-based RL methods
We monitored the validation error, MSE, in a separate dataset from rather than value function-based RL methods, such as Q-learning (Sut-
train and selected the model parameter with the smallest validation ton and Barto, 2018), to explicitly optimize the policy model. We can
error to prevent overfitting. use any policy gradient-based RL algorithm in our proposed method
and adopt the trust region policy optimization (TRPO) (Schulman
4.2. Phase 2: Reinforcement learning et al., 2015, 2016) in the following experiment. The initial state was
randomly determined from the ranges listed in Table 5. Therefore, the
We used the simulator described in Section 3. A reward function trained policy after the RL phase should be generalized for various
that evaluates the goodness of the agent situation should be designed initial states.
for an RL algorithm. A reward function should not depend on a specific
simulator model because the simulator model is replaced in the future 4.3. Evaluation metrics
owing to the refinement of the model and the introduction of further
disturbances. Fu et al. (2018) stated that a reward function that takes We propose quantitative evaluation metrics to compare the control
only the current state as input is robust to changes in the model. Thus, policies obtained using different methods. Note that proposing these
this study uses a reward function expressed in the following formulas quantitative evaluation metrics is a contribution of this study because
that take only the current state 𝒔 as the input. previous studies did not exploit such metrics to evaluate the effective-
ness of berthing policies. Specifically, we ran the obtained policy for
𝑟(𝒔) = 𝑟′ (𝒔) + 𝑐suc 𝟏suc (𝒔) + 𝑐obs 𝟏obs (𝒔) , (8)
√ 100 episodes (100 berthing simulations) from random initial states and
√ (
√∑ target )2 measured the following three quantitative metrics:
√ 6 𝑠𝑖 − 𝑠𝑖
𝑟 (𝒔) = −√
′
, (9)
𝑖=1
𝑏𝑖 Safety risk The number of episodes that collide with obstacles. The
smaller the better.
target
where 𝑠𝑖 and 𝑠𝑖 denote the 𝑖th dimension of the state and 𝑖th
dimension of the target state, respectively, and 𝑏𝑖 indicates the range of Success rate The number of episodes wherein the ship successfully
the 𝑖th dimension of the acceptable errors. The coefficients 𝑐suc and 𝑐obs berthed without colliding with the obstacles. The larger the
are constant values to balance each term in Eq. (8), and suc , obs ⊂ better.
4
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
In the safety risk metric, we evaluated how safely the berthing policy
controls the ship. It is critical not to collide with obstacles during
a ship’s actual berthing. The success rate evaluates the accuracy of
the policy in berthing successfully. The stability metric evaluates how
stable the policy approaches the pier and how much variance there is
in its behavior when it neither collides with the pier nor successfully
berths it. In this study, berthing is considered successful if the difference
between the current and target states is within the acceptable error
range. This condition should be satisfied when |𝑟′ (𝑠)| is minimized.
Therefore, the minimum value of |𝑟′ (𝑠)| was used to evaluate stability.
The median is a metric that evaluates how stable the policy approaches
the target state; the smaller the median, the more stable it is. The
interquartile range is a metric of behavior variance; the smaller the
range, the lower the variance in the behavior. Using these metrics,
we can compare two policies as follows. First, we compare them in
terms of the number of collisions because it is the most crucial factor in Fig. 3. Spatial distribution of the offline-calculated trajectories. Each green arrow
represents the initial position and the direction of the trajectory. The point corresponds
preventing accidents. If the number of collisions is approximately the to the coordinates of the midship, and the arrow’s direction corresponds to the heading
same, we can compare them based on the number of successes. When angle. The arrow’s size has no meaning.
the numbers of collisions and successes are both similar, the stability
metric can be used to compare them because it is calculated from cases
with no collisions or successes. 5.1.3. Settings for reinforcement learning
Furthermore, to evaluate the change in policy behavior depending We use TRPO (Schulman et al., 2015, 2016) to train the policy.
on the initial position, we divided the range of the initial position into a TRPO is a major policy gradient-based RL algorithm that introduces
4 × 4 grid and evaluated the safety risk and success rate in each region. a constraint on the update strength of the parameters of the policy
neural network, resulting in stable optimization of the policy neural
networks. We used TRPO in our proposed method because we expect
5. Experiment and result it to prevent loss of the control knowledge obtained by the SL phase
owing to a large update strength. As a stochastic policy is required
We conducted two experiments: the first was an evaluation in during policy training in TRPO, the action values are sampled from a
an environment without wind disturbances, and the second was an Gaussian distribution to make the action stochastic. In particular, the
evaluation in an environment with wind disturbances. The objective mean values of the Gaussian distribution corresponded to the output
of the experiment is to evaluate the effectiveness of the combination of values of the policy neural network, and their variance parameters
SL and RL through the comparison with SL and RL algorithms alone. were added as learnable parameters, independent of the network. After
policy training, i.e., the evaluation phase, we used the output values
of the network as the action instead of sampling from the Gaussian
5.1. Experimental settings distribution. In other words, the control policy in the evaluation phase
was deterministic.
5.1.1. Common settings We implemented TRPO using PyTorch (version 1.10.0) (Paszke
et al., 2019) based on ChainerRL (version 0.8.0) (Fujita et al., 2021)
The policy neural network had three hidden layers comprising 64
and Spinning Up (version 0.2) (Achiam, 2018). The neural networks
units and a hyperbolic tangent activation function. The output layer
and probability distributions were implemented as done in ChainerRL.
comprises two units corresponding to action values. The output values
Specifically, it includes the policy network, value function, and scaling
are scaled by the following equation because the action values 𝛿 and 𝑛 in Eq. (12). We implemented the training code for TRPO in the same
have lower and upper bounds, as shown in Table 2. manner as for Spinning Up, and most of the hyperparameters were the
𝑎max − 𝑎min 𝑎 + 𝑎min same as those used in it. We used different values for the training steps
𝑧= tanh 𝑧̄ + max , (12)
2 2 and number of layers in the policy network. We set the number of
where 𝑎max and 𝑎min denote the upper and lower bounds of an action layers in the policy network to three as described in Section 5.1.1. For
value and 𝑧̄ and 𝑧 are the outputs before and after adjusting the scale, the value function in the TRPO algorithm, a neural network with two
hidden layers comprising 64 units and hyperbolic tangent activation
respectively.
was used, which is the default setting in Spinning Up. The specific
values for the training steps are described in each experimental section.
5.1.2. Settings for supervised learning For the reward function in Eq. (8), we set 𝑐suc = 5000 and tested
To train the weight parameters in the policy neural network, we multiple values of 𝑐obs for each experiment. The coefficient of 𝑐obs < 0
used an Adam optimizer (Kingma and Ba, 2015) with a learning rate determines the significance of the penalty term for collision, where
of 0.001 and L2 regularization with a coefficient of 0.03. The MSE a greater magnitude of 𝑐obs considerably penalizes the collision. The
function shown in Eq. (7) was used as the loss function. We used reason for experimenting with multiple collision penalty coefficients of
𝑐obs is that this is an essential parameter that affects the difficulty of
offline optimization to generate 40 trajectories with different initial
the task. For example, if the penalty magnitude is too small, the control
positions and speeds. In particular, we used two initial speed settings of
policy may be allowed to collide; conversely, if the penalty magnitude
0.1977 and 0.2966 and uniformly distributed 20 initial positions. Sub-
is too large, it is expected to avoid approaching the pier and may fail
sequently, we randomly selected 39 trajectories as training data train ,
to berth.
and the remaining trajectory was used as validation data. Fig. 3 shows
the initial positions and directions of the offline-calculated trajectories. 5.2. Experiment 1: Environment without wind disturbances
We trained the neural network for 30,000 epochs with a minibatch
size of 32. Wind disturbances did not occur when the trajectories were In Experiment 1, we used the state without the wind speed and
collected. direction defined in Eq. (1), indicating that wind disturbances did not
5
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
Table 6
Results of the environment without the wind in Experiment 1. Each row shows the
results measured in 100 episodes of the corresponding penalty, where ‘‘SL’’ is the result
of SL alone, and rows with ‘‘(w/o SL)’’ are the results of RL alone. ‘‘MED’’ and ‘‘IQR’’
represent the median and interquartile range of the minimum value of |𝑟′ (𝑠)| in the
episodes, respectively.
Method Safety risk Success rate Stability
# Collisions # Successes MED IQR
SL 51 0 12.02 2.41
𝑐obs = −500 1 83 4.24 18.54
𝑐obs = −1000 0 77 4.70 12.92
𝑐obs = −1500 1 0 11.90 7.08
𝑐obs = −2000 0 0 9.97 5.07
𝑐obs = −500 (w/o SL) 0 8 9.62 9.44
𝑐obs = −1000 (w/o SL) 0 0 13.21 9.61
𝑐obs = −1500 (w/o SL) 0 4 9.53 7.44
𝑐obs = −2000 (w/o SL) 0 0 11.44 7.38
6
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
Fig. 6. Temporal changes in states and actions of an offline-calculated trajectory and a trajectory generated by the policy.
7
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
Table 7
Results of the environment with the wind in Experiment 2. A row for ‘‘Steps’’ represents
the total number of training steps: up to 1.0×108 steps, the policies were trained without
the wind, and between 1.0×108 and 2.0×108 steps, trained with the wind. We measured
each metric in the environment with the wind. The item of ‘‘N/A’’ indicates the stability
metric could not be calculated because all trajectories collided.
Method Steps (×108 ) Safety risk Success rate Stability
# Collisions # Successes MED IQR
SL N/A 45 0 16.65 2.32
1.0 15 82 2.19 0.19
𝑐obs = −500
2.0 0 99 2.30 0.00
1.0 5 75 3.57 3.12
𝑐obs = −1000
2.0 3 83 2.32 1.58
1.0 75 0 18.56 13.48
𝑐obs = −500 (w/o SL)
2.0 3 0 8.95 5.33
1.0 100 0 N/A N/A
𝑐obs = −1000 (w/o SL)
2.0 0 0 13.94 7.11
8
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
Fig. 9. Temporal changes in states and actions when the policy trained 2.0 × 108 steps with the penalty of −500 was run under several initial conditions.
9
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
In addition, quantitatively comparing the performance of the pro- Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y.,
posed method with existing path-following or trajectory-tracking con- Firoiu, V., Harley, T., Dunning, I., Legg, S., Kavukcuoglu, K., 2018. IMPALA:
Scalable distributed deep-RL with importance weighted actor-learner architectures.
trol methods is necessary.
In: Dy, J., Krause, A. (Eds.), Proceedings of the 35th International Conference on
Machine Learning. In: Proceedings of Machine Learning Research, vol. 80, PMLR,
7. Conclusion pp. 1407–1416, URL: https://proceedings.mlr.press/v80/espeholt18a.html.
Fu, J., Luo, K., Levine, S., 2018. Learning robust rewards with adverserial inverse
reinforcement learning. In: Proceedings of the International Conference on Learning
We proposed a method for obtaining an online berthing control Representations. ICLR, URL: https://openreview.net/forum?id=rkHywl-A-.
law by combining SL and RL. The control policy neural networks were Fujita, Y., Nagarajan, P., Kataoka, T., Ishikawa, T., 2021. ChainerRL: A deep reinforce-
trained and evaluated in a simulator of a specific harbor geometry un- ment learning library. J. Mach. Learn. Res. 22 (77), 1–14, URL: http://jmlr.org/
der various wind and initial conditions. Experimental results show that papers/v22/20-376.html.
Hasegawa, K., Fukutomi, T., 1994. On harbour manoeuvring and neural control
the proposed method can obtain control laws with higher success rates system for berthing with tug operation. In: Proc. of 3rd International Conference
and lower safety risks than naive SL and RL methods. The proposed Manoeuvring and Control of Marine Craft. MCMC’94, pp. 197–210.
method enables online control considering port geometries without Huisman, M., Van Rijn, J.N., Plaat, A., 2021. A survey of deep meta-learning. Artif.
reference trajectories. We hope that this study will serve as an essential Intell. Rev. 54 (6), 4483–4541. http://dx.doi.org/10.1007/s10462-021-10004-4.
Kingma, D.P., Ba, J., 2015. Adam: A method for stochastic optimization. In: Proceedings
first step toward the practical application of automatic berthing and of the International Conference on Learning Representations. ICLR, URL: http:
that future studies will focus on this topic. //arxiv.org/abs/1412.6980.
Kober, J., Bagnell, J.A., Peters, J., 2013. Reinforcement learning in robotics: A
survey. Int. J. Robot. Res. 32 (11), 1238–1274. http://dx.doi.org/10.1177/
CRediT authorship contribution statement
0278364913495721.
Koide, T., Mizuno, N., 2021. Automatic berthing maneuvering of ship by reinforcement
Shoma Shimizu: Methodology, Software, Investigation, Writing – learning. In: The 8th SICE Multi-Symposium on Control Systems.
original draft, Visualization. Kenta Nishihara: Methodology, Software, Li, S., Liu, J., Negenborn, R.R., Wu, Q., 2020. Automatic docking for underactuated
ships based on multi-objective nonlinear model predictive control. IEEE Access 8,
Writing – original draft. Yoshiki Miyauchi: Methodology, Software,
70044–70057. http://dx.doi.org/10.1109/ACCESS.2020.2984812.
Writing – original draft, Writing – review & editing. Kouki Wakita: Maki, A., Akimoto, Y., Umeda, N., 2021. Application of optimal control theory based
Methodology, Software, Writing – review & editing. Rin Suyama: on the evolution strategy (CMA-ES) to automatic berthing (part: 2). J. Mar. Sci.
Investigation, Software, Writing – review & editing. Atsuo Maki: Con- Technol. 26, 835–845. http://dx.doi.org/10.1007/s00773-020-00774-x.
Maki, A., Sakamoto, N., Akimoto, Y., Nishikawa, H., Umeda, N., 2020. Application
ceptualization, Writing – original draft, Project administration, Funding
of optimal control theory based on the evolution strategy (CMA-ES) to automatic
acquisition. Shinichi Shirakawa: Writing – review & editing, Supervi- berthing. J. Mar. Sci. Technol. 25 (1), 221–233. http://dx.doi.org/10.1007/s00773-
sion, Project administration. 019-00642-3.
Martinsen, A.B., Bitar, G., Lekkas, A.M., Gros, S., 2020. Optimization-based automatic
docking and berthing of ASVs using exteroceptive sensors: Theory and experiments.
Declaration of competing interest
IEEE Access 8, 204974–204986. http://dx.doi.org/10.1109/ACCESS.2020.3037171.
Miyauchi, Y., Maki, A., Umeda, N., Rachman, D.M., Akimoto, Y., 2022a. System
The authors declare that they have no known competing finan- parameter exploration of ship maneuvering model for automatic docking/berthing
cial interests or personal relationships that could have appeared to using CMA-ES. J. Mar. Sci. Technol. http://dx.doi.org/10.1007/s00773-022-00889-
3, URL: https://link.springer.com/10.1007/s00773-022-00889-3.
influence the work reported in this paper. Miyauchi, Y., Sawada, R., Akimoto, Y., Umeda, N., Maki, A., 2022b. Optimization
on planning of trajectory and control of autonomous berthing and unberthing for
Data availability the realistic port geometry. Ocean Eng. 245, 110390. http://dx.doi.org/10.1016/j.
oceaneng.2021.110390.
Mizuno, N., Kuroda, M., Okazaki, T., Ohtsu, K., 2007. Minimum time ship maneuvering
The authors are unable or have chosen not to specify which data method using neural network and nonlinear model predictive compensator. Control
has been used. Eng. Pract. 15 (6), 757–765. http://dx.doi.org/10.1016/j.conengprac.2007.01.002,
Special Section on Control Applications in Marine Systems.
Mizuno, N., Uchida, Y., Okazaki, T., 2015. Quasi real-time optimal control scheme
Acknowledgments
for automatic berthing. IFAC-PapersOnLine 48 (16), 305–312. http://dx.doi.org/
10.1016/j.ifacol.2015.10.297, 10th IFAC Conference on Manoeuvring and Control
This study was supported by a Grant-in-Aid for Scientific Research of Marine Craft (MCMC 2015).
from the Japan Society for Promotion of Science, Japan (JSPS KAK- Ogawa, A., Kasai, H., 1978. On the mathematical model of manoeuvring motion
of ships. Int. Shipbuild. Prog. 25 (292), 306–319. http://dx.doi.org/10.3233/ISP-
ENHI Grant #19K04858, #22H01701). We would like to thank Editage 1978-2529202.
(www.editage.com) for English language editing. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,
Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.,
References
2019. Pytorch: An imperative style, high-performance deep learning library. In:
Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R.
Achiam, J., 2018. Spinning up in deep reinforcement learning. https://spinningup. (Eds.), Advances in Neural Information Processing Systems. Vol. 32. Curran
openai.com/en/latest/. (Accessed 30 May 2022). Associates, Inc. pp. 8024–8035, URL: http://papers.neurips.cc/paper/9015-pytorch-
Ahmed, Y.A., Hasegawa, K., 2014. Experiment results for automatic ship berthing using an-imperative-style-high-performance-deep-learning-library.pdf.
artificial neural network based controller. IFAC Proc. Vol. 47 (3), 2658–2663. Rachman, D.M., Aoki, Y., Miyauchi, Y., Maki, A., 2022. Automatic docking (Berthing)
Akimoto, Y., Miyauchi, Y., Maki, A., 2022. Saddle point optimization with approximate by dynamic positioning system with VecTwin rudder. In: Conference Proceedings
minimization oracle and its application to robust berthing control. ACM Trans. Evol. of the Japan Society of Naval Architects and Ocean Engineers. Japan Society of
Learn. Optim. 2 (1), http://dx.doi.org/10.1145/3510425. Naval Architects and Ocean Engineers.
Bai, T., Luo, J., Zhao, J., Wen, B., Wang, Q., 2021. Recent advances in adversarial Rachman, D.M., Miyauchi, Y., Umeda, N., Maki, A., 2021. Feasibility study on the use
training for adversarial robustness. In: Zhou, Z.-H. (Ed.), Proceedings of the Thirti- of evolution strategy: CMA-ES for ship automatic docking problem. In: Proc. 1st
eth International Joint Conference on Artificial Intelligence. IJCAI-21, International International Conference on the Stability and Safety of Ships and Ocean Vehicles.
Joint Conferences on Artificial Intelligence Organization, pp. 4312–4321. http: STABS 2021.
//dx.doi.org/10.24963/ijcai.2021/591, Survey Track. Ross, S., Bagnell, D., 2010. Efficient reductions for imitation learning. In: Teh, Y.W.,
Bain, M., Sammut, C., 1999. A framework for behavioural cloning. In: Machine Titterington, M. (Eds.), Proceedings of Machine Learning Research, Proceedings of
Intelligence 15, Intelligent Agents [St. Catherine’s College, Oxford, July 1995]. the Thirteenth International Conference on Artificial Intelligence and Statistics Pro-
Oxford University, GBR, pp. 103–129. ceedings of Machine Learning Research, vol. 9, 661–668.URL: http://proceedings.
Bitar, G., Martinsen, A.B., Lekkas, A.M., Breivik, M., 2020. Trajectory planning and con- mlr.press/v9/ross10a.html,
trol for automatic docking of ASVs with full-scale experiments. IFAC-PapersOnLine Sawada, R., Hirata, K., Kitagawa, Y., Saito, E., Ueno, M., Tanizawa, K., Fukuto, J.,
53 (2), 14488–14494. http://dx.doi.org/10.1016/j.ifacol.2020.12.1451, 21st IFAC 2021. Path following algorithm application to automatic berthing control. J. Mar.
World Congress. Sci. Technol. 26, 541–554. http://dx.doi.org/10.1007/s00773-020-00758-x.
10
S. Shimizu et al. Ocean Engineering 265 (2022) 112553
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: An Introduction. MIT Press.
Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., Silver, D., 2020. Takai, H., Ohtsu, K., 1990. Automatic berthing experiments using ‘‘Shioji-Maru’’ (in
Mastering Atari, Go, chess and shogi by planning with a learned model. Nature Japanese). J. Jpn. Inst. Navig.
588 (7839), 604–609. http://dx.doi.org/10.1038/s41586-020-03051-4. Takai, H., Yoshihisa, H., 1987. An automatic maneuvering system in berthing. In:
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P., 2015. Trust region policy Proceedings of 18th Ship Control Symposium.
optimization. In: Bach, F., Blei, D. (Eds.), Proceedings of the 32nd International Tran, V.L., Im, N., 2012. A study on ship automatic berthing with assistance of auxiliary
Conference on Machine Learning. In: Proceedings of Machine Learning Research, devices. Int. J. Nav. Archit. Ocean Eng. 4 (3), 199–210. http://dx.doi.org/10.2478/
vol. 37, PMLR, Lille, France, pp. 1889–1897, URL: https://proceedings.mlr.press/ IJNAOE-2013-0090.
v37/schulman15.html. Wakita, K., Akimoto, Y., Rachman, D.M., Amano, N., Fueki, Y., Maki, A., 2022a.
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P., 2016. High-dimensional Method of tracking control considering static obstacle for automatic berthing using
continuous control using generalized advantage estimation. In: Proceedings of the reinforcement learning (in Japanese). In: Conference Proceedings of the Japan
International Conference on Learning Representations. ICLR. Society of Naval Architects and Ocean Engineers. Japan Society of Naval Architects
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy and Ocean Engineers.
optimization algorithms. arXiv:1707.06347. Wakita, K., Maki, A., Umeda, N., Miyauchi, Y., Shimoji, T., Rachman, D.M., Akimoto, Y.,
Shouji, K., Ohtsu, K., Hotta, T., 1993. An automatic berthing study by optimal control 2022b. On neural network identification for low-speed ship maneuvering model. J.
techniques. IFAC Proc. Vol. 173, 221–229. Mar. Sci. Technol. 1–14. http://dx.doi.org/10.1007/s00773-021-00867-1.
Shouji, K., Ohtsu, K., Mizoguchi, S., 1992. An automatic berthing study by optimal Yoshimura, Y., Nakao, I., Ishibashi, A., 2009. Unified mathematical model for ocean
control techniques. IFAC Proc. Vol. 25 (3), 185–194. and harbour manoeuvring. In: Proceedings of MARSIM2009. pp. 116–124, URL:
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., http://hdl.handle.net/2115/42969.
Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q., 2021.
van den Driessche, G., Graepel, T., Hassabis, D., 2017. Mastering the game of A comprehensive survey on transfer learning. Proc. IEEE 109 (1), 43–76. http:
go without human knowledge. Nature 550 (7676), 354–359. http://dx.doi.org/10. //dx.doi.org/10.1109/JPROC.2020.3004555.
1038/nature24270.
11