Mass Mahjong Decision System Based On Transfer Learning: Yajun Zheng Shuqin LI

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Mass Mahjong Decision System Based on Transfer Learning

Yajun Zheng Shuqin LI*


School of Computer Science, Beijing Information Science School of Computer Science, Beijing Information Science
and Technology University; Perception and and Technology University; Perception and
Computational Intelligence Laboratory Computational Intelligence Laboratory

ABSTRACT self-playing without any human knowledge. In Riichi Mahjong,


In this paper, we propose a transfer learning to solve the prob- Suphx [5], based on deep reinforcement learning, is currently the
lem of lacking in data and the difficulty in constructing models strong Japanese Mahjong AI that can reach levels beyond most
effectively, which is typically represented by Mass Mahjong in the top human players on the Tenhou Mahjong platform. Although
field of imperfect information. Design and implement the Mass the above research on imperfect information games has achieved
Mahjong Discard model based on transfer learning. The previously progress of closing to the real level of human players through deep
well-trained Blood Mahjong Discard model on a large dataset is learning in some games, imperfect information games are diversed
migrated to Mass Mahjong Discard model in a similar domain. In in type, and the differences in the rules of the game can lead to the
the subsequent model optimization, a self-play based approaching differences in the data for training, model inputs and outputs, net-
is used to improve the Mass Mahjong Discard model. The experi- work structure, and validation methods. Only the game of Mahjong
mental results show that the transfer learning-based Mass Mahjong in China has various from National Standard Mahjong, Sichuan
Discard model performs well in the situation of less data, and can Mahjong, Riichi Mahjong, etc.. The rules of Mahjong are inherently
fit the Mass Mahjong Discard rule. And the model won the sec- complex, and there is also some variability in the rules between
ond prize in the Mass Mahjong event of the National University different Mahjong games. And the Mahjong game is a multiplayer
Computer Gaming Competition in 2021. game, including more gaming decision actions, a large action space
and a strong uncertainty. This leads to the development of a new
CCS CONCEPTS domain of Mahjong decision model requires huge human and ma-
terial resources, and needs to go through a long process of model
• Computing methodologies; • Machine learning; • Learning
tuning, which not only cannot guarantee a good training effect, but
paradigms; • Multi-task learning; • Transfer learning;
also cannot guarantee the transferability of research results. And
KEYWORDS for some domains where there is a lack of data or less data, it is
not possible to use deep learning for model building, or the domain
Artificial intelligence, Transfer learning, Mahjong, Machine learn- model does not generalize well. Therefore, this paper tries to use
ing transfer learning method to improve the imperfect information
ACM Reference Format: game with less data.
Yajun Zheng and Shuqin LI*. 2022. Mass Mahjong Decision System Based In this paper, using Mass Mahjong as a typical representative,
on Transfer Learning. In 2022 the 6th International Conference on Innovation and with only a small amount of data, we try to transfer the Bloody
in Artificial Intelligence (ICIAI) (ICIAI 2022), March 04–06, 2022, Guangzhou, Mahjong model [6], which has a large amount of data and is well
China. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3529466.
trained, to the Mass Mahjong domain through transfer learning to
3529485
construct a Discard model applicable to Mass Mahjong. At the same
time, in order to continue to improve the accuracy of this model,
1 INTRODUCTION
the model is optimized by using a self-play approach.
With the development of deep learning, more and more research in
the field of computer gaming has been broken through. In particular,
the emergence of AlphaGo [1] has allowed computers to beat hu-
2 INTRODUCTION TO THE RULES OF MASS
mans within the field of Go based on deep learning methods, which
brings the application of deep learning to the field of computer gam- MAHJONG
ing to an unprecedented level. In the field of competitive two against The game of Mass Mahjong consists of four players and the tiles are
one, DouZero [2], referring to the ideas of AlphaZero [3], combin- divided into three suits: Dot, Bamboo and Character, with a total of
ing Monte Carlo algorithm [4] with deep learning, has reached the 108 tiles [7]. Among them, the Mass Mahjong common terms are
level of human players in a few days by learning from zero through explained as follows.
Wall: After the initial hand is drawn at the start of the game, the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed remaining tiles are the wall of tiles.
for profit or commercial advantage and that copies bear this notice and the full citation Draw: The player draws a tile from the Wall.
on the first page. Copyrights for components of this work owned by others than ACM Discard: The player chooses a tile from his hand and discards it.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a Eat: When a tile discarded by your last player and two tiles in
fee. Request permissions from permissions@acm.org. your hand form a three consecutive tiles of the same suit, then you
ICIAI 2022, March 04–06, 2022, Guangzhou, China can take Eat action.
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9550-2/22/03. . . $15.00 Pong: If another player discards a tile that is the same as two
https://doi.org/10.1145/3529466.3529485 tiles in your hand, you can take Pong action.

228
ICIAI 2022, March 04–06, 2022, Guangzhou, China Yajun Zheng and Shuqin Li

Kong: There are 3 forms of Kong, divided into Open Kong, Closed situation as one player winning and the others continuing
Kong and Repaired Kong. Open Kong is when a tile discarded by to play.
another player is the same as the three tiles in your hand. When In conclusion, Mass Mahjong is simpler in rules compared to
you draw a tile that is the same as three tiles in your hand, you can Bloody Mahjong and can be seen as a stage in the Bloody Mahjong,
take Closed Kong action. If you draw a tile that is the same as a tile thus providing good conditions for using transfer learning.
you have already Pong, you can take Repaired Kong action.
Listen: When the hand is one tile short of Win, the hand is then 3.2 Overall design concept of the Mass
in a waiting state. The player can choose whether or not Listen.
Mahjong decision system
After Listen, although you can get the score reward directly, but the
hand cannot be changed after that, i.e. whatever tile is subsequently The overall concept of the transfer learning Mass Mahjong Discard
drawn is discarded until the winning tile in the game appears, then model design is shown in Figure 1. It consists of the following four
you can take Win action. main components.
Meld: During the game, the combinations of tiles for Eat, Pong, 1. Pre-training. Generally, the model that can get better training
Open Kong and Repaired Kong actions. on a similar task is selected for pre-training. According to the
Win: The player’s hand forms a special combination. characteristics of Mass Mahjong, this paper selects the Bloody
At the beginning of the game, four players seat in the south, east, Mahjong Discard model, which is similar to Mass Mahjong, as the
north, west, and each player draws 13 tiles in his hand, while the source model and pre-trains it on the Bloody Mahjong dataset.
dealer player draws one more tile in his hand, making a total of 14 • Weight transfer. This step mainly uses the trained model
tiles. After that, the player in the dealer’s seat starts to discard in a weights and then uses these weights as initial weights in a
clockwise direction, and during the game, he can Draw, Discard, Eat, new task, which simply means that the parameters of the
Pong, Kong, Listen and Win. Among them, Eat, Pong, Kong, Listen trained model are transferred to the new model to help train
and Win are all directly benefits. And multiple decision actions the new model. In this paper, we transfer the weights of the
may occur at the same moment. The priority of each action is: Win pre-trained Bloody Mahjong Discard model.
> Kong > Pong > Eat. When a player wins or the wall of tiles are • Re-training. According to the characteristics of the current
empty, the game comes to an end and the total score of each player task, some structures of the original model are fixed or
is counted based on the score of the winning hand and the score of changed to adapt the model to the new task. In this paper, we
the action in the game. first remove some unique features of Bloody Mahjong from
the model input, such as changing three tiles and missing suit,
3 OVERALL DESIGN OF THE MASS and change the input structure of the model, then use data
augmentation, after augmenting less Mass Mahjong data,
MAHJONG DECISION SYSTEM
and use the transferred weights to retrain Mass Mahjong
Due to the rules of Mass Mahjong, Eat, Pong, Kong and Listen to build a transfer learning based Mass Mahjong Discard
operations can be directly rewarded. Therefore, this part can be model.
implemented based on knowledge rules. And the quality of the • Model optimization. After training the Mass Mahjong Dis-
Discard determines whether the subsequent win and gain. So the card model, the model is optimized by fine-tuning. In this
research in this paper focuses on the Mahjong Discard decision. paper, other Mahjong decision models are integrated with
In previous work [6], the Mahjong Discard decision model was the Mahjong Discard model to form the Mahjong decision
trained and performed well. Therefore, this paper uses the Bloody system. According to the rules of Mass Mahjong, a Mass
Mahjong decision model to perform transfer learning. Mahjong judge system is established to self-play with the
Mass Mahjong decision system, and the data generated by
3.1 The similarities and differences between the self-play is used to update the Mass Mahjong Discard
the rules of Mass Mahjong and Bloody model to achieve the effect of model fine-tuning and opti-
Mahjong mization.
Mass Mahjong and Bloody Mahjong are similar, the overall rules
4 DESIGN AND IMPLEMENTATION OF MASS
are similar but there are some differences, the main similarities and
differences between the two are as follows. MAHJONG DECISION SYSTEM
• Similarities: Both are 108 tiles, with only three suits: Dot,
4.1 Design and implementation of Discard
Bamboo and Character. The rules are the same for both the model for Mass Mahjong
Pong and Kong actions. The rules for winning are also similar, In previous work, due to a large amount of Bloody Mahjong Dis-
as long as the particular tile fits the winning rules. card data, a Bloody Mahjong Discard model is constructed for
• Differences: Mass Mahjong does not have the rule of chang- pre-training as a source model for transfer learning. For the model
ing three tiles and there is no requirement for a missing input data representation reference to the method proposed in the
suit to win. However, two actions have been added to Mass literature [8-10], the game scenario was segmented considering the
Mahjong: Eat and Listen. The Eat, Pong, Kong and Listen completeness and correlation between the data, and the known
actions of Mass Mahjong are all directly rewarded. The game information in the current situation was extracted using human
comes to an end when one player wins, and there is no such experience based knowledge. Since there are no features in Mass

229
Mass Mahjong Decision System Based on Transfer Learning ICIAI 2022, March 04–06, 2022, Guangzhou, China

Figure 1: Overall design idea of Mass Mahjong decision system

Table 1: Model input data representation Table 2: Weight transfer parameters

Input Dimension Representation Features Parameter Name Numerical Value


3*9 Player’s current hand num_epochs 50
3*9 Player’s Discard learning_rate 0.001
3*9 Player’s Meld batch_size 64
3*9 Last Player’s Discard depth 100
3*9 Last Player’s Meld growth_rate 12
3*9 Opposite Player’s Discard criterion Cross Entropy Loss
3*9 Opposite Player’s Meld optimizer SGD
3*9 Next Player’s Discard
3*9 Next Player’s Meld
3*9 All Players’ Discard DenseNet network is chosen to be used in Mahjong Discard decision.
3*9 Current Remaining Tiles Since it is a transfer learning and the types of Mass Mahjong and
3*9 Last Hand Position Bloody Mahjong tiles are the same, the previous information needs
3*9 Last Hand Tile to be taken into account when discarding tiles, and only the input
layer of the model needs to be modified. The model output is the
probability distribution of the 27 tile types.
In this paper, we use the Bloody Mahjong dataset for pre-training
to train the Bloody Mahjong Discard model, and after a series
of parametric work, the final accuracy of the Bloody Mahjong
Figure 2: Model output player action representation Discard model on the test set can reach 91.6%. Next, weight transfer
is performed to transfer the parameters of the Bloody Mahjong
Discard model to the Mass Mahjong Discard model, where the
Mahjong such as change three tiles, these features are eliminated specific parameters are shown in Table 2 below.
when constructing the Mass Mahjong Discard model. In this paper, The loss function is chosen as the cross-entropy loss, which is
we choose a feature plane of 3*9 to represent the situation, with 3 expressed in the following equation 1).
rows representing the suits of the tiles and 9 columns representing
c = − x [y ln a + (1 − y) ln a]
Í
the 9 categories of the sequence 1 to 9. For the other known infor- (1)
mation including: the last, next, opposite players have discarded n
tiles and other features are represented by this feature plane. The In Equation 1), y represents the desired output, i.e., the position
specific data representation is shown in Table 1 corresponding to the classification in the label, a represents the ac-
The output of the transfer learning model considers the actions tual output of the neuron, i.e., the position of the multiclassification
as multi-classification, that is, 27 kinds of tile actions, as shown in label predicted by the neural network, x represents the sample, and
Figure 2 below. n represents the total number of samples. The optimization objec-
In previous work, Bloody Mahjong not only considers the current tive is to minimize the loss, and the value of c decreases gradually
situation information when discarding a tile, but also takes the when the actual output of the model is close to the true result.
past tile information into account. And DenseNet [11] network The Mass Mahjong Discard model is obtained by retraining using
establishes the connection relationship between different layers, so the Mass Mahjong Discard data after data augmentation and the

230
ICIAI 2022, March 04–06, 2022, Guangzhou, China Yajun Zheng and Shuqin Li

Table 3: Weight transfer parameters

Action Rules For Use


Eat 1. If the number of tiles in the hand of a Eat meld is 3, not take Eat action.
2. If the Eat meld already formed in the hand, not take Eat action.
3. In the rest of the cases, the Eat action is performed if the Eat is legal.
Pong 1. If the tile number of the Pong is 3 and have single number 1, 2 in the hand, then no take Pong.
2. If the tile number of the Pong is 7 and have single number 8, 9 in the hand, then no take Pong.
3. In the rest of the cases, the Pong action is performed if the Pong is legal.
Kong 1. If the winning hand is a 7 pair, not take Kong action.
2. If the tile of the Kong can be divided into a pair and a meld in the hand, then no take Kong.
3. In the rest of the cases, the Kong action is performed if the Kong is legal.
Listen 1. Listen to a large number of tiles.
2. Listen to the high scoring tile when winning.
3. Listen to the tile is not all discarded.

parameters after weight transfer. The specific algorithm is shown Algorithm 2 Self-play model optimization algorithm
in Algorithm 1 below. input model_Mass, model_Other, epochs
output model_Mass
epoch = 0
Algorithm 1 Mass Mahjong Discard model algorithm based on best_error = 1
transfer learning while epoch < epochs do
data = Self_Play(model_Mass, model_Other)
input data_Bloody, data_Mass
data_win = Select(data)
output model_Mass
data_train, data_dev, data_test = Split_Data(data_win)
model_Bloody = Train(data_Bloody) // Training Bloody Mahjong
model_new = Train(data_train, data_dev)
Discard model
test_error = Compute_Error(data_test, model_new)
data_Mass_ aug= Augmentation(data_Mass) // Mass Mahjong
if test_error < best_error then
data augmentation
best_error ← test_error
model_Mass = Train (data_Mass_ aug) // Training Mass Mahjong
model_Mass ← model_new
Discard model using Bloody Mahjong Discard model parameters
Save(model_Mass)
return model_Mass
epoch ← epoch + 1
return model_Mass

4.2 Design and implementation of other action


decision models for Mass Mahjong
Mass Mahjong has actions such as Eat, Discard, Pong, Kong, Listen,
these actions combined together to form a complete Mass Mahjong
decision system. The Discard model is the one trained by transfer
learning, while the other four models are rule-based models due to Figure 3: Raw data format
certain constraints in taking actions and the small amount of data
that can be collected. Some of the rules are listed in this paper and
the algorithm is shown in Table 3 below. 5 RESULTS AND ANALYSIS
The above models of Eat, Pong, Kong, Listen and Discard are
combined into a complete Mass Mahjong decision system.
5.1 Data Processing
Due to the small amount of raw data, no further data cleaning is
carried out in this paper in the data processing. The data processing
4.3 Optimized design and implementation of mainly performs semantic segmentation of the game situation and
Mass Mahjong models feature extraction on the raw data. The Mass Mahjong data used
After the Mass Mahjong decision system is constructed, the data in this paper comes from the game records of the 2020 National
is generated through self-play to optimize the Discard model so University Computer Gaming Mass Mahjong Competition. The raw
that the Mass Mahjong Discard model can be better adapted to the data is a log in json format, with information including: one’s hand,
rules of Mass Mahjong. The training process of the self-play model one’s meld, one’s seat number, and the game record records so far.
optimization algorithm is shown in Algorithm 2. The raw data format is shown in Figure 3 below.

231
Mass Mahjong Decision System Based on Transfer Learning ICIAI 2022, March 04–06, 2022, Guangzhou, China

and judging whether the next decision is legal. The overall design
is shown in Figure 5 below.
In this paper, we use self-playing games to produce data, and
then use the data of the winning players to train the model for
optimization. Ten processes were used for training, and one training
session was conducted 250 times per game for 40 rounds each with
a learning rate of 0.001. While using JJWorld’s testing software(The
National Competition‘s testing software with official support), the
games were tested for 30 rounds with four players randomly seated.
The unoptimized AI were players with the names TEST 2 and TEST
4, and the optimized AI were TEST 1 and TEST 3. TEST 3 trained
longer than TEST 1, and the final results are shown in Table 4
Figure 4: Transfer learning experiment results As can be seen in Table 4, the unoptimized model won fewer
times and scored lower than the optimized model. The optimized
model, on the other hand, wins more times and score higher, while
the longer the model is trained, the better the model is optimized
To improve the generalization ability of the model, data augmen- and the more the model can learn the rules of Mass Mahjong.
tation is carried out on the processed data. The three suits of tiles,
Dot, Bamboo and Character, have the same rank and the numbers 5.3 Results of the practical competition
one to nine are symmetrical in structure. Referring to AlphaZero’s The Mass Mahjong decision system designed in this paper was
data augmentation method, the three suits of tiles can be rotated entered into the 2021 National University Computer Gaming Com-
and the numbers one to nine are symmetrically processed, with a petition. The Mass Mahjong agent designed in this paper won the
total of 48 situations, i.e. there are 47 other positions in a situation second prize in the competition. When playing against each other,
that are equivalent to the current one. some of the results are shown in Table 5 below.
Among them, BISTU-Mahjong1 was the runner-up team last
5.2 Transfer learning model experiment year, based on the deep learning model, and BISTU-Mahjong2 was
After data processing, these calibration data are divided, and a the participating team last year, basing on rules, which won the
total of 900,000 calibration data are obtained. A randomly selected 5th place last year. The Mass Mahjong decision system based on
80% of the data was used as the training set and 20% as the test transfer learning proposed in this paper performs better than the
set. The model was written with pytorch, using parameters after above two, indicating that the algorithm proposed in this paper has
weight transfer, and had been trained on a single GPU in the type a certain effect.
GEFORCE RTX 2080. As the number of training increases, the test When playing against other players’ AI, the model performed
set obtained is shown in Figure 4 average, and the reasons for this were analyzed as follows: On the
Although the accuracy of 84% was achieved on the test set. How- one hand, there is not enough time for model training; on the other
ever, due to the low quality of data, the model needs to be optimized hand, the models of Eat, Pong and Kong are implemented based
and improved by subsequent self-playing games. on rules, and they are not well coordinated with the Discard mod-
In order to provide conditions for the subsequent self-play, els. Finally, the Mass Mahjong decision system based on transfer
this paper constructs the Mass Mahjong judge system. The Mass learning proposed in this paper won the second prize.
Mahjong judge system can simulate the environment of a Mass
Mahjong game and is responsible for interacting with the Mass 6 CONCLUSION
Mahjong decision system, which form a complete Mass Mahjong In this paper, in order to solve the problem of differences in the
game together. The Mass Mahjong judge system mainly includes rules of adjacent Mahjong domains and the simultaneous absence
functions such as drawing tiles, judging winning, calculating scores, of data, transfer learning is used to remove features specific to the

Figure 5: Overall design idea of Mass Mahjong judge system

232
ICIAI 2022, March 04–06, 2022, Guangzhou, China Yajun Zheng and Shuqin Li

Table 4: Competition results for AI with different levels of learning

Team Name Total Score Total Number Of Wins Winning Number


TEST 3 67 84 14
TEST 1 -2 42 7
TEST 4 -11 30 6
TEST 2 -54 30 5

Table 5: Comparison Results and Technology research program (NO. KM201911232002), by Con-
struction Project of computer technology specialty (NO.5112011019)
Team Name Total Score , and by Normal projects of promoting graduated education pro-
gram at Beijing Information Science and Technology University
DIHU 170
(NO.5112011041) .
BISTU-Mahjong1 46
BISTU-Mahjong2 -105 REFERENCES
Example AI -111 [1] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep
neural networks and tree search[J]. nature, 2016, 529(7587): 484-489.
[2] Zha D, Xie J, Ma W, et al. DouZero: Mastering DouDizhu with Self-Play Deep
transferred source model, and then the model is migrated to a new Reinforcement Learning[J]. arXiv preprint arXiv:2106.06135, 2021.
[3] Silver D, Hubert T, Schrittwieser J, et al. Mastering chess and shogi by self-
Mahjong, and finally the model is further optimized by building play with a general reinforcement learning algorithm[J]. arXiv preprint arXiv:
a self-play system. In this paper, the Bloody Mahjong model was 1712.01815, 2017.
[4] Van der Kleij A A J. Monte Carlo tree search and opponent modeling through
transferred to the Mass Mahjong model, which eventually won the player clustering in no-limit Texas hold’em poker[J]. University of Groningen,
second prize in the 2021 National University Computer Gaming The Netherlands, 2010.
Mass Mahjong Competition. Since some of the models used in this [5] Li J, Koyamada S, Ye Q, et al. Suphx: Mastering Mahjong with Deep Reinforcement
Learning[J]. arXiv preprint arXiv:2003.13590, 2020.
paper are rule-based models, in the future work, it is possible that [6] Gao S, Li S. Bloody Mahjong playing strategy based on the integration of deep
for the Eat, Pong, Kong and Listen models, deep learning models learning and XGBoost[J]. CAAI Transactions on Intelligence Technology, 2021.
can be trained to replace the current rule-based models, so that [7] Qingyue Wang. Game of Mahjong [M]. Chengdu: Shurong Chess Publishing
House.2003.
each model can have a good coordination effect, improve the overall [8] Gao S, Okuya F, Kawahara Y, et al. Supervised Learning of Imperfect Information
decision-making ability of the model. Data in the Game of Mahjong via Deep Convolutional Neural Networks[J].
Information Processing Society of Japan, 2018.
[9] Gao S, Okuya F, Kawahara Y, et al. Building a Computer Mahjong Player via
ACKNOWLEDGMENTS Deep Convolutional Neural Networks[J]. arXiv preprint arXiv:1906.02146, 2019.
[10] Wang M, Yan T, Luo M, et al. A novel deep residual network-based incomplete
This work is supported by Normal projects of promoting graduated information competition strategy for four-players Mahjong games[J]. Multimedia
education program at Beijing Information Science and Technology Tools and Applications, 2019: 1-25.
[11] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional
University. (NO. 5212010937), by Normal projects of General Science networks[C]//Proceedings of the IEEE conference on computer vision and pattern
recognition. 2017: 4700-4708.

233

You might also like