Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Linguistic Rules-Based Corpus Generation for Native Chinese

Grammatical Error Correction


Shirong Ma1∗, Yinghui Li1∗ , Rongyi Sun1 , Qingyu Zhou2 , Shuling Huang1 , Ding Zhang1 ,
Yangning Li1 , Ruiyang Liu4 , Zhongli Li2 , Yunbo Cao2 , Hai-Tao Zheng1,3†, Ying Shen5†
1
Tsinghua Shenzhen International Graduate School, Tsinghua University
2
Tencent Cloud Xiaowei, 3 Peng Cheng Laboratory
4
Department of Computer Science and Technology, Tsinghua University
5
School of Intelligent Systems Engineering, Sun-Yat Sen University
{masr21,liyinghu20}@mails.tsinghua.edu.cn
Abstract 2020c). In recent years, CGEC has attracted more
and more attention from NLP researchers due to its
Chinese Grammatical Error Correction broader applications in all kinds of daily scenarios
(CGEC) is both a challenging NLP task and
and downstream tasks (Duan and Hsu, 2011; Kubis
a common application in human daily life.
arXiv:2210.10442v1 [cs.CL] 19 Oct 2022

Recently, many data-driven approaches are


et al., 2019; Omelianchuk et al., 2020).
proposed for the development of CGEC With the progress of deep learning, data-driven
research. However, there are two major methods based on neural networks, e.g., Trans-
limitations in the CGEC field: First, the lack former (Vaswani et al., 2017), have become the
of high-quality annotated training corpora mainstream for CGEC (Zhao and Wang, 2020;
prevents the performance of existing CGEC
Tang et al., 2021; Zhang et al., 2022; Li et al.,
models from being significantly improved.
Second, the grammatical errors in widely used 2022a). However, we argue that there are still two
test sets are not made by native Chinese speak- problems in CGEC: (1) For model training, ow-
ers, resulting in a significant gap between the ing to the limited number of real sentences contain-
CGEC models and the real application. In ing grammatical errors, the long-term lack of high-
this paper, we propose a linguistic rules-based quality annotated training corpora hinders many
approach to construct large-scale CGEC data-driven models from exercising their capabili-
training corpora with automatically generated ties on the CGEC task. (2) For model evaluation,
grammatical errors. Additionally, we present
the widely used benchmarks such as NLPCC (Zhao
a challenging CGEC benchmark derived
entirely from errors made by native Chinese et al., 2018) and CGED (Rao et al., 2018, 2020)
speakers in real-world scenarios. Extensive are all derived from the grammatical errors made
experiments1 and detailed analyses not only by foreign Chinese learners (i.e., L2 learners) in
demonstrate that the training data constructed their process of learning Chinese, the gap between
by our method effectively improves the the language usage habits of L2 learners and Chi-
performance of CGEC models, but also reflect nese native speakers makes the performance of the
that our benchmark is an excellent resource
CGEC models in real scenarios unpredictable.
for further development of the CGEC field.
As illustrated in Table 1, the samples of NLPCC
1 Introduction and CGED are both from L2 learners, so their
sentence structures are relatively short and sim-
In the field of theoretical linguistics, the native ple. More crucially, the grammatical errors in these
speaker is the authority of the grammar.
samples are very obvious and naive. On the other
- Noam Chomsky
hand, in the third example, the erroneous sentence
Chinese Grammatical Error Correction (CGEC) is fluent on the whole, which shows that the gram-
aims to automatically correct grammatical errors matical errors made by native speakers are more
that violate language rules and converts the noisy subtle, that is, they are actually in line with the
input texts to clean output texts (Wang et al., habit of speaking in people’s daily communication,
∗ but they do not conform to linguistic norms. There-
indicates equal contribution. Work is done during
Yinghui’s internship at Tencent Cloud Xiaowei. fore, in the broader scenarios of Chinese usage

Corresponding author: Hai-Tao Zheng and besides foreigners learning Chinese, we believe
Ying Shen. (E-mail: zheng.haitao@sz.tsinghua.edu.cn, that wrong sentences made by native speakers are
sheny76@mail.sysu.edu.cn)
1
Our dataset and source codes are available at https:// more valuable and can better evaluate the model
github.com/masr2000/CLG-CGEC. performance than errors made by L2 learners.
Incorrect: 那个消息给我打冲击。
Translation: The news gave me a hit shock.
NLPCC Correct: 那个消息给我很大冲击。
Translation: The news gave me a big shock.
Source: L2 learners
Incorrect: 他非常被日本的风景吸引了。
Translation: He was very attracted by the Japanese landscape.
CGED Correct: 他深深地被日本的风景吸引了。
Translation: He was deeply attracted by the Japanese landscape.
Source: L2 learners
Incorrect: 站在即将到来的2017年的起跑线上,我们不由得情不自禁地感到自豪和喜悦。
Translation: Standing at the starting line of upcoming 2017, we cannot have to help feeling proud and joyful.
Native
Error Correct: 站在2017年的起跑线上,我们情不自禁地感到自豪和喜悦。
Translation: Standing at the starting line of 2017, we cannot help feeling proud and joyful.
Source: Native speakers

Table 1: Examples of grammatical errors from NLPCC, CGED and Chinese native speakers respectively.

To alleviate the dilemma of missing large- difficult for well-educated Chinese native speakers
scale training data, we propose CLG, a novel to identify grammatical errors in NaCGEC than in
approach based on Chinese linguistic rules that au- previous existing benchmarks, which indicates that
tomatically constructs high-quality ungrammatical errors in NaCGEC are closer to the real mistakes
sentences from grammatical corpus. Specifically, that native speakers would make in their daily life.
according to authoritative linguistic books (Huang We believe that our proposed corpus generation
and Liao, 2011; Shao, 2016), we divide Chinese approach and benchmark can greatly contribute to
grammatical errors into 6 categories, and design de- the development of CGEC methods.
tailed grammatical rules to generate corresponding
erroneous sentences according to the characteris-
tics of their respective errors. Our divided 6 error 2 Related Work
types are: Structural Confusion, Improper Logical-
ity, Missing Component, Redundant Component,
Improper Collocation, and Improper Word Order. 2.1 CGEC Resources
Different from traditional data augmentation, un-
grammatical sentences generated by CLG are more Compared with English Grammatical Error Correc-
closely matching actual errors that Chinese native tion (EGEC), data resources for CGEC are still
speakers would make. Benefiting from our pro- lacking. The NLPCC (Zhao et al., 2018) pro-
posed CLG, high-quality and large-scale training vides a test set containing 2K sentences and a
samples are automatically constructed with anno- large-scale dataset collected from the Lang-8 web-
tated error types. Moreover, to fill the gap be- site for training model. The CGED (Rao et al.,
tween existing benchmarks and practical appli- 2018, 2020) is an evaluation dataset focusing on
cations, we collect a test dataset containing gram- error diagnosis which contains 5K sentences from
matical errors made by native Chinese speakers in HSK corpus (Cui and Zhang, 2011; Zhang and
real scenarios, named NaCGEC, which will be a Cui, 2013). The entire HSK corpus can be also
more challenging benchmark and a meaningful re- utilized for model training. The YACLC (Wang
source to facilitate further development of CGEC. et al., 2021) collects and annotates 32K sentences
We conduct extensive experiments to demon- from Lang-8 to construct a CGEC dataset. The
strate the effectiveness of CLG and the challenge latest MuCGEC (Zhang et al., 2022) selects and re-
of NaCGEC. Quantitative experiments show that annotates sentences from the NLPCC, CGED, and
the model trained on our generated corpus per- Lang-8 corpora to obtain a multi-reference eval-
forms better than that trained on traditional CGEC uation dataset with 7K sentences. To the best of
datasets. And compared with general data augmen- our knowledge, no existing resources focus on the
tation methods, the training data obtained by CLG grammatical errors made by Chinese native speak-
brings larger performance improvements. In addi- ers, all of these above-mentioned datasets originate
tion, qualitative analyses illustrate that it is more from errors made by L2 learners.
2.2 CGEC Methods training data containing spelling errors. How to
CGEC can be considered as a seq2seq task. Some automatically generate high-quality training data
existing works employ CNN-based (Ren et al., is one of our core contributions, so we also don’t
2018) or RNN-based (Zhou et al., 2018) models need to focus on spelling errors in our study.
to resolve the CGEC task. Most later work (Wang From the linguistic point of view, the schema of
et al., 2020a; Tang et al., 2021; Zhao and Wang, these 6 error types is explained as follows:
2020) employs the Transformer (Vaswani et al.,
(1) Structural Confusion (结构混乱) means to
2017) model which has been a great success in Ma-
mix two or more different syntactic structures
chine Translation. Those studies also propose some
in one sentence, which results in confusing
data augmentation approaches to extend the train-
sentence structure.
ing data for improving the model performance. Re-
cently, researchers start to treat CGEC as a seq2edit (2) Improper Logicality (不合逻辑) represents
task that iteratively predicts the modification la- that the meaning of a sentence is inconsistent
bel for each position of the sentence. Similar to or does not conform to objective reasoning.
the GECToR (Omelianchuk et al., 2020), Liang
et al. (2020) utilize a seq2edit model for CGEC. (3) Missing Component (成分残缺) means that
Zhang et al. (2022) directly adopt GECToR for the sentence structure is incomplete and some
CGEC and enhances it by using pretrained lan- grammatical components are missing.
guage models. TtT (Li and Shi, 2021) proposes
(4) Redundant Component (成分冗余) refers to
a non-autoregressive CGEC approach by employ-
an addition of unnecessary words or phrases to
ing the BERT (Devlin et al., 2019) encoder with
a well-structured sentence.
CRF (Lafferty et al., 2001). Li et al. (2022a) pro-
poses a sequence-to-action model to resolve the (5) Improper Collocation (搭配不当) is that the
CGEC task, which combines the advantages of collocation between some components of a sen-
both seq2seq and seq2edit approaches. Unlike pre- tence does not conform to the structural rules
vious works, we first focus on the linguistic rules of or grammatical conventions of Chinese.
Chinese grammar and exploit them to automatically
obtain high-quality training corpora to improve the (6) Improper Word Order (语序不当) mainly
performance of CGEC models. refers to the ungrammatical order of words or
clauses in a sentence.
3 Automatic Corpus Generation and
Benchmark Construction Some example sentences containing various gram-
matical errors and the corresponding corrections
3.1 Schema Definition are presented in Table 2. These examples are all
According to the authoritative linguistic selected from our proposed benchmark which will
books (Huang and Liao, 2011; Shao, 2016), be mentioned in Section 3.4.
Chinese grammatical errors are categorized into 7 Next, we will further introduce the process of
types: Structural Confusion, Improper Logicality, our proposed CLG and the details of NaCGEC.
Missing Component, Redundant Component,
Improper Collocation, Improper Word Order and 3.2 Correct Sentences Collection
Ambiguity. It is worth noting that the errors of To achieve perfect correct/ungrammatical sentence
ambiguity are often caused by the lack of context pairs for model training, we must make sure that
information. So if we want the model to correct the correct sentences are free of any grammatical
such errors, we must have enough additional errors as much as possible. However, collecting
knowledge besides grammar, which is beyond the and annotating large-scale correct sentences are ex-
essence of the CGEC task. Therefore, we do not tremely time-consuming and expensive processes.
consider this type of error. In addition, there is a To address this issue, we first accumulate massive
class of common errors, i.e., spelling errors which amounts of high-quality Chinese sentences from
are mainly caused by various confusing characters public datasets (Xu, 2019) such as the Chinese Peo-
with similar strokes/pronunciations (Li et al., ple’s Daily corpus, Chinese machine translation
2022b). But Wang et al. have comprehensively dataset, and Chinese wiki corpus as the raw cor-
studied how to automatically generate large-scale pora. Then we randomly selected 1,000 sentences
Incorrect: 食用水果前应该洗净削皮较为安全。
Structural Translation: Fruit should be washed and peeled before eating is safer.
Confusion Correct: 食用水果前应该洗净削皮。
Translation: Fruit should be washed and peeled before eating.
Incorrect: 集团向社会各界人士、沿途村庄百姓表示歉意。
Improper Translation: The group apologizes to people from all walks of life and villagers along the way.
Logicality Correct: 集团向社会各界人士表示歉意。
Translation: The group apologizes to people from all walks of life.
Incorrect: 该节目成功地实现了收视冠军。
Missing Translation: The program has successfully achieved a ratings champion.
Component Correct: 该节目成功地实现了收视冠军的目标。
Translation: The program has successfully achieved the goal of being a ratings champion.
Incorrect: 昨天是转会截止日期的最后一天。
Redundant Translation: Yesterday was the last day of the transfer deadline.
Component Correct: 昨天是转会的最后一天。
Translation: Yesterday was the last day of the transfer.
Incorrect: 丝绸之路开拓了千古传诵的壮美篇章。
Improper Translation: The Silk Road has opened a magnificent chapter that has been passed down through the ages.
Collocation Correct: 丝绸之路谱写了千古传诵的壮美篇章。
Translation: The Silk Road has written a magnificent chapter that has been passed down through the ages.
Incorrect: 学校三个月内要求每名学生完成20个小时的义工服务。
Improper Translation: The school in three months requires each student to complete 20 hours of volunteer service.
Word Order Correct: 学校要求每名学生三个月内完成20个小时的义工服务。
Translation: The school requires each student to complete 20 hours of volunteer service in three months.

Table 2: Examples of sentences with various types of grammatical errors.

from the raw corpora for human judgment by the terns with logical errors and match a pattern to
annotators. The result that more than 97% of the a correct sentence to modify it to a wrong one.
sentences are grammatically correct shows that, in
a statistical sense, about 97% of the sentences in (3) For Missing Component, we pick a compo-
the raw corpus do not contain grammatical errors. nent from sentences to remove, or add words
Further, to ensure the quality of our finally selected to the sentence to cover up a sentence compo-
sentences, we only select sentences with perplexity nent.
values in the top 90% (reverse order) as our origi- (4) For Redundant Component, we select a no-
nal correct corpus. Finally, we acquire about 590K tional word or meaningful connective in a sen-
correct Chinese sentences from public corpora and tence, and then insert a synonym before or after
guarantee their quality to a certain extent. it to obtain an extra component.
3.3 Ungrammatical Sentence Generation (5) For Improper Collocation, we summarize
After collecting correct sentences, we use the THU- abundant common collocations between sen-
LAC toolkit (Sun et al., 2016) for word segmen- tence components, and we match a collocation
tation and POS tagging, so that the six types of that appears in the sentence and randomly re-
sentence components (subject, predicate, object, place it with a wrong one.
attribute, adverbial, complement) can be identi-
(6) For Improper Word Order, we switch the or-
fied. Then for each category of grammatical errors,
der of some important sentence components to
we design methods for constructing incorrect sen-
make the sentence ungrammatical (e.g., revers-
tences from correct sentences according to gram-
ing the order of related words or the order of
matical rules and language conventions as follows:
an attribute and a head word).
(1) For Structural Confusion, we collect com-
Due to the page limitation, in Figure 1, we only
mon correct sentence structures and mix two
present examples of generating sentences contain-
structures in a sentence to obtain an erroneous
ing two types of errors, “Improper Word Order”
sentence that is also smooth and confusing.
and “Improper Logicality”, and the other examples
(2) For Improper Logicality, we collect some pat- are shown in Appendix A.1.
nsubj compound:nn
conj
aux:ba dobj
compound:nn
compound:nn
dep cc etc
我 把 男生 作为 恋爱 对象 … 船桨 和 风帆 等 船舶 用品
PRON X NOUN VERB NOUN NOUN PRON CCONJ NOUN PART NOUN NOUN
I treat boys as love objects … oars , sails , etc. (are) marine supplies
nsubj conj
conj cc
aux:ba
compound:nn dobj compound:nn
dep cc

我 把 恋爱 对象 作为 男生 … 船桨 和 风帆 以及 船舶 用品
PRON X NOUN NOUN VERB NOUN PRON CCONJ NOUN CCONJ NOUN NOUN
I treat love objects as boys … oars , sails and marine supplies

(a) Improper Word Order (b) Improper Logicality

Figure 1: Examples of generating ungrammatical sentences containing different types of errors. In the figures,
NOUN represents “名词(noun)”, VERB represents “动词(verb)”, PRON represents “代词(pronoun)”, CCONJ
represents “连词(conjunction)”, X represents “助动词(auxiliary verb)”.

Based on the above methods, we further de- corpus with grammatical errors. Therefore, we
sign fine-grained grammatical error rules which collect the grammatical error questions of the
are presented in Appendix A.2 and develop the middle school and university entrance exami-
corresponding algorithms2 to change correct sen- nations and the questions in the supporting
tences into ungrammatical sentences. Through this teaching materials for the past 10 years for the
paradigm of automatically generating erroneous construction of our test corpus.
sentences of corresponding grammatical error types
based on correct sentences, we obtain not only sen- (2) Recruitment examinations for government
tence pairs that can be used for model training, but departments: Similar to the entrance exami-
also the labels of the grammatical error types of the nations for students, in China, the exams for
corresponding sentence pairs. Moreover, note that civil servants employed by government agen-
the rules mentioned in the above methods can be cies also include similar grammar questions.
applied in combination to construct more complex Different from the entrance exams, the ques-
erroneous instances which have multi grammatical tion text of the civil service recruitment exams
errors in one sentence. is closer to the domain of government official
documents, while the questions of the entrance
3.4 Benchmark Collection and Annotation exam are closer to the domain of education.
Another core contribution of our work is that we Likewise, we also collect texts of real test ques-
construct the test dataset NaCGEC in which the tions on civil service examinations and their
grammatical errors are all made by native Chinese auxiliary materials over the past 10 years.
speakers in real-world scenarios. In this part, we (3) Various Chinese news sites: To guarantee the
describe how we collect and annotate NaCGEC. data size and domain diversity of our bench-
Our benchmark focuses the native Chinese mark, we hired more than 20 data collectors
speakers’ texts. To ensure that the language char- and asked them to scour various mainstream
acteristics of NaCGEC are consistent with native Chinese news sites for sentences with gram-
Chinese speakers, we collect data from the follow- matical errors. To ensure that they can find un-
ing three real scenarios: grammatical sentences more accurately, before
they officially start their work, we have system-
(1) Entrance examinations for secondary
atically taught them grammar knowledge, and
schools and universities: In the various
asked them to pass the test of grammar error
entrance examinations for Chinese students,
diagnosis questions set by us.
there is a special type of question, that is,
grammatical error diagnosis questions. These It is worth mentioning that the data collection
questions provide us with natural and real process lasted for more than 6 months. After ob-
2
The implementation of the specific algorithm can refer to taining ungrammatical corpus, we employ highly
the supplementary materials we submitted. educated university students majoring in Chinese
linguistics, who are familiar with Chinese gram- 4.2 Human Evaluation
matical rules, to check the collected corpus and
annotate the corresponding error types according
To verify that our NaCGEC more closely matches
to our defined schema. The details of the anno-
the language style of Chinese native speakers and
tation process are shown in Appendix A.3. We
the grammatical errors that native speakers would
finally obtain 6,767 ungrammatical sentences and
make, we conduct a human evaluation experiment.
corresponding correction results to compose the
NaCGEC benchmark. Specifically, we randomly sample 300 correct
sentences and 300 incorrect sentences from the test
4 Data Analyses sets of NLPCC, CGED, and NaCGEC respectively.
We carefully check these sentences to ensure that
4.1 Corpus Statistics the correct and incorrect labels of them are correct
Table 3 reports the data statistics such as the num- from the perspective of native Chinese speakers.
ber of sentences, average length, and average edit Then we invite 3 Chinese native speakers to do
distance (Levenshtein, 1966) of the corpus we the following two annotation task for these sen-
constructed, including the automatically generated tences: (1) Annotators should determine whether
training data and the manually acquired test data. a sentence contains grammatical errors or not. (2)
Annotators should determine whether the language
CLG-Train NaCGEC-Test style of a sentence matches that of a native speaker,
Number of Sentences 591,404 6,767 giving a score of 0, 1, or 2 depending on the de-
Erroneous Sentences 591,404 6,496 gree of match. We compare the results of whether
Number of References 591,404 7,793
Average Length (Char.) 40.31 56.54 the sentences contain grammatical errors judged by
Edit Distance (Char.) 2.18 4.19 the annotators with the corresponding ground truth,
References / Sentence 1.00 1.20 and calculate Precision, Recall, and F1 of binary
classification made by annotators. The results of 3
Table 3: We report the statistics of the training corpus
generated by CLG and the test data of NaCGEC. annotators are averaged and reported in Table 5.

Pre Rec F1 Score


Replace Insert Delete Total
NLPCC 78.57 82.76 80.61 0.92
Structural Confusion 0.44 0.87 1.55 2.86 CGED 95.00 90.48 92.68 0.85
Improper Logicality 0.60 0.90 2.06 3.56 NaCGEC 72.86 68.00 70.34 1.78
Missing Component 0.10 2.10 0.42 2.62
Redundant Component 0.09 0.06 2.00 2.15
Improper Collocation 1.28 0.79 0.79 2.86 Table 5: Results of human evaluation experiment. Pre-
Improper Word Order 0.46 5.37 5.42 11.24 cision, Recall, and F1 are metrics of annotators distin-
guishing whether sentences contain grammatical errors,
Table 4: Average edit distance for sentences containing and the Score represents the extent to which annotators
various types of grammatical errors. Edit operations judge that sentences conform to the language habits of
include Replace, Insert and Delete. native speakers (0, 1, 2).

In addition, we perform further analysis on the


test samples in NaCGEC. Specifically, we analyze The results demonstrate that: (1) Compared
the character-level edit distance required to correct to NLPCC and CGED datasets, the test dataset
sentences for different types of errors. Note that a of NaCGEC obtains a significantly higher lan-
sentence with multiple errors will be counted in the guage style score, suggesting that the sentences in
corresponding error types repeatedly. From Table 4, NaCGEC are more in line with the language habits
we see that the average edit distance for sentences of native speakers. (2) Annotators who are Chinese
with most types is between 2 and 4, while that native speakers have more difficulty distinguishing
for sentences with “Improper Word Order” is as whether the sentences from our benchmark contain
high as 11.24. It indicates that such error requires grammatical errors, indicating that the grammatical
significant adjustments to sentence structure, which errors in our benchmark are more likely to be made
is a challenge for CGEC models. by native speakers in their everyday writing.
Pre Rec F0.5
Data Aug.(1000K) 3.50 1.49 2.76
Lang8(1220K) 8.22 1.04 3.44
Transformer (Vaswani et al., 2017)
Lang8+HSK(1377K) 5.91 0.79 2.57
CLG(591K) 17.19 6.20 12.69
Lang8+CLG(1811K) 26.75 5.89 15.66
Data Aug.(1000K) 4.35 1.85 3.42
Lang8(1220K) 20.77 6.97 14.88
GECToR-Chinese (Zhang et al., 2022)
Lang8+HSK(1377K) 22.01 8.73 16.88
CLG(591K) 23.25 11.03 19.04
Lang8+CLG(1811K) 27.71 12.19 22.09

Table 6: Experimental results on NaCGEC benchmark. The number in parentheses represents the number of
sentences in the corresponding dataset.

5 Experiments Pre Rec F0.5 ∆F0.5


No Pretrained 34.17 13.41 26.09 -
5.1 Experimental Setup Data Aug. (1000K) 41.49 14.48 30.21 +4.12
Data Aug. (1600K) 42.30 15.94 31.79 +5.70
We evaluate the performance of two mainstream CLG (591K) 38.24 16.64 30.36 +4.27
CGEC methods on our benchmark, including a CLG + Data Aug. (1591K) 41.73 17.02 32.34 +6.25
seq2seq model and a seq2edit model:
Table 7: Experimental results of Transformer on
Transformer (Vaswani et al., 2017) is a widely NLPCC benchmark.
used model with the encoder-decoder structure and
it is usually utilized to resolve seq2seq tasks. Fol-
lowing previous work (Fu et al., 2018; Wang et al., Additionally, the implementation details of our ex-
2020a; Tang et al., 2021) which treats CGEC as periments are presented in Appendix A.4.
a monolingual translation task and employs the 5.2 Experimental Results
Transformer on the NLPCC, we implement and
train a Transformer as our seq2seq model. The experimental results shown in Table 6 demon-
strate that: (1) Existing models do not perform well
GECToR-Chinese (Zhang et al., 2022) is a on our benchmark, with F0.5 being below 25, re-
seq2edit model, which treats CGEC as a sequence vealing that NaCGEC benchmark is challenging.
labeling task and predicts the modification label, in- (2) For both Transformer and GECToR-Chinese,
cluding insertion, deletion, and substitution (Malmi models trained on CLG training dataset outperform
et al., 2019), for each token of the sentence. The models trained on other datasets, among which
model adopts GECToR (Omelianchuk et al., 2020) Transformer trained on CLG performs far better
and enhances it by applying a pretrained language than that trained on other datasets. It reflects that
model as the encoder. the training data generated by our proposed CLG
Following the NLPCC shared task (Zhao et al., effectively assists models to correct grammatical
2018), we employ word-level MaxMatch (M2 ) errors made by Chinese native speakers. (3) CLG
Scorer (Dahlmeier and Ng, 2012) for evaluation, and Lang8 dataset can be jointly employed to train
which computes Precision, Recall, and F0.5 be- the model and further improve the performance
tween the gold edit set and the system edit set. of the model, which demonstrates that the dataset
We train the model using the Lang8 (Zhao et al., constructed by CLG is compatible with existing
2018) dataset which is the official training dataset datasets without conflicts.
of NLPCC, HSK (Cui and Zhang, 2011; Zhang
and Cui, 2013) dataset, and the training dataset 5.3 Analysis of Generalization Ability
generated by CLG. In addition, we construct a Furthermore, to verify the generalizability of the
training dataset by data augmentation (Wang et al., data generated by CLG, we conduct an experiment
2020a; Tang et al., 2021) for the comparison be- on the NLPCC dataset. To be specific, before train-
tween our proposed CLG and traditional data aug- ing with the Lang8 dataset, we pretrain the model
mentation methods. After the training stage, we using CLG training corpus. The compared baseline
test the trained models on NaCGEC benchmark. models include: the Transformer without pretrain-
ing and the Transformer pretrained on a corpus con- 其对象主要是面向低收入家庭
Incorrect
structed by traditional data augmentation approach. Its target is mainly for low-income families
Following previous work (Wang et al., 2020a; Tang 其对象主要是低收入家庭
Correct
et al., 2021), we implement the following data aug- Its target is mainly low-income families
mentation approach. After a correct sentence is Model
其对象主要是面向低收入家庭
(Lang8)
segmented into words, the following operations are
performed for each word in the sentence according Model
其对象主要是低收入家庭
(CLG)
to different probabilities: 70% of no modification,
Type Structural Confusion
10% of inserting a random word before this word,
10% of replacing the word with a random word, 请把这件事你不要放在心上
Incorrect
and 10% of deleting this word. I hope this matter you do not keep in mind

The results in Table 7 show that: (1) Pretraining 请你不要把这件事放在心上


Correct
I hope you do not keep this matter in mind
the model with CLG improves the model’s per-
formance on the NLPCC test dataset, and the im- Model 请把这件事放在心上
(Lang8) I hope you keep this matter in mind
provement is greater than pretraining the model
Model
with a larger corpus constructed by traditional data 请不要把这件事你放在心上
(CLG)
augmentation, which reflects the superiority of our
Type Improper Word Order
CLG over traditional data augmentation methods
on the CGEC task. (2) Compared with directly Table 8: Cases of models trained on different
increasing the size of corpus constructed by data datasets correcting ungrammatical sentences. The
augmentation, our CLG and data augmentation ap- red/blue/orange color in the first column represents
proach can be combined to achieve better results. that the prediction result of the corresponding model
To summarize, the training data constructed by our is wrong/correct/approximately correct.
proposed CLG brings significant and stable im-
provements to the model on both NaCGEC, which Pre Rec F0.5
focuses on native speaker errors, and NLPCC, Structural Confusion 37.14 23.53 33.28
which focuses on foreign learner errors. This phe- Improper Logicality 31.63 19.17 27.99
nomenon proves the good generalization ability of Missing Component 9.48 4.00 7.44
Redundant Component 27.75 15.90 24.15
the data constructed by CLG. Improper Collocation 7.82 3.30 6.13
Improper Word Order 19.82 5.65 13.20
5.4 Case Study and Fine-grained Analysis
Table 9: Model performance on different types of gram-
Table 8 illustrates two cases of models trained on matical errors.
different datasets correcting erroneous sentences
in NaCGEC. The first sentence contains an er-
ror of “Structural Confusion”, where “对象...面 that: (1) The distribution of grammatical errors in
向(target...for)” is a mixture of two different sen- the Lang8 dataset differs significantly from the dis-
tence patterns. The model trained on the Lang8 tribution of grammatical errors that occur in the
dataset cannot recognize such an error, while the everyday writing of native speakers. (2) The model
model trained on CLG perceive the error and cor- trained with CLG can better identify and correct
rect the sentence properly. The second sentence the grammatical errors of native speakers.
involves an error of “Improper Word Order”, revers- Additionally, we also analyze the performance
ing the order of “把这件事(this matter)” and “你 of the model on sentences with fine-grained gram-
不要(you do not)”. The error is so difficult for mod- matical error types. The results of Table 9 illustrate
els that the model trained on Lang8 removes “你 that the model performs particularly poorly in deal-
不要”, which makes the corrected sentence gram- ing with three types of error, i.e., “Missing Com-
matical but entirely changes the original sentence ponent”, “Improper Collocation” and “Improper
meaning. Meanwhile, the model trained on CLG Word Order”, which reveals that existing methods
does not correct it completely, but identifies the have insufficient understanding of common gram-
sentence including an error of “Improper Word Or- matical collocations and insufficient knowledge of
der” and makes a partially appropriate correction. macroscopic sentence structure. We suggest that
From these two cases, it can be further revealed future research for the CGEC task should be more
refined and pay more attention to the connections method and our constructed benchmark in the main
among each individual component of sentences and text. It is worth noting that in the process of gen-
the overall structure of sentences. erating training data and labeling test data, all the
original corpora used are from publicly available
6 Conclusion datasets or resources on the legitimate websites,
In this paper, we propose the CLG based on linguis- and no sensitive data is involved. Additionally, for
tic rules to automatically generate ungrammatical CGEC itself, it is a task that has been concerned
sentences from correct texts for obtaining large- for a long time and has a wide range of applica-
scale CGEC training corpus. Besides, we collect tion scenarios. Our work focuses on an important
sentences with grammatical errors written by Chi- issue that has been overlooked in previous CGEC
nese native speakers and construct a more challeng- work, namely grammatical errors made by native
ing CGEC benchmark NaCGEC. Experimental re- Chinese speakers. Therefore, NaCGEC is likely to
sults and detailed analyses indicate that NaCGEC directly facilitate the research of CGEC, and fur-
is consistent with the language habits of Chinese ther enhance the ability of future CGEC models to
native speakers, and the training data generated by deal with hard errors made by native speakers.
CLG effectively improves the model performance
Acknowledgement
on NaCGEC and other mainstream benchmarks.
We hope our work provides better resources and a This research is supported by National Natural Sci-
new direction for future CGEC research. ence Foundation of China (Grant No.62276154
and 62011540405), Beijing Academy of Arti-
7 Limitations ficial Intelligence (BAAI), the Natural Science
7.1 Limitation of Language Foundation of Guangdong Province (Grant No.
2021A1515012640), Basic Research Fund of Shen-
This work focuses on the Chinese Grammatical Er-
zhen City (Grant No. JCYJ20210324120012033
ror Correction (CGEC) task. The grammatical error
and JCYJ20190813165003837), and Overseas
classification and ungrammatical sentence genera-
Cooperation Research Fund of Tsinghua Shen-
tion approach CLG involved in this paper are based
zhen International Graduate School (Grant No.
on Chinese linguistic rules and cannot be directly
HW2021008).
extended to other languages such as English. How-
ever, we believe that a linguistic perspective can
also be introduced in GEC tasks of other languages References
for more in-depth analysis and exploration.
Xiliang Cui and Bao-lin Zhang. 2011. The principles
7.2 Limitation of Experiments for building the “international corpus of learner chi-
nese”. Applied Linguistics, 2:100–108.
We conduct experiments on CGEC employing two
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shi-
mainstream approaches, including Transformer and
jin Wang, and Guoping Hu. 2020. Revisiting pre-
GECToR-Chinese models. Limited by hardware re- trained models for Chinese natural language process-
sources, the total number of parameters of the two ing. In Findings of the Association for Computa-
models we construct are around 54M and 108M re- tional Linguistics: EMNLP 2020, pages 657–668,
spectively, and we do not try to build larger models Online. Association for Computational Linguistics.
or utilize model ensemble. Future work can con- Daniel Dahlmeier and Hwee Tou Ng. 2012. Better
struct larger models and use more tricks to improve evaluation for grammatical error correction. In Pro-
the performance on our benchmark. ceedings of the 2012 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
8 Ethical Considerations 568–572, Montréal, Canada. Association for Com-
In this paper, we propose CLG that automatically putational Linguistics.
generate large-scale and high-quality CGEC train- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
ing data based on our designed linguistic rules. Ad- Kristina Toutanova. 2019. BERT: Pre-training of
ditionally, we present a human-annotated bench- deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference
mark, NaCGEC, which focuses on the grammatical of the North American Chapter of the Association
errors made by native Chinese speakers. We de- for Computational Linguistics: Human Language
scribe the details of our proposed corpus generation Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ- Piji Li and Shuming Shi. 2021. Tail-to-tail non-
ation for Computational Linguistics. autoregressive sequence prediction for Chinese
grammatical error correction. In Proceedings of the
Huizhong Duan and Bo-June Paul Hsu. 2011. Online 59th Annual Meeting of the Association for Compu-
spelling correction for query completion. In Pro- tational Linguistics and the 11th International Joint
ceedings of the 20th International Conference on Conference on Natural Language Processing (Vol-
World Wide Web, WWW 2011, Hyderabad, India, ume 1: Long Papers), pages 4973–4984, Online. As-
March 28 - April 1, 2011, pages 117–126. ACM. sociation for Computational Linguistics.
Joseph L Fleiss. 1971. Measuring nominal scale agree- Yinghui Li, Qingyu Zhou, Yangning Li, Zhongli Li,
ment among many raters. Psychological bulletin, Ruiyang Liu, Rongyi Sun, Zizhen Wang, Chao Li,
76(5):378. Yunbo Cao, and Hai-Tao Zheng. 2022b. The past
mistake is the future wisdom: Error-driven con-
Kai Fu, Jin Huang, and Yitao Duan. 2018. Youdao’s
trastive probability optimization for Chinese spell
winning solution to the NLPCC-2018 task 2 chal-
checking. In Findings of the Association for Compu-
lenge: A neural machine translation approach to
tational Linguistics: ACL 2022, pages 3202–3213,
chinese grammatical error correction. In Natural
Dublin, Ireland. Association for Computational Lin-
Language Processing and Chinese Computing - 7th
guistics.
CCF International Conference, NLPCC 2018, Ho-
hhot, China, August 26-30, 2018, Proceedings, Part
I, volume 11108 of Lecture Notes in Computer Sci- Deng Liang, Chen Zheng, Lei Guo, Xin Cui, Xiuzhang
ence, pages 341–350. Springer. Xiong, Hengqiao Rong, and Jinpeng Dong. 2020.
BERT enhanced neural machine translation and se-
Borong Huang and Xudong Liao. 2011. Modern Chi- quence tagging model for Chinese grammatical er-
nese (Updated Fifth Edition). Higher Education ror diagnosis. In Proceedings of the 6th Workshop
Press, Beijing, China. on Natural Language Processing Techniques for Ed-
ucational Applications, pages 57–66, Suzhou, China.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Association for Computational Linguistics.
method for stochastic optimization. In 3rd Inter-
national Conference on Learning Representations, Ilya Loshchilov and Frank Hutter. 2017. Fixing
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, weight decay regularization in adam. CoRR,
Conference Track Proceedings. abs/1711.05101.

Marek Kubis, Zygmunt Vetulani, Mikolaj Wypych, and Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil
Tomasz Zietkiewicz. 2019. Open challenge for cor- Mirylenka, and Aliaksei Severyn. 2019. Encode,
recting errors of speech recognition systems. In tag, realize: High-precision text editing. In Proceed-
Human Language Technology. Challenges for Com- ings of the 2019 Conference on Empirical Methods
puter Science and Linguistics - 9th Language and in Natural Language Processing and the 9th Inter-
Technology Conference, LTC 2019, Poznan, Poland, national Joint Conference on Natural Language Pro-
May 17-19, 2019, Revised Selected Papers, volume cessing (EMNLP-IJCNLP), pages 5054–5065, Hong
13212 of Lecture Notes in Computer Science, pages Kong, China. Association for Computational Lin-
322–337. Springer. guistics.
John D. Lafferty, Andrew McCallum, and Fernando
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem
C. N. Pereira. 2001. Conditional random fields:
Chernodub, and Oleksandr Skurzhanskyi. 2020.
Probabilistic models for segmenting and labeling se-
GECToR – grammatical error correction: Tag, not
quence data. In Proceedings of the Eighteenth Inter-
rewrite. In Proceedings of the Fifteenth Workshop
national Conference on Machine Learning (ICML
on Innovative Use of NLP for Building Educational
2001), Williams College, Williamstown, MA, USA,
Applications, pages 163–170, Seattle, WA, USA →
June 28 - July 1, 2001, pages 282–289. Morgan
Online. Association for Computational Linguistics.
Kaufmann.

J Richard Landis and Gary G Koch. 1977. The mea- Adam Paszke, Sam Gross, Francisco Massa, Adam
surement of observer agreement for categorical data. Lerer, James Bradbury, Gregory Chanan, Trevor
biometrics, pages 159–174. Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Köpf, Edward Z.
VI Levenshtein. 1966. Binary codes capable of cor- Yang, Zachary DeVito, Martin Raison, Alykhan Te-
recting deletions, insertions and reversals. In Soviet jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
Physics Doklady, volume 10, page 707. Junjie Bai, and Soumith Chintala. 2019. Py-
torch: An imperative style, high-performance deep
Jiquan Li, Junliang Guo, Yongxin Zhu, Xin Sheng, learning library. In Advances in Neural Informa-
Deqiang Jiang, Bo Ren, and Linli Xu. 2022a. tion Processing Systems 32: Annual Conference
Sequence-to-action: Grammatical error correction on Neural Information Processing Systems 2019,
with action guided sequence generation. CoRR, NeurIPS 2019, December 8-14, 2019, Vancouver,
abs/2205.10884. BC, Canada, pages 8024–8035.
Gaoqi Rao, Qi Gong, Baolin Zhang, and Endong Xun. pre-training for deep language understanding. In
2018. Overview of NLPTEA-2018 share task Chi- 8th International Conference on Learning Represen-
nese grammatical error diagnosis. In Proceedings tations, ICLR 2020, Addis Ababa, Ethiopia, April
of the 5th Workshop on Natural Language Process- 26-30, 2020. OpenReview.net.
ing Techniques for Educational Applications, pages
42–51, Melbourne, Australia. Association for Com- Yingying Wang, Cunliang Kong, Liner Yang, Yijun
putational Linguistics. Wang, Xiaorong Lu, Renfen Hu, Shan He, Zheng-
hao Liu, Yun Chen, Erhong Yang, and Maosong Sun.
Gaoqi Rao, Erhong Yang, and Baolin Zhang. 2020. 2021. YACLC: A chinese learner corpus with multi-
Overview of NLPTEA-2020 shared task for Chi- dimensional annotation. CoRR, abs/2112.15043.
nese grammatical error diagnosis. In Proceedings of
the 6th Workshop on Natural Language Processing Yu Wang, Yuelin Wang, Jie Liu, and Zhuo Liu. 2020c.
Techniques for Educational Applications, pages 25– A comprehensive survey of grammar error correc-
35, Suzhou, China. Association for Computational tion. CoRR, abs/2005.06600.
Linguistics.
Bright Xu. 2019. Nlp chinese corpus: Large scale chi-
Hongkai Ren, Liner Yang, and Endong Xun. 2018. A nese corpus for nlp.
sequence to sequence learning for chinese grammat-
Bao-lin Zhang and Xiliang Cui. 2013. Design con-
ical error correction. In Natural Language Process-
cepts of “the construction and research of the inter-
ing and Chinese Computing - 7th CCF International
language corpus of chinese from global learners”.
Conference, NLPCC 2018, Hohhot, China, August
Language Teaching and Linguistic Study, 5:27–34.
26-30, 2018, Proceedings, Part II, volume 11109 of
Lecture Notes in Computer Science, pages 401–410. Yue Zhang, Zhenghua Li, Zuyi Bao, Jiacheng Li,
Springer. Bo Zhang, Chen Li, Fei Huang, and Min Zhang.
2022. Mucgec: a multi-reference multi-source eval-
Jingmin Shao. 2016. General Theory of Modern Chi-
uation dataset for chinese grammatical error correc-
nese (Third Edition). Shanghai Educational Publish-
tion. CoRR, abs/2204.10994.
ing House, Shanghai, China.
Yuanyuan Zhao, Nan Jiang, Weiwei Sun, and Xiao-
Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng jun Wan. 2018. Overview of the NLPCC 2018
Guo, and Zhiyuan Liu. 2016. Thulac: An efficient shared task: Grammatical error correction. In Nat-
lexical analyzer for chinese. ural Language Processing and Chinese Computing
- 7th CCF International Conference, NLPCC 2018,
Zecheng Tang, Yixin Ji, Yibo Zhao, and Junhui Li.
Hohhot, China, August 26-30, 2018, Proceedings,
2021. Chinese grammatical error correction en-
Part II, volume 11109 of Lecture Notes in Computer
hanced by data augmentation from word and char-
Science, pages 439–445. Springer.
acter levels. In Proceedings of the 20th Chinese
National Conference on Computational Linguistics, Zewei Zhao and Houfeng Wang. 2020. Maskgec: Im-
Hohhot, China, pages 13–15. proving neural grammatical error correction via dy-
namic masking. In The Thirty-Fourth AAAI Con-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ference on Artificial Intelligence, AAAI 2020, The
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Thirty-Second Innovative Applications of Artificial
Kaiser, and Illia Polosukhin. 2017. Attention is all
Intelligence Conference, IAAI 2020, The Tenth AAAI
you need. In Advances in Neural Information Pro-
Symposium on Educational Advances in Artificial In-
cessing Systems 30: Annual Conference on Neural
telligence, EAAI 2020, New York, NY, USA, Febru-
Information Processing Systems 2017, December 4-
ary 7-12, 2020, pages 1226–1233. AAAI Press.
9, 2017, Long Beach, CA, USA, pages 5998–6008.
Junpei Zhou, Chen Li, Hengyou Liu, Zuyi Bao, Guang-
Chencheng Wang, Liner Yang, Yingying Wang, Yong- wei Xu, and Linlin Li. 2018. Chinese grammatical
ping Du, and Erhong Yang. 2020a. Chinese gram- error correction using statistical and neural models.
matical error correction method based on trans- In Natural Language Processing and Chinese Com-
former enhanced architecture. Journal of Chinese puting - 7th CCF International Conference, NLPCC
Information Processing, 34(6):106. 2018, Hohhot, China, August 26-30, 2018, Proceed-
ings, Part II, volume 11109 of Lecture Notes in Com-
Dingmin Wang, Yan Song, Jing Li, Jialong Han, and
puter Science, pages 117–128. Springer.
Haisong Zhang. 2018. A hybrid approach to auto-
matic corpus generation for Chinese spelling check.
In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing,
pages 2517–2527, Brussels, Belgium. Association
for Computational Linguistics.

Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan


Xia, Zuyi Bao, Liwei Peng, and Luo Si. 2020b.
Structbert: Incorporating language structures into
A Appendix • Improper Negation (否定失当) means
that the sentence uses multiple negation,
A.1 Other Examples of Ungrammatical
making the sentence meaning reversed.
Sentence Generation
• Reverse Host-guest (主客倒置) means
Figure 2 presents some examples of generating un- that the position of the host and guest de-
grammatical sentences that are not mentioned in the scribed in a sentence are reversed.
body of this paper. In the figures, NOUN represents • Imposing Cause and Effect (强加因果)
“名 词(noun)”, VERB represents “动 词(verb)”, means that the sentence forces a connec-
PRON represents “代词(pronoun)”, CCONJ rep- tion between two events that are not di-
resents “连词(conjunction)”, ADP represents “介 rectly causally linked.
词(preposition)”, X represents “助动词(auxiliary
verb)”, PART represents “助词(particle/auxiliary (3) For Missing Component, there are 4 fine-
word)”, NUM represents “数量词(numeral)”. grained error types.

A.2 Fine-grained Grammatical Error Rules • Lack Subject (主语残缺) means that the
sentence does not have a subject.
As described in Section 3.3, we design rules for
• Lack Predicate (谓语残缺) means that
generating sentences containing corresponding er-
the sentence does not have a predicate.
rors for each of the 6 types of grammatical errors.
Moreover, each of these 6 types of errors can be • Lack Object (宾语残缺) means that the
divided into several fine-grained error types. As sentence does not have an object.
shown in Figure 3, these grammatical errors can • Lack Modifier (修饰成分残缺) refers to
be separated into a total of 26 fine-grained types of sentences that lack necessary modifiers.
error. In fact, for each fine-grained type of errors, (4) For Redundant Component, there are 2 fine-
we consider their characteristics and design the cor- grained error types.
responding rules. These fine-grained grammatical
error rules are explained as follows: • Multi Words (堆砌词语) refers to the
use of two or more words in a sentence
(1) For Structural Confusion, there are 3 fine- where one word would have been suffi-
grained error types. cient, resulting in redundancy.
• Mixed Patterns (句式杂糅) means con- • Multi Meanings (语义重复) means that
fusion in sentence structure caused by two or more words of the same meaning
mixing different expressions of the same are used in a sentence or that the latter
meaning together. word already contains the meaning of the
former word.
• Mixed Subjects (中途易辙) means a sen-
tence has two subjects but without any (5) For Improper Collocation, there are 5 fine-
connectives. grained error types.
• Mixed Sentences (两句合一) refers to • Subject-predicate (主 谓 搭 配 不 当)
the inappropriate merging of two sen- means the inappropriate pairing of sub-
tences, that is, the structure of one sen- ject and predicate in a sentence.
tence is complete, but the last part of the
• Predicate-object (动宾搭配不当) and
sentence is used as the beginning of the
Subject-object (主宾搭配不当) are al-
other sentence.
most the same as Subject-predicate.
(2) For Improper Logicality, there are 5 fine- • Modifier-head Word (修饰语中心语搭
grained error types. 配不当) means the modifier does not cor-
rectly modify the head word in a sentence.
• Measure Word (数词不当) refers to the
• Connectives (关 联 词 语 搭 配 不 当)
exact and approximate measure words are
refers to the incorrect collocation of con-
used in a sentence simultaneously.
nectives used in a sentence.
• Unreasonable (不 合 事 理) means the
sentence doesn’t conform to formal or (6) For Improper Word Order, there are 7 fine-
moral logic such as conceptual judgments. grained error types.
dobj dobj
nsubj nummod
nsubj compound:nn
mark:clf
本文 分为 五 个 部分 发展中国家 面临 治理 挑战
PRON VERB NUM NUM NOUN NOUN VERB NOUN NOUN
This article is divided into five parts Developing countries face governance challenges
conj *
dobj dobj
nsubj ccomp
compound:nn
nsubj nummod
mark:clf
发展中国家 面临 遭遇 治理 挑战
本文 分为 五 个 部分 组成
PRON VERB NUM NUM NOUN VERB NOUN VERB VERB NOUN NOUN
This article is divided into five * parts to compose Developing countries face encounter governance challenges

(a) Structural Confusion (b) Redundant Component


nsubj
nmod:assmod
dobj
nmod:prep dobj nsubj
case case aux:ba
我 在 家里 吃 晚饭 我们 的 队伍 吸取 了 教训
PRON ADP NOUN VERB NOUN PRON PART NOUN VERB PART NOUN
At home I have dinner Our team learns a lesson
*
nsubj nmod:assmod
dobj
nsubj
case nmod:prep case aux:asp
我 在 家里 晚饭 我们 的 队伍 听取 了 教训
PRON ADP NOUN NOUN PRON PART NOUN VERB PART NOUN
At home I dinner Our * team listens to a lesson

(c) Missing Component (d) Improper Collocation

Figure 2: Other Examples of generating ungrammatical sentences containing different types of errors.

• Multi Attributives (多 重 定 语) means correct sentences and even analysis of the reasons
that the multi-attributives’ order is wrong. for the errors, because these data are all from real
• Multi Adverbials (多 重 状 语) means exam questions. Therefore, our annotation work
that the order of multi-adverbials is in- is mainly carried out on data from news websites.
correct. To be specific, our annotation team consists of 5
• Attributive-head Word (定语与中心语 annotators and 1 senior annotation referee. Our
次序不当) means that the positions of annotation workflow mainly consists of two parts:
attributive and head word are reversed.
(1) To get the effective grammatical error labels,
• Prepositions (介 词 短 语 位 置 不 当)
we ask each annotator to give each sample the
refers to wrong positions of prepositions.
type of grammatical error he think the sample
• Connectives-subject (关联词位置不当) has. So for each sample, we will get 5 prelimi-
refers to the improper location of connec- nary annotation results. We will select the one
tives in clauses. with the most occurrences among these five re-
• Associated Words (副 词 位 置 不 当) sults as the final labeling result for this sample.
means that the associated words or If there are multiple error types with the same
phrases are not in the right order. number of occurrences, the senior referee will
• Adverbial-attributives (定语状语次序 decide the final error type for this sample based
不当) means that attributives and adverbs on his profound knowledge.
misuse each other.
(2) To get the reasonable reference correct sen-
A.3 Details of NaCGEC Annotation Process tences, we require each annotator to rewrite
The resources of NaCGEC mainly come from three every sentence and give as many rewrites as
aspects, namely entrance examinations, recruit- they feel feasible. Finally, we will select the
ment examinations, and various Chinese news sites. rewriting results that appear twice or more as
Note that the data from entrance/recruitment ex- the reference standard sentences for the erro-
amination scenarios include specific error types, neous sentence. If the rewriting results given
‎Subject-predicate
‎Mixed Patterns
‎Predicate-object
‎ tructural
S
‎Mixed Subjects
‎Confusion I‎ mproper
‎Subject-object
‎Collocation
‎Mixed Sentences
‎Modifier-head Word

‎Imposing Cause And Effect ‎Connectives

‎Reverse Host-guest
I‎ mproper ‎Grammatical ‎ edundant
R
‎Multi Words
‎Improper Negation
‎Logicality ‎Component
‎Errors ‎Multi Meanings
‎Unreasonable

‎Measure Word ‎Multi Attributives

‎Multi Adverbials
‎Lack Subject
‎Attributive-head Word
‎Lack Predicate
‎ issing
M I‎ mproper
‎Prepositions
‎Component ‎Word Order
‎Lack Object
‎Connectives-subject
‎Lack Modifier
‎Associated Words

‎Adverbial-attributives

Figure 3: An overview of the fine-grained classification of different types of grammatical errors.

by the 5 annotators differ from each other for For the GECToR-Chinese, we utilize the code
a sample, we will ignore this sample directly provided by the original authors (Zhang et al.,
after the senior judge has reviewed it. 2022). Due to the limitation of hardware resources,
we select MacBERT(Base) (Cui et al., 2020) as
Additionally, considering that the process of its encoder, unlike the original code using Struct-
labeling grammatical error types is essentially a BERT (Wang et al., 2020b). The maximum sen-
multi-label classification process, we use the Fleiss’ tence length is also set to 200. Adam (Kingma and
kappa (Fleiss, 1971) to verify the annotator agree- Ba, 2015) is employed as the model optimizer. The
ment of labeling grammatical error types. The final training process is divided into two stages: in the
Fleiss’ kappa score is 0.823, which indicates that first stage the parameters of the encoder are frozen
our annotation results can be regarded as “almost and the model is trained for 2 epochs with a learn-
perfect agreement” (Landis and Koch, 1977). ing rate set to 1e-3, and in the second stage the full
parameters are tuned and the model is trained for
A.4 Implementation Details 20 epochs with a learning rate set to 2e-5.
For the Transformer, we implement the code using
PyTorch (Paszke et al., 2019). The word embed-
ding matrix shares weight between the source side
and the target side, and the embedding dimension
is set to 512. Both the encoder and decoder of the
Transformer contain 6 layers, and each layer con-
tains 8 attention heads. The maximum length of
the input is set to 200, and sentences longer than
200 are truncated. We train the model with the
AdamW (Loshchilov and Hutter, 2017) optimizer
for 20 epochs. The learning rate and dropout rate
are set to 5e-4 and 0.1 respectively. In the infer-
ence phase, we use Beam Search as the decoding
method, with the number of beams being 8.

You might also like