Professional Documents
Culture Documents
Tools For Disambiguating RFCs
Tools For Disambiguating RFCs
85
ANRW ’21, July 24–30, 2021, Virtual Event, USA Jane Yen, Ramesh Govindan, and Barath Raghavan
communication with RFC editors to speed up the RFC Y��� 2016 2017 2018 2019 2020
publication process. RFC reviewers could bene�t from RFC C���� 310 263 208 180 209
less work to pinpoint textual ambiguity. Those who build Table 1: RFCs reviewed per year from 2016 to 2020.
reference implementations could leverage such automated
ambiguity analysis to develop assertions and/or unit
tests, or relax some constraints to accommodate di�erent 2.2 Human e�ort in speci�cation review
implementations for interoperability (Figure 1).
To understand how much e�ort is put into publishing RFCs
In this paper, we �rst review how a typical RFC is writ-
each year, we collected the number of reviewed RFCs from
ten and reviewed, and consider quantitatively how many
the RFC editor site (Table 1) [11]. We note that an RFC doc-
drafts are reviewed per year. Then we consider what tools
ument is usually tens of pages of text. Therefore, roughly
have the potential to improve the whole RFC authoring and
speaking, the networking community is drafting and review-
publication process. We envision additional tools that can
ing thousands of pages of speci�cation text in total every
be used to reduce ambiguities from natural language de-
year. Many of these drafts likely contain language ambigutity,
scriptions and can be used to generate code, if applicable,
making this work quite challenging.
for verifying logical/functional objectives. We then consider
both what aspects of ambiguity have been addressed in prior
research and what remains to be understood. In this context, 2.3 Tools in the publication process
we introduce our recent work, ���� [27], which identi�es We identify two additional kinds of tools that can be added
ambiguities in protocol speci�cation text with Combinatory in the RFC publication process to reduce the work that must
Categorial Grammar (CCG) [3] analysis, disambiguates with be done by the people involved in the process.
set of rules, and, given unambiguous speci�cation text, per-
forms per-sentence translation to C++ code. In that work Ambiguity identi�cation. In the process of drafting and
we considered how to handle packet generation and session reviewing RFCs (Figure 2), RFC drafters may encounter a
management across multiple protocols. However, ���� is not problem: readers may interpret a statement inconsistently
fully general nor is it usable by the average RFC author or with respect to the drafters’ intention. Before reaching agree-
editor. Thus, �nally, we discuss what remains to be explored ment with editors or reviewers, drafters have to repeat the
to further reduce the e�ort required of a human editor. process of editing drafts, waiting for reviews, and repeat-
ing this process through multiple rounds of feedback and
2 Overview revision. A tool that can reduce the tedious nature of this
In this section, we describe what is currently entailed in writ- process may be one that can identify ambiguities in natural
ing an RFC (though our description is oversimpli�ed). With language sentences.
an understanding of which procedure takes signi�cant hu- Compilation from natural language to code. RFC
man e�ort, we point out what mechanisms would be helpful drafters commonly write their system or reference code
to improve the end-to-end RFC publication process. and RFC drafts in parallel. While the drafters may believe
their translations between RFC drafts and reference code
2.1 RFC drafting and reviewing process are consistent, they cannot always guarantee that natural
An RFC is generated to describe one or more networking language sentences and implementations are logically
behaviors, where such behaviors are usually related to how a equivalent. When such inconsistency occurs, as it always
networking system/protocol works. For RFC authors/drafters does, if no reference code is provided, readers of RFCs have
to design such a system/protocol, one common approach to depend on natural language sentences and implement
is to alternate between spec writing and implementation, a potentially non-interoperable implementation. Even if
revising both until they are satis�ed with a �nal version reference code is provided, readers may also get confused
(Figure 2). After a draft is generated it is often shared within about whether to depend upon the natural language text or
a working group and then eventually will make its way the reference code, which may lead to the speci�cation and
to an RFC editor for review. An RFC editor may perform the protocol "in the wild" diverging. To reduce the frequency
similar steps to comment on RFC drafts, to consider any of such an inconsistency problem, one useful tool could be
unclear/confusing descriptions; the editor may request the a compiler that takes in natural language descriptions and
RFC’s authors revise according to those comments. In some turns the descriptions into intermediate representation such
cases, when a reference implementation is provided, software as pseudocode, formal speci�cations, or executable code.
veri�cation methods can also be applied to discover design Such a tool need not provide a unique implementation of
�aws. This potentially assists RFC authors to re�ect changes the speci�cation, but simply provide a piece of code that can
in both RFC drafts and system code. used to verify if the functionality is logically equivalent.
86
Tools for Disambiguating RFCs ANRW ’21, July 24–30, 2021, Virtual Event, USA
87
ANRW ’21, July 24–30, 2021, Virtual Event, USA Jane Yen, Ramesh Govindan, and Barath Raghavan
Why
zero. '
makes no sense; entities can switch their orders in a logical
Running CCG parsing. We run CCG parsing on each sen- form without messing up the semantic meanings; logical
tence of an RFC. Ideally, an unambiguous sentence results forms with the same semantic meanings can be expressed
in exactly one logical form. In practice, a CCG parser may in a verbose or a concise form). ���� applies sets of rules to
output zero or more logical forms, some of which are due disambiguate those logical forms.
to limitations in the parser itself and some from ambiguities
inherent in a sentence. 4.3 Code generation
RFCs are used to describe networking behaviors from diverse
4.2 Disambiguation
perspectives. Protocol RFCs are a kind of RFC that come with
Using the CCG parsing results (i.e. how many logical forms the intent to program the protocol design as executable code.
are generated) directly as the criteria to de�ne ambiguity A protocol RFC contains multiple elements (Table 2), where
might not provide RFC drafters a clear suggestion of how to each component contributes to part of the whole protocol
remove that ambiguity. Some parsing results actually repre- code. From Table 2, we marked the components that ���� can
sent exactly one semantic meaning after recovering some process from sentence description together with structural
information and/or �xing CCG’s own limitations. Filtering and/or syntactical information to executable C++ code.
88
Tools for Disambiguating RFCs ANRW ’21, July 24–30, 2021, Virtual Event, USA
����’s code generator takes a logical form as an input, and decide whether a variable value will be reused for the
and uses a post-order traversal of the logical form obtained next event.
after disambiguation to convert the relations of entities into Semantic Meaning and Classi�cations ���� has evalu-
C++ code. For example, “@I�("checksum", @N��(0))“ has ated a number of protocol RFCs with successful code gen-
an assignment relation between variable "checksum" and eration. However, the type of generated code remains lim-
number 0, and ���� outputs a line of executable code “hdr- ited. ���� can parse descriptions that assign, associate, and
>checksum = 0“. rewrite values and simple if-else statements. While these
Generality. We evaluated ����’s functioning across multi- sets of operations have covered a large portion of system
ple protocols for the speci�ed components (Table 2), includ- code, there are other types of code that should be considered
ing ICMP and partial support for IGMP, NTP, and BFD. We such as code for asserting values, code for adding constraints,
veri�ed that ����-generated code can interoperate with stan- code for logging events, etc.
dard implementations of tcpdump, ping, and traceroute. In Although we can use ���� to disambiguate sentences and
addition, ���� shows the ability to parse complicated com- generate exactly one logical form, the conversion from logi-
ponents e.g., state management. cal form to code should not be limited to exactly one kind of
code. We have to identify when/whether a description can
be interpreted as more than one type of code, and determine
what the execution order is when di�erent types of code
5 Challenges coexist. In other words, classi�cation of semantic meanings
also takes an important role in code generation.
While ���� is a signi�cant �rst step to improve the RFC Mis-matched/ Mis-captured behaviors. Among RFC
publication process, there remain many challenges to be ad- components, many components (such as packet formulation
dressed. Thus far, ���� can parse one sentence at a time and or state machine context) are presented with both text and
establish rules to identify ambiguities. ���� can also compile syntactical components, where syntactical components
an unambiguous sentence into a line of code, and sequen- can be diagrams, listings, tables or �gures. In some cases,
tially execute a block of code when the descriptions contain textual sentences/paragraphs are used to explain what
clear instructions of execution order. Below, we discuss what the syntactical components represent, or extend what
challenges could be considered for future directions that syntactical components have covered. Thus, text and
build upon ����. syntactical components are complementary to each other.
Paragraph analysis instead of per-sentence analysis. In other cases, due to mistakes or some other reason, texts
���� mostly parses sentence by sentence to determine a sen- and syntactical components can be inconsistent. This
tence’s ambiguity and generates code according to the order situation causes confusion for the RFC reader whether to
of descriptions. Therefore, ���� has some parsing limitations believe in textual descriptions or the meaning of syntactical
in some cases. For example, a sentence might use a pronoun components.
and there could be multiple candidates that the pronoun When it comes to code generation for any case, the process
refers to. The candidates could be discovered from multiple of code generation has to (1) correctly associate the same
neighboring sentences that are within the same paragraph. semantic meanings parsed from textual descriptions and
If more than one candidate exists, this can be considered as syntactical components, (2) identify any missing semantic
a kind of ambiguity that can confuse RFC readers. meanings from either representations, and (3) identify dis-
When a compiler compiles a paragraph of text, it is pos- crepancies between texts and syntactical components. The
sible that the order of descriptions do not follow sequential generated code should not repeat the same semantic mean-
execution order. For example, a protocol RFC can describe ing output, or miss any mentioned semantic meanings. The
handshakes that resemble a state machine, with all the in- code generator should also report/alarm discrepancies to
volved connection states and their transitions are described RFC drafters to reduce confusion.
in a large paragraph. The compiler should no longer assume Alternative code representations. The aim of the com-
the generated code should follow the description order to piler (§2.3) is to accept natural language speci�cations and
convert sentence by sentence. Instead, the compiler should turn them into any structured representation, which can
be aware that all connection states are of equal importance include pseudocode, formal speci�cation language text, and
to be executed, and the generated code should be able to implementations in a variety of programming languages.
switch to any state for execution given the current state. Thus far, ���� has illustrated the possibility to turn natural
In other words, the compiler has to learn the context of a language texts into working C++ code. While every kind of
whole paragraph, determine a correct code block template,
89
ANRW ’21, July 24–30, 2021, Virtual Event, USA Jane Yen, Ramesh Govindan, and Barath Raghavan
representations has its advantages, the conversion among the future protocol implementer to directly apply the ref-
di�erent representations could face di�erent challenges. For erence implementation in all contexts). In other words, we
example, RFC drafters commonly put pseudocode snippets could reasonably expect the generated code from the com-
in their drafts to better explain their ideas to their readers, piler is valued for its correct logic instead of its performance.
but pseudocode is not executable and maintains some level If a drafter suggests a performance-oriented implementa-
of �exibility in the expressions given. If pseudocode is gen- tion, some mechanism could be supported to identify which
erated by the compiler and a drafter would like to compare natural-language statements convey performance consider-
its logical behavior with a manually written pseudocode im- ations and which convey logical considerations. Moreover,
plementation, what the criteria is to determine whether the the compiler can optionally output di�erent versions of gen-
functionalities are equivalent. For another example, some erated code according to the needs of the user.
speci�cations expressed in the protocol’s design in TLA+; the
TLA+ language itself has di�erent expressions than logical 6 Discussion
form and C++, and future work is required to identify the We have mentioned a number of problems yet to be solved.
limitation of compiling natural language to TLA+. Here we discuss some existing techniques that might assist
Standalone RFC or multiple RFCs. Ideally, every perspec- us to resolve them with proper modi�cation to ����.
tive of a protocol can be completely expressed in a single RFC. For example, to handle multiple candidate problems, we
In practice, a protocol can be explained over multiple RFCs might employ tools to identify what terms/tokens are used
for the ease of reading by topic. For example, one RFC may to refer the same concept, which is commonly recognized
describe the functionality of the protocol and explain the as a “coreference resolution“ problem. Although the coref-
design of packets, and a separate complementary RFC may erence resolution problem has been studied for years and
explain what each value represents for the �elds presenting the performance of tools for this task is generally consid-
in the protocol and what additional constraints would be ered good, we �nd generic coreference resolution tools are
under a di�erent scenario. still insu�cient to handle many domain-speci�c terms. In
When multiple RFCs are given, we need to discover how particular, such tools do not perform well when multiple
to merge the content by concept without missing or violat- terms seemingly are the same but are actually di�erent be-
ing any constraints that may exist. From the perspective of cause of naming styles e.g., ADMINDOWN, admin_down,
an RFC drafter, the drafter must con�rm the consistency AdminDown. The generic coreference resolution tools pro-
not only within a standalone RFC but also across all rele- vide in�exible APIs, which disable customization rules that
vant RFCs. Any discrepancy among RFCs could similarly could group terms together. If we would like to resolve such
lead to interoperability issues. An ideal compiler should thus problems, we could pre-process text, or post-process results
take inconsistency into consideration and automatically com- to address the limitations of existing tools.
pare/label/alarm where the discrepancies happen. As another example, to handle multiple RFCs that combine
to form a single protocol, one method is to leverage proper
Single protocol or stack of protocols. While RFC spec- markdown or reference links to identify the relationships be-
i�cations are usually limited to the discussion of a single tween RFCs. Given proper structural information (e.g., titles)
protocol, this single protocol has to interact/stack cleanly and categorized labels, the same speci�cation documents can
with other protocols when it’s applied in a networked sys- be analyzed together and used to discover whether there are
tem. For example, the ICMP RFC describes its design, but it any discrepancies.
has to be built on top of IP. In such cases, a compiler that We do not mean to claim that all the mentioned challenges
is considering both protocols together �rst has to identify can be easily solved; these examples still need to undergo
the dependency between the two protocols and/or the con- comprehensive discussion and evaluation to prove their e�ec-
straints to be considered. When the compiler is stacking the tiveness. Our focus in this paper is to provide some perspec-
two protocols together, it should be aware of, for example, tives from our experience and identify a number of research
the general IP header description selecting the exact value directions that can be explored that may be of bene�t to the
for ICMP as its next protocol �eld, and that the total length networking standards development community.
�eld in the IP header needs to add the ICMP header and the
payload. References
Logic vs. performance. A protocol RFC usually focuses on [1] A�����, M. B., ��� P�������, L. L. A language-based approach to
protocol implementation. IEEE/ACM transactions on networking (1993).
describing its logical functionality and leaves the �exibil-
[2] A�������, D. P. Automated protocol implementation with rtag. IEEE
ity of implementing code to any reader of the RFC (unless Transactions on Software Engineering 14, 3 (1988), 291–300.
the RFC provides a reference implementation and expects [3] A����, Y., F���G�����, N., ��� Z����������, L. S. Semantic parsing
with combinatory categorial grammars. ACL (Tutorial Abstracts) 3
90
Tools for Disambiguating RFCs ANRW ’21, July 24–30, 2021, Virtual Event, USA
91