A Novel Natural Language Processing (NLP) Approach To Automatically Generate Conceptual Class Model From Initial Software Requirements

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A Novel Natural Language Processing (NLP) Approach

to Automatically Generate Conceptual Class Model from


Initial Software Requirements

Mudassar Adeel Ahmed, Wasi Haider Butt, Imran Ahsan,


Muhammad Waseem Anwar ✉ , Muhammad Latif, and Farooque Azam
( )

Department of Computer Engineering, College of Electrical and Mechanical Engineering,


National University of Sciences and Technology, H-12, Islamabad, Pakistan
{mudassar.adeel14,imran.ahsan14}@ce.ceme.edu.pk,
{wasi,waseemanwar,mlatif,farooq}@ceme.nust.edu.pk

Abstract. Conceptual class model is an essential design artifact of Software


Development Life Cycle (SDLC). The involvement of several resources and
additional time is required to generate the class model from early software
requirements. On the other hand, Natural Language Processing (NLP) is a knowl‐
edge discovery approach to automatically extract elements of concern from initial
plain text documents. Consequently, it is frequently utilized to generate various
SDLC artifacts like class model from the early software requirements. However,
it is usually required to perform few manual processing on textual requirements
before applying NLP techniques that makes the whole process semi-automatic.
This article presents a novel fully automated NLP approach to generate concep‐
tual class model from initial software requirements. As a part of research, Auto‐
mated Requirements 2 Design Transformation (AR2DT) tool is developed. The
validation is performed through three benchmark case studies. The experimental
results prove that the proposed NLP approach is fully automated and considerably
improved as compared to the other state-of-the-art approaches.

Keywords: NLP · AR2DT · Class diagram · Software requirements · Natural


language processing

1 Introduction

Getting significant information from preliminary set of requirements in the analysis


phase is inherently a crucial task and requires more manual intervention that leads to
massive data processing time. Moreover, these manual interventions can cause crucial
data processing errors. Natural Language Processing (NLP) shows some propitious and
more encouraging results to overcome such issues, especially in bio-medical domain [1].
NLP allows automated data processing features and is applied to various software
development phases to generate the requirement specifications [2], design artifacts and
test cases [3] in an automated manner. A lot of research is done over design phase which
include class diagram generation [4], use case generation [5], collaboration diagram
generation [6] and so on.

© Springer Nature Singapore Pte Ltd. 2017


K. Kim and N. Joukov (eds.), Information Science and Applications 2017,
Lecture Notes in Electrical Engineering 424, DOI 10.1007/978-981-10-4154-9_55
A Novel Natural Language Processing (NLP) 477

Although there is a noticeable research available that deals with the generation of
class model from initial plain text software requirements, the existing studies usually
requires few manual processing on textual requirements before generating the class
model that makes the whole process semi-automatic. This deviates the actual spirit of
true automation. Therefore, in this article, we propose a novel and fully automated NLP
approach to generate the class model from early software requirements. The overview
of this study is shown in Fig. 1.

Proposed NLP Approach

Rules for Splitting Sentences Rules for Tokenization Rules for POS Tagging

Early Software Requirements


(Plain Text)

Implementation
(AR2DT Tool)

Implementation of Rules Class Generation User Interface

Conceptual Class Model


(with Code)

Fig. 1. Overview of research

Firstly, we defined the novel and improved rules for splitting sentences, tokenization
and POS tagging (Sect. 2). Secondly, we implement the defined rules in AR2DT tool
(Sect. 2.1). There are three components of AR2DT tool i.e. Implementation of Rules,
Class generation and User Interface. It takes early software requirements as a plain text
and generate conceptual class model with code as shown in Fig. 1. Finally, we utilize
three case studies for the validation of proposed approach (Sect. 3). The comparative
analysis with state-of-the-art is given in Sect. 4. The paper is concluded in Sect. 5.

2 Proposed Methodology and Implementation

The proposed NLP approach comprises the novel rules of sentence splitting, tokeniza‐
tion and POS tagging as shown in Fig. 2. The defined rules are applied to the initial plain
text software requirements to generate conceptual class model. Our proposal mainly
concerns with the extraction of Noun Plural (NNS), Proper Noun Singular (NNP) and
Proper Noun Plural (NNPS) by using matching nouns. The summary of rules is as
follows:
478 M.A. Ahmed et al.

Fig. 2. Proposed architecture of AR2DT

Nouns or Classes Identification: It can be concluded from the state-of-the-art


(Sect. 4) that researchers usually consider all types of nouns as classes. However, we
propose to only consider NNS, NNP and NNPS as classes.

Conversion of Plural to Singular: We are converting the plural nouns to singular e.g.
convert books to book.

Remove Redundant Classes: Repeated classes are only considered once. This concept
is implemented by defining a dictionary which includes all the irrelevant glossary words
e.g. user, software, number etc. The special set of the standard guidelines are defined
while developing the glossary of the dictionary in order to avoid any sort of biasness.
The identification of classes from plain text are performed on the basis of pre-defined
rules. A class can be described by this equation: C: ϵ [{C, A, O, R}] Where C is the
candidate class, A belongs to the attribute of this class, O is the operation or function of
the class and R represents the relationship of the class. The relationships between the
classes can be expressed as follow: R: ϵ [{rT, Cr, Rc}] Where R belongs to relationship,
rT is the relationship type i.e. association, Cr is the cardinality and Rc is the related class.
A Novel Natural Language Processing (NLP) 479

2.1 Implementation
AR2DT is developed in Visual Studio 2010 and written in C# with 1500 line of codes.
The SQL Server 2012 has been used for the storage. In AR2DT, the rules are imple‐
mented through SharpNLP-1.0.2529 [14] library. Subsequently, Regular Expression
library is used to match classes by utilizing the concept of dictionary. The interface of
AR2DT implementing ATM case study (Sect. 3) is shown in Fig. 3.

Fig. 3. AR2DT user interface

The text area is provided to write and copy/paste the desired case study. The classes
can be identified by pressing Identify Classes button where the business logic for the
rules of sentence splitting, tokenization and POS tagging has been implemented. The
Generate Class Diagram Code button creates the code of the class diagram. The gener‐
ated classes can be viewed in a grid view. The operations like tokenization and spitting
can be performed separately (without the generation of classes) as shown in Fig. 3. The
details about AR2DT tool like installation/user manual, executable file, source code and
sample case studies can be found at [20].

3 Validation

Automatic Teller Machine (ATM) Case Study: Rumbaugh et al. [19] first analyzed
the automatic teller machine case study by using OMT methodology. We took the same
problem statement to present the results of analysis. The initial software requirements
of ATM, expressed as a plain text, are shown in Fig. 4.
480 M.A. Ahmed et al.

Design the software to support a computerized banking network including both human cashiers and
automatic teller machines (ATMs) to be shared by a consortium of banks. Each bank provides its own
computer to maintain its own accounts and process transactions against them. Cashier stations are
owned by individual banks and communicate directly with their own bank's computers. Human
cashiers enter account and transaction data. Automatic teller machines communicate with a central
computer which clears transactions with the appropriate banks. An automatic teller machine accepts a
cash card, interacts with the user, communicates with the central system to carry out the transaction,
dispenses cash, and prints receipts. The system requires appropriate record-keeping and security
provisions. The system must handle concurrent accesses to the same account correctly. The banks will
provide their own software for their own computers.

Fig. 4. Automatic teller machine problem statement

Rumbaugh et al. [19] took all the nouns and created a list of classes from the case
study. The set of classes are 23: Software, Consortium, Cash receipt, Cash card, Account
data, Baking network, Bank computer, Bank, Traction, Access, Cashier station, Central
computer, Transaction, ATM, Cashier,, Transaction data, Security provision, Record
keeping provision, System, Cost, Receipt, Account, and Customer. In our case, AR2DT
generate 10 classes for ATM case study as shown in the Fig. 5.

Fig. 5. Conceptual class model of ATM by AR2DT

3.1 Evaluation of Results with the State-of-the-Art


It is assumed in this paper that models given in the object oriented books are correct so
we took them all as our answer key for matching our results. For evaluation purpose,
we considered three type of measures i.e. precision, recall and over specification. Preci‐
sion shows how much the information was correct and present in the answer key.
Following equations are used to calculate precison, recall and over specification:

Precision = Ncorrect∕(Ncorrect + Nincorrect)


Recall = Ncorrect ∕ (Ncorrect + Nmissing)
Over − specification = Nextra ∕ (Ncorrect + Nmissing)
A Novel Natural Language Processing (NLP) 481

We evaluate the performance of AR2DT tool against three case studies i.e. ATM,
Electronic Filling Program (EFP) and Local Hospital Problem (LHP). However, due to
space limitations, we only provide the details of ATM case study and further details can
be found at [20]. We compare the results of AR2DT tool with high impact journal
research study (i.e. Class-Gen [8]). The results are summarized in Table 1.

Table 1. Evaluation of results with Class-Gen [8]


Sr. # Case study Correct Incorrect Missing Extra Precision Recall OS
classes classes
1 ATM 9 0 1 1 100% 90% 10%
2 EFP 6 1 3 2 85.7% 66.7% 22.2%
3 LHP 5 0 0 0 100% 100% 0%
(AR2DT) Avg. 94.9% 85.56% 10.73%
Class-Gen Avg. 82.6% 83.3% 42%

It can be seen from the Table 1 that the results of AR2DT for precision, recall and
over specification are significantly improved as compared to Class-Gen [8].

4 Comparative Analysis with the State-of-the-Art

In this section, we compare our proposed approach with state-of-the-art approaches. We


believe it is important to first highlight few studies relevant with the subject of automatic
conceptual class model generation from NL software requirements using natural
language processing. We considered latest paper ranging from 2003-14 from well-
known repositories Springer [15], ACM [16], Elsevier [17] and IEEE [18].
Ibrahim and Ahmad [4] suggested a methodology for the automation of analysis
process for class diagram generation from natural language text using NLP. They devel‐
oped a RACE tool to extract the classes and relationships for class diagram generation.
Kumar and Sanyal [5] analyze the natural language text and generated class model and
use case model from the SUGAR tool.
Deeptimahanti and Sanyal [7] used Stanford parser, JavaRAP and WordNet for the
conversion of NL requirements to UML models semi-automatically. Elbendak et al. [8]
developed class-gen tool to generate class diagram from use case descriptions through
semi-automated approach. Sharma et al. [9] developed FCDT tool and used RSA algo‐
rithm for production of functional design.
Viney et al. [10] developed R-tool to analyze the NL requirements for identification
of classes, attributes, methods and relationships which serves as the basis for the creation
of class diagram. Author used tokenization as NLP technique. Alkhader et al. [11]
suggested a framework for class diagram generation from the NL requirements by using
MIMB and GATE tool. Tripathy and Rath [12] developed a methodology for the iden‐
tification of class name from the SRS documents in automated manner. Harmain and
Gaizauskas [13] developed CM-Builder for the creation of class diagram from NL
requirements in semi-automated way.
482 M.A. Ahmed et al.

4.1 Comparative Analysis


To this point, we present existing state-of-the-art approaches in the given context. Now,
we compare significant studies with our proposed approach to highlight the strengths
and weaknesses. We use three parameters to perform this comparison as follows (1)
Input define the format of requirements which has been used to generate class model.
(2) Coverage describes the coverage area of the selected research study i.e. whether the
research study covers the generation of a Class (C), Relationship (R), Attribute (A) and
Operation (O). (3) Automated evaluates the involvement of manual steps required on
the textual requirements before apply NLP approach. It can be evaluated as Automatic
and Semi-Automatic (in case some manual processing is required). The summary of
comparison is given in Table 2.

Table 2. Comparative analysis of proposed approach with state-of-the-art


Paper Input Coverage Automated
C R A O
Ibrahim and Ahmad [4] Plain text Yes Yes Yes No Semi-automatic
Kumar and Sanyal [5] Plain text Yes No Yes Yes Semi-automatic
Deeptimahanti and Plain text Yes No No No Semi-automatic
Sanyal [7]
Elbendak et al. [8] Plain text Yes Yes Yes No Semi-automatic
Sharma et al. [9] RS Yes Yes No Yes Semi-automatic
Viney et al. [10] Plain text Yes Yes Yes Yes Semi-automatic
Alkhader et al. [11] Plain text Yes Yes Yes No Semi-automatic
Tripathy and Rath [12] RS Yes Yes Yes Yes Semi-automatic
Harmain and Gaizauskas RS Yes Yes Yes No Semi-automatic
[13]
AR2DT Plain text Yes Yes No No Automatic

It can be seen from the Table 2 that our approach fully automate the requirement to
design automation process which is a significant contribution. Furthermore, our exper‐
imental results (Sect. 3.1) are more encouraging as compared to other studies. However,
we are not dealing with the generation of association and operation. We intend to include
such missing features in AR2DT in near future.

5 Conclusions and Future Work

This article presents a novel Natural Language Processing (NLP) approach to


automatically generate conceptual class model from early software require‐
ments. Particular, the new sentence splitting, tokenization and POS tagging
rules are defined to avoid the manual processing which is usually required on
textual requirements before the generation of class model. As a part of research,
Automated Requirement 2 Design Transformation (AR2DT) tool has been
developed to automatically generate class model with code from initial plain
A Novel Natural Language Processing (NLP) 483

text requirements. The application of AR2DT is validated through three bench‐


mark case studies. Experimental results prove that the recall, precision and over
specification of AR2DT tool are significantly improved as compared to the
state-of-the-art. Furthermore, AR2DT is fully automated and does not require
any manual processing on textual requirements.
Currently, AR2DT does not deal with the generation of class relationships like
aggregation, composition and inheritance. Furthermore, the generation of methods and
cardinalities are also missing. We intend to include such missing features in AR2DT in
our future article.

References

1. Meteer, M., Borukhov, B., Crivaro, M., Shafir, M., Thamrongrattanarit, A.: MedLingMap: a
growing resource mapping the bio-medical NLP field. In: Proceedings of the 2012 Workshop
on Biomedical Natural Language Processing (BioNLP 2012), Montreal, Canada, 8 June 2012,
pp. 140–145 (2012)
2. Umber, A., Bajwa, I.S., Asif Naeem, M.: NL-based automated software requirements
elicitation and specification. In: Abraham, A., Lloret Mauri, J., Buford, J.F., Suzuki, J.,
Thampi, S.M. (eds.) ACC 2011. CCIS, vol. 191, pp. 30–39. Springer, Heidelberg (2011). doi:
10.1007/978-3-642-22714-1_4
3. Sneed, H.M.: Testing against natural language requirements. In: Seventh International
Conference on Quality Software. IEEE (2007)
4. Ibrahim, M., Ahmad, R.: Class diagram extraction from textual requirements using natural
language processing (NLP) techniques. In: Second International Conference on Computer
Research and Development, pp. 200–204. IEEE Computer Society, IEEE (2010)
5. Kumar, D.D., Sanyal, R.: Static UML model generator from analysis of requirements
(SUGAR). In: Advanced Software Engineering and Its Applications, pp. 77–84. IEEE (2008)
6. Liu, D., Subramaniam, K., Eberlein, A., Far, Behrouz, H.: Natural language requirements
analysis and class model generation using UCDA. In: Orchard, B., Yang, C., Ali, M. (eds.)
IEA/AIE 2004. LNCS (LNAI), vol. 3029, pp. 295–304. Springer, Heidelberg (2004). doi:
10.1007/978-3-540-24677-0_31
7. Deeptimahanti, D.K., Sanyal, R.: Semi-automatic generation of UML models from natural
language requirements. In: ISEC 2011, pp. 165–174. ACM (2011)
8. Elbendak, M., Vickers, P., Rossiter, N.: Parsed use case descriptions as a basis for object-
oriented class model generation. J. Syst. Softw. 84, 1209–1223 (2011). 2011 Published by
Elsevier Inc.
9. Sharma, V.S., Sarkar, S., Verma, K., Panayappan, A., Kass, A.: Extracting high-level
functional design from software requirements. In: 16th Asia-Pacific Software Engineering
Conference. IEEE (2009)
10. Vinay, S., Aithal, S., Desai, P.: An NLP based requirements analysis tool. In: International
Advance Computing Conference. IEEE (2009)
11. Alkhader, Y., Hudaib, A., Hammo, B.: Experimenting with extracting software requirements
using NLP approach. In: ICIA. IEEE (2006)
12. Tripathy, A., Rath, S.K.: Application of natural language processing in object oriented
software development. In: International Conference on Recent Trends in Information
Technology. IEEE (2014)
13. Harmain, H.M., Gaizauskas, R.: CM-builder: a natural language-based CASE tool for object-
oriented analysis. Autom. Softw. Eng. 10, 157–181 (2003). Springer
484 M.A. Ahmed et al.

14. https://sharpnlp.codeplex.com/. Accessed 12 Sept 2016


15. Springer. http://www.springer.com/in/. Accessed Sept 2016
16. ACM. http://dl.acm.org. Accessed Sept 2016
17. Elsevier. https://www.elsevier.com. Accessed Sept 2016
18. IEEE Scientific database.http://ieeexplore.ieee.org/Xplore/home.jsp. Accessed Sept 2016
19. Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F., Lorensen, W.: Object-Oriented Modeling
and Design. Pearson Education, Upper Saddle River (1991)
20. AR2DT Tool. http://ceme.nust.edu.pk/ISEGROUP/Resources/ar2dt/ar2dt.html

You might also like