Software Mining Repository Notes Unit 1

Empirical Software Engineering: - “Empirical” is typically used to define any statement about the world that is related to
observation or experience. Empirical software engineering (ESE) is an area of research that emphasizes methods in the field of
software engineering. It involves methods for evaluating, assessing, predicting, monitoring, and controlling the existing artifacts
of software development. ESE applies quantitative methods to the software engineering phenomenon to understand software
development better. ESE has been gaining importance over the past few decades because of the ability to mine data from open-
source software repositories that contain information about software requirements, bugs, and changes.
Software repositories also known as code repositories are centralized hubs that help developers create, maintain, and track
software packages. Repository management software controls access to software packages, tracks package deployments, and
includes or integrates with version control systems. The repositories work with package managers and build tools. Advanced
repository features include pipeline workflow tools, static code analysis, vulnerability testing, and ensuring developers have
access to the latest versions of public artifacts.
Public and Private Repositories
Public repositories securely store, publish, and freely share open-source software.
Organizations use private repositories to manage their proprietary software resources. They can publish that software and
charge fees through licensing arrangements.
Software Repositories Features
• Build, maintain, store, or source software packages and containers from public and private feeds
• Package management
• Deployment tracking
• Version control
• Access controls
• Encrypted storage and backup
• Asset discovery
• Pipeline workflow tools
• Release management tools
• Static code analysis
• Vulnerability testing
• Dashboards, statistics, and reporting
Benefits of using software repositories
• Productivity: Software repositories facilitate code reuse expediting product development.
• Efficiency: A centralized software repository fosters collaboration. Repository management tools streamline the tracking and
deployment of software packages.
• Security: Encrypted storage, access controls, and backups secure the software packages.
Importance of Mining Software Repositories (MSR)
• Software repositories usually provide a vast array of varied and valuable information regarding software projects.
• By utilizing the information mined from these repositories, software engineering researchers and practitioners do not need to
depend primarily on their intuition and experience, but more on real data for informed decision making.
• Effective mining techniques can extract the right kind of information from these repositories in the right form.
Potential Benefits of MSR
• Support maintenance of the software system
• Empirical validation of techniques and methods
• Supporting software reuse
• Proper allocation of testing and maintenance resources
Data Analysis Procedure After Mining
Types of Software repositories
•Historical
–Source Control
–Bug Repositories
–Archived Communications
•Run-time repositories or Deployment logs
•Source Code repositories
Source control repositories/Version Control System: -Source control repositories record and maintain the development trail of
a project. They track each and every change incurred in any of the artifacts of a software system, such as the source code,
documentation manuals, and so on. Additionally, they also maintain the metadata regarding each change, for instance, the
developer or project member who carried out the change, the timestamp when the change was performed, and a short
description of the change. These are the most readily available repositories, and also the most employed in software projects.
Git, CVS, subversion (SVN), Perforce, and ClearCase are some of the popular source control repositories that are used in
practice.
Bug Repositories: - These repositories track and maintain the resolution history of defect/bug reports, which provide valuable
information regarding the bugs that were reported by the users of a large software project, as well as the developers of that
project. Bugzilla and Jira are the commonly used bug repositories.
Archived Communications: - Discussions regarding the various aspects of a software project during its lifecycle, such as mailing
lists, emails, instant messages, and internet relay chats (IRCs) are recorded in the archived communications.
Run-time repositories, also known as deployment logs, record information regarding the execution of a single deployment, or
different deployments of a software system. For example, run-time repositories may record the error messages reported by a
software application at varied deployment sites. Run-time repositories can possibly be employed to determine the execution
anomalies by discovering dominant execution or usage patterns across various deployments, and recording the deviations
Source code repositories maintain the source code for a large number of OSS projects.
Sourceforge.net and Google code are among the most commonly employed code repositories, and host the source code for a
large number of Open source systems, such as Android OS, Apache Foundation Projects,
Three Types of VCS
•Local VCS(e.g Revision Control System)
•Centralized Version Control System (CVCS)(e.g SVN)
•Distributed/Decentralized Version Control System (DVCS).(e.g Git)
Local VCS:- Local VCS employ a simple database that records and
maintains all the changes to artifacts of the software project under
revision control. A system named revision control system (RCS) was
a very popular local versioning system. This tool operates by simply
recording the patch sets (i.e., the differences between two
artifacts) while moving from one revision to the other in a specific
format on the user’s system. It can then easily recreate the image
of a project artifact at any point of time by summing up all the
maintained patches. However, the user cannot collaborate with
other users on other systems, as the database is local and not
maintained centrally. Each user has his/her own copy of the
different revisions of project artifacts, and thus there are
consistency and data sharing problems. Moreover, if one user loses
the versioning data, recovering it is impossible until and unless a
backup is maintained from time to time.
Centralized VCS (CVCS):- The main aim of CVCS is to

allow the user to easily collaborate with different
users on other systems. These systems, such as CVS,
Perforce, and SVN, employ a single centralized server
that records and maintains all the versioned artifacts
of a software project under revision control, and
there are a number of clients or users that check out
(obtain) the project artifacts from that central server.
However, if the central server fails or the data stored
at central server is corrupted or lost, there are no
chances of recovery unless we maintain periodic
backups.
Distributed VCS(DVCS) :- As opposed to CVCS, a

DVCS (such as Bazaar, Darcs, Git, and Mercurial) ensures that the clients or users do not just obtain or check out the latest
revision or snapshot of the project artifacts, but clone, mirror, o download the entire software project repository to obtain the
artifacts. If any server of the DVCS fails or its data is corrupted or lost, any of the software project repositories stored at the
client machine can be uploaded as back up to the server to restore it. Therefore, every checkout carried out by a client is
essentially a complete backup of the entire software project data.
Some Important Terms-Build, Revision

Build - a binary that’s produced after committing to VCS (Version Control System).
Version is assigned to any build to be able to identify it.
Revision is an identifier in VCS - it’s something that can tell which source code was at which point in time
Release - is a build or its version that goes to production environment or is announced to the public.
VCS Terminology Generic
Repository: A repository is the heart of any version control system. It is the central place where developers store all their work.
Repository not only stores files but also the history. Repository is accessed over a network, acting as a server and version control
tool acting as a client. Clients can connect to the repository, and then they can store/retrieve their changes to/from repository.
By storing changes, a client makes these changes available to other people and by retrieving changes, a client takes other
people's changes as a working copy.
Trunk: The trunk is a directory where all the main development happens and is usually checked out by developers to work on
the project.
Tags: The tags directory is used to store named snapshots of the project. Tag operation allows to give descriptive and
memorable names to specific version in the repository.
Branches: Branch operation is used to create another line of development. It is useful when you want your development
process to fork off into two different directions. For example, when you release version 5.0, you might want to create a branch
so that development of 6.0 features can be kept separate from 5.0 bug-fixes.
Working copy: Working copy is a snapshot of the repository. The repository is shared by all the teams, but people do not modify
it directly. Instead, each developer checks out the working copy. The working copy is a private workplace where developers can
do their work remaining isolated from the rest of the team.
Commit changes: Commit is a process of storing changes from private workplace to central server. After commit, changes are
made available to all the team. Other developers can retrieve these changes by updating their working copy. Commit is an
atomic operation. Either the whole commit succeeds or is rolled back. Users never see half-finished commit.
Head: refers to the commit that has been made most recently, either to a branch or to the trunk.
Concurrent Versioning System (CVS) is a particular VCS of Centralized type. CVS is used for two apparently unrelated purposes:
record keeping and collaboration. It turns out, however, that these two functions are closely connected. Record keeping became
necessary because people wanted to compare a program's current state with how it was at some point in the past. For example,
in the normal course of implementing a new feature, a developer may bring the program into a thoroughly broken state, where it
will probably remain until the feature is mostly finished
CVS Specifics
Revision A committed change in the history of a file or set of files. A revision is one "snapshot" in a constantly changing project.
Repository The master copy where CVS stores a project's full revision history. Each project has exactly one repository.
Working copy :-The copy in which you actually make changes to a project. There can be many working copies of a given project;
generally each developer has his or her own copy.
Check out To request a working copy from the repository. Your working copy reflects the state of the project as of the moment
you checked it out; when you and other developers make changes, you must use commit and update to "publish" your changes
and view others' changes.
Commit To send changes from your working copy into the central repository. Also known as check-in.
Log message A comment you attach to a revision when you commit it, describing the changes. Others can page through the log
messages to get a summary of what's been going on in a project.
Update To bring others' changes from the repository into your working copy and to show if your working copy has any
uncommitted changes. Be careful not to confuse this with commit; they are complementary operations. Mnemonic: update
brings your working copy up to date with the repository copy.
Conflict The situation when two developers try to commit changes to the same region of the same file. CVS notices and points
out conflicts, but the developers must resolve them.
CVS features: - CVS is a popular CVCS that hosts a large number of OSS systems. CVS has been developed with the primary goal
to handle different revisions of various software project artifacts by storing the changes between two subsequent revisions of
these artifacts in the repository. Thus, CVS predominantly stores the change logs rather than the actual artifacts such as binary
files. It does not imply that CVS cannot store binary files. It can, but they are not handled efficiently.
CVS Revision numbers: Each new revision or version of a project artifact stored in the CVS repository is assigned a unique
revision number by the CVS itself.
CVS Branching and Merging: - CVS supports almost all of the functionalities pertaining to branches in a VCS. The user can create
his/her own branch for development, and view, modify, or delete a branch created by the user as well as other users, provided
the user is authorized to access those branches in the repository
CVS Version control Data: - For each artifact, which is under the repository’s version control, CVS generates detailed version
control data and saves it in a change log or simply log files. The recorded log information can be easily retrieved by using the CVS
log command. Moreover, we can specify some additional parameters so as to allow the retrieval of information regarding a
particular artifact or even the complete project directory.
CVS Shortcoming: -a major shortcoming of CVS that haunts most of the developers is the lack of functionality to provide
appropriate mechanisms for linking detailed modification reports and classifying changes.
CVS Details
• RCS file: This field contains the path information to identify an artifact in the repository.
• Locks and Access List: These are file content access and security options set by the developer during the time of committing the
file with the CVS. These may be used to prevent unauthorized modification of the file and allow the users to only download
certain file, but does not allow them to commit protected or locked files with the CVS repository.
• Symbolic names: This field contains the revision numbers assigned to tag names. The assignment of revision numbers to the tag
names is carried out individually for each artifact because the revision numbers might be different.
• Description: This field contains the modification reports that describe the change history of the artifact, beginning from the first
commit until the current version. Apart from the changes incurred in the head or main trunk, changes in all the branches are
also recorded there. The revisions are separated by a few number of “-” characters.
• Revision number: This field is used to identify the revision of source code artifact (main trunk, branch) that has been subject to
change(s).
• Date: This field records the date and time of the check in.
• Author: This field provides the information of the person who committed the change.
• State: This field provides information about the state of the committed artifact and generally assumes one of these values:
“Exp” (experimental) and “dead” (file has been removed).
• Lines: This field counts the lines added and/or deleted of the newly checked in revision compared with the previous version of a
file. If the current revision is also a branch point, a list of branches derived from this revision is listed in the branches field. In the
above example, the branches field is blank, indicating that the current revision is not a branch point.
• Free Text: This field provides the comments entered by the author while committing the artifact.
SVN :- Apache Subversion which is often abbreviated as SVN, is a software another versioning and revision control system
distributed under an open source license. SVN hosts a large number of OSS systems, such as Tomcat and other Apache projects.
Being a CVCS, SVN has the capability to operate across various networks, because of which people working on different locations
and devices can use SVN. Similar to other VCS, SVN also conceptualizes and implements a version control database or repository
in the same manner. However, different from a working copy, a SVN repository can be considered as an abstract entity, which has
the ability to be accessed and operated upon almost exclusively by employing the tools and libraries, such as the Tortoise-SVN.
SVN Specifics
Revision numbers: Each revision of a project artifact stored in the SVN repository is assigned a unique natural number, which is
one more than the number assigned to the previous revision. The initial revision of a newly created repository is typically
assigned the number “0,” indicating that it consists of nothing other than an empty trunk or main directory. Unlike most of the
VCS (including CVS), the revision numbers assigned by SVN apply to the entire repository tree of a project, not the individual
project artifacts. Each revision number represents an entire tree, or a specific state of the repository after a change is
committed. In other words, revision “i” means the state of the SVN repository after the “ith” commit. Since some artifacts may
be more affected by updation or changes than the others, it implies that the two revisions of a single file may be the same, since
even if one file is changed the revision number of each and every artifact is incremented by one. Therefore, every artifact has
the same revision number for a given version of the entire project.
Branching and merging: SVN fully provides the developers with various options to maintain parallel branches of their project
artifacts and directories. It permits them to create branches by simply replicating or copying their data, and remembers that the
copies which are created are related among themselves. It also supports the duplication of changes from a given branch to
another. SVN’s repository is specially calibrated to support efficient branching.
•When we duplicate or copy any directory to create a branch, we need not worry that the entire SVN repository will grow in
size. Instead, SVN does not copy any data in reality. It simply creates a new directory entry, pointing to an existing tree in the
repository. Owing to this mechanism, branches in the SVN exist as normal directories. This is opposed to many of the other VCS,
where branches are typically identified by some specific “labels” or identifiers to the concerned artifacts.
Merging of different branches. :- SVN 1.5 had incorporated the feature of merge tracking to SVN. In the absence of this feature,
a great deal of manual effort and the application of external tools were required to keep track of merges.
Version control data: For each artifact, which is under version control in the repository, SVN also generates detailed version
control data and stores it to change log or simply log files. The recorded log information can be easily retrieved by using a SVN
client, such as Tortoise-SVN client, and also by the “svn log” command. Moreover, we can also specify some additional
parameters so as to allow the retrieval of information regarding a particular artifact or even the complete project directory.
Although the SVN classifies changes to the files as modified, added, or deleted, there are no other classification types for the
incurred changes that are directly provided by it, such as classifying changes for enhancement, bug-fixing, and so on. Even
though we have a “Bugzilla-ID” field, it is still optional and the developer committing the change is not bound to specify it, even
if he has fixed a bug already reported in the Bugzilla database.
SVN Commit Record
Revision number: This field identifies the source code revision (main trunk,branch) that has been modified.
Actions: This field specifies the type of operation(s) performed with the file(s) being changed in the current commit. Possible
values include “Modified” (if a file has been changed),
“Deleted” (if a file has been deleted),
“Added” (if a file has been added), and a combination of these values is also possible, in case there are multiple
files affected in the current commit.
Author: This field identifies the person who did the check in.
Date: Date and time of the check in, that is, permanently recording changes with
the SVN, are recorded in the date field.
Bugzilla ID (optional): This field contains the ID of a bug (if the current commit fixes a bug) that has also been reported in the
Bugzilla database. If specified, then this field may be used to link the two repositories: SVN and Bugzilla, together.
SVN Commit Record Example
We may obtain change logs from the SVN (through version control data) and bug details from the Bugzilla.
• Modified: This field lists the source code files that were modified in the current commit. In the above log file, the file
“mbeans-descriptors.dtd” was modified.
• Added: This field lists the source code files that were added to the project in thecurrent commit. In the above log file, this field
is not specified, indicating that no
files have been added.
• Message: The following message field contains informal data entered by the authorduring the check in process.
GIT:- Git is a version control system for tracking changes in computer files. It helps in coordinating work amongst several people
in a project and tracks progress over time. Unlike the centralized version control system, Git branches can be easily merged. A
new branch is created every time a developer wants to start working on something. This ensures that the master branch always
has a production-quality code.Git is a distributed version control system, so here, every developer gets their local repository with
full commit history. The commit history makes Git fast, as now a network connection is not needed to create commits or perform
diffs between commits.
• Git object: It is an abstraction of
the key-value pair storage
mechanism of Git. It is
also called a hash-object pair, and
each object stores a secure hash
value (sha) and a corresponding
pointer to the data or files stored by
Git for that hash value. The
fields of a Git object are as follows:
• SHA, a unique identifier for
each object
• Type of the Git object (string),
namely, tree, tag, commit, or
blob
• Size of the Git object (integer)
• Content of the Git object,
represented in bytes.
• Tree: It eliminates the problem of
storing different file names but
upports storing different files together. A tree corresponds to branches or file directories. The fields of a Tree object
are as follows:
• Name of the Tree object (string)
• Type of the Tree object, having the fixed value as "Tree"
Blob: It corresponds to the Inodes in a Tree object, which store the File's information and content. Different blobs are
created for different commits of a single file, and each blob is associated with a unique tag value. The fields of a Blob
object are as follows:
• Name of the Blob object (string)
• Type of the Blob object, having the fixed value as "Blob"
Commit: It is one that connects the Tree objects together. The fields of a Commit object are as follows:
• Message specified while committing the changes (string)
• Type of the Commit object, having the fixed value as "Commit"
Tag: It contains a reference to another Git object and may also hold some metadata of another object. The fields of a
Tag object are as follows:
• Name of the Tag object (string)
• Type of the Tag object, having the fixed value as "Tag"
GIT SPECIFIC
Revision numbers: Similar to CVS, each new version of a file stored in the Git repository receives a unique revision number
(e.g., 1.1 is assigned to the first version of a committed file) and after the commit operation, the revision number of each
modified file is incremented by one. But in contrast to CVS, and many other VCS, that store the change-set (i.e., changes
between subsequent versions), Git thinks of its data more like a set of snapshots of a mini file system. Every time a user per-
forms a commit and saves the state of his/her project with Git, Git simply captures a snapshot of what all the files look like at
that particular moment of committing, and then reference to that snapshot is stored. For efficiency, Git simply stores the link to
the previous file, if the files in current and previous commit are identical.
Local operations: In Git, most of the operations are done using files on client machine, that is, local disk. For example, if we
want to know the changes between current version and version created few months back. Git does local calculation by looking
up the differences between current version and previous version instead of getting information from remote server or
downloading previous version from the remote server. Thus, the user feels the increase in speed as the network latency
overhead will be reduced. Further, lots of work can be done offline.
Branching and merging: Git also provides its users to exploit its branching capabili- ties easily and efficiently. All the basic
operations on a branch, such as creation, cloning, modification, and so on, are fully supported by Git. In CVS, the main issue
with branches is that CVS does not support detection of branch merges. However, Git determines to use for its merge base, the
best common ancestor. This is contrary to CVS, wherein the developer performing the merging has to figure out the best merge
base himself. Thus, merging is much easier in Git.
Version control data: Similar to CVS, for each working file, Git generates version control data and saves it in log files. From
here, the log file and its metadata can be retrieved by using the "git log" command. The specification of additional parameters
allows the retrieval of information regarding a given file or a complete directory.
Commit: Indicates the check-sum for this commit. In C everything is check- summed prior to being stored and is then referenced
by using that checksum.
Date: This field records the date and time of the check in.
• Author. The author field provides the information of the person who committed the change.
Free text: This field provides informal data or comments given by the author during the commit. This field is of prime
importance in extracting information for areas such as defect prediction, wherein bug or issue IDs are required to identify a
defect, and these can be obtained after effectively processing this field. The rame of a changed file is followed by a number
which indicates the total number of LOC Changes incurred in that file, which in turn is followed by the number of LOC insertions
(the count of occurrences of "+") and LOC deletions (the count of occurrences of "-"). However, a modified LOC is treated as a
line that is first deleted (-) and then inserted (+) after modifying. The last line summarizes the total number of files changed,
along with total LOC changes (insertions and deletions). Table 5.1 compares the following freely available features of the
software repositories. These repositories can be mined to obtain useful information for analysis.
• Initial release: The date of initial release is specified.
• Development language: The programming language in which the system is developed.
• Maintained by: The name of the company that is currently responsible of the development
Bug Tracking Systems: - A bug tracking system (also known as defect tracking system) is a software system/ application that is
built with the intent of keeping a track record of various defects, bugs, or issues in software development life cycle. It is a type of
issue tracking system.
Bug Information
• The time when the bug was reported in the software system
• Severity of the reported bug
• Behaviour of the source program/module in which the bug was encountered.
• Details on how to reproduce that bug
• Information about the person who reported that bug
• Developers who are possibly working to fix that bug, or will be assigned the job to do so
Components of BTS
A database is a crucial component of a bug tracking system, which stores and maintains information regarding the bugs
reported by the users and/or developers.
Bug Life cycle
1. New: When any new defect is identified by the
tester, it falls in the ‘New’ state. It is the first state of
the Bug Life Cycle.
2. Assigned: Defects that are in the status of ‘New’ will
be approved and that newly identified defect is
assigned to the development team for working on the
defect and to resolve that. When the defect is assigned
to the developer team the status of the bug changes to
the ‘Assigned’ state.
3. Open: In this ‘Open’ state the defect is being
addressed by the developer team and the developer
team works on the defect for fixing the bug
4. Fixed: After necessary changes of codes or after
fixing identified bug developer team marks the state as
‘Fixed’.
5. Pending Request: During the fixing of the defect is
completed, the developer team passes the new code
to the testing team for retesting. And the
code/application is pending for retesting on the Tester
side so the status is assigned as ‘Pending Retest’.
6. Retest: At this stage, the tester starts work of
retesting the defect to check whether the defect is fixed by the developer or not, and the status is marked as ‘Retesting’.
7. Reopen: After ‘Retesting’ if the tester team found that the bug continues like previously even after the developer team has
fixed the bug, then the status of the bug is again changed to ‘Reopened’.
8. Verified: The tester re-tests the bug after it got fixed by the developer team and if the tester does not find any kind of
defect/bug then the bug is fixed and the status assigned is ‘Verified’.
9. Closed: It is the final state of the Defect Cycle, after fixing the defect by the developer team when testing found that the bug
has been resolved and it does not persist then they mark the defect as a ‘Closed’ state.
Bug Severity and Bug Priority
Severity is basically a parameter that denotes the total impact of a given defect on any software. Severity relates to the
standards of quality.
Priority is basically a parameter that decides the order in which we should fix the defects. Priority relates to the scheduling of
defects to resolve them in software.
How BTS is Used by admins and Devs?
Ideally, the administrators of a bug tracking system are allowed to manipulate the bug information, such as determining the
possible values of bug status, and hence the bug life cycle states, configuring the permissions based on bug status, changing the
status of a bug, or even remove the bug information from the database.
Advantages of BTS
The primary advantage of a bug tracking system is that it provides a clear, concise, and centralized overview of the bugs
reported in any phase of the software development life cycle, and their state. The information provided is valuable for defining
the product road map and plan of action, or even planning the next release of a software system. Bugzilla is one of the most
widely used bug tracking systems. Several open-source projects, including Mozilla, employ the Bugzilla.
Mailing List Analysis:- Most open source developers communicate through mailing lists. This style of communication makes
mailing lists a rich source of information which researchers can use to understand software processes and improve development
practices. Mailing lists have been used to infer social structure, identify architectural changes, and also to study the code review
process . Developers use mailing lists to discuss a variety of issues and project decisions. Many of these issues and decisions are
related to and affect the source code. These issues are often driven by external factors such as the introduction of new features
in competing products.
Role of Mailing Lists in OSS
Extracting Data from Software

Repositories
The procedure for extracting
data from software repositories
is depicted in Figure on next
slide. The Figure shows the
data-collection process of
extracting defect/change
reports. The first step in the
data-collection procedure is to
extract metrics using metrics-
collection toolssuch as
Understand and chidamber and
kemerer java metrics (CKJM).
The second step involves
collection of bug information to
the desired level of detail (file,
method, or class) from the
defect report and source
control repositories. Finally, the
report containing the software
metrics and the defects
extracted from the repositories is generated and can be used by the researchers for further analysis

Software Mining Repository Notes Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Software Mining Repository Notes Unit 1

Uploaded by

Copyright:

Available Formats

Empirical Software Engineering: - “Empirical” is typically used to define any statement about the world that is related to

Centralized VCS (CVCS):- The main aim of CVCS is to

Distributed VCS(DVCS) :- As opposed to CVCS, a

Some Important Terms-Build, Revision

Role of Mailing Lists in OSS

Extracting Data from Software

You might also like