Engineering Reports - 2023 - Parthasarathy - A Framework For Managing Ethics in Data Science Projects

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Received: 23 April 2023 Revised: 1 June 2023 Accepted: 15 June 2023

DOI: 10.1002/eng2.12722

RESEARCH ARTICLE

A framework for managing ethics in data science projects

Sudhaman Parthasarathy1 Prabin Kumar Panigrahi2 Girish H. Subramanian3

1
Department of Applied Mathematics and
Computational Science, Thiagarajar Abstract
College of Engineering, Madurai, India The field of data ethics is concerned with the ethical considerations surround-
2
Department of Information Systems, ing data, algorithms, and associated practices, with the aim of identifying ethical
Indian Institute of Management (IIM),
solutions. The application of ethical principles to the handling of data, algo-
Indore, India
3
School of Business, Penn State
rithms, and practices can facilitate the identification and delineation of ethical
Harrisburg, Middletown, Pennsylvania, quandaries within the domain of data science. The present study focuses on the
USA topic of data ethics, specifically pertaining to the processes of data collection,
Correspondence data model construction, evaluation, and deployment. This study introduces a
Girish H. Subramanian, School of comprehensive framework designed to facilitate the management of ethical con-
Business, Penn State Harrisburg,
siderations in data science projects. In order to authenticate the framework, a
Middletown, PA, USA.
Email: ghs2@psu.edu case study was conducted and our perspectives on its practical implementation
were presented. The description of the scope of future research is also provided.

KEYWORDS
case study, data ethics, data model, data science, framework

1 I N T RO DU CT ION

A software project has never been without ethical challenges or considerations. Data scientists, however, find it difficult
to govern ethics in projects because they follow different timelines than more conventional software application devel-
opment projects. Additionally, the majority of data science projects are made to handle big data and include steps like
data cleansing, data model construction, model evaluation, and deployment. The widespread use of data science in recent
years has had unfavorable repercussions, including an increase in privacy invasion, data-driven prejudice, and data-driven
decision-making without rationale.1-3
The primary focus of data ethics is on what is right and wrong. Numerous data science components related to data pri-
vacy are addressed by the General Data Protection Regulation (GDPR) in Europe, including explainability.2 Data-related
issues examine the most pressing ethical questions that may emerge from the gathering and analyzing of information.
Data scientists use large volumes of collected, stored, and accessible data to create predictions based on prior patterns.4
Companies that use data science should also offer interactive ethical evaluations and training to help staff deal with
ethical issues.5 However, whether or not these groups have the scope and scale of expertise required to properly conduct
this training is unknown. Access to or possession of data does not guarantee the moral application of that data. The data
model mostly addresses moral challenges associated with model creation and deployment. Utilizing historical informa-
tion, an analytical model may be used to better understand and predict future events. Analytical models are mathematical
methods that anticipate a state using past knowledge. However, using an algorithm might result in unethical situations
or make already unethical situations worse.4 Data science has so far been applied primarily to producing advantageous

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the
original work is properly cited.
© 2023 The Authors. Engineering Reports published by John Wiley & Sons Ltd.

Engineering Reports. 2024;6:e12722. wileyonlinelibrary.com/journal/eng2 1 of 12


https://doi.org/10.1002/eng2.12722
25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 of 12 PARTHASARATHY et al.

results for business and society, such as risk management, the detection of tax fraud, the prediction of terrorist acts, or in
a commercial context, boosting profitability, raising revenues, or saving money.
Data science has enabled citizens to benefit from better, more effective services. However, as with any technology,
data science has also had drawbacks, including an increase in privacy invasion, discrimination against vulnerable groups
based on data, and data-driven decision-making without justification. Within the fast-growing subject of data science,
which is itself a subfield of research, data ethics is an incipient but rapidly growing subfield of study.6-10 The algorithms
that underlie automated systems in data science projects frequently produce unfair results.11,12 This has the potential to
have a devastating effect, even on products developed in data science projects with the best of intentions. In data science
initiatives, the algorithms powering automated systems often lead to unethical outcomes.7-9,11,12
A data scientist uses preloaded libraries to call on existing datasets. Data science projects require critical decisions
from data collection through model development. Ethics in data science is arguably even more important for managers in
businesses where data science practices are a key asset.13 The members of a team working on a data science project need to
be familiar with the theories, procedures, and stories of ethics in data science because ethical considerations are becoming
increasingly relevant. Data scientists and managers are not inherently unethical, but at the same time, they are not trained
to think this through either. Some popular instances are Microsoft’s racist chatbot, Google Photos’ incorrect recognition
of a picture with black people as gorillas,14,15 Apple Card’s inability to quickly respond to accusations of discrimination
against women,16 Amazon’s apparent discrimination against women,17 and the Cambridge Analytica debacle involving
Facebook data.18
Being ethical has been promoted as a life goal, but there are also significant societal and corporate benefits. The haz-
ards to one’s reputation and finances are enormous when it comes to data science ethics. If data scientist do not get the
ethical issues correct, they risk having their growth (or possibly the growth of their firm) completely halted and getting
into difficulty during investment discussions or due diligence. Financial hazards are easily correlated with reputational
risks. Lawsuits and settlements can result in significant financial losses, just as unethical data science can cause emo-
tional and bodily suffering. Data science models may be improved as a result of ethical reasoning, possibly with more
accurate forecasts or increased user acceptance. In addition to better data models, ethical behavior may also be a power-
ful marketing tool, as Apple is increasingly emphasizing the privacy component of its products. Thus, data science ethics
can increase business value through higher profits, lower costs, or increased revenue.
Data ethics is the study and promotion of ethical practices with regards to data (including its creation, use, sharing,
dissemination, processing, curation, and collection), algorithms (such as robotics, deep learning, machine learning, intel-
ligent agents, and artificial intelligence), and corresponding practices (such as professional codes, hacking, programming,
and responsible innovation, for instance, right conduct or right values6 ;). Thus, data science ethics can be grouped under
the ethics of data, algorithms, and practices.6 In this paper, we focus primarily on data ethics (data collection, building a
data model, its evaluation, and deployment).
The paper is organized as follows: Previous research on ethical issues in data science initiatives is discussed in
Section 2. Our ethics framework for data science projects is described in Section 3. Section 4 presents the case study
conducted in this research work to evaluate the framework. Section 5 outlines our observations on the utilization of the
framework, followed by a conclusion and scope for future research.

2 REVIEW OF LITERATURE

We examined the Scopus digital library (www.scopus.com) to find the most recent articles on data ethics and the related
problems. The search was performed with the help of the following search string, which we built out of descriptive key-
words related to our area of interest:(data science AND (ethics OR data ethics OR ethical concerns) AND (“data science
projects” OR “data privacy,” OR “data scientists” OR “data models” OR “data gathering”)). Our review of the literature
followed Kuhrmann et al.’s19 guidelines. We also used snowballing techniques to supplement the Scopus search, as rec-
ommended by these researchers. However, we point out that our objective was not to conduct a comprehensive systematic
literature review.20 Instead, we wanted to track out some of the most up-to-date publications on the topics that pique
our attention, namely data ethics and its related ethical concerns about data science projects, to highlight the variety of
research focuses taken into account in the data science community. To create our framework for the management of ethics
in data science projects, all we sought to do was identify the essential elements of data ethics in previously published
related research papers. To reach our study goals, we also review findings from relevant work.
25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARTHASARATHY et al. 3 of 12

It is no surprise that there were several literature reviews devoted to the topic of ethics in computer science. Brey
& Soraker,21 Stahl et al.,22 Yallop & Aliasghar,23 Akter et al.,1 and Wei & Pardo3 provide useful overviews of the ethical
difficulties and concerns in industrial projects generally, while anthologies aim to cover the most important ones.24-26
However, not one has focused on data scientists and the ethical challenges they confront in the modern day. Ethics in
data science is a topic that has received a lot of attention recently.1,3,10,13,23,27-29 It is uncertain if data science companies
have the breadth and depth of expertise to easily deliver this training, but they have been advised to provide guidance and
support, as well as interactive ethical exams to help staff examine ethical challenges.5
If data science ethics goes unaddressed, it may introduce new business risks. Because of this, some researchers have
remarked that none of the existing codes of conduct adequately address the full spectrum of ethical concerns a data science
team may confront.26,30 Therefore, all the existing practices or approaches toward data ethics management are insufficient
for data science projects. Furthermore, the need for ethics was reaffirmed by a coordinated team of data scientists who
created a code of professional behavior for the area.
More and more data are being created, archived, and made accessible to data scientists so they may analyze patterns
in the data and extrapolate information about the future. Making an ethical framework, for instance, was proposed as a
means of facilitating accurate terminology when discussing moral dilemmas arising from data science.30,31 Data science
teams may benefit from a comprehensive, all-encompassing framework for dealing with data science’s ethical issues.30
Many people stated that there was no existing code of ethics that fully covered what was needed,5,10,31 and they also stated
that a more general code of ethics would not be useful because it would not be specific enough.
Data access or collection does not imply morally acceptable data use. Additionally, “upstream” ethical concerns exist,
like the privacy consequences of how big data are initially collected.32 Feelings, perspectives, and correct data processing
make it difficult to know whether consent was given to the data in question.33 In terms of how the data is used, the data
scientist must verify the “fitness of purpose.” Otherwise, data might be used inappropriately or not for the data provider’s
intentions.4 It is important to pay attention to how the results of the analysis are presented as well as how the data was
analyzed, and the people who design the analytics must fully comprehend and articulate how they will affect the data.34,35
Data cleaning, data modeling, and model deployment were recognized as the three main data-related issues in a data
science project about data ethics.4,36,37
With the recommendations made by Kitchenham & Charters,38 we developed a data extraction process to find per-
tinent data from previously published research works (more specifically, 46 related research papers were reviewed) that
are relevant to our goal of identifying the essential components for managing ethics in a data science project. As part of
the data extraction process, we designed a form to keep track of the 46 publications’ ideas, views, contributions, and con-
clusions. After the data were extracted, we used content analysis39,40 to examine the main ethical ideas covered in each
paper. Additionally, as part of the data extraction, each of these crucial ideas was noted.
By examining the papers through an iterative process that involved item surfacing, refining, and regrouping, we were
able to specifically characterize the ethical issues that were raised in the previous research studies. Finally, we conducted
an inter-rater study, as suggested by Fleiss et al.41 to see how well our data extraction and categorization procedures
held up to replication. We had two separate coders examine the research papers to determine the inter-rater agreement
between the researchers. Eighty-four percent of the coding selections made after training were agreed upon by the coders.
To get a final collection of coded data, disagreements were discussed and resolved.
In their research on the ethical issues affecting data science projects, many researchers1,3-5,34,36,37 ; concurred that the
elements such as data privacy, data ownership, defining target variables used in a data model, fair evaluation of built-in
data model, and foreseeing the potential consequences of the deployed data model are all set to invite ethical concerns
or issues in a data science project. However, no practical answer or paradigm addressing these issues has emerged from
the aforementioned studies. In addition, we discovered that little work had been done to organize the ethical actions into
a unified procedure represented by a framework. Thus far, Saltz & Dewar4 have done the only study in the field of data
science ethics that has resulted in a published theory.
We combined the specific insights we gained while reading up on ethical considerations and issues in earlier research
initiatives involving data science. We categorize them below as S-1 through S-3, in the form of a summary. This study’s
abridged findings served as the basis for our proposed framework for handling ethics in a data science project.

S-1: As part of data collection and data cleaning in a data science project, addressing data privacy and informed
consent from the data owners is essential to maintaining data ethics in practice.
S-2: Properly defining target variables in a data model and benchmarking the built-in model is required to keep the
model fair and transparent.
25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 of 12 PARTHASARATHY et al.

S-3: At the stage of the deployment of the data model itself, foreseeing potential consequences will curtail ethical
concerns.

3 DATA ET HICS MANAGEMENT FRAMEWORK

Armed with our understanding of data science projects and their adjoining ethical concerns and issues and considering
the observations made by previous researchers on data ethics (summarized in Section 2), we draw the elements for our
proposed framework for managing ethics in a data science project. Figure 1 shows our proposed framework.
The information presented in Figure 1 allows one to observe that there are three primary stages involved in most data
science projects, namely data cleansing, data modeling, and model deployment. The data collection process is the first and
foremost step for data scientists. All else being equal, the quality of your data will determine the accuracy and reliability
of your analyses. Data cleaning is an essential first step in fostering an organization-wide culture that values fact-based
decision-making. All else being equal, the quality of your data will determine the accuracy and reliability of your analyses.
Companies are gathering more data as production costs have decreased. However, acquiring more information from
many sources can only increase the volume and give the appearance of more evidence. In fact, it will support systemic
error rather than add to the narrative. Data may become less trustworthy if inaccurate data becomes more prevalent and
practical. Data scientists should investigate any potential issues with the raw data.
The process of removing erroneous, damaged, incorrectly formatted, duplicate, or inadequate data from a dataset and
replacing it with new, accurate data are known as “data cleaning.” When combining many sources of data, there is a high
probability of making mistakes in the form of data duplication or labeling. Results and algorithms are unreliable when
they are based on inaccurate data, even though the results and algorithms seem to be right. The specific procedures in
the data cleaning process cannot be prescribed in a single, universal fashion because they differ from dataset to dataset.
However, it is crucial to create a template for your data cleaning approach so that you can be certain it is followed properly
each time. From the perspective of data ethics, it is inferred from the framework that data privacy and informed consent
are two integral parts of datasets used by data scientists for analytics.
By “informed consent,” we mean that the human subject must be made aware of the experiment, must permit it to
proceed, and must be given the option to revoke their consent at any time by notifying the data scientists. The consent
must be given voluntarily, which means that it does not have to be forced so that data analytics can be done without any

FIGURE 1 A framework for data ethics management in data science projects.


25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARTHASARATHY et al. 5 of 12

ethical trouble. Any data science project that involves people should, ideally, take into account what the Institutional
Review Board (IRB) says. The IRB is made up of different people, some of whom are not scientists. It approves research
on human subjects, weighs the risks to the subjects against the benefits to science, and handles situations where informed
consent cannot be given. Concerning voluntary consent, also called “voluntary disclosure,” the people whose information
is being shared should be told that anything they share voluntarily with others is much less safe than information they
keep to themselves.
When discussing data science ethics, data privacy is likely to come up first. In the present digital age, the right to pri-
vacy has assumed significant importance. Data scientists frequently believe that open data is freely available for copying,
which is a common fallacy. It is common for businesses and startups to have ideas about how to use an already-existing
public dataset, but one should take care when obtaining such data. The rights and policies of the database are two factors
that control data privacy.
As a consequence of database rights, it is illegal to replicate a database without the permission of the owner. This is
because database creation requires a significant financial investment, and database rights recognize this fact. A “collection
of independent works, data, or other resources that are arranged in a systematic or methodical form and are individually
accessible by electronic or other means” is a database under European law. When a database was created with a lot of
money, its entries became copyright protected. Searching the public database is allowed, but copying large chunks is not.
After data cleaning is building a data model carrying well-defined target variables. Analytics models abstract the real
world. This abstraction purposefully distances analytics results from reality to aid in higher-level decision making. The
gap between theory and reality, however, might be unnecessarily widened by unintended omissions or inadequate models.
Here, management expertise, global knowledge, and analytical analysis can all come together to support decisions that
are better than either one could come up with on its own. The variables on the list that were used in the data models
should be reviewed and redefined appropriately by data scientists.
The aspect of a dataset about which you want to learn more is referred to as the “Target Variable” of a dataset in a
data model for a data science project. By analyzing your existing data, supervised machine learning may help you find
correlations between your goal and other variables. The data scientists are expected to benchmark the model with other
similar related data models to verify and validate any trade-offs between the input variables and the target variables. This
will be followed by evaluating the model. Data science modeling can incorporate privacy in many ways. Imagine that you
are a data scientist who has been tasked with creating a range of prediction models using data sets acquired from various
data suppliers. Second, data scientists must prevent sensitive variables from being predicted based on datasets. Political
preference might not be mentioned clearly in the dataset but could still be anticipated.
Particular ethical preferences may be incorporated into the model during the modeling step. The fundamental jus-
tification for this is that the data do not accurately reflect the desired results. This could be useful when dealing with
uncommon situations or when positive discrimination toward specific groups is desired. A data science-driven predic-
tion model may at some point, have to choose between running over an object (let us say A) and another object (say B).
On the preference, an ethical discussion should be held. However, because of their low frequency, these occurrences will
hardly ever (if ever) appear in the data.
It has been said that debugging is harder than programming. Data scientists may not have the requisite intelligence
to properly evaluate their models if they are programmed beyond their expertise. The process of benchmarking involves
comparing the inputs and outputs of one model to projections based on different sets of internal or external data or models.
Both the model-building process and ongoing monitoring can incorporate it. Testing and maximizing a statistic that does
not address the business issue is a common error in benchmarking. A false negative, for instance, could be far more
expensive than a false positive in areas like fraud detection and medical diagnosis. One of the sneakiest methods by which
a benchmark can produce false results is dataset leakage. If more sophisticated models, like deep neural networks, random
forests, and gradient-boosted machines are trained on a dataset with leakage, they may outperform simpler models on
holdout sets. This is true even when the production data does not have more forecasting power.
In general, evaluating data science models is a challenging process. It includes deciding what will be monitored,
interpreting the results, and producing reports using data analytics. Data science projects must be carried out ethically
and by all industry norms. Data scientists must consider fairness and transparency while assessing a data model.2 During
the review stage, the data science model is assessed according to the aforementioned fairness standards, including privacy
and discrimination against vulnerable groups. Individual characteristics like ethnicity, age, and marital status should be
made available to the data scientists to allow them to assess how fair the model is and then applied to sensitive groups.
Transparency plays many significant roles in the evaluation of the approach. The first is about properly evaluating
models. It is crucial enough to have ethical implications. Different performance measures may influence different model
25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 of 12 PARTHASARATHY et al.

decisions. Think about estimating the share price of a private company as an example. If the prediction is made using the
firm’s share prices for a given week in the preceding month and if the price climbed during this time, then it stands to
reason that the prediction will reflect a faster rate of share price growth in the future. However, if the data model takes into
account the dataset on a weekly, monthly, quarterly, half-yearly, and yearly basis, the model should anticipate different
share values.
Data model deployment follows evaluation. Every company and person must balance ethical concerns with the utility
of data. These weights determine ethics, equilibrium, and best practices. Data scientists must consider the pros and cons
(potential negative consequences) of their data model to avoid ethical issues in the future. When the data science project is
deployed, who has access to the system must be considered. Access to the system may be restricted to particular people or
places for many reasons. There are certain special explanations for the system’s restricted access. The first is that sensitive
and private data must be restricted by companies. A logging system that records each data access is crucial, in addition
to the obvious confidentiality, integrity, and access control measures. For instance, if a banker checks a famous person’s
payment history, they should be held accountable and prepared to justify the action. Build in this logging functionality
whenever you anticipate having access to such sensitive data, and inform your staff of its presence and any potential
drawbacks if the data model is viewed without authorization.
Certain substantial, potentially unethical uses exist for some data science tools. Hence, access must be tightly
restricted. Companies and governments embracing data science now have a new kind of power because they have exten-
sive discretion over who has access to the system. Again, being open and honest about the decision-making process,
including the reasons for decisions and ethical considerations, is critical. Managers can be tricked despite their best efforts.
Defense-in-depth or considering the ramifications of the collapse of one line of defense is a helpful concept.
An additional degree of security for managerial decisions based on data would be to acknowledge that the data itself
still has the potential to be deceptive. If the data are insufficient or unreliable, what are the potential positive and nega-
tive consequences in terms of costs, time, prediction accuracy, and other factors? Setting out a plan to lessen excessive,
undesirable, or unfavorable effects will unquestionably prevent the data science initiatives from suffering a great loss. It
is important to consider these issues carefully because dealing with the consequences of data dishonesty and recovering
from it could take a long time and be difficult. In the next section, we discuss the preliminary evaluation of the proposed
framework, followed by our observations and lessons learned from it.

4 PRELIMINARY EVA LUAT ION O F THE PROPOSED FRAMEWORK

Our framework (Figure 1) for data ethics management in data science projects emerged from the extensive review of
literature on data ethics management in data science projects discussed in previous related research works. We now intend
to evaluate this framework with the help of a case study. Hence, a case study was conducted in a mid-size IT services
company (pseudonym: DS-Tech) during 2020–2021. This company is headquartered in India and owns branches in the
United States and Singapore. DS-Tech is managing several data science projects for their client organizations in India. The
case study methodology was adopted from Yin.42 The case study’s primary purpose was to provide data scientists with a
first look at how well the suggested framework for ethical data management in data science projects stands up to actual
use in the field. Table 1 describes the case study company.
A total of 27 data science project team members (Data scientists 12, IT/Business analysts 3, Team Lead 2, Data engineer
4, Programmers 6) from “DS-Tech” participated in this evaluation process conducted as part of the case study. As suggested
by the case study research methodologist Yin,42 the following procedures were used in this case study (1–5): Step 1: Prepare
a draft questionnaire (Evaluation Questions-EQ) to interview43 the case study participants and get their responses on our
framework. Step 2: Test the questionnaire with two to three interviewees. Step 3: Use the final questionnaire (Appendix A)
to interview the chosen participants. Step 4: Assess the practicality of our framework. This questionnaire should answer
the framework’s evaluation questions. Step 5: Engage with case study participants to learn more about the framework.

T A B L E 1 Profile of case study company.


Case study Employees’ Revenue
company Employee experience Years of per year
“DS-Tech” strength (mean) Project domain existence (mean) (in $)

DS-Tech 91 4.5 Mobile Apps, Healthcare analytics, Insurance systems 7 2,65,000


25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARTHASARATHY et al. 7 of 12

FIGURE 2 Evaluation score on the framework for data ethics management.

Our case study respondents were chosen based on the following criteria: The participants were involved in the data
science projects in the case study company for at least 2–3 years in the recent past and were familiar with database man-
agement, software development, and data science model development and deployment. A total of 27 participants were
selected from the case study company, “DS-Tech.” The first author works at a top southern Indian technological institute,
where one of its alumni founded this company. This allowed the authors to easily interface with this company for data
collection and analysis to evaluate the proposed framework.
Focused semi-structured interviews, developed by King & Horrock,43 were employed to obtain our data. By “focused,”
we mean that the case study participants’ responses to our evaluation questionnaire regarding the applicability of our
proposed framework were the main focus of our interview with them and our interactions with them. The interviews were
considered “semi-structured” if they consisted of predetermined questions with which the participants were questioned
afterward. The interviewer, however, will have the leeway to improvise additional questions designed to elicit much more
specific responses from the interviewee.
The data collection process included the following. As a first step, the proposed framework for data ethics man-
agement in data science projects was shared with the chosen 27 case study participants. The working principle of this
framework was explained to them. They were then asked to apply the framework to their upcoming data science project
to ascertain its suitability in practice. They were given 7 weeks of time to carry out this exercise. Then, the same sets
of participants were met with one by one for an interview as well as informal interaction. At this point, the assessment
questionnaire (Appendix A) was given to each of them and their responses were gathered using a Likert Scale rating,44
with the respondents choosing a number between 1 (“strongly disagree”) and 5 (“strongly agree”) to assess whether the
framework manages the team member’s ethical concerns in practice.
The total duration of this interview process, including informal interactions was 9 h, and it happened over 2 days with
DS-Tech. To determine how the suggested framework was viewed by the members of the data science project who utilized
it in the field, the responses (scores) of the case study participants have been recorded in an MS Excel sheet and examined
graphically. The descriptive statistics of these scores are displayed in Figure 2.

5 DISCUSSION

This section covers the findings from our initial assessment of the framework for managing data ethics in data science
projects. According to our case study participants’ perceptions as a whole, the framework is appropriate for use in a
data science project. They concur that the framework makes it easier for the data science project team to address ethical
issues while dealing with data pre-processing, data cleaning preparation, data modeling, evaluation, and deployment. The
attributes of this framework appear to have met the data science company’s expectations generally, which is encouraging
for the framework’s potential application in their future data science projects. From Figure 2, we infer that the mean score,
25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 of 12 PARTHASARATHY et al.

the lowest score, and the highest score provided by the case study participants for the evaluation questions (EQ1–EQ10)
varies between 3 and 5 for the framework on data ethics management in data science projects. The evaluation questions
“EQ9” and “EQ10” obtained a mean score of 3.22 and 3.478, respectively. For other EQs, the mean score value is consider-
ably higher. These questions deal with the suitability of the framework to respond to data ethics equilibrium and address
ethical concerns about complex data models.
The evaluation questions “EQ1, EQ2, EQ5, EQ6, and EQ8” obtained the lowest score of “3” uniformly, while for other
EQs, it is 4. Similarly, the evaluation questions “EQ2, EQ3, EQ4, EQ7, and EQ8” have uniformly secured the highest score
of “5” while for other EQs, it is “4.” As a whole, the scores provided by our case study participants through the Likert
Scale appear to favor the use of our proposed framework in practice in data science projects for managing ethical concerns
consciously. Participants in the case study agreed that for data science companies to build a strong data analytics product,
they must strike a balance between ethical considerations and data utilization. However, they felt that the process of
comparing our data model to models that have already been developed did not need to be given more weight. Data science
and related technologies are developing quickly, therefore this suggestion may be reasonable. Rapid change may also
affect benchmarking data models. As a result, rather than comparing apples to apples, this could occasionally result in
comparing apples to limes.
The conclusions of prior studies4,6 that addressing the ethical issues surrounding data collection, modeling, evalu-
ation, and deployment is crucial for a data science project are supported by the evaluation results of our framework.
Participants in the case study have also noted that the framework has distinct parts to identify and address ethical concerns
regarding the secure collection of data from numerous sources, building a model to define target variables, comparing it
to other models already in use, and finally assessing it for its impact from a behavioral science perspective. It is interesting
to note that this result concurs with earlier studies.4,36,37
The primary objective was to establish a preliminary assessment of the suggested framework for overseeing ethical
considerations in data science initiatives, as outlined by Wieringa. As per the findings of research methodologists, con-
ducting an initial evaluation study is the primary measure toward gradually expanding a novel approach to practical
circumstances. The objective of this study is to exhibit the practical implementation of the approach in a real-life scenario.
The purpose is to enable both researchers and practitioners to gain insights from the experience and compile a set of pre-
ferred attributes for the approach. These attributes can be taken into account while improving, augmenting, or refining
the approach.
The growing dependence on data acquisition, examination, and application across diverse fields, including technol-
ogy, commerce, healthcare, and governance, has given rise to the necessity for a framework for data ethics. The increasing
prevalence of data collection and utilization has led to a heightened awareness of the criticality of ethical considerations
in promoting responsible and advantageous data usage. Data ethics frameworks serve to protect the privacy rights of
individuals by establishing suitable practices for data collection, storage, and sharing. Using a standard data ethics frame-
work, the data science project team should strive to establish a set of guidelines pertaining to the acquisition of informed
consent, anonymization techniques, and measures to prevent unauthorized access or misuse of data. If not appropriately
designed and monitored, data-driven systems possess the capability to sustain biases or engage in discriminatory practices
against specific individuals or groups.
The implementation of an ethical framework can effectively tackle concerns pertaining to algorithmic partiality,
impartiality, and lucidity, thereby fostering just treatment and curbing prejudiced consequences. The establishment of a
comprehensive data ethics framework plays a pivotal role in fostering trust among entities such as industries, individuals,
and the society at large. When data collection practices are transparent and adhere to ethical principles, it is more proba-
ble that stakeholders will exhibit trust in the collection, utilization, and dissemination of their data. The implementation
of a data ethics framework prompts organizations to contemplate the wider societal implications of their data-related
activities, guaranteeing that they are consistent with social norms and contribute to the collective welfare. Furthermore,
it lays down frameworks for ensuring accountability and promoting responsible conduct in activities related to data.
Data ethics frameworks aim to tackle the security and governance dimensions of data management. It provides
optimal methodologies for safeguarding data against breaches, unpermitted entry, and cyber hazards. As guided by the
framework, explicit guidelines should be outlined to ensure responsible data stewardship, data sharing, and data retention
in order to prevent any potential misuse or abuse of sensitive information. The implementation of data ethics frame-
works can effectively tackle the obstacles associated with conducting operations in a globalized context, wherein data is
transmitted across international boundaries.
Beyond demonstrating the framework’s suitability for data ethics management in a real-world setting for a data science
project, the purpose of the assessment research was to collect first-hand accounts and gain insight into the type of ethical
25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARTHASARATHY et al. 9 of 12

considerations that should be taken into account by the data science project team. We have divided our learnings into
three groups, as listed below, by Wieringa & Daneva45 : (1) learning on understanding the data ethics challenges faced by
the data science project team; (2) learning about the efforts required to apply our framework by the data science project
team; and (3) learning about the case study company’s practitioners’ framework usage experience in our preliminary
study. The following is a summary of what we learned.
Participants discussed how even an experienced team working on a data science project could not afford to spend
additional time conforming to all of the rules and regulations or norms established by various regulatory authorities in
various countries for the purpose of data protection and privacy. Participants highlighted that this was due to the fact
that such compliance would need a significant amount of time. When attempting to address the ethical considerations
that are present in a typical data science project, the participants of the case study company ran into this roadblock as the
first impediment. In most cases, this is the result of a combination of two factors: on the one hand, there is a dearth of
educated labor or human resources, and on the other hand, there is ignorance. Because they are dedicated to providing
their clients with the data analytics solution they desire, several of the case study participants in our research claimed that
they found it very difficult to find a balance between ethical concerns and the usage of data. This was one of the reasons
why our study was conducted. The team working on the data science project has to have a simplified framework to help
and control their process flow and workflow if they are going to be able to successfully tackle such practical issues.
Second, we want to point out that it is difficult for us to evaluate our framework in relation to other techniques in
terms of the amount of time and effort (in terms of both human resources and time) that would be required to use the
framework for data ethics management. This is due to the fact that past studies in the management of data ethics did not
present any frameworks of this kind, despite the fact that they emphasized the significance of such a framework for teams
working on data science projects. Third, the findings of our analysis showed that the proposed framework demanded less
time and effort from the practitioners in order to be utilized. Several of the participants made explicit reference to this
observation while we were conducting the evaluation process as part of the case study. On the other hand, they suggested
that the members of the team working on the data science project should participate in frequent training in order to keep
up with the laws, rules, regulations, and standards that regulatory organizations in different countries have enacted for
the protection of data and privacy, as well as for data analysis and reports.

6 CO NCLUSIONS A ND FUTURE RESEARC H

Data science has had an extremely positive impact on people’s lives and businesses, and it is quickly replacing some
traditional business practices at but a small number of businesses. However, there is a chance for unintended, expensive,
and severe negative consequences. As many cautionary tales have demonstrated, anyone working on the cutting edge of
data science technologies will inevitably run into ethical issues. Data science ethics takes time, effort, and training. Open
discussions and understanding potential issues, ideas, and methods are crucial. Because data science and data science
ethics research are expanding, maintaining best practices will require time and resources. It is, therefore, likely that a
prerequisite is senior management’s willingness to support data science ethics recognize its significance.
The prior studies that discuss the requirement for ethical considerations in data science initiatives serve as the foun-
dation for the framework for managing data ethics in this study. We created a preliminary suggestion for a framework for
managing ethics in a data science project following a thorough examination of the literature on data ethics. The frame-
work’s primary areas of attention are data cleansing, data modeling, and their evaluation and deployment. A case study
with 27 participants from a data science project team at a mid-sized IT services company analyzed the framework’s usabil-
ity and applicability. They offered input on our evaluation questions through a questionnaire and in informal chats. Our
suggested framework may be viewed as a theory for outlining the crucial elements to resolve ethical issues in a typi-
cal data science project. A systematic framework like the one suggested in this study should be adopted by data science
project teams as a requirement from the perspective of practitioners. We recognize that they were unable to adopt a
strictly formal strategy, but they do need to begin considering the absolute minimum in terms of processes and project
documentation.
Some implications for future research are also provided by this study. Additional case studies with IT organizations
of varying sizes with different domains (for instance, banking, healthcare, and insurance), as well as other support team
members, are needed to better understand the ethical data management activities of their projects and the suggested
framework. This provides a potential study avenue. Future research shall also focus on risk and impact assessments for
biased data, discriminatory outcomes, and privacy breaches.
25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 of 12 PARTHASARATHY et al.

AU THOR CONTRIBUTIONS
Conceptualization, methodology, and analysis: Sudhaman Parthasarathy; Writing—original draft preparation: Sudhaman
Parthasarathy; Validation and Writing [review and editing] and Supervision: Girish H. Subramanian and Prabin Kumar
Panigrahi. All authors read and approved the final manuscript.

FUNDING INFORMATION
The authors did not receive support from any organization for the submitted work.

CONFLICT OF INTEREST STATEMENT


The authors have no conflicts of interest to declare that are relevant to the content of this article.

PEER REVIEW
The peer review history for this article is available at https://www.webofscience.com/api/gateway/wos/peer-review/10
.1002/eng2.12722.

DATA AVAILABILITY STATEMENT


The data that support the findings of this study are available from the corresponding author upon reasonable request.

ORCID
Sudhaman Parthasarathy https://orcid.org/0000-0001-7439-6878
Girish H. Subramanian https://orcid.org/0000-0003-3477-3186

REFERENCES
1. Akter S, Dwivedi YK, Sajib S, Biswas K, Bandara RJ, Michael K. Algorithmic bias in machine learning-based marketing models. J Bus Res.
2022;144:201-216.
2. Rachel T, Martens D. Data science ethics concepts, techniques and cautionary. Oxford University Press; 2021.
3. Wei R, Pardo C. Artificial intelligence and SMEs: how can B2B SMEs leverage AI platforms to integrate AI technologies? Ind Mark Manag.
2022;107:466-483.
4. Saltz JS, Dewar N. Data science ethical considerations: a systematic literature review and proposed project framework. Ethics Inf Technol.
2019;21:197-208.
5. Leonelli S. Locating ethics in data science: responsibility and accountability in global and distributed knowledge production systems. Phil
Trans R Soc A. 2016;374(2083):20160122.
6. Baumer BS, Garcia RL, Kim AY, Kinnaird KM, Miles QO. Integrating data science ethics into an undergraduate major: a case study. J Stat
Data Sci Educ. 2022;30(1):15-28.
7. Eubanks V. Automating Inequality: how High-Tech Tools Profile, Police, and Punish the Poor. St.Martin’s Press; 2018.
8. Noble SU. Algorithms of Oppression: how Search Engines Reinforce Racism. NYUPress; 2018.
9. O’Neil C. Weapons of math destruction: how big data increases inequality and threatens democracy. Crown; 2016.
10. Yallop AC, Gica OA, Moisescu OI, Coroş MM, Seraphin H. The digital traveller: implications for data ethics and data governance in tourism
and hospitality. J Consum Mark. 2023;40(2):155-170.
11. D’Ignazio C, Klein LF. Data Feminism. MIT Press; 2020.
12. Fry H. Hello World: Being human in the age of algorithms. WW Norton & Company; 2018.
13. Richards D, Vythilingam R, Formosa P. A principlist-based study of the ethical design and acceptability of artificial social agents. Int
J Human-Comput Stud. 2023;172:102980.
14. Barr A. Google mistakenly tags black people as ‘gorillas’, showing limits of algorithms. Wall Street J. 2015;1(7):2015.
15. Simonite T. A sobering message about the future at AI’s biggest party. Wired. 2019.
16. Natarajan S, Nasiripour S. Viral tweet about apple card leads to Goldman Sachs probe. Bloomberg. 2019, November, 9.
17. Dastin J. Amazon scraps secret AI recruiting tool that showed bias against women. Ethics of Data and Analytics. Auerbach Publications;
2018:296-299.
18. Ghoshal D. Mapped: the breathtaking global reach of Cambridge analytica’s parent company. Quartz; 2018.
19. Kuhrmann M, Méndez Fernández D, Daneva M. On the pragmatic design of literature studies in software engineering: an experience-based
guideline. Empir Softw Eng. 2017;22(6):2852-2891.
20. Webster J, Watson RT. Analyzing the past to prepare for the future: writing a literature review. Manag Inf Syst Q. 2002;26(2):13-23.
21. Brey P, Soraker J. Philosophy of computing and information technology. In: Gabbay DM, Meijers AWM, Woods J, Thagard P, eds.
Philosophy of Technology and Engineering Sciences. Elsevier; 2009:1341-1408.
22. Stahl BC, Timmermans J, Mittelstadt BD. The ethics of computing: a survey of the computing-oriented literature. ACM Comput Surv.
2016;48(4):55-38.
25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARTHASARATHY et al. 11 of 12

23. Yallop CA, Aliasghar O. No business as usual: a case for data ethics and data governance in the age of coronavirus. Online Inf Rev.
2020;44(6):1217-1221.
24. Johnson D. Computer Ethics. Prentice-Hall; 1985.
25. Johnson D, Nissenbaum H. Computers, Ethics and Social Values. Pearson; 1995.
26. Saltz J, Dewar N, Heckman R. Key concepts for a data science ethics curriculum. Paper presented at: Proceedings of the 49th ACM technical
symposium on computer science education; 2018:952-957.
27. Bag S, Rahman MS, Srivastava G, Shore A, Ram P. Examining the role of virtue ethics and big data in enhancing viable, sustainable, and
digital supply chain performance. Technol Forecast Soc Chang. 2023;186:122154.
28. Bertino E, Kundu A, Sura Z. Data transparency with blockchain and AI ethics. J Data Inform Quality. 2019;11(4):1-8.
29. Hand DJ. Aspects of data ethics in a changing world: where are we now? Big Data. 2018;6(3):176-190.
30. Tractenberg RE, Russell AJ, Morgan GJ, et al. Using ethical reasoning to amplify the reach and resonance of professional codes of conduct
in training big data scientists. Sci Eng Ethics. 2015;21(6):1485-1507.
31. Voronova L, Kazantsev N. The ethics of big data: analytical survey. Paper presented at: 2015 IEEE 17th conference Business Informatics
(CBI), IEEE; 2015;2:57-63.
32. Pascalev M. Privacy exchanges: restoring consent in privacy self-management. Ethics Inf Technol. 2017;19(1):39-48.
33. Braun A, Garriga G. Consumer journey analytics in the context of data privacy and ethics. In: Linnhoff-Popien C, Schneider R, Zaddach M,
eds. Digital Marketplaces Unleashed. Springer; 2018.
34. Floridi L, Taddeo M. What is data ethics? Philos Trans Ser A. 2016;374:2083.
35. Fuller M. Big data, ethics and religion: new questions from a new science. Religion. 2017;8(5):88.
36. Krotov V, Johnson L. Big web data: challenges related to data, technology, legality, and ethics. Bus Horiz. 2022;66:481-491.
37. Wylie CD. Who should do data ethics? Patterns. 2020;1(1):100015.
38. Kitchenham B, Charters S. Guidelines for Performing Systematic Literature Reviews in Software Engineering. Keele; 2007.
39. Elo S, Kyngäs H. The qualitative content analysis process. J Adv Nurs. 2007;62(1):107-115.
40. Hsieh H-F, Shannon SE. Three approaches to qualitative content analysis. Qual Health Res. 2005;15(9):1277-1288.
41. Fleiss JL, Levin B, Paik MC. Determining sample sizes needed to detect a difference between two proportions. Stat Methods Rates Prop.
2004;2:64-85.
42. Yin RK. Case Study Research. 5th ed. Sage; 2013.
43. King N, Horrock C. Interviews in Qualitative Research. Sage; 2010.
44. Allen E, Christopher S. Likert scales and data analyses. Qual Prog. 2007;64-65.
45. Wieringa RJ, Daneva M. Six strategies for generalizing software engineering theories. Sci Comput Program. 2015;101(1):136-152.

How to cite this article: Parthasarathy S, Panigrahi PK, Subramanian GH. A framework for managing ethics in
data science projects. Engineering Reports. 2024;6(3):e12722. doi: 10.1002/eng2.12722

APPENDIX A. QUESTIONS FOR EVALUATION OF THE F RAMEWORK

This evaluation questionnaire has 10 statements. We ask case study participants to assess their agreement or disagreement
with each statement. The degree of agreement or disagreement on a Likert scale is: strongly disagree is 1 and strongly
agree is 5. We identify the exact evaluation issue (data cleaning involving privacy, data modeling and evaluation, and
deployment) that each statement covers in parenthesis.

EQ1. It is possible to address ethical concerns, namely data privacy and informed consent, during the data cleaning
process with the help of the framework. (Data cleaning)
EQ2. The framework helps data scientists explore any potential issues with the raw data. (Data cleaning)
EQ3. The framework as a whole helps us to plan and address ethical concerns about our project. (Data modeling)
EQ4. The framework reminds us of the importance of providing a clear definition for target variables during data
modeling. (Data modeling)
EQ5. The framework guides us to benchmark our data model with other standard models within our domain. (Model
evaluation)
EQ6. The framework helps us check the fairness of the data model. (Model evaluation)
EQ7. The framework helps us check the transparency of the data model. (Model evaluation)
EQ8. The framework helps us reason the potential positive and negative consequences of the data model. (Model
deployment)
25778196, 2024, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/eng2.12722 by CochraneArgentina, Wiley Online Library on [03/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PARTHASARATHY et al.

EQ9. The framework facilitates the data science project team in balancing ethical concerns with the utility of data.

EQ10. The framework covers the important aspects of data ethics required for a complex model in a data science
project. (Model deployment)
(Model deployment)
12 of 12

You might also like