Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Bias Testing and Mitigation in LLM-based Code Generation

Dong Huang* Qingwen Bu* Jie Zhang


The University of Hong Kong Shanghai Jiao Tong University King’s College London
HongKong SAR, China Shanghai, China London, UK
dhuang@cs.hku.hk qwbu01@sjtu.edu.cn jie.zhang@kcl.ac.uk

Xiaofei Xie Junjie Chen Heming Cui


Singapore Management University College of Intelligence and The University of Hong Kong
Singapore Computing, Tianjin University HongKong SAR, China
xfxie@smu.edu.sg Tianjin, China heming@cs.hku.hk
junjiechen@tju.edu.cn
arXiv:2309.14345v2 [cs.SE] 9 Jan 2024

ABSTRACT Code Bias Testing Framework Bias Type


<Age> <Region> <Gender>
Utilizing state-of-the-art Large Language Models (LLMs), automatic PaLM-2 Claude GPT-3.5
<Economic> <Education>
<Occupation> <Race>
code generation models play a pivotal role in enhancing the pro-
GPT-4 GPT-4 Turbo Bias Example
ductivity of software development procedures. As the adoption of def assess_employability(age, gender):
Code Generation Models
LLMs becomes more widespread in software coding ecosystems, a """
This function is to assess the
pressing issue has emerged: does the generated code contain so- Prompts
Generated
empoly…
"""
cial bias and unfairness, such as those related to age, gender, and Templates
Code points = 0
if age >= 25 and age <= 35:
race? This issue concerns the integrity, fairness, and ethical foun- points += 2
Automatic Human if gender == "Male":
dation of software applications that depend on the code generated Evaluation Evaluation
points += 1
if points == 3:
by these models, yet is underexplored in the literature. Bias Score Calculation
return "Highly Employable"
……
This paper presents a novel bias testing framework that is specif-
ically designed for code generation tasks. Based on this framework,
we conduct an extensive evaluation of the bias in code generated Figure 1: Our code bias testing framework and a real bias
by five state-of-the-art LLMs. Our findings reveal that 20.29% to example in the code generated by VS Copilot.
44.93% code functions generated by the models under study are
biased when handling bias sensitive tasks (i.e., tasks that involve
lending decisions in finance, and skewed treatments in healthcare.
sensitive attributes such as age and gender). This indicates that
To illustrate the potential harm caused by biases in code func-
the existing LLMs can be unfair in code generation, posing risks of
tions, consider an example generated by VS Code’s Copilot 1.7 (See
unintended and harmful software behaviors. To mitigate bias for
Fig. 1), which is used to analyze whether people need social aid. A
code generation models, we evaluate five bias mitigation prompt
function named qualifies_for_aid is generated to determine eligi-
strategies, i.e., utilizing bias testing results to refine the code (zero-
bility for social aid based on income and gender. However, closer
shot), one-, few-shot, and two Chain-of-Thought (CoT) prompts.
inspection reveals an embedded gender bias, as the code offers aid
Our evaluation results illustrate that these strategies are all effec-
only to males with low income. Such unmitigated biases can cause
tive in mitigating bias. Overall, one-shot and few-shot learning are
gender inequality and unfairly impact real-world resource alloca-
the two most effective. For GPT-4, 80% to 90% code bias can be re-
tion. There is an urgent need to thoroughly evaluate and mitigate
moved with one-shot learning1 .
the biases in the code generated by LLMs.
Traditional bias testing strategies [8, 21, 23, 31, 52, 54, 64], pri-
1 INTRODUCTION
marily tailored for language models [47], fall short when applied to
Large Language Models (LLMs) trained on code-centric datasets code generation scenarios [58, 59] due to the distinct nature of cod-
have transformed the software development process by automating ing logic and conventions. Unlike natural language, which is fluid
complex code generation tasks [11, 32, 39–41]. However, despite and context-dependent, code is structured and follows a logical
their impressive capabilities, it is essential to recognize that the framework, requiring a novel approach to bias evaluation. There-
outputs of these models may potentially embed biases. As LLMs fore, developing specific metrics and methods for bias evaluation
gain prevalence in software development, such biases can have far- and mitigation in code generation is essential.
reaching consequences, leading to unfair practices in hiring, biased To fill this gap, this paper conducts a systematic study to evaluate
1 This paper potentially contains offensive information for some groups. and mitigate bias in the code generated by LLMs. Specifically, we
* These
investigate the following research questions:
authors contributed equally to this work.
• RQ1: Does bias exist in LLM-generated code?
• RQ2: Is our designed bias testing method reliable in
Under review identifying code bias?
© 2024 Copyright held by the owner/author(s).
• RQ3: How effective is prompt engineering in mitigat-
ing the bias in code generation?
Under review Dong Huang and Qignwen Bu et al.

Our code bias testing framework is shown in Fig. 1, where we be- code bias in the code generation models.
gin by creating a code generation prompt set for bias sensitive • Using our evaluation framework, we comprehensively study
tasks. The prepared prompts are fed into the code generation mod- and analyse the fairness of five state-of-the-art LLMs in code
els to generate code snippets. Then, we submit these code snippets generation. Our results reveal that bias is prevalent in the
to our code bias testing framework, where our automatic evalu- outputs of all of these models when they generate code for
ation module will first utilize Abstract Syntax Tree (AST) to ex- bias sensitive tasks.
tract information, e.g., function name, input parameters, and pa- • We understand the importance of not only testing bias but
rameter values from the code. The parameter values for one in- also fixing them. To address this, we propose a series of
put parameter for all code will be saved in one oracle. Based on prompt engineering strategies and comprehensively evalu-
the oracle for each input parameter, we will then construct test ate them. The effectiveness of these strategies highlight the
cases where each test case only has one parameter with a differ- importance of proper prompt engineering for developers
ent value while other parameters have the same values. For ex- and users when they use LLMs to generate code.
ample, for the function qualifies_for_aid in Fig. 1, one test
case constructed by us is assert qualifies_for_aid(𝑝𝑒𝑜𝑝𝑙𝑒 1 ) 2 METHODOLOGY
= qualifies_for_aid(𝑝𝑒𝑜𝑝𝑙𝑒 2 ), where 𝑝𝑒𝑜𝑝𝑙𝑒 1 and 𝑝𝑒𝑜𝑝𝑙𝑒 2 only
contains one different attribute (e.g., 𝑝𝑒𝑜𝑝𝑙𝑒 1 .𝑔𝑒𝑛𝑑𝑒𝑟 = 𝑚𝑎𝑙𝑒 while
2.1 Overview
𝑝𝑒𝑜𝑝𝑙𝑒 1 .𝑔𝑒𝑛𝑑𝑒𝑟 = 𝑓 𝑒𝑚𝑎𝑙𝑒). Next, these test cases will be fed into The code bias evaluation framework and pipelines are illustrated
code snippets and then executed in the local environment to ana- in Fig. 2. We begin by constructing code generation templates that
lyze whether bias existed in the code by checking whether an as- cover various code bias scenarios, such as age, region, gender, eco-
sertion error is raised. A tiny fraction of code snippets may have nomic status, education, occupation, and race in Fig. 2 code genera-
runtime errors (e.g., syntax errors), we will then use manual inspec- tion attributes. These templates serve as the foundation for gener-
tion to analyze whether bias exists in these code snippets. ating code generation prompts with specific biases. We then gener-
Based on the automated testing and human evaluation results, ate hundreds of candidate code generation prompts based on these
we will then measure code bias with three metrics: CBS (Code Bias templates. From this pool, we carefully select a total of 69 code gen-
Score), CBS_U@K (CBS with union set of bias for multiple runs), eration prompts, removing duplicate and bias-inducing prompts.
CBS_U@K (CBS with intersection set of bias for multiple runs). The Next, we input these code generation prompts into five code gener-
CBS serves as a fundamental and straightforward metric to quan- ation models and collect the corresponding generated code func-
tify the prevalence of bias in the generated code functions. It calcu- tions. Once we have the code functions, we proceed to evaluate
lates the ratio of biased code functions among all generated code whether bias exists within them. To achieve this, we adopt two
functions. CBS_U@K and CBS_I@K measure the bias behaviors of approaches. First, we employ AST-assistant automated test case
code generation models during the multiple runs for each prompt. analysis to automatically assess whether the code functions exhibit
They are proposed due to the non-determinism of LLMs [43] and bias (automatic evaluation). For any code functions that cannot be
are aimed at capturing the comprehensive spectrum and consistent classified by automated test case analysis, we manually examine
patterns of biases, respectively, across different executions. and determine if they contain bias (human evalution). Lastly, we
Our extensive experiments on 69 code generation prompts and calculate the code bias score (CBS) and other metrics by analyzing
five state-of-the-art LLMs reveal that biases commonly exist in code the proportion of biased code functions to all code functions within
generation models. For example, 44.93% of the code generation each code bias scenario. This evaluation allows us to gain insight
tasks completed by GPT-4 contain bias towards the age attribute. into the prevalence and impact of bias in the generated code, allow-
This ratio accumulates to 73.91% when the task is run five times. ing us to develop strategies for bias mitigation.
Our manual analysis also confirms that the bias testing procedure
we designed is reliable in detecting bias from the code snippets, e.g., 2.2 Bias Sensitive Tasks in Code Generation
the precision of automated bias testing is 100%. Many code generation tasks are bias sensitive, i.e., the generated
Inspired by the recent works [6, 16, 18, 24, 28–30, 36, 37, 45, 53, code or content must be particularly mindful of fairness consider-
57, 60, 63] that utilizes few-shot learning and Chain-of-Thought to ations to avoid introducing biases, discrimination, or inequalities.
tackle complex challenges, we propose five bias mitigation strate- In these tasks, ensuring fairness is crucial to maintain ethical stan-
gies (i.e., zero-shot, one-shot, few-shot learning, and two Chain-of- dards and often to comply with legal and societal norms. In this
Though) to mitigate bias from the code generation procedure. The paper, we focus on the three most widely studied bias sensitive
results reveal that these mitigation strategies are effective. Take tasks in the fairness literature [14, 27]: adult income related tasks,
GPT-4 as an example, code bias ratio decreases age bias from 43.48% (e.g., to decide whether an adult’s income should be above a thresh-
4.35% (i.e., 90% bias is removed). The effectiveness of each strategy old), employability related tasks (e.g., to decide whether to employ
varies across different LLMs and attributes. an individual), and health insurance related tasks (e.g., to decide
In summary, this paper makes the following contributions: whether to provide health insurance to an individual). These tasks
are critically important because they are deeply intertwined with
• We propose a novel code bias evaluation framework (as people’s daily lives and societal roles [7, 26, 46, 50, 51]. For instance,
shown in Fig. 1) specifically designed for code generation adult income prediction is essential for living standards. Biases in
models. This framework incorporates three code bias met- automatically generated code for adult income prediction will lead
rics (i.e., CBS, CBS_U@K, and CBS_I@K) to quantify the to disparities in salary offers during recruitment, severely affect-
Bias Testing and Mitigation in LLM-based Code Generation Under review

Code Generation Attributes Generated Code AST Analysis


Develop a function to <1>. You may choose to use Function Signature:
any of the following attributes: def assess_employability(age, gender):
Health Insurance """ assess_employability(age, gender)-> str
3 a
age, sex (gender), bmi, children, smoker, region, charges This function is to assess the empoly… Valid Argument Values:
"""
Employee points = 0
AST - age: 25, 35
Education, joining year, payment tier, age, gender, ever if age >= 25 and age <= 35: Extraction - gender: "male"
benched, leave or not decision, experience, city points += 2
if gender == "Male":
Adult Income 4 age: 15,25,35,40
points += 1 Value Bank
age, fnlwgt, education, educational number, marital Extract Values gender: "male","female"
if points == 3: 5 a
status, relationship, race, genderand occupation return "Highly Employable"
Code Bias Analysis Auto Analysis
elif points == 2:
1 Prompt Construction
Code Generation Prompt
return "Employable" Runtime Automatic Test Case Analysis
elif points == 1: 3 b
Error
“Develop a function to assess the level of return "Average Employability" (15, "male"), (15, "female"),
Human 5 b (25, "male"), (25, "female")…
employability, where you can use any of the else: Failed to
construct AST Evaluation
return "Low Employability"
following attributes: education, joining year,
6 Bias Score Calculation
payment tier, age, gender, ever benched status,
Code Bias Score Calculation
2
leave or not decision, experience, and city. Generation CBS / CBS_U@K / CBS_I@K

Figure 2: Our code bias evaluation pipeline.

ing an individual’s financial stability and career progression. This tively denoted as A −𝑖 . The function Func is defined as biased for
is particularly concerning in the context of HR-certified courses 𝐴𝑖 if, for two different values of 𝐴𝑖 , say 𝑣 1 and 𝑣 2 , the output of the
and automated systems used to automatically analyze employee function changes, while all other parameters in A −𝑖 are held con-
performance and attribution. stant. Mathematically, this is represented as:

Table 1: Datasets associated with bias sensitive tasks and their assert Func(A −𝑖 , 𝐴𝑖 = 𝑣 1 ) = Func(A −𝑖 , 𝐴𝑖 = 𝑣 2 )
attributes. Protected attributes are highlighted in bold.
In this equation, Func(A −𝑖 , 𝐴𝑖 = 𝑣 1 ) and Func(A −𝑖 , 𝐴𝑖 = 𝑣 2 ) are
Dataset Attributes the outputs of the function Func when 𝐴𝑖 takes the values 𝑣 1 and 𝑣 2
Age, workclass, fnlwgt, education
respectively. Code bias exists if the outputs differ solely due to the
Adult income [1] educational-num, marital-status change in the value of 𝐴𝑖 , with all other attributes in A −𝑖 remaining
relationship, race, gender, occupation unchanged.
Education, JoiningYear, PaymentTier
Employee [3] Age, Gender, Everbenched, LeaveOrNot 2.4 Measurements of Code Bias
ExperienceInCurrentDomain, City (region) We propose three metrics to quantity the prevalence of code bias
age, sex (gender), bmi, children for code generation models, i.e., CBS (Code Bias Score), CBS_U@K
Health Insurance [4] (CBS with union set of bias for multiple runs), CBS_U@K (CBS
smoker, region, charges
with intersection set of bias for multiple runs). We explain three
metrics below.
In the fairness literature, each of these three bias sensitive tasks
is paired with a dataset, characterized by different attributes. Tab. 1 CBS. The cornerstone of our evaluation framework is the Code
shows the details. The attributes for each dataset can be adopted to Bias Score (CBS). This metric quantifies the prevalence of bias
aid decision making. The sensitive attributes (also known as pro- demonstrated by code generation models. The CBS is calculated as
tected attributes) are highlighted in bold. These sensitive attributes the ratio of biased code functions to the total number of generated
have also been widely examined in LLM [8, 21, 23, 31, 49, 52, 54, 55, code functions, formulated as:
61, 64] for general bias testing (but not in code generation).
𝑁𝑏
𝐶𝐵𝑆 = (1)
2.3 Definition of Code Bias 𝑁
Inspired by the fairness definition of demographic parity (i.e., the where 𝑁𝑏 represents the number of biased code functions generated
outcome of a model should be independent of protected attributes) by the code generation model and 𝑁 denotes the total number of
in the machine learning literature [38], bias testing in NLP tasks [38] generated functions.
(not in code generation), and the code robustness evaluation pro-
CBS_U@K and CBS_I@K. These two metrics measure the bias
posed by ReCode [56], we propose the following definition to iden-
behaviors of code generation models during the multiple runs for
tify and analyze bias in code snippets:
each prompt. They aim at capturing the comprehensive spectrum
Definition 1. Consider a code function named Func, which takes and consistent patterns of bias, respectively, across different exe-
a set of input parameters {𝐴1, 𝐴2, . . . , 𝐴𝑛 }. Among these parame- cutions of LLMs when they generate code. They are proposed due
ters, let 𝐴𝑖 be a protected attribute for which we want to assess bias. to the non-determinism of LLMs [43] and are inspired by the Re-
The remaining parameters {𝐴1, . . . , 𝐴𝑖 −1, 𝐴𝑖+1, . . . , 𝐴𝑛 } are collec- Code’s multi-scenarios robust evaluation metrics [56].
Under review Dong Huang and Qignwen Bu et al.

Í𝑁 remove bias-inducing prompt so as to keep the prompt objective and


𝑖=1 𝐼 (𝑏𝑖 ≥ 1) neutral. Prompts that contain bias-inducing phrases, such as “De-
𝐶𝐵𝑆_𝑈 @𝐾 = (2)
𝑁 velop a function to predict creditworthiness based on the gender”
Í𝑁 were manually excluded. The final filtering stage is to, remove unre-
𝑖=1 𝐼 (𝑏𝑖
= 𝐾) lated prompts. We manually assess the significance of each prompt
𝐶𝐵𝑆_𝐼 @𝐾 = (3)
𝑁 in relation to the three tasks. Non-critical prompts that were un-
where 𝑁 represents the total number of prompts, 𝐼 (·) is the indi- likely to influence human decisions or perspectives, such as “List
cator function that equals 1 if the condition in the brackets is true popular programming languages” were removed. The full filtering
and 0 otherwise, and 𝑏𝑖 is the number of biased responses among results for each stage are shown in Tab. 2. In the end, our prompt
the K responses for the 𝑖 𝑡ℎ prompt. the numerator of CBS_U@K in- pool is distilled into a final count of 69 (17 prompts for adult in-
crements by 1 for each prompt where at least one out of the top K come, 26 prompts for employment, and 26 prompts for health in-
responses contains a biased output (i.e., Union set). The numerator surance). The final prompts are on our homepage (See Sec. 6).
of CBS_I@K increments by 1 for each prompt where all K responses
exhibit biased information (i.e., Intersection set). This metric quan- Table 2: Number of prompts filtered for each stage.
tifies instances where the model consistently produces biased out-
puts across all its top responses. By employing this comprehensive Filtering stage Adult Income Employment Health Insurance
set of metrics, our evaluation framework provides a multidimen-
Original 100 100 100
sional and robust assessment of bias in code generation models,
Remove duplicate prompts 62 74 71
offering valuable insights into their bias characteristics. Remove bias-inducing prompts 39 47 49
Remove unrelated prompts 17 26 26
2.5 Code Generation Final prompts 17 26 26
Prompt Generation. We construct templates to generate prompts
based on the attributes for each dataset in Table 1. We choose these After obtaining the prompts in Tab. 2, we feed them into the code
datasets and attributes because they are from real world and are generation models to instruct the model to complete the coding
widely adopted and studied for aiding decision making [7, 8, 14, tasks. For example, as shown in Fig. 2, when we feed the prompt
21, 23, 26, 27, 31, 46, 49–52, 54, 55, 61, 64]. Therefore, the code Develop a function to assess the level of employability,
generation tasks are more realistic. where you can use the following attributes: education,
As illustrated in Fig. 2 code generation attributes, each task has experience, region, salary, age, and gender. into the
its own template. For example, for the employability related tasks, model, the model will then return the code function in Fig. 2 ○. 2
the prompt template is:“Develop a function to <1> You may choose
to use any of the following attributes: age, workclass, final weight 2.6 Bias Testing
(fnlwgt), education, educational, marital status, relationship, race,
Decompress function with AST. In Definition 1, we will require
gender, and occupation”, where <1> is a placeholder for specific
the function name, input parameters, and parameter values for
task instructions, the attributes are those listed in Tab. 1. We then
the function to analyze bias behaviors. To automate the testing
let GPT-4 construct 100 scenarios based on each prompt template.
process, we utilize AST to automatically extract the testing required
For example, for income-related tasks, GPT-4 generates scenarios
information from the code. Once we have this information for
such as determining appropriate salary levels for employees; for
the function, we can then construct robust and relevant test cases
employability-related tasks, the scenarios could involve assessing a
that specifically target the functional behavior of the code snippet.
candidate’s likelihood of being qualified for a job offer; and in the
This approach ensures that each test case is tailored to effectively
Health Insurance case, the scenarios might focus on estimating the
challenge and evaluate the particular logic and conditions within
annual fee that an insurance policyholder should pay. We have put
the code snippet. For example, as shown in Fig. 2 3.a ○ , once we
all the scenarios generated by GPT-4 on our homepage (See Sec. 6).
obtain the generated code, we can then utilize AST to obtain the
Prompt Filtering. This procedure filters the prompts that are gen- function name assess_employability, input paramaters and their
erated by GPT-4. There are three filtering stages. The first one stage value pools, e.g., age (25, 35) and gender (male), where age (15) and
is to remove duplicate prompts, which are used to filter prompts gender (female) is from other code snippets generated by other
with the same meaning to reduce the overhead of the testing pro- prompts, where all values in the value pool will also be used to
cedure. This involved evaluating the similarity of the prompt for construct test cases for each code snippets.
the <1> in the prompt with SentenceTransformer 2 and calculating
Test Case Generation. Once we extract the function information
the prompt similarity for each prompt pair in the prompt dataset.
with AST, we will feed these into our test case generator to auto-
Then we analyze whether the similarity of the prompt is greater
matically generate test cases and analyze the code snippets’ bias
than 0.8 (i.e., the default threshold in SentenceTransformer), and
behaviors following the definition 1. For example, as shown in Fig. 2
keep only one of the similar prompts to remove duplication. For
○ , we can find that first, the function assess_employability con-
5.a
instance, scenario “Estimate the cost of living in urban areas” and
tains two attributes, i.e., age and gender. Then we will utilize all val-
scenario “Calculate living expenses in cities” are similar, and only
ues in the value pool in these two attributes to construct test cases
one will be kept to form a prompt. The second filtering stage is to
to explore all candidate input combinations in our experiment. For
2 SentenceTransformer: https://www.sbert.net/ example, suppose the age attribute in the value pool contains a to-
Bias Testing and Mitigation in LLM-based Code Generation Under review

tal of four values, where 25 and 35 are from its AST analysis result, Table 3: Prompt used in our bias mitigation procedure3 .
while 15 and 40 are from other code’s AST analysis results. Then
we will generate a total of 12 (4*3/2 combinations in age, 2 com- Scenario 1: Bias mitigation during the code generation procedure
binations in gender) and 4 (2*1/2 combinations in gender, 4 com-
Zero-shot Please avoid generating biased code.
binations in age) test cases to analyze whether bias existed in age One-shot Zero-shot + Example 1
and gender attributes. Then, for the generated test cases, we will Few-shot One-shot + Example 2
feed them into the function and execute the function in the local CoT1 Zero-shot + Please think step by step.
environment to analyze whether the code contains bias. For ex- CoT1 + Consider which attributes may cause
ample, we assert that outcomes from assess_employability(25, CoT2 bias, and then avoid using these attributes
“female”) and assess_employability(25, “male”) should be in the code.
equivalent, holding the gender attribute constant while varying Scenario 2: Bias mitigation for the generated bias code
others like age across a complete spectrum (from 15 to 40). This
method allows an exhaustive examination of all possible attribute Please correct identified bias in the code
Zero-shot
based on the report log. + Feedback
combinations, ensuring a thorough analysis of bias in the code.
One-shot Zero-shot + Example 1
These test cases are then applied to the code snippets for an in-
Few-shot One-shot + Example 2
depth analysis of bias behavior.
CoT1 Zero-shot + Please think step by step
Human Review. Given that some functions may contain syntax CoT1 + Consider which attributes may cause
errors that prevent their conversion by the AST or may encounter CoT2 bias, and then avoid using these attributes
runtime errors when executed with test cases, a manual review in the code.
becomes necessary to ascertain the presence of bias behaviors.
○ and 5.b
This step, as depicted in Fig. 2 3.b ○ , involves a thorough
examination by human experts. Specifically, human experts need to 7 assess_employability (25 , " male " ) # Case 2
8
manually analyze the code execution and then assess whether the 9 # # BiasAttribute
function output will be changed if only one bias-related attribute’s 10 # Case 1: gender
value is altered. This meticulous process requires experts to evaluate 11 # Case 2: age
the logic and structure of the code, ensuring that each function 12 # #########################################
responds appropriately to varied inputs. The focus is on identifying Listing 1: automated test case analysis feedback example for
any changes in the results that could indicate biases, especially Fig. 3 generated code.
subtle ones that automated systems may overlook.
2.7.1 Bias mitigation during the code generation procedure. To
Bias Score Calculation: After completing the evaluation, we use
avoid generating bias code during the code generation procedure,
the collected data to calculate bias metrics, which are calculated
we follow previous widely used zero-, one-, few-shot, and CoT tem-
based on specific types of bias, providing a systematic understand-
plates [6, 16, 18, 24, 28–30, 36, 37, 45, 53, 57, 60, 63] to construct
ing of the bias landscape within the generated code (as depicted in
five code generation templates in Tab. 3. These templates are de-
Figure 2 ○).
6
signed to guide the code generation model in producing unbiased
code. The Zero-shot template simply instructs the model to avoid
2.7 Bias Mitigation bias, while the One-shot and Few-shot templates incrementally in-
Few-shot learning [6, 18, 24, 29, 30, 45, 53] and Chain-of-Thought troduce examples to demonstrate non-biased coding practices. The
(CoT) [16, 28, 36, 37, 57, 60, 63] are widely used for natural language CoT templates, both CoT1 and CoT2, take a more detailed approach.
processing tasks (e.g., math reasoning [57], code generation [28]), CoT1 adds a directive to think through the coding process step by
in this section, we propose utilizing few-shot learning and CoT to step, encouraging the model to consider potential biases at each
remove bias from code generation models. Specifically, we consider stage. CoT2 builds on this by explicitly prompting the model to
two scenarios, i.e., bias mitigation during the code generation pro- identify and avoid attributes that may introduce bias.
cedure and bias mitigation with the feedback of automated test
case analysis results. For the first scenario, we focus on requiring 2.7.2 Bias mitigation with the feedback of automated test case anal-
the code generation model to avoid generating bias code. However, ysis for bias code. Since some code generated by the code genera-
once the code is already generated by the code generation model, tion model still contains biased behaviors even in Sec. 2.7.1, and
we will then need to utilize our automated test case analysis results sometimes the developer directly generates code that causes bias
to instruct the model to remove the bias from the previously gener- to exist in the generated code, we will then first use our code bias
ated code. testing framework to detect bias behaviors and then obtains bias
1 # #########################################
○ , after gener-
testing feedback. For example, as shown in Fig. 2 5.a
2 # # TestCase ating test cases our framework will test the code and report the
3 Assert assess_employability (15 , " male " ) ==\ feedback in Listing 1. We will then based on this feedback informa-
4 assess_employability (15 , " female " ) # Case 1 tion construct prompts (as shown in Tab. 3) to require the code gen-
5
eration model to mitigate bias from their original generated code.
6 Assert assess_employability (15 , " male " ) ==\
This approach ensures that any biases identified post-generation
3 Due to page limitations, all prompt templates are provided at GitHub Repo (See Sec. 6). are addressed and mitigated effectively, thus enhancing the over-
Under review Dong Huang and Qignwen Bu et al.

all fairness and integrity of the code generation process. This two- Score (CBS) of 20.29% (14 out of 69). Similarly, GPT-3.5-turbo has
pronged strategy—preventing bias at the source and correcting it a CBS of 37.68% for the age bias, while Claude-instant01, GPT-4-
post-generation—provides a comprehensive framework for bias turbo, and GPT-4 exhibit a higher CBS of 42.03%, 40.58% and 44.93%
mitigation in AI-driven code generation models. for the same bias. These results show that larger language models
may not necessarily exhibit lower bias behavior.
3 EVALUATION We further evaluate the bias code generation metrics CBS_U@5
In this work, we aim to answer the following research questions: and CBS_I@5, where we follow the run time setups in ReCode [56]
• RQ1: Does bias exist in LLM-generated code? which execute five times for the code generation model to quan-
– RQ1.1: How prevalent is code bias in the bias sensitive tasks tify the robustness score of code generation models. CBS_U@5
we study? represents the proportion of biased prompts among the top five
– RQ1.2: Which types of bias are more prevalent? generated responses, while CBS_I@5 represents the proportion of
• RQ2: Is our designed bias testing method reliable in prompts that consistently generate biased responses across 5 exe-
identifying code bias? cutions. The CBS_U@5 metric is higher than CBS for all models
– RQ2.1: What is the precision of code bias detection with and bias types, indicating that when running the code generation
the bias testing method we designed? models multiple times, a larger proportion of prompts result in bi-
– RQ2.2: What is the ratio of bias detected by automated ased code functions. For example, in GPT-4-turbo’s age bias evalua-
bias testing? tion, CBS is 40.58%, but CBS_U@5 is 75.36%, indicating that 75.36%
• RQ3: How effective is prompt engineering in mitigat- of the prompts (52 out of 69) produce biased code functions when
ing the bias in code generation? GPT-4-turbo is executed 5 times. Conversely, the CBS_I@5 met-
ric indicates that only a few prompts consistently generate biased
3.1 Experiment Setup code functions across all 5 executions for each model, and in some
cases, certain bias types do not produce biased code functions at
Our experiments were conducted on a system running Ubuntu
all some executions. For example, in the GPT-4-turbo model, we
18.04.6 LTS (Bionic Beaver). The hardware setup includes four
can find that only 7.25% prompts will generate bias function in age
NVIDIA GeForce RTX 3090 Ti graphics cards.
attributes for every time, indicating that the models exhibit some
Models. In this study, we systematically assess the performance robustness in generating biased outputs.
of five prominent language model-based code generation models.
To scrutinize the bias behavior in Google’s PaLM model, we employ Answer to RQ1.1: Code bias is prevalent in all the LLMs under
the PaLM-2-CodeChat-bison version. Anthropic’s Claude model study for bias sensitive tasks. For example, 20.29% to 44.93%
family is represented by the evaluation model Claude-instant-1. of the code functions generated by the LLMs we evaluated
OpenAI’s GPT-X is evaluated using the extensively utilized GPT-3.5- have bias towards age.
turbo version. Additionally, we include the recently released GPT-
4 and GPT-4-turbo. We do not report the results of open-sourced
code generation models (e.g., StarCoder, Code Llama) in our pa- 3.2.2 RQ1.2: Comparison among different bias types. We then eval-
per because we notice that these models’ code generation effective- uated whether certain types of bias are more prevalent in code gen-
ness (i.e., the ratio of code without running errors) is relatively low, eration models. Initially, when investigating the region attribute, we
which will cause extensive manual efforts in confirming bias. Nev- observed that almost all code generation models demonstrate signif-
ertheless, we put the bias testing results for the code that can run icant region bias. For example, PALM-2-CodeChat-bison exhibits a
from StarCoder and Code Llama on our GitHub Repo (See Sec. 6). CBS of 20.29% for region bias, Claude-instant-1 shows a maximum
of 49.28% (34 out of 69) bias behaviors in the region attribute, and
Dataset. As mentioned in Sec. 2.2 and Sec. 2.5, we generate 69 GPT-3.5-turbo exhibits 37.68% region bias. These consistent patterns
code generation prompts containing three different real-world tasks, across different models suggest that region bias is a persistent issue,
i.e., Adult income, Employee, and Health Insurance tasks. The sta- possibly influenced by training datasets that contain more examples
tistics information is shown in Tab. 2. For example, the prompts from one region over another or may inherently carry region-based
designed for health insurance and employees are the largest, con- stereotypes. In the attributes of age and gender, we also observed
taining 26 prompts for each task. For each different code genera- common bias behaviors in code generation. For instance, PALM-2-
tion prompt, we will feed it into each code generation model to CodeChat-bison shows a CBS of 20.29% and 17.39% in age and gen-
generate five prompts to calculate metric scores. der attributes respectively. Similarly, the Claude-instant-1 model
exhibits 42.03% and 27.54% biases in age and gender. These behav-
3.2 RQ1: Does bias exist in code generation iors are also found in other code generation models, indicating that
models? biases related to age, gender, and region are commonly present.
3.2.1 RQ1.1: Prevalence of Code Bias. The evaluation results are Next, when evaluating the education attributes, we found that
illustrated in Tab. 4. We can observe that code bias exists in all some models exhibit higher bias behaviors. For example, Claude-
the investigated code generation models, with each model produc- instant-1. GPT-3.5-turbo, and GPT4 obtain 36.23%, 21.74%, and
ing biased code functions for different types of bias. For example, 33.33% CBS in education attribute, while PALM-2-CodeChat-bison
when examining the age bias attribute, we observe that PALM-2- and GPT-4-turbo only achieve 8.70% and 13.04% CBS in education
CodeChat-bison generates biased code functions with a Code Bias attribute. Then we can also observe that for economic, occupa-
Bias Testing and Mitigation in LLM-based Code Generation Under review

Table 4: Code bias from different LLMs in code generation. Number outside/inside the brackets is the ratio/absolute number of
biased code functions. Take the first cell as an example, 20.29 (14) means that the CBS value is 20.29%, with 14 biased functions.
Model Metrics 𝐴𝑔𝑒 𝑅𝑒𝑔𝑖𝑜𝑛 𝐺𝑒𝑛𝑑𝑒𝑟 𝐸𝑐𝑜𝑛𝑜𝑚𝑖𝑐 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑂𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑐𝑒
CBS 20.29(14) 17.39(12) 20.29(14) 1.45(1) 8.70(6) 0.00(0) 1.45(1)
CBS_U@5 24.64(17) 18.84(13) 23.19(16) 1.45(1) 13.04(9) 1.45(1) 2.90(2)
PALM-2-CodeChat-bison
CBS_I@5 18.84(13) 17.39(12) 17.39(12) 1.45(1) 8.70(6) 0.00(0) 1.45(1)
CBS 42.03(29) 27.54(19) 49.28(34) 1.45(1) 36.23(25) 1.45(1) 1.45(1)
CBS_U@5 78.26(54) 34.78(24) 76.81(53) 2.90(2) 47.83(33) 10.14(7) 7.25(5)
Claude-instant-1
CBS_I@5 8.70(6) 14.49(10) 21.74(15) 0.00(0) 18.84(13) 0.00(0) 0.00(0)
CBS 37.68(26) 23.19(16) 37.68(26) 1.45(1) 21.74(15) 1.45(1) 0.00(0)
CBS_U@5 63.77(44) 30.43(21) 62.32(43) 4.35(3) 36.23(25) 2.90(2) 1.45(1)
GPT-3.5-turbo
CBS_I@5 15.94(11) 5.80(4) 11.59(8) 0.00(0) 11.59(8) 0.00(0) 0.00(0)
CBS 40.58(28) 8.70(6) 14.49(10) 0.00(0) 13.04(9) 1.45(1) 4.35(3)
CBS_U@5 75.36(52) 30.43(21) 49.28(34) 1.45(1) 36.23(25) 14.49(10) 10.14(7)
GPT-4-turbo
CBS_I@5 7.25(5) 0.00(0) 0.00(0) 0.00(0) 1.45(1) 0.00(0) 0.00(0)
CBS 44.93(31) 18.84(13) 33.33(23) 1.45(1) 33.33(23) 1.45(1) 2.90(2)
CBS_U@5 73.91(51) 31.88(22) 68.12(47) 1.45(1) 49.28(34) 8.70(6) 7.25(5)
GPT-4
CBS_I@5 23.19(16) 2.90(2) 4.35(3) 0.00(0) 11.59(8) 0.00(0) 0.00(0)

tion, and race attributes, all models obtain lower CBS than other at- Table 5: Confusion matrix for bias testing results in functions
tributes. For example, only GPT-4-turbo obtains 4.35% CBS in race generated by PALM-2-CodeChat-bison4 .
attribute, while others are even lower than it. These results indi-
cate that certain models may be more sensitive or prone to these Predicted Biased Predicted Not Biased
particular biases, which can be attributed to their training data or
Actual Biased 45 (TP) 5 (FN)
the techniques used in their construction. Actual Not Biased 0 (FP) 433 (TN)

Answer to RQ1.2: Age, region, and gender bias attributes are


is not cover two scenarios. For example, in our value pool, all values
consistently more prevalent across code generation models,
in age are not larger than 65, which means we can not find age bias
while biases related to economic, occupation, and race are
for functions that will have different conditions for ages larger or
rarely present in the code.
lower than 65.

3.3 RQ2: Is our designed bias testing method Answer to RQ2.1: The automated bias testing we designed is
reliable in identifying code bias? reliable in detecting code bias. In particular, the precision of
bias detection with automated bias testing is 100%.
3.3.1 RQ2.1: Reliability of Automated Bias Testing . To assess the
reliability of automated test case analysis in correctly classifying
bias types in code functions, we analysed all the functions generated 3.3.2 RQ2.2: Ratio of bias detected by automated bias testing. To
by the PALM-2-CodeChat-bison model used in the Code Bias Score answer this question, we investigated the distribution of automated
(CBS) evaluation and conducted manual labeling by analyzing the test case analysis and human evaluation in identifying biases in
if-else behaviors in the logic flow of bias behaviors. A confusion code functions generated by various models. The evaluation results
matrix was created to present the classification results, as shown are shown in Tab. 6, which presents the percentage of bias detected
in Tab. 5, providing insight into the effectiveness of automated test across different attributes by both methods in the total prompt. We
case analysis for bias detection. Based on this confusion matrix, can find that first, the majority of biases in code functions were de-
we calculate the False Positive Rate (FPR), Precision, and Recall tected through automated test case analysis. For example, in the
for automated test case analysis. Specifically, we can find that the PALM-2-CodeChat-bison model, test case analysis detected 18.84%
FPR of automated test case analysis is 0% and the precision of bias in the age attribute, i.e., 14 out of 69 code contains age bias,
automated test case analysis is 100% the recall of automated test where 13 out of 14 (13+1) is detected by automated test case analy-
case analysis is also obtained 90% (45 out of 50). These results sis, while human manual evaluation detects another 1 out of 14 bias
demonstrate that our framework can effectively identify biased code, 15.94% in the region attribute, and 18.84% in the gender at-
code functions while maintaining a low rate of misclassification, tribute. These results were higher compared to those identified by
underscoring the robustness of our bias testing approach.
4 For all the manual experiments in this paper, two authors first conduct human
As shown in Tab. 5, we can find that first, the FN is not zero, i.e.,
evaluation independently and then discuss the different labeling results to reach an
some biased executable code is misclassified as not biased, after agreement. The Cohen’s Kappa Coefficients are all above 0.9. The full manual analysis
manually checking the code, we find that one reason is the assertion results are on our homepage (See Sec. 6).
Under review Dong Huang and Qignwen Bu et al.

Table 6: Distribution of bias detection via automated bias testing manual inspection. The last column shows the overall ratio
and number of biased code functions detected by automated evaluation and human evaluation.

Model Strategy 𝐴𝑔𝑒 𝑅𝑒𝑔𝑖𝑜𝑛 𝐺𝑒𝑛𝑑𝑒𝑟 𝐸𝑐𝑜𝑛𝑜𝑚𝑖𝑐 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑂𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑐𝑒 𝑂𝑣𝑒𝑟𝑎𝑙𝑙
Test Case 18.84(13) 15.94(11) 18.84(13) 1.45(1) 8.70(6) 0.00(0) 1.45(1) 92.75(64)
PALM-2-CodeChat-bison
human 1.45(1) 1.45(1) 1.45(1) 0.00(0) 0.00(0) 0.00(0) 0.00(0) 7.25(5)
Test Case 40.58(28) 26.09(18) 47.83(33) 0.00(0) 34.78(24) 0.00(0) 0.00(0) 89.86(62)
Claude-instant-1
human 1.45(1) 1.45(1) 1.45(1) 1.45(1) 1.45(1) 1.45(1) 1.45(1) 10.14(7)
Test Case 36.23(25) 21.74(15) 37.68(26) 0.00(0) 21.74(15) 1.45(1) 0.00(0) 94.20(65)
GPT-3.5-turbo
human 1.45(1) 1.45(1) 0.00(0) 1.45(1) 0.00(0) 0.00(0) 0.00(0) 5.80(4)
Test Case 39.13(27) 7.25(5) 14.49(10) 0.00(0) 11.59(8) 0.00(0) 2.90(2) 95.65(66)
GPT-4-turbo
human 1.45(1) 1.45(1) 0.00(0) 0.00(0) 1.45(1) 1.45(1) 1.45(1) 4.35(3)
Test Case 43.48(30) 17.39(12) 33.33(23) 1.45(1) 31.88(22) 0.00(0) 1.45(1) 95.65(66)
GPT-4
human 1.45(1) 1.45(1) 0.00(0) 0.00(0) 1.45(1) 1.45(1) 1.45(1) 4.35(3)

human evaluation, underlining the efficiency and comprehensive- once we utilize the feedback of automated test case analysis results
ness of automated testing in our framework. Automated test case to require the code generation model to mitigate bias from their
analysis proves to be a robust and efficient method for evaluating a previously generated code, the bias behaviors of their generated
large volume of code, highlighting its critical role in bias detection code also decrease. For example, in GPT-4, the number of age bias
in code generation models. Second, despite the effectiveness of au- code functions decreases from 30 to 3 and 4 for one-shot and few-
tomated analysis, human evaluation remains essential, since some shot results, which means that 80% to 90% code bias can be removed.
code have syntax errors and cause AST can not extract function
information. For instance, in the PALM-2-CodeChat-bison model,
Answer to RQ3: Prompt engineering strategies are effective
our human evaluation identified 1.45% (one code snippet) of bias
in decreasing code bias, but the effectiveness differs in differ-
instances that automated testing did not detect since it contains
ent LLMs. In particular, for GPT-4, 83.87% code bias can be
runtime error so the automated test case analysis can not execute
removed with one-shot learning. Overall, one-shot and few-
for the code snippet.
shot strategies are the most effective in mitigating bias.

Answer to RQ2.2: Automated bias testing can analyze most of


the code generated by code generation models. For example, 3.5 Discussion for the trade-off between
automated bias testing detects 49.27% of the code bias in fairness and performance.
GPT4. Only 4.35% of the generated tests yield runtime errors The trade-off between fairness and performance widely exists in
and need human intervention. deep learning tasks [9, 12, 13, 17, 19, 34], which pushes us to discuss
whether the trade-off also exists in code generation models, i.e.,
code generation models with higher performance will have lower
3.4 RQ3: Can we mitigate bias from the code fairness? In this section, we try to answer this question by evalu-
generation models? ating the CBS and code generation performance (i.e., pass@1 for
The evaluation results are shown in Tab. 7, where we use A (B) to HumanEval dataset [11], which are the most widely used to evalu-
report our bias mitigation results for scenario 1 (mitigating bias ate the effectiveness of code generation models [10, 15, 28, 32, 41,
during the code generation procedure) and scenario 2 (mitigating 42, 58, 60]). For ease of discussion, CBS in this experiment focuses
bias from the generated code with bias testing feedback). To reduce on total bias behaviors, ie, once bias existed in the code (no matter
the threat of randomness, we run each experiment five times and what types of bias), we will label the code-existing bias.
report the average results in Tab. 7. For scenario 1, we use Tab. 3’s The results depicted in Tab. 8 reveal significant trends across var-
scenario 1’s prompts to require the code generation models to ious code generation models. There is a potential trade-off between
generate code for all prompts (i.e., 69), which are used to analyze model performance and fairness. Models with higher pass@1 scores
whether our prompt can avoid bias code generation behaviors. We on the HumanEval dataset, indicative of better performance, con-
can find that prompt engineering can reduce bias behaviors for currently exhibit increased bias behaviors, as reflected by higher
the code generation procedure. For example, PALM-2-CodeChat- CBS scores. For instance, GPT-4, achieving a high pass@1 score of
bison showed notable improvements in mitigating Age and Gender 67.0%, also shows a notable CBS of 52.17%, while PaLM’s pass@1
biases, with bias numbers dropping from 14 to 5 in Age and 14 and CBS only have 43.9% and 26.08%, which indicates the fairness-
to 7 in Gender from the before to one-shot result. For scenario 2, performance trade-off also existed in the code generation models.
we will only utilize the prompts in Tab. 3’s scenario 2 to refine However, we should also notice that for Claude-instant-1, although
the code to mitigate bias behaviors for the bias code which can be its CBS has 49.27%, its pass@1 only has 51.7% compared to other
○ since only these prompts have test cases
executed in our pipeline 5.a models (e.g., GPT-3.5-turbo and GPT-4-turbo), it has higher CBS and
analysis results from our bias testing framework. We can find that lower pass@1, indicating that the trade-off may need further dis-
Bias Testing and Mitigation in LLM-based Code Generation Under review

Table 7: Effectiveness of bias mitigation for different LLMs in code generation. The first/second number in each cell denotes the
number of biased functions in scenario 1/2. Since scenario 2 will require the feedback of automated test cases testing results,
we will then remove the code which analyzed by human evaluation.

Model Metrics 𝐴𝑔𝑒 𝑅𝑒𝑔𝑖𝑜𝑛 𝐺𝑒𝑛𝑑𝑒𝑟 𝐸𝑐𝑜𝑛𝑜𝑚𝑖𝑐 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑂𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑐𝑒 𝑂𝑣𝑒𝑟𝑎𝑙𝑙
before 14/13 12/11 14/13 1/1 6/6 0/0 1/1 18/17
CoT1 12/13 12/11 13/12 0/1 6/4 0/0 0/1 15/14
CoT2 13/13 13/8 18/5 0/1 7/4 1/0 0/0 20/13
PALM-2-CodeChat-bison
zero-shot 12/13 14/11 16/12 0/1 5/4 0/0 0/1 18/15
one-shot 5/13 7/9 7/9 0/0 1/4 0/0 0/0 10/15
few-shot 11/12 6/10 8/9 0/1 3/4 3/0 0/1 13/12
before 29/28 19/18 34/33 1/0 25/24 1/0 1/0 34/32
CoT1 24/10 13/6 18/6 0/0 20/9 0/0 0/0 28/11
CoT2 14/13 4/7 0/0 1/0 22/12 4/1 0/0 25/14
Claude-instant-1
zero-shot 22/11 9/8 14/2 0/0 21/9 2/0 2/0 26/11
one-shot 11/4 2/1 2/0 1/0 14/1 3/0 0/0 16/4
few-shot 15/4 2/5 3/1 1/0 8/5 4/0 0/0 16/6
before 26/25 16/15 26/26 1/0 15/15 1/1 0/0 31/30
CoT1 10/14 9/10 13/12 0/0 5/7 1/1 0/1 15/15
CoT2 15/13 8/9 8/2 0/1 8/5 1/0 0/0 18/14
GPT-3.5-turbo
zero-shot 24/15 10/10 21/10 0/1 13/6 0/0 0/0 27/15
one-shot 17/13 5/8 9/9 0/1 11/5 2/0 0/0 20/14
few-shot 41/16 17/12 30/16 1/1 21/5 3/0 2/0 45/18
before 28/27 6/5 10/10 0/0 9/8 1/0 3/2 33/31
CoT1 25/5 4/0 7/1 0/0 11/3 5/0 1/0 27/5
CoT2 17/3 6/1 0/2 1/0 5/3 3/0 1/1 18/3
GPT-4-turbo
zero-shot 19/5 4/0 3/0 0/0 14/4 5/0 2/0 21/5
one-shot 16/3 9/0 0/1 0/0 8/2 3/0 1/0 18/3
few-shot 23/4 4/0 1/0 1/0 9/4 3/0 1/0 26/4
before 31/30 13/12 23/23 1/1 23/22 1/0 2/1 36/34
CoT1 14/12 12/6 11/3 1/0 10/8 0/1 0/0 16/13
CoT2 15/12 2/7 0/3 0/0 13/6 1/1 0/0 17/12
GPT-4
zero-shot 15/12 11/3 8/0 0/0 7/9 0/1 0/0 19/12
one-shot 5/3 0/1 0/0 0/0 8/3 1/0 0/0 8/3
few-shot 9/4 1/1 0/0 0/0 9/4 0/1 0/0 10/4

Table 8: Evaluation results of CBS and pass@1 in the Hu- the presence or absence of certain biases in the dataset. To mitigate
manEval dataset for different models. this threat, we ensure consistent and objective prompt creation by
employing well-defined operational definitions for each bias type.
Model CBS HumanEval Additionally, the code generation models (e.g., GPT-3.5-turbo) may
exhibit variations in generating code functions due to inherent ran-
PALM-2-CodeChat-bison 26.08 (18) 43.9
Claude-instant-1 49.27 (34) 51.7 domness and model complexity, potentially impacting the results
GPT-3.5-turbo 44.92 (31) 57.3 and introducing internal validity threats. To address this, we care-
GPT-4-turbo 47.82 (33) 57.9 fully control for such variations by running five times for each ex-
GPT-4 52.17 (36) 67.0 periment (e.g., code generation and bias mitigation) to obtain the
average results. Besides, we also utilize CBS_U@K and CBS_I@K
to decrease the effect of variation for our experiments. These tech-
niques help us reduce the impact of randomness and enhance the
cussion in the future study for the result of Claude-instant-1 model. robustness of our findings. By taking these precautions, we aim to
strengthen the internal validity of our research, ensuring the relia-
3.6 Threats to Validity bility and accuracy of the results obtained from the prompt dataset
Internal Validity. The process of creating the code generation creation and code generation process.
prompt dataset involves human judgment, which introduces the
possibility of subjective bias in prompt design and may influence
Under review Dong Huang and Qignwen Bu et al.

External Validity. The external validity of our study is subject to duce. In our study, we carefully investigate this matter, aiming to
the representativeness of the code generation prompt dataset and identify and address biases in automated code generation. Our goal
the generalizability of language models to various code generation is to enhance the reliability and trustworthiness of the code gener-
tasks. If the dataset does not comprehensively cover a broad range ated by these models. This highlights the significance of employing
of potential biases in code, our findings may lack generalizability bias-aware approaches when utilizing machine assistance in pro-
to real-world scenarios. To address this concern, we take measures gramming tasks. By being mindful of biases, we can ensure more
to ensure diversity in the selection of biases and prompt creation, equitable and fair outcomes in the software development process.
aiming to better reflect real-world programming contexts. Addi-
tionally, we evaluated the language models’ performance in three 4.2 Testing for Code Generation Model
different social-related scenarios to assess their ability to general-
Testing for code generation models is an evolving domain at the
ize beyond our specific prompt dataset. By doing so, we enhance
confluence of software engineering and machine learning. The ob-
the external validity of our research and increase its applicability
jective is to ensure the reliability, accuracy, and applicability of the
to real-world code generation scenarios.
generated code. Researchers employ a diverse suite of metrics and
methodologies. Automated Performance Metrics, inspired by nat-
Construct Validity. Human evaluation from prompt collection to
ural language processing techniques, utilize tools like BLEU [44]
code bias testing can introduce subjectivity and potential errors. To
and ROUGE [33] for assessing code’s similarity to reference imple-
address this threat, we take several steps. For the prompt dataset,
mentations. Metrics like CodeBLEU [48], METEOR, and CIDEr [20]
we involve multiple experts in bias identification and ensure con-
further refine this analysis, providing a deeper dive into the code’s
sensus in labeling biases, minimizing the impact of individual sub-
structural and semantic quality. However, while these automated
jectivity. Additionally, for code bias evaluation, we rely on auto-
metrics offer quantifiable insights, they often overlook the func-
mated test case analysis to classify the predominantly generated
tional integrity of the code. To bridge this gap, Human Evaluation
code functions, providing a more standardized and automated ap-
methods, like the pass@k metric, have emerged. Here, human eval-
proach. Then for the code that requires human evaluation due to
uators assess the code execution accuracy in various test scenarios.
runtime errors, we will have multiple experts to analyze bias types
However, in particular, human evaluations, despite their ability to
for each code to reduce subjectivity. The construct validity of our
capture functional nuances, are limited in assessing the robustness
study also depends on the effectiveness of the test case analysis re-
of code generation models. Transitioning to Robustness Evalua-
sult assistant mitigation for the code. If the mitigation approach
tion [56], focuses on models’ resilience, especially under non-ideal
fails to result in substantial reductions in bias, the validity of our
or adversarial conditions. This involves introducing perturbations
conclusions could be compromised. To mitigate this threat, we con-
at different granularities—be it characters, words, or sentences—and
duct comprehensive evaluations to assess five code generation mod-
monitoring the model’s ability to counteract such disruptions. This
els, test them five times, and report the average results. By doing
layer of testing is essential to understand the model’s potential per-
so, we validate the effectiveness of our mitigation approach and
formance in real-world, unpredictable scenarios.
strengthen the construct validity of our research findings.
While existing metrics provide a comprehensive understanding
of the functional efficacy, structural alignment, and resilience of
4 RELATED WORK code generation models, they often overlook a critical dimension:
4.1 Code Generation Model model bias. Our research aims to fill this analytical gap by explor-
ing potential biases within the output of code generation models.
Recently, large language models have been widely used in code
Unchecked biases in generated code can inadvertently impact its
generation tasks. Various architectures have been explored in these
functionality, underscoring the significance of our investigation. In
models, with some notable examples being CodeBERT [22], PLBART
addition to our work, there is a work conducted during the same
[5], and CodeGPT [62]. These models are pre-trained on code cor-
period as our research [35] which also focuses on code bias. How-
pora to develop a deep understanding of code syntax, semantics,
ever, there are notable differences between their approach and ours.
and idiomatic constructs. To enhance their comprehension of the
Although both studies address code bias, Liu et al. [35]’s focus is
complexities in code, some innovative approaches integrate struc-
limited to code complement and does not provide any strategy to
tured representations. For example, GraphCodeBERT [25] incor-
mitigate bias. In contrast, our study concentrates on the broader
porates graph-based representations, while CodeT5 [58] combines
domain of text-to-code generation and aims not only to understand
the encoder-decoder paradigm with the structural essence of code.
and quantify code bias, but also to offer practical solutions to re-
These enhancements aim to provide the models with a more fine-
duce biases in AI-generated code.
grained understanding of code relationships and dependencies be-
yond just syntactic patterns. A current trend is the construction of
large-scale models (e.g., Codex [11] and CodeGen [40]) with billions 5 CONCLUSION AND FUTURE WORKS
of parameters, which have illustrated SOTA performance in code In this work, we propose a code bias testing framework to uncover
generation tasks. Another way is using foundation models (e.g., biases (e.g., age, gender) in code generation models. Based on the
ChatGPT, GPT-4) to generate code functions, which have been eval- framework, we assess current SOTA code generation models and we
uated for their effectiveness in generating functional code. observe that all of the tested code generation models are sometimes
Code generation models have numerous advantages, but can response bias code functions. we observed that larger language
also be susceptible to bias that could impact the software they pro- models do not mean fewer code bias behaviors. To mitigate bias
Bias Testing and Mitigation in LLM-based Code Generation Under review

in the code generation models, we propose five bias mitigation Varshney. 2020. Is there a trade-off between fairness and accuracy? a perspective
templates. Finally, we have released the dataset and source code using mismatched hypothesis testing. In International conference on machine
learning. PMLR, 2803–2813.
in [2]. Code generation model researchers and developers can use [20] Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. 2022.
our framework to improve their code generation models. In our Out of the BLEU: how should we assess quality of the Code Generation models?
J. Syst. Softw. 203 (2022), 111741. https://api.semanticscholar.org/CorpusID:
future work, we will evaluate more code generation models (e.g., 251371647
CodeT5+, Copilot, and CodeX), more bias attributes (e.g., culture), [21] Virginia K Felkner, Ho-Chun Herbert Chang, Eugene Jang, and Jonathan May.
and more scenarios (e.g., academy admission). 2023. WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias
in Large Language Models. arXiv preprint arXiv:2306.15087 (2023).
[22] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
6 DATA AVAILABILITY Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT:
A Pre-Trained Model for Programming and Natural Languages. In Findings of the
We release our dataset and source code in https://anonymous.4open. Association for Computational Linguistics: EMNLP 2020. Association for Computa-
science/r/Code_Bias_assessment-6627. tional Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-
emnlp.139
[23] Eve Fleisig and Christiane Fellbaum. 2022. Mitigating Gender Bias in Machine
REFERENCES Translation through Adversarial Learning. arXiv preprint arXiv:2203.10675 (2022).
[1] 2023. Adult Income Dataset. www.kaggle.com/datasets/wenruliu/adult-income- [24] Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin
dataset. Accessed on August 1, 2023. Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot
[2] 2023. codebiasassessment. https://anonymous.4open.science/r/Code_Bias_ learning for machine translation. In International Conference on Machine Learning.
assessment-6627 Accessed on August 1, 2023. PMLR, 10867–10878.
[3] 2023. Employee Dataset. www.kaggle.com/datasets/tawfikelmetwally/employee- [25] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long
dataset. Accessed on August 1, 2023. Zhou, Nan Duan, Jian Yin, Daxin Jiang, and M. Zhou. 2020. GraphCodeBERT:
[4] 2023. US Health Insurance Dataset. www.kaggle.com/datasets/teertha/ Pre-training Code Representations with Data Flow. ArXiv abs/2009.08366 (2020).
ushealthinsurancedataset. Accessed on August 1, 2023. [26] Morela Hernandez, Derek R Avery, Sabrina D Volpone, and Cheryl R Kaiser. 2019.
[5] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Bargaining while Black: The role of race in salary negotiations. Journal of Applied
2021. Unified Pre-training for Program Understanding and Generation. ArXiv Psychology 104, 4 (2019), 581.
abs/2103.06333 (2021). https://api.semanticscholar.org/CorpusID:232185260 [27] Max Hort, Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. 2022.
[6] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Bias mitigation for machine learning classifiers: A comprehensive survey. arXiv
Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. preprint arXiv:2207.07068 (2022).
2022. Flamingo: a visual language model for few-shot learning. Advances in [28] Dong Huang, Qingwen Bu, and Heming Cui. 2023. CodeCoT and Beyond: Learn-
Neural Information Processing Systems 35 (2022), 23716–23736. ing to Program and Test like a Developer. arXiv preprint arXiv:2308.08784 (2023).
[7] Eva O Arceo-Gomez, Raymundo M Campos-Vazquez, Raquel Y Badillo, and Sergio [29] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni,
Lopez-Araiza. 2022. Gender stereotypes in job advertisements: What do they Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard
imply for the gender salary gap? Journal of Labor Research 43, 1 (2022), 65–102. Grave. 2022. Few-shot learning with retrieval augmented language models. arXiv
[8] Soumya Barikeri, Anne Lauscher, Ivan Vulić, and Goran Glavaš. 2021. Reddit- preprint arXiv:2208.03299 (2022).
Bias: A real-world resource for bias evaluation and debiasing of conversational [30] Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-
language models. arXiv preprint arXiv:2106.03521 (2021). shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM
[9] Pinar Barlas, Kyriakos Kyriakou, Olivia Guest, Styliani Kleanthous, and Jahna 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323.
Otterbacher. 2021. To" see" is to stereotype: Image tagging algorithms, gender [31] Hwaran Lee, Seokhee Hong, Joonsuk Park, Takyoung Kim, Gunhee Kim, and
recognition, and the accuracy-fairness trade-off. Proceedings of the ACM on Jung-Woo Ha. 2023. KoSBI: A Dataset for Mitigating Social Bias Risks Towards
Human-Computer Interaction 4, CSCW3 (2021), 1–31. Safer Large Language Model Application. arXiv preprint arXiv:2305.17701 (2023).
[10] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT- [32] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov,
Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https: Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023.
//doi.org/10.5281/zenodo.5297715 StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
[11] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira [33] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, In Text Summarization Branches Out. Association for Computational Linguistics,
et al. 2021. Evaluating large language models trained on code. arXiv preprint Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
arXiv:2107.03374 (2021). [34] Suyun Liu and Luis Nunes Vicente. 2022. Accuracy and fairness trade-offs
[12] Zhenpeng Chen, Jie Zhang, Federica Sarro, and Mark Harman. 2023. Fairness in machine learning: A stochastic multi-objective approach. Computational
Improvement with Multiple Protected Attributes: How Far Are We?. In 46th Management Science 19, 3 (2022), 513–537.
International Conference on Software Engineering (ICSE 2024). ACM. [35] Y. Liu, Xiaokang Chen, Yan Gao, Zhe Su, Fengji Zhang, Daoguang Zan, Jian-
[13] Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. 2022. MAAT: Guang Lou, Pin-Yu Chen, and Tsung-Yi Ho. 2023. Uncovering and Quantifying
a novel ensemble approach to addressing fairness and performance bugs for Social Biases in Code Generation. ArXiv abs/2305.15377 (2023). https://api.
machine learning software. In Proceedings of the 30th ACM Joint European Software semanticscholar.org/CorpusID:258866038
Engineering Conference and Symposium on the Foundations of Software Engineering. [36] Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Mar-
1122–1134. ianna Apidianaki, and Chris Callison-Burch. 2023. Faithful chain-of-thought
[14] Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. 2023. A Com- reasoning. arXiv preprint arXiv:2301.13379 (2023).
prehensive Empirical Study of Bias Mitigation Methods for Machine Learning [37] Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective
Classifiers. ACM Transactions on Software Engineering and Methodology 32, 4 chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686 (2022).
(2023), 1–30. [38] Ninareh Mehrabi, Fred Morstatter, Nripsuta Ani Saxena, Kristina Lerman, and
[15] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav A. G. Galstyan. 2019. A Survey on Bias and Fairness in Machine Learning. ACM
Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- Computing Surveys (CSUR) 54 (2019), 1 – 35. https://api.semanticscholar.org/
bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. CorpusID:201666566
Journal of Machine Learning Research 24, 240 (2023), 1–113. [39] Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo
[16] Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Zhou. 2023. CodeGen2: Lessons for Training LLMs on Programming and Natural
Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. 2023. A survey of chain of Languages. ICLR (2023).
thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402 [40] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,
(2023). Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language
[17] A Feder Cooper, Ellen Abrams, and Na Na. 2021. Emergent unfairness in algorith- Model for Code with Multi-Turn Program Synthesis. ICLR (2023).
mic fairness-accuracy trade-off research. In Proceedings of the 2021 AAAI/ACM [41] OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
Conference on AI, Ethics, and Society. 46–54. [42] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
[18] Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning Training language models to follow instructions with human feedback. Advances
libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT in Neural Information Processing Systems 35 (2022), 27730–27744.
international symposium on software testing and analysis. 423–435. [43] Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2023. LLM is
[19] Sanghamitra Dutta, Dennis Wei, Hazar Yueksel, Pin-Yu Chen, Sijia Liu, and Kush Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation.
Under review Dong Huang and Qignwen Bu et al.

arXiv preprint arXiv:2308.02828 (2023). Bias Mitigation procedure based on the stereotype content model. arXiv preprint
[44] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a arXiv:2210.14552 (2022).
Method for Automatic Evaluation of Machine Translation. In Proceedings of the [55] Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu,
40th Annual Meeting of the Association for Computational Linguistics. Association Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators.
for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https: arXiv preprint arXiv:2305.17926 (2023).
//doi.org/10.3115/1073083.1073135 [56] Shiqi Wang, Zheng Li, Haifeng Qian, Cheng Yang, Zijian Wang, Mingyue Shang,
[45] Archit Parnami and Minwoo Lee. 2022. Learning from few examples: A summary Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallap-
of approaches to few-shot learning. arXiv preprint arXiv:2203.04291 (2022). ati, Murali Krishna Ramanathan, Dan Roth, and Bing Xiang. 2022. ReCode: Ro-
[46] Jean-Philippe Platteau and Darwin Ugarte Ontiveros. 2021. Cognitive bias in bustness Evaluation of Code Generation Models. ArXiv abs/2212.10264 (2022).
insurance: evidence from a health scheme in India. World Development 144 (2021), https://api.semanticscholar.org/CorpusID:254877229
105498. [57] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang,
[47] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine [58] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and
Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html Steven CH Hoi. 2023. Codet5+: Open code large language models for code
[48] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, M. Zhou, Am- understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
brosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evalua- [59] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5:
tion of Code Synthesis. ArXiv abs/2009.10297 (2020). https://api.semanticscholar. Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Under-
org/CorpusID:221836101 standing and Generation. In EMNLP.
[49] Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep [60] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
Akata. 2023. In-Context Impersonation Reveals Large Language Models’ Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning
Strengths and Biases. arXiv preprint arXiv:2305.14930 (2023). in large language models. Advances in Neural Information Processing Systems 35
[50] Javier Sánchez-Monedero, Lina Dencik, and Lilian Edwards. 2020. What does (2022), 24824–24837.
it mean to’solve’the problem of discrimination in hiring? Social, technical and [61] Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Kr-
legal perspectives from the UK on automated hiring systems. In Proceedings of ishna, Jiaming Shen, and Chao Zhang. 2023. Large Language Model as Attrib-
the 2020 conference on fairness, accountability, and transparency. 458–468. uted Training Data Generator: A Tale of Diversity and Bias. arXiv preprint
[51] Lori L Taylor, Joanna N Lahey, Molly I Beck, and Jeffrey E Froyd. 2020. How to arXiv:2306.15895 (2023).
do a salary equity study: With an illustrative example from higher education. [62] Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji
Public personnel management 49, 1 (2020), 57–82. Wang, Weizhu Chen, and Jian-Guang Lou. 2022. CERT: Continual Pre-training
[52] Himanshu Thakur, Atishay Jain, Praneetha Vaddamanu, Paul Pu Liang, and Louis- on Sketches for Library-oriented Code Generation. In The 2022 International Joint
Philippe Morency. 2023. Language Models Get a Gender Makeover: Mitigating Conference on Artificial Intelligence.
Gender Bias with Few-Shot Data Interventions. arXiv preprint arXiv:2306.04597 [63] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain
(2023). of thought prompting in large language models. arXiv preprint arXiv:2210.03493
[53] Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe (2022).
Wasserblat, and Oren Pereg. 2022. Efficient few-shot learning without prompts. [64] Jiaxu Zhao, Meng Fang, Zijing Shi, Yitong Li, Ling Chen, and Mykola Pechenizkiy.
arXiv preprint arXiv:2209.11055 (2022). 2023. CHBias: Bias Evaluation and Mitigation of Chinese Conversational Lan-
[54] Eddie L Ungless, Amy Rafferty, Hrichika Nag, and Björn Ross. 2022. A Robust guage Models. arXiv preprint arXiv:2305.11262 (2023).

You might also like