Reward-Based Deception With Cognitive Bias: Bo Wu, Murat Cubuktepe, Suda Bharadwaj, and Ufuk Topcu

Reward-Based Deception with Cognitive Bias
Bo Wu, Murat Cubuktepe, Suda Bharadwaj, and Ufuk Topcu
Abstract— Deception plays a key role in adversarial or limited resource to each state, such that the expected cost
strategic interactions for the purpose of self-defence and from defender’s perspective (or equivalently, the reward for
survival. This paper introduces a general framework and the adversary) incurred by an adversary can be minimized,
solution to address deception. Most existing approaches for
deception consider obfuscating crucial information to rational even though the adversary is expecting more based on his
adversaries with abundant memory and computation resources. cognitively biased view of rewards.
In this paper, we consider deceiving adversaries with bounded To deceive a human more effectively, it is essential to
arXiv:1904.11454v1 [cs.AI] 25 Apr 2019
rationality and in terms of expected rewards. This problem understand the human’s cognitive characteristics and what
is commonly encountered in many applications especially in- affects his decisions (particularly with stochastic outcomes).
volving human adversaries. Leveraging the cognitive bias of
humans in reward evaluation under stochastic outcomes, we Works in behavior psychology, e.g. [14], suggested that
introduce a framework to optimally assign resources of a limited humans’ decision-making follows intuition and bounded
quantity to optimally defend against human adversaries. Mod- rationality. Empirical evidence has shown that humans
eling such cognitive biases follows the so-called prospect theory tend to evaluate gains and losses differently in decision-
from behavioral psychology literature. Then we formulate the making [15]. Humans tend to over-estimate the likelihood
resource allocation problem as a signomial program to minimize
the defender’s cost in an environment modeled as a Markov of low-probability events and underestimate the likelihood
decision process. We use police patrol hour assignment as an of high-probability events in a nonlinear fashion [16], [15].
illustrative example and provide detailed simulation results Risk-sensitive measures, such as those in the so-called
based on real-world data. prospect theory [16], capture such biases and are widely
used in psychology and economics to characterize human
I. I NTRODUCTION preferences. Furthermore, humans tend to make decisions
Deception refers to a deliberate attempt to mislead or that are often sub-optimal [17]. It is generally believed that
confuse adversaries so that they may take strategies that such sub-optimality is the result of intuitive decisions or
are in the defender’s favor [1]. Deception can limit the preferences that happen automatically and quickly without
effectiveness of an adversary’s attack, waste adversary’s much reflection [17], [14]. Human decisions are subject to
resources and prevent the leakage of critical information [2]. stochasticity due to the limited computational capacity and
It is a widely observed behavior in nature for self-defence inherent noise [18]. Consequently, human decisions are often
and survival. Deception also plays a key role in many aspects cognitively biased (have a different reward mechanism),
of human society, such as economics [3], warfare [4], game probabilistic (have a stochastic action selection policy) and
[5], cyber security [2] and so on. memoryless (only depends on the current state). These are
In this paper, we focus on the scenario in which the adver- the very characteristics of human decision-making we expect
sary acts in an environment where this interaction is modeled to account for in reward-based deception.
as a Markov decision process (MDP) [6]. The adversary’s This paper investigates how one can deceive a human ad-
aim is to collect rewards at each state of the MDP and the versary by optimally allocating limited resources to minimize
defender tries to minimize the accumulated reward through his rewards. We model the environment as an MDP to capture
deception. Many existing approaches for deception rely on a the choices available to a human decision-maker and their
rational adversary with sufficient memory and computation probabilistic outcomes. We consider opportunistic human
power to find its optimal policy [7], [1]. However, deceiving adversaries, i.e., they usually do not have significant planning
an adversary with only bounded rationality [8], i.e., one and only act based on immediately available rewards [11].
whose decisions may follow certain rules that deviate from We describe the human adversary’s policy to select different
the optimal action [9], has not been adequately studied so actions following the prospect theory and bounded rationality
far. Deceiving an adversary with bounded rationality finds, [8]. We model both the adversary’s perceived reward and
for example, its application in intrusion detection and pro- defender’s cost (equivalently, the adversary’s reward from
tection [10] or public safety [11]. Different from obfuscating the defender’s point of view) as functions of the resources
sensitive system information to the adversary [12], [13], by available at each state of the MDP. Additionally, we define
deception, we mean that the defender optimally assigns a a subset of the states in the MDP as sensitive states that the
human adversary should be kept from visiting.
Bo Wu, Murat Cubuktepe, Suda Bharadwaj and Ufuk Topcu are with We then formulate the optimal resource allocation problem
the Department of Aerospace Engineering and Engineering Mechanics, and as a signomial program (SP) to minimize the defender’s
the Oden Institute for Computational Engineering and Sciences, University
of Texas, Austin, 201 E 24th St, Austin, TX 78712. email: {bwu3, cost. SPs are a special form of nonlinear programming
mcubuktepe, suda.b, utopcu}@utexas.edu problems, and they are generally nonconvex. Solving non-
convex NLPs is NP-hard [19] in general, and a globally no restriction on the coefficient sign. A sum of monomials
optimal solution of an SP cannot be computed efficiently. is then called a polynomial.
SPs generalize geometric programs (GP), which can be
B. Nonlinear programs.
transformed into convex optimization problems and then can
be solved efficiently [20]. In this paper, we approximate the A general nonlinear program (NLP) over a set of real-
proposed SP to a GP. In numerical experiments, we show valued variables V is
that this approach obtain locally optimal solutions of the SP minimize f (2)
efficiently by solving a number of GPs. We demonstrate the
approach with a problem on the assignment of police patrol subject to
hour against opportunistic criminals [11]. gi ≤ 1, i = 1, . . . , m, (3)
The problem we study is closely related to the Stackelberg hj = 1, j = 1, . . . , m, (4)
security game (SSG) which consists of an attacker and a
defender that interact with each other. In SSG, the defender where f , gi , and hj are arbitrary functions over V , and m
acts first with limited resources and then the attackers play in and p are the number of inequality and equality constraints
response [21]. SSG is a popular formalism to study security of the program respectively.
problems against human adversaries. Early efforts focused C. Signomial programs and geometric programs.
on one-shot games where an adversary can only take one
A special class of NLPs known as signomial programs
move [22] without considering human’s bounded rationality.
(SP) is of the form (2)–(4) where f , gi and hj are signomials
Then repeated SSG was considered in wildlife security [10]
over V , see Def. II-A. A geometric program (GP) is an SP of
and fisheries [23] where the defender and the adversary can
the form (2)–(4) where f, gi are posynomial functions and hj
have repeated interaction. However, neither of these papers
are monomial functions over V . GPs can be transformed into
considered how a human perceives probabilities, where the
convex programs [20, §2.5] and then can be solved efficiently
existence of nonlinear probability weighting curves is a well-
using interior-point methods [26]. SPs are non-convex pro-
known result in prospect theory [16]. Such phenomenon was
grams in general, and therefore there is no efficient algorithm
taken into account in [24] and [25]. But [24] only studied
to compute global optimal solutions for SPs . However, we
one-shot games and [25] did not consider the adversaries
can efficiently obtain local optimal solutions for SPs in our
may move from place to place.
setting, as shown in the following sections.
The rest of this paper is organized as the following.
In this paper, the adversary with bounded rationality moves
We first provide the necessary preliminaries for stochastic
in an environment modeled as a Markov decision process
environment modeling, human cognitive biases and decision-
(MDP) [6].
making in Section II. Then we formulate the human decep-
tion problem in terms of resource allocation in Section III D. Markov Decision Processes.
and show that it can be transformed into a signomial program A (MDP) is a tuple M = (S, ν, A, T, U ) where
in Section IV. We propose the computational approach to
• S is a finite set of states;
solve the signomial program in Section V. Section VI shows
• ν : S → [0, 1] is the initial state distribution;
simulations results and discusses their implications. We con-
• A is a finite set of actions;
clude our paper and discusses possible future directions in 0 0
• T (s, a, s ) := P (s |s, a). That is, the probability of
Section VII.
transiting from s to s0 with action a; and
+
II. P RELIMINARIES • U (s) ∈ R is the utility function that assigns resources
with a quantity U (s) to state s.
A. Monomials, Posynomials, and Signomials.
At each state s, an adversary has a set of actions available
Let V = {x1 , . . . , xn } be a finite set of strictly positive to choose. Then the nondeterminism of the action selection
real-valued variables. A monomial over V is an expression has to be resolved by a policy π executed by the adversary.
of the form A (memoryless) policy π : S × A → [0, 1] of an MDP M
f = c · xa1 1 · · · xann , is a function that maps every state action pair (s, a) where
s ∈ S and a ∈ A with probability π(s, a).
where c ∈ R+ is a positive coefficient, and ai ∈ R are By definition, the policy π specifies the probability for the
exponents for 1 ≤ i ≤ n. A posynomial over V is a sum of next action a to be taken at the current state s. A bounded ra-
one or more monomials: tional adversary is often limited in memory and computation
K
X power, therefore we only consider the memoryless policies.
g= ck · x1a1k · · · xannk . (1) In an MDP, a finite state-action path is ω = s0 a0 s1 a1 ...,
k=1 where si ∈ S, ai ∈ A and T (si , ai , si+1 ) > 0. Given a
If ck is allowed to be a negative real number for any 1 ≤ policy π, it is possible to calculate the probability of such
k ≤ K, then the expression (1) is a signomial. path Pπ (ω) as
This definition of monomials differs from the standard al-
Y
Pπ (ω) = ν(s0 ) π(si , ai )T (si , ai , si+1 ). (5)
gebraic definition where exponents are positive integers with i
III. R EWARD -BASED D ECEPTION A. Human Adversaries with Cognitive Biases
We assume that an adversary with bounded rationality To solve Problem 1, it is essential to find the adversary’s
moves around in an environment modeled as an MDP M = policy π. In this paper, we take human as the adversary with
(S, ν, A, T, U ). When the adversary is at a state s ∈ S, from bounded rationality who is opportunistic, meaning that he
the defender’s point of view, the immediate reward for the does not have a specific attack goal nor plans strategically,
human adversary (or equivalently, the cost for the defender) but is flexible about his movement plan and seek opportu-
is nities for attacks [27]. Those attacks may incur rewards to
the human adversary and consequently certain costs for the
R(s) = g(U (s)) ∈ R+ ,
defender. The process of human decision-making typically
which is a function of allocated resource U (s). However, due follows several steps [28]. First, a human recognizes his
to the bounded rationality and cognitive biases, the perceived current situation or state. Second, he will evaluate each
immediate reward Rh (s) at state s by the adversary is a available action based on the potential immediate reward it
different function of U (s), and is given by can bring. Third, he will select an action following some
rules. Then he will receive a reward and observe a new state.
Rh (s) = f (U (s)) ∈ R+ , In this section, we will introduce the modeling framework
for the second and third step.
where f is another function over U . For a given policy π, For a human with bounded rationality, the value of a
expected rewards Qt (s) at each state s and time t with a reward from an action is a function of the possible outcomes
finite time horizon H can be evaluated as and their associated probabilities. The prospect theory de-
XX veloped by Kahneman and Tversky [16] is a frequently used
Qt (s) = R(s) + π(s, a)T (s, a, s0 )Qt+1 (s0 ), (6)
modeling framework to characterize the reward perceived by
a s0
a human. Prospect theory claims that humans tend to over-
where t = 0..., H − 1, QH (s) = R(s). Therefore, Qt estimate the low probabilities and underestimate the high
represents the expect accumulated cost of the defender, probabilities in a nonlinear fashion. For example, between
1
or equivalently, expected rewards for the human adversary winning 100 dollar with 100 probability and nothing else, or
obtained from the policy π. 1 dollar with probability 1, humans tend to prefer the former,
The defender’s objective is to optimally assign the re- even though both have the same expectation.
sources to each state to minimize his cost (equivalently, the Given X as the discrete random variable that has a
adversary’s reward) Q, where finite set of outcomes O, a general form of prospect theory
X utility V (X) (i.e. the reward anticipated by a human) is the
Q= ν(s)Q0 (s), (7) following. X
V (X) = v(x)w(p(x)), (10)
by designing the utility function U , where the resources are
P x∈O
of limited quantity, i.e., s U (s) = D. Also imagine that
there are set of sensitive states Ss ⊂ S that the adversaries where v(x) ∈ R denotes the reward perceived by a human
should be kept away from. Denote the set of paths that from the outcome x. The probability p(x) to get the outcome
reach Ss in H steps as Ω such that for each ω ∈ Ω where x is weighted by a nonlinear function w that captures
ω = s0 a0 , ..., sN , we require N ≤ H, si ∈ / Ss , i < N and the human tendency to over-estimate low probabilities and
sN ∈ Ss . In particular, given a policy π, P (♦≤H Ss ) can be under-estimate high probabilities.
calculated as The expected immediate reward ra (s) to perform an action
X a at state s is
P (♦≤H Ss ) = P (ω). (8) X
ra (s) = R(s0 )T (s, a, s0 ). (11)
ω∈Ω
s0
Problem 1: Given an MDP M = (S, ν, A, T, U ), time However, according to prospect theory, from a human’s
horizon H, human reward function defined in (12) and the perspective, the perceived expected immediate reward rah (s)
policy π,
P design reward function U with a limited total is different. Let Xs,a be the random variable for the outcome
budget s U (s) = D, such that Q as defined in (7) can Os,a of executing action a at state s. We have Os,a =
be minimized ŝ and P (♦≤H Ss ) ≤ λ, which requires that {x0s |T (s, a, s0 ) > 0} where xs0 denotes the event that the
the probability to reach Ss in H steps should be no larger state transits from s to s0 with an action a. The distribution
than λ ∈ [0, 1], i.e., of Xs,a is defined as follows.
P (♦≤H Ss ) ≤ λ. (9) p(xs0 ) = T (s, a, s0 ), ∀xs0 ∈ Os,a .
Remark 1: Problem 1 studies how to optimally assign
The human perceived reward v(xs0 ) for the outcome xs0
the reward to trick the adversary into thinking that his
depends on U (s0 ) received from reaching the state s0 , which
policy could obtain more rewards but in fact, the actual
is denoted by
expected reward is minimized with a low probability of
visiting sensitive states Ss . v(xs0 ) = Rh (s0 ) = f (U (s0 )).
U (s2 ) = 2 U (s3 ) = 2.5
s2 s3 X
minimize Q = ν(s)Q0 (s) (15)
b, 0.5
a, 0.9 subject to
b, 0.5
U (s1 ) = 5 s1 s0 s4 U (s4 ) = 2.5 ∀s ∈ S, t ∈ {0, . . . , H − 1},
a, 0.1 XX
Qt (s) ≥ R(s) + π(s, a)T (s, a, s0 )Qt+1 (s0 )
a∈A s0 ∈S
Fig. 1. A simple example for sub-optimality with human cognitive biases (16)
∀s ∈ S, t ∈ {0, . . . , H − 1},
XX
As a result, rah (s) is denoted by Pt (s) ≥ π(s, a)T (s, a, s0 )Pt+1 (s0 ) (17)
X a∈A s0 ∈S
rah (s) = v(xs0 )w(p(xs0 )) ∀t ∈ {0, . . . , H}, ∀s ∈ Ss , Pt (s) = 1 (18)
xs0 ∈Os,a X
X (12) ∀t ∈ {0, . . . , H}, ν(s)Pt (s) ≤ λ (19)
= f (U (s0 ))w(T (s, a, s0 )). s∈S
s0
∀s ∈ S, a ∈ A,
An empirical form of w is the following [16]. X X
π(s, a) f (U (s0 ))w(T (s, a0 , s0 ))
pγ a0 ∈A s0 ∈S
w(p) = 1 , γ > 0. (13) X
(pγ + (1 − p)γ ) γ
= f (U (s0 ))w(T (s, a, s0 )) (20)
s0 ∈S
Given an MDP as depicted in Figure 1, where S =
{s0 , . . . , s4 }, A = {a, b}. We assume that R(s) = U (s), ∀s ∈ S, R(s) = g(U (s)) (21)
Rh (s) = U (s)0.88 , γ = 0.6 in (13). It can be found from (11)
X
U (s) = D, (22)
and (12) that ra (s0 ) = 2.3, rah (s0 ) = 2.0678, rb (s0 ) = 2.5 s∈S
and rbh (s0 ) = 1.8617. Since rah (s0 ) > rbh (s0 ), suppose a where variables R(s) are for rewards in each state s, U (s)
human is at s0 , from human’s perspective, he will prefer are for utilities in each state s, π(s, a) are for the probability
the action a. However, ra (s0 ) < rb (s0 ) which indicates that of taking action a in state s are for each state and action,
action a actually has more expected immediate rewards. Qt (s) are for the expected reward of the state s and time
Remark 2: In this example, the rewards are already given, step t, and Pt (s) are for the probability of reaching the set
and it can be seen that the human could make a sub-optimal of target states Ss in each state s and time step t.
decision. It illustrates how cognitive bias can deviate the The objective in (15) minimizes the accumulated expected
human behavior from optimal. reward from the initial state distribution ν(s) over a time
After evaluating the outcome of each candidate action a horizon H. In (16), we compute Qt (s) by adding the
by rah (s), a human then needs to make an action selection. immediate reward in state s and the expected reward of the
Humans are known to only have quite limited cognitive successor states according to the policy variables π(s, a) for
capabilities. Human’s policy π to choose an action can be each action a. The probability of reaching each successor
described as a random process that biases toward the actions state s0 depends on the policy variables π(s, a) in each state
of high rah (s), such that s and action a. Similar to the constraint in (16), the variables
rah (s) Pt (s) are assigned to the probability of reaching the set of
π(s, a) = P h
, (14) target states Ss from state s and time step t in (17).
a0 ra0 (s)
The probability of reaching any state s ∈ Ss in each
where π(s, a) denotes the probability of executing the action horizon from the states in Ss is set to 1 as in (18). The
a at state s. Such a bounded rational behavior has been constraint in (19) assures that the probability of reaching
observed in humans, such as urban criminal activities [29]. any state s ∈ Ss from the initial state distribution ν(s) is
Intuitively, it implies that human selects the action a oppor- less than λ. The constraint in (20) computes the policy using
tunistically at each state s with the probability proportional the model in (14). We give the relationship between rewards
to the perceived immediate reward rah (s). and utilities in (21). Finally, (22) gives the total budget for
Now we are ready to redefine Problem 1 as follows. utilities.
Problem 2: Solve Problem 1 for π defined as (14). The constraint in (16) and (17) are convex constraints,
because the functions in the right hand sides are posynomial
IV. S IGNOMIAL P ROGRAMMING F ORMULATION functions, and the functions in the left hand sides are mono-
Given an MDP M, time horizon H, human reward func- mial functions. The constraints in (18) and (19) are affine
tion and policy as defined in (12) and (14), the solution of constraints, therefore they are convex. The constraints in (20)
the Problem 1 can be computed by solving the following and (22) are equality constraints with posynomials, therefore
signomial program. The g and f are assumed to be monomial they belong to the class of signomial constraints, and they
functions of U for our solution method. are not convex. In the literature, there are various methods
to deal with the nonconvex constraints to obtain a locally the reachability constraint in (19), we change the objective
optimal solution including sequential convex programming, in (15) as follows:
convex-concave programming, branch and bound or cutting
plane methods [30], [20], [31]. minimize Q + δ · τ (25)
where δ is a positive penalty parameter that determines
V. C OMPUTATIONAL A PPROACH FOR THE S IGNOMIAL the violation rate for the soft constraint in (23). In our
P ROGRAM formulation, we increase δ after each iteration to satisfy the
In this section, we discuss how to compute a locally reachability constraint.
optimal solution efficiently for Problem 1 by solving the We stop the iterations when the change in the value of Q
signomial program in (15)–(22). We propose a sequential is less than a small positive constant . Intuitively, defines
convex programming method to compute a local optimum the required improvement on the objective value for each
of the signomial program in (15)–(22), following [20, §9.1], iteration; once there is not enough improvement, the process
solving a sequence of GPs. We obtain each GP by replacing terminates.
signomial constraints in equality constraints of the SGP sig-
VI. N UMERICAL E XPERIMENT
nomial program in (15)–(22) with monomial approximations
of the functions. Let us consider an urban security problem, where a
criminal plans his next move randomly based on his local
A. Monomial approximation information on the nearby locations that are protected by
Given a posynomial f , a set of variables {x1 , . . . , xn }, police patrols. Such a criminal is opportunistic, i.e, he is
and an initial point x̂, a monomial approximation [20] fˆ for not highly strategic by conducting careful surveillance and
f around x̂ is rational planning before making moves. It is known that this
!ai kind of opportunistic adversaries contribute to the majority
n
Y x i
of the urban crimes [32]. For prevention and protection, each
∀i.1 ≤ i ≤ n fˆ = f [x̂] , location should be assigned a certain police patrol hours. Due
i=1
x̂(xi )
to the limited amount of police resources, the total number
x̂(xi ) ∂f of patrol hours is limited as well.
where ai = [x̂].
f [x̂] ∂xi
Intuitively, a monomial approximation of a posynomial f
around an initial point x̂ corresponds to an affine approxima-
tion of the posynomial f . Such an approximation is provided
by the first order Taylor approximation of f , see [20, §9.1]
for more details.
For a given instantiation of the utility and policy variables
U (s) and π(s, a), we approximate the SP in (15)–(22) to
obtain a GP as follows. We first normalize the utility values
to ensure that they sum up to D. Then, using those utility
values, we compute the policy according to constraint in (20).
After the policy comptutation, we compute a monomial
approximation of each posynomial term in the constraints
(20) and (22) around the previous instantiation of the utility
and policy variables. After the approximation, we solve the
approximate GP. We repeat this procedure until the procedure
converges.
One key problem with this approach is, we require an
initial feasible point to the signomial problem in (15)–(22),
which may be hard to find because of the reachability
constraint in (19). Therefore, we introduce a new variable
τ and we replace the reachability constraint in (19) by the
following constraints:
X
∀t ∈ {0, . . . , H}, ν(s)Pt (s) ≤ λ · τ (23) Fig. 2. The 35 intersections in North East of San Francisco. The map
s∈S
is obtained from Google map. The shaded area is a circle with a 500
feet radius. The numbering of the states starts from the bottom left corner
τ ≥ 1. (24) and goes from left to right in every row. The number beside each location
indicates the number of crimes in that area.
By replacing the reachability constraint, we ensure that
any initial utility function and policy is feasible to the Figure 2 shows 35 intersections in San Francisco, CA with
signomial program in (15)–(24). To enforce the feasability of 7 rows and 5 columns. We use an MDP M = (S, ν, A, T, U )
to describe the network of the set of intersections S. The
number C(s) of crimes that occurred in the first four weeks
of October, 2018 within 500 feet of each interaction s
is shown in Figure 2. The crime data are obtained from
https://www.crimemapping.com/map/ca/sanfrancisco. The
criminal can choose to move left, right, up or down to the
immediate neighboring intersections. Consequently, there
are four actions available. The execution of each action
will lead the human to its intended neighborhood of the
intersection with a high probability (≥ 0.95) and small
probability to other neighboring intersections to account for
unexpected change of movement plan.
Initially, the criminal has equal probability to appear at any
1
state, i.e., ν(s) = 35 for any s ∈ S. The utility U (s) denotes
the number of police patrol hours that should be allocated to
the vicinity of each intersection.
P The total number of police
patrol hours is D = U (s) = 400. If a location s is
assigned with U (s) patrol hours, its reward to the criminal
(equivalently, the cost to the defender) is
C(s)
R(s) = .
U (s)
Intuitively, it means that the reward to the human adversary,
from the defender’s point of view, is proportional to the
crime rate indicated by C(s) and inversely proportional to the Fig. 3. The distribution of utility U .
police patrol hours. The reward from the human adversary’s
view is evaluated as
f (U (s)) = R(s)0.88 , U (s), assigned to each location s. In Figure 3, U (s) is shown
with a logarithmic scale for better illustration. As the color
which is a function commonly seen in the literature to bar at the bottom of the figure indicates, the closer the color
describe how human biases the reward [15]. at each location is to the right side of this bar, the higher
Initially, the criminal is at s with probability ν(s), where patrol hours are assigned. For example, the state at (3, 1)
he tries to plan his move over the next H steps. The objective (the third state from the first row), where C(s) = 48 gets
is to assign the police patrol hours to each state, such that assigned patrol hour equals 106 which is approximately 2 in
the expected accumulated reward in H steps received by the logarithmic scale. Therefore, its color is yellow in Figure 3
criminal is minimized. The sensitive states Ss = {3, 14, 33} as indicated by the right tip of the color bar. Together with
should be visited with a probability no larger than λ = 0.3, Figure 2, it can be observed that sensitive places and places
i.e. with a higher number of crimes get assigned more patrol
P (♦≤H Ss ) ≤ 0.3. hours. Consequently, the rewards at those states are fairly
The sensitive states are also shown as blue circles in Figure low to discourage the criminal from visiting it. The cost at
2. each location is proportional to the crime rate and inversely
We formulate the problem as a signomial program. From proportional to the police patrol hours. The patrol hours
an initial uniform utility distribution, we instantiate the poli- assigned to each place intends to minimize the expected cost
cies and reward functions. Then, from the initial values, we incurred by the human adversary.
linearize the signomial program in (15)–(25) to a geometric
VII. C ONCLUSION
program. We parse the geometric programs using the tool
GPkit [33], and solve them using the solver MOSEK [34]. This paper introduces a general framework for deceiving
We set = 10−4 for convergence tolerance. All experiments adversaries with bounded rationality in terms of the obtained
were run on a 2.3 GHz machine with 16 GB RAM. The reward minimization. Leveraging the cognitive bias of the
procedure converged after 32 iterations for a problem with human from well-known prospect theory, we formulate the
horizon length T = 20 in 230.06 seconds. The expected reward-based deception as a resource allocation problem
reward Q from the initial state distribution is 117.15, and the in Markov decision process environment and solve as a
reachability probability of the sensitive states from the initial signomial program to minimize the adversary’s expected
state distribution is 0.192, which satisfies the reachability reward. We use police patrol hour assignment as the illustra-
specification. tive example and show the validity of our propose solution
The result is shown in Figure 3. Different colors at each approach. It opens doors for further research on the topic to
intersection show the number of patrol hours, i.e, the resource consider the scenarios where defender can move around and
react to the human adversaries in real time, and the human [24] R. Yang, C. Kiekintveld, F. OrdóñEz, M. Tambe, and R. John,
adversary has a learning capability to adapt the defender’s “Improving resource allocation strategies against human adversaries in
security games: An extended study,” Artificial Intelligence, vol. 195,
deceiving policy. pp. 440–469, 2013.
[25] D. Kar, F. Fang, F. Delle Fave, N. Sintov, and M. Tambe, “A game of
thrones: when human behavior models compete in repeated stackelberg
R EFERENCES security games,” in Proceedings of the 2015 International Conference
on Autonomous Agents and Multiagent Systems. International Foun-
[1] M. H. Almeshekah and E. H. Spafford, “Cyber security deception,” in dation for Autonomous Agents and Multiagent Systems, 2015, pp.
Cyber deception. Springer, 2016, pp. 23–50. 1381–1390.
[2] J. Pawlick, E. Colbert, and Q. Zhu, “A game-theoretic taxonomy and [26] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge
survey of defensive deception for cybersecurity and privacy,” arXiv university press, 2004.
preprint arXiv:1712.05441, 2017. [27] Y. D. Abbasi, M. Short, A. Sinha, N. Sintov, C. Zhang, and M. Tambe,
[3] S. Bonetti, “Experimental economics and deception,” Journal of Eco- “Human adversaries in opportunistic crime security games: Evaluating
nomic Psychology, vol. 19, no. 3, pp. 377–395, 1998. competing bounded rationality models,” in Proceedings of the Third
[4] T. Holt, The deceivers: Allied military deception in the Second World Annual Conference on Advances in Cognitive Systems ACS, 2015, p. 2.
War. Simon and Schuster, 2010. [28] K. Doya, “Modulators of decision making,” Nature neuroscience,
[5] E. Morgulev, O. H. Azar, R. Lidor, E. Sabag, and M. Bar-Eli, vol. 11, no. 4, p. 410, 2008.
“Deception and decision making in professional basketball: Is it [29] M. B. Short, M. R. D’orsogna, V. B. Pasour, G. E. Tita, P. J.
beneficial to flop?” Journal of Economic Behavior & Organization, Brantingham, A. L. Bertozzi, and L. B. Chayes, “A statistical model
vol. 102, pp. 108–118, 2014. of criminal behavior,” Mathematical Models and Methods in Applied
Sciences, vol. 18, no. supp01, pp. 1249–1267, 2008.
[6] M. L. Puterman, Markov decision processes: discrete stochastic dy-
[30] R. E. Moore, “Global optimization to prescribed accuracy,” Computers
namic programming. John Wiley & Sons, 2014.
& mathematics with applications, vol. 21, no. 6-7, pp. 25–39, 1991.
[7] K. Horák, Q. Zhu, and B. Bošanskỳ, “Manipulating adversary’s belief: [31] E. L. Lawler and D. E. Wood, “Branch-and-bound methods: A survey,”
A dynamic game approach to deception by design for proactive Operations research, vol. 14, no. 4, pp. 699–719, 1966.
network security,” in International Conference on Decision and Game [32] P. J. Brantingham and G. Tita, “Offender mobility and crime pattern
Theory for Security. Springer, 2017, pp. 273–294. formation from first principles,” in Artificial crime analysis systems:
[8] H. A. Simon, “Models of man; social and rational.” 1957. using computer simulations and geographic information systems. IGI
[9] C. F. Camerer, Behavioral game theory: Experiments in strategic Global, 2008, pp. 193–208.
interaction. Princeton University Press, 2011. [33] E. Burnell and W. Hoburg, “Gpkit software for geometric pro-
[10] R. Yang, B. Ford, M. Tambe, and A. Lemieux, “Adaptive resource allo- gramming,” https://github.com/convexengineering/gpkit, 2018, version
cation for wildlife protection against illegal poachers,” in Proceedings 0.7.0.
of the 2014 international conference on Autonomous agents and multi- [34] M. ApS, The MOSEK optimization toolbox for PYTHON. Version 7.1
agent systems. International Foundation for Autonomous Agents and (Revision 60), 2015. [Online]. Available: http://docs.mosek.com/7.1/
Multiagent Systems, 2014, pp. 453–460. quickstart/Using MOSEK from Python.html
[11] C. Zhang, A. X. Jiang, M. B. Short, P. J. Brantingham, and
M. Tambe, “Defending against opportunistic criminals: New game-
theoretic frameworks and algorithms,” in International Conference on
Decision and Game Theory for Security. Springer, 2014, pp. 3–22.
[12] B. Wu and H. Lin, “Privacy verification and enforcement via belief
abstraction,” IEEE Control Systems Letters, vol. 2, no. 4, pp. 815–820,
Oct 2018.
[13] P. Masters and S. Sardina, “Deceptive path-planning,” in Proceedings
of the 26th International Joint Conference on Artificial Intelligence.
AAAI Press, 2017, pp. 4368–4375.
[14] D. Kahneman, Thinking, Fast and Slow. Macmillan, 2011.
[15] A. Tversky and D. Kahneman, “Advances in prospect theory: Cumu-
lative representation of uncertainty,” Journal of Risk and uncertainty,
vol. 5, no. 4, pp. 297–323, 1992.
[16] D. Kahneman and A. Tversky, “Prospect theory: An analysis of
decision under risk,” in Handbook of the fundamentals of financial
decision making: Part I. World Scientific, 2013, pp. 99–127.
[17] E. Norling, “Folk psychology for human modelling: Extending the bdi
paradigm,” in Proceedings of the Third International Joint Conference
on Autonomous Agents and Multiagent Systems-Volume 1. IEEE
Computer Society, 2004, pp. 202–209.
[18] P. B. Reverdy, V. Srivastava, and N. E. Leonard, “Modeling human
decision making in generalized gaussian multiarmed bandits,” Pro-
ceedings of the IEEE, vol. 102, no. 4, pp. 544–571, 2014.
[19] D. S. Hochbaum, “Complexity and algorithms for nonlinear optimiza-
tion problems,” Annals of Operations Research, vol. 153, no. 1, pp.
257–296, 2007.
[20] S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi, “A tutorial on
geometric programming,” Optimization and Engineering, vol. 8, no. 1,
2007.
[21] B. An and M. Tambe, “Stackelberg security games (ssg) basics and
application overview,” Improving Homeland Security Decisions, p.
485, 2017.
[22] M. Jain, J. Tsai, J. Pita, C. Kiekintveld, S. Rathi, M. Tambe, and
F. Ordónez, “Software assistants for randomized patrol planning for
the lax airport police and the federal air marshal service,” Interfaces,
vol. 40, no. 4, pp. 267–290, 2010.
[23] W. B. Haskell, D. Kar, F. Fang, M. Tambe, S. Cheung, and E. Denicola,
“Robust protection of fisheries with compass.” 2014.

Reward-Based Deception With Cognitive Bias: Bo Wu, Murat Cubuktepe, Suda Bharadwaj, and Ufuk Topcu

Uploaded by

Copyright:

Available Formats

You might also like

Reward-Based Deception With Cognitive Bias: Bo Wu, Murat Cubuktepe, Suda Bharadwaj, and Ufuk Topcu

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reward-Based Deception With Cognitive Bias: Bo Wu, Murat Cubuktepe, Suda Bharadwaj, and Ufuk Topcu

Uploaded by

Copyright:

Available Formats

Reward-Based Deception with Cognitive Bias

Bo Wu, Murat Cubuktepe, Suda Bharadwaj, and Ufuk Topcu

You might also like