Who-Blocks-Who-Simultaneous-Segmentation-of-Occluded-Objects

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221110794

Who Blocks Who: Simultaneous clothing segmentation for grouping images

Conference Paper in Proceedings / IEEE International Conference on Computer Vision. IEEE International Conference on Computer Vision · November 2011
DOI: 10.1109/ICCV.2011.6126412 · Source: DBLP

CITATIONS READS
75 639

2 authors, including:

Haizhou Ai
Tsinghua University
131 PUBLICATIONS 4,492 CITATIONS

SEE PROFILE

All content following this page was uploaded by Haizhou Ai on 18 November 2014.

The user has requested enhancement of the downloaded file.


Who Blocks Who: Simultaneous Clothing Segmentation for Grouping Images

Nan Wang and Haizhou Ai


Computer Science and Technology Department
Tsinghua University, Beijing, China
wang-n04@mails.tsinghua.edu.cn, ahz@mail.tsinghua.edu.cn

Abstract 5
1 2 3 6
0 4
Clothing is one of the most informative cues of human
appearance. In this paper, we propose a novel multi-person
clothing segmentation algorithm for highly occluded im-
ages. The key idea is combining blocking models to address
the person-wise occlusions. In contrary to the traditional
layered model that tries to solve the full layer ranking prob-
lem, the proposed blocking model partitions the problem
into a series of pair-wise ones and then determines the local
blocking relationship based on individual and contextual in-
formation. Thus, it is capable of dealing with cases with a
large number of people. Additionally, we propose a layout
model formulated as Markov Network which incorporates
the blocking relationship to pursue an approximately opti-
mal clothing layout for group people. Experiments demon-
strated on a group images dataset show the effectiveness of Traditional Method Our Method
our algorithm. Figure 1. Clothing segmentation for highly occluded group images
will benefit from blocking information between neighbors. Ar-
rows in the upper-right image indicate the blocking relationship.
1. Introduction Figures in this paper are best viewed in color.

Clothing is one of the most informative cues of human in [9] seem to show its ability to address occlusions, but it
appearance. It has been proven that clothing segmentation is achieved by co-segmenting multiple images of the same
will benefit human detection [18] and recognition [9], pose person with the same clothes.
estimation [16] and human sketches [5].
There have been a number of approaches to segment ob-
This paper addresses the problem of multi-person cloth-
jects with partial occlusion. Some innovative papers use
ing segmentation with high occlusions, since numerous
part-based models. Winn et al. [21] and Hoiem et al.[10]
group pictures are captured in family gatherings and social
handle the object occlusions by using asymmetric pair-wise
activities. To analyze these images, inter-person occlusions
potentials that pursue consistent layout output. Kumar et al.
must be taken into account.
[15] allow for self occlusion by building a Layered Picto-
Existing clothing segmentation algorithms [9][11] for-
rial Structure for specific class. Clothing shapes differ from
mulate the problem in the MRF framework with differ-
each other dramatically due to severe occlusion (Fig. 1). It
ent clothing modeling approaches. In [11], clothing re-
will be more appropriate to choose global masks for each
gion (foreground) is initialized by torso detection based on
person, since local parts are difficult to learn when using
dominant colors determination, while in [9], clothing mask
such inconsistent data.
is learned by mutual information analysis based on human
identities. In addition, Gallagher et al. [9] show that cloth- A similar work to ours is the layered model introduced
ing segmentation and human recognition can benefit each by Yang et al. [22], where occlusions are handled by ob-
other. However, neither of the algorithms consider person- ject depth estimation. The layer order is searched in a brute
wise occlusions explicitly. Note that segmentation results force way based on detector confidence and bottom edge lo-
Data Preparing Bayesian Model

1 0 1 0
0.66 0.34
Blocking Prob.
Blocking Model
(Random Forest) 0
1
𝑏𝑖𝑗 𝑃 𝑏𝑖𝑗
Clothing Shapes Layout Model
(Random Forest) 𝑦0 𝑦1 (Markov Network)

𝐺 𝑥, ℰ

Input Multi-Person Clothing Model Output

Figure 2. Graphical diagram of our algorithm. Input data are the image [8] and face detection results [12]. Our algorithm outputs not only
the multi-person clothing segmentation results but also the blocking relationship between neighbors.

cations. Group images can contain dozens of people, which for group people. Therefore, the ambiguities in overlapping
makes the brute force search intractable sometimes. More- region are reduced to guide the bottom-up segmentation al-
over, it is not necessary to determine the occlusion relation- gorithm more accurately.
ship between any pair of persons. Our main contributions are: (1) an effective Blocking
Our work is also inspired by the work of Eichner et al. Model to estimate the inter-person occlusion, which inte-
[7], where inter-person occlusions are modeled based on grates individual and context information and is capable of
relative face location and image border to facilitate multi- scaling to large number of people; (2) a Layout Model to
person pose estimation. However, relative face location can generate approximately optimal clothing layout for group
hardly help the occlusion determination for people abreast. people conditioned on the inferred blocking relationship.
We argue that clothing self-similarity is more helpful in this The rest of the paper is organized as follows: we de-
case and integrate the two cues in the form of Random For- scribe our probabilistic model in Sec. 2 including the Lay-
est. out Model and the Blocking Model, and introduce the de-
In this paper, we propose to estimate the clothing shapes tailed learning and inference algorithm in Sec. 3 and Sec. 4
for group people with a global view of the scene. We treat respectively. Experiments are demonstrated in Sec. 5.
each person as a node and model their interaction (block-
ing relationship) as edges. The reason why we use such a 2. Multi-Person Clothing Model
person-graph (Fig. 1) is that the variation of clothing shapes
in group images can be distinguished by their individual and Inter-person occlusions in group images are usually se-
context information (Note that in this scenario, human poses vere and dense. An extreme case is shown in Fig. 1, where
do not change a lot, so clothing shapes are more affected by some people’s bodies are completely blocked. In such sce-
inter-person occlusions). Pixel-level segmentation can ben- narios, the largely-blocked people can only located through
efit from the top-down guide. face detection (rectangles in Fig. 1) rather than human de-
The key aspect of our method is the use of Blocking tection.
Model for occlusion reasoning. As mentioned before, it is Given an image and its N faces being detected, our al-
intractable to search for the full layer order in a brute force gorithm is aimed to infer the appropriate clothing shape for
way for group images. However, it is not necessary to re- each person considering the occlusion between neighbors.
cover the full order, since local blocking relationship is in- The whole algorithm is illustrated in Fig. 2. Let xi de-
formative enough for clothing shape decision. Thus, in con- note the features extracted for each person, including face
trary to the traditional layered model used in [22], we parti- location and size, superpixel image [6] and RGB features in
tion the full layer ranking problem into a series of pair-wise clothing region. Suppose M candidate clothing shapes are
ones. A Blocking Model combining individual and context generated for each person (as described in Sec 3, these can-
information is built to infer the pair-wise layer order. didate shapes are generated by Random Forest), and then
Taken the inferred blocking results as prior information, shape label yi ∈ {1, · · · , M } represents the clothing shape
we propose a Layout Model formulated as Markov Net- selected for each person.
work to pursue the approximately optimal clothing layout The person pairs possibly occluded by each other are
denoted as the edge set E. For any pair of persons, bi- and
nary blocking indicator bij denotes the blocking relation-
yjk bij + k yik (1 − bij )
P P
k
ship. When person i blocks person j, bij = 1; otherwise, ϕbij = P k P k (6)
bij = 0. And bij = 1 − bji . k yi + k yj
Let x = {x1 , · · · , xN }, y = {y1 , · · · , yN } and b =
{bij | (i, j) ∈ E}. The multi-person clothing model max- p
βij is the weight parameter of pairwise potentials. In our
imizes the conditional probability P (y, b|x) for clothing p
method, βij represent the confidence of the blocking rela-
shape y and blocking relationship b. The probability can p
tionship. So we set βij = P (bij ), where P (bij ) is the
be factored based on Bayes’ theorem: blocking probability calculated as Eq. 10. yik is the binary
indicator of shape of pixel in A, the same to yjk ; |A| is the
P (y, b|x) = P (y|x, b) P (b|x) (1) size of the overlapping region. In Eq. 5, the numerator ac-
P (y|x, b) is the probability of each clothing layout y cumulates the number of pixels where occluded and unoc-
given the blocking information b. We call it Layout Model. cluded shapes are foreground simultaneously. This conflict
P (b|x) is the probability of blocking relationships for cur- of foreground brings the ambiguity for bottom-up segmen-
rent image. We call it Blocking Model. tation. In Eq. 6, we encourage a bigger foreground size
of the unblocked person in the overlapping region by cal-
2.1. Layout Model culating their foreground size proportions. For both of the
potentials, smaller values are preferred.
We assume that the person-graph is a Markov Network,
so the Layout Model can be factored as:
2.2. Blocking Model
1 Y β Heuristically, we can say the person with a lower face
P (y|x, b) = ϕi (yi |xi ) u location is more likely to block that with a higher one in a
Z i
(2) group image. However, estimating the occlusion relation-
βp
Y
ϕij (yi , yj |xi , xj , bij ) ij ship by relative face location is not as reliable as human
(i,j)∈E body locations [22]. For example, the 0-th and 2-th face
locations in Fig. 1 are lower than their neighbors, but they
where Z is the partition function to make sure that actually stand behind them. It is not a rare case, since in
P (y|x, b) is a distribution. E denotes the graph structure, group images, people prefer to stand in rows. A helpful ob-
which will be decided in runtime based on face locations servation is that unoccluded clothing usually is symmetric,
and sizes. or contains more repetitive patterns. The two cues are used
The unary potentials ϕi measure the consistency be- in our model to estimate the blocking probability.
tween the clothing shape and the actual edges. We segment
We assume that the blocking model for all pairs of neigh-
the image into superpixels [6]. The region boundaries are
bors are independent to each other. so P (b|x) can be fac-
used as the actual edges of the image. We perform a naive
tored as:
inference in pixel and superpixel level respectively based Y
on the clothing shape. Suppose the inferred foregrounds P (b|x) = P (bij |xi , xj ) (7)
in pixel and superpixel level are Lpx and Lrg respectively, (i,j)∈E
then ϕi is defined as their consistency:
The blocking probability between pairs of persons takes the
|Lpx ∩ Lrg |
ϕi = (3) form of random forests as shown in Fig. 3.
|Lpx ∪ Lrg |
A uniformed unary parameter βu (Eq. 2) is assumed for all
unary potentials.
3. Model Learning
The pairwise potentials ϕij penalize the conflict fore- We learn the model using a supervised algorithm based
ground part of the neighbor shapes and encourage larger on a manually labeled data set. Different persons are la-
size of the unoccluded one. Denoting the overlapping re- beled respectively. The blocking relationship is generated
gion as A, ϕij is calculated as: from the segmentation labeling results. For persons nearby
whose clothing regions (estimated by face location and size)
ϕij (yi , yj |bij ) = exp −ϕcij − ϕbij

(4) overlap the other, we accumulate the foreground size of
where them in the overlapping region and determine the blocking
relationship bgij and confidence P bgij based on the fore-
X 
ϕcij = yik yjk / |A| (5)
k
ground proportion:
1
θ    
d −
2 H̃ = arg min d c+
H,τ + d c H,τ (9)
H H1 H,τ

H2

Weak Features where c+ H,τ and cH,τ are the subsets produced by comparing
the response of H with threshold τ . The learning process
is halted when the remaining clothing shapes are consistent
enough (d (c) < 0.1) or there are no available weak fea-
tures.
A set of decision trees is learned using a random subset
[4] of the weak features for each node. Each tree is trained
on a different bootstrapped dataset. Since the features used
in the Random Forest imply the self-similarity of clothing,
Decision Trees
the output clothing shapes can address the occlusion mod-
erately (referring to Fig. 6).
(a) (b)
3.2. Blocking Distribution
Figure 3. (a) Weak features and Decision Tree for clothing shape
sampling (b) Weak features and Decision Tree for blocking distri- We learn the blocking distribution as Random Forest too.
bution modeling. Since it is basically a binary classification problem, we use
traditional decision tree as its elemental model.
X X For each decision tree, a blocking distribution between
bgij = yik > yjk neighbors is learned based on relative weak features, as
k k shown in Fig. 3(b). Here we set the right-then-upper per-
P k
P k (8) son as the original one. The first kind of features is related
max k yi , y
P bgij P kkj

= P k with the one used in Sec. 3.1. For neighbors, H1 and H2
k yi + k yj
are calculated for each one respectively, and then the simi-
3.1. Clothing Shape larity difference H1 − H2 is used as the weak feature. The
second kind of features is the face angle θ and face dis-
As stated in Sec. 2, M clothing shapes should be sam- tance d. Still, in the non-terminal nodes, these continuous
pled for each individual person. We learned the model using features are used by comparing with predefined thresholds
Random Forest, since it is very efficient and is straightfor- and the one with the maximum information gain is selected.
ward to implement [21]. Weak feature H is designed based The terminal nodes maintain the probability that person A
on the self-similarity properties of clothing (Fig. 3(a)). blocks person B using non-parametric models, i.e. we count
Specifically, rectangle patches are densely sampled in the the blocking and unblocking frequency from training sam-
clothing regions. The weak feature is defined as the his- ples fallen into them.
togram similarity (X 2 distance) between any pair of the lo- A set of decision trees is learned using a random sub-
cal patches. Since histogram similarity is continuous, the set [4] of the weak features for each node, and each tree is
training data in each non-terminal is partitioned by compar- trained on a different bootstrapped dataset. The final block-
ing the current weak feature with predefined thresholds. ing distribution is obtained by averaging outputs of all trees:
Each decision tree is a binary tree (illustrated in Fig.
3(a)), where terminals (solid nodes in Fig. 3) record the Nbt
clothing shape fallen in it. In contrary to a conventional 1 X
P (bij |xi , xj ) = Pk (bij |xi , xj ) (10)
decision tree, the information gain is not available for a Nbt
k=1
high-dimensional clothing shape. Alternatively, we use the
hamming distance between shapes to calculate the decision where Nbt is the number of blocking decision tree and
tree metrics. The hamming distance P between any two bi- Pk (bij |xi , xj ) is the blocking distribution obtained by a
nary shapes is defined as d (ci , cj ) = m cm m
i 6= cj , where
single decision tree.
ci and cj are the binary shape images (all clothing shape
are normalized with the same size based on face size). Let 4. Model Inference
c = {c1 , · · · , cn } be the shape set, then the distance (dis-
similarity) is defined on the set d (c) = maxi,j d (ci , cj ). Given an image and its face detection results, we are aim-
The feature in each non-terminal is selected as the one ing at inferring the clothing layout and also the blocking
whose partition is the most consistent, i.e.: relationship for group people.
1 First, as suggested by Kumar et al. [15], when we use the
accuracy
confidence
top-down model as latent variables to guide the bottom-up
0.9 image segmentation, the expectation of the log likelihood of
0.8 an MRF can be efficiently optimized by a single graph cut
optimization. Importantly, they argue that multiple samples
0.7
are necessary for some difficult cases in which RGB distri-
0.6 bution of background is similar to that of the object.
The other is from an insight of our model. In Sec 3.1,
0.5
0.1 0.2 0.3 0.4 0.5
a clothing shape random forest is learned for each person
Shape Proportion Representative Images individually. Random forest increases its generalization ac-
Figure 4. To get a detailed view of the performance of our blocking curacy nearly monotonic with respect to the number of de-
model, we partition the testing samples into different groups based cision tree [4]. The final voting procedure of outputs will
on the proportion of foreground sizes of neighbors in the overlap- make the model more robust to noise. However, the MAP
ping region ( min (A, B) / (A + B), referring to Eq. 6). The inference selecting only one result from all tree outputs will
smaller the proportion is, the more likely they block each other. decrease the generalization ability. From this point of view,
Based on the plot, we find that our algorithm performs well for sampling from the Layout Model, where interaction of the
severe blocking cases (88%). For the inaccurate proportion, we neighbors is taken into account, is a novel and effective vot-
observe that the blocking relationship is not so obvious (see the ing method to improve the individual random forest perfor-
representative images in the figure), so we depress the inaccurate mance.
effect by incorporating the confidence into the pairwise potential
For convenience of sampling, we use a sum-product al-
functions ( in Eq. 4), since it (red curve) is highly correlated with
the accuracy (blue curve).
gorithm for MAP inference [14] and then sample multi-
solutions around the local (global for tree-graph) minimum
4.1. Data Preparing by Gibbs sampling. Here we approximate the posterior dis-
Based on face locations, we estimate the possible cloth- tribution of layout solution as [15]:
ing regions as rectangles. The edges E for Layout Model are Q
determined for the neighbor rectangles with large enough ij∈E Bij (yi , yj )
p (y) = Q qi −1 (12)
overlapping regions (10% of the minimum rectangle). i Bi (yi )
For each person, we obtain the candidate clothing shapes
where Bi and Bij are the unary and pairwise beliefs cal-
using our clothing shape random forest learned in Sec. 3.1.
culated by the sum-product algorithm. qi is the neighbor
For each pair of neighbors, we obtain the blocking distribu-
number of node i.
tion using Eq. 10.
4.2. Blocking Relationship 5. Experiments
The optimal blocking relationship b̃ can be obtained by We evaluate our method on a public available data set
maximizing the individual local blocking distribution, i.e. of group images [8]. All the images with high occlusions
Y between neighbors are downloaded from Flickr by different
b̃ = arg max P (bij |xi , xj ) (11)
bij key words: wedding, family and group images. We manu-
(i,j)∈E
ally label 281 images with 1051 persons. We partition the
Yang et al. [22] search the optimal ordering from all ground truth dataset into two halves randomly and train our
possible permutations. In our method, we just use the MAP model using one half and test it with the other. In our ex-
solution, since we observe that the unsure blocking relation- periments, clothing is labeled as any covering of the torso,
ship usually occurs when neighbors stand close but do not including naked shoulder and arms.
block each other (as shown in Fig. 4). It means that “who Implementation: We detect face location and scale us-
blocks who” is not critical in these cases. We incorporate ing face detector [12] and normalize the images to make
the confidence P (bij |xi , xj ) to Layout Model in Eq. 4 by each face 24 × 24. β u , the weight of unary potentials in Eq.
the pairwise weight, which decreases the effect of neighbor 2 is set as 0.4 empirically by cross-validation.
constraints when the relationship is not so confident. Random Forests for clothing shape and blocking rela-
tionship consist of 50 decision
p trees. For each node of
4.3. Sampling Layout
them, we randomly select Nf [2] weak clothing self-
Given blocking configuration , we can approximate the similarity features (relative face location features are used
MAP solution of Eq. 2 by Loopy Belief Propagation. How- for all nodes), where Nf is the weak feature number (around
ever, in our method, we propose to sample the layout instead 300,000). The size of rectangles used in clothing self-
of a single MAP solution. The two reasons are as follows: similarity features is 4 × 4, and they are densely sampled
103
Algorithm 1 Layout optimization for group images 0.9 0.8
Full Model
Input: Image data and face detection results 0.8

Blocking accuracy
0.6 Face Infor
Output: Probabilistic clothing mask

Forest Size
Cloth Infor
0.7
[Data Preparing] 0.4
1: Determine the inter-person relationship based on face 0.6 Full Model
Face Infor
locations and construct the graph; 0.5
0.2
Cloth Infor
2: Generate shape candidates {yi } for each person based
0.4 0
on clothing model in Sec. 3.1; 1 5 9 13 17 21 25 29 1 2 3 4 5 6 7 8
[Model Inference] tree number tree number

1: Estimate the blocking relationship {bij } and its confi- Figure 5. Detailed comparison of blocking model. Left: blocking
dence {P (bij )}; accuracy with respect to the number of trees. Right: forest size
with respect to the number of trees. The performance is evaluated
2: Optimize Eq. 2 by Loopy Sum-Product Belief Propaga-
using only face information, only clothing information and both of
tion to get the MAP inference result, the unary beliefs
them (full model).
{Bi } and pairwise beliefs {Bij };
3: Generate multiple solutions around the MAP one by the blocking relationship is not so critical for the others, as
Gibbs sampling (Eq. 12); show in Fig. 4.
4: Average the multiple clothing shapes for each person to Experiments are performed with different configura-
generate the probabilistic clothing mask; tions: only face information, only clothing information and
both of them (our full model). Based on the curve in Fig.
with step 2. We calculate the upper and lower bound of fea- 5, we find that face information achieves baseline accuracy,
ture values for each node and partition the interval into 10 since it portrays the basic structure of group people. How-
pieces and then select the optimal threshold from them. No ever, it cannot handle the occlusion among several people
pruning is made for decision trees. abreast, which limits its capability for blocking relation-
A sketch of our layout optimization algorithm is shown ship prediction. Moreover, its performance will not increase
in Algorithm 1. Only original image data and face detection with respect to the number of trees, since actually there are
results are required. The algorithm outputs a probabilistic not so many available features for our random selection.
mask for each person providing top-down guide for further Both clothing information and our full model can
pixelwise segmentation. achieve better accuracies with a large enough amount of
To benefit the bottom-up segmentation, we incorporate trees. However, the experiments show that to achieve the
the top-down cues as a unary term in a CRF based frame- same performance, the full model needs fewer trees. In ad-
work [20]. Given the clothing shapes for each person by dition, as shown in the right curves of Fig. 5, the size of
sampling Eq. 12, we obtain the multi-person clothing mask random forest based on both kinds of information is much
by a Bayesian fusion as: smaller than that based on only clothing information. So the
conclusion is that the clothing information can capture the
M
Y blocking relationship between persons, while face informa-
p (li = 0) = p (li = 0|ym ) (13) tion can enhance it by reducing the required number of trees
m=1
and nodes. As a comparison, we implement the method
in [7]. However, its MAP result only achieves 56.2% accu-
p (li = 1|ym ) racy. The reason may be that in our dataset, people stand
p (li = m) = P · (1 − p (li = 0)) (14) closer, which reduces the power of Face Relative Size and
k p (li = 1|yk )
moreover no self-similarity features are explored.
where li ∈ {0, 1, · · · , M } is the label of pixel i, in which Layout Modeling: We investigate the performance of
0 is the background and {1, · · · , M } are different persons. Layout Model by segmentation accuracy based on the man-
Intuitively, li = 0 if and only if in each person’s clothing ually labeled dataset.
mask, pixel i is background. Tab. 1 reports pixelwise occlusion error (the proportion
Clothing and background color models (histograms in of pixels assigned to wrong persons) and segmentation ac-
RGB space) are estimated based on the clothing masks. curacy. We apply three different methods to segment the
Shape and color cues are integrated with superpixel-level clothing, which gives a detailed view of the performance
constraints [6]. The final segmentation result is obtained by of our model. As a baseline method, we average over all
optimizing a Dual-Level Conditional Random Field [20]. clothing shapes to get a global mask, and just scale the prob-
Blocking Modeling: The blocking performance is eval- abilistic mask for each person with respect to the face size.
uated for the ones whose confidence P bgij > 0.7, since

In this case, no blocking information is used. The perfor-
Occlusion Error Pixel Accuracy
Single Mask 6.8% 88.2%
CF 5.9% 90.7%
CF + BF 4.6% 92.8%

Table 1. Segmentation performance comparison. Single mask: av-


eraging over all clothing shapes, which uses none occlusion cues.
CF: only use clothing shape forest, which addresses blocking in-
formation implicitly. CF + BF: Clothing Forest + Blocking Forest
(integrated by Layout Model), which addresses blocking relation-
ship explicitly.

mance can be improved by using our Clothing Forest model,


which is constructed based on clothing self-similarity fea-
tures. As stated before, unoccluded clothing usually con-
tains more repeatable patterns, so in fact Clothing Forest in- Figure 7. Comparison experiment with GrabCut. The initial rect-
corporates blocking information implicitly. Occlusion be- angle (red one) for GrabCut is generated by our algorithm.
tween persons can be handled moderately. An example is
racy can produce remarkable qualitative segmentation im-
shown in Fig. 6, which is marked with an orange single-
provement, which is also stated in [19][13]. We also com-
boundary rectangle. However, when assigning a putative
pare with GrabCut [17] (implemented by OpenCV [2]) in
clothing shape to each person, Clothing Forest still works
Fig. 7. The initial rectangle is generated based on global
in an independent way. Our Layout Model enhances this
clothing template. Without blocking information, GrabCut
procedure by optimizing the layout in a global view of the
might fail for blocked people.
entire people group. Blocking information is captured by
We show additional segmentation results from the data
Blocking Forest and embedded as pairwise constraints.
set [8] in Fig. 8. In addition, as stated before, we parti-
tion the full layer sorting problem into pairwise ones, so
our algorithm is capable of dealing with cases with a large
number of persons. Fig. 8 also shows such examples. Our
algorithm produces nice segmentation results even in such
a crowded scene. Note that inaccurate segmentations occur
in the arm parts, since the color of the skin is different from
86.5% 88.0% 90.3% that of clothing while we do not filter that out especially.

6. Conclusion
In this paper, we have proposed a novel approach for
multi-person clothing segmentation problem. Blocking
88.5% 91.0% 93.2%
information is incorporated in a principle way to bene-
(a) (b) (c) (d)
fit a clothing shape sampling procedure. Experiments on
severely crowded group images with arbitrary number of
Figure 6. Comparative segmentation results. (a) input image and
the face detections (b) results using single clothing mask (c) results people show that our approach outperforms conventional
using clothing shape random forest directly (d) results by incorpo- method and yield expressive segmentation results.
rating the blocking relationship. Currently, the model assumes that there are not serious
pose variations. Thus, there is clearly room for improve-
ment, particularly in combining with human pose estima-
Some of the comparison segmentation results produced
tion (e.g. multiple-person pose estimation [7], simultaneous
by the three methods are shown in Fig. 6. With the help
pose estimation and human segmentation [3][1]).
of blocking information, our segmentation algorithm can
produce much better object-level boundaries even when the 7. Acknowledgements
clothing of different persons is very similar (the occlusion
parts are marked by double-boundary rectangles). Also, we This work is supported in part by National Sci-
report the pixelwise segmentation accuracy under each seg- ence Foundation of China under grant No.61075026, Na-
mentation result. We can see that although the quantitative tional Basic Research Program of China under Grant
improvement in reported accuracy seems to be marginal, for No.2011CB302203, and it is also supported by a grant from
segmentation problem, a small increment in pixelwise accu- HP Corporation.
Figure 8. Some additional segmentations. For each pair of images, the first one is the input image with detected face, and the other is the
segmentation result produced by our algorithm. The last two rows show expressive segmentation results for crowded scene.

References [12] C. Huang, H. Ai, Y. Li, and S. Lao. High-performance rota-


tion invariant multiview face detection. PAMI, 2007.
[1] L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting peo- [13] P. Kohli, L. Ladicky, and P. H. S. Torr. Robust higher order
ple using mutually consistent poselet activations. In Proc. potentials for enforcing label consistenty. In Proc. CVPR,
ECCV, 2010. 2008.
[2] G. Bradski. The opencv library. Dr. Dobb’s Journal of Soft- [14] D. Koller and N. Friedman. Probabilistic Graphical Models:
ware Tools, 2000. Principles and Techniques. MIT Press, 2009.
[3] M. Bray, P. Kohli, and P. H. S. Torr. Posecut: Simultane- [15] M. P. Kumar, P. H. S. Torr, and A. Zisserman. Obj cut:
ous segmentation and 3d pose estimation of humans using Efficient segmentation using top-down and bottom-up cues.
dynamic graph-cuts. In Proc. ECCV, 2006. PAMI, 2010.
[4] L. Breiman. Random forests. Machine Learning, 2001. [16] M. W. Lee and I. Cohen. A model-based approach for esti-
[5] H. Chen, Z. J. Xu, Z. Q. Liu, and S. C. Zhu. Composite mating human 3d poses in static images. PAMI, 2006.
templates for cloth modeling and sketching. In Proc. CVPR, [17] C. Rother, V. Kolmogorov, and A. Blake. “grabcut” ł inter-
2006. active foreground extraction using iterated graph cuts. Sig-
[6] Y. Deng and B. S. Manjunath. Unsupervised segmentation graph, 2004.
of color-texture regions in images and videos. PAMI, 2001. [18] J. Sivic, C. L. Zitnick, and R. Szeliski. Finding people in
[7] M. Eichner and V. Ferrari. We are family: Joint pose estima- repeated shots of the same scene. In Proc. BMVC, 2006.
tion of multiple persons. In Proc. ECCV, 2010. [19] M. Szummer, P. Kohli, and D. Hoiem. Learning crfs using
[8] A. Gallagher and T. Chen. Understanding groups of images graph cuts. In Proc. ECCV, 2008.
of people. In Proc. CVPR, 2009. [20] N. Wang and H. Ai. A compositional exemplar-based model
[9] A. C. Gallagher and T. Chen. Clothing cosegmentation for for hair segmentation. In Proc. ACCV, 2010.
recognizing people. In Proc. CVPR, 2008. [21] J. Winn and J. Shotton. The layout consistent random field
for recognizing and segmenting partially occluded objects.
[10] D. Hoiem, C. Rother, and J. Winn. 3d layoutcrf for multi-
In Proc. CVPR, 2006.
view object class recognition and segmentation. In Proc.
CVPR, 2007. [22] Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes. Lay-
ered object detection for multi-class segmentation. In Proc.
[11] Z. Hu, H. Yan, and X. Lin. Clothing segmentation using fore-
CVPR, 2010.
ground and background estimation based on the constrained
delaunay triangulation. PR, 2007.

View publication stats

You might also like