A Structural Support Vector Method For Extracting Contexts and Answers of Questions From Online Forums

A Structural Support Vector Method for Extracting Contexts and
Answers of Questions from Online Forums

Wen-Yun Yang †∗ Yunbo Cao †‡ Chin-Yew Lin ‡
†
Department of Computer Science and Engineering
Shanghai Jiao Tong University, Shanghai, China
‡
Microsoft Research Asia, Beijing, China
wenyun.yang@gmail.com {yunbo.cao; cyl}@microsoft.com
Abstract Post1: <context id=1> S1: Hi I am looking for

a pet friendly hotel in Hong Kong because all of
This paper addresses the issue of extract- my family is going there for vacation. S2: my fam-
ing contexts and answers of questions ily has 2 sons and a dog. </context> <question
from post discussion of online forums. id=1> S3: Is there any recommended hotel near
We propose a novel and unified model by Sheung Wan or Tsing Sha Tsui? </question>
customizing the structural Support Vector <context id=2, 3> S4: We also plan to go shopping
Machine method. Our customization has in Causeway Bay. </context> <question id=2>
several attractive properties: (1) it gives a S5: What’s the traffic situation around those com-
comprehensive graphical representation of mercial areas? </question> <question id=3> S6:
thread discussion. (2) It designs special Is it necessary to take a taxi? </question> S7: Any
inference algorithms instead of general- information would be appreciated.
purpose ones. (3) It can be readily ex- Post2: <answer id=1> S8: The Comfort Lodge
tended to different task preferences by near Kowloon Park allows pet as I know, and usu-
varying loss functions. Experimental re- ally fits well within normal budgets. S9: It is also
sults on a real data set show that our meth- conveniently located, nearby the Kowloon railway
ods are both promising and flexible. station and subway. </answer>
Post3: <answer id=2, 3> S10: It’s very crowd in
1 Introduction those areas, so I recommend MTR in Causeway Bay
Recently, extracting questions, contexts and an- because it is cheap to take you around. </answer>
swers from post discussions of online forums in-
curs increasing academic attention (Cong et al., Figure 1: An example thread with three posts and
2008; Ding et al., 2008). The extracted knowl- ten sentences
edge can be used either to enrich the knowledge
base of community question answering (QA) ser- (S10). As shown in the example, a forum question
vices such as Yahoo! Answers or to augment the usually requires contextual information to com-
knowledge base of chatbot (Huang et al., 2007). plement its expression. For example, the ques-
Figure 1 gives an example of a forum thread tion sentence S3 would be of incomplete meaning
with questions, contexts and answers annotated. without the contexts S1 and S2, since the impor-
This thread contains three posts and ten sentences, tant keyword pet friendly would be lost.
among which three questions are discussed. The The problem of extracting questions, contexts,
three questions are proposed in three sentences, and answers can be solved in two steps: (1) iden-
S3, S5 and S6. The context sentences S1 and tify questions and then (2) extract contexts and an-
S2 provide contextual information for question swers for them. Since identifying questions from
sentence S3. Similarly, the context sentence S4 forum discussions is already well solved in (Cong
provides contextual information for question sen- et al., 2008), in this paper, we are focused on step
tence S5 and S6. There are three question-context- (2) while assuming questions already identified.
answer triples in this example, (S3) − (S1, S2) − Previously, Ding et al. (2008) employ general-
(S8, S9), (S5) − (S4) − (S10) and (S6) − (S4) − purpose graphical models without any customiza-
∗
This work was done while the first author visited Mi- tions to the specific extraction problem (step 2).
crosoft Research Asia. In this paper, we improve the existing models in
514
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 514–523,
c
Singapore, 6-7 August 2009. 2009 ACL and AFNLP
three aspects: graphical representation, inference have an exact and efficient inference (Section 4.1).
algorithm and loss function. Moreover, an approximate inference algorithm is
Graphical representation. We propose a more also given (Section 4.2).
comprehensive and unified graphical representa- Loss function. In practice, different applica-
tion to model the thread for relational learning. tion settings usually imply different requirements
Our graphical representation has two advantages for system performance. For example, we expect
over previous work (Ding et al., 2008): unifying a higher recall for the purpose of archiving ques-
sentence relations and incorporating question in- tions but a higher precision for the purpose of re-
teractions. trieving questions. A flexible framework should
Three types of relation should be considered for be able to cope with various requirements. We
context and answer extraction: (a) relations be- employ structural Support Vector Machine (SVM)
tween successive sentences (e.g., context sentence model that could naturally incorporate different
S2 occurs immediately before question sentence loss functions (Section 5).
S3); (b) relations between context sentences and We use a real data set to evaluate our approach
answer sentences (e.g., context S4 presents the to extracting contexts and answers of questions.
phrase Causeway Bay linking to answer which is The experimental results show both the effective-
absent from question S6); and (c) relations be- ness and the flexibility of our approach.
tween multiple labels for one sentence (e.g., one In the next section, we formalize the problem
question sentence is unlikely to be the answer to of context and answer extraction and introduce the
another question although one sentence can serve structural model. In Sections 3, 4 and 5 we give
as contexts for more than one questions). Our pro- the details of customizing structural model for our
posed graphical representation improves the mod- task. In Section 6, we evaluate our methods. In
eling of the three types of sentence relation (Sec- Section 7, we discuss the related work. Finally,
tion 2.2). we conclude this paper in Section 8.
Certain interactions exist among questions. For 2 Problem Statement
example, question sentences S5 and S6 interact by
sharing context sentence S4. Our proposed graphi- We first introduce our notations in Section 2.1 and
cal representation can naturally model the interac- then in Section 2.2 introduce how we model the
tions. Previous work (Ding et al., 2008) performs problem of extracting contexts and answers for
the extraction of contexts and answers in multiple questions with a novel form of graphical represen-
passes of the thread (with each pass corresponding tation. In Section 2.3 we introduce the structured
to one question), which cannot address the interac- model based on the new representation.
tions well. In comparison, our model performs the
extraction in one pass of the thread. 2.1 Notations
Inference algorithm. Inference is usually a Assuming that a given thread contains p posts
time-consuming process for structured prediction. {p1 , . . . , pp }, which are authored by a set of
We design special inference algorithms, instead of users {u1 , . . . , up }. The p posts can be further
general-purpose inference algorithms used in pre- segmented into n sentences x = {x1 , . . . , xn }.
vious works (Cong et al., 2008; Ding et al., 2008), Among the n sentences, m question sentences q =
by taking advantage of special properties of our {xq1 , . . . , xqm } have been identified. Our task is
task. Specifically, we utilize two special properties to identify the context sentences and the answer
of thread structure to reduce the inference (time) sentences for those m question sentences. More
cost. First, context sentences and question sen- formally, we use four types of label {C, A, Q, P }
tences usually occur in the same post while answer to stand for context, answer, question and plain la-
sentences can only occur in the following posts. bels. Then, our task is to predict an m × n label
With this properties, we can greatly reduce context matrix y = (yij )1≤i≤m,1≤j≤n , except m elements
(or answer) candidate sets of a question, which {y1,q1 , . . . , ym,qm } which correspond to (known)
results in a significant decrease in inference cost question labels. The element yij in label matrix y
(Section 3). Second, context candidate set is usu- represents the role that the jth sentence plays for
ally much smaller than the number of sentences the ith question. We denote the ith row and jth
in a thread. This property enables our proposal to column of the label matrix y by yi. and y.j .
515
y1 y2 y3 y4 y5 y6 y7 y1 y2 y3 y4 y5 y6 y7
{C , P } {C , P } {C , P } {Q } {P } {A, P } {A, P } {C , P } {C , P } {C , P } {Q } {P } {A, P } {A, P }
x1 x2 x3 x4 x5 x6 x7 x1 x2 x3 x4 x5 x6 x7
(a) Skip-chain model (b) Complete skip-chain model
y11 y12 y13 y14 y1n y11 y12 y13 y14 y1n
y21 y22 y23 y24 y2n y21 y22 y23 y24 y2n
ym 1 ym 2 ym 3 ym 4 ym n ym 1 ym 2 ym 3 ym 4 ym n
(c) 2D model
(d) Label group model
Figure 2: Structured models
2.2 Graphical Representation notated with its allowed labels and the labels C, A,
Q and P stand for context, answer, question and
Recently, Ding et al. (2008) use skip-chain and plain sentence labels, respectively. Note that the
2D Conditional Random Fields (CRFs) (Lafferty complete skip-chain model completely links each
et al., 2001) to perform the relational learning for two context and answer candidates and the label
context and answer extraction. The skip-chain group model combines the labels of one sentence
CRFs (Sutton and McCallum, 2004; Galley, 2006) into one label group.
model the long distance dependency between con-
text and answer sentences and the 2D CRFs (Zhu 2.3 Structured Model
et al., 2005) model the dependency between con-
Following the standard machine learning setup,
tiguous questions. The graphical representation
we denote the input and output spaces by X and
of those two models are shown in Figures 2(a)
Y, then formulate our task as learning a hypoth-
and 2(c), respectively. Those two CRFs are both
esis function h : X → Y to predict a y when
extensions of the linear chain CRFs for the sake
given x. In this setup, x represents a thread of n
of powerful relational learning. However, di-
sentences and m identified questions. y represents
rectly using the skip-chain and 2D CRFs with-
the m × n label matrix to be predicted.
out any customization has obvious disadvantages:
Given a set of training examples, S =
(a) the skip-chain model does not model the de-
{(x(i) , y(i) ) ∈ X × Y : i = 1, . . . , N }, we
pendency between answer sentence and multiple
restrict ourselves to the supervised learning sce-
context sentences; and (b) the 2D model does not
nario. We focus on hypothesis functions that
model the dependency between non-contiguous
take the form h(x; w) = arg maxy∈Y F(x, y; w)
questions.
with discriminant function F : X × Y → R
To better model the problem of extracting con- where F(x, y; w) = wT Ψ(x, y). As will be
texts and answers of questions, we propose two introduced in Section 4, we employ structural
more comprehensive models, complete skip-chain SVMs (Joachims et al., 2009) to find the optimal
model and label group model to improve the ca- parameters w. The structural SVMs have sev-
pability of the two previous models. These two eral competitive properties as CRFs. First, it fol-
models are shown in Figures 2(b) and 2(d). lows from the maximum margin strategy, which
In Figures 2(a) and 2(b), each label node is an- has been shown with competitive or even better
516
performance (Tsochantaridis et al., 2005; Nguyen sets for the yi. as
and Guo, 2007). Second, it allows flexible choices ( ¯ )
¯ p =p , cj 6= qi
of loss functions to users. Moreover, in general, ¯ cj qi
| {z }
C = cj ¯ | {z } ,
it has theoretically proved convergence in polyno- ¯ In Question Post Not Question Sentence
( ¯ )
mial time (Joachims et al., 2009). ¯ p >p
¯ aj qi , uaj 6= uqi
A = aj ¯ | {z } | {z } .
To use structural SVMs in relational learning, ¯ After Question Post Not by the Same User
one needs to customize three steps according to
specific tasks. The three steps are (a) definition of In the following, we describe formally about the
joint feature mapping for encoding relations, (b) definitions of the three feature sub-mappings.
algorithm of finding the most violated constraint The node feature mapping Ψn (x, y) encodes
(inference) for efficient trainings and (c) definition the relations between sentence and label pairs, we
of loss function for flexible uses. define it as follows,
m X
X n
In the following Sections 3, 4 and 5, we describe
Ψn (x, y) = ψn (xj , yij ),
the customizations of the three steps for our con-
i=1 j=1
text and answer extraction task, respectively.
where ψn (xj , yij ) is a feature mapping for a given
sentence and a label. It can be formally defined as
3 Encoding Relations follows,
ψn (xj , yij ) = Λ(yij ) ⊗ φqi (xj ), (1)
We use a joint feature mapping to model the rela-
tions between sentences in a thread. For context where ⊗ denotes a tensor product, φqi (xj ) and
and answer extraction, the joint feature mapping Λ(yij ) denote two vectors. φqi (xj ) contains ba-
can be defined as follows, sic information for output label. Λ(yij ) is a 0/1
vector defined as
  Λ(yij ) = [λC (yij ), λA (yij ), λP (yij )]T ,
Ψn (x, y)
Ψ(x, y) =  Ψh (x, y)  , where λC (yij ) equal to one if yij = C, otherwise
Ψv (x, y) zero. The λA (yij ) and λP (yij ) are similarly de-
fined. Thus, for example, writing out ψn (xj , yij )
for yij = C one gets,
where the sub-mappings Ψn (x, y), Ψh (x, y), and  
Ψv (x, y) encode three types of feature mappings, φqi (xj ) ← context
node features, edge features and label group fea- ψn (xj , yij ) =  0  ← answer .
tures. The node features provide the basic infor- 0 ← plain
mation for the output labels. The edge features Note that the node feature mapping does not in-
consist of the sequential edge features and skip- corporate the relations between sentences.
chain edge features for successive label dependen- The edge feature mapping Ψh (x, y) is used
cies. The label group features encode the relations to incorporate two types of relation, the relation
within each label group. between successive sentences and the relation be-
Before giving the detail definitions of the sub- tween context and answer sentences. It can be de-
mappings, we first introduce the context and an- fined as follows,
swer candidate sets, which will be used for the · ¸
Ψhn (x, y)
definitions and inferences. Each row of the label Ψh (x, y) = ,
Ψhc (x, y)
matrix y corresponds to one question. Assuming
that the ith row yi. corresponds to the question where Ψhn (x, y) and Ψhc (x, y) denote the two
with sentence index qi , we thus have two candi- types of feature mappings corresponding to se-
date sets of contexts and answers for this question quential edges and skip-chain edges, respectively.
denoted by C and A, respectively. We denote the Their formal definitions are given as follows,
post indices and the author indices for the n sen- m n−1
X X
tences as p = (p1 , . . . , pn ) and u = (u1 , . . . , un ). Ψhn (x, y) = ψhn (xj , xj+1 , yij , yi,j+1 ),
Then, we can formally define the two candidate i=1 j=1
517
Descriptions Dimensions
ψqi (xj ) (32 dimensions) in Ψn (x, y)
The cosine, WordNet and KL-divergence similarities with the question xqi 3
The cosine, WordNet and KL-divergence similarities with the questions other than xqi 3
The cosine, WordNet and KL-divergence similarities with previous and next sentences 6
Is this sentence xj exactly xqi or one of the questions in {xq1 , . . . , xqm }? 2
Is this sentence xj in the three beginning sentences? 3
The relative position of this sentence xj to questions 4
Is this sentence xj share the same author with the question sentence xqi ? 1
Is this sentence xj in the same post with question sentences? 2
Is this sentence xj in the same paragraph with question sentences? 2
The presence of greeting (e.g., “hi”) and acknowledgement words in this sentence xj 2
The length of this sentence xj 1
The number of nouns, verbs and pronouns in this sentence xj , respectively 3
Ψh (x, y) (704 dimensions)
For Ψhn (x, y), the above 32 dimension features w.r.t. 4 × 4 = 16 transition patterns 512
For Ψhc (x, y), 12 types of pairwise or merged similarities w.r.t. 16 transition patterns 192
Ψv (x, y) (32 dimensions)
The transition patterns for any two non-contiguous labels in a label group 16
The transition patterns for any two contiguous labels in a label group 16
Table 1: Feature descriptions and demisions
m
X XX 4 Structural SVMs and Inference
Ψhc (x, y) = ψhc (xj , xk , yij , yik ),
i=1 j∈C k∈A Given a training set S = {(x(i) , y(i) ) ∈ X ×
| {z }
Complete Edges Y : i = 1, . . . , N }, we use the structural
SVMs (Taskar et al., 2003; Tsochantaridis et
al., 2005; Joachims et al., 2009) formulation, as
ψhn (xj , xj+1 , yij , yi,j+1 )
shown in Optimization Problem 1 (OP1), to learn
= Λ(yij , yi,j+1 ) ⊗ φhn (xj , xj+1 , yij , yi,j+1 ), a weight vector w.
OP 1 (1-Slack Structural SVM)
ψhc (xj , xk , yij , yik )
1 C
= Λ(yij , yik ) ⊗ ψhc (xj , xk , yij , yik ) min ||w||2 + ξ
w,ξ≥0 2 N
where Λ(yij , yik ) is a 16-dimensional vector. It in- s.t. ∀(ȳ(1) , . . . , ȳ(N ) ) ∈ Y n ,
dicates all 4×4 pairwise transition patterns of four N
1 TX
types of labels, the context, answer, question and w [Ψ(x(i) , y(i) ) − Ψ(x(i) , ȳ(i) )]
N
plain. Note that apart from previous work (Ding i=1
N
X
et al., 2008) we use complete skip-chain (context- 1
answer) edges in Ψhc (x, y). ≥ ∆(y(i) , ȳ(i) ) − ξ,
N
i=1
The label group feature mapping Ψv (x, y) is
defined as follows, where ξ is a slack variable, Ψ(x, y) is the joint
n
feature mapping and ∆(y, ȳ) is the loss func-
X tion that measures the loss caused by the dif-
Ψv (x, y) = ψv (xj , y.j ),
j=1
ference between y and ȳ. Though OP1 is al-
ready a quadratic optimization problem, directly
where ψv (xj , y.j ) encodes each label group pat- using off-the-shelf quadratic optimization solver
tern into a vector. will fail, due to the large number of constraints.
The detail descriptions and vector dimensions Instead, a cutting plane algorithm is used to ef-
of the used features are listed in Table 1. ficiently solve this problem. For the details of the
518
{P P P , P P C , P C P , P C C , C P P , C P C , C C P , C C C } {P P P } {C C C }
....
{C , P } {C , P } {C , P } {Q } {P } {A, P } {A, P } {Q } {P } {A, P } {A, P } {Q } {P } {A, P } {A, P } {Q } {P } {A, P } {A, P }
(a) Original graph (b) Transformed graph (c) Decomposed graph
Figure 3: The equivalent transform of graphs
Algorithm 1 Exact Inference Algorithm representation in Figure 2 to the graphs in Fig-

1: Input: (Ci , Ai ) for each qi , w, x, y ure 3. This graph transform merges all the nodes
2: for i ∈ {1, . . . , m} do in the context candidate set C to one node with 2|C|
3: for Cs ⊆ Ci do possible labels.
4: [R(Cs ), ȳi. (Cs )] ← Viterbi(w, x; Cs ) We design an exact inference algorithm in Algo-
5: end for rithm 1 based on the graph in Figure 3(c). The al-
6: Cs∗ = arg maxCs ⊆Ci R(Cs ) gorithm can be summarized in three steps: (1) enu-
7: ȳi.∗ = ȳi. (Cs∗ ) merate all the 2|C| possible labels1 for the merged
8: end for node (line 3). (2) For each given label of the
9: return ȳ∗ merged node, perform the Viterbi algorithm (Ra-
biner, 1989) on the decomposed graph (line 4) and
store the Viterbi algorithm outputs in R and ŷi. .
structural SVMs, please refer to (Tsochantaridis et (3) From the 2|C| Viterbi algorithm outputs, select
al., 2005; Joachims et al., 2009). the one with highest score as the output (lines 6
The most essential and time-consuming step in and 7).
structural SVMs is finding the most violated con- The use of the Viterbi algorithm is assured by
straint, which is equivalent to solve the fact that there exists certain equivalence be-
arg max wT Ψ(x(i) , y) + ∆(y(i) , y). (2) tween the decomposed graph (Figure 3(c)) and a
y∈Y linear chain. By fixing the the label of the merged
node, we could remove the dashed edges in the
Without the ability to efficiently find the most vio-
decomposed graph and regard the rest graph as a
lated constraint, the cutting plane algorithm is not
linear chain, which results in the Viterbi decoding.
tractable.
In the next sub-sections, we introduce the al- 4.2 Approximate Inference
gorithms for finding the most violated constraint,
also called loss-augmented inference. The algo- The exact inference cannot handle the complete
rithms are essential for the success of customizing model with three sub-mappings, Ψn , Ψh , and
structural SVMs to our problem. Ψv , since the label group defeats the graph trans-
form in Figure 3. Thus, we design two ap-
4.1 Exact Inference proximate algorithms by employing undergener-
The exact inference algorithm is designed for a ating and overgenerating approaches (Finley and
simplified model with two sub-mappings Ψn and Joachims, 2008).
Ψh , except Ψv . First, we develop an undergenerating local
One naive approach to finding the most violated greedy search algorithm shown in Algorithm 2. In
constraint for the simplified model is to enumer- the algorithm, there are two loops, inner and outer
ate all the 2|C|+|A| cases for each row of the label loops. The outer loop terminates when no labels
matrix. However, it would be intractable for large change (steps 3-11). The inner loop enumerates
candidate sets. the whole label matrix and greedily determines
An important property is that the context can- each label (step 7) by maximizing the Equation
didate set is usually much smaller than the whole (2). Since the whole algorithm terminates only if
number of sentences in a thread. This property en- 1
Since the merged node is from context candidate set C,
ables us to design efficient and exact inference al- enumerating its label is equivalent to enumerating subsets Cs
gorithm by transforming from the original graph of the candidate set C
519
Algorithm 2 Greedy Inference Algorithm Items in the data set #items
1: Input: w, x, y Thread 515
2: initialize solution: ȳ ← y0 Post 2, 035
3: repeat Sentence 8, 500
4: y0 ← ȳ
question annotation 1, 407
5: for i ∈ {1, . . . , m} do
context annotation 1, 962
6: for j ∈ {1, . . . , n} do
∗ ← arg max T answer annotation 4, 652
ȳij ȳij w Ψ(x, ȳ)
7: plain annotation 18, 198
+4(y, ȳ)
8: ∗
ȳij ← ȳij Table 2: The data statistics
9: end for
10: end for two weights cr and cp respectively. Specifically,
11: until ȳ = y0 we denote the loss function with cp /cr = 2 and
12: ȳ∗ ← ȳ that with cr /cp = 2 by 4pp and 4rp , respectively.
13: return ȳ∗ Various types of loss function can be defined in
a similar fashion. To save the space, we skip the
definitions of other loss functions and only use the
the label matrix does not change during the last above two types of loss functions to show the flex-
outer loop. This indicates that at least a local opti- ibility of our approach.
mal solution is obtained.
Second, an overgenerating method can be 6 Experiments
designed by using linear programming relax-
ation (Finley and Joachims, 2008). To save the 6.1 Experimental Setup
space, we skip the details of this algorithm here. Corpus. We made use of the same data set as
introduced in (Cong et al., 2008; Ding et al.,
5 Loss Functions
2008). Specifically, the data set includes about
Structural SVMs allow users to customize the loss 591 threads from the forum TripAdvisor2 . Each
function 4 : Y × Y → R according to different sentence in the threads is tagged with the labels
system requirements. In this section, we introduce ‘question’, ‘context’, ‘answer’, or ‘plain’ by two
the loss functions used in our work. annotators. We removed 76 threads that have no
Basic loss function. The simplest way to quan- question sentences or more than 40 sentences and
tify the prediction quality is counting the number 6 questions. The remaining 515 forum threads
of wrongly predicted labels. Formally, form our data set.
Table 2 gives the statistics on the data set. On
m X
X n
average, each thread contains 3.95 posts and 2.73
4b (y, ȳ) = I[yij 6= ȳij ], (3)
i=1 j=1
questions, and each question has 1.39 context sen-
tences and 3.31 answer sentences. Note that the
where I[.] is an indicative function that equals to number of annotations is much larger than the
one if the condition holds and zero otherwise. number of sentences because one sentence can be
Recall-vs-precision loss function. In practice, annotated with multiple labels.
we may place different emphasis on recall and pre- Experimental Details. In all the experiments,
cision according to application settings. We could we made use of linear models for the sake of com-
include this preference into the model by defining putational efficiency. As a preprocessing step, we
the following loss function, normalized the value of each feature value into
the interval [0, 1] and then followed the heuristic
m X
X n
used in SVM-light (Joachims, 1998) to set C to
4p (y, ȳ) = I[yij 6= P, ȳij = P ] · cr
i=1 j=1
1/||x||2 , where ||x|| is the average length of input
samples (in our case, sentences). The tolerance pa-
+I[yij = P, ȳij 6= P ] · cp . (4)
rameter ² was set to 0.1 (the value also used in (Cai
This function penalizes the wrong prediction de- 2
TripAdvisor (http://www.tripadvisor.com/
creasing recall and that decreasing precision with ForumHome) is one of the most popular travel forums
520
and Hofmann, 2004)) in all the runs of the experi- Method 4b P (%) R (%) F1 (%)
ments. Context Extraction
Evaluation. We calculated the standard preci- C4.5 − 74.2 68.7 71.2
sion (P), recall (R) and F1 -score (F1 ) for both tasks B-SVM − 78.3 72.2 74.9
(context extraction and answer extraction). All the M-SVM − 68.0 77.6 72.1
experimental results were obtained through 5-fold S-SVM 8.86 75.6 71.7 73.4
cross validation. S-SVM-H 8.60 77.5 75.5 76.3
S-SVM-HC* 8.65 77.9 74.1 75.8
6.2 Baseline Methods
S-SVM-HC 8.62 77.5 75.2 76.2
We employed binary SVMs (B-SVM), multiclass S-SVM-HCV* 8.08 79.5 79.6 79.5
SVMs (M-SVM), and C4.5 (Quinlan, 1993) as our S-SVM-HCV 7.98 79.7 80.2 79.9
baseline methods:
Answer Extraction
B-SVM. We trained two binary SVMs for con-
C4.5 − 61.3 45.2 51.8
text extraction (context vs. non-context) and an-
B-SVM − 69.7 42.0 51.8
swer extraction (answer vs. non-answer), respec-
M-SVM − 63.2 51.5 55.8
tively. We used the feature mapping φqi (xj ) de-
S-SVM 8.86 67.0 48.0 55.6
fined in Equation (1) while training the binary
S-SVM-H 8.60 66.9 49.7 56.7
SVM models.
S-SVM-HC* 8.65 66.5 49.4 56.4
M-SVM. We extended the binary SVMs by
S-SVM-HC 8.62 65.7 51.5 57.4
training multiclass SVMs for three category labels
S-SVM-HCV* 8.08 65.5 58.7 61.7
(context, answer, plain).
S-SVM-HCV 7.98 65.1 61.2 63.0
C4.5. This decision tree algorithm solved the
same classification problem as binary SVMs and Table 3: The effectiveness of our approach
made use of the same set of features.
6.3 Modeling Sentence Relations and bel groups are useful for both context extraction
Question Interactions and answer extraction. The relation encoded by
We demonstrate in Table 3 that our approach can complete skip-chain edges is useful for answer
make use of the three types of relation among sen- extraction. The complete skip-chain edges not
tences well to boost the performance. only avoid preprocessing but also boost the per-
In Table 3, S-SVM represents the structural formance when compared with the preprocessed
SVMs only using the node features Ψn (x, y). The skip-chain edges. The label groups improve the
suffixes H, C, and V denote the models using vertical sequential edges.
horizontal sequential edges, complete skip-chain Interactions among questions. The interac-
edges and vertical label groups, respectively. The tions encoded by label groups are especially use-
suffixes C* and V* denote the models using in- ful. We conducted significance tests (sign test) on
complete skip-chain edges and vertical sequential the experimental results. The test result shows that
edges proposed in (Ding et al., 2008), as shown S-SVM-HCV outperforms all the other methods
in Figures 2(a) and 2(c). All the structural SVMs without vertical edges statistically significantly (p-
were trained using basic loss function ∆b in Equa- value < 0.01). Our proposed graphical represen-
tion (3). From Table 3, we can observe the follow- tation in Figure 2(d) eases us to model the complex
ing advantages of our approaches. interactions. In comparison, the 2D model in Fig-
Overall improvement. Our structural approach ure 2(c) used in previous work (Ding et al., 2008)
steadily improves the extraction as more types of can only model the interaction between adjacent
relation (corresponding to more types of edge) are questions.
included. The best results obtained by using the
three types of relation together improve the base- 6.4 Loss Function Results
line methods binary SVMs by about 6% and 20% We report in Table 4 the comparison between
in terms of F1 values for context extraction and structural SVMs using different loss functions.
answer extraction, respectively. Note that ∆pp prefers precision and ∆rp prefers re-
The usefulness of relations. The relations call. From Table 4, we can observe that the ex-
encoded by horizontal sequential edges and la- perimental results also exhibit this kind of system
521
Method P (%) R (%) F1 (%) 1
Context Extraction 0.9
Context
Answer
Precision
S-SVM-HCV-4b 79.7 80.2 79.9 0.8
p
S-SVM-HCV-4p 82.0 70.3 75.6 0.7
S-SVM-HCV-4rp 75.7 84.2 79.7 0.6

−1.5 −1 −0.5 0 0.5 1 1.5
Answer Extraction Log loss ratio
1
S-SVM-HCV-4b 65.1 61.2 63.0 Context
S-SVM-HCV-4pp 71.8 52.2 60.2

Answer
0.8
Recall
S-SVM-HCV-4rp 61.8 66.1 63.7 0.6
Table 4: The use of different loss functions 0.4

−1.5 −1 −0.5 0 0.5 1 1.5
Log loss ratio
preference. Moreover, we further demonstrate the

Figure 4: Balancing between precision and recall
capability of the loss function ∆p in Figure 4. The
curves are achieved by varying the ratio between
two parameters cp /cr in Equation (4). The curves al., 2006; Jeon et al., 2005; Harabagiu and Hickl,
confirm our intuition: when log(cp /cr ) becomes 2006; Cui et al., 2005; Dang et al., 2006). They
larger, the precisions increase but the recalls de- mainly focused on using sophisticated linguistic
crease and vice versa. analysis to construct answer from a large docu-
ment collection.
7 Related work
8 Conclusion and Future Work
Previous work on extracting questions, answers
and contexts is most related with our work. Cong We have proposed a new form of graphical rep-
et al. (2008) proposed a supervised approach for resentation for modeling the problem of extract-
question detection and an unsupervised approach ing contexts and answers of questions from online
for answer detection without considering contexts. forums and then customized structural SVM ap-
Ding et al. (2008) used CRFs to detect contexts proach to solve it.
and answers of questions from forum threads. The proposed graphical representation is able
Some researches on summarizing discussion to naturally express three types of relation among
threads and emails are related to our work, too. sentences: relation between successive sentences,
Zhou and Hovy (2005) segmented internet re- relation between context sentences and answer
lay chat, clustered segments into sub-topics, and sentences, and relation between multiple labels for
identified responding segments of the first seg- one sentence. The representation also enables us
ment in each sub-topic by assuming the first seg- to address interactions among questions. We also
ment to be focus. In (Nenkova and Bagga, 2003; developed the inference algorithms for the struc-
Wan and McKeown, 2004; Rambow et al., 2004), tural SVM model by exploiting the special struc-
email summaries were organized by extracting ture of thread discussions.
overview sentences as discussion issues. The Experimental results on a real data set show that
work (Shrestha and McKeown, 2004) used RIP- our approach significantly improves the baseline
PER as a classifier to detect interrogative questions methods by effectively utilizing various types of
and their answers then used the resulting question relation among sentences.
and answer pairs as summaries. We also note the Our future work includes: (a) to summa-
existing work on extracting knowledge from dis- rize threads and represent the forum threads in
cussion threads. Huang et al. (2007) used SVMs question-context-answer triple, which will change
to extract input-reply pairs from forums for chat- the organization of online forums; and (b) to en-
bot knowledge. Feng et al. (2006) implemented hance QA services (e.g., Yahoo! Answers) by the
a discussion-bot which used cosine similarity to contents extracted from online forums.
match students’ query with reply posts from an an-
Acknowledgement
notated corpus of archived threaded discussions.
Moreover, extensive researches have been done The authors would like to thank the anonymous re-
within the area of question answering (Burger et viewers for their comments to improve this paper.
522
References Thorsten Joachims, Thomas Finley, and Chun-Nam Yu.
2009. Cutting-plane training of structural SVMs.
John Burger, Claire Cardie, Vinay Chaudhri, Robert Machine Learning.
Gaizauskas, Sanda Harabagiu, David Israel, Chris-
tian Jacquemin, Chin-Yew Lin, Steve Maiorano, Thorsten Joachims. 1998. Text categorization with
George Miller, Dan Moldovan, Bill Ogden, John support vector machines: Learning with many rele-
Prager, Ellen Riloff, Amit Singhal, Rohini Shrihari, vant features. In Proceedings of ECML, pages 137–
Tomek Strzalkowski, Ellen Voorhees, and Ralph 142.
Weishedel. 2006. Issues, tasks and program struc-
tures to roadmap research in question and answering John Lafferty, Andrew McCallum, and Fernando
(qna). ARAD: Advanced Research and Development Pereira. 2001. Conditional random fields: Prob-
Activity (US). abilistic models for segmenting and labeling se-
quence data. In Proceedings of ICML, pages 282–
Lijuan Cai and Thomas Hofmann. 2004. Hierarchi- 289.
cal document categorization with support vector ma- Ani Nenkova and Amit Bagga. 2003. Facilitating
chines. In Proceedings of CIKM, pages 78–87. email thread access by extractive summary genera-
tion. In Proceedings of RANLP, pages 287–296.
Gao Cong, Long Wang, Chin-Yew Lin, and Young-In
Song. 2008. Finding question-answer pairs from Nam Nguyen and Yunsong Guo. 2007. Comparisons
online forums. In Proceedings of SIGIR, pages 467– of sequence labeling algorithms and extensions. In
474. Proceedings of ICML, pages 681–688.
John Quinlan. 1993. C4.5: programs for machine
Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-
learning. Morgan Kaufmann Publisher Incorpora-
Seng Chua. 2005. Question answering passage re-
tion.
trieval using dependency relations. In Proceedings
of SIGIR, pages 400–407. Lawrence Rabiner. 1989. A tutorial on hidden markov
models and selected applications in speech recogni-
Hoa Dang, Jimmy Lin, and Diane Kelly. 2006. tion. In Proceedings of IEEE, pages 257–286.
Overview of the trec 2006 question answering track.
In Proceedings of TREC, pages 99–116. Owen Rambow, Lokesh Shrestha, John Chen, and
Chirsty Lauridsen. 2004. Summarizing email
Shilin Ding, Gao Cong, Chin-Yew Lin, and Xiaoyan threads. In Proceedings of HLT-NAACL, pages 105–
Zhu. 2008. Using conditional random field to ex- 108.
tract contexts and answers of questions from online
Lokesh Shrestha and Kathleen McKeown. 2004. De-
forums. In Proceedings of ACL, pages 710–718.
tection of question-answer pairs in email conversa-
tions. In Proceedings of COLING, pages 889–895.
Donghui Feng, Erin Shaw, Jihie Kim, and Eduard H.
Hovy. 2006. An intelligent discussion-bot for an- Charles Sutton and Andrew McCallum. 2004. Collec-
swering student queries in threaded discussions. In tive segmentation and labeling of distant entities in
Proceedings of IUI, pages 171–177. information extraction. Technical Report 04-49.
Thomas Finley and Thorsten Joachims. 2008. Training Benjamin Taskar, Carlos Guestrin, and Daphne Koller.
structural SVMs when exact inference is intractable. 2003. Max-margin markov networks. In Advances
In Proceedings of ICML, pages 304–311. in Neural Information Processing Systems 16. MIT
Press.
Michel Galley. 2006. A skip-chain conditional random Ioannis Tsochantaridis, Thorsten Joachims, Thomas
field for ranking meeting utterances by importance. Hofmann, and Yasemin Altun. 2005. Large margin
In Proceedings of the 2006 Conference on Empiri- methods for structured and interdependent output
cal Methods in Natural Language Processing, pages variables. Journal of Machine Learning Research,
364–372. 6:1453–1484.
Sanda M. Harabagiu and Andrew Hickl. 2006. Meth- Stephen Wan and Kathy McKeown. 2004. Generating
ods for using textual entailment in open-domain overview summaries of ongoing email thread discus-
question answering. In Proceedings of ACL, pages sions. In Proceedings of COLING, pages 549–555.
905–912.
Liang Zhou and Eduard Hovy. 2005. Digesting vir-
Jizhou Huang, Ming Zhou, and Dan Yang. 2007. Ex- tual ”geek” culture: The summarization of technical
tracting chatbot knowledge from online discussion internet relay chats. In Proceedings of ACL, pages
forums. In Proceedings of IJCAI, pages 423–428. 298–305.
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and
Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. 2005. Wei-Ying Ma. 2005. 2d conditional random fields
Finding similar questions in large question and an- for web information extraction. In Proceedings of
swer archives. In Proceedings of CIKM, pages 84– ICML, pages 1044–1051.
90.
523

A Structural Support Vector Method For Extracting Contexts and Answers of Questions From Online Forums

Uploaded by

Copyright:

Available Formats

You might also like

A Structural Support Vector Method For Extracting Contexts and Answers of Questions From Online Forums

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Structural Support Vector Method For Extracting Contexts and Answers of Questions From Online Forums

Uploaded by

Copyright:

Available Formats

A Structural Support Vector Method for Extracting Contexts and

Answers of Questions from Online Forums

Abstract Post1: <context id=1> S1: Hi I am looking for

(a) Skip-chain model (b) Complete skip-chain model

Figure 2: Structured models

Table 1: Feature descriptions and demisions

{C , P } {C , P } {C , P } {Q } {P } {A, P } {A, P } {Q } {P } {A, P } {A, P } {Q } {P } {A, P } {A, P } {Q } {P } {A, P } {A, P }

(a) Original graph (b) Transformed graph (c) Decomposed graph

Figure 3: The equivalent transform of graphs

Algorithm 1 Exact Inference Algorithm representation in Figure 2 to the graphs in Fig-

S-SVM-HCV-4rp 75.7 84.2 79.7 0.6

S-SVM-HCV-4pp 71.8 52.2 60.2

Table 4: The use of different loss functions 0.4

preference. Moreover, we further demonstrate the

You might also like