Professional Documents
Culture Documents
A Structural Support Vector Method For Extracting Contexts and Answers of Questions From Online Forums
A Structural Support Vector Method For Extracting Contexts and Answers of Questions From Online Forums
A Structural Support Vector Method For Extracting Contexts and Answers of Questions From Online Forums
514
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 514–523,
c
Singapore, 6-7 August 2009.
2009 ACL and AFNLP
three aspects: graphical representation, inference have an exact and efficient inference (Section 4.1).
algorithm and loss function. Moreover, an approximate inference algorithm is
Graphical representation. We propose a more also given (Section 4.2).
comprehensive and unified graphical representa- Loss function. In practice, different applica-
tion to model the thread for relational learning. tion settings usually imply different requirements
Our graphical representation has two advantages for system performance. For example, we expect
over previous work (Ding et al., 2008): unifying a higher recall for the purpose of archiving ques-
sentence relations and incorporating question in- tions but a higher precision for the purpose of re-
teractions. trieving questions. A flexible framework should
Three types of relation should be considered for be able to cope with various requirements. We
context and answer extraction: (a) relations be- employ structural Support Vector Machine (SVM)
tween successive sentences (e.g., context sentence model that could naturally incorporate different
S2 occurs immediately before question sentence loss functions (Section 5).
S3); (b) relations between context sentences and We use a real data set to evaluate our approach
answer sentences (e.g., context S4 presents the to extracting contexts and answers of questions.
phrase Causeway Bay linking to answer which is The experimental results show both the effective-
absent from question S6); and (c) relations be- ness and the flexibility of our approach.
tween multiple labels for one sentence (e.g., one In the next section, we formalize the problem
question sentence is unlikely to be the answer to of context and answer extraction and introduce the
another question although one sentence can serve structural model. In Sections 3, 4 and 5 we give
as contexts for more than one questions). Our pro- the details of customizing structural model for our
posed graphical representation improves the mod- task. In Section 6, we evaluate our methods. In
eling of the three types of sentence relation (Sec- Section 7, we discuss the related work. Finally,
tion 2.2). we conclude this paper in Section 8.
Certain interactions exist among questions. For 2 Problem Statement
example, question sentences S5 and S6 interact by
sharing context sentence S4. Our proposed graphi- We first introduce our notations in Section 2.1 and
cal representation can naturally model the interac- then in Section 2.2 introduce how we model the
tions. Previous work (Ding et al., 2008) performs problem of extracting contexts and answers for
the extraction of contexts and answers in multiple questions with a novel form of graphical represen-
passes of the thread (with each pass corresponding tation. In Section 2.3 we introduce the structured
to one question), which cannot address the interac- model based on the new representation.
tions well. In comparison, our model performs the
extraction in one pass of the thread. 2.1 Notations
Inference algorithm. Inference is usually a Assuming that a given thread contains p posts
time-consuming process for structured prediction. {p1 , . . . , pp }, which are authored by a set of
We design special inference algorithms, instead of users {u1 , . . . , up }. The p posts can be further
general-purpose inference algorithms used in pre- segmented into n sentences x = {x1 , . . . , xn }.
vious works (Cong et al., 2008; Ding et al., 2008), Among the n sentences, m question sentences q =
by taking advantage of special properties of our {xq1 , . . . , xqm } have been identified. Our task is
task. Specifically, we utilize two special properties to identify the context sentences and the answer
of thread structure to reduce the inference (time) sentences for those m question sentences. More
cost. First, context sentences and question sen- formally, we use four types of label {C, A, Q, P }
tences usually occur in the same post while answer to stand for context, answer, question and plain la-
sentences can only occur in the following posts. bels. Then, our task is to predict an m × n label
With this properties, we can greatly reduce context matrix y = (yij )1≤i≤m,1≤j≤n , except m elements
(or answer) candidate sets of a question, which {y1,q1 , . . . , ym,qm } which correspond to (known)
results in a significant decrease in inference cost question labels. The element yij in label matrix y
(Section 3). Second, context candidate set is usu- represents the role that the jth sentence plays for
ally much smaller than the number of sentences the ith question. We denote the ith row and jth
in a thread. This property enables our proposal to column of the label matrix y by yi. and y.j .
515
y1 y2 y3 y4 y5 y6 y7 y1 y2 y3 y4 y5 y6 y7
{C , P } {C , P } {C , P } {Q } {P } {A, P } {A, P } {C , P } {C , P } {C , P } {Q } {P } {A, P } {A, P }
x1 x2 x3 x4 x5 x6 x7 x1 x2 x3 x4 x5 x6 x7
y11 y12 y13 y14 y1n y11 y12 y13 y14 y1n
y21 y22 y23 y24 y2n y21 y22 y23 y24 y2n
ym 1 ym 2 ym 3 ym 4 ym n ym 1 ym 2 ym 3 ym 4 ym n
(c) 2D model
(d) Label group model
2.2 Graphical Representation notated with its allowed labels and the labels C, A,
Q and P stand for context, answer, question and
Recently, Ding et al. (2008) use skip-chain and plain sentence labels, respectively. Note that the
2D Conditional Random Fields (CRFs) (Lafferty complete skip-chain model completely links each
et al., 2001) to perform the relational learning for two context and answer candidates and the label
context and answer extraction. The skip-chain group model combines the labels of one sentence
CRFs (Sutton and McCallum, 2004; Galley, 2006) into one label group.
model the long distance dependency between con-
text and answer sentences and the 2D CRFs (Zhu 2.3 Structured Model
et al., 2005) model the dependency between con-
Following the standard machine learning setup,
tiguous questions. The graphical representation
we denote the input and output spaces by X and
of those two models are shown in Figures 2(a)
Y, then formulate our task as learning a hypoth-
and 2(c), respectively. Those two CRFs are both
esis function h : X → Y to predict a y when
extensions of the linear chain CRFs for the sake
given x. In this setup, x represents a thread of n
of powerful relational learning. However, di-
sentences and m identified questions. y represents
rectly using the skip-chain and 2D CRFs with-
the m × n label matrix to be predicted.
out any customization has obvious disadvantages:
Given a set of training examples, S =
(a) the skip-chain model does not model the de-
{(x(i) , y(i) ) ∈ X × Y : i = 1, . . . , N }, we
pendency between answer sentence and multiple
restrict ourselves to the supervised learning sce-
context sentences; and (b) the 2D model does not
nario. We focus on hypothesis functions that
model the dependency between non-contiguous
take the form h(x; w) = arg maxy∈Y F(x, y; w)
questions.
with discriminant function F : X × Y → R
To better model the problem of extracting con- where F(x, y; w) = wT Ψ(x, y). As will be
texts and answers of questions, we propose two introduced in Section 4, we employ structural
more comprehensive models, complete skip-chain SVMs (Joachims et al., 2009) to find the optimal
model and label group model to improve the ca- parameters w. The structural SVMs have sev-
pability of the two previous models. These two eral competitive properties as CRFs. First, it fol-
models are shown in Figures 2(b) and 2(d). lows from the maximum margin strategy, which
In Figures 2(a) and 2(b), each label node is an- has been shown with competitive or even better
516
performance (Tsochantaridis et al., 2005; Nguyen sets for the yi. as
and Guo, 2007). Second, it allows flexible choices ( ¯ )
¯ p =p , cj 6= qi
of loss functions to users. Moreover, in general, ¯ cj qi
| {z }
C = cj ¯ | {z } ,
it has theoretically proved convergence in polyno- ¯ In Question Post Not Question Sentence
( ¯ )
mial time (Joachims et al., 2009). ¯ p >p
¯ aj qi , uaj 6= uqi
A = aj ¯ | {z } | {z } .
To use structural SVMs in relational learning, ¯ After Question Post Not by the Same User
one needs to customize three steps according to
specific tasks. The three steps are (a) definition of In the following, we describe formally about the
joint feature mapping for encoding relations, (b) definitions of the three feature sub-mappings.
algorithm of finding the most violated constraint The node feature mapping Ψn (x, y) encodes
(inference) for efficient trainings and (c) definition the relations between sentence and label pairs, we
of loss function for flexible uses. define it as follows,
m X
X n
In the following Sections 3, 4 and 5, we describe
Ψn (x, y) = ψn (xj , yij ),
the customizations of the three steps for our con-
i=1 j=1
text and answer extraction task, respectively.
where ψn (xj , yij ) is a feature mapping for a given
sentence and a label. It can be formally defined as
3 Encoding Relations follows,
ψn (xj , yij ) = Λ(yij ) ⊗ φqi (xj ), (1)
We use a joint feature mapping to model the rela-
tions between sentences in a thread. For context where ⊗ denotes a tensor product, φqi (xj ) and
and answer extraction, the joint feature mapping Λ(yij ) denote two vectors. φqi (xj ) contains ba-
can be defined as follows, sic information for output label. Λ(yij ) is a 0/1
vector defined as
Λ(yij ) = [λC (yij ), λA (yij ), λP (yij )]T ,
Ψn (x, y)
Ψ(x, y) = Ψh (x, y) , where λC (yij ) equal to one if yij = C, otherwise
Ψv (x, y) zero. The λA (yij ) and λP (yij ) are similarly de-
fined. Thus, for example, writing out ψn (xj , yij )
for yij = C one gets,
where the sub-mappings Ψn (x, y), Ψh (x, y), and
Ψv (x, y) encode three types of feature mappings, φqi (xj ) ← context
node features, edge features and label group fea- ψn (xj , yij ) = 0 ← answer .
tures. The node features provide the basic infor- 0 ← plain
mation for the output labels. The edge features Note that the node feature mapping does not in-
consist of the sequential edge features and skip- corporate the relations between sentences.
chain edge features for successive label dependen- The edge feature mapping Ψh (x, y) is used
cies. The label group features encode the relations to incorporate two types of relation, the relation
within each label group. between successive sentences and the relation be-
Before giving the detail definitions of the sub- tween context and answer sentences. It can be de-
mappings, we first introduce the context and an- fined as follows,
swer candidate sets, which will be used for the · ¸
Ψhn (x, y)
definitions and inferences. Each row of the label Ψh (x, y) = ,
Ψhc (x, y)
matrix y corresponds to one question. Assuming
that the ith row yi. corresponds to the question where Ψhn (x, y) and Ψhc (x, y) denote the two
with sentence index qi , we thus have two candi- types of feature mappings corresponding to se-
date sets of contexts and answers for this question quential edges and skip-chain edges, respectively.
denoted by C and A, respectively. We denote the Their formal definitions are given as follows,
post indices and the author indices for the n sen- m n−1
X X
tences as p = (p1 , . . . , pn ) and u = (u1 , . . . , un ). Ψhn (x, y) = ψhn (xj , xj+1 , yij , yi,j+1 ),
Then, we can formally define the two candidate i=1 j=1
517
Descriptions Dimensions
ψqi (xj ) (32 dimensions) in Ψn (x, y)
The cosine, WordNet and KL-divergence similarities with the question xqi 3
The cosine, WordNet and KL-divergence similarities with the questions other than xqi 3
The cosine, WordNet and KL-divergence similarities with previous and next sentences 6
Is this sentence xj exactly xqi or one of the questions in {xq1 , . . . , xqm }? 2
Is this sentence xj in the three beginning sentences? 3
The relative position of this sentence xj to questions 4
Is this sentence xj share the same author with the question sentence xqi ? 1
Is this sentence xj in the same post with question sentences? 2
Is this sentence xj in the same paragraph with question sentences? 2
The presence of greeting (e.g., “hi”) and acknowledgement words in this sentence xj 2
The length of this sentence xj 1
The number of nouns, verbs and pronouns in this sentence xj , respectively 3
Ψh (x, y) (704 dimensions)
For Ψhn (x, y), the above 32 dimension features w.r.t. 4 × 4 = 16 transition patterns 512
For Ψhc (x, y), 12 types of pairwise or merged similarities w.r.t. 16 transition patterns 192
Ψv (x, y) (32 dimensions)
The transition patterns for any two non-contiguous labels in a label group 16
The transition patterns for any two contiguous labels in a label group 16
m
X XX 4 Structural SVMs and Inference
Ψhc (x, y) = ψhc (xj , xk , yij , yik ),
i=1 j∈C k∈A Given a training set S = {(x(i) , y(i) ) ∈ X ×
| {z }
Complete Edges Y : i = 1, . . . , N }, we use the structural
SVMs (Taskar et al., 2003; Tsochantaridis et
al., 2005; Joachims et al., 2009) formulation, as
ψhn (xj , xj+1 , yij , yi,j+1 )
shown in Optimization Problem 1 (OP1), to learn
= Λ(yij , yi,j+1 ) ⊗ φhn (xj , xj+1 , yij , yi,j+1 ), a weight vector w.
OP 1 (1-Slack Structural SVM)
ψhc (xj , xk , yij , yik )
1 C
= Λ(yij , yik ) ⊗ ψhc (xj , xk , yij , yik ) min ||w||2 + ξ
w,ξ≥0 2 N
where Λ(yij , yik ) is a 16-dimensional vector. It in- s.t. ∀(ȳ(1) , . . . , ȳ(N ) ) ∈ Y n ,
dicates all 4×4 pairwise transition patterns of four N
1 TX
types of labels, the context, answer, question and w [Ψ(x(i) , y(i) ) − Ψ(x(i) , ȳ(i) )]
N
plain. Note that apart from previous work (Ding i=1
N
X
et al., 2008) we use complete skip-chain (context- 1
answer) edges in Ψhc (x, y). ≥ ∆(y(i) , ȳ(i) ) − ξ,
N
i=1
The label group feature mapping Ψv (x, y) is
defined as follows, where ξ is a slack variable, Ψ(x, y) is the joint
n
feature mapping and ∆(y, ȳ) is the loss func-
X tion that measures the loss caused by the dif-
Ψv (x, y) = ψv (xj , y.j ),
j=1
ference between y and ȳ. Though OP1 is al-
ready a quadratic optimization problem, directly
where ψv (xj , y.j ) encodes each label group pat- using off-the-shelf quadratic optimization solver
tern into a vector. will fail, due to the large number of constraints.
The detail descriptions and vector dimensions Instead, a cutting plane algorithm is used to ef-
of the used features are listed in Table 1. ficiently solve this problem. For the details of the
518
{P P P , P P C , P C P , P C C , C P P , C P C , C C P , C C C } {P P P } {C C C }
....
519
Algorithm 2 Greedy Inference Algorithm Items in the data set #items
1: Input: w, x, y Thread 515
2: initialize solution: ȳ ← y0 Post 2, 035
3: repeat Sentence 8, 500
4: y0 ← ȳ
question annotation 1, 407
5: for i ∈ {1, . . . , m} do
context annotation 1, 962
6: for j ∈ {1, . . . , n} do
∗ ← arg max T answer annotation 4, 652
ȳij ȳij w Ψ(x, ȳ)
7: plain annotation 18, 198
+4(y, ȳ)
8: ∗
ȳij ← ȳij Table 2: The data statistics
9: end for
10: end for two weights cr and cp respectively. Specifically,
11: until ȳ = y0 we denote the loss function with cp /cr = 2 and
12: ȳ∗ ← ȳ that with cr /cp = 2 by 4pp and 4rp , respectively.
13: return ȳ∗ Various types of loss function can be defined in
a similar fashion. To save the space, we skip the
definitions of other loss functions and only use the
the label matrix does not change during the last above two types of loss functions to show the flex-
outer loop. This indicates that at least a local opti- ibility of our approach.
mal solution is obtained.
Second, an overgenerating method can be 6 Experiments
designed by using linear programming relax-
ation (Finley and Joachims, 2008). To save the 6.1 Experimental Setup
space, we skip the details of this algorithm here. Corpus. We made use of the same data set as
introduced in (Cong et al., 2008; Ding et al.,
5 Loss Functions
2008). Specifically, the data set includes about
Structural SVMs allow users to customize the loss 591 threads from the forum TripAdvisor2 . Each
function 4 : Y × Y → R according to different sentence in the threads is tagged with the labels
system requirements. In this section, we introduce ‘question’, ‘context’, ‘answer’, or ‘plain’ by two
the loss functions used in our work. annotators. We removed 76 threads that have no
Basic loss function. The simplest way to quan- question sentences or more than 40 sentences and
tify the prediction quality is counting the number 6 questions. The remaining 515 forum threads
of wrongly predicted labels. Formally, form our data set.
Table 2 gives the statistics on the data set. On
m X
X n
average, each thread contains 3.95 posts and 2.73
4b (y, ȳ) = I[yij 6= ȳij ], (3)
i=1 j=1
questions, and each question has 1.39 context sen-
tences and 3.31 answer sentences. Note that the
where I[.] is an indicative function that equals to number of annotations is much larger than the
one if the condition holds and zero otherwise. number of sentences because one sentence can be
Recall-vs-precision loss function. In practice, annotated with multiple labels.
we may place different emphasis on recall and pre- Experimental Details. In all the experiments,
cision according to application settings. We could we made use of linear models for the sake of com-
include this preference into the model by defining putational efficiency. As a preprocessing step, we
the following loss function, normalized the value of each feature value into
the interval [0, 1] and then followed the heuristic
m X
X n
used in SVM-light (Joachims, 1998) to set C to
4p (y, ȳ) = I[yij 6= P, ȳij = P ] · cr
i=1 j=1
1/||x||2 , where ||x|| is the average length of input
samples (in our case, sentences). The tolerance pa-
+I[yij = P, ȳij 6= P ] · cp . (4)
rameter ² was set to 0.1 (the value also used in (Cai
This function penalizes the wrong prediction de- 2
TripAdvisor (http://www.tripadvisor.com/
creasing recall and that decreasing precision with ForumHome) is one of the most popular travel forums
520
and Hofmann, 2004)) in all the runs of the experi- Method 4b P (%) R (%) F1 (%)
ments. Context Extraction
Evaluation. We calculated the standard preci- C4.5 − 74.2 68.7 71.2
sion (P), recall (R) and F1 -score (F1 ) for both tasks B-SVM − 78.3 72.2 74.9
(context extraction and answer extraction). All the M-SVM − 68.0 77.6 72.1
experimental results were obtained through 5-fold S-SVM 8.86 75.6 71.7 73.4
cross validation. S-SVM-H 8.60 77.5 75.5 76.3
S-SVM-HC* 8.65 77.9 74.1 75.8
6.2 Baseline Methods
S-SVM-HC 8.62 77.5 75.2 76.2
We employed binary SVMs (B-SVM), multiclass S-SVM-HCV* 8.08 79.5 79.6 79.5
SVMs (M-SVM), and C4.5 (Quinlan, 1993) as our S-SVM-HCV 7.98 79.7 80.2 79.9
baseline methods:
Answer Extraction
B-SVM. We trained two binary SVMs for con-
C4.5 − 61.3 45.2 51.8
text extraction (context vs. non-context) and an-
B-SVM − 69.7 42.0 51.8
swer extraction (answer vs. non-answer), respec-
M-SVM − 63.2 51.5 55.8
tively. We used the feature mapping φqi (xj ) de-
S-SVM 8.86 67.0 48.0 55.6
fined in Equation (1) while training the binary
S-SVM-H 8.60 66.9 49.7 56.7
SVM models.
S-SVM-HC* 8.65 66.5 49.4 56.4
M-SVM. We extended the binary SVMs by
S-SVM-HC 8.62 65.7 51.5 57.4
training multiclass SVMs for three category labels
S-SVM-HCV* 8.08 65.5 58.7 61.7
(context, answer, plain).
S-SVM-HCV 7.98 65.1 61.2 63.0
C4.5. This decision tree algorithm solved the
same classification problem as binary SVMs and Table 3: The effectiveness of our approach
made use of the same set of features.
6.3 Modeling Sentence Relations and bel groups are useful for both context extraction
Question Interactions and answer extraction. The relation encoded by
We demonstrate in Table 3 that our approach can complete skip-chain edges is useful for answer
make use of the three types of relation among sen- extraction. The complete skip-chain edges not
tences well to boost the performance. only avoid preprocessing but also boost the per-
In Table 3, S-SVM represents the structural formance when compared with the preprocessed
SVMs only using the node features Ψn (x, y). The skip-chain edges. The label groups improve the
suffixes H, C, and V denote the models using vertical sequential edges.
horizontal sequential edges, complete skip-chain Interactions among questions. The interac-
edges and vertical label groups, respectively. The tions encoded by label groups are especially use-
suffixes C* and V* denote the models using in- ful. We conducted significance tests (sign test) on
complete skip-chain edges and vertical sequential the experimental results. The test result shows that
edges proposed in (Ding et al., 2008), as shown S-SVM-HCV outperforms all the other methods
in Figures 2(a) and 2(c). All the structural SVMs without vertical edges statistically significantly (p-
were trained using basic loss function ∆b in Equa- value < 0.01). Our proposed graphical represen-
tion (3). From Table 3, we can observe the follow- tation in Figure 2(d) eases us to model the complex
ing advantages of our approaches. interactions. In comparison, the 2D model in Fig-
Overall improvement. Our structural approach ure 2(c) used in previous work (Ding et al., 2008)
steadily improves the extraction as more types of can only model the interaction between adjacent
relation (corresponding to more types of edge) are questions.
included. The best results obtained by using the
three types of relation together improve the base- 6.4 Loss Function Results
line methods binary SVMs by about 6% and 20% We report in Table 4 the comparison between
in terms of F1 values for context extraction and structural SVMs using different loss functions.
answer extraction, respectively. Note that ∆pp prefers precision and ∆rp prefers re-
The usefulness of relations. The relations call. From Table 4, we can observe that the ex-
encoded by horizontal sequential edges and la- perimental results also exhibit this kind of system
521
Method P (%) R (%) F1 (%) 1
Context Extraction 0.9
Context
Answer
Precision
S-SVM-HCV-4b 79.7 80.2 79.9 0.8
p
S-SVM-HCV-4p 82.0 70.3 75.6 0.7
Recall
S-SVM-HCV-4rp 61.8 66.1 63.7 0.6
522
References Thorsten Joachims, Thomas Finley, and Chun-Nam Yu.
2009. Cutting-plane training of structural SVMs.
John Burger, Claire Cardie, Vinay Chaudhri, Robert Machine Learning.
Gaizauskas, Sanda Harabagiu, David Israel, Chris-
tian Jacquemin, Chin-Yew Lin, Steve Maiorano, Thorsten Joachims. 1998. Text categorization with
George Miller, Dan Moldovan, Bill Ogden, John support vector machines: Learning with many rele-
Prager, Ellen Riloff, Amit Singhal, Rohini Shrihari, vant features. In Proceedings of ECML, pages 137–
Tomek Strzalkowski, Ellen Voorhees, and Ralph 142.
Weishedel. 2006. Issues, tasks and program struc-
tures to roadmap research in question and answering John Lafferty, Andrew McCallum, and Fernando
(qna). ARAD: Advanced Research and Development Pereira. 2001. Conditional random fields: Prob-
Activity (US). abilistic models for segmenting and labeling se-
quence data. In Proceedings of ICML, pages 282–
Lijuan Cai and Thomas Hofmann. 2004. Hierarchi- 289.
cal document categorization with support vector ma- Ani Nenkova and Amit Bagga. 2003. Facilitating
chines. In Proceedings of CIKM, pages 78–87. email thread access by extractive summary genera-
tion. In Proceedings of RANLP, pages 287–296.
Gao Cong, Long Wang, Chin-Yew Lin, and Young-In
Song. 2008. Finding question-answer pairs from Nam Nguyen and Yunsong Guo. 2007. Comparisons
online forums. In Proceedings of SIGIR, pages 467– of sequence labeling algorithms and extensions. In
474. Proceedings of ICML, pages 681–688.
John Quinlan. 1993. C4.5: programs for machine
Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-
learning. Morgan Kaufmann Publisher Incorpora-
Seng Chua. 2005. Question answering passage re-
tion.
trieval using dependency relations. In Proceedings
of SIGIR, pages 400–407. Lawrence Rabiner. 1989. A tutorial on hidden markov
models and selected applications in speech recogni-
Hoa Dang, Jimmy Lin, and Diane Kelly. 2006. tion. In Proceedings of IEEE, pages 257–286.
Overview of the trec 2006 question answering track.
In Proceedings of TREC, pages 99–116. Owen Rambow, Lokesh Shrestha, John Chen, and
Chirsty Lauridsen. 2004. Summarizing email
Shilin Ding, Gao Cong, Chin-Yew Lin, and Xiaoyan threads. In Proceedings of HLT-NAACL, pages 105–
Zhu. 2008. Using conditional random field to ex- 108.
tract contexts and answers of questions from online
Lokesh Shrestha and Kathleen McKeown. 2004. De-
forums. In Proceedings of ACL, pages 710–718.
tection of question-answer pairs in email conversa-
tions. In Proceedings of COLING, pages 889–895.
Donghui Feng, Erin Shaw, Jihie Kim, and Eduard H.
Hovy. 2006. An intelligent discussion-bot for an- Charles Sutton and Andrew McCallum. 2004. Collec-
swering student queries in threaded discussions. In tive segmentation and labeling of distant entities in
Proceedings of IUI, pages 171–177. information extraction. Technical Report 04-49.
Thomas Finley and Thorsten Joachims. 2008. Training Benjamin Taskar, Carlos Guestrin, and Daphne Koller.
structural SVMs when exact inference is intractable. 2003. Max-margin markov networks. In Advances
In Proceedings of ICML, pages 304–311. in Neural Information Processing Systems 16. MIT
Press.
Michel Galley. 2006. A skip-chain conditional random Ioannis Tsochantaridis, Thorsten Joachims, Thomas
field for ranking meeting utterances by importance. Hofmann, and Yasemin Altun. 2005. Large margin
In Proceedings of the 2006 Conference on Empiri- methods for structured and interdependent output
cal Methods in Natural Language Processing, pages variables. Journal of Machine Learning Research,
364–372. 6:1453–1484.
Sanda M. Harabagiu and Andrew Hickl. 2006. Meth- Stephen Wan and Kathy McKeown. 2004. Generating
ods for using textual entailment in open-domain overview summaries of ongoing email thread discus-
question answering. In Proceedings of ACL, pages sions. In Proceedings of COLING, pages 549–555.
905–912.
Liang Zhou and Eduard Hovy. 2005. Digesting vir-
Jizhou Huang, Ming Zhou, and Dan Yang. 2007. Ex- tual ”geek” culture: The summarization of technical
tracting chatbot knowledge from online discussion internet relay chats. In Proceedings of ACL, pages
forums. In Proceedings of IJCAI, pages 423–428. 298–305.
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and
Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. 2005. Wei-Ying Ma. 2005. 2d conditional random fields
Finding similar questions in large question and an- for web information extraction. In Proceedings of
swer archives. In Proceedings of CIKM, pages 84– ICML, pages 1044–1051.
90.
523