Professional Documents
Culture Documents
A Progressive Understanding Web Agent For Web Crawler Generation
A Progressive Understanding Web Agent For Web Crawler Generation
A Progressive Understanding Web Agent For Web Crawler Generation
Crawler Generation
Wenhao Huang♢ , Chenghao Peng♡ , Zhixu Li♢ , Jiaqing Liang♡†
Yanghua Xiao♢♠†, Liqian Wen♣ , Zulong Chen♣
♢
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
♡
School of Data Science, Fudan University, ♠ Fudan-Aishu Cognitive Intelligence Joint Research Center
♣
Alibaba Holding-Aicheng Technology-Enterprise
{whhuang21,chpeng23}@m.fudan.edu.cn, {liangjiaqing,zhixuli,shawyh}
@fudan.edu.cn, {liqian.wenliqian,zulong.czl}@alibaba-inc.com
<html>
Abstract ...
<div class="entity-title">
html
div
DOM tree
<div class="title"> span LEBRON JAMES
<span>LEBRON JAMES</span>
#23 - SMALL
<span>#23 - SMALL FORWARD - LOS ANGELES LAKERS</span> span
Web automation is a significant technique that <div class="tab-mob-only-flex"> div
FORWARD
<div class="flex">
div
accomplishes complicated web tasks by au- <span>24.8</span> <span>PPG</span> <span>•</span>
</div> span 24.8
arXiv:2404.12753v1 [cs.CL] 19 Apr 2024
<div class="flex">
tomating common web actions, enhancing op- <span>7.3</span> <span>RPG</span> <span>•</span>
</div> div
span PPG
span RPG
</div>
manual intervention. Traditional methods, such </div>
</div>
div
span 7.8
as wrappers, suffer from limited adaptability </div> <!---->
... span APG
</html>
and scalability when faced with a new web-
Wrapper Heavy Human Effort
site. On the other hand, generative agents em-
powered by large language models (LLMs) ex- I have to write new
hibit poor performance and reusability in open- wrapper for new websites.
1 Introduction
Web automation refers to the process of program- the automation system only interacts with a pre-
matically interacting with web-based applications defined, limited set of websites or pages and does
or websites to execute tasks that would typically re- not extend beyond this specified domain. Conse-
quire human intervention. Web automation stream- quently, these traditional methods face limitations
lines repetitive and time-consuming tasks, signifi- in adaptability and scalability, struggling to func-
cantly enhancing efficiency, accuracy, and scalabil- tion effectively when encountering new or altered
ity across diverse online processes. website structures. Given these limitations, both
In traditional web automation, methods predom- rule-based wrappers and auto wrappers (Bronzi
inantly rely on wrappers, which are scripts or soft- et al., 2013), despite their differences, share a com-
ware specifically designed to extract data from pre- mon dependency on manually annotated examples
determined websites or pages. This approach is for each website (Gulhane et al., 2011). For ex-
characteristic of a closed-world scenario, where ample, more than 1,400 annotations of webpages
are used for information extraction (Lockard et al.,
†
Corresponding authors. 2019).
1
Resources of this paper can be found at https://
github.com/EZ-hwh/AutoCrawler The advent of LLMs has revolutionized web au-
tomation by introducing advanced capabilities such documents. This indicates a potential challenge in
as planning, reasoning, reflection, and tool use. accurately capturing and utilizing the hierarchical
Leveraging these capabilities, web automation em- structure inherent in long HTML documents.
ploys LLMs to construct generative agents that can Therefore, we introduce AUTO C RAWLER , a
autonomously navigate, interpret, and interact with two-stage framework designed to address the
web content. This effectively solves open-world crawler generation task. An overview of AU -
web-based tasks through sophisticated language un- TO C RAWLER is presented in Figure 2. Our frame-
derstanding and decision-making processes. How- work leverages the hierarchical structure of HTML
ever, despite these advancements, this paradigm for progressive understanding. Specifically, we
faces two major issues. On one hand, existing web propose a heuristic algorithm consisting of top-
agent frameworks often demonstrate poor perfor- down and step-back operation based on LLMs. It
mance, with a success rate mentioned as 2.0 (Deng first tries to refine down to the specific node in the
et al., 2023) on open-world tasks. On the other DOM tree containing the target information, and
hand, a significant weakness encountered in this then moves up the DOM tree when execution fails.
approach is its insufficient reusability. This implies This process can correct erroneous executions and
that these agents are overly dependent on LLMs progressively prune irrelevant parts of the HTML
even when dealing with similar tasks, leading to content until it successfully executes.
low efficiency when managing a large volume of Our contributions can be summarized as follows:
repetitive and similar webpages.
In this work, we propose a crawler generation • We propose the web crawler generation task
task for vertical information web pages. The goal and the paradigm of utilizing LLMs for gen-
of this task is to automatically generate a series of erating crawlers, and we make an analysis on
predefined rules or action sequences to automati- extraction efficiency.
cally extract target information. This task calls for
a paradigm that LLMs generate crawlers. Com- • We introduce AUTO C RAWLER , a two-phase
pared to traditional wrappers, this paradigm can framework with progressive understanding to
be quickly adjusted according to different web- generate executable action sequences.
sites and task requirements. This flexibility en-
ables crawlers to handle diverse and changing web • Comprehensive experimental results demon-
environments more efficiently. Compared to the strate the effectiveness of our framework in
generative agent paradigm, it introduces interme- the web crawler generation task.
diate rules to enhance reusability and reduce the
dependency on LLMs when dealing with similar 2 Preliminaries
tasks, thereby improving efficiency when handling
In this section, we first define the crawler gener-
a large number of web tasks.
ation task and then present the dataset collection
Although LLMs possess strong webpage infor- process and its corresponding evaluation metrics.
mation extraction ablities, there are still the follow-
ing challenges in generating crawlers for LLMs: 2.1 Task Formulation
First, LLMs are primarily pre-trained on massive
First, we formulate our crawler generation task.
corpora of cleansed, high-quality pure text, lack-
Given a set of webpages on the same website
ing exposure to markup languages such as HTML.
w ∈ W describing a subject entity s (also called a
As a result, LLMs exhibit a limited understanding
topic entity in the previous literature), and its corre-
of the complex structures and semantics inherent
sponding predefined target attribute r ∈ R, the task
in HTML. Second, HTML, as a semi-structured
objective is to generate an executable rule/action
data, contains elements of both structured (tags
sequence A to extract target information o from all
and attributes) and unstructured (textual content),
webpages.
concurrently encompassing multilayered informa-
tion nested. It augments the complexity of crawler
2.2 Datasets
generation. Third, although LLMs excel in com-
prehending textual content, they still fall short in We adopt the semi-structure information extraction
understanding the structural information of lengthy task as a testbed for the crawler generation task.
Phase1: Progressive Generation Phase2: Synthesis
Action Sequence Action Sequence
Action Sequence
Action Sequence
Step1: //*[text()='PPG’]
Top-down Step-back Step1: //*[text()='PPG’]
Step2:
Step1: ./ancestor
//*[text()='PPG’]
Step2: ./ancestor
Step2: ./ancestor
Step3: ./span[1]/text()
Step3:
Step3: ./span[1]/text()
./span[1]/text()
Figure 2: Our framework for the generation of crawler. Left: Progressive generation process, consists of a cycle
of top-down and step-back to progressively generate an executable action sequence; Right: Synthesis process,
generates a stable action sequence generated based on ones from seed websites.
S WDE (Hao et al., 2011) is a Structured Web Please extract the team of the player he plays now
Data Extraction dataset that contains webpages and is a complete test case of our crawler generation
golden labels from 80 websites in 8 domains, with task. Third, we pre-process the websites by re-
124,291 webpages. Each of the websites from the moving those elements in a webpage that do not
same domains focuses on 3-5 attributes in the web- contribute to the semantics. We filter out all DOM
pages. element nodes with <script> and <style>, as
well as delete all attributes in the element node
E XTENDED S WDE (Lockard et al., 2019) in- except @class. We replace the original escape
volves fine-grained manual annotation of 21 sites characters in the annotations to ensure consistency
in 3 domains from S WDE. While S WDE contains with the corresponding information on the web.
an average of 4,480 triples for 3 predicates per web-
Ultimately, we collect three datasets containing
site, the E XTENDED S WDE dataset averages 41K
320 / 294 / 83 test cases, covering 32 / 221 / 11
triples for 36 predicates per site.
different extraction tasks, and comprising a total of
D S 1 (Omari et al., 2017) contains 166 annotated 61,566 webpages from 10 different domains.
webpages from 30 real-life large-scale websites cat-
egorized into books, shopping, hotels, and movies. 2.3 Evaluation Metrics
We transform the dataset with the following set- A single generation from an LLM is capable of
tings. First, we design instructions for each of the directly extracting value from web pages and gen-
domains, and for each of the attributes as the input erating sequences of actions. However, the exist-
information for LLMs2 . Second, for each website ing evaluation schemes for web page extraction
in each domain, we sample 100 webpages as the tasks still follow the traditional metrics of text in-
whole test set. We consider the set of webpages on formation extraction tasks, namely precision, recall,
the same websites and the corresponding extraction and F1 score. They limit the assessment of meth-
instruction as a case. For example, for the ESPN ods for the crawler generation task to two aspects.
websites3 in NBA player domains, the sampled First, it focuses on extraction with a single web-
100 detail webpage of players and the instruction page, rather than considering the generalizability
2
Further details about the prompt is in Appendix D.1 from the perspective of a collection of webpages.
3
https://global.espn.com/nba/ Second, it does not effectively measure the transfer-
ability when adopting the action sequence to other Algorithm 1: Algorithm for progressive
webpages. understanding
To address this issue, we transform the tradi- Data: origin HTML code h0 , task
tional IE task evaluation into an executable eval- instruction I, max retry times dmax
uation. Based on the traditional IE evaluation on Result: Executable action sequence Aseq to
a collection of webpages, we categorize the exe- extract the value in the HTML
cutability of action sequences into the following 1 Initial history Aseq ← [], k = 0;
six situations. Specifically, for each extraction task 2 while True do
on a website, the result is classified based on the 3 if k > dmax then break;
extraction result on precision, recall, and f1-score. // Top-down
(1) Correct, both precision, recall and f1-score 4 value, xpath ← LLMg (hk , I);
equal 1, which indicates the action sequence is pre- 5 result ← Parsertext (hk , xpath);
cisely; (2) Precision(Prec.), only precision equals 6 if result == value then break;
1, which indicates perfect accuracy in the instances // Step-back
extracted following the action sequence, but misses 7 repeat
relevant instances; (3) Recall(Reca.), only recall 8 xpath ← xpath + ”/..”;
equals 1, which means that it successfully identi- 9 hk+1 ← Parsernode (hk , xpath);
fies all relevant instances in the webpage but incor- 10 until h contains value;
rectly identifies some irrelevant instances; (4) Un- 11 Append(Aseq , xpath);
executable(Unex.), recall equals 0, which indi- 12 k ← k + 1;
cates that the action sequence fails to identify rel-
13 end
evant instances; (5) Over-estimate(Over.), preci-
14 return Aseq
sion equals 0, which indicates that the action se-
quence extracts the instances while ground truth is
empty; (6) Else: the rest of the situation, including sequence generation task. In specific, we generate
partially extracting the information, etc. an action sequence Aseq that consists of a sequence
Since the above classifications are mutually ex- of XPath4 expression from a set of seed webpages
clusive, we use the ratio metric to calculate the (i.e., a small portion of webpages in the test case
proportion of each result in our task. for generating the sequence).
# case of situation
MR = (1) Aseq = [XPath1 , XPath2 , ..., XPathn ] (2)
# total case
We are more concerned with success rate, so for the where n denotes the length of the action sequence.
Correct metric, higher values indicate a better pro- We execute the XPath in the sequence using the
portion of generated execution paths; whereas for parser in order. In the sequence, all XPath expres-
the Un-executable metric, lower values are prefer- sions except the last one are used for pruning the
able. webpage, and the last one is used for extracting the
corresponding element value from the pruned web
3 AUTO C RAWLER page.
In this section, we describe our framework AU - 3.2 Progressive Generation
TO C RAWLER for generating a crawler to extract
Dealing with the lengthy content and hierarchical
specific information from semi-structured HTML.
structure of webpages, generating a complete and
Our approach is divided into two phases: first, we
executable crawler in one turn is difficult. How-
adopt a progressive generation framework that uti-
ever, the HTML content is organized in a DOM
lizes the hierarchical structure of web pages; sec-
tree structure, which makes it possible to prune
ond, we employ a synthesis framework based on
irrelevant page components and hence, limit the
results from multiple web pages. The overall frame-
length and height of the DOM tree to improve the
work is presented in Figure 2.
performance of LLM generation.
3.1 Modeling Specifically, we perform a traversal strategy
Unlike the wrapper method that generates an XPath, consisting of top-down and step-back operations.
4
we model the crawler generation task as an action https://en.wikipedia.org/wiki/XPath
E XECUTABLE E VALUATION IE E VALUATION
Models Method
Correct(↑) Prec Reca Unex.(↓) Over. Else Prec Reca F1
Closed-source LLMs
COT 36.75 8.83 6.71 43.46 0.71 3.53 89.45 50.43 47.99
GPT-3.5-
Reflexion 46.29 11.66 2.83 37.10 0.71 1.41 94.67 55.85 55.10
Turbo
AUTO C RAWLER 54.84 11.83 8.96 19.35 1.08 3.94 85.85 73.34 69.20
COT 29.69 10.94 7.50 47.19 1.25 3.44 81.21 45.22 41.81
Gemini
Reflexion 33.12 6.56 4.06 52.50 0.63 3.12 87.45 42.75 40.88
Pro
AUTO C RAWLER 42.81 11.87 4.69 34.38 1.25 5.00 85.70 57.54 54.91
COT 61.88 12.50 7.19 14.37 0.94 3.12 87.75 79.90 76.95
GPT4 Reflexion 67.50 13.75 4.37 10.94 0.94 2.50 93.28 82.76 82.40
AUTO C RAWLER 71.56 14.06 5.31 4.06 0.63 4.37 92.49 89.13 88.69
Open-source LLMs
COT 3.44 0.31 0.63 95.31 0.00 0.63 94.23 4.55 4.24
Mistral 7B Reflexion 2.19 0.00 0.31 97.19 0.00 0.31 95.60 2.78 2.49
AUTO C RAWLER 2.87 0.00 0.00 96.77 0.36 0.00 98.57 3.23 2.87
COT 17.98 3.75 2.25 74.53 0.00 1.50 79.75 21.98 21.36
CodeLlama Reflexion 18.08 4.80 2.95 73.06 0.00 1.11 78.96 23.26 22.44
AUTO C RAWLER 23.99 8.12 1.48 64.94 0.00 1.48 78.59 28.70 28.41
COT 28.75 8.13 4.37 57.81 0.31 0.63 89.79 38.23 37.26
Mixtral
Reflexion 36.25 6.88 3.12 51.25 0.00 2.50 89.35 44.57 43.60
8×7B
AUTO C RAWLER 46.88 10.62 7.19 30.31 0.63 4.37 87.32 62.71 59.75
COT 36.56 10.94 5.63 42.50 0.63 3.75 86.05 48.78 47.05
Deepseek-
Reflexion 37.19 11.25 4.06 44.69 1.25 1.56 86.41 48.28 47.08
coder
AUTO C RAWLER 38.75 11.25 5.31 39.69 0.63 4.37 84.91 52.11 49.68
Table 1: The executable evaluation and IE evaluation of LLMs with three frameworks in S WDE dataset. We examine
7 LLMs, including 3 closed-source LLMs and 4 open-source LLMs.
Top-down refers to starting from the root node of get information and the structure between different
the current DOM tree, progressively refining down web pages. The action sequence may collect XPath
to the specific node containing the target informa- with specific characteristics in a single HTML and
tion. Step-back refers to reassessing and adjusting lose generalizability. To enhance the reusability of
selection criteria by moving up the DOM tree to the action sequence, we propose a synthesis phase.
choose a more reliable and broadly applicable node Specifically, we randomly select ns webpages
as a foundation for more consistent and accurate from the test case as seed webpages. Then, we
XPath targeting. At each step, we first employ a generate an action sequence for each of them. Sub-
top-down operation, guiding the LLMs to directly sequently, we execute multiple different action se-
write out the XPath leading to the node contain- quences to extract information from the seed web-
ing the target information and to judge whether the pages, respectively. We collect all action sequence
value extracted with XPath is consistent with the and their corresponding results and then choose
value it recognizes. If execution fails, then adopt a one action sequence that can extract all the target
step-back operation to retreat from the failed node, information in the webpages as the final action se-
ensuring the web page includes the target informa- quence.
tion, which is driven by LLMs. The detail is shown
in Algorithm 1. 4 Experiment
3.3 Synthesis Intending to put AUTO C RAWLER to practical
Although we gain an executable action sequence use, we investigate the following research ques-
within the progressive generation process, there are tions: 1) How is the overall performance of AU -
still differences in the specific location of the tar- TO C RAWLER in comparison to the current state-of-
E XECUTABLE E VALUATION
the-art crawler generation task? 2) How does our Models Method
Correct(↑) Prec Reca Unex.(↓) Over. Else
progressive understanding generation framework Closed-source LLMs
improve performance? What is its relationship to GPT-3.5-
COT 41.70 12.92 7.38 35.42 0.74 1.85
Reflexion 47.23 16.24 2.21 33.21 0.37 0.74
the size of LLM? 3) Can we rely entirely on LLMs Turbo AUTO C RAWLER 56.89 19.43 5.65 13.43 0.71 3.89
Compression ratio(%)
95
Correct Ratio(%)
Mixtral 8×7B 89 53 43 24 104 3.00 40
90 87.06
Mistral 7B 28 7 11 7 84 3.82 84.77 20
85 85.28
Deepseek-coder 137 70 55 29 23 2.14 81.05
80 82.68 78.58 0
CodeLlama 75 35 32 18 80 2.97 78.92 74.83
75 72.3 75.23 20
70 71.43
Table 3: Length of action sequence of AUTO C RAWLER 68.61 40
based on different LLMs in S WDE dataset. 65
GPT4 bo Pro 7B er ma 7B
.5-tur Gemini ixtral 8× pseek CodCodeLla Mistral
GPT3 M Dee
Backbone model
3) Compared to closed-source LLMs, even pro- Figure 3: The curve on length and height compression
vided with golden labels, Open-source LLMs are ratio in S WDE dataset.
unable to achieve sustained performance improve-
ment. This phenomenon demonstrates that the bot-
web page after being pruned by AUTO C RAWLER .
tleneck for these models lies not in understanding
the webpage content but in understanding the web- #tokens of new HTML
page’s hierarchical structure itself. CompressionL =
#tokens of origin HTML
#height of new HTML
CompressionH =
#height of origin HTML
4.4 Further Study with AUTO C RAWLER (3)
We calculate their compression ratio of the Cor-
The length of the action sequence is dependent rect case and rank LLMs based on their perfor-
on the LLMs capability. To comprehensively ex- mance. Figure 3 shows the result. It is interesting
plore the performance of different LLMs in under- to note that there is a "U" curve on both the length
standing web page structure, we explore the impact and height compression ratios. This phenomenon
of models on the number distribution of the steps. can be explained from two aspects: on one hand,
In particular, we collect all the action sequences when LLM is powerful, it can generate the cor-
and calculate the average steps of AUTO C RAWLER rect XPath without the process of step-back to re-
with different LLMs. The experimental result is accessing the sub-DOM tree; on the other hand,
reported in Table 3. when the model is weak, it is unable to effectively
understand the hierarchical structure of web page,
We observe that AUTO C RAWLER with stronger and thus cannot generate reliable, effective XPaths
LLMs generate fewer lengths of action sequence. for the web page.
AUTO C RAWLER with GPT4 generates 1.57 steps
on average, while the AUTO C RAWLER with Mis- XPath fragility within AUTO C RAWLER The
tral 7B generates 3.82 steps on average. This phe- fragility of XPath often refers to the characteristic
nomenon can be interpreted as more powerful mod- of XPath expressions becoming ineffective or inac-
els having a better understanding of the web page curately matching the target element when faced
hierarchical structure, thus being able to accurately with new webpages. This is mainly due to XPath
output the appropriate XPaths in longer/deeper web specifying specific information through predicates,
pages, thereby reducing the number of steps. such as text, @class, etc.
We mainly focus on the fragility of text because
these webpages are from the same websites (i.e.
The "U" curve of compression ratio We define @class is a good characteristic for generating
the length of HTML as the number of tokens in the stable action sequences). Table 4 shows XPath
HTML, and its height as the height of the DOM expressions that rely on text. We aim to explore
tree represented by the HTML. we define the com- the reusability of generating XPath based on text
pression ratio of length and height as the ratio of the features. We manually calculated the proportion
length/height of the original web page to that of the of bad cases with two types of predicates, contains
Good case Bad case
Question Here’s a webpage on detail information with Here’s a webpage with detailed information
detail information of an NBA player. Please about a university. Please extract the con-
extract the height of the player. tact phone number of the university.
Case //div[@class=’gray200B-dyContent’]/ //div[@class=’infopage’]//h5[ contains
b[ contains(text(),’Height:’) ]/following- (text(), ’703-528-7809’) ]
sibling::text()
Table 4: Examples of XPath fragility. The green focuses on the common information across different webpages,
while the red focuses on specific information of seed webpages.
Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. Chenhao Xie, Jiaqing Liang, Jingping Liu, Chengsong
2019. Openceres: When open information extraction Huang, Wenhao Huang, and Yanghua Xiao. 2021.
meets the semi-structured web. In Proceedings of the Revisiting the negative data of distantly supervised re-
2019 Conference of the North American Chapter of lation extraction. arXiv preprint arXiv:2105.10158.
the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long and Shunyu Yao, Howard Chen, John Yang, and Karthik
Short Papers), pages 3047–3056. Narasimhan. 2023. Webshop: Towards scalable
real-world web interaction with grounded language
Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiao- agents.
man Pan, and Dong Yu. 2023. Laser: Llm agent with
Longtao Zheng, Rundong Wang, Xinrun Wang, and
state-space exploration for web navigation.
Bo An. 2024. Synapse: Trajectory-as-exemplar
Adi Omari, Sharon Shoham, and Eran Yahav. 2017. prompting with memory for computer control.
Synthesis of forgiving data extractors. In Proceed- Xiaolin Zheng, Tao Zhou, Zukun Yu, and Deren Chen.
ings of the tenth ACM international conference on 2008. Url rule based focused crawler. In 2008 IEEE
web search and data mining, pages 385–394. International Conference on e-Business Engineering,
OpenAI. 2022. Chatgpt. pages 147–154. IEEE.
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou,
OpenAI. 2023. Gpt-4 technical report.
Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Gra-
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, ham Neubig. 2023. Webarena: A realistic web envi-
Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy ronment for building autonomous agents.
Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Yichao Zhou, Ying Sheng, Nguyen Vo, Nick Edmonds,
Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron and Sandeep Tata. 2021. Simplified dom trees for
Grattafiori, Wenhan Xiong, Alexandre Défossez, transferable attribute extraction from the web. arXiv
Jade Copet, Faisal Azhar, Hugo Touvron, Louis Mar- preprint arXiv:2101.02415.
tin, Nicolas Usunier, Thomas Scialom, and Gabriel
A Experiments Models Method
E XEC E VAL IE E VAL
Correct(↑) Unex.(↓) F1
A.1 Main results on E XTENDED S WDE COT 36.75 43.46 47.99
- synthesis 27.56 57.24 34.44
Because E XTENDED S WDE dataset focuses on GPT-3.5- Reflexion 46.29 37.10 55.10
Turbo - synthesis 28.62 59.01 35.01
OpenIE task (the relation is also expected to be ex-
tracted), we first map relations into a predefined list AUTO C RAWLER 54.84 19.35 69.20
- synthesis 44.52 29.33 58.44
of attributes and remove unusual ones. Specifically,
COT 29.69 47.19 41.81
we conducted experiments with 294 attributes from - synthesis 27.56 57.24 33.09
21 websites selected from the E XTENDED S WDE Gemini Reflexion 33.12 52.50 40.88
Pro - synthesis 28.62 59.01 37.60
dataset.
AUTO C RAWLER 42.81 34.38 54.91
Table 8 shows the result. By comparing Table 1, - synthesis 39.46 31.56 56.48
we find that: 1) Under complex extraction task set- COT 61.88 14.37 76.95
tings (multiple target values and ambiguous prob- - synthesis 46.88 30.00 61.20
lem description), the closed-source LLMs perform GPT4 Reflexion 67.50 10.94 82.40
- synthesis 56.87 25.31 69.78
better in generating executable action sequences
AUTO C RAWLER 71.56 4.06 88.69
compared to the open-source LLMs. 2) There are - synthesis 65.31 11.87 80.41
some tasks with unclear descriptions, such as the
"Calendar System" and "Facilities and Programs Table 6: Ablation study on our two-stage framework.
Offered" on university websites, which affect the We report Correct, Unexecutable from the executive
wrapper generation performance of all methods. evaluation, and F1 score from the IE evaluation in
S WDE dataset.
A.2 Main results on D S 1
B.2 Seed Websites
Due to D S 1 only contains 166 hand-crafted web-
pages, and for each website, there are only two In all previous experiments, we fixed the number
webpages, so we take one webpage for inference of seed websites ns = 3, which demonstrates the
and the other for evaluation. Meanwhile, due to the effectiveness of the synthesis module. In this exper-
number of the seed websites being equal to one, we iment, we offer different numbers of seed webpages
test three methods without applying the synthesis and test the performance of AUTO C RAWLER with
module described in Section 3.3. GPT4. The result is shown in Figure 4.
As the number of seed webpages increases, the
Table 9 shows the result in the DS1 dataset.
correct ratio increases, while the unexecutable ra-
Among all LLMs with three methods, GPT4 + AU -
tio decreases. It suggests that the performance of
TO C RAWLER achieves the best performance, and
AUTO C RAWLER can still be further improved by
AUTO C RAWLER beats the other two methods in all
providing more seed webpages. In addition, the
LLMs, which is consistent with the conclusion we
performance improvement reduces as the number
make above.
increases, which shows that there is an upper limit
to improve the performance of AUTO C RAWLER by
B Analysis on AUTO C RAWLER increasing the number of seed webpages.
B.1 Ablation Study B.3 Extraction Efficiency
To further justify the effectiveness of each com- Suppose the number of seed webpages is ns , the
ponent of AUTO C RAWLER , we perform an abla- number of webpages on the same website is NW ,
tion study. The results are shown in Table 6. It the time to generate a wrapper is Tg , the time to
shows that: 1) AUTO C RAWLER without a second synthesis is Ts , and the time for extract information
module still beat the other two baseline methods from a webpage with a wrapper is Te . The total
among different LLMs. 2) The second module of time for extracting all information from all websites
AUTO C RAWLER , synthesis module, not only im- with AUTO C RAWLER is
proves AUTO C RAWLER , but also improves the per-
T1 = TG + TE = (ns Tg + Ts ) + NW Te (4)
formance of other methods. Using more webpages
for inference can make the generated wrapper more Besides, the time for LLMs directly extracting
stable and have better generalization. information from a webpage is Td , and the total
Model F1
80
Correct
Unexecutable Render-Full (Hao et al., 2011) 84.30
75 73.75
72.66 FreeDOM (Lin et al., 2020) 82.32
70.63 71.56
70 SimpDOM (Zhou et al., 2021) 83.06
Executable Metric(%)
time for extracting all information from all websites B.5 Comparison with COT & Reflexion
directly is Figure 5 more intuitively shows the specific dif-
T2 = NW Td (5) ferences between different baselines in the exper-
In real-world scenario, there are many webpages iment. The most significant difference between
from the same websites to be extracted. Although AUTO C RAWLER and other methods lies in whether
generating a wrapper takes more time than extract- the hierarchical structure of web pages is utilized
ing directly from a single webpage, the extraction to help LLMs reduce the difficulty of complex web
efficiency of subsequent webpages would be signif- structures. COT only executes one turn while the
icantly improved. To explore how many webpages other executes multiple turns and can learn from
are needed to make AUTO C RAWLER more efficient the failed execution of the wrapper. Compared to
in web IE, we calculate the threshold of NW . Sup- the Reflexion method, AUTO C RAWLER employs
pose T1 ≤ T2 , we have top-down and step-back operations to prune the
DOM tree during each XPath generation process,
TG + TE = (ns Tg + Ts ) + NW Te ≤ NW Td (6)
thereby reducing the length of the web page. In
contrast, the Reflextion method can only reflect and
ns Tg + Ts regenerate after producing an unexecutable XPath,
NW ≥
(7)
Td − Te which does not effectively simplify the webpage.
It should be noted that Tg depends on dmax
in Algorithm 1 and can be roughly considered as C Dataset Statistic
Tg ≈ dmax Td . In our experimental settings, we set
Table 10, 11, 12 shows the detailed statistic about
dmax = 5 and ns = 3. Also, under the approxima-
the semi-structure web information extraction
tion that Ts ≈ Td and Td ≫ Te , AUTO C RAWLER
dataset S WDE, E XTENDED S WDE and D S 1.
have better extraction efficiency when a website
contains more than 16 webpages.
D Prompt List
B.4 Comparison with supervised baselines
D.1 Task Prompt
Table 7 shows the comparison with 5 baseline mod-
Table 13 shows the task prompt we design for each
els in web information extraction on supervised
attribute for S WDE.
learning scenarios. These models are trained on
webpages in some seed websites and tested on the D.2 Module Prompt
other websites. Although the comparison is unfair
We provide a comprehensive list of all the prompts
because our methods is in zero-shot settings, AU -
that have been used in this study, offering a clear
TO C RAWLER beat most of them on F1 scores. It
reference to understand our experimental approach.
shows that by designing an appropriate framework,
LLMs can surpass supervised learning methods in Listing 1: Prompts for ZS_COT, Reflexion and Au-
some web information extraction tasks. toCrawler
Instruction: What’s the average point of the player?
COT
//*[text()=‘APG’]
/ span[1]/text()
5.1
Reflexion
//*[text()=‘PTS’]
//*[text()=‘APG’] Wrong [1]/ text()
/ span[1]/text()
5.1 19
AutoCrawler
//*[text()=‘APG’] //*[text()=‘PPG’]
Wrong / span[1]/text()
/ span[1]/text()
17.1
5.1
Table 8: The executable evaluation and IE evaluation of LLMs with three frameworks in E XTENDED S WDE dataset.
We examine 6 LLMs, including 3 closed-source LLMs and 3 open-source LLMs.
Table 9: The executable evaluation and IE evaluation of LLMs with three frameworks in D S 1 dataset. We examine
6 LLMs, including 3 closed-source LLMs and 3 open-source LLMs.