Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Exploring the Potential of Generative AI for the World Wide Web

https://webdiffusion.ai/
Nouar AlDahoul⋆, Joseph Hong⋆, Matteo VarvelloΔ, Yasir Zaki⋆
⋆ New York University Abu Dhabi, Δ Nokia Bell Labs
⋆ United Arab Emirates, Δ United States of America
yasir.zaki@nyu.edu,joseph.hong@nyu.edu,matteo.varvello@nokia.com
ABSTRACT by Brave [1] and MAX by Arc [4]. In the future, Web media
Generative Artificial Intelligence (AI) is a cutting-edge tech- like images could be locally generated by the browser offering
nology capable of producing text, images, and various media solutions to rectify broken web pages, optimize bandwidth
arXiv:2310.17370v1 [cs.AI] 26 Oct 2023

content leveraging generative models and user prompts. Be- usage, and enhance privacy protection.
tween 2022 and 2023, generative AI surged in popularity To explore this intriguing intersection between generative
with a plethora of applications spanning from AI-powered AI and the Web, we introduce WebDiffusion, a tool that
movies to chatbots. In this paper, we delve into the potential allows to emulate a Web leveraging stable diffusion for image
of generative AI within the realm of the World Wide Web, generation, both from a client and a server perspective. We
specifically focusing on image generation. Web developers focus on image generation for two reasons. First, images are a
already harness generative AI to help crafting text and im- vital component of the Web: for example, the median webpage
ages, while Web browsers might use it in the future to locally today is comprised of 900 KB of images (or about 44% of its
generate images for tasks like repairing broken webpages, total size) [20]. Second, image generation has no impact on
conserving bandwidth, and enhancing privacy. To explore Webcompat [14, 22, 28], differently from AI-based generation
this research area, we have developed WebDiffusion, a tool of HTML, JavaScript, etc. Still, this is an interesting research
that allows to simulate a Web powered by stable diffusion, a question which we left open for future work.
popular text-to-image model, from both a client and server Given a target webpage, WebDiffusion crawls its content,
perspective. WebDiffusion further supports crowdsourcing of produces AI-based images, and performs experiments with
user opinions, which we use to evaluate the quality and accu- unmodified browsers. Last but not least, WebDiffusion in-
racy of 409 AI-generated images sourced from 60 webpages. tegrates with Prolific [35], a popular crowdsourcing plat-
Our findings suggest that generative AI is already capable form, to gather worldwide eyeballs reporting on the qual-
of producing pertinent and high-quality Web images, even ity of Web images and webpages produced by generative
without requiring Web designers to manually input prompts, AI. This component of WebDiffusion is currently live at
just by leveraging contextual information available within the https://webdiffusion.ai/. We use WebDiffusion to investigate
webpages. However, we acknowledge that direct in-browser several research questions.
image generation remains a challenge, as only highly powerful
GPUs, such as the A40 and A100, can (partially) compete ∙ Can Web images be automatically generated? We identify
with classic image downloads. Nevertheless, this approach the top 200 webpages from Tranco’s top one million which
could be valuable for a subset of the images, for example when contain at least an image, load correctly, and are written in
fixing broken webpages or handling highly private content. English. From these pages, we collect 1,870 images for which
we generate annotations (prompts) to be used for image
generation. We consider client-side information only, i.e., alt-
1 INTRODUCTION text and the content of the webpage, to emulate a generative
In the last year, generative Artificial Intelligence (AI) has AI running in the browser, and server-side information, i.e.,
emerged as a revolutionary technology capable of generating with the addition of the actual image description and details
diverse forms of content, including text [18, 19], images [5, as an approximation of the Web designer intent. With a
29, 36–38], and multimedia [21, 27, 32]. This surge in interest user study involving 1,000 participants, we show that the
has ignited a wave of innovative applications across various client-based approach produces relevant annotations 80% of
domains, from the entertainment industry with AI-powered the time, while this fraction grows to 90% when considering
movies [39, 45] to the realms of medicine [13, 25]. In this a server-based approach.
paper, we delve into the fascinating role of generative AI ∙ What is the quality of AI-generated webpages? From the
within the context of the World Wide Web. above dataset, we randomly select 60 webpages (account-
Web developers are turning to generative AI to automate ing for 409 unique images) and generate each image (and
the creation of images and multimedia assets, or speedup web- webpage) using stable diffusion. To do so, we rely on the an-
pages.1 Similarly, browser vendors are exploring the potential notations discussed above, thus producing both client-based
of generative AI running inside the browser, e.g., Summarizer and server-based versions of each image/webpage. We then
1
crowdsource about 940 people (10 per image/webpage, on
According to HubSpot Blogs research, 93% of web designers have
used an AI tool to assist with a web design-related task in the summer average) to score the quality of the AI-generated content.
of 2023 [3]. We find that 70-95% of the images are scored as either “fair”
or higher (“good”, “very good”, or “excellent”), and 95%
of webpages are scored as “good” or higher. The server-
based approach achieves the highest quality, thanks to the
additional contextual information available. We further in-
vestigate which image features, e.g., landscape versus a
celebrity, have the largest impact on high and low scores.
We find that images including text achieve the lowest scores,
followed by images with faces.
∙ Can image generation run in the client? By loading multi-
ple webpages under several network conditions, our exper-
iments show that, as of today, only powerful GPUs (like
Nvidia Tesla A40 and A100) can compete with actual im-
age download, e.g., not dramatically slowdown a webpage
load. Further, this is only true for SpeedIndex, a popular
web performance metric relating to only the visible portion
of the screen (i.e., what is known as above the fold). We
found that there is potentially a 2.5-5 seconds improvement
for SpeedIndex leveraging image generation. As we move Figure 1: A visualization ot WebDiffusion, our tool to evaluate
to PageLoadTime, which instead requires the whole page the applicability of text-to-image generation for the Web.
to be downloaded, then even powerful GPUs would add
tens of seconds. We thus conclude that such approach is
currently viable for applications operating on a handful of Image Generation for the Web – To the best of our knowledge,
images, such as privacy-preserving advertising and webpage no previous work has investigated the role of generative AI
repairing. for Web images consumption. The closest work to this topic
is [41], where the authors propose to replace Web images
with similar content, e.g., a similar picture of the Golden
Gate bridge taken from a different angle. Their rationale is
2 BACKGROUND AND RELATED WORK to “race” similar Web images – identified via reverse image
Text-to-Image Generation – A variety of text-to-image genera- searches – to speedup webpage loads. The similarity with
tion models have been released over the last few years. These our work stands in the idea that Web images are not always
models face the so-called generative learning trilemma [44], “strict”, aka users might be willing to accept similar images
i.e., they can only satisfy two out of three requirements that to the originals if this comes with some benefits, such as page
real-world applications of generative models should possess: speedups in [41]. The key difference with our work is that we
high-quality outputs, output sample diversity, and compu- broadly explore the field of generative AI when applied to
tationally fast and inexpensive sampling. Image generation Web images, considering both a client and server standpoint.
usually requires high-quality outputs, thus general adversarial
network (GAN) or diffusion models are preferred.
Thanks to recent developments, diffusion models are cur- 3 WEBDIFFUSION
rently outperforming GAN models [15]. These gradually add This section presents WebDiffusion, a tool which allows to
Gaussian noise to the input, and afterwards perform a reverse explore and emulate a Web leveraging stable diffusion for
denoising process, which reverses this process and outputs an image generation. WebDiffusion aims at evaluating both
image [44]. GLIDE and DALL-E 2 [29, 36] both use classifier- a server and a client mode. The server mode refers to a
free diffusion models as a part of their image generation scenario where Web designers rely on stable diffusion for
process. We found that these diffusion models produce very image generation. The client mode refers to a scenario where
realistic outputs and have large mode coverage, ideal for im- image generation is fully run in the browser, e.g., to generate
age generation, but tend to be very slow due to the number privacy-preserving advertisement or fix broken URLs.
of iterations required throughout the denoising processes. At high level, WebDiffusion consists of the four compo-
This is the issue that the latent diffusion model [38], the nents visualized in Figure 1. First, a crawler which makes
foundation of stable diffusion [30], addresses. By transform- local copies of actual webpages, while annotating embedded
ing the diffusion model to be compatible with a compressed images with textual prompts using two techniques discussed
image representation in a latent space, latent diffusion al- next. Second, a proxy system to replay webpages while replac-
lows the retention of high-quality outputs on a wide variety ing images with textual prompts to feed to a stable diffusion
of image-generation tasks quickly and in a computationally server, allowing the emulation of an AI-powered Web with-
efficient manner [30]. Fulfilling the three aforementioned re- out requiring an in-browser implementation. Finally, a Web
quirements, latent diffusion has potential for real-world appli- interface which collects human feedback on the accuracy of
cations, which is our motivation for selecting stable diffusion AI-generated images and webpages. In the remainder of this
as the image generation model for this paper. section, we describe each component of WebDiffusion.
2
< ℎ2 >, < ℎ3 > and < ℎ4 > tag elements, and then iterate
through their children nodes to extract the text associated
with them. This is often where the title and the description
of an image are stored. We accumulate all the text associated
with these different elements into forming the client-based
textual prompt to be used to generate the image. It also
recursively traverses the parent div nodes of the image div
to extract all the text related to the image; it stops when it
discovers the presence of another image div and reverts back
to previous node.
Server-Based: This technique implies some server side sup-
port, e.g., from Web developers, to annotate images. It ex-
tends the client-based approach with image annotation ob-
tained via Microsoft’s GenerativeImage2Text transformer [26,
40], which derives a textual prompt related to an input im-
age. Note that leveraging GenerativeImage2Text allows us
to build an approximation which can be evaluated today
without support from Web developers.

Figure 2: WebDiffusion’s GUI. The top of the figure shows the 3.2 Proxy System
text originated using the client-based approach, i.e., eventual In theory, stable diffusion can be integrated in any open
alt text and nested divs information. The center of the figure source Web browser. In practice, this is a time consuming
shows both the original and the AI-generated images. The effort which can anyway lead to suboptimal performance as
bottom of the figure allows study participants to score the browsers are very complex software. Instead, WebDiffusion
quality of this AI-generated image, between 1 (poor) and 5 offloads image generation to an external machine (stable
(excellent). diffusion server) and leverages a proxy system to intercept
and operate on Web traffic (see Figure 1). This setup allows
WebDiffusion to control how images should be served, either
3.1 Crawler in a classic way or generated by the stable diffusion server.
The first task of the WebDiffusion’s crawler is to generate As shown in Figure 1, legacy clients – mobile or desk-
local copies of webpages. These webpages can be analyzed top running any browser – are instructed to forward all
to investigate the feasibility of image annotation techniques their HTTP(S) traffic via mitmproxy [11]. This requires in-
discussed below, as well as “replayed” to benchmark the stalling mitmproxy’s root CA (Certificate Authority) to han-
performance of AI-generated webpages. The crawler consists dle HTTPS traffic. Note that this step is only required for
of a Selenium [8] application which, for a given webpage, testing purposes, as a browser implementation would not
records headers and data returned for each HTTP(S) request require these steps. We assume all traffic is HTTP/2 [10].
along with its duration. Crawled webpages are stored in a Our experimental proxy system consists of two mitmproxy
database so that they can then be replayed via the proxy. proxies: one handling only requests for images, and one serv-
Next, the crawler runs the following annotation schemes to ing all the other requests. These two proxies are set in the
obtain textual prompts for each image embedded in a given testing client using a simple Proxy Auto-Configuration (PAC)
webpage. Each textual prompt is also stored in the database file [7] with a regular expression matching requests for im-
along with the technique used for its derivation. ages towards one proxy and vice-versa.2 This is necessary to
Client-Based: This technique aims at being deployed today allow the flexibility to emulate network connectivity (realized
without any server-side support, meaning that it works solely via Linux tc [12]) at the link between legacy clients and
relying on contextual information, e.g., alt text sometimes the proxy not handling image requests. In fact, the network
provided along with an image (about 50% of the time [34]). characteristics of this link should not affect the images gener-
The trade-off is a potential lack of accuracy as the browser ated by the stable diffusion server, since we are emulating a
is forced to “guess” the content of an image without having client-side implementation. This is ensured as long as a fast
access to it. connection (1 Gbps) is available between client, image proxy,
This mechanism works as follows. For each image tag in and stable diffusion server.
a page’s HTML that has either a src or background-image: For each request received, our proxy system performs a
url attribute in its CSS styling, we look for its alt text (or database lookup. For non-images, this lookup returns the
alt tag) whose goal is to provide a description of an image content to be served. For images, the database lookup might
in case the actual image is missing. In addition, we also 2
Note that this also requires to set the browser’s
iterate through the different div elements associated with PacHttpsUrlStrippingEnabled policy to false so as to enable full
the image looking for potential text within < 𝑝 >, < ℎ1 >, URL visibility in the PAC file with HTTPS.
3
return some textual prompt for image generation. This de- Our stable diffusion server currently runs on a Linux server
pends on the experiment running, e.g., some prompts might equipped with Nvidia Tesla A40 GPU [6] with a 48 GB
be missing when testing the client-based approach (see Sec- GDDR6 with error-correcting code (ECC), 300 W maximum
tion 3). In the presence of a textual prompt, the proxy con- power consumption, and a 696 GB/s GPU memory band-
tacts the stable diffusion server which generates the image. width. The server specifications are: Intel@Xeon(R) Bronze
The stable diffusion server can be configured with different 3204 CPU@1.90GHz x 12, with 16 GiB RAM, and running
hardware to emulate different client capabilities. When the Ubuntu 20.04.5 LTS.
image is generated, it is returned to the proxy which forwards
it back to the client. 3.4 Web Interface
Last but not least, WebDiffusion offers a Web interface
at https://webdiffusion.ai/. At this page, users can explore
3.3 Stable Diffusion Server
Web images or full webpages automatically generated using
Given the prompts derived using any of the above techniques, the above annotation schemes (see Appendix Figures 7, 8
we then rely on stable diffusion for automated image genera- and 9), while providing some feedback on the quality of the
tion. We chose stable diffusion for the reason discussed in Sec- generated content. The Web interface accepts a parameter
tion 2, and because of its open source nature. We deployed the (?type=[images, webpages]) to control whether to show im-
stable diffusion (SDXL) implementation of huggingface.co ages or full webpages, and which annotation scheme to use
using their stabilityai/stable-diffusion-xl-base-1.0 – default is server-based, while client-based is activated by
model [16][33]. It is a latent diffusion model with a three adding the suffix _client.
times larger UNet backbone for text-to-image synthesis that Figure 2 shows a visualization of WebDiffusion’s Web
shows remarkable performance compared to previous ver- interface when instrumented to show a image comparison. At
sions of stable diffusion. SDXL uses a second text encoder the top of the page some textual prompt is offered to describe
OpenCLIP ViT-bigG/14 and multiple novel conditioning the images. This text was obtained using the techniques
schemes [33]. To create prompts from images, as needed by described above for image annotation. Next, two images are
the server-based technique explained earlier, we have also shown. One image is an original Web image pulled from
utilized the large-sized GenerativeImage2Text transformer a given website; the other image is originated via stable
from microsoft/git-large-coco [40][26] which was fine- diffusion using the above textual prompt. Original images
tuned on COCO to generate an image caption to be used as a are displayed on the left, whereas the AI-powered images are
text prompt. The architecture of the GenerativeImage2Text displayed on the right. Both images are clearly labeled. At
transformer consists of one image encoder and one text de- the bottom of the page, the user is asked to rank the quality
coder under a single language modeling task [40]. of AI-generated images between 1 (poor) and 5 (excellent).
The quality of images produced by stable diffusion increases We integrated WebDiffusion’s Web GUI with Prolific [35],
with the number of inference steps, which in turn requires a a popular crowdsourcing website which allows us to quickly
longer generation time. The number of inference steps relates scale user studies. We used a standard 5-point Likert scale
to how many denoising steps the model will use to improve with a theme focusing on evaluating quality. The theme uses
the quality of the generated images. Generally, 50 denoising the following labels (also known as “Likert Scale Response
steps are sufficient to generate high-quality images. While Anchors”): (1) poor, (2) fair, (3) good, (4) very good, and (5)
experimenting with stable diffusion in the context of the excellent. Notice that we have also provided the numerical
Web, we realized that some textual prompts benefit more values associated with each label so as to give the participants
than others from more iterations, e.g., human faces versus a sense of the distance between the labels, and to help use
a landscape. In this paper, we settle for a constant number process the data qualitatively during our analysis phase.
of iterations (20) for fairness between images from different
webpages. This number was chosen as a good compromise 4 ON THE FEASIBILITY, QUALITY, AND
between quality and speed (about 1 second image generation
time on our machine). From the different Web images that
PERFORMANCE OF AI-GENERATED
were generated, we have noticed that the above number of WEBPAGES
inference is enough for generating a large portion of the Web This section evaluates the applicability of generative AI for
images, where any image that does not contain text, or faces the Web, specifically focusing on stable diffusion for image
had a good quality (for example food, landscape, objects, generation [5, 36, 38]. Our evaluation revolves around three
animals, etc.). However, an interesting direction for future main parts. First, we crawl 200 webpages and quantify the
work is to explore some dynamic stable diffusion settings feasibility of the techniques discussed in the previous section
based on the content of a textual prompt, and potentially for image annotation, which is paramount for the generation
additional webpage information. Apart from the number of of the prompts used for image generation. We consider both
inference, all other stable diffusion parameters were kept to client-based annotations, i.e., a best effort to locate webpage
the default values: guidance scale (5), and width and height text which should refer to an image, and server-based anno-
(1024x1024 pixels). We randomly generated the seed to have tations, where we assume access to the original image as an
a different images generated each time. approximation of a Web designer intent. Next, we evaluate
4
Images Images Webpages Webpages
1.0 Server Client Server Client
Client-Based
Server-Based Size 409 409 60 60
0.8
Participants 410 413 60 60
Responses 4,110 4,160 600 600
CDF (0-1)

0.6
Age Range 19-63 19-63 20-40 18-52
Gender (M/F) 239/171 197/216 20/10 20/10
0.4
# Countries 29 26 9 10
0.2 Table 1: User study summary. Server is short for server-based,
and client is short for client-based.
0.0
t t ot
ely an ow an ry t nn
let nt lev eh nt lev Ve van Ca dge
o mp eva Irre S om eva Re
l l e J u
C rel Re Re
Ir

and its scenery (see Figure 6 in the appendix for additional


Figure 3: CDF of how relevant client-based and served-based examples on the difference between both texts).
annotations are with respect to original images extracted from We evaluate the correctness of the annotations using Nat-
webpages (1,870 images from 200 webpages evaluated by 1,000 ural Language Processing (NLP) of sentences similarity. We
people each evaluating 10 random images). use the bert-base-nli-mean-tokens [9] model to generate
text embeddings for both annotation schemes, each contain-
ing 768 values; we then compute the cosine similarity [23]
the quality of AI-generated webpages, or webpages where
between both texts’ embeddings. The results show that both
images are automatically generated via stable diffusion [38].
the median and average similarities are about 50% (with 75th
We do this by performing several user studies involving about
percentile at 60%, and 25th percentile at 40%). This result is
900 participants provided by Prolific [35]. Finally, we bench-
encouraging since it suggests some intersection between the
mark the performance of AI-generated webpages using the
two annotation schemes. Of course, perfect intersection is
following web performance metrics metrics [17]: SpeedIndex,
unlikely given that the output of the GenerativeImage2Text
Page Load Time, and bandwidth usage. We use WebDiffu-
transformer is highly generic. However, systematic misbe-
sion to perform all the experiments discussed in this section,
havior of the client-based annotation scheme would result in
leveraging WebPageTest [43] and Chrome to automate the
much lower cosine similarity values.
webpage loads and collect the browser telemetry data.
To further comment on the quality of automated web im-
ages annotations, we perform a user study. We extend WebDif-
4.1 Can Web Images Be Automatically fusion’s Web interface (activated with parameter ?type=[scale,
Generated? scale_prompt], see Section 3.4) to show a single (original)
We crawl (using Selenium) the top 500 webpages from Tranco’s image along some descriptive text and ask for how relevant
top one million list. About 25% of these webpages fail to load; such text is to the image. Scores are in the range of “com-
for the remainder 75% of webpages, the crawler achieves a pletely irrelevant” to “very relevant”. We also allow study
70% success rate (260 webpages); failures are due to web- participants to respond “cannot judge”, for the cases when
page formatting, e.g., lack of img tags with clear src or participants are not knowledgeable of the image, e.g., when
background-image attributes. Next, we filter webpages host- the text refers to a celebrity they do not know.
ing inappropriate or non-English content, given that stable Figure 3 shows the Cumulative Distribution Function
diffusion currently operates only with English prompts. This (CDF) of the median score (10,000 scores from 1,000 people)
results in 200 webpages with a total of 1,870 images for which for the 1,870 images we have collected from 200 websites,
some client-based annotation is produced. We finally run the distinguishing between the client-based and server-based an-
GenerativeImage2Text transformer to derive the extra an- notations shown on top of each image. The figure shows a
notation needed by the server-based scheme. This allowed us success rate – score higher than “irrelevant” – of 80% for the
to generate a textual description of the original images. client-side approach, and close to 90% for the server-side ap-
Client-based annotation is a best effort attempt; as such, proach. This result has two important implications. First, not
potential mistakes are possible. The output of the even the highly accurate GenerativeImage2Text is capable
GenerativeImage2Text transformer is instead highly accu- of being 100% “relevant”, according to our study participants.
rate – given their usage of BLIP image-captioning tech- This is not unexpected, especially with no control on the
nique which is shown to outperform all state-of-the-art tech- input images which can assume a very abstract form (see
niques [24] – but generic. For example, an image of the Golden Figure 6 in the Appendix for sample images with their text
Gate bridge might generate a label such as “a bridge with description). It follows that the accuracy of the client-based
some clouds in the background”; conversely, the text in the approach is quite high, as it is only 10% worse from the best
webpage might describe more precisely which kind of bridge attainable by a sophisticated AI.
5
1.0 Excellent 1.0
Server-Based (5) Server-Based
Client-Based Client-Based
Very 0.8
0.8
good (4)
Good 0.6
0.6
(3)
Fair 0.4
0.4
(2)
Poor 0.2
0.2
(1)

Food
Face

Person
Hand
Animal

Celebrity

Landscape

Object

Text
0.0 0.0
Poor Fair Good Very Excellent Poor Fair Good Very Excellent
(1) (2) (3) good (4) (5) (1) (2) (3) good (4) (5)

(a) CDF of median score per image comparison, (b) Boxplot of median image scores as a function (c) CDF of median score per webpgage compar-
distinguishing between client and server-based of their tags. Server-based image generation. ison, distinguishing between client and server-
image generation. based image generation.

Figure 4: Results from the user study. Client-based refers to image generated on the client side, i.e., only using contextual
information like alt text. Server-based refers to image generated on the server side, i.e., assuming availability of both contextual
information and original image.

4.2 What Is the Quality of AI-Generated example a closeup of a person would be labelled as “face”,
Webpages? versus a group of people working is labelled as “person”.
Figure 4(b) shows one boxplot per image label, where each
This subsection investigates the quality of AI-generated web-
boxplot contains the median scores of all the images tagged
pages using user feedback collected on Prolific [35]. We first
with such label. We only show results for the server-based
collect feedback per image composing a webpage, and then
image generation scheme since it shows a similar trend as
focus on the full webpages. Table 1 summarizes the user study.
the client-based approach. If we focus on the median, the
We collect 10 user feedback for the two versions (client-based
figure shows that images containing “food”, “landscape”, and
and server-based) of the 409 images extracted from 60 web-
“object” achieve the best performance, with a median score of
sites randomly selected from the 200 previously crawled. Next,
“very good” and rarely showing negative scores. Next, images
we also collect 10 user feedback for two versions of each web-
labelled as “hand” or “animal” also perform well, with median
page, for a total of 9,400 user feedback. We ask Prolific to
comprised between “good” and “very good”. It is worth noting
offer high quality participants which are fluent in English, so
that for all of the above categories even the 25th percentile
that they can understand the image descriptions. Among 940
of the boxplot still achieves a score equal to or above “good”.
study participants, the minimum approval rate, i.e., the past
When it comes to images related to humans (those labelled
rate of approval of their work, we recorded was 94%, with
with “celebrity”, “person”, or “face”), we notice that their
80% of the participants having an approval rate of 99-100%.
median scores are still equal to or above “good”, however,
We start by investigating the quality of AI-generated Web
their 25th percentile falls below that, even reaching a score
images. Figure 4(a) shows the CDF of the median score (from
of “fair” in the case of the “face” label. The only images
10 users) among images, distinguishing between client-based
that their median scores fall close to the “fair” score are
and server-based. The figure shows that, regardless of the
those containing text. However, even in this category, the
approach, the majority of scores are positive, i.e., “fair” or
25th percentile of the boxplot does not fall below the “fair”
higher. Precisely, 95% of the median scores are better or equal
score. We conjecture that these low scores are due to fact
than “fair” for a server-based image generation; this number
that many of the text generated by stable diffusion are either
drops to 70% in the client-based approach. A reduction is
not completely readable, or that it does not make sense due
expected given the more challenging conditions (lack of access
to the inability of stable diffusion to re-produce accurate text.
to the original image) in which the client-based approach
The opposite is instead responsible of the higher scores, e.g.,
operates.
stable diffusion seems to be able to generate very convincing
Next we dig deeper into the collected scores with respect
food images.
to the content of the images. We manually tag each (original)
Finally, we comment on the quality of AI-generated web-
image using the set of tags shown on the x-axis in Figure 4(b);
pages. Figure 4(c) shows the CDF of the median score (from
we limit to a maximum of three tags per image, by considering
10 users) among webpages, distinguishing between the client-
the three predominant features. For example, the image in
based and server-based approach. Compared to Figure 4(a),
Figure 2 is labelled as “object”. Note that we see the tags
the overall score has improved, with no scores being “poor”,
“person” and “face” which might seem redundant; however,
and only a few being “fair”. Indeed, 55-68% of the scores are
these two tags are never used together, and their selection is
either “very good” or “excellent”. The reason for the score
based on whether a face is the main feature of an image. For
6
102 SI (a40) SI (a100) PLT (a40) PLT (a100)
1.0 1.2

0.9
101

Inference time (s)


0.8 1.0
Delta time (s)

0.7
100

CDF (1.0)
0.6 0.8
0
0.5
−100
0.4
0.6
0.3
−101 0.2
0.4
0.1

v100

a40

a6000

a100
−102
0.0
slow

avg.

fast

slow

avg.

fast
0 1 2 3 4 5 6
Size (MB) GPU
Network type

(a) CDF of the delta for FCP, SI, and PLT (with (b) SD bandwidth savings in MB. (c) SD inference time in seconds across multiple
two different GPUs). GPU types.

Figure 5: Reality check on running stable diffusion in the client for image generation.

increase compared to the previous image to image compari- AI-generated images. We use textual prompts derived with
son is that webpages are more complicated, composed by a the server-based scheme, although we did not measure any
collection of text/images. While for the images study partici- difference in the duration of image generation using the two
pants can focus on low level details, such as potential hands approaches.
deformity, this is less likely on a webpage where eyeballs Web Performance: We focus on two popular Web performance
tend to move quickly, potentially ignoring some images. The metrics: SpeedIndex (SI), and Page Load Time (PLT) [17].
results of the AI-generated webpages show similar curves for SI measures how quickly the visible content of a webpage
both the client-based and the server-based, apart from the is displayed during a page load. Conversely, PLT measures
gap in the region between “good” and “very good”. In this the total time it takes the browser to download and visualize
region, about 10% of the images have scored higher in the the entire webpage, including content located below the fold.
case of client-based (“very good”), and lower in the case of Figure 5(a) shows the CDF of the delta for each metric,
server-based (“good”). We conjecture that the quality differ- derived as the difference between the metric computed for the
ence of these images is so subtle, making it a very difficult original webpage, and the metric computed for the webpage
and subjective decision on which of these two scores, “good” when images are AI-generated. It follows that a negative
vs. “very good”, should these images be assigned. value indicates a slowdown, and a positive value indicates a
speedup. Each delta value in the plot is the median out of 5
4.3 Can Image Generation Run in the Client? runs.
We conclude our analysis by performing a reality check on The figure shows that powerful GPUs (A40 and A100)
running stable diffusion directly in the client. This technol- can compete with actual image downloads when considering
ogy offers several benefits, such as additional privacy and SI, e.g., 63% (A40) and 87% (A100) of the websites benefit
rectification of broken web pages, but comes at the cost of from a speedup even in presence of a very fast network. This
expensive hardware and/or potential webopage slowdowns. happens because, on average, only 2-5 images are loaded
For these experiments, we assume a powerful desktop ma- above the fold, and the transport might need some time to
chine similar to our stable diffusion server, i.e., mounting converge to the available bandwidth, e.g., due to connection
powerful GPUs like the A40 or the A100, since a mobile establishment and window increase. Furthermore, certain
phone’s hardware is far from being capable of running such images positioned above the fold have substantial file sizes.
complex models. For example, as of June 2023, it took an However, when we consider PageLoadTime, even on a very
iPhone-13 about 15.6 seconds to generate a single image slow network almost no webpage loads faster when images
using stable diffusion leveraging apple’s new core ML frame- are locally generated. Whether this might change in the
work [31]3 . For this reason, we emulate typical speeds of home future is a question of how hardware and AI models would
connections [2] and ignore slower speeds like 3G. Network evolve, in contrast with network performance. Regardless,
emulation is realized using TC (see Figure 1) instructed via this result also suggests that locally generating images is
the following connectivity profiles: slow (symmetric 20 Mbps already viable for applications operating on a handful of
and 100 ms RTT), average (symmetric 50 Mbps and 50ms images, such as privacy-preserving advertising and webpage
RTT), and fast (symmetric 100 Mbps and 20ms RTT). We repairing – granted that powerful GPUs are adopted.
use Chrome to load the previous 60 webpages (repeating
Resource Usage: Figure 5(b) quantifies the bandwidth savings
each 5 times), both in their original form and considering
(MB) due to not transferring Web images which are instead
3
This is viewed as a significant improvement compared to the 23 locally generated. Overall, the majority of webpages we tested
seconds generation time back in December 2022. enjoy multiple MBs of savings (median of 1.3MB), and even
7
more than 5MB for the 10% heaviest webpages. These savings Next, this paper currently only focuses on image generation.
are marginal for devices on WiFi, the likely connectivity While images are a fundamental component of webpages,
scenario for the powerful devices we are assuming. However, and account for a large fraction of the page size, there are
these savings can be significant for the content providers. For other elements like HTML, CSS, and JavaScript which can
example, let’s consider a popular newspaper with 1 million be (partially) automatically generated. Such generation is
daily views; a median saving of 1.3 MB per view would however more challenging as, differently from images, it can
account for daily savings of 1.3 TB. impact Web compatibility, i.e., breaks the functioning of
Finally, Figure 5(c) shows the boxplot of the stable diffu- webpages. Nevertheless, with the constant improvement of
sion image generation time (i.e., inference time), measured generative AI, it is possible to envision how these tools can
when generating our 409 images with four different popular be used in broader sense for the Web, and we thus plan to
GPUs. We can see that in the case of V100 and a40 the extend WebDiffusion to support them.
median inference time is about 1.1 seconds. However, with
more powerful GPUs such as the a100, the median inference 6 ETHICS
time drops down to about 500ms. It is also worth noting An institutional review board (IRB) approval is granted to
that the boxplots are extremely narrow, suggesting that the conduct the study, and the authors who conducted the study
inference time does not vary across multiple textual prompts. are CITI certified.

REFERENCES
[1] Brave search introduces the summarizer, an ai tool for synthesized,
5 CONCLUSIONS AND FUTURE WORK relevant results. https://brave.com/ai-summarizer/. Accessed:
2023-10-09.
Over the past year, we have observed a significant increase in [2] Etisalat internet and landline plans. https://www.etisalat.ae/
the adoption of generative Artificial Intelligence (AI) across b2c/eshop/viewProducts?category=homePlans&subCategory=
a range of fields, including medicine and media production. cat1100023&catName=eLife_plan&listVal=eLife_plan&locale=
EN. Accessed: 2022-10-13.
Similarly, Web developers and browser vendors are turning to [3] How web designers currently use ai. https://blog.hubspot.com/
generative AI for various applications like multimedia asset website/web-designers-ai. Accessed: 2023-10-09.
creation, webpage acceleration, and privacy enhancements. [4] Max - power up your browsing with ai. let arc do more for you,
so you can do less. https://arc.net/max. Accessed: 2023-10-09.
In this paper, we have delved into the role of generative AI [5] Midjourney. https://www.midjourney.com/home/.
in the specific context of web image generation and consump- [6] NVIDIA A40 the world’s most powerful data center gpu for visual
computing. https://www.nvidia.com/en-us/data-center/a40/. Ac-
tion. To accomplish this, we have introduced WebDiffusion, cessed: 2022-10-13.
a tool that facilitates real-time experiments using unaltered [7] Proxy auto-configuration (pac) file - http: Mdn. https:
web browsers, thus enabling an in-depth assessment of the //developer.mozilla.org/en-US/docs/Web/HTTP/Proxy_
servers_and_tunneling/Proxy_Auto-Configuration_PAC_file.
quality and effectiveness of AI-generated images within target [8] Selenium. https://www.selenium.dev/.
webpages. Moreover, by seamlessly integrating with crowd- [9] Sentence-transformers/bert-base-nli-mean-tokens · hugging
sourcing platforms such as Prolific, WebDiffusion offers a face. https://huggingface.co/sentence-transformers/bert-base-
nli-mean-tokens.
valuable avenue for collecting feedback and insights from a [10] Hypertext transfer protocol version 2 (http/2). https://
diverse global audience. Our results indicate that generative datatracker.ietf.org/doc/html/rfc7540, 2015.
[11] A free and open source interactive https proxy. https://mitmproxy.
AI can already create relevant and high-quality web images org/, 2021. Accessed: 2021-04-28.
without the need for manual prompts from web designers, [12] W. Almesberger et al. Linux network traffic con-
relying on contextual information found within webpages. trol—implementation overview, 1999.
[13] E. Callaway. How generative ai is building better antibodies. https:
Conversely, in-browser image generation remains a challeng- //www.nature.com/articles/d41586-023-01516-w, 2023. Nature,
ing task, as it requires highly capable GPUs like the A40 News. Accessed: 2023-10.
and A100 to partially match traditional image downloads. [14] M. Chaqfeh, Y. Zaki, J. Hu, and L. Subramanian. Jscleaner:
De-cluttering mobile webpages through javascript cleanup. In
Nevertheless, this approach holds promise for small scale WWW ’20, page 763–773, New York, NY, USA, 2020. Association
image generation tasks, such as repairing broken webpages for Computing Machinery.
[15] P. Dhariwal and A. Nichol. Diffusion models beat gans on image
or serving privacy-preserving media. synthesis. CoRR, abs/2105.05233, 2021.
This paper opens several interesting avenues for future [16] H. Face. Stable Diffusion XL, 2023. https://huggingface.co/docs/
work. First, stable diffusion often produces artefacts when diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl.
[17] T. Hoßfeld, F. Metzger, and D. Rossi. Speed index: Relating the
generating images of faces and hands even with a large num- industrial standard for user perceived web performance to web
ber of iterations (> 100, increasing the image generation time qoe. In , pages 1–6, 2018.
to above 15 seconds), especially when there are more than [18] M. Hutson. Could ai help you to write your next paper? https:
//www.nature.com/articles/d41586-022-03479-w, 2022. Nature,
one subject in the produced image. While stable diffusion is Technology Feature. Accessed: 2023-10.
constantly improving with this respect, one could consider ex- [19] H. Ibrahim, F. Liu, R. Asim, B. Battu, S. Benabderrahmane,
B. Alhafni, W. Adnan, T. Alhanai, B. AlShebli, R. Baghdadi,
tending WebDiffusion to other text-to-image generation mod- et al. Perception, performance, and detectability of conversa-
els. For example, the GFPGAN model is a popular method tional artificial intelligence across 32 university courses. Scientific
that addresses the issue of distorted faces. This model helps Reports, 13(1):12187, 2023.
[20] C. Kelton, M. Varvello, A. Aucinas, and B. Livshits. Browselite:
restore faces in samples and improves the overall realistic A private data saving solution for the web. In , pages 305–316,
quality of the generated image [42]. 2021.
8
[21] L. Kumar and D. K. Singh. A comprehensive survey on generative A APPENDIX
adversarial networks used for synthesizing multimedia content.
Multimedia Tools and Applications, pages 1–40, 2023. In this appendix we show few images that received a low
[22] J. Kupoluyi, M. Chaqfeh, M. Varvello, R. Coke, W. Hashmi, score during the user study associated with the feasibility
L. Subramanian, and Y. Zaki. Muzeel: Assessing the impact of
javascript dead code elimination on mobile web performance. In
evaluation (Section 4.1). In addition, we also show three
IMC ’22, page 335–348, New York, NY, USA, 2022. Association webpages in three forms: original, AI-generated – using both
for Computing Machinery. the client-based and server-based annotation (Section 3.1).
[23] B. Li and L. Han. Distance weighted cosine similarity measure
for text classification. In H. Yin, K. Tang, Y. Gao, F. Klawonn,
M. Lee, T. Weise, B. Li, and X. Yao, editors, , pages 611–618,
Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
[24] J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-
image pre-training for unified vision-language understanding and
generation. In , 2022.
[25] B. Meskó and E. J. Topol. The imperative for regulatory oversight
of large language models (or generative ai) in healthcare. npj
Digital Medicine, 6(1):120, 2023.
[26] Microsoft. GIT (GenerativeImage2Text), large-sized, fine-tuned
on COCO, 2023. https://huggingface.co/microsoft/git-large-coco.
[27] A. Music. AI music composition tools for content creators.
[28] U. Naseer, T. A. Benson, and R. Netravali. Webmedic: Disentan-
gling the memory-functionality tension for the next billion mobile
web users. In HotMobile ’21, page 71–77, New York, NY, USA,
2021. Association for Computing Machinery.
[29] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mc-
Grew, I. Sutskever, and M. Chen. Glide: Towards photorealistic
image generation and editing with text-guided diffusion models.
ArXiv, abs/2112.10741, 2021.
[30] S. Patil, P. Cuenca, N. Lambert, and P. von Platen. Stable diffu-
sion with diffusers. https://huggingface.co/blog/stable_diffusion.
[31] pcuenq Pedro Cuenca. Faster Stable Diffusion with Core ML on
iPhone, iPad, and Mac, 2023. https://huggingface.co/blog/fast-
diffusers-coreml.
[32] L. Plaugic. Musician Taryn Southern on composing her new album
entirely with AI, Aug 2017.
[33] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn,
J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent
diffusion models for high-resolution image synthesis, 2023.
[34] E. Portis and A. Ranganath. Media: 2022: The web almanac by
http archive. https://almanac.httparchive.org/en/2022, Oct 2022.
[35] Prolific. A higher standard of online research. https://www.
prolific.co/.
[36] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierar-
chical text-conditional image generation with clip latents. ArXiv,
abs/2204.06125, 2022.
[37] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford,
M. Chen, and I. Sutskever. Zero-shot text-to-image generation.
CoRR, abs/2102.12092, 2021.
[38] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer.
High-resolution image synthesis with latent diffusion models. In ,
pages 10684–10695, June 2022.
[39] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu,
H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y. Taig-
man. Make-a-video: Text-to-video generation without text-video
data, 2022.
[40] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu,
and L. Wang. Git: A generative image-to-text transformer for
vision and language, 2022.
[41] P. Wang, M. Varvello, C. Ni, R. Yu, and A. Kuzmanovic. Web-
lego: trading content strictness for faster webpages. In , pages
1–10. IEEE, 2021.
[42] X. Wang, Y. Li, H. Zhang, and Y. Shan. Towards real-world blind
face restoration with generative facial prior. In , pages 9168–9178,
2021.
[43] WPO-Foundation. Webpagetest - website performance and op-
timization test. https://www.webpagetest.org/, 2022. Accessed:
2022-10.
[44] Z. Xiao, K. Kreis, and A. Vahdat. Tackling the generative learning
trilemma with denoising diffusion gans, 2021.
[45] J. Zhu, H. Yang, H. He, W. Wang, Z. Tuo, W.-H. Cheng, L. Gao,
J. Song, and J. Fu. Moviefactory: Automatic movie creation from
text using large generative models for language and images, 2023.

9
(a) A golden statue of a per- (b) A colorful tree with a man (c) A hand is holding a pencil in a drawing style; An effective
son holding a sword; Akriti and a dog on it; Mozilla puts landing page needs a layout; a way to implement it; clear;
brass art wares antique brass people before profit; creating concise copy; and ongoing evaluation—all with the goal of
metal lord ganesha on fan products; technologies and pro- driving visitors to your cta. how to make a landing page
wall hanging for entrance grams that make the internet
door; living room; decorative healthier for everyone. learn
more about us more power to
you.

(d) A castle made out of colored (e) A man with a stack of cas- (f) The front of a building (g) A woman sitting at
plastic blocks; Picassotiles 60 settes on his head; By open with columns and a clock; a table using a laptop
piece set 60pcs magnet building mike eagle past and present blur Nih is the nations medical re- computer; Visit cisco
tiles clear magnetic 3d building on the new lp from the cele- search agency; supporting sci- hybrid work index to
blocks construction playboards brated mc. read more a tape entific studies that turn discov- understand the inclu-
called component system with ery into health. nih-at-a-glance- sive collaboration ex-
the auto reverse whoweare.jpg who we are who periences driving hy-
we are brid work. the future
of work is hybrid

(h) A couple of pictures of a girl and a boy; (i) A woman standing in front of a large truck; Lady of (j) A black and white
Your files and memories stay safe and secure the gobi;a rare woman among thousands of coal truck photo with the words
in the cloud; with 5 gb for free and 1 tb+ if drivers fights for survival on the hazardous mining road all about it; All about
you go premium store with confidence visual from mongolia to china;watch now;25.13 coal truck dri- the white lotus sea-
renderings demonstrating some of the con- ver; maikhuu; in the film lady of the gobi directed by son 2
tent types that can be safely stored in office khoroldorj choijoovanchig

Figure 6: Several samples of Web images with their Image2Text (first sentence before the semicolon) and extracted text descriptions
(text following the first semicolon) that were given a low score (“irrelevant” or “very irrelevant”) by our study participants.

10
(a) Original (b) Client-based (c) Server-based

Figure 7: Comparing the www.nasa.gov webpage with webpages where images are automatically generated using stable diffusion
and both the client-based and server-based approach.

(a) Original (b) Client-based (c) Server-based

Figure 8: Comparing the www.etsy.com webpagewith webpages where images are automatically generated using stable diffusion
and both the client-based and server-based approach.

(a) Original (b) Client-based (c) Server-based

Figure 9: Comparing the www.cnn.com webpage with webpages where images are automatically generated using stable diffusion
and both the client-based and server-based approach.

11

You might also like