Build A Search Engine For Medium Stories Using Streamlit and Elasticsearch - by ChiaChong - Better Programming

Build a Search Engine for Medium Stories Using
Streamlit and Elasticsearch

Search Medium stories in your own app
ChiaChong Follow
Sep 23 · 6 min read
Photo by Firmbee.com on Unsplash
When you read a Medium story and find it interesting or useful to you, would you save it
into your Medium list? I would. I have several hundreds of stories categorized and saved
in my lists. But I don’t often go back to them when I need to refer back to something I
know I have seen before. One of the reasons is because it takes time to scroll and look
through a long list of stories.
I would rather just open up a new tab in my browser and enter the keywords in Google
search. Although we can also do the same thing in Medium search, you will know why I
don’t do it if you’ve try it before. In this article, I’m going to share with you how I made it
and what I have learned. Sit back and have a coffee for this long story.
The code is available on GitHub. To get started, the main tools to build the search engine
are:
Python 3— to build and connect the UI (Streamlit) with the search engine.
Elasticsearch — a Lucence-based full-text search engine.
Docker — a virtualization tool that helps us package and deploy our application
easily.
Streamlit
Let’s start with Streamlit. Run pip3 install streamlit to install the Python package,
then create a py file in the folder srcs/streamlit_app
1 # srcs/streamlit_app/app.py
2 import streamlit as st
3
4 def main():
5 st.title('Search Medium Story')
6 search = st.text_input('Enter search words:')
7
8 if __name__ == '__main__':
9 main()
app.py
hosted with ❤ by GitHub view raw
Then, run streamlit run srcs/streamlit_app/app.py in the terminal to start the

streamlit app. You will see the main page with a text bar to enter the search words.
Image by the author.
Connect Elasticsearch
Now we need to install the Python Elasticsearch by running pip3 install elasticsearch
in the terminal. After that create some helper functions in a new py file.
1 # srcs/streamlit_app/utils.py
2 def check_and_create_index(es, index: str):
3 # define data model
4 mappings = {
5 'mappings': {
6 'properties': {
7 'author': {'type': 'keyword'},
8 'length': {'type': 'keyword'},
9 'title': {'type': 'text'},
10 'tags': {'type': 'keyword'},
11 'content': {'type': 'text'},
12 }
13 }
14 }
15 if not es.indices.exists(index):
16 es.indices.create(index=index, body=mappings, ignore=400)
17
18
19 def index_search(es, index: str, keywords: str, filters: str,
20 from_i: int, size: int) -> dict:
21 """
22 Args:
23 es: Elasticsearch client instance.
23 es: Elasticsearch client instance.
24 index: Name of the index we are going to use.
25 keywords: Search keywords.
26 filters: Tag name to filter medium stories.
27 from_i: Start index of the results for pagination.
28 size: Number of results returned in each search.
29 """
30 # search query
31 body = {
32 'query': {
33 'bool': {
34 'must': [
35 {
36 'query_string': {
37 'query': keywords,
38 'fields': ['content'],
39 'default_operator': 'AND',
40 }
41 }
42 ],
43 }
44 },
45 'highlight': {
46 'pre_tags': ['<b>'],
47 'post_tags': ['</b>'],
48 'fields': {'content': {}}
49 },
50 'from': from_i,
51 'size': size,
52 'aggs': {
53 'tags': {
54 'terms': {'field': 'tags'}
55 },
56 'match_count': {'value_count': {'field': '_id'}}
57 }
58 }
59 if filters is not None:
60 body['query']['bool']['filter'] = {
61 'terms': {
62 'tags': [filters]
63 }
64 }
65
66 res = es.search(index=index, body=body)
67 # sort popular tags
68 sorted_tags = res['aggregations']['tags']['buckets']
69 sorted_tags = sorted(
70 sorted_tags,
71 key=lambda t: t['doc_count'], reverse=True
72 )
73 res['sorted_tags'] = [t['key'] for t in sorted_tags]
74 return res
utils.py
In the check_and_create_index() function, we define that each story has five properties
— author name, length of the story, title, tags, and the content of the story. While in the
index_search() function, we define the Elasticsearch query to do an index search,
details of the query API can be found here.
Note that we add an aggregation term tags in the search query to group and count the
number of each tag in the results and sort them descendingly as sorted_tags . These tags
will be displayed on top of the results and can be used to filter search results.
Add the two helper functions into the app.py using the following code:
2 import sys
4 from elasticsearch import Elasticsearch
5 sys.path.append('srcs')
6 from streamlit_app import utils
7
8 INDEX = 'medium_data'
9 DOMAIN = '0.0.0.0'
10 es = Elasticsearch(host=DOMAIN)
11 utils.check_and_create_index(es, INDEX)
12
13 def main():
16 if search:
17 results = utils.index_search(es, INDEX, search, '', 0, PAGE_SIZE)
18
19 if __name__ == '__main__':
20 main()
app.py
Now we can enter keywords and search, but it will fail because the Elasticsearch is not
started yet, and we will do this later because it is very simple with the help of Docker.
Display Search Results

Assume that we already have Elasticsearch up and running, the next task is to display
the search results. Let’s create another py file templates.py in the same folder.
1 # srcs/streamlit_app/templates.py
2 import urllib.parse
3
4 def load_css() -> str:
5 """ Return all css styles. """
6 common_tag_css = """
7 display: inline-flex;
8 align-items: center;
9 justify-content: center;
10 padding: .15rem .40rem;
11 position: relative;
12 text-decoration: none;
13 font-size: 95%;
14 border-radius: 5px;
15 margin-right: .5rem;
16 margin-top: .4rem;
17 margin-bottom: .5rem;
18 """
19 return f"""
20 <style>
21 #tags {{
22 {common_tag_css}
23 color: rgb(88, 88, 88);
24 border-width: 0px;
25 background-color: rgb(240, 242, 246);
26 }}
27 #tags:hover {{
28 color: black;
29 box-shadow: 0px 5px 10px 0px rgba(0,0,0,0.2);
30 }}
31 #active-tag {{
32 {common_tag_css}
33 color: rgb(246 51 102);
33 color: rgb(246, 51, 102);
34 border-width: 1px;
35 border-style: solid;
36 border-color: rgb(246, 51, 102);
37 }}
38 #active-tag:hover {{
39 color: black;
40 border-color: black;
41 background-color: rgb(240, 242, 246);
42 box-shadow: 0px 5px 10px 0px rgba(0,0,0,0.2);
43 }}
44 </style>
45 """
46
47 def number_of_results(total_hits: int, duration: float) -> str:
48 """ HTML scripts to display number of results and duration. """
49 return f"""
50 <div style="color:grey;font-size:95%;">
51 {total_hits} results ({duration:.2f} seconds)
52 </div><br>
53 """
54
55 def search_result(i: int, url: str, title: str, highlights: str,
56 author: str, length: str, **kwargs) -> str:
57 """ HTML scripts to display search results. """
58 return f"""
59 <div style="font-size:120%;">
60 {i + 1}.
61 <a href="{url}">
62 {title}
63 </a>
64 </div>
65 <div style="font-size:95%;">
66 <div style="color:grey;font-size:95%;">
67 {url[:90] + '...' if len(url) > 100 else url}
68 </div>
69 <div style="float:left;font-style:italic;">
70 {author} ·
71 </div>
72 <div style="color:grey;float:left;">
73 {length} ...
74 </div>
75 {highlights}
76 </div>
77 """
78
79 def tag_boxes(search: str, tags: list, active_tag: str) -> str:
80 """ HTML scripts to render tag boxes. """
81 html = ''
82 search = urllib.parse.quote(search)
83 for tag in tags:
84 if tag != active_tag:
85 html += f"""
86 <a id="tags" href="?search={search}&tags={tag}">
87 {tag.replace('-', ' ')}
88 </a>
89 """
90 else:
91 html += f"""
92 <a id="active-tag" href="?search={search}">
93 {tag.replace('-', ' ')}
94 </a>
95 """
96
97 html += '<br><br>'
98 return html
templates.py
The functions above are used to generate some HTML and CSS scripts to be displayed in
the UI. The load_css() function is used to style the story tags that are used as filters.
The story tags rendered by the tag_boxes() will be clickable with URL parameters
search and tags which are important to persist the state of the app, we will discuss this
later. The search_result() function is the main component to display the URL of the
story, author name, length of the story, and short highlighted text snippet.
Now we can display search results with this code:
2 import sys
6 from streamlit_app import utils, templates
7
8 INDEX = 'medium data'
8 INDEX = medium_data
9 PAGE_SIZE = 5
10 DOMAIN = '0.0.0.0'
13
14 def main():
15 st.write(templates.load_css(), unsafe_allow_html=True)
18 if search:
19 results = utils.index_search(es, INDEX, search, None, 0, PAGE_SIZE)
20 total_hits = results['aggregations']['match_count']['value']
21 # show number of results and time taken
22 st.write(templates.number_of_results(total_hits, results['took'] / 1000),
23 unsafe_allow_html=True)
24 # render popular tags as filters
25 st.write(templates.tag_boxes(search, results['sorted_tags'][:10], ''),
27 # search results
28 for i in range(len(results['hits']['hits'])):
29 result = results['hits']['hits'][i]
30 res = result['_source']
31 res['url'] = result['_id']
32 res['highlights'] = '...'.join(result['highlight']['content'])
33 st.write(templates.search_result(i, **res), unsafe_allow_html=True)
34 st.write(templates.tag_boxes(search, res['tags'], ''),
36
37 if __name__ == '__main__':
38 main()
app.py
The st.write() is the magic command to render HTML and customize our UI.
Filters
At this stage, we can search and display the search results (assuming the Elasticsearch is
up and running).
Image by the author.
The grey rectangle boxes on top of the search results are the popular tags among the
results, we can click on any of them to filter the search results. Since they are hyperlinks
containing URL parameters search and tags , clicking on any of them will refresh the
app state, but we can capture the URL parameters using

st.experimental_get_query_params() and set those values as the new app states.
2 import sys
8
10 PAGE_SIZE = 5
11 DOMAIN = '0.0.0.0'
14
15 def set_session_state():
16 # set default values
17 if 'search' not in st.session_state:
18 st.session_state.search = None
19 if 'tags' not in st.session_state:
20 st.session_state.tags = None
21
22 # get parameters in url
23 para = st.experimental_get_query_params()
24 if 'search' in para:
25 st.experimental_set_query_params()
26 # decode url
27 new_search = urllib.parse.unquote(para['search'][0])
28 st.session_state.search = new_search
29 if 'tags' in para:
31 st.session_state.tags = para['tags'][0]
32
33 def main():
34 set_session_state()
37 if st.session_state.search is None:
39 else:
40 search = st.text_input('Enter search words:', st.session_state.search)
41 if search:
42 results = utils.index_search(es, INDEX, search, st.session_state.tags,
43 0, PAGE_SIZE)
48 # show popular tags
49 popular_tags_html = templates.tag_boxes(search, results['sorted_tags'][:10],
50 st.session_state.tags)
51 st.write(popular_tags_html, unsafe_allow_html=True)
52 # search results
58 st.write(templates.search_result(i, **res), unsafe_allow_html=True)
59 st.write(templates.tag_boxes(search, res['tags'], st.session_state.tags),
61
62 if __name__ == '__main__':
63 main()
app.py
I wrote an article about Streamlit session state before. Feel free to check it out for more
details.
Pagination
One last thing we need to do to complete the UI is the result pagination. Same with the
story tags, we will also render the pagination buttons as hyperlinks with URL parameters
search , tags and a new parameter page . Add the following function into the
templates.py :
1 # srcs/streamlit_app/templates.py
2 def pagination(total_pages: int, search: str, current_page: int, tags: str) -> str:
3 """ HTML scripts to render pagination buttons. """
4 # search words and tags
5 params = f'?search={urllib.parse.quote(search)}'
6 if tags is not None:
7 params += f'&tags={tags}'
8
9 # avoid invalid page number (<=0)
10 if (current_page - 5) > 0:
11 start_from = current_page - 5
12 else:
13 start_from = 1
14
15 hrefs = []
16 if current_page != 1:
17 hrefs += [
18 f'<a href="{params}&page={1}">&lt&ltFirst</a>',
19 f'<a href="{params}&page={current_page - 1}">&ltPrevious</a>',
20 ]
21
22 for i in range(start_from, min(total_pages + 1, start_from + 10)):
23 if i == current_page:
24 hrefs.append(f'{current_page}')
25 else:
26 hrefs.append(f'<a href="{params}&page={i}">{i}</a>')
27
28 if current_page != total_pages:
29 hrefs.append(f'<a href="{params}&page={current_page + 1}">Next&gt</a>')
30
31 return '<div>' + '&emsp;'.join(hrefs) + '</div>'
templates.py
And we will need to capture the new URL parameter and set the session state. The
following code will help us do this:
2 import sys
8
10 PAGE_SIZE = 5
11 DOMAIN = '0.0.0.0'
14
15 def set_session_state():
16 # set default values
17 if 'search' not in st.session_state:
18 st.session_state.search = None
19 if 'tags' not in st.session_state:
20 st.session_state.tags = None
21 if 'page' not in st.session_state:
22 st.session_state.page = 1
23
24 # get parameters in url
25 para = st.experimental_get_query_params()
26 if 'search' in para:
28 # decode url
29 new_search = urllib.parse.unquote(para['search'][0])
30 st.session_state.search = new_search
31 if 'tags' in para:
33 st.session_state.tags = para['tags'][0]
34 if 'page' in para:
36 st.session_state.page = int(para['page'][0])
37
38 def main():
39 set_session_state()
42 if st.session_state.search is None:
44 else:
45 search = st.text_input('Enter search words:', st.session_state.search)
46
47 if search:
48 from_i = (st.session_state.page - 1) * PAGE_SIZE
49 results = utils.index_search(es, INDEX, search, st.session_state.tags,
50 from_i, PAGE_SIZE)
55 # show popular tags
56 popular_tags_html = templates.tag_boxes(search, results['sorted_tags'][:10],
58 st.write(popular_tags_html, unsafe_allow_html=True)
59 # search results
65 st.write(templates.search_result(i + from_i, **res),
67 st.write(templates.tag_boxes(search, res['tags'], st.session_state.tags),
69
70 # pagination
71 if total_hits > PAGE_SIZE:
72 total_pages = (total_hits + PAGE_SIZE - 1) // PAGE_SIZE
73 pagination_html = templates.pagination(total_pages, search,
74 st.session_state.page,
76 st.write(pagination_html, unsafe_allow_html=True)
77
78 if __name__ == '__main__':
79 main()
app.py
Here we have the complete UI to do searching, filtering, and pagination. The next step
would be to containerize our Streamlit app using Docker.
Docker
To be honest, this is the first container image that I built. I learned from a great tutorial
by Prakhar Srivastav. I strongly suggest you read the tutorial if you are new to Docker.
The only thing we need to containerize our application is a file called Dockerfile (and
of course we also need Docker installed in our machine). The Dockerfile is a text file
containing a list of commands to automate the image creation process. Here is an
example Dockerfile to create the image of the Streamlit app:
1 FROM python:3.7-slim
2
3 COPY ./requirements.txt /requirements.txt
4
5 RUN apt update && \
6 apt install --no-install-recommends -y build-essential gcc && \
7 apt clean && rm -rf /var/lib/apt/lists/* && \
8 pip3 install --no-cache-dir --upgrade pip setuptools && \
9 pip3 install --no-cache-dir -r /requirements.txt
10
11 COPY ./srcs /srcs
12
13 ENTRYPOINT ["streamlit", "run"]
14
15 CMD ["/srcs/streamlit_app/app.py"]
16
17 EXPOSE 8501
Dockerfile
In the above example, we are telling the Docker client to start from Python 3.7, then
copy the requirements.txt from the local directory into the image and install all the
dependencies. After that, copy all the source codes and start the Streamlit app that is
running on the port 8501 . Lastly, we just need one terminal command to build the
image:
docker build -t medium-search-app .
Once the image is built, we can start the image using the terminal command:
docker run -p 8501:8501 medium-search-app
The search app is now up and running on the localhost:8501 .
Wait, didn’t we forget something? Elasticsearch! It’s time to start the “engine.”
docker run -p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" \
elasticsearch:7.11.2
The Elasticsearch image will be pulled automatically if it does not exist. We can check
the condition of Elasticsearch by running these commands:
curl 0.0.0.0:9200
# {
# "name" : "32746bf3501e",
# "cluster_name" : "docker-cluster",
# "cluster_uuid" : "DGEsSP6pTPqV4o4eKgxVgw",
# "version" : {
# "number" : "7.11.2",
# "build_flavor" : "default",
# "build_type" : "docker",
# "build_hash" : "3e5a16cfec50876d20ea77b075070932c6464c7d",
# "build_date" : "2021-03-06T05:54:38.141101Z",
# "build_snapshot" : false,
# "lucene_version" : "8.7.0",
# "minimum_wire_compatibility_version" : "6.8.0",
# "minimum_index_compatibility_version" : "6.0.0-beta1"
# },
# "tagline" : "You Know, for Search"
# }
The search app should be able to connect to Elasticsearch now. It is simple to set up
Elasticsearch with the help of Docker.
Docker Compose
The previous section shows the simplicity to run a single container, but we have to run
two containers with two commands, what if we have more containers? Is there a simpler
way to manage multicontainers? One of the ways is to write a bash script, but there is a
more elegant way — docker compose.
The first thing we need to do is install the docker-compose package, pip3 install
docker-compose , then prepare a config file docker-compose.yml .
1 version: "3"
2 services:
3 es:
4 image: elasticsearch:7.11.2
5 container_name: es
6 environment:
7 - discovery.type=single-node
8 ports:
9 - 9200:9200
10 volumes:
11 - esdata:/usr/share/elasticsearch/data
12
13 web:
14 image: medium-search-app
15 container_name: search-app
16 depends_on:
17 - es
18 ports:
19 - 8501:8501
20
21 volumes:
22 esdata:
23 driver: local
docker-compose.yml
One of the advantages of using docker-compose is it will automatically create a docker

network. The usage of the docker network is explained in the tutorial I mentioned
above. Now we can change the Elasticsearch hosting domain defined in the search app
app.py from DOMAIN='0.0.0.0' to DOMAIN='es' which is the container name of the
Elasticsearch image. Docker will automatically resolve the domain 'es' , so the search
app image will be directly connected to the Elasticsearch image through the docker
network.
Besides, the volumes parameter in the docker-compose.yml is also important to us since

we don’t want our app to lose data upon shutdown or restart. We need to specify the
mount point in the Elasticsearch image to persist the data.
Lastly, fire up the search app by running the terminal command:
docker-compose up
Wrap Up
This article is about how to create a search app using Streamlit and Elasticsearch besides
containerize the application using Docker.
The source code is available on GitHub. Feel free to try them out. Thanks for reading,
and I hope you enjoyed this and learned something new. I’d love to hear your feedback.
Reference
Docker blog
Thanks to Elliot Gunn and Anupam Chugh.
Sign up for programming bytes

By Better Programming
A monthly newsletter covering the best programming articles published across Medium. Code
tutorials, advice, career opportunities, and more! Take a look.
Get this newsletter

Programming Docker Python Software Development Software Engineering
About Write Help Legal
Get the Medium app

Build A Search Engine For Medium Stories Using Streamlit and Elasticsearch - by ChiaChong - Better Programming

Uploaded by

Copyright:

Available Formats

You might also like

Build A Search Engine For Medium Stories Using Streamlit and Elasticsearch - by ChiaChong - Better Programming

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Build A Search Engine For Medium Stories Using Streamlit and Elasticsearch - by ChiaChong - Better Programming

Uploaded by

Copyright:

Available Formats

Build a Search Engine for Medium Stories Using

Streamlit and Elasticsearch

Sep 23 · 6 min read

Photo by Firmbee.com on Unsplash

Elasticsearch — a Lucence-based full-text search engine.

Then, run streamlit run srcs/streamlit_app/app.py in the terminal to start the

Display Search Results

Now we can display search results with this code:

app state, but we can capture the URL parameters using

docker run -p 8501:8501 medium-search-app

The search app is now up and running on the localhost:8501 .

docker run -p 9200:9200 -p 9300:9300 \

# "tagline" : "You Know, for Search"

docker-compose , then prepare a config file docker-compose.yml .

One of the advantages of using docker-compose is it will automatically create a docker

Besides, the volumes parameter in the docker-compose.yml is also important to us since

Lastly, fire up the search app by running the terminal command:

Thanks to Elliot Gunn and Anupam Chugh.

Sign up for programming bytes

Get this newsletter

About Write Help Legal

Get the Medium app

You might also like