Build A Search Engine For Medium Stories Using Streamlit and Elasticsearch - by ChiaChong - Better Programming

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Build a Search Engine for Medium Stories Using

Streamlit and Elasticsearch


Search Medium stories in your own app

ChiaChong Follow

Sep 23 · 6 min read

Photo by Firmbee.com on Unsplash

When you read a Medium story and find it interesting or useful to you, would you save it
into your Medium list? I would. I have several hundreds of stories categorized and saved
in my lists. But I don’t often go back to them when I need to refer back to something I
know I have seen before. One of the reasons is because it takes time to scroll and look
through a long list of stories.

I would rather just open up a new tab in my browser and enter the keywords in Google
search. Although we can also do the same thing in Medium search, you will know why I
don’t do it if you’ve try it before. In this article, I’m going to share with you how I made it
and what I have learned. Sit back and have a coffee for this long story.

The code is available on GitHub. To get started, the main tools to build the search engine
are:

Python 3— to build and connect the UI (Streamlit) with the search engine.

Elasticsearch — a Lucence-based full-text search engine.

Docker — a virtualization tool that helps us package and deploy our application
easily.

Streamlit
Let’s start with Streamlit. Run pip3 install streamlit to install the Python package,
then create a py file in the folder srcs/streamlit_app

1 # srcs/streamlit_app/app.py
2 import streamlit as st
3
4 def main():
5 st.title('Search Medium Story')
6 search = st.text_input('Enter search words:')
7
8 if __name__ == '__main__':
9 main()

app.py
hosted with ❤ by GitHub view raw

Then, run streamlit run srcs/streamlit_app/app.py in the terminal to start the


streamlit app. You will see the main page with a text bar to enter the search words.
Image by the author.

Connect Elasticsearch
Now we need to install the Python Elasticsearch by running pip3 install elasticsearch

in the terminal. After that create some helper functions in a new py file.

1 # srcs/streamlit_app/utils.py
2 def check_and_create_index(es, index: str):
3 # define data model
4 mappings = {
5 'mappings': {
6 'properties': {
7 'author': {'type': 'keyword'},
8 'length': {'type': 'keyword'},
9 'title': {'type': 'text'},
10 'tags': {'type': 'keyword'},
11 'content': {'type': 'text'},
12 }
13 }
14 }
15 if not es.indices.exists(index):
16 es.indices.create(index=index, body=mappings, ignore=400)
17
18
19 def index_search(es, index: str, keywords: str, filters: str,
20 from_i: int, size: int) -> dict:
21 """
22 Args:
23 es: Elasticsearch client instance.
23 es: Elasticsearch client instance.
24 index: Name of the index we are going to use.
25 keywords: Search keywords.
26 filters: Tag name to filter medium stories.
27 from_i: Start index of the results for pagination.
28 size: Number of results returned in each search.
29 """
30 # search query
31 body = {
32 'query': {
33 'bool': {
34 'must': [
35 {
36 'query_string': {
37 'query': keywords,
38 'fields': ['content'],
39 'default_operator': 'AND',
40 }
41 }
42 ],
43 }
44 },
45 'highlight': {
46 'pre_tags': ['<b>'],
47 'post_tags': ['</b>'],
48 'fields': {'content': {}}
49 },
50 'from': from_i,
51 'size': size,
52 'aggs': {
53 'tags': {
54 'terms': {'field': 'tags'}
55 },
56 'match_count': {'value_count': {'field': '_id'}}
57 }
58 }
59 if filters is not None:
60 body['query']['bool']['filter'] = {
61 'terms': {
62 'tags': [filters]
63 }
64 }
65
66 res = es.search(index=index, body=body)
67 # sort popular tags
68 sorted_tags = res['aggregations']['tags']['buckets']
69 sorted_tags = sorted(
70 sorted_tags,
71 key=lambda t: t['doc_count'], reverse=True
72 )
73 res['sorted_tags'] = [t['key'] for t in sorted_tags]
74 return res

utils.py
hosted with ❤ by GitHub view raw

In the check_and_create_index() function, we define that each story has five properties
— author name, length of the story, title, tags, and the content of the story. While in the
index_search() function, we define the Elasticsearch query to do an index search,
details of the query API can be found here.

Note that we add an aggregation term tags in the search query to group and count the

number of each tag in the results and sort them descendingly as sorted_tags . These tags

will be displayed on top of the results and can be used to filter search results.

Add the two helper functions into the app.py using the following code:

1 # srcs/streamlit_app/app.py
2 import sys
3 import streamlit as st
4 from elasticsearch import Elasticsearch
5 sys.path.append('srcs')
6 from streamlit_app import utils
7
8 INDEX = 'medium_data'
9 DOMAIN = '0.0.0.0'
10 es = Elasticsearch(host=DOMAIN)
11 utils.check_and_create_index(es, INDEX)
12
13 def main():
14 st.title('Search Medium Story')
15 search = st.text_input('Enter search words:')
16 if search:
17 results = utils.index_search(es, INDEX, search, '', 0, PAGE_SIZE)
18
19 if __name__ == '__main__':
20 main()
app.py
hosted with ❤ by GitHub view raw

Now we can enter keywords and search, but it will fail because the Elasticsearch is not
started yet, and we will do this later because it is very simple with the help of Docker.

Display Search Results


Assume that we already have Elasticsearch up and running, the next task is to display
the search results. Let’s create another py file templates.py in the same folder.

1 # srcs/streamlit_app/templates.py
2 import urllib.parse
3
4 def load_css() -> str:
5 """ Return all css styles. """
6 common_tag_css = """
7 display: inline-flex;
8 align-items: center;
9 justify-content: center;
10 padding: .15rem .40rem;
11 position: relative;
12 text-decoration: none;
13 font-size: 95%;
14 border-radius: 5px;
15 margin-right: .5rem;
16 margin-top: .4rem;
17 margin-bottom: .5rem;
18 """
19 return f"""
20 <style>
21 #tags {{
22 {common_tag_css}
23 color: rgb(88, 88, 88);
24 border-width: 0px;
25 background-color: rgb(240, 242, 246);
26 }}
27 #tags:hover {{
28 color: black;
29 box-shadow: 0px 5px 10px 0px rgba(0,0,0,0.2);
30 }}
31 #active-tag {{
32 {common_tag_css}
33 color: rgb(246 51 102);
33 color: rgb(246, 51, 102);
34 border-width: 1px;
35 border-style: solid;
36 border-color: rgb(246, 51, 102);
37 }}
38 #active-tag:hover {{
39 color: black;
40 border-color: black;
41 background-color: rgb(240, 242, 246);
42 box-shadow: 0px 5px 10px 0px rgba(0,0,0,0.2);
43 }}
44 </style>
45 """
46
47 def number_of_results(total_hits: int, duration: float) -> str:
48 """ HTML scripts to display number of results and duration. """
49 return f"""
50 <div style="color:grey;font-size:95%;">
51 {total_hits} results ({duration:.2f} seconds)
52 </div><br>
53 """
54
55 def search_result(i: int, url: str, title: str, highlights: str,
56 author: str, length: str, **kwargs) -> str:
57 """ HTML scripts to display search results. """
58 return f"""
59 <div style="font-size:120%;">
60 {i + 1}.
61 <a href="{url}">
62 {title}
63 </a>
64 </div>
65 <div style="font-size:95%;">
66 <div style="color:grey;font-size:95%;">
67 {url[:90] + '...' if len(url) > 100 else url}
68 </div>
69 <div style="float:left;font-style:italic;">
70 {author} ·&nbsp;
71 </div>
72 <div style="color:grey;float:left;">
73 {length} ...
74 </div>
75 {highlights}
76 </div>
77 """
78
79 def tag_boxes(search: str, tags: list, active_tag: str) -> str:
80 """ HTML scripts to render tag boxes. """
81 html = ''
82 search = urllib.parse.quote(search)
83 for tag in tags:
84 if tag != active_tag:
85 html += f"""
86 <a id="tags" href="?search={search}&tags={tag}">
87 {tag.replace('-', ' ')}
88 </a>
89 """
90 else:
91 html += f"""
92 <a id="active-tag" href="?search={search}">
93 {tag.replace('-', ' ')}
94 </a>
95 """
96
97 html += '<br><br>'
98 return html

templates.py
hosted with ❤ by GitHub view raw

The functions above are used to generate some HTML and CSS scripts to be displayed in
the UI. The load_css() function is used to style the story tags that are used as filters.

The story tags rendered by the tag_boxes() will be clickable with URL parameters

search and tags which are important to persist the state of the app, we will discuss this
later. The search_result() function is the main component to display the URL of the
story, author name, length of the story, and short highlighted text snippet.

Now we can display search results with this code:

1 # srcs/streamlit_app/app.py
2 import sys
3 import streamlit as st
4 from elasticsearch import Elasticsearch
5 sys.path.append('srcs')
6 from streamlit_app import utils, templates
7
8 INDEX = 'medium data'
8 INDEX = medium_data
9 PAGE_SIZE = 5
10 DOMAIN = '0.0.0.0'
11 es = Elasticsearch(host=DOMAIN)
12 utils.check_and_create_index(es, INDEX)
13
14 def main():
15 st.write(templates.load_css(), unsafe_allow_html=True)
16 st.title('Search Medium Story')
17 search = st.text_input('Enter search words:')
18 if search:
19 results = utils.index_search(es, INDEX, search, None, 0, PAGE_SIZE)
20 total_hits = results['aggregations']['match_count']['value']
21 # show number of results and time taken
22 st.write(templates.number_of_results(total_hits, results['took'] / 1000),
23 unsafe_allow_html=True)
24 # render popular tags as filters
25 st.write(templates.tag_boxes(search, results['sorted_tags'][:10], ''),
26 unsafe_allow_html=True)
27 # search results
28 for i in range(len(results['hits']['hits'])):
29 result = results['hits']['hits'][i]
30 res = result['_source']
31 res['url'] = result['_id']
32 res['highlights'] = '...'.join(result['highlight']['content'])
33 st.write(templates.search_result(i, **res), unsafe_allow_html=True)
34 st.write(templates.tag_boxes(search, res['tags'], ''),
35 unsafe_allow_html=True)
36
37 if __name__ == '__main__':
38 main()

app.py
hosted with ❤ by GitHub view raw

The st.write() is the magic command to render HTML and customize our UI.

Filters
At this stage, we can search and display the search results (assuming the Elasticsearch is
up and running).
Image by the author.

The grey rectangle boxes on top of the search results are the popular tags among the
results, we can click on any of them to filter the search results. Since they are hyperlinks
containing URL parameters search and tags , clicking on any of them will refresh the

app state, but we can capture the URL parameters using


st.experimental_get_query_params() and set those values as the new app states.

1 # srcs/streamlit_app/app.py
2 import sys
3 import urllib.parse
4 import streamlit as st
5 from elasticsearch import Elasticsearch
6 sys.path.append('srcs')
7 from streamlit_app import utils, templates
8
9 INDEX = 'medium_data'
10 PAGE_SIZE = 5
11 DOMAIN = '0.0.0.0'
12 es = Elasticsearch(host=DOMAIN)
13 utils.check_and_create_index(es, INDEX)
14
15 def set_session_state():
16 # set default values
17 if 'search' not in st.session_state:
18 st.session_state.search = None
19 if 'tags' not in st.session_state:
20 st.session_state.tags = None
21
22 # get parameters in url
23 para = st.experimental_get_query_params()
24 if 'search' in para:
25 st.experimental_set_query_params()
26 # decode url
27 new_search = urllib.parse.unquote(para['search'][0])
28 st.session_state.search = new_search
29 if 'tags' in para:
30 st.experimental_set_query_params()
31 st.session_state.tags = para['tags'][0]
32
33 def main():
34 set_session_state()
35 st.write(templates.load_css(), unsafe_allow_html=True)
36 st.title('Search Medium Story')
37 if st.session_state.search is None:
38 search = st.text_input('Enter search words:')
39 else:
40 search = st.text_input('Enter search words:', st.session_state.search)
41 if search:
42 results = utils.index_search(es, INDEX, search, st.session_state.tags,
43 0, PAGE_SIZE)
44 total_hits = results['aggregations']['match_count']['value']
45 # show number of results and time taken
46 st.write(templates.number_of_results(total_hits, results['took'] / 1000),
47 unsafe_allow_html=True)
48 # show popular tags
49 popular_tags_html = templates.tag_boxes(search, results['sorted_tags'][:10],
50 st.session_state.tags)
51 st.write(popular_tags_html, unsafe_allow_html=True)
52 # search results
53 for i in range(len(results['hits']['hits'])):
54 result = results['hits']['hits'][i]
55 res = result['_source']
56 res['url'] = result['_id']
57 res['highlights'] = '...'.join(result['highlight']['content'])
58 st.write(templates.search_result(i, **res), unsafe_allow_html=True)
59 st.write(templates.tag_boxes(search, res['tags'], st.session_state.tags),
60 unsafe_allow_html=True)
61
62 if __name__ == '__main__':
63 main()

app.py
hosted with ❤ by GitHub view raw
I wrote an article about Streamlit session state before. Feel free to check it out for more
details.

Pagination
One last thing we need to do to complete the UI is the result pagination. Same with the
story tags, we will also render the pagination buttons as hyperlinks with URL parameters
search , tags and a new parameter page . Add the following function into the

templates.py :

1 # srcs/streamlit_app/templates.py
2 def pagination(total_pages: int, search: str, current_page: int, tags: str) -> str:
3 """ HTML scripts to render pagination buttons. """
4 # search words and tags
5 params = f'?search={urllib.parse.quote(search)}'
6 if tags is not None:
7 params += f'&tags={tags}'
8
9 # avoid invalid page number (<=0)
10 if (current_page - 5) > 0:
11 start_from = current_page - 5
12 else:
13 start_from = 1
14
15 hrefs = []
16 if current_page != 1:
17 hrefs += [
18 f'<a href="{params}&page={1}">&lt&ltFirst</a>',
19 f'<a href="{params}&page={current_page - 1}">&ltPrevious</a>',
20 ]
21
22 for i in range(start_from, min(total_pages + 1, start_from + 10)):
23 if i == current_page:
24 hrefs.append(f'{current_page}')
25 else:
26 hrefs.append(f'<a href="{params}&page={i}">{i}</a>')
27
28 if current_page != total_pages:
29 hrefs.append(f'<a href="{params}&page={current_page + 1}">Next&gt</a>')
30
31 return '<div>' + '&emsp;'.join(hrefs) + '</div>'

templates.py
hosted with ❤ by GitHub view raw
And we will need to capture the new URL parameter and set the session state. The
following code will help us do this:

1 # srcs/streamlit_app/app.py
2 import sys
3 import urllib.parse
4 import streamlit as st
5 from elasticsearch import Elasticsearch
6 sys.path.append('srcs')
7 from streamlit_app import utils, templates
8
9 INDEX = 'medium_data'
10 PAGE_SIZE = 5
11 DOMAIN = '0.0.0.0'
12 es = Elasticsearch(host=DOMAIN)
13 utils.check_and_create_index(es, INDEX)
14
15 def set_session_state():
16 # set default values
17 if 'search' not in st.session_state:
18 st.session_state.search = None
19 if 'tags' not in st.session_state:
20 st.session_state.tags = None
21 if 'page' not in st.session_state:
22 st.session_state.page = 1
23
24 # get parameters in url
25 para = st.experimental_get_query_params()
26 if 'search' in para:
27 st.experimental_set_query_params()
28 # decode url
29 new_search = urllib.parse.unquote(para['search'][0])
30 st.session_state.search = new_search
31 if 'tags' in para:
32 st.experimental_set_query_params()
33 st.session_state.tags = para['tags'][0]
34 if 'page' in para:
35 st.experimental_set_query_params()
36 st.session_state.page = int(para['page'][0])
37
38 def main():
39 set_session_state()
40 st.write(templates.load_css(), unsafe_allow_html=True)
41 st.title('Search Medium Story')
42 if st.session_state.search is None:
43 search = st.text_input('Enter search words:')
44 else:
45 search = st.text_input('Enter search words:', st.session_state.search)
46
47 if search:
48 from_i = (st.session_state.page - 1) * PAGE_SIZE
49 results = utils.index_search(es, INDEX, search, st.session_state.tags,
50 from_i, PAGE_SIZE)
51 total_hits = results['aggregations']['match_count']['value']
52 # show number of results and time taken
53 st.write(templates.number_of_results(total_hits, results['took'] / 1000),
54 unsafe_allow_html=True)
55 # show popular tags
56 popular_tags_html = templates.tag_boxes(search, results['sorted_tags'][:10],
57 st.session_state.tags)
58 st.write(popular_tags_html, unsafe_allow_html=True)
59 # search results
60 for i in range(len(results['hits']['hits'])):
61 result = results['hits']['hits'][i]
62 res = result['_source']
63 res['url'] = result['_id']
64 res['highlights'] = '...'.join(result['highlight']['content'])
65 st.write(templates.search_result(i + from_i, **res),
66 unsafe_allow_html=True)
67 st.write(templates.tag_boxes(search, res['tags'], st.session_state.tags),
68 unsafe_allow_html=True)
69
70 # pagination
71 if total_hits > PAGE_SIZE:
72 total_pages = (total_hits + PAGE_SIZE - 1) // PAGE_SIZE
73 pagination_html = templates.pagination(total_pages, search,
74 st.session_state.page,
75 st.session_state.tags)
76 st.write(pagination_html, unsafe_allow_html=True)
77
78 if __name__ == '__main__':
79 main()

app.py
hosted with ❤ by GitHub view raw
Here we have the complete UI to do searching, filtering, and pagination. The next step
would be to containerize our Streamlit app using Docker.

Docker
To be honest, this is the first container image that I built. I learned from a great tutorial
by Prakhar Srivastav. I strongly suggest you read the tutorial if you are new to Docker.

The only thing we need to containerize our application is a file called Dockerfile (and
of course we also need Docker installed in our machine). The Dockerfile is a text file
containing a list of commands to automate the image creation process. Here is an
example Dockerfile to create the image of the Streamlit app:

1 FROM python:3.7-slim
2
3 COPY ./requirements.txt /requirements.txt
4
5 RUN apt update && \
6 apt install --no-install-recommends -y build-essential gcc && \
7 apt clean && rm -rf /var/lib/apt/lists/* && \
8 pip3 install --no-cache-dir --upgrade pip setuptools && \
9 pip3 install --no-cache-dir -r /requirements.txt
10
11 COPY ./srcs /srcs
12
13 ENTRYPOINT ["streamlit", "run"]
14
15 CMD ["/srcs/streamlit_app/app.py"]
16
17 EXPOSE 8501

Dockerfile
hosted with ❤ by GitHub view raw

In the above example, we are telling the Docker client to start from Python 3.7, then
copy the requirements.txt from the local directory into the image and install all the
dependencies. After that, copy all the source codes and start the Streamlit app that is
running on the port 8501 . Lastly, we just need one terminal command to build the

image:
docker build -t medium-search-app .

Once the image is built, we can start the image using the terminal command:

docker run -p 8501:8501 medium-search-app

The search app is now up and running on the localhost:8501 .

Wait, didn’t we forget something? Elasticsearch! It’s time to start the “engine.”

docker run -p 9200:9200 -p 9300:9300 \

-e "discovery.type=single-node" \

elasticsearch:7.11.2

The Elasticsearch image will be pulled automatically if it does not exist. We can check
the condition of Elasticsearch by running these commands:

curl 0.0.0.0:9200

# {

# "name" : "32746bf3501e",

# "cluster_name" : "docker-cluster",

# "cluster_uuid" : "DGEsSP6pTPqV4o4eKgxVgw",

# "version" : {

# "number" : "7.11.2",

# "build_flavor" : "default",

# "build_type" : "docker",

# "build_hash" : "3e5a16cfec50876d20ea77b075070932c6464c7d",

# "build_date" : "2021-03-06T05:54:38.141101Z",

# "build_snapshot" : false,

# "lucene_version" : "8.7.0",

# "minimum_wire_compatibility_version" : "6.8.0",

# "minimum_index_compatibility_version" : "6.0.0-beta1"

# },

# "tagline" : "You Know, for Search"

# }
The search app should be able to connect to Elasticsearch now. It is simple to set up
Elasticsearch with the help of Docker.

Docker Compose
The previous section shows the simplicity to run a single container, but we have to run
two containers with two commands, what if we have more containers? Is there a simpler
way to manage multicontainers? One of the ways is to write a bash script, but there is a
more elegant way — docker compose.

The first thing we need to do is install the docker-compose package, pip3 install

docker-compose , then prepare a config file docker-compose.yml .

1 version: "3"
2 services:
3 es:
4 image: elasticsearch:7.11.2
5 container_name: es
6 environment:
7 - discovery.type=single-node
8 ports:
9 - 9200:9200
10 volumes:
11 - esdata:/usr/share/elasticsearch/data
12
13 web:
14 image: medium-search-app
15 container_name: search-app
16 depends_on:
17 - es
18 ports:
19 - 8501:8501
20
21 volumes:
22 esdata:
23 driver: local

docker-compose.yml
hosted with ❤ by GitHub view raw

One of the advantages of using docker-compose is it will automatically create a docker


network. The usage of the docker network is explained in the tutorial I mentioned
above. Now we can change the Elasticsearch hosting domain defined in the search app
app.py from DOMAIN='0.0.0.0' to DOMAIN='es' which is the container name of the
Elasticsearch image. Docker will automatically resolve the domain 'es' , so the search

app image will be directly connected to the Elasticsearch image through the docker
network.

Besides, the volumes parameter in the docker-compose.yml is also important to us since


we don’t want our app to lose data upon shutdown or restart. We need to specify the
mount point in the Elasticsearch image to persist the data.

Lastly, fire up the search app by running the terminal command:

docker-compose up

Wrap Up
This article is about how to create a search app using Streamlit and Elasticsearch besides
containerize the application using Docker.

The source code is available on GitHub. Feel free to try them out. Thanks for reading,
and I hope you enjoyed this and learned something new. I’d love to hear your feedback.

Reference
Docker blog

Thanks to Elliot Gunn and Anupam Chugh. 

Sign up for programming bytes


By Better Programming

A monthly newsletter covering the best programming articles published across Medium. Code
tutorials, advice, career opportunities, and more! Take a look.

Get this newsletter


Programming Docker Python Software Development Software Engineering

About Write Help Legal

Get the Medium app

You might also like