Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Basic Performance Optimization in Django by

Guardado en Dropbox • 26 dic. 2020 10:20 p. m.

=(.'*.4).(< >?(5'25'4??

!"#$"%&'##

!"##"$ %&%'!"##"$()* +,"-.

()*'+%,$-./-0)1+$
!"#$%&#"#'()*
234'0'5)4'/1%'1%67)18/
@"')(*?"5<'."'.A2*'*.")07
>?(5'25'4??
B(.'.A('C)(('D(<2-;'4??E

/0#(0'12## 345'67'89%: · %9';25')(4<


F#4045'G4;4;)4A
4,"-.'%'0(4)'4B"
@A2*'2*'4')(4##0'A(#?C-#'4).2H#('C")'"?.2;2I25B'JK45B"'L-()2(*
TL;DR: Pro%le your code, make fewer SQL
queries, serialize less data
M
I recently built Carta’s Scenario Modeling
platform. A little ways into the project I
realized I didn’t know how to write
+)?2.A'N'D-<<2
4,"-.'%'0(4)'4B"
performant code. I still don’t, but here are
O5.()(*.25BE
some things I wish I knew before I got
started.
8

+,#-./,0"1
The sections below are high-level
+5-<((P'1A(..0
explanations
"Q()'%'0(4)'4B" meant to expose you to

diJerent techniques you can use to


optimize your Django code. Results will
vary depending on your database, models,
and the nature of your data.

2"33,&4'53/13"6'7'"8/0$."'6/3/
0%6".
class Library(models.Model):
name =
models.CharField(max_length=20
0, default='')
address =
models.CharField(max_length=20
0, default='')

class Author(models.Model):
name =
models.CharField(max_length=20
0, default='')

class Book(models.Model):
library =
models.ForeignKey(
Library,

on_delete=models.CASCADE,
related_name='books',
)
author =
models.ForeignKey(
Author,

on_delete=models.CASCADE,
related_name='books'
)
title =
models.CharField(max_length=20
0, default='')
address =
models.CharField(max_length=20
0, default='')

def get_page_count(self):
return
self.pages.count()
class Page(models.Model):
book = models.ForeignKey(
Book,

on_delete=models.CASCADE,
related_name='pages',
)
text =
models.TextField(null=True,
blank=True)
page_number =
models.IntegerField()

I created a simple data model consisting of


books, pages, authors, and libraries so we
can play with the techniques outlined
below. The database is seeded with
10,000 books, 1,000,000 pages, 1,000
authors, and 1,000 libraries. The pages
are randomly assigned to books and the
books are randomly assigned to authors
and libraries.

9#"'.,&":$1%;,."1
If we don’t know why our code is slow its
going to be diOcult to %gure out how to
optimize it. line_profiler is a cool
python module that tells us how much
time it takes to execute each line in a
function. Before you get started, install
the package with pip install

line_profiler .

Here’s how I typically use it. Imagine we


needed a function which returned all of
the books corresponding to each library in
our database. Let’s say we ended up with
something like this…

def get_books_by_library_id():
libraries =
Library.objects.all()
result = {}

for library in libraries:


books_in_library =
library.books.all()
result[library.id] =
list(books_in_library)

return result

…which was causing the below view to


run slowly.

from django.http import


HttpResponse
from test_app.helpers import
get_books_by_library_id

def
books_by_library_id_view(reque
st):
books_by_library_id =
get_books_by_library_id()

...

return
HttpResponse(response)

To %nd out where the computer is


spending its time during the execution of
the function you can add an IPython

debugger to the view right before where


the suspect function is called…

from django.http import


HttpResponse
from test_app.helpers import
get_books_by_library_id

def
books_by_library_id_view(reque
st):
from IPython import embed;
embed()
books_by_library_id =
get_books_by_library_id()
...

return
HttpResponse(response)

…perform whatever action triggers the


view, wait for your server to stop running
and the IPython shell to start up, load
line_profiler as an extension into the
IPython shell, and then pro%le the
function with lprun -f

your_function_name

your_function_name() . If any arguments

need to be passed into the function, pass


them inside the parentheses of the last
part of that command.

The % Time column tells you how much of


the Total time the function spent
executing that line of code. So out of the
47.977 seconds it took to execute the
entire function we spent 98.8% of the
total time executing line 6, where we
evaluate the library.books.all()

queryset by coercing it to a list .

The Hits column tells you how many


times that line of code was executed. It
looks like line 6 is executed 1000 times —
one time for each Library in the
database. That means we’re making 1000
SQL queries which is probably why this
function is so slow. We’ll cover how to get
make this faster in later sections.

<=1&'%&'5>?'.%44,&4
When we’re really digging into a function
to see where the bottlenecks are it can be
useful to turn on SQL logging.

# settings.py

LOGGING = {
'version': 1,
'filters': {
'require_debug_true':
{
'()':
'django.utils.log.RequireDebug
True',
}
},
'handlers': {
'console': {
'level': 'DEBUG',
'filters':
['require_debug_true'],
'class':
'logging.StreamHandler',
}
},
'loggers': {
'django.db.backends':
{
'level': 'DEBUG',
'handlers':
['console'],
}
}
}

We can add this snippet to settings.py to


get a full printout of every SQL query
executed in your code. It’s usually way too
much to keep on at all times but it can be
enlightening to see how many queries are
being executed and how long its taking to
execute each of them in a given block of
code.
The way I typically use SQL logging is to
put a debugger before and after a suspect
block of code. I then execute the code,
clear my terminal once I hit the %rst
debugger, continue through the %rst
debugger, and investigate the queries that
were executed in between while keeping
an eye out for repeated, unexpected, or
slow queries.

This post wont cover SQL query


optimization but we will go over how to
reduce or eliminate the number of
repeated queries in a function.

@A%,6'B="1,"#',&'.%%$#

def get_books_by_library_id():
libraries =
Library.objects.all()
result = {}

for library in libraries:


books_in_library =
library.books.all()
result[library.id] =
list(books_in_library)

return result
Back to our terminally ill function
get_books_by_library . The problem with

this is we’re making a round trip to the


database to fetch the books for each
library. This might be what you want if
you only have a few libraries each with
millions of books. But if you have
thousands of libraries each with
thousands of books, your code is now
executing thousands of queries. To get
around this we can grab all of the books
up front:

def
get_books_by_library_id_one_qu
ery():
books = Book.objects.all()
result = defaultdict(list)

for book in books:

result[book.library_id].append
(book)

return result
Boom. Now you’re only making one SQL
query regardless of how many libraries
exist in your database.

So how do we know if this is any faster?


Python has this cool little module called
timeit that tells you how long it takes to
execute a function.

In [12]:
timeit(get_books_by_library_id
, number=10)
Out[12]: 6.598360636999132
In [13]:
timeit(get_books_by_library_id
_one_query, number=10)
Out[13]: 0.677092163998168

Even though our second function has to


loop over 10,000 books, it still runs nearly
10x faster than the original function by
eliminating round trips to the database.
Keep in mind this all depends on the
sparseness/density and scale of your data.
In my experience, code generally runs
slower if the number of SQL queries made
scales with some other input.
#"."-3:1"./3"6(*
Imagine you wanted a list of strings in the
format of Harry Potter and the

Sorcerer's Stone by J.K. Rowling

corresponding to each library.

Django makes it easy to write code that


executes unnecessary queries. If you don’t
know what’s happening under the hood or
you’re working with unfamiliar models,
you might write something like this:

def get_books_by_author():
books = Book.objects.all()
result = defaultdict(list)

for book in books:


author = book.author
title_and_author = '{}
by {}'.format(
book.title,
author.name
)

result[book.library_id].append
(title_and_author)

return result
Here’s the problem: each time you access
book.author you’re making a query
equivalent to
Author.objects.get(id=book.author_id) .

If you’re looping over thousands of books


you’re also making thousands of
completely unnecessary queries. Use
select_related to avoid this:

def
get_books_by_author_select_rel
ated():
books =
Book.objects.all().select_rela
ted('author')
result = defaultdict(list)

for book in books:


author = book.author
title_and_author = '{}
by {}'.format(
book.title,
author.name
)

result[book.library_id].append
(title_and_author)

return result

select_related works by making a single


more complex SQL query which also
returns the %elds of the related object. So
instead of just getting back data about all
of the books, you’re also grabbing data for
each book’s author as well. Now when you
access book.author you’re actually
accessing a cached version of the author
rather than making a separate database
query.

How much faster is the naive approach vs


the approach using select_related ? I ran

both of these functions using the timeit

module and found that the function using


select_related was 32x faster:

In [12]:
timeit(get_books_by_author,
number=10)
Out[12]: 41.363460485998075
In [13]:
timeit(get_books_by_author_sel
ect_related, number=10)
Out[13]: 1.2787263889913447

select_related wont work in all


situations however:
to avoid the much larger result set that
would result from joining across a ‘many’
relationship, select_related is limited to
single-valued relationships - foreign key
and one-to-one.

In order to traverse reverse ForeignKey or


ManyToMany relationships we’ll need
prefetch_related .

$1";"3-C:1"./3"6(*
prefetch_related is similar to
select_related in that it prevents
unnecessary SQL queries. Instead of
fetching the primary and related objects in
one go, prefetch_related makes separate
queries for each relationship and “joins”
the results together in python. The
downside of this approach is it requires
multiple round trips to the database.

Author.objects.filter(name__st
artswith='R').prefetch('books'
)
How it works: %rst a request is %red oJ
which runs the primary query
Author.objects.filter(name__startswith

=letter) . Then a second query equivalent

to:
Book.objects.filter(author_id__in=PKS_

OF_AUTHORS_FROM_FIRST_REQUEST) is
executed and both of the responses are
merged together into an Author queryset
that has each author’s books cached in
memory. So you end up with a similar
result as with select_related but you
arrive there through diJerent means.

Although you can use prefetch_related

anywhere you use select_related ,

generally your code will run faster with


select_related since it will grab
everything you need in one SQL query.
But if your data is particularly sparse
(millions of books to just a few libraries)
you may see a performance boost despite
the extra trips to the database. So when in
doubt, try both methods and see which
one comes out on top.

To summarize: use select_related and


prefetch_related if you’re going to access
a model’s relationship after making an
initial query. Use prefetch_related if
you’re following a reverse ForeignKey or
ManyToMany relationship. Use
select_related if you’re following a
ForeignKey or OneToOne relationship but
keep in mind that prefetch_related could
be a better choice for these types of
relationships depending on the nature of
your data.

A/.="#(*'/&6'A/.="#:.,#3(*
The time it takes to serialize SQL
responses into python scales with both the
number of rows and columns being
returned. In the function below, we’re
serializing all of the %elds de%ned on the
book and author even though we only
need the author’s name and the book’s
library_id and title . We’re also are
initializing a Django model instance for no
reason since we’re not doing anything
special with it (like calling model
methods).

def
get_books_by_author_select_rel
ated():
books =
Book.objects.all().select_rela
ted('author')
result = defaultdict(list)

for book in books:


author = book.author
title_and_author = '{}
by {}'.format(
book.title,
author.name
)

result[book.library_id].append
(title_and_author)

return result

Because of this, we incur pretty signi%cant


overhead that can be eliminated by calling
.values() or .values_list() on the
queryset to ask only for the %elds we need:
def
get_books_by_author_select_rel
ated_values():
books = (
Book.objects
.all()

.select_related('author')
.values('title',
'library_id', 'author__name')
)
result = defaultdict(list)

for book in
books.iterator():
title_and_author = '{}
by {}'.format(
book['title'],

book['author__name']
)

result[book['library_id']].app
end(title_and_author)

return result

def
get_books_by_author_select_rel
ated_values_list():
books = (
Book.objects
.all()

.select_related('author')
.values_list('title',
'library_id', 'author__name')
)
result = defaultdict(list)
for book in
books.iterator():
title_and_author = '{}
by {}'.format(
book[0],
book[2]
)

result[book[1]].append(title_a
nd_author)

return result

.values() returns a list of dictionary


representations of the model instance:
[{'title': 'Snow Crash', 'library_id':

9, 'author__name': 'Neil'}, ...] and


.values_list() returns a list of tuples
representing the model instance [('Snow

Crash', 9, 'Neil'), ...] .

So how much faster are these functions?


By only grabbing the %elds we need, the
functions using .values() and
.values_list() runs ~7x faster than the
original function.
In [13]:
timeit(get_books_by_author_sel
ect_related, number=10)
Out[13]: 1.2787263889913447
In [14]:
timeit(get_books_by_author_sel
ect_related_values, number=1)
Out[14]: 0.19064296898432076
In [15]:
timeit(get_books_by_author_sel
ect_related_values_list,
number=1)
Out[15]: 0.17425400999491103

Something to note here is these models


are super lightweight — there are only 4
%elds on the Book model. If we were
working with models which had tons of
%elds, the results would be more extreme.

Even if you need all of the %elds on the


object, .values() and .values_list()

will still provide a substantial


performance boost by skipping model
initialization. You can get all of the %elds
back by not passing in any %elds as
arguments.

# returns list of model


instances
def get_book_instances():
return list(
Book.objects
.all()

.select_related('author')
)

# returns list of dictionaries


representing model instances
def get_book_dictionaries():
return list(
Book.objects
.all()

.select_related('author')
.values()
)

# returns a list of
dictionaries with the name of
each book
def
get_book_dictionaries_title_on
ly():
return list(
Book.objects
.all()

.select_related('author')
.values('title')
)

Getting the dictionary representation of


the books is 6.5x faster and asking for a
particular %eld on the book is 8.8x faster
than requesting the model instances.

In [64]:
timeit(get_book_instances,
number=100)
Out[64]: 12.904168864974054
In [65]:
timeit(get_book_dictionaries,
number=100)
Out[65]: 2.049193776998436
In [66]:
timeit(get_book_dictionaries_t
itle_only, number=100)
Out[66]: 1.4734381759772077

Initializing Django model instances is


expensive. If you’re only using the data on
the model you’re likely better oJ working
with their dictionary or tuple
representations.

D=.E:-1"/3"(*
This one is pretty simple. If we’re going to
create more than one object at a time, use
bulk_create instead of creating the
objects in a loop. As the name suggests,
bulk_create will insert a list of objects
into the database using one query,
regardless of how many objects we’re
inserting.

It’s possible to insert so many objects that


the single query generated by
bulk_create will be slower than multiple
smaller queries. In that case, you can pass
in batch_size=SIZE_OF_BATCH as an
argument which will break the master
query into smaller queries.

I’m not aware of any rules of thumb


regarding the number of objects you need
to insert before it makes sense to use
batch_size or how to decide on the size
of the batch. Usually I omit it until it
becomes clear that there’s a bottleneck
and then determine a suitable batch_size

from there.

5>?',#'(4"&"1/..F*';/#3"1'3C/&'GF3C%&
Let’s imagine you wanted a function that
returns the total number of pages for each
library. Using what we’ve learned above
you might end up with something like
this:
def
get_page_count_by_library_id()
:
result = defaultdict(int)

books =
Book.objects.all().prefetch_re
lated('pages')

for book in books:

result[book.library_id] +=
book.get_page_count()

return result

Even though this will only %re oJ two


queries, we still have to pull all of the
books into memory and loop over each
one of them. This sort of thing can easily
be turned into an annotation.

from django.db.models import


Sum

def
get_page_count_by_library_id_u
sing_annotation():
result = {}

libraries = (
Library.objects
.all()
.annotate(page_count=Sum('book
s__pages'))
.values_list('id',
'page_count')
)

for library_id, page_count


in libraries:
result[library_id] =
page_count

return result

Now, instead of pulling a bunch of Django


instances into memory and calling a
model method on each of them, you’re
only pulling the two values you actually
care about for each Library : the id and
page_count . Our new function runs about

115x faster than the original.

In [66]:
timeit(get_page_count_by_libra
ry_id, number=10)
Out[66]: 158.0743614450039
In [67]:
timeit(get_page_count_by_libra
ry_id_using_annotation,
number=10)
Out[67]: 1.3725216790044215
This example is pretty simple but there’s a
lot you can do with annotations. If you’re
running into performance issues while
doing math on a large queryset, consider
writing an annotation to omoad that work
to the database.

H1/$$,&4'=$
If you’re working on a Django-based
project where there has been little or no
thought regarding performance, there
should be quite a bit of low hanging fruit
to get you started. In the majority of cases,
the techniques described above will
provide a substantial performance boost
without much of a trade oJ in terms of
readability.

8ERS T

N0.A"5 JK45B" N()C");45H(


I%1"';1%0'!F."F'5,.. !"##"$

1"C.$4)('F5B25(()'U
V4).4

I%1"'J1%0'I"6,=0

K%L'3%'M1"/3"'+F&/0,-
N/-E41%=&6#'H,3C'3C"'M55
G/,&3'@GO
!()(5H'+#;4*2'25'W(..()'N)"B)4;;25B

<"..,&4'</."# 7 <"-C&,-/.
-C/.."&4"#'%;'D=,.6,&4'/&
,&3"1/-3,A"'0/$
34;(*'W(*.'25'=)4Q2.0$(##'XS

K%L'3%'"6,3'GP<KQRG@<K'%&
H,&6%L#
J4#04'=4).I;45'25'C)((V"<(V4;?E")B

@./&'5=4/1S'IF'TUVP"/1'5"/1-C
;%1'3C"'<1=3C'2%"#'%&
W4))0'V"##25*'25'@A('1.4).-?
WX'!=DF'%&'!/,.#'O&3"1A,"L
>="#3,%&#'/&6'@&#L"1#
VA)2*'OE'25'/-,0'J42#0

9#,&4';%10:3/4'/&6'$/1/0#'3%
-1"/3"'/'#"/1-C'%1'1"&6"1'/
#%13"6'A,"L',&'!/,.#
@)4H2('D4*(Y

@D=#,&4'MYY'3C1"/6#'3%'."/1&
C%L'3C"F'L%1E
H"<(Z2Z<"

QA"1-%0,&4'!=DF'Z11%1
I"##/4"#
/"52'1A4,"

+,"-. G(#? [(B4#

=(.'.A('D(<2-;'4??

You might also like