Profiling and Performance Optimization in Python
Are you looking to take your Python code to the next level? Profiling and performance optimization are key skills for any Python developer who wants to build high-performance applications. By identifying performance bottlenecks and optimizing your code, you can dramatically improve the speed and efficiency of your programs.
Whether you’re building a high-performance web application or working on a complex data analysis project, the skills you’ll learn in this guide will help you write faster and more efficient Python code. So let’s dive in and take your Python skills to the next level!
As a sample code that we want to profile, consider the module
mcpi.py
below:
from numpy.random import rand
from numpy import sqrt
def main(nsteps):
n_inside = 0
for istep in range(nsteps):
x, y = get_random_point()
if is_in_circle(x, y):
n_inside += 1
return 4 * n_inside / nsteps
def get_random_point():
return rand(2)
def is_in_circle(x, y):
return sqrt(x**2 + y**2) <= 1
if __name__ == '__main__':
print(main(10**6))
It’s not really important what the code does, but if you want to know: it’s a primitive method to compute the value of π using Monte Carlo. If you really need π, don’t do this, though! ;-) Let’s profile and optimize this!
First, we run the application in profiling mode:
python -m cProfile -o mcpi.stats mcpi.py
And then we look a the statistics that the profiler has generated. You can use graphical tools like
snakeviz
but in opinion, the console viewer gives a better overview. Here is how you can start it:
python -m pstats mcpi.stats
Welcome to the profile statistics browser.
mcpi.stats%
Now let’s print the first 20 lines of the statistics:
mcpi.stats% stats 20
Wed Feb 15 09:29:29 2023 mcpi.stats
3086411 function calls (3083928 primitive calls) in 3.237 seconds
Random listing order was used
List reduced from 1067 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
118 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:35(_new_module)
158 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:58(__init__)
468 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:78(acquire)
468 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:103(release)
307 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:36(_relax_case)
160 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:143(__init__)
160 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap>:147(__enter__)
160 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:151(__exit__)
158 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:176(cb)
468 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
308 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap>:194(_lock_unlock_module)
229/1 0.000 0.000 0.118 0.118 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
1691 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:222(_verbose_message)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:232(_requires_builtin_wrapper)
156 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:342(__init__)
265 0.000 0.000 0.002 0.000 <frozen importlib._bootstrap>:376(cached)
346 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:389(parent)
148 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:397(has_location)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:406(spec_from_loader)
148 0.001 0.000 0.003 0.000 <frozen importlib._bootstrap>:477(_init_module_attrs)
The header says that more than 3 million function calls were executed and the entire program ran for 3.237 seconds. The table below the header is a lot of information but not very informative so far. That's because we didn't sort the statistics. As you can see, there are different columns, and we can ask the browser to sort by specific columns.
ncalls
is the number of calls of a given function
f
,
cumtime
is the total time spent in the function including all functions that are called inside
f
, whereas
tottime
is the total time spent in the function without all the functions it calls. Let's sort by
cumtime
:
mcpi.stats% sort cumtime
mcpi.stats% stats 20
Wed Feb 15 10:51:07 2023 mcpi.stats
3086411 function calls (3083928 primitive calls) in 3.453 seconds
Ordered by: cumulative time
List reduced from 1067 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
137/1 0.001 0.000 3.453 3.453 {built-in method builtins.exec}
1 0.000 0.000 3.453 3.453 mcpi.py:1(<module>)
1 1.157 1.157 3.312 3.312 mcpi.py:5(main)
1000000 1.220 0.000 1.220 0.000 mcpi.py:19(is_in_circle)
1000000 0.131 0.000 0.935 0.000 mcpi.py:15(get_random_point)
1000000 0.804 0.000 0.804 0.000 {method 'rand' of 'numpy.random.mtrand.RandomState' objects}
160/1 0.001 0.000 0.141 0.141 <frozen importlib._bootstrap>:986(_find_and_load)
160/1 0.000 0.000 0.141 0.141 <frozen importlib._bootstrap>:956(_find_and_load_unlocked)
229/1 0.000 0.000 0.141 0.141 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
365/1 0.000 0.000 0.141 0.141 {built-in method builtins.__import__}
148/1 0.001 0.000 0.141 0.141 <frozen importlib._bootstrap>:650(_load_unlocked)
118/1 0.000 0.000 0.141 0.141 <frozen importlib._bootstrap_external>:777(exec_module)
1 0.000 0.000 0.141 0.141 /Users/mathcube/venv/lib/python3.8/site-packages/numpy/__init__.py:1(<module>)
183/17 0.000 0.000 0.137 0.008 <frozen importlib._bootstrap>:1017(_handle_fromlist)
1 0.000 0.000 0.071 0.071 /Users/mathcube/venv/lib/python3.8/site-packages/numpy/core/__init__.py:1(<module>)
148/145 0.000 0.000 0.041 0.000 <frozen importlib._bootstrap>:549(module_from_spec)
29/27 0.000 0.000 0.038 0.001 <frozen importlib._bootstrap_external>:1099(create_module)
29/27 0.036 0.001 0.038 0.001 {built-in method _imp.create_dynamic}
1 0.000 0.000 0.029 0.029 /Users/mathcube/venv/lib/python3.8/site-packages/numpy/random/__init__.py:1(<module>)
1 0.000 0.000 0.029 0.029 /Users/mathcube/venv/lib/python3.8/site-packages/numpy/random/_pickle.py:1(<module>)
There is still a lot of information regarding built-in function calls but we can now also see our own functions. Obviously, the
main
function consumed most of the cumulative time (3.312s), which is not surprising because this contains all of the program except imports. 1.157s were spent inside the
main
function itself, the rest must have been spent in the function that
main
called. By the way, we can filter the output by search terms or regular expressions to reduce the noise even more. For example, we can only show functions called from our Python files by filtering by
mcpi.py
:
mcpi.stats% stats mcpi.py
Wed Feb 15 10:51:07 2023 mcpi.stats
3086411 function calls (3083928 primitive calls) in 3.453 seconds
Ordered by: cumulative time
List reduced from 1067 to 4 due to restriction <'mcpi.py'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 3.453 3.453 mcpi.py:1(<module>)
1 1.157 1.157 3.312 3.312 mcpi.py:5(main)
1000000 1.220 0.000 1.220 0.000 mcpi.py:19(is_in_circle)
1000000 0.131 0.000 0.935 0.000 mcpi.py:15(get_random_point)
Let's try to improve the performance of one of the bottlenecks, the function
is_in_circle
, which per call costs practically nothing, but since it is called very often, it consumes a lot of time in total (1.220s). The function is very simple:
def is_in_circle(x, y):
return sqrt(x**2 + y**2) <= 1
How can we improve on that? Well, the
sqrt
function is a rather complicated function, at least much more complicated than a integer power or an addition. So let's simply square the inequality:
def is_in_circle(x, y):
return x**2 + y**2 <= 1
Running the program in profiling mode again and starting
pstats
, we get
mcpi.stats% stats 20 mcpi.py
Wed Feb 15 10:29:16 2023 mcpi.stats
3086411 function calls (3083928 primitive calls) in 2.633 seconds
Ordered by: cumulative time
List reduced from 1067 to 20 due to restriction <20>
List reduced from 20 to 4 due to restriction <'mcpi.py'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 2.634 2.634 mcpi.py:1(<module>)
1 1.090 1.090 2.378 2.378 mcpi.py:5(main)
1000000 0.134 0.000 0.921 0.000 mcpi.py:15(get_random_point)
1000000 0.367 0.000 0.367 0.000 mcpi.py:19(is_in_circle)
Ok, so we significantly reduced the time in
is_in_circle
from 1.157s to 0.367s. What about the
get_random_point
function?
def get_random_point():
return rand(2)
Well, it seems there is not a lot we can do about it but we can! One problem is that it is called so often, and each call has at least some overhead. So we can rewrite the logic of the calling function and let
rand
produce all random points in one sweep:
from numpy.random import rand
def main(npoints):
n_inside = 0
points = get_random_points(npoints)
for point in points:
x, y = point
if is_in_circle(x, y):
n_inside += 1
return 4 * n_inside / nsteps
def get_random_points(npoints):
return rand(npoints, 2)
def is_in_circle(x, y):
return x**2 + y**2 <= 1
if __name__ == '__main__':
print(main(10**6))
Now the statistics looks like that:
mcpi.stats% stats mcpi.py
Wed Feb 15 10:34:55 2023 mcpi.stats
1086410 function calls (1083927 primitive calls) in 1.184 seconds
Ordered by: cumulative time
List reduced from 1067 to 4 due to restriction <'mcpi.py'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.184 1.184 mcpi.py:1(<module>)
1 0.752 0.752 1.065 1.065 mcpi.py:4(main)
1000000 0.296 0.000 0.296 0.000 mcpi.py:19(is_in_circle)
1 0.000 0.000 0.017 0.017 mcpi.py:15(get_random_points)
We have reduced the time for producing the random points from 0.9s to 0.017s! You see, there was a lot of overhead involved in the function calls themselves. Knowing that we can try to improve
is_in_circle
further by inlining it in the main function:
from numpy.random import rand
def main(nsteps):
n_inside = 0
points = get_random_points(nsteps)
for point in points:
x, y = point
if x**2 + y**2 <= 1: # inlined the is_in_circle function
n_inside += 1
return 4 * n_inside / nsteps
def get_random_points(npoints):
return rand(npoints, 2)
if __name__ == '__main__':
print(main(10**6))
With that change, the statistics is now
mcpi.stats% stats mcpi
Wed Feb 15 10:38:14 2023 mcpi.stats
86410 function calls (83927 primitive calls) in 1.000 seconds
Ordered by: cumulative time
List reduced from 1066 to 3 due to restriction <'mcpi'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.000 1.000 mcpi.py:1(<module>)
1 0.863 0.863 0.880 0.880 mcpi.py:4(main)
1 0.000 0.000 0.017 0.017 mcpi.py:15(get_random_points)
We gained another 10% in total runtime. Look at the data for the
main
function. We have a
cumtime
of 0.880s and most of it (0.863s) is spent in the function itself. So the bulk of the time is spent in the for loop. We can optimize that further by using numpy vectorization:
from numpy.random import rand
def main(nsteps):
points = get_random_points(nsteps)
# Make a boolean array indicating if points are in or out of circle:
points_in_or_out = points[:, 0]**2 + points[:, 1]**2 <= 1
n_inside = sum(points_in_or_out)
return 4 * n_inside / nsteps
def get_random_points(npoints):
return rand(npoints, 2)
if __name__ == '__main__':
print(main(10**6))
which leads to the statistics:
mcpi.stats% stats mcpi
Wed Feb 15 10:44:26 2023 mcpi.stats
86411 function calls (83928 primitive calls) in 0.255 seconds
Ordered by: cumulative time
List reduced from 1067 to 3 due to restriction <'mcpi'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.255 0.255 mcpi.py:1(<module>)
1 0.023 0.023 0.123 0.123 mcpi.py:4(main)
1 0.000 0.000 0.017 0.017 mcpi.py:12(get_random_points)
Ok, overall we have reduced the runtime step by step from 3.3s to 0.255s. And if you look at the column
tottime
when only about 10% of that time is spent in our own code. So the rest is spent in the called functions of numpy and some overhead. There is not much we can do about this any more except for parallelization. But that's a different story and needs different techniques.