Profiling and Performance Optimization in Python

Photo by Mathew Schwartz on Unsplash

Are you looking to take your Python code to the next level? Profiling and performance optimization are key skills for any Python developer who wants to build high-performance applications. By identifying performance bottlenecks and optimizing your code, you can dramatically improve the speed and efficiency of your programs.

Whether you’re building a high-performance web application or working on a complex data analysis project, the skills you’ll learn in this guide will help you write faster and more efficient Python code. So let’s dive in and take your Python skills to the next level!

As a sample code that we want to profile, consider the module mcpi.py below:

from numpy.random import rand
from numpy import sqrt


def main(nsteps):
    n_inside = 0
    for istep in range(nsteps):
        x, y = get_random_point()
        if is_in_circle(x, y):
            n_inside += 1

    return 4 * n_inside / nsteps


def get_random_point():
    return rand(2)


def is_in_circle(x, y):
    return sqrt(x**2 + y**2) <= 1


if __name__ == '__main__':
    print(main(10**6))

It’s not really important what the code does, but if you want to know: it’s a primitive method to compute the value of π using Monte Carlo. If you really need π, don’t do this, though! ;-) Let’s profile and optimize this!

First, we run the application in profiling mode:

python -m cProfile -o mcpi.stats mcpi.py

And then we look a the statistics that the profiler has generated. You can use graphical tools like snakeviz but in opinion, the console viewer gives a better overview. Here is how you can start it:

python -m pstats mcpi.stats
Welcome to the profile statistics browser.
mcpi.stats%

Now let’s print the first 20 lines of the statistics:

mcpi.stats% stats 20
Wed Feb 15 09:29:29 2023    mcpi.stats

         3086411 function calls (3083928 primitive calls) in 3.237 seconds

   Random listing order was used
   List reduced from 1067 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      118    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:35(_new_module)
      158    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:58(__init__)
      468    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:78(acquire)
      468    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:103(release)
      307    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:36(_relax_case)
      160    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:143(__init__)
      160    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:147(__enter__)
      160    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:151(__exit__)
      158    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:176(cb)
      468    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
      308    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:194(_lock_unlock_module)
    229/1    0.000    0.000    0.118    0.118 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
     1691    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:222(_verbose_message)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:232(_requires_builtin_wrapper)
      156    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:342(__init__)
      265    0.000    0.000    0.002    0.000 <frozen importlib._bootstrap>:376(cached)
      346    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:389(parent)
      148    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:397(has_location)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:406(spec_from_loader)
      148    0.001    0.000    0.003    0.000 <frozen importlib._bootstrap>:477(_init_module_attrs)

The header says that more than 3 million function calls were executed and the entire program ran for 3.237 seconds. The table below the header is a lot of information but not very informative so far. That's because we didn't sort the statistics. As you can see, there are different columns, and we can ask the browser to sort by specific columns. ncalls is the number of calls of a given function f , cumtime is the total time spent in the function including all functions that are called inside f , whereas tottime is the total time spent in the function without all the functions it calls. Let's sort by cumtime :

mcpi.stats% sort cumtime
mcpi.stats% stats 20

Wed Feb 15 10:51:07 2023    mcpi.stats

         3086411 function calls (3083928 primitive calls) in 3.453 seconds

   Ordered by: cumulative time
   List reduced from 1067 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    137/1    0.001    0.000    3.453    3.453 {built-in method builtins.exec}
        1    0.000    0.000    3.453    3.453 mcpi.py:1(<module>)
        1    1.157    1.157    3.312    3.312 mcpi.py:5(main)
  1000000    1.220    0.000    1.220    0.000 mcpi.py:19(is_in_circle)
  1000000    0.131    0.000    0.935    0.000 mcpi.py:15(get_random_point)
  1000000    0.804    0.000    0.804    0.000 {method 'rand' of 'numpy.random.mtrand.RandomState' objects}
    160/1    0.001    0.000    0.141    0.141 <frozen importlib._bootstrap>:986(_find_and_load)
    160/1    0.000    0.000    0.141    0.141 <frozen importlib._bootstrap>:956(_find_and_load_unlocked)
    229/1    0.000    0.000    0.141    0.141 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
    365/1    0.000    0.000    0.141    0.141 {built-in method builtins.__import__}
    148/1    0.001    0.000    0.141    0.141 <frozen importlib._bootstrap>:650(_load_unlocked)
    118/1    0.000    0.000    0.141    0.141 <frozen importlib._bootstrap_external>:777(exec_module)
        1    0.000    0.000    0.141    0.141 /Users/mathcube/venv/lib/python3.8/site-packages/numpy/__init__.py:1(<module>)
   183/17    0.000    0.000    0.137    0.008 <frozen importlib._bootstrap>:1017(_handle_fromlist)
        1    0.000    0.000    0.071    0.071 /Users/mathcube/venv/lib/python3.8/site-packages/numpy/core/__init__.py:1(<module>)
  148/145    0.000    0.000    0.041    0.000 <frozen importlib._bootstrap>:549(module_from_spec)
    29/27    0.000    0.000    0.038    0.001 <frozen importlib._bootstrap_external>:1099(create_module)
    29/27    0.036    0.001    0.038    0.001 {built-in method _imp.create_dynamic}
        1    0.000    0.000    0.029    0.029 /Users/mathcube/venv/lib/python3.8/site-packages/numpy/random/__init__.py:1(<module>)
        1    0.000    0.000    0.029    0.029 /Users/mathcube/venv/lib/python3.8/site-packages/numpy/random/_pickle.py:1(<module>)

There is still a lot of information regarding built-in function calls but we can now also see our own functions. Obviously, the main function consumed most of the cumulative time (3.312s), which is not surprising because this contains all of the program except imports. 1.157s were spent inside the main function itself, the rest must have been spent in the function that main called. By the way, we can filter the output by search terms or regular expressions to reduce the noise even more. For example, we can only show functions called from our Python files by filtering by mcpi.py :

mcpi.stats% stats mcpi.py
Wed Feb 15 10:51:07 2023    mcpi.stats

         3086411 function calls (3083928 primitive calls) in 3.453 seconds

   Ordered by: cumulative time
   List reduced from 1067 to 4 due to restriction <'mcpi.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.453    3.453 mcpi.py:1(<module>)
        1    1.157    1.157    3.312    3.312 mcpi.py:5(main)
  1000000    1.220    0.000    1.220    0.000 mcpi.py:19(is_in_circle)
  1000000    0.131    0.000    0.935    0.000 mcpi.py:15(get_random_point)

Let's try to improve the performance of one of the bottlenecks, the function is_in_circle , which per call costs practically nothing, but since it is called very often, it consumes a lot of time in total (1.220s). The function is very simple:

def is_in_circle(x, y):
    return sqrt(x**2 + y**2) <= 1

How can we improve on that? Well, the sqrt function is a rather complicated function, at least much more complicated than a integer power or an addition. So let's simply square the inequality:

def is_in_circle(x, y):
    return x**2 + y**2 <= 1

Running the program in profiling mode again and starting pstats , we get

mcpi.stats% stats 20 mcpi.py
Wed Feb 15 10:29:16 2023    mcpi.stats

         3086411 function calls (3083928 primitive calls) in 2.633 seconds

   Ordered by: cumulative time
   List reduced from 1067 to 20 due to restriction <20>
   List reduced from 20 to 4 due to restriction <'mcpi.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    2.634    2.634 mcpi.py:1(<module>)
        1    1.090    1.090    2.378    2.378 mcpi.py:5(main)
  1000000    0.134    0.000    0.921    0.000 mcpi.py:15(get_random_point)
  1000000    0.367    0.000    0.367    0.000 mcpi.py:19(is_in_circle)

Ok, so we significantly reduced the time in is_in_circle from 1.157s to 0.367s. What about the get_random_point function?

def get_random_point():
    return rand(2)

Well, it seems there is not a lot we can do about it but we can! One problem is that it is called so often, and each call has at least some overhead. So we can rewrite the logic of the calling function and let rand produce all random points in one sweep:

from numpy.random import rand


def main(npoints):
    n_inside = 0
    points = get_random_points(npoints)
    for point in points:
        x, y = point
        if is_in_circle(x, y):
            n_inside += 1

    return 4 * n_inside / nsteps


def get_random_points(npoints):
    return rand(npoints, 2)


def is_in_circle(x, y):
    return x**2 + y**2 <= 1


if __name__ == '__main__':
    print(main(10**6))

Now the statistics looks like that:

mcpi.stats% stats mcpi.py
Wed Feb 15 10:34:55 2023    mcpi.stats

         1086410 function calls (1083927 primitive calls) in 1.184 seconds

   Ordered by: cumulative time
   List reduced from 1067 to 4 due to restriction <'mcpi.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.184    1.184 mcpi.py:1(<module>)
        1    0.752    0.752    1.065    1.065 mcpi.py:4(main)
  1000000    0.296    0.000    0.296    0.000 mcpi.py:19(is_in_circle)
        1    0.000    0.000    0.017    0.017 mcpi.py:15(get_random_points)

We have reduced the time for producing the random points from 0.9s to 0.017s! You see, there was a lot of overhead involved in the function calls themselves. Knowing that we can try to improve is_in_circle further by inlining it in the main function:

from numpy.random import rand


def main(nsteps):
    n_inside = 0
    points = get_random_points(nsteps)
    for point in points:
        x, y = point
        if x**2 + y**2 <= 1: # inlined the is_in_circle function
            n_inside += 1

    return 4 * n_inside / nsteps


def get_random_points(npoints):
    return rand(npoints, 2)


if __name__ == '__main__':
    print(main(10**6))

With that change, the statistics is now

mcpi.stats% stats mcpi
Wed Feb 15 10:38:14 2023    mcpi.stats

         86410 function calls (83927 primitive calls) in 1.000 seconds

   Ordered by: cumulative time
   List reduced from 1066 to 3 due to restriction <'mcpi'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.000    1.000 mcpi.py:1(<module>)
        1    0.863    0.863    0.880    0.880 mcpi.py:4(main)
        1    0.000    0.000    0.017    0.017 mcpi.py:15(get_random_points)

We gained another 10% in total runtime. Look at the data for the main function. We have a cumtime of 0.880s and most of it (0.863s) is spent in the function itself. So the bulk of the time is spent in the for loop. We can optimize that further by using numpy vectorization:

from numpy.random import rand


def main(nsteps):
    points = get_random_points(nsteps)

    # Make a boolean array indicating if points are in or out of circle:
    points_in_or_out = points[:, 0]**2 + points[:, 1]**2 <= 1

    n_inside = sum(points_in_or_out)
    return 4 * n_inside / nsteps


def get_random_points(npoints):
    return rand(npoints, 2)


if __name__ == '__main__':
    print(main(10**6))

which leads to the statistics:

mcpi.stats% stats mcpi
Wed Feb 15 10:44:26 2023    mcpi.stats

         86411 function calls (83928 primitive calls) in 0.255 seconds

   Ordered by: cumulative time
   List reduced from 1067 to 3 due to restriction <'mcpi'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.255    0.255 mcpi.py:1(<module>)
        1    0.023    0.023    0.123    0.123 mcpi.py:4(main)
        1    0.000    0.000    0.017    0.017 mcpi.py:12(get_random_points)

Ok, overall we have reduced the runtime step by step from 3.3s to 0.255s. And if you look at the column tottime when only about 10% of that time is spent in our own code. So the rest is spent in the called functions of numpy and some overhead. There is not much we can do about this any more except for parallelization. But that's a different story and needs different techniques.