Python Tips and Tricks
Introduction
If you're anything like us, most of what you know about Python you learned by trial and error. Most geospatial professionals don't have much formal training in computer science or software engineering. And that's fine because we're mostly not computer scientists or sofware engineers. We don't have time to train in a whole new career just to write some automation scripts. But the downside of a purely practical education is that it can be easy to settle for suboptimal solutions because we're not aware of better practices that can improve the performance and readability of our code.
This session is designed to highlight some common code patterns that work OK, but for which a better approach exists.
- Instead of bracket notation, use more unpacking
- Instead of conditionals for validation, use try/except
- Instead of rolling your own solutions, use existing Python capabilities
- Instead of writing setup and teardown code, use context managers
- Instead of only writing documentation in separate files, use doc strings and type hints.
- Instead of always modeling data collections as lists, use more tuples and sets
- Instead of list comprehensions, use more generators
- Instead of guessing about inefficiencies, profile your code
This workshop is focused on general patterns that you can use no matter what type of problem you are working on. Because these are general patterns, don't expect to be able to lift the code examples here and use them directly in your code. Do expect to take these strategies and apply them to your code.
The code examples and exercises are written using Jupyter Notebooks. If you have a Google account, you can click the Open in Colab button at the top to run the notebooks using Google's Colab environment. Otherwise, you can click the download button for each notebook to download it to your local machine and run in a notebook environment (e.g. loading the notebook into ArcGIS Pro).
Use Python idioms¶
When presented with a problem, our first instinct might be to write code that reflects how we would solve the problem manually. That's good because we can take advantage of our existing knowledge. But it's also bad because it's likely to be harder to implement than a solution that takes full advantages of Python's capabilities.
You are probably working harder than you need to when you implement solutions that match your manual process. For many types of problems, Python has solved them already. Learning the Pythonic way to address a problem will be easier than coming up with your own. It will also be easier for other people to understand your code, because you are using the well-known idioms of Python instead of your own idiosyncratic implementation.
Unpack values¶
An analyst has a tuple of three values that represent the x, y, and z coordinates of a location. The analyst has a distance function that takes three arguments, one for each coordinate.
coordinates = (2, 5, 4)
def distance_from_origin(x, y, z):
return (x**2 + y**2 + z**2) ** 0.5
The analyst needs to pass the values from the tuple to the function. One way to do that is to use the index of each value with bracket notation.
x = coordinates[0]
y = coordinates[1]
z = coordinates[2]
distance_from_origin(x, y, z)
That works, but it has two problems:
- Repetition of
coodinates
is error prone and tough to refactor. - Overuse of brackets makes code harder to read.
You can use unpacking to fix both problems.
x, y, z = coordinates
distance_from_origin(x, y, z)
Variable unpacking takes a collection of values on the right-hand side of =
and assigns each value in order to an equal number of names on the left hand side. Importantly, the number of names on the left must match the number of values in the collection on the right.
x, y, z, m = coordinates
x, y = coordinates
Unpacking to names is useful, but you can go a step further when the values you need to unpack will be passed to a function or class constructor.
distance_from_origin(*coordinates)
The *
in front of the variable name unpacks the values so that each value in order is assigned to the parameters of the function.
One disadvantage of unpacking a collection into arguments this way is that it relies on parameter order. That means it only works when you can use positional arguments and doesn't work when you need to specify keyword arguments.
But if the values to unpack are in a dictionary where each key matches a parameter name, you can unpack them as keyword arguments with **
. Then the order of values no longer matters.
coordinates_dict = {
"z": 4,
"y": 5,
"x": 2
}
distance_from_origin(**coordinates_dict)
Big Takeaway: Unpacking reduces the amount of code you have to write and makes your code easier to read. Take advantage of it wherever you can.
Use comprehensions judiciously¶
An analyst has a list of population densities in people per km2. They need to transform those values in people per mi2.
One way to do that is to loop over all the values, apply a transformation function, and append the transformed value to a new list.
people_per_km2 = (5, 40, 200, 17, 8000)
people_per_mi2 = []
for density in people_per_km2:
mi2_density = density * 2.59
people_per_mi2.append(mi2_density)
people_per_mi2
That code is correct, but it is more verbose than necessary, which can hurt readability.
When you see a pattern where you loop over something, do something to the values, then append new values to an empty list, you should consider replacing it with a list comprehension.
people_per_mi2 = [density * 2.59 for density in people_per_km2]
people_per_mi2
List comprehensions are more readable, but only for people who are familiar with them, so be aware of your audience when using them. They are also a little bit faster than using a for
loop.
But comprehensions are bad when the transformation is complex. Imagine we have a list of quantitative values that we want to transform to qualitative values.
List comprehensions let you do that in a single line. But it is an abomination.
quantitative = [100, 50, 317, 21]
qualitative = ["S" if val < 100 else "M" if val <= 200 else "L" for val in quantitative]
For more complex operations, it is much better to use an explicit loop.
qualitative = []
for val in quantitative:
if val < 100:
qualitative.append("S")
elif val <= 200:
qualitative.append("M")
else:
qualitative.append("L")
qualitative
An analyst has tract-level census data where each tract has three values:
- Land Area in km2
- Population
Use more of the standard library¶
An analyst has tract-level census data records. Each tract has two values: population and households. The analyst could model a single tract as a dictionary.
tract1 = {
"population": 1000,
"households": 500
}
That looks appropriate because it clearly links each value to a key that explains what the value means. But a dictionary is usually not a good data structure for a single record from a table. For one thing, there is a substantial amount of repetition if you need to model many records.
tract2 = {
"population": 2000,
"households": 800
}
tract3 = {
"population": 5000,
"households": 3000
}
Another problem is that dictionaries are mutable, which means the keys can change and cause the dictionary to no longer fit the same data schema.
del tract2["households"]
tract2
Dictionaries are optimized for fast access of a value by key. This is not usually an important goal for an individual record. Using a dictionary to model records is unecessarily hard. A better data structure for a record is a tuple.
tract1 = (1000, 500)
tract2 = (2000, 800)
tract3 = (5000, 3000)
The problem with tuples, however, is the lack of context for each value represents. An even better data structure for a record is a named tuple, which you can import from the standard library.
To use a named tuple, create a class that inherits from NamedTuple
. For this kind of class, you only need to specify the field names and the datatype the values in each field should have. You can then create instances of that named tuple by passing the appropriate values to the constructor.
from typing import NamedTuple
class Tract(NamedTuple):
population: int
households: int
tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)
You can access the value in a field using dot notation.
tract1.households
Big takeaway: The standard library has classes and functions that make your life easier without having to install additional packages. You should use them more. Named tuples are just one example. The official documentation lists them all, but some highlights include:
- csv for working with csv files
- dataclasses for creating dataclasses (like
NamedTuple
, but editable) - datetime for working with dates and times
- itertools for efficient looping
- math for Mathematical functions
- pprint for nicely printing complex data structures
- pathlib and os.path for working with file paths
Use more built-ins¶
The analyst wants to know the average number of people per household across all tracts. That is not the same as averaging the number of people per household per tract. The analyst needs to divide the total population across tracts by the total number of households across tracts.
One way to get the right answer is to loop over each tract, keeping a running total of the population househould values. Then calculate the the ratio.
population = 0
households = 0
tracts = [tract1, tract2, tract3]
for tract in tracts:
population += tract.population
households += tract.households
population / households
That gives the correct answer, but keeping running totals obscures the goal, which is to create the sums of the total population and households across tracts.
Summing values is a common pattern, and for many common patterns, Python has some built-in capability to make it easier to accomplish. Built-ins differ from the standard library in that you don't have to import anything to get access to built-ins.
Code that is is considered Pythonic makes good use of these built-in capabilities. In this case, there is the sum
function. This has the advantage of making it more explicit to the reader that the code is summing values in a collection.
population_values = []
household_values = []
tracts = [tract1, tract2, tract3]
for tract in tracts:
population_values.append(tract.population)
household_values.append(tract.households)
sum(population_values) / sum(household_values)
While you could use a list comprehension, there's actually an even better way.
Imagine each tract is a row in a table that has population and household fields. We want to get the sum of each column. But we don't actually have columns, we only have the rows.
This turns out to be a very common type of problem where we have a group of pairs (or triples, etc), and we want a pair (or triple, etc) of groups. For these problems, use the built-in zip
function.
By using zip
, you can save yourself a little bit of typing and a significant amount of thinking about the correct implementation. Using zip
also makes your code easier to explain to other people familiar with Python because they don't have to reason through your implementation to make sure it's been done correctly.
population_values, household_values = zip(tract1, tract2, tract3)
sum(population_values) / sum(household_values)
Big Takeaway: Python built-ins can make your life easier, without even having to import additional libraries. sum
, list comprehensions, and zip
are among the more useful built-in capabilities of Python you should be using more. The official documentation has the complete list, but some other useful built-in functions include:
abs
for returning the absolute value of a number.all
andany
for testing the truth of a collection of values.dir
for listing the attributes of an objectenumerate
for getting both the index and a value from a collection. Useful for complex loops.help
for getting information about a Python object.isinstance
andtype
for getting information about an object's typelen
for getting the length of a collection, such as a string or list.max
andmin
for getting the maximum or minimum value from a group of values.open
for opening files.range
for creating a collection of values in a given range.
Beg forgiveness. Don't ask permission¶
The analyst writes a function to calculate the population density in people per km2 of land area.
tract1 = {
"land_area": 20,
"population": 1000
}
def pop_density(tract):
return tract["population"] / tract["land_area"]
pop_density(tract1)
There are a few ways this could go wrong. What if the record is missing a population
key because no people live there?
tract2 = {
"land_area": 10
}
pop_density(tract2)
What if it's missing a land_area
key because it's all water?
tract3 = {
"population": 0
}
pop_density(tract3)
What if it has a land area value of 0 because it's all water?
tract4 = {
"land_area": 0,
"population": 0
}
pop_density(tract4)
One way to deal with potential bad values to is to check for them ahead of time with conditional logic.
def pop_density2(tract):
if "population" not in tract.keys():
return 0
elif "land_area" not in tract.keys():
return 0
elif tract["land_area"] == 0:
return 0
else:
return tract["population"] / tract["land_area"]
for tract in (tract1, tract2, tract3, tract4):
print(pop_density2(tract))
But using conditional logic like this is not great. You need to put in the checks before you get to your core logic, which hurts both performance and readability.
You will also inevitably run into edge cases that you didn't anticipate. What if a tract with no people has the population
key set to None
?
tract5 = {
"land_area": 20,
"population": None
}
pop_density2(tract5)
Instead of writing an exploding mess of spaghetti code to deal with a never-ending parade of edge cases, it is better to use try
and except
. Python will attempt to run the code in the try
block. If that code throws an exception, Python will run the code in the except
block that matches the type of exception.
def pop_density3(tract):
try:
return tract["population"] / tract["land_area"]
except (KeyError, ZeroDivisionError, TypeError):
return 0
for tract in (tract1, tract2, tract3, tract4, tract5):
print(pop_density3(tract))
This code is still somewhat fragile. For example, it won't correctly handle records that store values as strings instead of numeric types. But it is usually easier to deal with those complexities as they arise by using try
/except
rather than if
statements. If you do really need some complex conditional logic to handle edge cases, banish it to the except
block instead of distracting the reader by putting it up front.
Big takeaway: You can just try things. It's usually easier, faster, and more readable to put the common case in a try
block, and handle exceptions for edge cases where the common case doesn't work.
Use context managers¶
If you want to open a file with Python, you had better make sure you close it, or bad things can happen.
f = open("file.txt", "w")
f.write("Here is some text")
f.close()
This has two problems:
- It's easy to forget to close it, especially because most of the time it doesn't actually cause problems when you don't
- If your code crashes after you open the file but before you close it, it doesn't close properly
Instead, you may already know you should do it like this so that the file closes automatically:
with open("file.txt", "w") as f:
f.write("Here is some different text")
This isn't magic, it's a context manager. A context manager provides setup and tear down code. The setup code always runs at the beginning of the with
block before anything inside the block. The teardown code always runs when the with
block exits, even if it exited because of an error.
Context managers are useful for reducing the amount of repetitive boilerplate code you have to write to make sure things are set up and torn down correctly.
For example, you may want to write some data to a database, but you want to make sure the transaction gets rolled back if there's a problem with some part of the write.
The code below:
- Connects to a sqlite database
- Puts the data writing code into a
try
block using an explicit transaction - Handles errors in the
except
block by rolling back the transation (conn.commit
is never reached if there is an error before that) - Closes the connection
import sqlite3
conn = sqlite3.connect("test.db")
try:
cursor = conn.cursor()
cursor.execute("BEGIN TRANSACTION")
cursor.execute("CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY, country TEXT)")
cursor.execute("INSERT INTO test (country) VALUES('Argentina')")
conn.commit()
except Exception as e:
print(f"Error {e}: Rolling back transaction")
conn.rollback
conn.close()
It turns out that sqlite has a context manager that creates the connection and rolls back transactions if there's an exception.
It's important to know exactly what kind of setup and teardown a particular context manager does. This particular context manager creates the connection, but it does not automatically close it. You still have to remember to close it yourself.
db_path = "test.db"
with sqlite3.connect(db_path) as conn:
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY, country TEXT)")
cursor.execute("INSERT INTO test (country) VALUES('Argentina')")
conn.commit()
conn.close()
Big takeaway: If you are working with objects that support context managers, you should use those context managers. Pay attention to how the context manager works though, because it may not do everything you expect.
Exercises¶
The exercises below invite you to practice applying the different strategies outlined above. They follow the order of the concepts presented, but you can attempt them in any order. Start with the ones that seem most applicable to the work you need to do.
You can find example answers in the ExerciseAnswers.ipynb notebook.
1) Use unpacking for pretty printing¶
The code below uses a loop to print each value in a collection on a separate line.
counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
for county in counties:
print(county)
Write a different implementation that uses unpacking to print each value on a separate line using a single call to the print
function instead of a loop.
Hint: The print
function's first parameter is *objects
, which is accepts any number of positional arguments (similar to *args
in other functions). These arguments are what will be printed. The second parameter is sep
, which defines the character to put in between the values to print. The default value of sep
is a single space (' '
), but it could be a newline character ('\n'
).
2) Use standard library data classes¶
The code below uses a dictionary to define a record, then changes one of the values in that record.
record = {
"total_population": 5000,
"population_in_poverty": 200
}
record["total_population"] = 6000
print(record)
This pattern cannot be implented using a named tuple, because named tuples are immutable. A data class is a standard library class that is similar to a named tuple, but it can be editable. Write a different implementation of the code above to use data classes instead of dictionaries
Hint: The official Python documentation may be hard to understand. You may want to search for a tutorial on data classes specifically.
3) Use the built-in min and max functions¶
The code below creates a list of 20 random numbers between -1000 and 1000.
from random import randint
nums = [randint(-1000, 1000) for i in range(20)]
The code below finds the maximum and minimum values of nums
using conditional logic and explicit comparisons to running values.
min_num = 1000
max_num = -1000
for num in nums:
if num > max_num:
max_num = num
if num < min_num:
min_num = num
print(max_num, min_num)
4) Just do things¶
The code below defines three records using a named tuple.
from typing import NamedTuple
class Record(NamedTuple):
total_population: int
population_in_poverty: int
record1 = Record(5000, 2000)
record2 = Record(200, 10)
record3 = Record("400", "30")
The code below calculates the poverty rate, first checking that the values in the record are the correct type, and transforming them if not.
def poverty_rate(record):
total_pop, pop_in_poverty = record
if not isinstance(total_pop, int):
total_pop = int(record.total_population)
if not isinstance(pop_in_poverty, int):
pop_in_poverty = int(record.population_in_poverty)
return pop_in_poverty / total_pop
for record in (record1, record2, record3):
print(poverty_rate(record))
Write a different implementation that doesn't use if
to check datatypes ahead of time.
Hint: You may find it useful to first write the code without any error handling to see what type of error occurs.
For an even better way to solve this kind of problem, look into Pydantic models. These models are not built-in or in the standard library, so you have to install the Pydantic library to get them. Pydantic models are like named tuples that guarantee the records will have the correct data type.
5) Use Pythonic patterns for setup and teardown boilerplate¶
The code below opens data.csv
and writes some information to the file. Then an exception occurs before the file is closed. The code creates data.csv
if it didn't exist before, but if you open the file, you will notice that the data has not been written to the file (In a Google Colab notebook, there is a files icon on the left where you can double-click to open a file in the web interface)
f = open("data.csv", "w")
f.write("Important data")
raise ValueError
f.close()
Rewrite this code so that the data is written to the file even though it raises an exception.
Help other people understand your code¶
Even if you use Pythonic idioms, your code probably won't be perfectly understandable by itself. You want other people to be able to work with the code you write.
Docstrings¶
An analyst has a function that calculates the distance from a given point to the origin in three dimensions.
def distance_from_origin(x, y, z):
return (x**2 + y**2 + z**2) ** 0.5
One option is to say that it is perfectly obvious what this function does from its name and parameters. But your functions are much more obvious to you than they are to other people. "Other people" includes future you. You do not want future you mad at current you for not explaining what your code does.
A better option is to write down explicity what this function does, what kind of arguments you can pass to it, and what kind of value it will return. For example, you might have a text file, or a web page, or a Word doc. Hopefully not a sticky note on your monitor, but even that's better than nothing. Something like:
Calculates the distance from a given point in three dimensions to the origin (0, 0, 0).
Args:
x (float): The x-axis coordinate.
y (float): The y-axis coordinate.
z (float): The z-axis coordinate.
Returns:
float: The distance.
That works OK, but separating your code from your documentation forces people to look in two places. It also means that the built-in help
function is mostly useless for learning about your function.
help(distance_from_origin)
A better way to document your code is to include the information as a docstring. You can use docstrings with modules, function, classes, and methods that you create.
def distance_from_origin_docstring(x, y, z):
"""
Calculates the distance from a given point in three dimensions to the origin (0, 0, 0).
Args:
x (float): The x-axis coordinate.
y (float): The y-axis coordinate.
z (float): The z-axis coordinate.
Returns:
float: The distance.
"""
return (x**2 + y**2 + z**2) ** 0.5
By including a docstring, people can use the built-in help
function to see the information without having to open the source code file.
help(distance_from_origin_docstring)
Many IDEs will even show the information when you hover over the function name.
Type hints¶
An analyst tries using the distance_from_origin_docstring
function, but is getting an error
coordinates = [2, 5, 4]
distance = distance_from_origin_docstring(*coordinates)
info_string = "The point is " + distance + " meters from the origin"
print(info_string)
The error is reasonably informative, and the analyst can use it to fix their code. But the problem only showed up after the analyst ran the code. It would be nice to get that information beforehand. Type hints are a way to pass information to type checkers and IDEs that can help ensure that you're using the correct types, without having to actually run the code.
def distance_from_origin_typehints(x: float, y: float, z: float) -> float:
"""
Calculates the distance from a given point in three dimensions to the origin (0, 0, 0).
Args:
x (float): The x-axis coordinate.
y (float): The y-axis coordinate.
z (float): The z-axis coordinate.
Returns:
float: The distance.
"""
return (x**2 + y**2 + z**2) ** 0.5
If the analyst had used this function, type checkers like Mypy would have flagged the use of the distance
name as incorrect usage. Then the analyst could have corrected their code before running it and seeing the error.
coordinates = [2, 5, 4]
distance = distance_from_origin_typehints(*coordinates)
info_string = "The point is " + distance + " meters from the origin"
print(info_string)
Type hints are well-named. They do not force you to use the right types. They will not cause Python to throw an error if you use the wrong types. They give you a hint that you are not using a value correctly.
For example, the distance_from_origin_typehints
function still executes without an error when you pass it a complex
number as an argument, even though a complex
is not a float
.
coordinates = [2j, 5, 4]
distance = distance_from_origin_typehints(*coordinates)
info_string = f"The point is {distance} meters from the origin"
print(info_string)
Type hints can be used for more complex types, like if you need to have a particular container type and you also need to specify the type of the values inside the container.
Iterable
is for containers that you want to use in afor
loop.Sequence
is anIterable
that lets you know the length and access an element by index.MutableSequence
is aSequence
that you might need to change.Mapping
is for dictionary-like objects where you want to get values by key.MutableMapping
is aMapping
that you might need to change.- If you know you want a specific type, you can also directly use
dict
,list
,tuple
, etc.
The code below refactors the distance function to use a single parameter and calculates the distance in any number of dimensions.
from collections.abc import Iterable
def n_dimension_distance_from_origin(coords: Iterable[float]) -> float:
"""
Calculates the distance from a given n-dimensional point to the origin.
Args:
coords (Iterable[float]):
An iterable of coordinate values, one for each dimension.
Returns:
float: The distance.
"""
sum_of_squares = sum(d ** 2 for d in coords)
return sum_of_squares ** 0.5
n_dimension_distance_from_origin((1, 1, 1, 1))
n_dimension_distance_from_origin([1, 1, 1, 1])
n_dimension_distance_from_origin(("1", "1", "1", "1"))
Exercises¶
The exercises below invite you to practice applying the different strategies outlined above. They follow the order of the concepts presented, but you can attempt them in any order. Start with the ones that seem most applicable to the work you need to do.
You can find example answers in the ExerciseAnswers.ipynb notebook.
1) Use type hints¶
Determine the input and output types of the calculate_area
function below, then add type hints.
Hint: The correct type for vertices
is complex. It is passed to the cycle
function, which means you need to be able to loop over it. The containers inside vertices
must have both an x
and a y
property, and you need to be able to do arithmetic using the values of those properties.
from itertools import cycle
from typing import NamedTuple
class Vertex(NamedTuple):
x: float
y: float
def calculate_area(vertices):
subtotals = []
vertex_cycle = cycle(vertices)
next(vertex_cycle)
for vertex in vertices:
next_vertex = next(vertex_cycle)
subtotal = vertex.x * next_vertex.y - vertex.y * next_vertex.x
subtotals.append(subtotal)
area = abs(sum(subtotals) / 2)
return area
vertices = (Vertex(4, 10), Vertex(9, 7), Vertex(11, 2), Vertex(2, 2))
calculate_area(vertices)
2) Add a docstring to a function¶
Determine what the calculate area
function does, then add a docstring.
The examples above use Google-style docstrings, which is a common standard. You may also want to look at other common formats.
Hint: It is not actually necessary to understand the shoelace algorithm implemented by this function. You can still write an excellent doc string explaining what it does and how to use it.
Optimize Performance and Memory Use¶
When you begin to use Python regularly in your work, you'll start noticing bottlenecks in your code. Some workflows may run at lightning speed, while others take hours of processing time to complete, or even crash.
Avoiding bloat is invaluable as you move toward using code for automation, bigger data, and working with APIs. Code efficiency means:
- Less chance of a slowdown or crash: the dreaded MemoryError.
- Quicker response time and fewer bottlenecks for the larger workflow.
- Better scaling.
- Efficient code is often (but not always!) cleaner and more readable.
Let's look at some ways you can reduce bloat in your code.
Access and store only what you need, no more.
- Storage: avoid a list where you could use a tuple
- Membership look-up: avoid a list (or tuple) where you could use a set (or dictionary)
- Iteration: avoid a function (or list comprehension) where you could use a generator (or generator expression)
- Profile: make time for performance checks by profiling your code for bottlenecks
Use fewer lists¶
If you have a collection of values, your first thought may be to store them in a list.
data_list = [17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356]
Lists are nice because they are very flexible. You can change the values in the list, including appending and removing values. But that flexibility comes at a cost. Lists are less efficient than tuples. For example, they use more memory.
import sys
data_tuple = tuple(data_list)
print(sys.getsizeof(data_list))
print(sys.getsizeof(data_tuple))
Note that sys.getsizeof
doesn't include the size of data in a container, just the size of the container. You can use it to compare data structures that have the same data in them, but not to compare different data.
Membership look-up: sequential vs. hashable¶
When you want to see if an element already exists in a collection of elements, neither lists nor tuples are the best choice.
- List and tuple lookup is sequential. The bigger the list, the longer look-up takes. This is called O(n) time complexity.
- Set and dictionary lookups are hashable, which means a lookup goes directly to the correct value. Lookup always takes the same amount of time, now matter how much data there is. This is called O(1) time complexity
For example, imagine an analyst has a dataset of 1 million addresses. They also have a smaller dataset of 10,000 zip codes. They want to know which of the zip codes are associated with at least 1 of the addresses.
One way to do that is with a list
from random import randint
addresses_zips = [randint(10000, 99950) for _ in range(1_000_000)]
zips_of_interest = [randint(10000, 99950) for _ in range(10_000)]
zips_with_address_match_from_list = []
for address_zip in addresses_zips:
if address_zip in zips_of_interest:
zips_with_address_match_from_list.append(address_zip)
print(len(zips_with_address_match_from_list))
A faster way is to use a set.
zips_of_interest_set = set(zips_of_interest)
zips_with_address_match_from_set = []
for address_zip in addresses_zips:
if address_zip in zips_of_interest_set:
zips_with_address_match_from_set.append(address_zip)
print(len(zips_with_address_match_from_set))
zips_with_address_match_from_set == zips_with_address_match_from_list
Big takeaway: Lists are appropriate when you need a collection where you can change the values, but they aren't the best choice for everything. Use a tuple if you don't need to change the values. Use a dictionary or set if you need to check if a value is in the collection.
Use more generators¶
Regular functions and comprehensions typically store outputs into containers, like lists or dictionaries. This can take up unnecessary memory, especially when we're creating multi-step workflows with many intermediate outputs.
In contrast, generators only hold one data item in memory at a time. A generator is a type of iterator that produces results on-demand (lazily), maintaining its state between iterations.
def massive_func():
"""A function that attempts to produce an infinitely long list of even numbers."""
x_list = []
x = 0
while True:
x_list.append(x)
x += 2
return x_list
# Calling this function will run out of space
for x in massive_func():
print(x)
def massive_gen():
"""A generator that produces an infinitely long stream of even numbers."""
x = 0
while True:
yield x
x += 2
# Calling this function will run out of time
for x in massive_gen():
print(x)
What goes for functions, also goes for list comprehensions. You can often use a generator expression in place of a list comprehension. We've already seen an example of a generator expression in the n-dimensional distance function:
coords = (1, 1, 1, 1)
sum(d ** 2 for d in coords)
Compare that example to one that uses a list comprehension:
coords = (1, 1, 1, 1)
sum([d ** 2 for d in coords])
The sum
function operates by looping over an iterable and adding the value to a running total. In the first case, the iterable is a generator that produces a single value at a time.
In the second case, the list comprension loops over coords
to produce a list where every value is stored in memory. Then the sum
function loops over that list.
An important limitation of generators is that because they produce a single value at a time and then forget about it, you cannot reuse them.
generator = (d ** 2 for d in coords)
sum(generator)
max(generator)
Big Takeaway: If you're only going to use a value once, you should probably use a generator. If you need to use it again, you probabably need to store it in something like a tuple or list.
Profile, don't guess¶
Profiling is any technique used to measure the performance of your code, such as its speed or resource usage. There are dozens of tools available for profiling, but we'll focus on two:
- Check memory use: Use
tracemalloc
to check the memory usage of code. - Spot-profile your code: Use the
timeit
notebook magic to perform some basic profiling by cell or by line.
To make profiling easier, the cell below defines functions for calculating a sum on a generator expression and on a list comprehension. Both functions will be called with a very large number of coordinates to make profile differences more obvious.
coords = (1, 1) * 1_000_000
def sum_generator(coords):
return sum(d ** 2 for d in coords)
def sum_list_comprehension(coords):
return sum([d ** 2 for d in coords])
Check memory use¶
The cells below uses tracemalloc
to capture information about memory usage for the the two versions of the function.
You do need to restart the kernel between runs of these cells, to ensure tracemalloc
isn't counting information stored in memory from a previous cell run.
import tracemalloc
tracemalloc.start()
sum_generator(coords)
current, peak = tracemalloc.get_traced_memory()
print(peak)
import tracemalloc
tracemalloc.start()
sum_list_comprehension(coords)
current, peak = tracemalloc.get_traced_memory()
print(peak)
Spot-check speed with %%timeit
¶
The timeit
module measures the execution time of a selection of code. Among the ways you'll see it written are "magic" commands in notebooks.
%%timeit
is a form of cell magic. It measures the execution time of the entire notebook cell.
%%timeit
sum_generator(coords)
%%timeit
sum_list_comprehension(coords)
If you just want to check the timeing for a single line, you can use the %timeit
line magic. That's useful if you have some code that takes some time to run, but you don't want it affecting the timeit
results. Compare the use of cell magic and line magic in the next two cells.
%%timeit
from time import sleep
sleep(1)
sum_list_comprehension(coords)
from time import sleep
sleep(1)
%timeit sum_list_comprehension(coords)
Big takeaway: You can use your knowledge of Python to make some predictions about where performance bottlenecks are occuring in your code. But you should check to be sure, because those bottlenecks frequently show up in unexpected places.
Exercises¶
The exercises below invite you to practice applying the different strategies outlined above. They follow the order of the concepts presented, and you'll need to at least run the code in #3 before you attempt #4 or #5, since they rely on the function definitions in #3. You can otherwise attempt them in any order. Start with the ones that seem most applicable to the work you need to do.
You can find example answers in the ExerciseAnswers.ipynb notebook.
1) Use the right data structure for immutable sequences¶
The code below creates a list containing all years in a research study timeframe, from 1900 to 2030.
The values in this collection will not need to be changed because the study will always use this timeframe.
import sys
def list_from_range(start, end):
"""Create a list from a range of values"""
return list(range(start, end + 1))
start = 1900
end = 2030
studyYears = list_from_range(start, end)
print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))
Write a different implementation using a different storage option and demonstrate that option uses less memory.
2) Use the right data structure for membership lookup¶
The code below assigns a collection of placenames to a list. Then, it checks whether a placename is in the list. If not, the placename is reported missing.
If you have 1 million placenames to look up and 6 names in the list, that’s up to 6 million checks.
placeNames_list = ["Kinshasa", "Duluth", "Uruguay"] * 1_000_000
# O(n) list look-up
if "Dinkytown" not in placeNames_list:
print("Missing.")
Write a different implementation using a storage option that allows quicker checks for membership at scale.
3) Use generators¶
The code below uses a generator to create vertices for triangles from a random selection. It also defines a function for calculating the area of a polygon from its vertices.
from itertools import cycle
from random import randint
class Random_Vertex:
def __init__(self):
self.x = randint(0, 100)
self.y = randint(0, 100)
def generate_polygon_vertices(num_polygons, num_sides):
for _ in range(num_polygons):
vertices = (Random_Vertex() for _ in range(num_sides))
yield vertices
def calculate_area(vertices):
subtotals = []
vertex_cycle = cycle(vertices)
next(vertex_cycle)
for vertex in vertices:
next_vertex = next(vertex_cycle)
subtotal = vertex.x * next_vertex.y - vertex.y * next_vertex.x
subtotals.append(subtotal)
area = abs(sum(subtotals) / 2)
return area
The code below uses the code above to generate 1 million triangles. You want to find out the area of the largest triangle. The code below does this with a list comprehension, which holds all 1 million area values in memory.
triangles = generate_polygon_vertices(1_000_000, 3)
max([calculate_area(triangle) for triangle in triangles])
Rewrite the code above to use less memory.
Hint: The easiest fix is to replace the list comprehension with a generator expression. Harder would be writing your own generator using the yield
statement
4) Check memory use of lists vs. generators¶
Change both cells below to use tracemalloc
to compare their memory use.
Hint: Because the notebook keeps many variables in memory, you will want to restart the notebook kernel between running the cell to get a valid comparison. That means you will need to re-run the cell that defines the generate_polygon_vertices
generator and calculate_area
function.
# Using lists
triangles = generate_polygon_vertices(1_000, 3)
max([calculate_area(triangle) for triangle in triangles])
# Using a generator expression
triangles = generate_polygon_vertices(1_000, 3)
max(calculate_area(triangle) for triangle in triangles)
5) Compare execution speed of lists vs. generators¶
Change both cells below to use timeit
to compare their execution time.
# Using a list
triangles = generate_polygon_vertices(1_000, 3)
max([calculate_area(triangle) for triangle in triangles])
# Using a generator expression
triangles = generate_polygon_vertices(1_000, 3)
max(calculate_area(triangle) for triangle in triangles)
Exercise Answers¶
The code cells below are example answers to the workshop exercises. They are useful if you get stuck and need a hint or if you want to use them as a comparison with your own attempts
1.1) Use unpacking for pretty printing¶
counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
print(*counties, sep='\n')
Anoka Dakota Carver Hennepin Ramsey Scott Washington
1.2) Use standard library data classes¶
from dataclasses import dataclass
@dataclass
class Record:
total_population: int
population_in_poverty: int
record = Record(5000, 200)
record.total_population = 6000
print(record)
Record(total_population=6000, population_in_poverty=200)
1.3) Use the built-in min and max functions¶
from random import randint
nums = [randint(-1000, 1000) for i in range(20)]
print(max(nums), min(nums))
891 -899
1.4) Just do things¶
from typing import NamedTuple
class Record(NamedTuple):
total_population: int
population_in_poverty: int
record1 = Record(5000, 2000)
record2 = Record(200, 10)
record3 = Record("400", "30")
def poverty_rate(record):
total_pop, pop_in_poverty = record
return int(pop_in_poverty) / int(total_pop)
for record in (record1, record2, record3):
print(poverty_rate(record))
0.4 0.05 0.075
1.5) Use Pythonic patterns for setup and teardown boilerplate¶
with open("data.csv", "w") as f:
f.write("Important data")
raise ValueError
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[3], line 3 1 with open("data.csv", "w") as f: 2 f.write("Important data") ----> 3 raise ValueError ValueError:
2.1) Use type hints¶
from itertools import cycle
from collections.abc import Iterable
from typing import NamedTuple
class Vertex(NamedTuple):
x: float
y: float
def calculate_area(vertices: Iterable[Vertex]) -> float:
subtotals = []
vertex_cycle = cycle(vertices)
next(vertex_cycle)
for vertex in vertices:
next_vertex = next(vertex_cycle)
subtotal = vertex.x * next_vertex.y - vertex.y * next_vertex.x
subtotals.append(subtotal)
area = abs(sum(subtotals) / 2)
return area
vertices = (Vertex(4, 10), Vertex(9, 7), Vertex(11, 2), Vertex(2, 2))
calculate_area(vertices)
2.2) Add a docstring to a function¶
def calculate_area(vertices: Iterable[Vertex]) -> float:
"""
Calculate the area of a polygon given the coordinates of its vertices
Args:
vertices (Iterable[Vertex]):
An iterable, such as a list or tuple, of Vertex objects
holding the (x, y) coordinates of each vertex
Returns:
float: The area of the polygon
"""
subtotals = []
vertex_cycle = cycle(vertices)
next(vertex_cycle)
for vertex in vertices:
next_vertex = next(vertex_cycle)
subtotal = vertex.x * next_vertex.y - vertex.y * next_vertex.x
subtotals.append(subtotal)
area = abs(sum(subtotals) / 2)
return area
vertices = (Vertex(4, 10), Vertex(9, 7), Vertex(11, 2), Vertex(2, 2))
calculate_area(vertices)
45.5
3.1) Use the right data structure for immutable sequences¶
import sys
def tuple_from_range(start, end):
"""Create a tuple from a range of values"""
return tuple(range(start, end + 1))
start = 1900
end = 2030
studyYears = tuple_from_range(start, end)
print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))
(1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030) Bytes used: 1088
3.2) Use the right data structure for membership lookup¶
placeNames_list = ["Kinshasa", "Duluth", "Uruguay"] * 1_000_000
placeNames_set = set(placeNames_list)
# O(1) set look-up
if "Dinkytown" not in placeNames_set:
print("Missing.")
3.3) Use generators¶
from itertools import cycle
from random import randint
class Random_Vertex:
def __init__(self):
self.x = randint(0, 100)
self.y = randint(0, 100)
def generate_polygon_vertices(num_polygons, num_sides):
for _ in range(num_polygons):
vertices = (Random_Vertex() for _ in range(num_sides))
yield vertices
def calculate_area(vertices):
subtotals = []
vertex_cycle = cycle(vertices)
next(vertex_cycle)
for vertex in vertices:
next_vertex = next(vertex_cycle)
subtotal = vertex.x * next_vertex.y - vertex.y * next_vertex.x
subtotals.append(subtotal)
area = abs(sum(subtotals) / 2)
return area
Easier: Use a generator expression to find the triangle with the maximum area instead of a list comprehension
triangles = generate_polygon_vertices(1_000_000, 3)
max([calculate_area(triangle) for triangle in triangles])
5000.0
Harder: Write a generator to replace the list comprehension instead of using a generator expression.
def calculate_areas(polygons):
for polygon in polygons:
yield(calculate_area(polygon))
triangles = generate_polygon_vertices(1_000_000, 3)
max(calculate_areas(triangles))
5000.0
3.4) Compare memory use of lists vs. generators¶
import tracemalloc
tracemalloc.start()
triangles = generate_polygon_vertices(1_000_000, 3)
max([calculate_area(triangle) for triangle in triangles])
current, peak = tracemalloc.get_traced_memory()
print(peak)
32452069
import tracemalloc
tracemalloc.start()
triangles = generate_polygon_vertices(1_000_000, 3)
max(calculate_area(triangle) for triangle in triangles)
current, peak = tracemalloc.get_traced_memory()
print(peak)
31073
3.5) Check execution speed of lists vs. generators¶
%%timeit
triangles = generate_polygon_vertices(1_000, 3)
max([calculate_area(triangle) for triangle in triangles])
7.55 ms ± 40.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
triangles = generate_polygon_vertices(1_000, 3)
max(calculate_area(triangle) for triangle in triangles)
7.54 ms ± 32.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)