Use Python idioms¶
When presented with a problem, our first instinct might be to write code that reflects how we would solve the problem manually. That's good because we can take advantage of our existing knowledge. But it's also bad because it's likely to be harder to implement than a solution that takes full advantages of Python's capabilities.
You are probably working harder than you need to when you implement solutions that match your manual process. For many types of problems, Python has solved them already. Learning the Pythonic way to address a problem will be easier than coming up with your own. It will also be easier for other people to understand your code, because you are using the well-known idioms of Python instead of your own idiosyncratic implementation.
Unpack values¶
An analyst has a tuple of three values that represent the x, y, and z coordinates of a location. The analyst has a distance function that takes three arguments, one for each coordinate.
coordinates = (2, 5, 4)
def distance_from_origin(x, y, z):
return (x**2 + y**2 + z**2) ** 0.5
The analyst needs to pass the values from the tuple to the function. One way to do that is to use the index of each value with bracket notation.
x = coordinates[0]
y = coordinates[1]
z = coordinates[2]
distance_from_origin(x, y, z)
That works, but it has two problems:
- Repetition of
coodinates
is error prone and tough to refactor. - Overuse of brackets makes code harder to read.
You can use unpacking to fix both problems.
x, y, z = coordinates
distance_from_origin(x, y, z)
Variable unpacking takes a collection of values on the right-hand side of =
and assigns each value in order to an equal number of names on the left hand side. Importantly, the number of names on the left must match the number of values in the collection on the right.
x, y, z, m = coordinates
x, y = coordinates
Unpacking to names is useful, but you can go a step further when the values you need to unpack will be passed to a function or class constructor.
distance_from_origin(*coordinates)
The *
in front of the variable name unpacks the values so that each value in order is assigned to the parameters of the function.
One disadvantage of unpacking a collection into arguments this way is that it relies on parameter order. That means it only works when you can use positional arguments and doesn't work when you need to specify keyword arguments.
But if the values to unpack are in a dictionary where each key matches a parameter name, you can unpack them as keyword arguments with **
. Then the order of values no longer matters.
coordinates_dict = {
"z": 4,
"y": 5,
"x": 2
}
distance_from_origin(**coordinates_dict)
Big Takeaway: Unpacking reduces the amount of code you have to write and makes your code easier to read. Take advantage of it wherever you can.
Use comprehensions judiciously¶
An analyst has a list of population densities in people per km2. They need to transform those values in people per mi2.
One way to do that is to loop over all the values, apply a transformation function, and append the transformed value to a new list.
people_per_km2 = (5, 40, 200, 17, 8000)
people_per_mi2 = []
for density in people_per_km2:
mi2_density = density * 2.59
people_per_mi2.append(mi2_density)
people_per_mi2
That code is correct, but it is more verbose than necessary, which can hurt readability.
When you see a pattern where you loop over something, do something to the values, then append new values to an empty list, you should consider replacing it with a list comprehension.
people_per_mi2 = [density * 2.59 for density in people_per_km2]
people_per_mi2
List comprehensions are more readable, but only for people who are familiar with them, so be aware of your audience when using them. They are also a little bit faster than using a for
loop.
But comprehensions are bad when the transformation is complex. Imagine we have a list of quantitative values that we want to transform to qualitative values.
List comprehensions let you do that in a single line. But it is an abomination.
quantitative = [100, 50, 317, 21]
qualitative = ["S" if val < 100 else "M" if val <= 200 else "L" for val in quantitative]
For more complex operations, it is much better to use an explicit loop.
qualitative = []
for val in quantitative:
if val < 100:
qualitative.append("S")
elif val <= 200:
qualitative.append("M")
else:
qualitative.append("L")
qualitative
An analyst has tract-level census data where each tract has three values:
- Land Area in km2
- Population
Use more of the standard library¶
An analyst has tract-level census data records. Each tract has two values: population and households. The analyst could model a single tract as a dictionary.
tract1 = {
"population": 1000,
"households": 500
}
That looks appropriate because it clearly links each value to a key that explains what the value means. But a dictionary is usually not a good data structure for a single record from a table. For one thing, there is a substantial amount of repetition if you need to model many records.
tract2 = {
"population": 2000,
"households": 800
}
tract3 = {
"population": 5000,
"households": 3000
}
Another problem is that dictionaries are mutable, which means the keys can change and cause the dictionary to no longer fit the same data schema.
del tract2["households"]
tract2
Dictionaries are optimized for fast access of a value by key. This is not usually an important goal for an individual record. Using a dictionary to model records is unecessarily hard. A better data structure for a record is a tuple.
tract1 = (1000, 500)
tract2 = (2000, 800)
tract3 = (5000, 3000)
The problem with tuples, however, is the lack of context for each value represents. An even better data structure for a record is a named tuple, which you can import from the standard library.
To use a named tuple, create a class that inherits from NamedTuple
. For this kind of class, you only need to specify the field names and the datatype the values in each field should have. You can then create instances of that named tuple by passing the appropriate values to the constructor.
from typing import NamedTuple
class Tract(NamedTuple):
population: int
households: int
tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)
You can access the value in a field using dot notation.
tract1.households
Big takeaway: The standard library has classes and functions that make your life easier without having to install additional packages. You should use them more. Named tuples are just one example. The official documentation lists them all, but some highlights include:
- csv for working with csv files
- dataclasses for creating dataclasses (like
NamedTuple
, but editable) - datetime for working with dates and times
- itertools for efficient looping
- math for Mathematical functions
- pprint for nicely printing complex data structures
- pathlib and os.path for working with file paths
Use more built-ins¶
The analyst wants to know the average number of people per household across all tracts. That is not the same as averaging the number of people per household per tract. The analyst needs to divide the total population across tracts by the total number of households across tracts.
One way to get the right answer is to loop over each tract, keeping a running total of the population househould values. Then calculate the the ratio.
population = 0
households = 0
tracts = [tract1, tract2, tract3]
for tract in tracts:
population += tract.population
households += tract.households
population / households
That gives the correct answer, but keeping running totals obscures the goal, which is to create the sums of the total population and households across tracts.
Summing values is a common pattern, and for many common patterns, Python has some built-in capability to make it easier to accomplish. Built-ins differ from the standard library in that you don't have to import anything to get access to built-ins.
Code that is is considered Pythonic makes good use of these built-in capabilities. In this case, there is the sum
function. This has the advantage of making it more explicit to the reader that the code is summing values in a collection.
population_values = []
household_values = []
tracts = [tract1, tract2, tract3]
for tract in tracts:
population_values.append(tract.population)
household_values.append(tract.households)
sum(population_values) / sum(household_values)
While you could use a list comprehension, there's actually an even better way.
Imagine each tract is a row in a table that has population and household fields. We want to get the sum of each column. But we don't actually have columns, we only have the rows.
This turns out to be a very common type of problem where we have a group of pairs (or triples, etc), and we want a pair (or triple, etc) of groups. For these problems, use the built-in zip
function.
By using zip
, you can save yourself a little bit of typing and a significant amount of thinking about the correct implementation. Using zip
also makes your code easier to explain to other people familiar with Python because they don't have to reason through your implementation to make sure it's been done correctly.
population_values, household_values = zip(tract1, tract2, tract3)
sum(population_values) / sum(household_values)
Big Takeaway: Python built-ins can make your life easier, without even having to import additional libraries. sum
, list comprehensions, and zip
are among the more useful built-in capabilities of Python you should be using more. The official documentation has the complete list, but some other useful built-in functions include:
abs
for returning the absolute value of a number.all
andany
for testing the truth of a collection of values.dir
for listing the attributes of an objectenumerate
for getting both the index and a value from a collection. Useful for complex loops.help
for getting information about a Python object.isinstance
andtype
for getting information about an object's typelen
for getting the length of a collection, such as a string or list.max
andmin
for getting the maximum or minimum value from a group of values.open
for opening files.range
for creating a collection of values in a given range.
Beg forgiveness. Don't ask permission¶
The analyst writes a function to calculate the population density in people per km2 of land area.
tract1 = {
"land_area": 20,
"population": 1000
}
def pop_density(tract):
return tract["population"] / tract["land_area"]
pop_density(tract1)
There are a few ways this could go wrong. What if the record is missing a population
key because no people live there?
tract2 = {
"land_area": 10
}
pop_density(tract2)
What if it's missing a land_area
key because it's all water?
tract3 = {
"population": 0
}
pop_density(tract3)
What if it has a land area value of 0 because it's all water?
tract4 = {
"land_area": 0,
"population": 0
}
pop_density(tract4)
One way to deal with potential bad values to is to check for them ahead of time with conditional logic.
def pop_density2(tract):
if "population" not in tract.keys():
return 0
elif "land_area" not in tract.keys():
return 0
elif tract["land_area"] == 0:
return 0
else:
return tract["population"] / tract["land_area"]
for tract in (tract1, tract2, tract3, tract4):
print(pop_density2(tract))
But using conditional logic like this is not great. You need to put in the checks before you get to your core logic, which hurts both performance and readability.
You will also inevitably run into edge cases that you didn't anticipate. What if a tract with no people has the population
key set to None
?
tract5 = {
"land_area": 20,
"population": None
}
pop_density2(tract5)
Instead of writing an exploding mess of spaghetti code to deal with a never-ending parade of edge cases, it is better to use try
and except
. Python will attempt to run the code in the try
block. If that code throws an exception, Python will run the code in the except
block that matches the type of exception.
def pop_density3(tract):
try:
return tract["population"] / tract["land_area"]
except (KeyError, ZeroDivisionError, TypeError):
return 0
for tract in (tract1, tract2, tract3, tract4, tract5):
print(pop_density3(tract))
This code is still somewhat fragile. For example, it won't correctly handle records that store values as strings instead of numeric types. But it is usually easier to deal with those complexities as they arise by using try
/except
rather than if
statements. If you do really need some complex conditional logic to handle edge cases, banish it to the except
block instead of distracting the reader by putting it up front.
Big takeaway: You can just try things. It's usually easier, faster, and more readable to put the common case in a try
block, and handle exceptions for edge cases where the common case doesn't work.
Use context managers¶
If you want to open a file with Python, you had better make sure you close it, or bad things can happen.
f = open("file.txt", "w")
f.write("Here is some text")
f.close()
This has two problems:
- It's easy to forget to close it, especially because most of the time it doesn't actually cause problems when you don't
- If your code crashes after you open the file but before you close it, it doesn't close properly
Instead, you may already know you should do it like this so that the file closes automatically:
with open("file.txt", "w") as f:
f.write("Here is some different text")
This isn't magic, it's a context manager. A context manager provides setup and tear down code. The setup code always runs at the beginning of the with
block before anything inside the block. The teardown code always runs when the with
block exits, even if it exited because of an error.
Context managers are useful for reducing the amount of repetitive boilerplate code you have to write to make sure things are set up and torn down correctly.
For example, you may want to write some data to a database, but you want to make sure the transaction gets rolled back if there's a problem with some part of the write.
The code below:
- Connects to a sqlite database
- Puts the data writing code into a
try
block using an explicit transaction - Handles errors in the
except
block by rolling back the transation (conn.commit
is never reached if there is an error before that) - Closes the connection
import sqlite3
conn = sqlite3.connect("test.db")
try:
cursor = conn.cursor()
cursor.execute("BEGIN TRANSACTION")
cursor.execute("CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY, country TEXT)")
cursor.execute("INSERT INTO test (country) VALUES('Argentina')")
conn.commit()
except Exception as e:
print(f"Error {e}: Rolling back transaction")
conn.rollback
conn.close()
It turns out that sqlite has a context manager that creates the connection and rolls back transactions if there's an exception.
It's important to know exactly what kind of setup and teardown a particular context manager does. This particular context manager creates the connection, but it does not automatically close it. You still have to remember to close it yourself.
db_path = "test.db"
with sqlite3.connect(db_path) as conn:
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY, country TEXT)")
cursor.execute("INSERT INTO test (country) VALUES('Argentina')")
conn.commit()
conn.close()
Big takeaway: If you are working with objects that support context managers, you should use those context managers. Pay attention to how the context manager works though, because it may not do everything you expect.
Exercises¶
The exercises below invite you to practice applying the different strategies outlined above. They follow the order of the concepts presented, but you can attempt them in any order. Start with the ones that seem most applicable to the work you need to do.
You can find example answers in the ExerciseAnswers.ipynb notebook.
1) Use unpacking for pretty printing¶
The code below uses a loop to print each value in a collection on a separate line.
counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
for county in counties:
print(county)
Write a different implementation that uses unpacking to print each value on a separate line using a single call to the print
function instead of a loop.
Hint: The print
function's first parameter is *objects
, which is accepts any number of positional arguments (similar to *args
in other functions). These arguments are what will be printed. The second parameter is sep
, which defines the character to put in between the values to print. The default value of sep
is a single space (' '
), but it could be a newline character ('\n'
).
2) Use standard library data classes¶
The code below uses a dictionary to define a record, then changes one of the values in that record.
record = {
"total_population": 5000,
"population_in_poverty": 200
}
record["total_population"] = 6000
print(record)
This pattern cannot be implented using a named tuple, because named tuples are immutable. A data class is a standard library class that is similar to a named tuple, but it can be editable. Write a different implementation of the code above to use data classes instead of dictionaries
Hint: The official Python documentation may be hard to understand. You may want to search for a tutorial on data classes specifically.
3) Use the built-in min and max functions¶
The code below creates a list of 20 random numbers between -1000 and 1000.
from random import randint
nums = [randint(-1000, 1000) for i in range(20)]
The code below finds the maximum and minimum values of nums
using conditional logic and explicit comparisons to running values.
min_num = 1000
max_num = -1000
for num in nums:
if num > max_num:
max_num = num
if num < min_num:
min_num = num
print(max_num, min_num)
4) Just do things¶
The code below defines three records using a named tuple.
from typing import NamedTuple
class Record(NamedTuple):
total_population: int
population_in_poverty: int
record1 = Record(5000, 2000)
record2 = Record(200, 10)
record3 = Record("400", "30")
The code below calculates the poverty rate, first checking that the values in the record are the correct type, and transforming them if not.
def poverty_rate(record):
total_pop, pop_in_poverty = record
if not isinstance(total_pop, int):
total_pop = int(record.total_population)
if not isinstance(pop_in_poverty, int):
pop_in_poverty = int(record.population_in_poverty)
return pop_in_poverty / total_pop
for record in (record1, record2, record3):
print(poverty_rate(record))
Write a different implementation that doesn't use if
to check datatypes ahead of time.
Hint: You may find it useful to first write the code without any error handling to see what type of error occurs.
For an even better way to solve this kind of problem, look into Pydantic models. These models are not built-in or in the standard library, so you have to install the Pydantic library to get them. Pydantic models are like named tuples that guarantee the records will have the correct data type.
5) Use Pythonic patterns for setup and teardown boilerplate¶
The code below opens data.csv
and writes some information to the file. Then an exception occurs before the file is closed. The code creates data.csv
if it didn't exist before, but if you open the file, you will notice that the data has not been written to the file (In a Google Colab notebook, there is a files icon on the left where you can double-click to open a file in the web interface)
f = open("data.csv", "w")
f.write("Important data")
raise ValueError
f.close()
Rewrite this code so that the data is written to the file even though it raises an exception.