Don't try so hard¶
When presented with a problem, our first instinct might be to write code that reflects how we would solve the problem manually. That's good because we can take advantage of our existing knowledge. But it's also bad because it's likely to be harder to implement than a solution that takes full advantages of Python's capabilities.
You are probably working harder than you need to when you implement solutions that match your manual process. For many types of problems, Python has solved them already. Learning the Pythonic way to address a problem will be easier than coming up with your own. It will also be easier for other people to understand your code, because you are using the well-known idioms of Python instead of your own idiosyncratic implementation.
Unpack values¶
An analyst has a tuple of three values that represent the x, y, and z coordinates of a location. The analyst has a distance function that takes three arguments, one for each coordinate.
coordinates = (2, 5, 4)
def distance_from_origin(x, y, z):
return (x**2 + y**2 + z**2) ** 0.5
The analyst needs to pass the values from the tuple to the function. One way to do that is to use the index of each value with bracket notation.
x = coordinates[0]
y = coordinates[1]
z = coordinates[2]
distance_from_origin(x, y, z)
6.708203932499369
That works, but it has two problems:
- Repetition of
coodinates
is error prone and tough to refactor. - Overuse of brackets makes code harder to read.
You can use unpacking to fix both problems.
x, y, z = coordinates
distance_from_origin(x, y, z)
6.708203932499369
Variable unpacking takes a collection of values on the right-hand side of =
and assigns each value in order to an equal number of names on the left hand side. Importantly, the number of names on the left must match the number of values in the collection on the right.
x, y, z, m = coordinates
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[6], line 1 ----> 1 x, y, z, m = coordinates ValueError: not enough values to unpack (expected 4, got 3)
x, y = coordinates
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[5], line 1 ----> 1 x, y = coordinates ValueError: too many values to unpack (expected 2)
Unpacking to names is useful, but you can go a step further when the values you need to unpack will be passed to a function or class constructor.
distance_from_origin(*coordinates)
6.708203932499369
The *
in front of the variable name unpacks the values so that each value in order is assigned to the parameters of the function.
One disadvantage of unpacking a collection into arguments this way is that it relies on parameter order. That means it only works when you can use positional arguments and doesn't work when you need to specify keyword arguments.
But if the values to unpack are in a dictionary where each key matches a parameter name, you can unpack them as keyword arguments with **
. Then the order of values no longer matters.
coordinates_dict = {
"z": 4,
"y": 5,
"x": 2
}
distance_from_origin(**coordinates_dict)
6.708203932499369
If you have ever seen a function defintion with *args
and **kwargs
, that's related to unpacking. The *args
parameter means the function can be called with any number of positional arguments. The **kwargs
means the function can be called with any keyword arguments.
Big Takeaway: Unpacking reduces the amount of code you have to write and makes your code easier to read. Take advantage of it wherever you can.
Beg forgiveness. Don't ask permission¶
An analyst has tract-level census data where each tract has three values:
- Total Area in km2
- Water Area in km2
- Population
tract1 = {
"area": 100,
"area_water": 20,
"population": 1000
}
The analyst writes a function to calculate the population density in people per km2 of land area.
def pop_density(tract):
area_land = tract["area"] - tract["area_water"]
return tract["population"] / area_land
pop_density(tract1)
12.5
But some records don't have a value for the water area because they are all land. This is the equivalent of not having an "area_water"
column. Passing those records to the function causes a KeyError
exception.
tract2 = {
"area": 100,
"population": 1000
}
pop_density(tract2)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[53], line 6 1 tract2 = { 2 "area": 100, 3 "population": 1000 4 } ----> 6 pop_density(tract2) Cell In[52], line 2, in pop_density(tract) 1 def pop_density(tract): ----> 2 area_land = tract["area"] - tract["area_water"] 3 return tract["population"] / area_land KeyError: 'area_water'
One way to deal with potential bad values to is to check ahead of time with conditional logic (if
/elif
/else
).
def pop_density2(tract):
if "area_water" not in tract.keys():
area_land = tract["area"]
else:
area_land = tract["area"] - tract["area_water"]
return tract["population"] / area_land
pop_density2(tract2)
10.0
But using conditional logic like this is not great. You need to put in the checks before you get to your core logic, which hurts both performance and readability. You will also run into edge cases that you didn't anticipate that cause your code to fail or return the wrong answer. For example, some records without any water have explicitly set the "area_water"
value to None
. This is equivalent to having null values in an "area_water"
column. Passing a record like that to the function causes a TypeError
.
tract3 = {
"area": 100,
"area_water": None,
"population": 1000
}
pop_density2(tract3)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[55], line 7 1 tract3 = { 2 "area": 100, 3 "area_water": None, 4 "population": 1000 5 } ----> 7 pop_density2(tract3) Cell In[54], line 5, in pop_density2(tract) 3 area_land = tract["area"] 4 else: ----> 5 area_land = tract["area"] - tract["area_water"] 6 return tract["population"] / area_land TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
Your code gets more complicated as you deal with those edge cases.
def pop_density3(tract):
if "area_water" not in tract.keys() or tract["area_water"] is None:
area_land = tract["area"]
else:
area_land = tract["area"] - tract["area_water"]
return tract["population"] / area_land
pop_density3(tract3)
10.0
No matter how many edge cases you anticipate, there will probably be another one you didn't. For example, passing a record that is all water causes a ZeroDivisionError
.
tract4 = {
"area": 100,
"area_water": 100,
"population": 0
}
pop_density3(tract4)
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) Cell In[57], line 7 1 tract4 = { 2 "area": 100, 3 "area_water": 100, 4 "population": 0 5 } ----> 7 pop_density3(tract4) Cell In[56], line 6, in pop_density3(tract) 4 else: 5 area_land = tract["area"] - tract["area_water"] ----> 6 return tract["population"] / area_land ZeroDivisionError: division by zero
Instead of writing an exploding mess of spaghetti code to deal with a never-ending parade of edge cases, it is better to use try
and except
. Python will attempt to run the code in the try
block. If that code throws and exception, Python will run the code in the except
block that matches the type of exception.
def pop_density4(tract):
try:
area_land = tract["area"] - tract["area_water"]
return tract["population"] / area_land
except (KeyError, TypeError):
return tract["population"] / tract["area"]
except ZeroDivisionError:
return 0
for tract in [tract1, tract2, tract3, tract4]:
print(pop_density4(tract))
12.5 10.0 10.0 0
This pattern puts your core logic at the top, and deals with edge cases afterward, making your code more performant and readable. It also gives you the option to handle different types of errors differently. The first except
block handles both the KeyError
and TypeError
problems by using the total area to calculate population density. The second except
block handles the ZeroDivisionError
by appropriately returning 0
.
Big takeaway: You can just try things. It's usually easier, faster, and more readable to put the common case in a try
block, and handle exceptions for edge cases where the common case doesn't work.
Use more of the standard library¶
An analyst has tract-level census data records. Each tract has two values: population and households. The analyst could model a single tract as a dictionary.
tract1 = {
"population": 1000,
"households": 500
}
That looks appropriate because it clearly links each value to a key that explains what the value means. But a dictionary is usually not a good data structure for a single record from a table. For one thing, there is a substantial amount of repetition if you need to model many records.
tract2 = {
"population": 2000,
"households": 800
}
tract3 = {
"population": 5000,
"households": 3000
}
Another problem is that dictionaries are mutable, which means the keys can change and cause the dictionary to no longer fit the same data schema.
del tract2["households"]
tract2
{'population': 2000}
Dictionaries are optimized for fast access of a value by key. This is not usually an important goal for an individual record. Using a dictionary to model records is unecessarily hard. A better data structure for a record is a tuple.
tract1 = (1000, 500)
tract2 = (2000, 800)
tract3 = (5000, 3000)
A glaring omission from a tuple, however, is the lack of context for each value represents. An even better data structure for a record is a named tuple, which you can import from the standard library.
To use a named tuple, create a class that inherits from NamedTuple
. For this kind of class, you only need to specify the field names and the datatype the values in each field should have. You can then create instances of that named tuple by passing the appropriate values to the constructor.
from typing import NamedTuple
class Tract(NamedTuple):
population: int
households: int
tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)
You can access the value in a field using dot notation.
tract1.households
500
Big takeaway: The standard library has classes and functions that make your life easier without having to install additional packages. You should use them more. Named tuples are just one example. The official documentation lists them all, but some highlights include:
- csv for working with csv files
- dataclasses for creating dataclasses (like
NamedTuple
, but editable) - datetime for working with dates and times
- itertools for efficient looping
- math for Mathematical functions
- pprint for nicely printing complex data structures
- pathlib and os.path for working with file paths
Use more built-ins¶
The analyst wants to know the average number of people per household across all tracts. That is not the same as averaging the number of people per household per tract. The analyst needs to divide the total population across tracts by the total number of households across tracts.
One way to get the right answer is to loop over each tract, keeping a running total of the population househould values. Then calculate the the ratio.
population = 0
households = 0
tracts = [tract1, tract2, tract3]
for tract in tracts:
population += tract.population
households += tract.households
population / households
1.8604651162790697
That gives the correct answer, but keeping running totals obscures the goal, which is to create a sum of the total population and households across tracts. Summing values is a common pattern, and for many common patterns, Python has some built-in capability to make it easier to accomplish. Built-ins differ from the functionality from the standard library in that you don't have to import anything to get access to built-ins. Code that is is considered Pythonic makes good use of these built-in capabilities. In this case, there is the sum
function. This has the advantage of making it more explicit to the reader that the code is summing values in a collection.
population_values = []
household_values = []
tracts = [tract1, tract2, tract3]
for tract in tracts:
population_values.append(tract.population)
household_values.append(tract.households)
sum(population_values) / sum(household_values)
1.8604651162790697
Appending values to a list in a loop is also a very common pattern. For this case, Python has list comprehensions. List comprehensions are shorter and more readable (once you get used to them) than explicit loops. They also execute faster than the equivalent loop.
tracts = [tract1, tract2, tract3]
population_values = [tract.population for tract in tracts]
household_values = [tract.households for tract in tracts]
sum(population_values) / sum(household_values)
1.8604651162790697
In this particular case, a list comprehension is not so great because we had to iterate over the tracts twice. You could create a nested comprehension to loop over the tracts for for each value in a tract, which would get some effeciencies if you had a lot of tracts.
population_values, household_values = [[tract[i] for tract in tracts] for i in range(len(tract1))]
sum(population_values) / sum(household_values)
1.8604651162790697
But this is an abomination. While comprehensions are generally awesome, resist the temptation to make complicated comprehensions. In such cases, it would be better to use an explicit loop instead.
Even better is to know that recombining a group of pairs into a pair of groups (or vice versa) is also a common pattern. Once again, Python has a built-in function, zip
, to do that for you. By using zip
, you can save yourself a little bit of typing and a significant amount of thinking about the correct implementation. Using zip
also makes your code easier to explain to other people familiar with Python because they don't have to reason through your implementation to make sure it's been done correctly.
population_values, household_values = zip(tract1, tract2, tract3)
sum(population_values) / sum(household_values)
1.8604651162790697
Big Takeaway: Python built-ins can make your life easier, without even having to import additional libraries. sum
, list comprehensions, and zip
are among the more useful built-in capabilities of Python you should be using more. The official documentation has the complete list, but some other useful built-in functions include:
abs
for returning the absolute value of a number.all
andany
for testing the truth of a collection of values.dir
for listing the attributes of an objectenumerate
for getting both the index and a value from a collection. Useful for complex loops.help
for getting information about a Python object.isinstance
andtype
for getting information about an object's typelen
for getting the length of a collection, such as a string or list.max
andmin
for getting the maximum or minimum value from a group of values.open
for opening files.range
for creating a collection of values in a given range.
Context managers¶
An analyst needs to write some values to a csv file. They open the file, write the content, then close the file.
rows = ("x,y,z", "2,4,5")
f = open('data.csv', 'w')
f.write("\n".join(rows))
f.close()
That works, but it has two problems. The first is that you have to remember to write f.close()
at the end or else the file will stay open. That increases the risk of file corruption. That's not too hard in this example, but the more code between the open
function and the close
method, the more likely it is that you will forget.
The second problem is that even if you remember to write the teardown code to close the file, it won't run if the code throws an exception before it gets there. That also increases the risk of file corruption.
Because it is common to need some setup and/or teardown code when working with certain objects, Python has context managers that allow you to apply that code automatically. For example, TextIOWrapper
objects created by opening text files have teardown code to close the file, so you don't have to invoke the close
method yourself. Instead, you can use a with
block to activate the context manager.
with open("data.csv", "w") as f:
f.write("\n".join(rows))
The as f
part of the code creates a name (f
) that points to an object. This is equivalent to
f = open("data.csv", "w")
Before executing the code inside the with
block, the context manager executes the setup code defined for that object. As soon as the code inside the with
block finishes (even if it finished because of an exception), the context manager executes the teardown code defined for the object.
Not every object can be used with a context manager. Whoever wrote the code for defining that object had to add special capabilities to the object to enable a context manager.
Big Takeaway: Context managers let you avoid boilerplate setup/teardown code. Opening files is probably the most common use case for context managers, but pay attention to how they are used in other libraries you work with as well.
Exercises¶
The exercises below invite you to practice applying the different strategies outlined above. They follow the order of the concepts presented, but you can attempt them in any order. Start with the ones that seem most applicable to the work you need to do.
You can find example answers in the ExerciseAnswers.ipynb notebook.
1) Use unpacking for pretty printing¶
The code below uses a loop to print each value in a collection on a separate line.
counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
for county in counties:
print(county)
Anoka Dakota Carver Hennepin Ramsey Scott Washington
Write a different implementation that uses unpacking to print each value on a separate line using a single call to the print
function instead of a loop.
Hint: The print
function's first parameter is *objects
, which is accepts any number of positional arguments (similar to *args
in other functions). These arguments are what will be printed. The second parameter is sep
, which defines the character to put in between the values to print. The default value of sep
is a single space (' '
), but it could be a newline character ('\n'
).
2) Use try/except¶
The code below defines two records using named tuples.
from typing import NamedTuple
class Record(NamedTuple):
total_population: int
population_in_poverty: int
record1 = Record(5000, 200)
record2 = Record(200, 0)
The code below calculates the ratio of each value in the first record to the corresponding value in the second record. It uses conditional logic to catch potential errors.
from math import inf
for field in Record._fields:
if getattr(record2, field) != 0:
ratio = getattr(record1, field) / getattr(record2, field)
else:
ratio = inf
print(ratio)
25.0 inf
Write a different implementation that uses try
and except
instead.
Hint: You may find it useful to first write the code without any error handling to see what type of error occurs.
3) Use standard library data classes¶
The code below uses a dictionary to define a record, then changes one of the values in that record.
record = {
"total_population": 5000,
"population_in_poverty": 200
}
record["total_population"] = 6000
print(record)
{'total_population': 6000, 'population_in_poverty': 200}
This pattern cannot be implented using a named tuple, because named tuples are immutable. A data class is a standard library class that is similar to a named tuple, but it can be editable. Write a different implementation of the code above to use data classes instead of dictionaries
Hint: The official Python documentation may be hard to understand. You may want to search for a tutorial on data classes specifically.
4) Use the built-in min and max functions¶
The code below creates a list of 20 random numbers between -1000 and 1000.
from random import randint
nums = [randint(-1000, 1000) for i in range(20)]
The code below finds the maximum and minimum values of nums
using conditional logic and explicit comparisons to running values.
min_num = 1000
max_num = -1000
for num in nums:
if num > max_num:
max_num = num
if num < min_num:
min_num = num
print(max_num, min_num)
992 -952
5) Open a file with a context manager¶
The code below opens a file and writes to it.
f = open("exercise.txt", "w")
f.write("This is example text for an exercise.")
f.close()
Rewrite it to use a context manager instead.