Don't try so hard¶

When presented with a problem, our first instinct might be to write code that reflects how we would solve the problem manually. That's good because we can take advantage of our existing knowledge. But it's also bad because it's likely to be harder to implement than a solution that takes full advantages of Python's capabilities.

You are probably working harder than you need to when you implement solutions that match your manual process. For many types of problems, Python has solved them already. Learning the Pythonic way to address a problem will be easier than coming up with your own. It will also be easier for other people to understand your code, because you are using the well-known idioms of Python instead of your own idiosyncratic implementation.

Unpack values¶

An analyst has a tuple of three values that represent the x, y, and z coordinates of a location. The analyst has a distance function that takes three arguments, one for each coordinate.

In [ ]:

Copied!

coordinates = (2, 5, 4)

def distance_from_origin(x, y, z):
    return (x**2 + y**2 + z**2) ** 0.5
coordinates = (2, 5, 4)

def distance_from_origin(x, y, z):
    return (x**2 + y**2 + z**2) ** 0.5

The analyst needs to pass the values from the tuple to the function. One way to do that is to use the index of each value with bracket notation.

In [3]:

Copied!

x = coordinates[0]
y = coordinates[1]
z = coordinates[2]

distance_from_origin(x, y, z)
x = coordinates[0]
y = coordinates[1]
z = coordinates[2]

distance_from_origin(x, y, z)

Out[3]:

6.708203932499369

That works, but it has two problems:

Repetition of coodinates is error prone and tough to refactor.
Overuse of brackets makes code harder to read.

You can use unpacking to fix both problems.

In [4]:

Copied!

x, y, z = coordinates

distance_from_origin(x, y, z)
x, y, z = coordinates

distance_from_origin(x, y, z)

Out[4]:

6.708203932499369

Variable unpacking takes a collection of values on the right-hand side of = and assigns each value in order to an equal number of names on the left hand side. Importantly, the number of names on the left must match the number of values in the collection on the right.

In [6]:

Copied!

x, y, z, m = coordinates
x, y, z, m = coordinates

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 x, y, z, m = coordinates

ValueError: not enough values to unpack (expected 4, got 3)

In [5]:

Copied!

x, y = coordinates
x, y = coordinates

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 x, y = coordinates

ValueError: too many values to unpack (expected 2)

Unpacking to names is useful, but you can go a step further when the values you need to unpack will be passed to a function or class constructor.

In [237]:

Copied!

distance_from_origin(*coordinates)
distance_from_origin(*coordinates)

Out[237]:

6.708203932499369

The * in front of the variable name unpacks the values so that each value in order is assigned to the parameters of the function.

One disadvantage of unpacking a collection into arguments this way is that it relies on parameter order. That means it only works when you can use positional arguments and doesn't work when you need to specify keyword arguments.

But if the values to unpack are in a dictionary where each key matches a parameter name, you can unpack them as keyword arguments with **. Then the order of values no longer matters.

In [ ]:

Copied!





coordinates_dict = {
    "z": 4,
    "y": 5,
    "x": 2
}

distance_from_origin(**coordinates_dict)
coordinates_dict = {
    "z": 4,
    "y": 5,
    "x": 2
}

distance_from_origin(**coordinates_dict)

Out[ ]:

6.708203932499369

If you have ever seen a function defintion with *args and **kwargs, that's related to unpacking. The *args parameter means the function can be called with any number of positional arguments. The **kwargs means the function can be called with any keyword arguments.

Big Takeaway: Unpacking reduces the amount of code you have to write and makes your code easier to read. Take advantage of it wherever you can.

Beg forgiveness. Don't ask permission¶

An analyst has tract-level census data where each tract has three values:

Total Area in km²
Water Area in km²
Population

In [51]:

Copied!





tract1 = {
    "area": 100,
    "area_water": 20,
    "population": 1000
}
tract1 = {
    "area": 100,
    "area_water": 20,
    "population": 1000
}

The analyst writes a function to calculate the population density in people per km² of land area.

In [52]:

Copied!

def pop_density(tract):
    area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density(tract1)
def pop_density(tract):
    area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density(tract1)

Out[52]:

12.5

But some records don't have a value for the water area because they are all land. This is the equivalent of not having an "area_water" column. Passing those records to the function causes a KeyError exception.

In [53]:

Copied!





tract2 = {
    "area": 100,
    "population": 1000
}

pop_density(tract2)
tract2 = {
    "area": 100,
    "population": 1000
}

pop_density(tract2)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[53], line 6
      1 tract2 = {
      2     "area": 100,
      3     "population": 1000
      4 }
----> 6 pop_density(tract2)

Cell In[52], line 2, in pop_density(tract)
      1 def pop_density(tract):
----> 2     area_land = tract["area"] - tract["area_water"]
      3     return tract["population"] / area_land

KeyError: 'area_water'

One way to deal with potential bad values to is to check ahead of time with conditional logic (if/elif/else).

In [54]:

Copied!





def pop_density2(tract):
    if "area_water" not in tract.keys():
        area_land = tract["area"]
    else:
        area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density2(tract2)
def pop_density2(tract):
    if "area_water" not in tract.keys():
        area_land = tract["area"]
    else:
        area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density2(tract2)

Out[54]:

10.0

But using conditional logic like this is not great. You need to put in the checks before you get to your core logic, which hurts both performance and readability. You will also run into edge cases that you didn't anticipate that cause your code to fail or return the wrong answer. For example, some records without any water have explicitly set the "area_water" value to None. This is equivalent to having null values in an "area_water" column. Passing a record like that to the function causes a TypeError.

In [55]:

Copied!





tract3 = {
    "area": 100,
    "area_water": None,
    "population": 1000
}

pop_density2(tract3)
tract3 = {
    "area": 100,
    "area_water": None,
    "population": 1000
}

pop_density2(tract3)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[55], line 7
      1 tract3 = {
      2     "area": 100,
      3     "area_water": None,
      4     "population": 1000
      5 }
----> 7 pop_density2(tract3)

Cell In[54], line 5, in pop_density2(tract)
      3     area_land = tract["area"]
      4 else:
----> 5     area_land = tract["area"] - tract["area_water"]
      6 return tract["population"] / area_land

TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

Your code gets more complicated as you deal with those edge cases.

In [56]:

Copied!





def pop_density3(tract):
    if "area_water" not in tract.keys() or tract["area_water"] is None: 
        area_land = tract["area"]
    else:
        area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density3(tract3)
def pop_density3(tract):
    if "area_water" not in tract.keys() or tract["area_water"] is None: 
        area_land = tract["area"]
    else:
        area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density3(tract3)

Out[56]:

10.0

No matter how many edge cases you anticipate, there will probably be another one you didn't. For example, passing a record that is all water causes a ZeroDivisionError.

In [57]:

Copied!





tract4 = {
    "area": 100,
    "area_water": 100,
    "population": 0
}

pop_density3(tract4)
tract4 = {
    "area": 100,
    "area_water": 100,
    "population": 0
}

pop_density3(tract4)

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[57], line 7
      1 tract4 = {
      2     "area": 100,
      3     "area_water": 100,
      4     "population": 0
      5 }
----> 7 pop_density3(tract4)

Cell In[56], line 6, in pop_density3(tract)
      4 else:
      5     area_land = tract["area"] - tract["area_water"]
----> 6 return tract["population"] / area_land

ZeroDivisionError: division by zero

Instead of writing an exploding mess of spaghetti code to deal with a never-ending parade of edge cases, it is better to use try and except. Python will attempt to run the code in the try block. If that code throws and exception, Python will run the code in the except block that matches the type of exception.

In [31]:

Copied!





def pop_density4(tract):
    try:
        area_land = tract["area"] - tract["area_water"]
        return tract["population"] / area_land
    except (KeyError, TypeError):
        return tract["population"] / tract["area"]
    except ZeroDivisionError:
        return 0

for tract in [tract1, tract2, tract3, tract4]:
    print(pop_density4(tract))
def pop_density4(tract):
    try:
        area_land = tract["area"] - tract["area_water"]
        return tract["population"] / area_land
    except (KeyError, TypeError):
        return tract["population"] / tract["area"]
    except ZeroDivisionError:
        return 0

for tract in [tract1, tract2, tract3, tract4]:
    print(pop_density4(tract))

This pattern puts your core logic at the top, and deals with edge cases afterward, making your code more performant and readable. It also gives you the option to handle different types of errors differently. The first except block handles both the KeyError and TypeError problems by using the total area to calculate population density. The second except block handles the ZeroDivisionError by appropriately returning 0.

Big takeaway: You can just try things. It's usually easier, faster, and more readable to put the common case in a try block, and handle exceptions for edge cases where the common case doesn't work.

Use more of the standard library¶

An analyst has tract-level census data records. Each tract has two values: population and households. The analyst could model a single tract as a dictionary.

In [32]:

Copied!





tract1 = {
    "population": 1000,
    "households": 500
}
tract1 = {
    "population": 1000,
    "households": 500
}

That looks appropriate because it clearly links each value to a key that explains what the value means. But a dictionary is usually not a good data structure for a single record from a table. For one thing, there is a substantial amount of repetition if you need to model many records.

In [33]:

Copied!





tract2 = {
    "population": 2000,
    "households": 800
}

tract3 = {
    "population": 5000,
    "households": 3000
}
tract2 = {
    "population": 2000,
    "households": 800
}

tract3 = {
    "population": 5000,
    "households": 3000
}

Another problem is that dictionaries are mutable, which means the keys can change and cause the dictionary to no longer fit the same data schema.

In [34]:

Copied!

del tract2["households"]
tract2
del tract2["households"]
tract2

Out[34]:

{'population': 2000}

Dictionaries are optimized for fast access of a value by key. This is not usually an important goal for an individual record. Using a dictionary to model records is unecessarily hard. A better data structure for a record is a tuple.

In [35]:

Copied!

tract1 = (1000, 500)
tract2 = (2000, 800)
tract3 = (5000, 3000)
tract1 = (1000, 500)
tract2 = (2000, 800)
tract3 = (5000, 3000)

A glaring omission from a tuple, however, is the lack of context for each value represents. An even better data structure for a record is a named tuple, which you can import from the standard library.

To use a named tuple, create a class that inherits from NamedTuple. For this kind of class, you only need to specify the field names and the datatype the values in each field should have. You can then create instances of that named tuple by passing the appropriate values to the constructor.

In [36]:

Copied!





from typing import NamedTuple

class Tract(NamedTuple):
    population: int
    households: int

tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)
from typing import NamedTuple

class Tract(NamedTuple):
    population: int
    households: int

tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)

You can access the value in a field using dot notation.

In [37]:

Copied!

tract1.households
tract1.households

Out[37]:

Big takeaway: The standard library has classes and functions that make your life easier without having to install additional packages. You should use them more. Named tuples are just one example. The official documentation lists them all, but some highlights include:

csv for working with csv files
dataclasses for creating dataclasses (like NamedTuple, but editable)
datetime for working with dates and times
itertools for efficient looping
math for Mathematical functions
pprint for nicely printing complex data structures
pathlib and os.path for working with file paths

Use more built-ins¶

The analyst wants to know the average number of people per household across all tracts. That is not the same as averaging the number of people per household per tract. The analyst needs to divide the total population across tracts by the total number of households across tracts.

One way to get the right answer is to loop over each tract, keeping a running total of the population househould values. Then calculate the the ratio.

In [38]:

Copied!





population = 0
households = 0
tracts = [tract1, tract2, tract3]
for tract in tracts:
    population += tract.population
    households += tract.households

population / households
population = 0
households = 0
tracts = [tract1, tract2, tract3]
for tract in tracts:
    population += tract.population
    households += tract.households

population / households

Out[38]:

1.8604651162790697

That gives the correct answer, but keeping running totals obscures the goal, which is to create a sum of the total population and households across tracts. Summing values is a common pattern, and for many common patterns, Python has some built-in capability to make it easier to accomplish. Built-ins differ from the functionality from the standard library in that you don't have to import anything to get access to built-ins. Code that is is considered Pythonic makes good use of these built-in capabilities. In this case, there is the sum function. This has the advantage of making it more explicit to the reader that the code is summing values in a collection.

In [39]:

Copied!





population_values = []
household_values = []
tracts = [tract1, tract2, tract3]
for tract in tracts:
    population_values.append(tract.population)
    household_values.append(tract.households)

sum(population_values) / sum(household_values)
population_values = []
household_values = []
tracts = [tract1, tract2, tract3]
for tract in tracts:
    population_values.append(tract.population)
    household_values.append(tract.households)

sum(population_values) / sum(household_values)

Out[39]:

1.8604651162790697

Appending values to a list in a loop is also a very common pattern. For this case, Python has list comprehensions. List comprehensions are shorter and more readable (once you get used to them) than explicit loops. They also execute faster than the equivalent loop.

In [199]:

Copied!

tracts = [tract1, tract2, tract3]
population_values = [tract.population for tract in tracts]
household_values = [tract.households for tract in tracts]

sum(population_values) / sum(household_values)
tracts = [tract1, tract2, tract3]
population_values = [tract.population for tract in tracts]
household_values = [tract.households for tract in tracts]

sum(population_values) / sum(household_values)

Out[199]:

1.8604651162790697

In this particular case, a list comprehension is not so great because we had to iterate over the tracts twice. You could create a nested comprehension to loop over the tracts for for each value in a tract, which would get some effeciencies if you had a lot of tracts.

In [202]:

Copied!

population_values, household_values = [[tract[i] for tract in tracts] for i in range(len(tract1))]

sum(population_values) / sum(household_values)
population_values, household_values = [[tract[i] for tract in tracts] for i in range(len(tract1))]

sum(population_values) / sum(household_values)

Out[202]:

1.8604651162790697

But this is an abomination. While comprehensions are generally awesome, resist the temptation to make complicated comprehensions. In such cases, it would be better to use an explicit loop instead.

Even better is to know that recombining a group of pairs into a pair of groups (or vice versa) is also a common pattern. Once again, Python has a built-in function, zip, to do that for you. By using zip, you can save yourself a little bit of typing and a significant amount of thinking about the correct implementation. Using zip also makes your code easier to explain to other people familiar with Python because they don't have to reason through your implementation to make sure it's been done correctly.

In [200]:

Copied!

population_values, household_values = zip(tract1, tract2, tract3)

sum(population_values) / sum(household_values)
population_values, household_values = zip(tract1, tract2, tract3)

sum(population_values) / sum(household_values)

Out[200]:

1.8604651162790697

Big Takeaway: Python built-ins can make your life easier, without even having to import additional libraries. sum, list comprehensions, and zip are among the more useful built-in capabilities of Python you should be using more. The official documentation has the complete list, but some other useful built-in functions include:

abs for returning the absolute value of a number.
all and any for testing the truth of a collection of values.
dir for listing the attributes of an object
enumerate for getting both the index and a value from a collection. Useful for complex loops.
help for getting information about a Python object.
isinstance and type for getting information about an object's type
len for getting the length of a collection, such as a string or list.
max and min for getting the maximum or minimum value from a group of values.
open for opening files.
range for creating a collection of values in a given range.

Context managers¶

An analyst needs to write some values to a csv file. They open the file, write the content, then close the file.

In [47]:

Copied!





rows = ("x,y,z", "2,4,5")
f = open('data.csv', 'w')
f.write("\n".join(rows))
f.close()
rows = ("x,y,z", "2,4,5")
f = open('data.csv', 'w')
f.write("\n".join(rows))
f.close()

That works, but it has two problems. The first is that you have to remember to write f.close() at the end or else the file will stay open. That increases the risk of file corruption. That's not too hard in this example, but the more code between the open function and the close method, the more likely it is that you will forget.

The second problem is that even if you remember to write the teardown code to close the file, it won't run if the code throws an exception before it gets there. That also increases the risk of file corruption.

Because it is common to need some setup and/or teardown code when working with certain objects, Python has context managers that allow you to apply that code automatically. For example, TextIOWrapper objects created by opening text files have teardown code to close the file, so you don't have to invoke the close method yourself. Instead, you can use a with block to activate the context manager.

In [ ]:

Copied!

with open("data.csv", "w") as f:
    f.write("\n".join(rows))
with open("data.csv", "w") as f:
    f.write("\n".join(rows))

The as f part of the code creates a name (f) that points to an object. This is equivalent to

f = open("data.csv", "w")

Before executing the code inside the with block, the context manager executes the setup code defined for that object. As soon as the code inside the with block finishes (even if it finished because of an exception), the context manager executes the teardown code defined for the object.

Not every object can be used with a context manager. Whoever wrote the code for defining that object had to add special capabilities to the object to enable a context manager.

Big Takeaway: Context managers let you avoid boilerplate setup/teardown code. Opening files is probably the most common use case for context managers, but pay attention to how they are used in other libraries you work with as well.

Exercises¶

The exercises below invite you to practice applying the different strategies outlined above. They follow the order of the concepts presented, but you can attempt them in any order. Start with the ones that seem most applicable to the work you need to do.

You can find example answers in the ExerciseAnswers.ipynb notebook.

1) Use unpacking for pretty printing¶

The code below uses a loop to print each value in a collection on a separate line.

In [243]:

Copied!

counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
for county in counties:
    print(county)
counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
for county in counties:
    print(county)

Anoka
Dakota
Carver
Hennepin
Ramsey
Scott
Washington

Write a different implementation that uses unpacking to print each value on a separate line using a single call to the print function instead of a loop.

Hint: The print function's first parameter is *objects, which is accepts any number of positional arguments (similar to *args in other functions). These arguments are what will be printed. The second parameter is sep, which defines the character to put in between the values to print. The default value of sep is a single space (' '), but it could be a newline character ('\n').

In [ ]:

2) Use try/except¶

The code below defines two records using named tuples.

In [216]:

Copied!





from typing import NamedTuple

class Record(NamedTuple):
    total_population: int
    population_in_poverty: int

record1 = Record(5000, 200)
record2 = Record(200, 0)
from typing import NamedTuple

class Record(NamedTuple):
    total_population: int
    population_in_poverty: int

record1 = Record(5000, 200)
record2 = Record(200, 0)

The code below calculates the ratio of each value in the first record to the corresponding value in the second record. It uses conditional logic to catch potential errors.

In [222]:

Copied!





from math import inf

for field in Record._fields:
    if getattr(record2, field) != 0:
        ratio = getattr(record1, field) / getattr(record2, field)
    else:
        ratio = inf
    print(ratio)
from math import inf

for field in Record._fields:
    if getattr(record2, field) != 0:
        ratio = getattr(record1, field) / getattr(record2, field)
    else:
        ratio = inf
    print(ratio)

25.0
inf

Write a different implementation that uses try and except instead.

Hint: You may find it useful to first write the code without any error handling to see what type of error occurs.

In [ ]:

3) Use standard library data classes¶

The code below uses a dictionary to define a record, then changes one of the values in that record.

In [224]:

Copied!





record = {
    "total_population": 5000,
    "population_in_poverty": 200
}

record["total_population"] = 6000

print(record)
record = {
    "total_population": 5000,
    "population_in_poverty": 200
}

record["total_population"] = 6000

print(record)

{'total_population': 6000, 'population_in_poverty': 200}

This pattern cannot be implented using a named tuple, because named tuples are immutable. A data class is a standard library class that is similar to a named tuple, but it can be editable. Write a different implementation of the code above to use data classes instead of dictionaries

Hint: The official Python documentation may be hard to understand. You may want to search for a tutorial on data classes specifically.

In [ ]:

4) Use the built-in min and max functions¶

The code below creates a list of 20 random numbers between -1000 and 1000.

In [211]:

Copied!

from random import randint

nums = [randint(-1000, 1000) for i in range(20)]
from random import randint

nums = [randint(-1000, 1000) for i in range(20)]

The code below finds the maximum and minimum values of nums using conditional logic and explicit comparisons to running values.

In [225]:

Copied!





min_num = 1000
max_num = -1000

for num in nums:
    if num > max_num:
        max_num = num
    if num < min_num:
        min_num = num

print(max_num, min_num)
min_num = 1000
max_num = -1000

for num in nums:
    if num > max_num:
        max_num = num
    if num < min_num:
        min_num = num

print(max_num, min_num)

992 -952

Write a different implementation that uses the built-in max and min functions instead.

In [ ]:

5) Open a file with a context manager¶

The code below opens a file and writes to it.

In [ ]:

Copied!

f = open("exercise.txt", "w")
f.write("This is example text for an exercise.")
f.close()
f = open("exercise.txt", "w")
f.write("This is example text for an exercise.")
f.close()

Rewrite it to use a context manager instead.

In [ ]: