Python Tips and Tricks

Introduction

If you're anything like us, most of what you know about Python you learned by trial and error. Most geospatial professionals don't have much formal training in computer science or software engineering. And that's fine because we're mostly not computer scientists or sofware engineers. We don't have time to train in a whole new career just to write some automation scripts. But the downside of a purely practical education is that it can be easy to settle for suboptimal solutions because we're not aware of better practices that can improve the performance and readability of our code.

This session is designed to highlight some common code patterns that work OK, but for which a better approach exists.

Instead of bracket notation, use more unpacking
Instead of conditionals for validation, use try/except
Instead of rolling your own solutions, use existing Python capabilities
Instead of writing setup and teardown code, use context managers
Instead of only writing documentation in separate files, use doc strings and type hints.
Instead of always modeling data collections as lists, use more tuples and sets
Instead of always looping over lists, use more iterators
Instead of list comprehensions, use more generators
Instead of guessing about inefficiencies, profile your code

This workshop is focused on general patterns that you can use no matter what type of problem you are working on. Because these are general patterns, don't expect to be able to lift the code examples here and use them directly in your code. Do expect to take these strategies and apply them to your code.

The code examples and exercises are written using Jupyter Notebooks. If you have a Google account, you can click the Open in Colab button at the top to run the notebooks using Google's Colab environment. Otherwise, you can click the download button for each notebook to download it to your local machine and run in a notebook environment (e.g. loading the notebook into ArcGIS Pro).

Don't try so hard¶

When presented with a problem, our first instinct might be to write code that reflects how we would solve the problem manually. That's good because we can take advantage of our existing knowledge. But it's also bad because it's likely to be harder to implement than a solution that takes full advantages of Python's capabilities.

You are probably working harder than you need to when you implement solutions that match your manual process. For many types of problems, Python has solved them already. Learning the Pythonic way to address a problem will be easier than coming up with your own. It will also be easier for other people to understand your code, because you are using the well-known idioms of Python instead of your own idiosyncratic implementation.

Unpack values¶

An analyst has a tuple of three values that represent the x, y, and z coordinates of a location. The analyst has a distance function that takes three arguments, one for each coordinate.

In [ ]:

Copied!

coordinates = (2, 5, 4)

def distance_from_origin(x, y, z):
    return (x**2 + y**2 + z**2) ** 0.5
coordinates = (2, 5, 4)

def distance_from_origin(x, y, z):
    return (x**2 + y**2 + z**2) ** 0.5

The analyst needs to pass the values from the tuple to the function. One way to do that is to use the index of each value with bracket notation.

In [3]:

Copied!

x = coordinates[0]
y = coordinates[1]
z = coordinates[2]

distance_from_origin(x, y, z)
x = coordinates[0]
y = coordinates[1]
z = coordinates[2]

distance_from_origin(x, y, z)

Out[3]:

6.708203932499369

That works, but it has two problems:

Repetition of coodinates is error prone and tough to refactor.
Overuse of brackets makes code harder to read.

You can use unpacking to fix both problems.

In [4]:

Copied!

x, y, z = coordinates

distance_from_origin(x, y, z)
x, y, z = coordinates

distance_from_origin(x, y, z)

Out[4]:

6.708203932499369

Variable unpacking takes a collection of values on the right-hand side of = and assigns each value in order to an equal number of names on the left hand side. Importantly, the number of names on the left must match the number of values in the collection on the right.

In [6]:

Copied!

x, y, z, m = coordinates
x, y, z, m = coordinates

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 x, y, z, m = coordinates

ValueError: not enough values to unpack (expected 4, got 3)

In [5]:

Copied!

x, y = coordinates
x, y = coordinates

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 x, y = coordinates

ValueError: too many values to unpack (expected 2)

Unpacking to names is useful, but you can go a step further when the values you need to unpack will be passed to a function or class constructor.

In [237]:

Copied!

distance_from_origin(*coordinates)
distance_from_origin(*coordinates)

Out[237]:

6.708203932499369

The * in front of the variable name unpacks the values so that each value in order is assigned to the parameters of the function.

One disadvantage of unpacking a collection into arguments this way is that it relies on parameter order. That means it only works when you can use positional arguments and doesn't work when you need to specify keyword arguments.

But if the values to unpack are in a dictionary where each key matches a parameter name, you can unpack them as keyword arguments with **. Then the order of values no longer matters.

In [ ]:

Copied!





coordinates_dict = {
    "z": 4,
    "y": 5,
    "x": 2
}

distance_from_origin(**coordinates_dict)
coordinates_dict = {
    "z": 4,
    "y": 5,
    "x": 2
}

distance_from_origin(**coordinates_dict)

Out[ ]:

6.708203932499369

If you have ever seen a function defintion with *args and **kwargs, that's related to unpacking. The *args parameter means the function can be called with any number of positional arguments. The **kwargs means the function can be called with any keyword arguments.

Big Takeaway: Unpacking reduces the amount of code you have to write and makes your code easier to read. Take advantage of it wherever you can.

Beg forgiveness. Don't ask permission¶

An analyst has tract-level census data where each tract has three values:

Total Area in km²
Water Area in km²
Population

In [51]:

Copied!





tract1 = {
    "area": 100,
    "area_water": 20,
    "population": 1000
}
tract1 = {
    "area": 100,
    "area_water": 20,
    "population": 1000
}

The analyst writes a function to calculate the population density in people per km² of land area.

In [52]:

Copied!

def pop_density(tract):
    area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density(tract1)
def pop_density(tract):
    area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density(tract1)

Out[52]:

12.5

But some records don't have a value for the water area because they are all land. This is the equivalent of not having an "area_water" column. Passing those records to the function causes a KeyError exception.

In [53]:

Copied!





tract2 = {
    "area": 100,
    "population": 1000
}

pop_density(tract2)
tract2 = {
    "area": 100,
    "population": 1000
}

pop_density(tract2)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[53], line 6
      1 tract2 = {
      2     "area": 100,
      3     "population": 1000
      4 }
----> 6 pop_density(tract2)

Cell In[52], line 2, in pop_density(tract)
      1 def pop_density(tract):
----> 2     area_land = tract["area"] - tract["area_water"]
      3     return tract["population"] / area_land

KeyError: 'area_water'

One way to deal with potential bad values to is to check ahead of time with conditional logic (if/elif/else).

In [54]:

Copied!





def pop_density2(tract):
    if "area_water" not in tract.keys():
        area_land = tract["area"]
    else:
        area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density2(tract2)
def pop_density2(tract):
    if "area_water" not in tract.keys():
        area_land = tract["area"]
    else:
        area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density2(tract2)

Out[54]:

10.0

But using conditional logic like this is not great. You need to put in the checks before you get to your core logic, which hurts both performance and readability. You will also run into edge cases that you didn't anticipate that cause your code to fail or return the wrong answer. For example, some records without any water have explicitly set the "area_water" value to None. This is equivalent to having null values in an "area_water" column. Passing a record like that to the function causes a TypeError.

In [55]:

Copied!





tract3 = {
    "area": 100,
    "area_water": None,
    "population": 1000
}

pop_density2(tract3)
tract3 = {
    "area": 100,
    "area_water": None,
    "population": 1000
}

pop_density2(tract3)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[55], line 7
      1 tract3 = {
      2     "area": 100,
      3     "area_water": None,
      4     "population": 1000
      5 }
----> 7 pop_density2(tract3)

Cell In[54], line 5, in pop_density2(tract)
      3     area_land = tract["area"]
      4 else:
----> 5     area_land = tract["area"] - tract["area_water"]
      6 return tract["population"] / area_land

TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

Your code gets more complicated as you deal with those edge cases.

In [56]:

Copied!





def pop_density3(tract):
    if "area_water" not in tract.keys() or tract["area_water"] is None: 
        area_land = tract["area"]
    else:
        area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density3(tract3)
def pop_density3(tract):
    if "area_water" not in tract.keys() or tract["area_water"] is None: 
        area_land = tract["area"]
    else:
        area_land = tract["area"] - tract["area_water"]
    return tract["population"] / area_land

pop_density3(tract3)

Out[56]:

10.0

No matter how many edge cases you anticipate, there will probably be another one you didn't. For example, passing a record that is all water causes a ZeroDivisionError.

In [57]:

Copied!





tract4 = {
    "area": 100,
    "area_water": 100,
    "population": 0
}

pop_density3(tract4)
tract4 = {
    "area": 100,
    "area_water": 100,
    "population": 0
}

pop_density3(tract4)

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[57], line 7
      1 tract4 = {
      2     "area": 100,
      3     "area_water": 100,
      4     "population": 0
      5 }
----> 7 pop_density3(tract4)

Cell In[56], line 6, in pop_density3(tract)
      4 else:
      5     area_land = tract["area"] - tract["area_water"]
----> 6 return tract["population"] / area_land

ZeroDivisionError: division by zero

Instead of writing an exploding mess of spaghetti code to deal with a never-ending parade of edge cases, it is better to use try and except. Python will attempt to run the code in the try block. If that code throws and exception, Python will run the code in the except block that matches the type of exception.

In [31]:

Copied!





def pop_density4(tract):
    try:
        area_land = tract["area"] - tract["area_water"]
        return tract["population"] / area_land
    except (KeyError, TypeError):
        return tract["population"] / tract["area"]
    except ZeroDivisionError:
        return 0

for tract in [tract1, tract2, tract3, tract4]:
    print(pop_density4(tract))
def pop_density4(tract):
    try:
        area_land = tract["area"] - tract["area_water"]
        return tract["population"] / area_land
    except (KeyError, TypeError):
        return tract["population"] / tract["area"]
    except ZeroDivisionError:
        return 0

for tract in [tract1, tract2, tract3, tract4]:
    print(pop_density4(tract))

This pattern puts your core logic at the top, and deals with edge cases afterward, making your code more performant and readable. It also gives you the option to handle different types of errors differently. The first except block handles both the KeyError and TypeError problems by using the total area to calculate population density. The second except block handles the ZeroDivisionError by appropriately returning 0.

Big takeaway: You can just try things. It's usually easier, faster, and more readable to put the common case in a try block, and handle exceptions for edge cases where the common case doesn't work.

Use more of the standard library¶

An analyst has tract-level census data records. Each tract has two values: population and households. The analyst could model a single tract as a dictionary.

In [32]:

Copied!





tract1 = {
    "population": 1000,
    "households": 500
}
tract1 = {
    "population": 1000,
    "households": 500
}

That looks appropriate because it clearly links each value to a key that explains what the value means. But a dictionary is usually not a good data structure for a single record from a table. For one thing, there is a substantial amount of repetition if you need to model many records.

In [33]:

Copied!





tract2 = {
    "population": 2000,
    "households": 800
}

tract3 = {
    "population": 5000,
    "households": 3000
}
tract2 = {
    "population": 2000,
    "households": 800
}

tract3 = {
    "population": 5000,
    "households": 3000
}

Another problem is that dictionaries are mutable, which means the keys can change and cause the dictionary to no longer fit the same data schema.

In [34]:

Copied!

del tract2["households"]
tract2
del tract2["households"]
tract2

Out[34]:

{'population': 2000}

Dictionaries are optimized for fast access of a value by key. This is not usually an important goal for an individual record. Using a dictionary to model records is unecessarily hard. A better data structure for a record is a tuple.

In [35]:

Copied!

tract1 = (1000, 500)
tract2 = (2000, 800)
tract3 = (5000, 3000)
tract1 = (1000, 500)
tract2 = (2000, 800)
tract3 = (5000, 3000)

A glaring omission from a tuple, however, is the lack of context for each value represents. An even better data structure for a record is a named tuple, which you can import from the standard library.

To use a named tuple, create a class that inherits from NamedTuple. For this kind of class, you only need to specify the field names and the datatype the values in each field should have. You can then create instances of that named tuple by passing the appropriate values to the constructor.

In [36]:

Copied!





from typing import NamedTuple

class Tract(NamedTuple):
    population: int
    households: int

tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)
from typing import NamedTuple

class Tract(NamedTuple):
    population: int
    households: int

tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)

You can access the value in a field using dot notation.

In [37]:

Copied!

tract1.households
tract1.households

Out[37]:

Big takeaway: The standard library has classes and functions that make your life easier without having to install additional packages. You should use them more. Named tuples are just one example. The official documentation lists them all, but some highlights include:

csv for working with csv files
dataclasses for creating dataclasses (like NamedTuple, but editable)
datetime for working with dates and times
itertools for efficient looping
math for Mathematical functions
pprint for nicely printing complex data structures
pathlib and os.path for working with file paths

Use more built-ins¶

The analyst wants to know the average number of people per household across all tracts. That is not the same as averaging the number of people per household per tract. The analyst needs to divide the total population across tracts by the total number of households across tracts.

One way to get the right answer is to loop over each tract, keeping a running total of the population househould values. Then calculate the the ratio.

In [38]:

Copied!





population = 0
households = 0
tracts = [tract1, tract2, tract3]
for tract in tracts:
    population += tract.population
    households += tract.households

population / households
population = 0
households = 0
tracts = [tract1, tract2, tract3]
for tract in tracts:
    population += tract.population
    households += tract.households

population / households

Out[38]:

1.8604651162790697

That gives the correct answer, but keeping running totals obscures the goal, which is to create a sum of the total population and households across tracts. Summing values is a common pattern, and for many common patterns, Python has some built-in capability to make it easier to accomplish. Built-ins differ from the functionality from the standard library in that you don't have to import anything to get access to built-ins. Code that is is considered Pythonic makes good use of these built-in capabilities. In this case, there is the sum function. This has the advantage of making it more explicit to the reader that the code is summing values in a collection.

In [39]:

Copied!





population_values = []
household_values = []
tracts = [tract1, tract2, tract3]
for tract in tracts:
    population_values.append(tract.population)
    household_values.append(tract.households)

sum(population_values) / sum(household_values)
population_values = []
household_values = []
tracts = [tract1, tract2, tract3]
for tract in tracts:
    population_values.append(tract.population)
    household_values.append(tract.households)

sum(population_values) / sum(household_values)

Out[39]:

1.8604651162790697

Appending values to a list in a loop is also a very common pattern. For this case, Python has list comprehensions. List comprehensions are shorter and more readable (once you get used to them) than explicit loops. They also execute faster than the equivalent loop.

In [199]:

Copied!

tracts = [tract1, tract2, tract3]
population_values = [tract.population for tract in tracts]
household_values = [tract.households for tract in tracts]

sum(population_values) / sum(household_values)
tracts = [tract1, tract2, tract3]
population_values = [tract.population for tract in tracts]
household_values = [tract.households for tract in tracts]

sum(population_values) / sum(household_values)

Out[199]:

1.8604651162790697

In this particular case, a list comprehension is not so great because we had to iterate over the tracts twice. You could create a nested comprehension to loop over the tracts for for each value in a tract, which would get some effeciencies if you had a lot of tracts.

In [202]:

Copied!

population_values, household_values = [[tract[i] for tract in tracts] for i in range(len(tract1))]

sum(population_values) / sum(household_values)
population_values, household_values = [[tract[i] for tract in tracts] for i in range(len(tract1))]

sum(population_values) / sum(household_values)

Out[202]:

1.8604651162790697

But this is an abomination. While comprehensions are generally awesome, resist the temptation to make complicated comprehensions. In such cases, it would be better to use an explicit loop instead.

Even better is to know that recombining a group of pairs into a pair of groups (or vice versa) is also a common pattern. Once again, Python has a built-in function, zip, to do that for you. By using zip, you can save yourself a little bit of typing and a significant amount of thinking about the correct implementation. Using zip also makes your code easier to explain to other people familiar with Python because they don't have to reason through your implementation to make sure it's been done correctly.

In [200]:

Copied!

population_values, household_values = zip(tract1, tract2, tract3)

sum(population_values) / sum(household_values)
population_values, household_values = zip(tract1, tract2, tract3)

sum(population_values) / sum(household_values)

Out[200]:

1.8604651162790697

Big Takeaway: Python built-ins can make your life easier, without even having to import additional libraries. sum, list comprehensions, and zip are among the more useful built-in capabilities of Python you should be using more. The official documentation has the complete list, but some other useful built-in functions include:

abs for returning the absolute value of a number.
all and any for testing the truth of a collection of values.
dir for listing the attributes of an object
enumerate for getting both the index and a value from a collection. Useful for complex loops.
help for getting information about a Python object.
isinstance and type for getting information about an object's type
len for getting the length of a collection, such as a string or list.
max and min for getting the maximum or minimum value from a group of values.
open for opening files.
range for creating a collection of values in a given range.

Context managers¶

An analyst needs to write some values to a csv file. They open the file, write the content, then close the file.

In [47]:

Copied!





rows = ("x,y,z", "2,4,5")
f = open('data.csv', 'w')
f.write("\n".join(rows))
f.close()
rows = ("x,y,z", "2,4,5")
f = open('data.csv', 'w')
f.write("\n".join(rows))
f.close()

That works, but it has two problems. The first is that you have to remember to write f.close() at the end or else the file will stay open. That increases the risk of file corruption. That's not too hard in this example, but the more code between the open function and the close method, the more likely it is that you will forget.

The second problem is that even if you remember to write the teardown code to close the file, it won't run if the code throws an exception before it gets there. That also increases the risk of file corruption.

Because it is common to need some setup and/or teardown code when working with certain objects, Python has context managers that allow you to apply that code automatically. For example, TextIOWrapper objects created by opening text files have teardown code to close the file, so you don't have to invoke the close method yourself. Instead, you can use a with block to activate the context manager.

In [ ]:

Copied!

with open("data.csv", "w") as f:
    f.write("\n".join(rows))
with open("data.csv", "w") as f:
    f.write("\n".join(rows))

The as f part of the code creates a name (f) that points to an object. This is equivalent to

f = open("data.csv", "w")

Before executing the code inside the with block, the context manager executes the setup code defined for that object. As soon as the code inside the with block finishes (even if it finished because of an exception), the context manager executes the teardown code defined for the object.

Not every object can be used with a context manager. Whoever wrote the code for defining that object had to add special capabilities to the object to enable a context manager.

Big Takeaway: Context managers let you avoid boilerplate setup/teardown code. Opening files is probably the most common use case for context managers, but pay attention to how they are used in other libraries you work with as well.

Exercises¶

The exercises below invite you to practice applying the different strategies outlined above. They follow the order of the concepts presented, but you can attempt them in any order. Start with the ones that seem most applicable to the work you need to do.

You can find example answers in the ExerciseAnswers.ipynb notebook.

1) Use unpacking for pretty printing¶

The code below uses a loop to print each value in a collection on a separate line.

In [243]:

Copied!

counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
for county in counties:
    print(county)
counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
for county in counties:
    print(county)

Anoka
Dakota
Carver
Hennepin
Ramsey
Scott
Washington

Write a different implementation that uses unpacking to print each value on a separate line using a single call to the print function instead of a loop.

Hint: The print function's first parameter is *objects, which is accepts any number of positional arguments (similar to *args in other functions). These arguments are what will be printed. The second parameter is sep, which defines the character to put in between the values to print. The default value of sep is a single space (' '), but it could be a newline character ('\n').

In [ ]:

2) Use try/except¶

The code below defines two records using named tuples.

In [216]:

Copied!





from typing import NamedTuple

class Record(NamedTuple):
    total_population: int
    population_in_poverty: int

record1 = Record(5000, 200)
record2 = Record(200, 0)
from typing import NamedTuple

class Record(NamedTuple):
    total_population: int
    population_in_poverty: int

record1 = Record(5000, 200)
record2 = Record(200, 0)

The code below calculates the ratio of each value in the first record to the corresponding value in the second record. It uses conditional logic to catch potential errors.

In [222]:

Copied!





from math import inf

for field in Record._fields:
    if getattr(record2, field) != 0:
        ratio = getattr(record1, field) / getattr(record2, field)
    else:
        ratio = inf
    print(ratio)
from math import inf

for field in Record._fields:
    if getattr(record2, field) != 0:
        ratio = getattr(record1, field) / getattr(record2, field)
    else:
        ratio = inf
    print(ratio)

25.0
inf

Write a different implementation that uses try and except instead.

Hint: You may find it useful to first write the code without any error handling to see what type of error occurs.

In [ ]:

3) Use standard library data classes¶

The code below uses a dictionary to define a record, then changes one of the values in that record.

In [224]:

Copied!





record = {
    "total_population": 5000,
    "population_in_poverty": 200
}

record["total_population"] = 6000

print(record)
record = {
    "total_population": 5000,
    "population_in_poverty": 200
}

record["total_population"] = 6000

print(record)

{'total_population': 6000, 'population_in_poverty': 200}

This pattern cannot be implented using a named tuple, because named tuples are immutable. A data class is a standard library class that is similar to a named tuple, but it can be editable. Write a different implementation of the code above to use data classes instead of dictionaries

Hint: The official Python documentation may be hard to understand. You may want to search for a tutorial on data classes specifically.

In [ ]:

4) Use the built-in min and max functions¶

The code below creates a list of 20 random numbers between -1000 and 1000.

In [211]:

Copied!

from random import randint

nums = [randint(-1000, 1000) for i in range(20)]
from random import randint

nums = [randint(-1000, 1000) for i in range(20)]

The code below finds the maximum and minimum values of nums using conditional logic and explicit comparisons to running values.

In [225]:

Copied!





min_num = 1000
max_num = -1000

for num in nums:
    if num > max_num:
        max_num = num
    if num < min_num:
        min_num = num

print(max_num, min_num)
min_num = 1000
max_num = -1000

for num in nums:
    if num > max_num:
        max_num = num
    if num < min_num:
        min_num = num

print(max_num, min_num)

992 -952

Write a different implementation that uses the built-in max and min functions instead.

In [ ]:

5) Open a file with a context manager¶

The code below opens a file and writes to it.

In [ ]:

Copied!

f = open("exercise.txt", "w")
f.write("This is example text for an exercise.")
f.close()
f = open("exercise.txt", "w")
f.write("This is example text for an exercise.")
f.close()

Rewrite it to use a context manager instead.

In [ ]:

Help people understand your code¶

Even if you use Pythonic idioms, your code probably won't be perfectly understandable by itself. But with all the time you save by writing code that is more Pythonic, you can spend more time documenting your code. That way other people can figure it out.

Doc strings¶

An analyst has a function that calculates the distance from a given point to the origin in three dimensions.

In [9]:

Copied!

def distance_from_origin(x, y, z):
    return (x**2 + y**2 + z**2) ** 0.5
def distance_from_origin(x, y, z):
    return (x**2 + y**2 + z**2) ** 0.5

One option is to say that it is perfectly obvious what this function does from its name and parameters. But your functions are much more obvious to you than they are to other people. "Other people" includes future you. You do not want future you mad at current you for not explaining what your code does.

A better option is to write down explicity what this function does, what kind of arguments you can pass to it, and what kind of value it will return. For example, you might have a text file, or a web page, or a Word doc. Hopefully not a sticky note on your monitor, but even that's better than nothing. Something like:

Calculates the distance from a given point in three dimensions to the origin (0, 0, 0). 

Args: 
    x (float): The x-axis coordinate.
    y (float): The y-axis coordinate.
    z (float): The z-axis coordinate.

Returns:
    float: The distance.

That works OK, but separating your code from your documentation forces people to look in two places. It also means that the built-in help function is mostly useless for learning about your function.

In [29]:

Copied!

help(distance_from_origin)
help(distance_from_origin)

Help on function distance_from_origin in module __main__:

distance_from_origin(x, y, z)

A better way to document your code is to include the information as a doc string. You can use doc strings with modules, function, classes, and methods that you create.

In [10]:

Copied!





def distance_from_origin_docstring(x, y, z):
    """
    Calculates the distance from a given point in three dimensions to the origin (0, 0, 0). 

    Args: 
        x (float): The x-axis coordinate.
        y (float): The y-axis coordinate.
        z (float): The z-axis coordinate.

    Returns:
        float: The distance.
    """

    return (x**2 + y**2 + z**2) ** 0.5
def distance_from_origin_docstring(x, y, z):
    """
    Calculates the distance from a given point in three dimensions to the origin (0, 0, 0). 

    Args: 
        x (float): The x-axis coordinate.
        y (float): The y-axis coordinate.
        z (float): The z-axis coordinate.

    Returns:
        float: The distance.
    """

    return (x**2 + y**2 + z**2) ** 0.5

By including a doc string, people can use the built-in help function to see the information without having to open the source code file.

In [12]:

Copied!

help(distance_from_origin_docstring)
help(distance_from_origin_docstring)

Help on function distance_from_origin_docstring in module __main__:

distance_from_origin_docstring(x, y, z)
    Calculates the distance from a given point in three dimensions to the origin (0, 0, 0).

    Args:
        x (float): The x-axis coordinate.
        y (float): The y-axis coordinate.
        z (float): The z-axis coordinate.

    Returns:
        float: The distance.

Many IDEs will even show the information when you hover over the function name.

Type hints¶

An analyst tries using the distance_from_origin_docstring function, but is getting an error

In [ ]:

Copied!





coordinates = [2, 5, 4]
distance = distance_from_origin_docstring(*coordinates)
info_string = "The point is " + distance + " meters from the origin"
print(info_string)
coordinates = [2, 5, 4]
distance = distance_from_origin_docstring(*coordinates)
info_string = "The point is " + distance + " meters from the origin"
print(info_string)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 3
      1 coordinates = [2, 5, 4]
      2 distance = distance_from_origin_docstring(*coordinates)
----> 3 info_string = "The point is " + distance + " meters from the origin"
      4 print(info_string)

TypeError: can only concatenate str (not "float") to str

The error is reasonably informative, and the analyst can use it to fix their code. But the problem only showed up after the analyst ran the code. It would be nice to get that information beforehand. Type hints are a way to pass information to type checkers and IDEs that can help ensure that you're using the correct types, without having to actually run the code.

In [17]:

Copied!





def distance_from_origin_typehints(x: float, y: float, z: float) -> float:
    """
    Calculates the distance from a given point in three dimensions to the origin (0, 0, 0). 

    Args: 
        x (float): The x-axis coordinate.
        y (float): The y-axis coordinate.
        z (float): The z-axis coordinate.

    Returns:
        float: The distance.
    """

    return (x**2 + y**2 + z**2) ** 0.5
def distance_from_origin_typehints(x: float, y: float, z: float) -> float:
    """
    Calculates the distance from a given point in three dimensions to the origin (0, 0, 0). 

    Args: 
        x (float): The x-axis coordinate.
        y (float): The y-axis coordinate.
        z (float): The z-axis coordinate.

    Returns:
        float: The distance.
    """

    return (x**2 + y**2 + z**2) ** 0.5

If the analyst has used this function, type checkers like Mypy would have flagged the use of the distance name as incorrect usage. Then the analyst could have corrected their code immediately.

Type hints are well-named. They do not force you to use the right types. They will not cause Python to throw an error if you use the wrong types. They give you a hint that you are not using a value correctly. For example, the distance_from_origin_typehints function executes successfully when you pass it a complex number as an argument, even though a complex is not a float.

In [ ]:

Copied!





coordinates = [2j, 5, 4]
distance = distance_from_origin_typehints(*coordinates)
info_string = f"The point is {distance} meters from the origin"
print(info_string)
coordinates = [2j, 5, 4]
distance = distance_from_origin_typehints(*coordinates)
info_string = f"The point is {distance} meters from the origin"
print(info_string)

The point is (6.082762530298219+0j) meters from the origin

Optimize Performance and Memory Use¶

When you begin to use Python regularly in your work, you'll start noticing bottlenecks in your code. Some workflows may run at lightning speed, while others take hours of processing time to complete, or even crash.

Avoiding bloat is invaluable as you move toward using code for automation, bigger data, and working with APIs. Code efficiency means:

Less chance of a slowdown or crash: the dreaded MemoryError.
Quicker response time and fewer bottlenecks for the larger workflow.
Better scaling.
Efficient code is often (but not always!) cleaner and more readable.

Let's look at some ways you can reduce bloat in your code.

tl;dr
Access and store only what you need, no more.

Storage: avoid a list where you could use a tuple
Membership look-up: avoid a list (or tuple) where you could use a set (or dictionary)
Iteration: avoid a function (or list comprehension) where you could use a generator (or generator expression)
Profile: make time for performance checks by profiling your code for bottlenecks

Storage: lists vs. tuples¶

If you have a collection of values, your first thought may be to store them in a list.

In [ ]:

Copied!

data_list = [17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356]
data_list = [17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356]

Lists are nice because they are very flexible. You can change the values in the list, including appending and removing values. But that flexibility comes at a cost. Lists are less efficient than tuples. For example, they use more memory.

In [ ]:

Copied!

import sys

data_tuple = (17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356)

print(sys.getsizeof(data_list))
print(sys.getsizeof(data_tuple))
import sys

data_tuple = (17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356)

print(sys.getsizeof(data_list))
print(sys.getsizeof(data_tuple))

104
88

If you aren't going to be changing the values in a collection, use a tuple instead of a list.

Membership look-up: sequential vs. hashable¶

However, when you want to see if an element already exists in a collection of elements, use a set or dictionary to store that collection if possible.

List and tuple look-up is sequential, going at the speed of O(n): linear time.
- With lists, Python scans the entire list until it finds the match (or reaches the end).
- Worst case: it has to look at every element.
Set and dictionary look-up are hashable: mapping keys to values. These go at the speed of O(1): constant time.
- No matter how big the collection is, the set only ever has to check 1 value.
- Sets are built on hash tables. Python computes the hash of the element and jumps straight to where it should be stored.

The example below shows that a set is over 100x faster than a list in calculating the first 10,000 values of Recaman's sequence.

In [ ]:

Copied!





def recaman_check(cur, i, visited):
    return (cur - i) < 0 or (cur - i) in visited

def recaman_list(n: int) -> list[int]:
    """
    return a list of the first n numbers of the Recaman series
    """

    visited_list = [0]
    current = 0
    for i in range(1, n):
        if recaman_check(current, i, visited_list):
            current += i
        else:
            current -= i
        visited_list.append(current)
    return visited_list
def recaman_check(cur, i, visited):
    return (cur - i) < 0 or (cur - i) in visited

def recaman_list(n: int) -> list[int]:
    """
    return a list of the first n numbers of the Recaman series
    """

    visited_list = [0]
    current = 0
    for i in range(1, n):
        if recaman_check(current, i, visited_list):
            current += i
        else:
            current -= i
        visited_list.append(current)
    return visited_list

In [ ]:

Copied!

%%timeit
recaman_list(10000)
%%timeit
recaman_list(10000)

386 ms ± 36.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [ ]:

Copied!





def recaman_set(n: int) -> list[int]:
    """
    return a set of the first n numbers of the Recaman series
    """
    visited_set = {0}
    current = 0
    for i in range(1, n):
        if recaman_check(current, i, visited_set):
            current += i
        else:
            current -= i
        visited_set.add(current)
    return visited_set
def recaman_set(n: int) -> list[int]:
    """
    return a set of the first n numbers of the Recaman series
    """
    visited_set = {0}
    current = 0
    for i in range(1, n):
        if recaman_check(current, i, visited_set):
            current += i
        else:
            current -= i
        visited_set.add(current)
    return visited_set

In [ ]:

Copied!

%%timeit
recaman_set(10000)
%%timeit
recaman_set(10000)

2.06 ms ± 61.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When you add an element to a set...

Python calls the element’s hash() method to get a hash value (an integer);
That hash value determines where the element will be stored in the set's internal structure; and
When checking if an element is in the set, Python uses the hash to quickly find it.

Iteration: functions vs. generators¶

We often use functions to operate on data, but generators can be more memory-efficient and faster for certain tasks.

Regular functions and comprehensions typically store outputs into containers, like lists or dictionaries. This can take up unnecessary memory, especially when we're creating multi-step workflows with many intermediate outputs.

In contrast, generators only hold one data item in memory at a time. A generator is a type of iterator that produces results on-demand (lazily), maintaining its state between iterations.

Under the hood, a generator's syntax is similar to a function. Generally, you:

define a process(),
provide the logic, and
ask for the result, either with a return statement (for functions) or a yield statement (for generators).

Here's a quick way to see why a generator is superior for memory. Let's compare a regular function that produces new values endlessly, storing them in a list, to a generator that yields each value one at a time, discarding it from memory as it moves to the next iteration.

In [ ]:

Copied!





def massive_rf():
  """A regular function that produces even numbers, endlessly."""
  x_list = []
  x = 0
  while True:
    x_list.append(x)
    x += 2

# Run it:
massive_rf()
def massive_rf():
  """A regular function that produces even numbers, endlessly."""
  x_list = []
  x = 0
  while True:
    x_list.append(x)
    x += 2

# Run it:
massive_rf()

Woah! That did its best, but my notebook has now informed me that, "Your session crashed after using all available RAM."

In [ ]:

Copied!





def massive_gen():
  """A generator that produces even numbers, endlessly."""
  x = 0
  while True:
    yield x
    x += 2

# Run it: (use keyboard interrupt when you want to move on.)
for x in massive_gen():
  print(x)
def massive_gen():
  """A generator that produces even numbers, endlessly."""
  x = 0
  while True:
    yield x
    x += 2

# Run it: (use keyboard interrupt when you want to move on.)
for x in massive_gen():
  print(x)

The generator was willing to keep going until I interrupted it because it did not store each result in memory as it proceeded.

Let's look at a more concrete scenario. Imagine you have a large dataset containing millions of employee records. You want to calculate the combined hourly rates of all employees on an annual salary.

In [ ]:

Copied!





# For the sake of simplicity, we'll represent the dataset with a small sample.
employeeDatabase = [
  {'lastName': 'Knope', 'rate': 72000, 'pay_class': 'annual'},
  {'lastName': 'Gergich', 'rate': 17, 'pay_class': 'hourly'},
  {'lastName': 'Ludgate', 'rate': 60000, 'pay_class': 'annual'},
  {'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
  {'lastName': 'Haverford', 'rate': 52000, 'pay_class': 'annual'}
]
# For the sake of simplicity, we'll represent the dataset with a small sample.
employeeDatabase = [
  {'lastName': 'Knope', 'rate': 72000, 'pay_class': 'annual'},
  {'lastName': 'Gergich', 'rate': 17, 'pay_class': 'hourly'},
  {'lastName': 'Ludgate', 'rate': 60000, 'pay_class': 'annual'},
  {'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
  {'lastName': 'Haverford', 'rate': 52000, 'pay_class': 'annual'}
]

You can use a function for this, but it means i) the entire input dataset will be held in memory, and ii) each result (each worker's hourly value) will be held in memory too.

In [ ]:

Copied!





def hourly_rate(payments):
  """Function that returns each salaried worker's hourly rate."""
  hourlyRates = []
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      hourlyRates.append(hourly)
  return hourlyRates

# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")
def hourly_rate(payments):
  """Function that returns each salaried worker's hourly rate."""
  hourlyRates = []
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      hourlyRates.append(hourly)
  return hourlyRates

# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")

Total dispersments per hour for salaried employees: $88.46

If the input dataset is huge, this eats up a ton of space. Instead, what if we process data lazily, storing one row in memory at a time?

In [ ]:

Copied!





def hourly_rate_gen(payments):
  """Generator that yields each salaried worker's hourly rate."""
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      yield hourly

# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate_gen(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")
def hourly_rate_gen(payments):
  """Generator that yields each salaried worker's hourly rate."""
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      yield hourly

# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate_gen(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")

Total dispersments per hour for salaried employees: $88.46

A return statement is your signal that every output being produced will be held in memory at the same time and provided (returned) all at once.

If a function returns a list of 1 thousand items, all 1 thousand are stored in memory before the end of execution.

In a generator, the yield statement signals that execution can proceed one at a time. When yield is executed, the generator pauses, retaining the generator's state until the next time it is called.

Lazy outputs: Each output that a generator produces is yielded, then discarded before the next output is yielded.
Lazy inputs: A generator can also stream input data, but you have to write it that way. For example, for worker in payments above is a for loop that streams one element (one worker's information) from the employeeDatabase list at a time.

Tip: Generator pipelines are a powerful workflow for GIS and remote sensing. Use multiple generators to string tasks together lazily. These are hugely helpful for complex spatial analysis workflows, such as raster processing.

Iteration, continued: List comprehension vs. generator expression¶

Generator expressions (aka generator comprehensions) are concise, one-line generators. Generator expressions can be a handy replacement for list comprehensions.

Let's look at how the analysis above would appear in list comprehension format.

In [ ]:

Copied!

hourly = [worker['rate'] / 2080 for worker in employeeDatabase if worker.get('pay_class') == 'annual']
salariesPerHour = sum(hourly)

print(f"${salariesPerHour:.2f}")
hourly = [worker['rate'] / 2080 for worker in employeeDatabase if worker.get('pay_class') == 'annual']
salariesPerHour = sum(hourly)

print(f"${salariesPerHour:.2f}")

$88.46

As with the function, the list comprehension constructs a list of n values. Then, we use sum() to add all values in the list together.

A generator expression looks almost identical to a list comprehension: simply swap out square brackets with parentheses.

In [ ]:

Copied!

hourly = (worker['rate'] / 2080 for worker in employeeDatabase if worker.get('pay_class') == 'annual')
salariesPerHour = sum(hourly)

print(f"${salariesPerHour:.2f}")
hourly = (worker['rate'] / 2080 for worker in employeeDatabase if worker.get('pay_class') == 'annual')
salariesPerHour = sum(hourly)

print(f"${salariesPerHour:.2f}")

$88.46

Profiling: finding bottlenecks¶

Profiling is any technique used to measure the performance of your code, in particular its speed. There are dozens of tools available for profiling. We'll use a few to:

Check memory use: Use sys.getsizeof() to check the memory size of variables.
Spot-profile your code: Use the timeit notebook magic to perform some basic profiling by cell or by line.
Profile your script comprehensively: The cProfile module has the ability to break down call by call to determine the number of calls and the total time spent on each.

Check memory use with `getsizeof()`¶

Use this tool to quickly check how much memory a variable is taking up on your system.

In [ ]:

Copied!





import sys

tract1 = {
    "area": 100,
    "area_water": 20,
    "population": 1000
}

print(f"Bytes: {sys.getsizeof(tract1)}")
import sys

tract1 = {
    "area": 100,
    "area_water": 20,
    "population": 1000
}

print(f"Bytes: {sys.getsizeof(tract1)}")

Bytes: 184

In [ ]:

Copied!

print(f"Bytes: {sys.getsizeof(recaman_list(1000))}")
print(f"Bytes: {sys.getsizeof(recaman_set(1000))}")
print(f"Bytes: {sys.getsizeof(recaman_list(1000))}")
print(f"Bytes: {sys.getsizeof(recaman_set(1000))}")

Bytes: 8856
Bytes: 32984

"You said sets were better than lists!"

Remember, sets are preferred over lists for membership lookup because they are faster, not slimmer.

If you care more about output size, make a list; it takes up less memory.
If you care more about task speed, make a set.

Spot-check speed with `%%timeit`¶

The timeit module measures the execution time of a selection of code. Among the many ways you'll see it written are "magic" commands:

%timeit is a form of line magic. Line magic arguments only extend to the end of the current line.
%%timeit is a form of cell magic. It measures the execution time of the entire notebook cell.

With both of these commands, the notebook will test your code multiple times and print the average speed of those calls.

In [ ]:

Copied!





%%timeit
# Cell magic example
from typing import NamedTuple

class Tract(NamedTuple):
    population: int
    households: int

tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)

tract1.households
%%timeit
# Cell magic example
from typing import NamedTuple

class Tract(NamedTuple):
    population: int
    households: int

tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)

tract1.households

113 µs ± 24.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [ ]:

Copied!

# Line magic example
%timeit sum(hourly_rate(employeeDatabase))
%timeit sum(hourly_rate_gen(employeeDatabase))
# Line magic example
%timeit sum(hourly_rate(employeeDatabase))
%timeit sum(hourly_rate_gen(employeeDatabase))

859 ns ± 222 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
712 ns ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

timeit tip: Optionally, you can limit the number of calls and repetitions with:

-n: number of times to execute the main statement (size of each sample to time)
-r: number of times to repeat the timer (number of samples)

In [ ]:

Copied!

%timeit -n 1 -r 5 sum(hourly_rate_gen(employeeDatabase))
%timeit -n 1 -r 5 sum(hourly_rate_gen(employeeDatabase))

The slowest run took 5.31 times longer than the fastest. This could mean that an intermediate result is being cached.
2.66 µs ± 1.91 µs per loop (mean ± std. dev. of 5 runs, 1 loop each)

Profile with `cProfile`¶

Whereas timeit is a quick way to test speed, cProfile is useful as a comprehensive and holistic code profiler. Some perks of cProfile:

Compare which lines take longest to execute
See how often a function is executed
Sort profiling results by time
See the respective data the function interacts with
Print detailed reports with multiple statistics

Let's take a look:

In [ ]:

Copied!

import cProfile

cProfile.run('recaman_list(10000)')
import cProfile

cProfile.run('recaman_list(10000)')

         20002 function calls in 0.406 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     9999    0.394    0.000    0.394    0.000 <ipython-input-4-1cfc8d8a116c>:1(recaman_check)
        1    0.011    0.011    0.406    0.406 <ipython-input-4-1cfc8d8a116c>:4(recaman_list)
        1    0.000    0.000    0.406    0.406 <string>:1(<module>)
        1    0.000    0.000    0.406    0.406 {built-in method builtins.exec}
     9999    0.001    0.000    0.001    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Above the table, you are given the number of function calls and how long the code took overall.

The field cumtime is the cumulative time it took to call a given function, including all of its subfunctions.

In [ ]:

Copied!

cProfile.run('recaman_set(10000)')
cProfile.run('recaman_set(10000)')

         20002 function calls in 0.010 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     9999    0.003    0.000    0.003    0.000 <ipython-input-4-1cfc8d8a116c>:1(recaman_check)
        1    0.006    0.006    0.010    0.010 <ipython-input-7-c2c0edd0cc91>:1(recaman_set)
        1    0.000    0.000    0.010    0.010 <string>:1(<module>)
        1    0.000    0.000    0.010    0.010 {built-in method builtins.exec}
     9999    0.001    0.000    0.001    0.000 {method 'add' of 'set' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

These results show that the set-based function executed the same number of calls (200002), but ran 40x faster.

cProfile tip: Use cProfile.Profile() as a context manager!

In [ ]:

Copied!





with cProfile.Profile() as pr:
  def recaman_check(cur, i, visited):
    return (cur - i) < 0 or (cur - i) in visited

  def recaman_set(n: int) -> list[int]:
      """
      return a set of the first n numbers of the Recaman series
      """
      visited_set = {0}
      current = 0
      for i in range(1, n):
          if recaman_check(current, i, visited_set):
              current += i
          else:
              current -= i
          visited_set.add(current)
      return visited_set

  recaman_set(1000)

  pr.print_stats('line') # Order by line number.
with cProfile.Profile() as pr:
  def recaman_check(cur, i, visited):
    return (cur - i) < 0 or (cur - i) in visited

  def recaman_set(n: int) -> list[int]:
      """
      return a set of the first n numbers of the Recaman series
      """
      visited_set = {0}
      current = 0
      for i in range(1, n):
          if recaman_check(current, i, visited_set):
              current += i
          else:
              current -= i
          visited_set.add(current)
      return visited_set

  recaman_set(1000)

  pr.print_stats('line') # Order by line number.

         2008 function calls in 0.001 seconds

   Ordered by: line number

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      999    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      999    0.000    0.000    0.000    0.000 <ipython-input-22-3b8d66b9f081>:2(recaman_check)
        1    0.001    0.001    0.001    0.001 <ipython-input-22-3b8d66b9f081>:5(recaman_set)
        1    0.000    0.000    0.000    0.000 cProfile.py:41(print_stats)
        1    0.000    0.000    0.000    0.000 cProfile.py:51(create_stats)
        1    0.000    0.000    0.000    0.000 pstats.py:108(__init__)
        1    0.000    0.000    0.000    0.000 pstats.py:118(init)
        1    0.000    0.000    0.000    0.000 pstats.py:137(load_stats)

Exercises¶

Section 2 exercises summary

Tuple-based storage
Set-based look-up
Generator expression
Generator
Compare differences in speed with timeit
1. %timeit line magic
2. %%timeit cell magic
Check for speed bottlenecks in detail with cProfile
Stretch goal: Raster generator

1) Tuple-based storage¶

The code below creates a list containing all years in a research study timeframe, from 1900 to 2030.

The values in this collection will not need to be changed because the study will always use this timeframe.

In [ ]:

Copied!





import sys

def listFromRange(r1, r2):
  """Create a list from a range of values"""
  return [item for item in range(r1, r2+1)]

start = 1900
end = 2030

studyYears = listFromRange(start, end)

print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))
import sys

def listFromRange(r1, r2):
  """Create a list from a range of values"""
  return [item for item in range(r1, r2+1)]

start = 1900
end = 2030

studyYears = listFromRange(start, end)

print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))

[range(1900, 2031)]
Bytes used:  64

Your turn: For the same timeframe, write a different implementation using a storage option that takes up less memory.

In [ ]:

2) Set-based look-up¶

The code below assigns a collection of placenames to a list. Then, it checks whether a placename is in the list. If not, the placename is reported missing.

If you have 1 million placenames to look up and 6 names in the list, that’s up to 6 million checks.

In [ ]:

Copied!

placeNames_list = ["Kinshasa", "Duluth", "Uruguay", "Doherty Residence", "Dinkytown", "Khazad-dum"]

# List look-up
if "Dinkytown" not in placeNames_list:
    print("Missing.")  # O(n) look-up
placeNames_list = ["Kinshasa", "Duluth", "Uruguay", "Doherty Residence", "Dinkytown", "Khazad-dum"]

# List look-up
if "Dinkytown" not in placeNames_list:
    print("Missing.")  # O(n) look-up

Your turn: Write a different implementation using a storage option that allows quicker checks for membership at scale.

In [ ]:

3) Generator expression¶

You have a list of random strings which contain a combination of upper and lowercase letters. You have written a list comprehension, lowerCase, to rewrite all of these strings into lowercase.

In [ ]:

Copied!





import random
import string

# Input dataset: A list of random strings. Each string is 8 letters long.
randomStrings = [''.join(random.choices(string.ascii_letters, k=8)) for i in range(10)]
print(randomStrings)

# Convert all strings to lowercase
lowerCase = [x.lower() for x in randomStrings]
print(lowerCase)
import random
import string

# Input dataset: A list of random strings. Each string is 8 letters long.
randomStrings = [''.join(random.choices(string.ascii_letters, k=8)) for i in range(10)]
print(randomStrings)

# Convert all strings to lowercase
lowerCase = [x.lower() for x in randomStrings]
print(lowerCase)

['quCxGiYN', 'cUDXcQBk', 'yOVxBjyl', 'QKznqHJV', 'KwkxAbra', 'hjLXVdAh', 'lppRGHIB', 'VoDgKHws', 'mCzLrskq', 'ovkTIIYS']
['qucxgiyn', 'cudxcqbk', 'yovxbjyl', 'qkznqhjv', 'kwkxabra', 'hjlxvdah', 'lpprghib', 'vodgkhws', 'mczlrskq', 'ovktiiys']

Your turn: Write a different implementation that still prints all the lowercase results, but operates faster than a list comprehension (when used with a large dataset).

In [ ]:

4) Generator¶

The following function compares the length of each input dataset to that of a primary list. If the input and primary lists are the same length, the function calculates their difference and returns the result.

In [ ]:

Copied!





# The list that each dataset will be compared to.
primary = [4, 7, 140, 55, 7, 91, 6]

# Input datasets
inputs = (
 [0, 3, 40, 55, 6, 98, 4],
 [5, 4, 3, 45, 1, 67, 2],
 [7, 150, 0.5, 1]
 )

def matchingStructure(inputsList, primList):
  """
  This function compares the length of each input collection to the primary
  list. An input that matches in length gets multiplied by the primary list and
  appended to the results list.
  """
  results = []
  for item in inputsList:
    if len(item) == len(primList):
      difference = [b - a for a, b in zip(item, primList)]
      results.append(difference)
  return results

print(matchingStructure(inputs, primary))
# The list that each dataset will be compared to.
primary = [4, 7, 140, 55, 7, 91, 6]

# Input datasets
inputs = (
 [0, 3, 40, 55, 6, 98, 4],
 [5, 4, 3, 45, 1, 67, 2],
 [7, 150, 0.5, 1]
 )

def matchingStructure(inputsList, primList):
  """
  This function compares the length of each input collection to the primary
  list. An input that matches in length gets multiplied by the primary list and
  appended to the results list.
  """
  results = []
  for item in inputsList:
    if len(item) == len(primList):
      difference = [b - a for a, b in zip(item, primList)]
      results.append(difference)
  return results

print(matchingStructure(inputs, primary))

[[4, 4, 100, 0, 1, -7, 2], [-1, 3, 137, 10, 6, 24, 4]]

Your turn: Write a different implementation that uses a generator instead of a function to compare lengths and calculate results.

In [ ]:

5) Compare differences in speed using `timeit`¶

5.1) `%timeit` line magic¶

Using %timeit line magic, compare the time it takes each comprehension below to run.

In [ ]:

Copied!

[i for i in range(50) if i % 2 == 0]
(i for i in range(50) if i % 2 == 0)
[i for i in range(50) if i % 2 == 0]
(i for i in range(50) if i % 2 == 0)

5.2) `%%timeit` cell magic¶

Using %%timeit cell magic, calculate the time it takes this cell to run.

Set the command to execute the main statement only once and repeat the timer only once.

In [ ]:

Copied!





employeeDatabase = [
  {'lastName': 'Knope', 'rate': 72000, 'pay_class': 'annual'},
  {'lastName': 'Gergich', 'rate': 17, 'pay_class': 'hourly'},
  {'lastName': 'Ludgate', 'rate': 60000, 'pay_class': 'annual'},
  {'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
  {'lastName': 'Haverford', 'rate': 52000, 'pay_class': 'annual'}
]

def hourly_rate(payments):
  """Function that returns each salaried workers' hourly rate."""
  hourlyRates = []
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      hourlyRates.append(hourly)
  return hourlyRates

# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")
employeeDatabase = [
  {'lastName': 'Knope', 'rate': 72000, 'pay_class': 'annual'},
  {'lastName': 'Gergich', 'rate': 17, 'pay_class': 'hourly'},
  {'lastName': 'Ludgate', 'rate': 60000, 'pay_class': 'annual'},
  {'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
  {'lastName': 'Haverford', 'rate': 52000, 'pay_class': 'annual'}
]

def hourly_rate(payments):
  """Function that returns each salaried workers' hourly rate."""
  hourlyRates = []
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      hourlyRates.append(hourly)
  return hourlyRates

# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")

6) Check for speed bottlenecks in detail using `cProfile`¶

Using cProfile, answer these questions about the following lines of code:

How long does everything in this cell take to execute?
Which item takes the longest time to execute? Tip: Sort by cumtime find hotspots more easily.

In [ ]:

Copied!





dataList = [x for x in range(1, 10_000_000)]
dataTuple = tuple(x for x in range(1, 10_000_000))

listFromList = []
listFromTuple = []

for item in dataList:
  new = item + 1
  listFromList.append(new)

for item in dataTuple:
  new = item + 1
  listFromTuple.append(new)
dataList = [x for x in range(1, 10_000_000)]
dataTuple = tuple(x for x in range(1, 10_000_000))

listFromList = []
listFromTuple = []

for item in dataList:
  new = item + 1
  listFromList.append(new)

for item in dataTuple:
  new = item + 1
  listFromTuple.append(new)

7) Stretch Goal: Raster generator¶

Let's say you have a raster depicting 500 square meter population density (people per 500m²) across a country. That's a huge dataset! You want to resample the raster down to 1 square kilometer (people per 1km²) to make it easier to work with.

To do this, you have written a function that creates a new raster of 1km² grid cells. Each 1km² cell contains the total population of all 500m² cells within it.

In [ ]:

Copied!

import numpy as np

# Starting dataset: 80x80 grid of people per 500m².
highResPop = np.ones((80, 80)) * 5
import numpy as np

# Starting dataset: 80x80 grid of people per 500m².
highResPop = np.ones((80, 80)) * 5

Note: The example here uses arrays to represent the rasters for simplicity, and each 500m² cell contains exactly 5 people.

In [ ]:

Copied!





def densityKM(popArray):
    """
    Function that returns population density per km² cell from a
    500 m² resolution population source.

    Input:  500x500m 2D array
    Output: 1x1km 2D array, covering the same area of interest.
    """
    group_size = 20 # Every 20x20 group of 500m² cells equals 1km².
    rows, cols = popArray.shape

    # Aggregate
    kmArray = popArray.reshape(
        rows // group_size, group_size,
        cols // group_size, group_size
    )

    # Sum over each group
    kmDensity = kmArray.sum(axis=(1, 3))

    # Output
    return kmDensity
def densityKM(popArray):
    """
    Function that returns population density per km² cell from a
    500 m² resolution population source.

    Input:  500x500m 2D array
    Output: 1x1km 2D array, covering the same area of interest.
    """
    group_size = 20 # Every 20x20 group of 500m² cells equals 1km².
    rows, cols = popArray.shape

    # Aggregate
    kmArray = popArray.reshape(
        rows // group_size, group_size,
        cols // group_size, group_size
    )

    # Sum over each group
    kmDensity = kmArray.sum(axis=(1, 3))

    # Output
    return kmDensity

In [ ]:

Copied!

densityKM(highResPop)
densityKM(highResPop)

Out[ ]:

array([[2000., 2000., 2000., 2000.],
       [2000., 2000., 2000., 2000.],
       [2000., 2000., 2000., 2000.],
       [2000., 2000., 2000., 2000.]])

Your turn: Write a different implementation using a generator. As an extra challenge, try to find a way to avoid storing your entire km² array in memory. (Instead, process one group of 20x20 cells at a time).

In [ ]:

Exercise Answers¶

The code cells below are example answers to the workshop exercises. They are useful if you get stuck and need a hint or if you want to use them as a comparison with your own attempts

1.1) Use unpacking for pretty printing¶

In [ ]:

Copied!

counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
print(*counties, sep='\n')
counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
print(*counties, sep='\n')

Anoka
Dakota
Carver
Hennepin
Ramsey
Scott
Washington

1.2) Use try/except¶

In [ ]:

Copied!





from math import inf
from typing import NamedTuple

class Record(NamedTuple):
    total_population: int
    population_in_poverty: int

record1 = Record(5000, 200)
record2 = Record(200, 0)

for field in Record._fields:
    try:
        ratio = getattr(record1, field) / getattr(record2, field)
    except ZeroDivisionError:
        ratio = inf
    print(ratio)
from math import inf
from typing import NamedTuple

class Record(NamedTuple):
    total_population: int
    population_in_poverty: int

record1 = Record(5000, 200)
record2 = Record(200, 0)

for field in Record._fields:
    try:
        ratio = getattr(record1, field) / getattr(record2, field)
    except ZeroDivisionError:
        ratio = inf
    print(ratio)

25.0
inf

1.3) Use standard library data classes¶

In [ ]:

Copied!





from dataclasses import dataclass

@dataclass
class Record:
    total_population: int
    population_in_poverty: int

record = Record(5000, 200)
record.total_population = 6000
print(record)
from dataclasses import dataclass

@dataclass
class Record:
    total_population: int
    population_in_poverty: int

record = Record(5000, 200)
record.total_population = 6000
print(record)

Record(total_population=6000, population_in_poverty=200)

1.4) Use the built-in min and max functions¶

In [ ]:

Copied!

from random import randint

nums = [randint(-1000, 1000) for i in range(20)]

print(max(nums), min(nums))
from random import randint

nums = [randint(-1000, 1000) for i in range(20)]

print(max(nums), min(nums))

891 -899

1.5) Open a file with a context manager¶

In [ ]:

Copied!

with open("exercise.txt", "w") as f:
    f.write("This is example text for an exercise.")
with open("exercise.txt", "w") as f:
    f.write("This is example text for an exercise.")

2.1) Tuple-based storage¶

In [1]:

Copied!





import sys

def tupleFromRange(r1, r2):
  """Create a tuple from a range of values"""
  return tuple(range(r1, r2+1))

start = 1900
end = 2030

studyYears = tupleFromRange(start, end)

print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))
import sys

def tupleFromRange(r1, r2):
  """Create a tuple from a range of values"""
  return tuple(range(r1, r2+1))

start = 1900
end = 2030

studyYears = tupleFromRange(start, end)

print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))

(1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030)
Bytes used:  1088

2.2) Set-based look-up¶

In [2]:

Copied!





placeNames_list = ["Kinshasa", "Duluth", "Uruguay", "Doherty Residence", "Dinkytown", "Khazad-dûm"]
placeNames_set = set(placeNames_list)

# Set look-up
if "Dinkytown" not in placeNames_set:
    print("Missing.")  # O(1) look-up
placeNames_list = ["Kinshasa", "Duluth", "Uruguay", "Doherty Residence", "Dinkytown", "Khazad-dûm"]
placeNames_set = set(placeNames_list)

# Set look-up
if "Dinkytown" not in placeNames_set:
    print("Missing.")  # O(1) look-up

2.3) Generator expression¶

In [3]:

Copied!





import random
import string

# Input dataset: A list of random strings. Each string is 8 letters long.
randomStrings = [''.join(random.choices(string.ascii_letters, k=8)) for i in range(10)]
print(randomStrings)

# Convert all strings to lowercase
lowerCase_gen = (x.lower() for x in randomStrings)
for x in lowerCase_gen:
  print(x)
import random
import string

# Input dataset: A list of random strings. Each string is 8 letters long.
randomStrings = [''.join(random.choices(string.ascii_letters, k=8)) for i in range(10)]
print(randomStrings)

# Convert all strings to lowercase
lowerCase_gen = (x.lower() for x in randomStrings)
for x in lowerCase_gen:
  print(x)

['qwUYHnBg', 'yypGsKIi', 'bRVmubKp', 'RbEdRWKF', 'zCRwzvBG', 'swaWiWNs', 'gXndKiBa', 'acRnakiC', 'kuElpvqh', 'fvyfTINZ']
qwuyhnbg
yypgskii
brvmubkp
rbedrwkf
zcrwzvbg
swawiwns
gxndkiba
acrnakic
kuelpvqh
fvyftinz

2.4) Generator¶

In [4]:

Copied!





# The list that each dataset will be compared to.
primary = [4, 7, 140, 55, 7, 91, 6]

# Input datasets
inputs = (
 [0, 3, 40, 55, 6, 98, 4],
 [5, 4, 3, 45, 1, 67, 2],
 [7, 150, 0.5, 1]
 )

def matchingStructure_gen(inputsList, primList):
  """
  This generator compares the length of each input collection to the primary
  list. An input that matches in length gets multiplied by the primary list and
  yielded.
  """
  for item in inputsList:
    if len(item) == len(primList):
      multiplied = [b - a for a, b in zip(item, primList)]
      yield multiplied

for item in matchingStructure_gen(inputs, primary):
  print(item)
# The list that each dataset will be compared to.
primary = [4, 7, 140, 55, 7, 91, 6]

# Input datasets
inputs = (
 [0, 3, 40, 55, 6, 98, 4],
 [5, 4, 3, 45, 1, 67, 2],
 [7, 150, 0.5, 1]
 )

def matchingStructure_gen(inputsList, primList):
  """
  This generator compares the length of each input collection to the primary
  list. An input that matches in length gets multiplied by the primary list and
  yielded.
  """
  for item in inputsList:
    if len(item) == len(primList):
      multiplied = [b - a for a, b in zip(item, primList)]
      yield multiplied

for item in matchingStructure_gen(inputs, primary):
  print(item)

[4, 4, 100, 0, 1, -7, 2]
[-1, 3, 137, 10, 6, 24, 4]

2.5) Compare differences in speed using `timeit`¶

2.5.1) `%timeit` line magic¶

In [5]:

Copied!

%timeit [i for i in range(50) if i % 2 == 0]
%timeit (i for i in range(50) if i % 2 == 0)
%timeit [i for i in range(50) if i % 2 == 0]
%timeit (i for i in range(50) if i % 2 == 0)

4.2 µs ± 833 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
505 ns ± 19.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

2.5.2) `%%timeit` cell magic¶

In [6]:

Copied!





%%timeit -n 1 -r 1

employeeDatabase = [
  {'lastName': 'Knope', 'rate': 72000, 'pay_class': 'annual'},
  {'lastName': 'Gergich', 'rate': 17, 'pay_class': 'hourly'},
  {'lastName': 'Ludgate', 'rate': 60000, 'pay_class': 'annual'},
  {'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
  {'lastName': 'Haverford', 'rate': 52000, 'pay_class': 'annual'}
]

def hourly_rate(payments):
  """Function that returns each salaried workers' hourly rate."""
  hourlyRates = []
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      hourlyRates.append(hourly)
  return hourlyRates

# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")
%%timeit -n 1 -r 1

employeeDatabase = [
  {'lastName': 'Knope', 'rate': 72000, 'pay_class': 'annual'},
  {'lastName': 'Gergich', 'rate': 17, 'pay_class': 'hourly'},
  {'lastName': 'Ludgate', 'rate': 60000, 'pay_class': 'annual'},
  {'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
  {'lastName': 'Haverford', 'rate': 52000, 'pay_class': 'annual'}
]

def hourly_rate(payments):
  """Function that returns each salaried workers' hourly rate."""
  hourlyRates = []
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      hourlyRates.append(hourly)
  return hourlyRates

# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")

Total dispersments per hour for salaried employees: $88.46
79.4 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

2.6) Check for speed bottlenecks in detail with `cProfile`¶

In [7]:

Copied!





import cProfile

with cProfile.Profile() as pr:
  dataList = [x for x in range(1, 10_000_000)]
  dataTuple = tuple(x for x in range(1, 10_000_000))

  listFromList = []
  listFromTuple = []

  for item in dataList:
    new = item + 1
    listFromList.append(new)

  for item in dataTuple:
    new = item + 1
    listFromTuple.append(new)

  pr.print_stats('cumtime')
import cProfile

with cProfile.Profile() as pr:
  dataList = [x for x in range(1, 10_000_000)]
  dataTuple = tuple(x for x in range(1, 10_000_000))

  listFromList = []
  listFromTuple = []

  for item in dataList:
    new = item + 1
    listFromList.append(new)

  for item in dataTuple:
    new = item + 1
    listFromTuple.append(new)

  pr.print_stats('cumtime')

         30000008 function calls in 4.517 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10000000    1.981    0.000    1.981    0.000 <ipython-input-7-233c72c0aca7>:5(<genexpr>)
 19999998    1.670    0.000    1.670    0.000 {method 'append' of 'list' objects}
        1    0.866    0.866    0.866    0.866 <ipython-input-7-233c72c0aca7>:4(<listcomp>)
        1    0.000    0.000    0.000    0.000 cProfile.py:41(print_stats)
        1    0.000    0.000    0.000    0.000 pstats.py:108(__init__)
        1    0.000    0.000    0.000    0.000 pstats.py:118(init)
        1    0.000    0.000    0.000    0.000 pstats.py:137(load_stats)
        1    0.000    0.000    0.000    0.000 cProfile.py:51(create_stats)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}

2.7) Stretch goal: Raster generator¶

In [8]:

Copied!





# # # Exercise solution, version 1 # # #
import numpy as np

# Starting dataset: 80x80 grid of people per 500m².
highResPop = np.ones((80, 80)) * 5

def densityKM_gen(popArray):
    """
    Generator that yields rows of population density per km² cells from a
    500 m² resolution population source.

    Input:  500x500m 2D array
    Output: Each yield output is a 1D array representing one row of densities.
    """
    group_size = 20
    rows, cols = popArray.shape

    # Aggregate
    kmArray = popArray.reshape(
        rows // group_size, group_size,
        cols // group_size, group_size
    )

    # Sum over each group
    kmDensity = kmArray.sum(axis=(1, 3))

    for row in kmDensity:
        yield row  # Now yields an array


for row in densityKM_gen(highResPop):
    print(row)
# # # Exercise solution, version 1 # # #
import numpy as np

# Starting dataset: 80x80 grid of people per 500m².
highResPop = np.ones((80, 80)) * 5

def densityKM_gen(popArray):
    """
    Generator that yields rows of population density per km² cells from a
    500 m² resolution population source.

    Input:  500x500m 2D array
    Output: Each yield output is a 1D array representing one row of densities.
    """
    group_size = 20
    rows, cols = popArray.shape

    # Aggregate
    kmArray = popArray.reshape(
        rows // group_size, group_size,
        cols // group_size, group_size
    )

    # Sum over each group
    kmDensity = kmArray.sum(axis=(1, 3))

    for row in kmDensity:
        yield row  # Now yields an array


for row in densityKM_gen(highResPop):
    print(row)

[2000. 2000. 2000. 2000.]
[2000. 2000. 2000. 2000.]
[2000. 2000. 2000. 2000.]
[2000. 2000. 2000. 2000.]

In [9]:

Copied!





# # # Exercise solution, version 2 (even more memory efficient) # # #
import numpy as np

# Starting dataset: 80x80 grid of people per 500m².
highResPop = np.ones((80, 80)) * 5

def densityKM_gen2(popArray):
    """
    Generator that yields rows of population density per km² cells from a
    500 m² resolution population source.
    Unlike Solution Version 1, this generator does not create the entire km²
    array in memory. It saves memory by processing one group of 20x20
    cells at a time.

    Input:  500x500m 2D array
    Output: Each yield is a 1D NumPy array representing one row of km²
    densities, processed group by group.
    """
    import numpy as np

    group_size = 20
    rows, cols = popArray.shape

    num_row_blocks = rows // group_size
    num_col_blocks = cols // group_size

    for i in range(num_row_blocks):
        row_densities = []
        row_start = i * group_size

        for j in range(num_col_blocks):
            col_start = j * group_size
            block = popArray[row_start:row_start + group_size,
                             col_start:col_start + group_size]
            density = block.sum()
            row_densities.append(density)

        yield np.array(row_densities)


for row in densityKM_gen2(highResPop):
    print(row)
# # # Exercise solution, version 2 (even more memory efficient) # # #
import numpy as np

# Starting dataset: 80x80 grid of people per 500m².
highResPop = np.ones((80, 80)) * 5

def densityKM_gen2(popArray):
    """
    Generator that yields rows of population density per km² cells from a
    500 m² resolution population source.
    Unlike Solution Version 1, this generator does not create the entire km²
    array in memory. It saves memory by processing one group of 20x20
    cells at a time.

    Input:  500x500m 2D array
    Output: Each yield is a 1D NumPy array representing one row of km²
    densities, processed group by group.
    """
    import numpy as np

    group_size = 20
    rows, cols = popArray.shape

    num_row_blocks = rows // group_size
    num_col_blocks = cols // group_size

    for i in range(num_row_blocks):
        row_densities = []
        row_start = i * group_size

        for j in range(num_col_blocks):
            col_start = j * group_size
            block = popArray[row_start:row_start + group_size,
                             col_start:col_start + group_size]
            density = block.sum()
            row_densities.append(density)

        yield np.array(row_densities)


for row in densityKM_gen2(highResPop):
    print(row)

[2000. 2000. 2000. 2000.]
[2000. 2000. 2000. 2000.]
[2000. 2000. 2000. 2000.]
[2000. 2000. 2000. 2000.]

Python Tips and Tricks

Introduction

Don't try so hard¶

Unpack values¶

Beg forgiveness. Don't ask permission¶

Use more of the standard library¶

Use more built-ins¶

Context managers¶

Exercises¶

1) Use unpacking for pretty printing¶

2) Use try/except¶

3) Use standard library data classes¶

4) Use the built-in min and max functions¶

5) Open a file with a context manager¶

Help people understand your code¶

Doc strings¶

Type hints¶

Optimize Performance and Memory Use¶

Storage: lists vs. tuples¶

Membership look-up: sequential vs. hashable¶

Iteration: functions vs. generators¶

Iteration, continued: List comprehension vs. generator expression¶

Profiling: finding bottlenecks¶

Check memory use with getsizeof()¶

Spot-check speed with %%timeit¶

Profile with cProfile¶

Exercises¶

1) Tuple-based storage¶

2) Set-based look-up¶

3) Generator expression¶

4) Generator¶

5) Compare differences in speed using timeit¶

5.1) %timeit line magic¶

5.2) %%timeit cell magic¶

6) Check for speed bottlenecks in detail using cProfile¶

7) Stretch Goal: Raster generator¶

Exercise Answers¶

1.1) Use unpacking for pretty printing¶

1.2) Use try/except¶

1.3) Use standard library data classes¶

1.4) Use the built-in min and max functions¶

1.5) Open a file with a context manager¶

2.1) Tuple-based storage¶

2.2) Set-based look-up¶

2.3) Generator expression¶

2.4) Generator¶

2.5) Compare differences in speed using timeit¶

2.5.1) %timeit line magic¶

2.5.2) %%timeit cell magic¶

2.6) Check for speed bottlenecks in detail with cProfile¶

2.7) Stretch goal: Raster generator¶

Check memory use with `getsizeof()`¶

Spot-check speed with `%%timeit`¶

Profile with `cProfile`¶

5) Compare differences in speed using `timeit`¶

5.1) `%timeit` line magic¶

5.2) `%%timeit` cell magic¶

6) Check for speed bottlenecks in detail using `cProfile`¶

2.5) Compare differences in speed using `timeit`¶

2.5.1) `%timeit` line magic¶

2.5.2) `%%timeit` cell magic¶

2.6) Check for speed bottlenecks in detail with `cProfile`¶