Python Tips and Tricks
Introduction
If you're anything like us, most of what you know about Python you learned by trial and error. Most geospatial professionals don't have much formal training in computer science or software engineering. And that's fine because we're mostly not computer scientists or sofware engineers. We don't have time to train in a whole new career just to write some automation scripts. But the downside of a purely practical education is that it can be easy to settle for suboptimal solutions because we're not aware of better practices that can improve the performance and readability of our code.
This session is designed to highlight some common code patterns that work OK, but for which a better approach exists.
- Instead of bracket notation, use more unpacking
- Instead of conditionals for validation, use try/except
- Instead of rolling your own solutions, use existing Python capabilities
- Instead of writing setup and teardown code, use context managers
- Instead of only writing documentation in separate files, use doc strings and type hints.
- Instead of always modeling data collections as lists, use more tuples and sets
- Instead of always looping over lists, use more iterators
- Instead of list comprehensions, use more generators
- Instead of guessing about inefficiencies, profile your code
This workshop is focused on general patterns that you can use no matter what type of problem you are working on. Because these are general patterns, don't expect to be able to lift the code examples here and use them directly in your code. Do expect to take these strategies and apply them to your code.
The code examples and exercises are written using Jupyter Notebooks. If you have a Google account, you can click the Open in Colab button at the top to run the notebooks using Google's Colab environment. Otherwise, you can click the download button for each notebook to download it to your local machine and run in a notebook environment (e.g. loading the notebook into ArcGIS Pro).
Don't try so hard¶
When presented with a problem, our first instinct might be to write code that reflects how we would solve the problem manually. That's good because we can take advantage of our existing knowledge. But it's also bad because it's likely to be harder to implement than a solution that takes full advantages of Python's capabilities.
You are probably working harder than you need to when you implement solutions that match your manual process. For many types of problems, Python has solved them already. Learning the Pythonic way to address a problem will be easier than coming up with your own. It will also be easier for other people to understand your code, because you are using the well-known idioms of Python instead of your own idiosyncratic implementation.
Unpack values¶
An analyst has a tuple of three values that represent the x, y, and z coordinates of a location. The analyst has a distance function that takes three arguments, one for each coordinate.
coordinates = (2, 5, 4)
def distance_from_origin(x, y, z):
return (x**2 + y**2 + z**2) ** 0.5
The analyst needs to pass the values from the tuple to the function. One way to do that is to use the index of each value with bracket notation.
x = coordinates[0]
y = coordinates[1]
z = coordinates[2]
distance_from_origin(x, y, z)
6.708203932499369
That works, but it has two problems:
- Repetition of
coodinates
is error prone and tough to refactor. - Overuse of brackets makes code harder to read.
You can use unpacking to fix both problems.
x, y, z = coordinates
distance_from_origin(x, y, z)
6.708203932499369
Variable unpacking takes a collection of values on the right-hand side of =
and assigns each value in order to an equal number of names on the left hand side. Importantly, the number of names on the left must match the number of values in the collection on the right.
x, y, z, m = coordinates
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[6], line 1 ----> 1 x, y, z, m = coordinates ValueError: not enough values to unpack (expected 4, got 3)
x, y = coordinates
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[5], line 1 ----> 1 x, y = coordinates ValueError: too many values to unpack (expected 2)
Unpacking to names is useful, but you can go a step further when the values you need to unpack will be passed to a function or class constructor.
distance_from_origin(*coordinates)
6.708203932499369
The *
in front of the variable name unpacks the values so that each value in order is assigned to the parameters of the function.
One disadvantage of unpacking a collection into arguments this way is that it relies on parameter order. That means it only works when you can use positional arguments and doesn't work when you need to specify keyword arguments.
But if the values to unpack are in a dictionary where each key matches a parameter name, you can unpack them as keyword arguments with **
. Then the order of values no longer matters.
coordinates_dict = {
"z": 4,
"y": 5,
"x": 2
}
distance_from_origin(**coordinates_dict)
6.708203932499369
If you have ever seen a function defintion with *args
and **kwargs
, that's related to unpacking. The *args
parameter means the function can be called with any number of positional arguments. The **kwargs
means the function can be called with any keyword arguments.
Big Takeaway: Unpacking reduces the amount of code you have to write and makes your code easier to read. Take advantage of it wherever you can.
Beg forgiveness. Don't ask permission¶
An analyst has tract-level census data where each tract has three values:
- Total Area in km2
- Water Area in km2
- Population
tract1 = {
"area": 100,
"area_water": 20,
"population": 1000
}
The analyst writes a function to calculate the population density in people per km2 of land area.
def pop_density(tract):
area_land = tract["area"] - tract["area_water"]
return tract["population"] / area_land
pop_density(tract1)
12.5
But some records don't have a value for the water area because they are all land. This is the equivalent of not having an "area_water"
column. Passing those records to the function causes a KeyError
exception.
tract2 = {
"area": 100,
"population": 1000
}
pop_density(tract2)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[53], line 6 1 tract2 = { 2 "area": 100, 3 "population": 1000 4 } ----> 6 pop_density(tract2) Cell In[52], line 2, in pop_density(tract) 1 def pop_density(tract): ----> 2 area_land = tract["area"] - tract["area_water"] 3 return tract["population"] / area_land KeyError: 'area_water'
One way to deal with potential bad values to is to check ahead of time with conditional logic (if
/elif
/else
).
def pop_density2(tract):
if "area_water" not in tract.keys():
area_land = tract["area"]
else:
area_land = tract["area"] - tract["area_water"]
return tract["population"] / area_land
pop_density2(tract2)
10.0
But using conditional logic like this is not great. You need to put in the checks before you get to your core logic, which hurts both performance and readability. You will also run into edge cases that you didn't anticipate that cause your code to fail or return the wrong answer. For example, some records without any water have explicitly set the "area_water"
value to None
. This is equivalent to having null values in an "area_water"
column. Passing a record like that to the function causes a TypeError
.
tract3 = {
"area": 100,
"area_water": None,
"population": 1000
}
pop_density2(tract3)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[55], line 7 1 tract3 = { 2 "area": 100, 3 "area_water": None, 4 "population": 1000 5 } ----> 7 pop_density2(tract3) Cell In[54], line 5, in pop_density2(tract) 3 area_land = tract["area"] 4 else: ----> 5 area_land = tract["area"] - tract["area_water"] 6 return tract["population"] / area_land TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
Your code gets more complicated as you deal with those edge cases.
def pop_density3(tract):
if "area_water" not in tract.keys() or tract["area_water"] is None:
area_land = tract["area"]
else:
area_land = tract["area"] - tract["area_water"]
return tract["population"] / area_land
pop_density3(tract3)
10.0
No matter how many edge cases you anticipate, there will probably be another one you didn't. For example, passing a record that is all water causes a ZeroDivisionError
.
tract4 = {
"area": 100,
"area_water": 100,
"population": 0
}
pop_density3(tract4)
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) Cell In[57], line 7 1 tract4 = { 2 "area": 100, 3 "area_water": 100, 4 "population": 0 5 } ----> 7 pop_density3(tract4) Cell In[56], line 6, in pop_density3(tract) 4 else: 5 area_land = tract["area"] - tract["area_water"] ----> 6 return tract["population"] / area_land ZeroDivisionError: division by zero
Instead of writing an exploding mess of spaghetti code to deal with a never-ending parade of edge cases, it is better to use try
and except
. Python will attempt to run the code in the try
block. If that code throws and exception, Python will run the code in the except
block that matches the type of exception.
def pop_density4(tract):
try:
area_land = tract["area"] - tract["area_water"]
return tract["population"] / area_land
except (KeyError, TypeError):
return tract["population"] / tract["area"]
except ZeroDivisionError:
return 0
for tract in [tract1, tract2, tract3, tract4]:
print(pop_density4(tract))
12.5 10.0 10.0 0
This pattern puts your core logic at the top, and deals with edge cases afterward, making your code more performant and readable. It also gives you the option to handle different types of errors differently. The first except
block handles both the KeyError
and TypeError
problems by using the total area to calculate population density. The second except
block handles the ZeroDivisionError
by appropriately returning 0
.
Big takeaway: You can just try things. It's usually easier, faster, and more readable to put the common case in a try
block, and handle exceptions for edge cases where the common case doesn't work.
Use more of the standard library¶
An analyst has tract-level census data records. Each tract has two values: population and households. The analyst could model a single tract as a dictionary.
tract1 = {
"population": 1000,
"households": 500
}
That looks appropriate because it clearly links each value to a key that explains what the value means. But a dictionary is usually not a good data structure for a single record from a table. For one thing, there is a substantial amount of repetition if you need to model many records.
tract2 = {
"population": 2000,
"households": 800
}
tract3 = {
"population": 5000,
"households": 3000
}
Another problem is that dictionaries are mutable, which means the keys can change and cause the dictionary to no longer fit the same data schema.
del tract2["households"]
tract2
{'population': 2000}
Dictionaries are optimized for fast access of a value by key. This is not usually an important goal for an individual record. Using a dictionary to model records is unecessarily hard. A better data structure for a record is a tuple.
tract1 = (1000, 500)
tract2 = (2000, 800)
tract3 = (5000, 3000)
A glaring omission from a tuple, however, is the lack of context for each value represents. An even better data structure for a record is a named tuple, which you can import from the standard library.
To use a named tuple, create a class that inherits from NamedTuple
. For this kind of class, you only need to specify the field names and the datatype the values in each field should have. You can then create instances of that named tuple by passing the appropriate values to the constructor.
from typing import NamedTuple
class Tract(NamedTuple):
population: int
households: int
tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)
You can access the value in a field using dot notation.
tract1.households
500
Big takeaway: The standard library has classes and functions that make your life easier without having to install additional packages. You should use them more. Named tuples are just one example. The official documentation lists them all, but some highlights include:
- csv for working with csv files
- dataclasses for creating dataclasses (like
NamedTuple
, but editable) - datetime for working with dates and times
- itertools for efficient looping
- math for Mathematical functions
- pprint for nicely printing complex data structures
- pathlib and os.path for working with file paths
Use more built-ins¶
The analyst wants to know the average number of people per household across all tracts. That is not the same as averaging the number of people per household per tract. The analyst needs to divide the total population across tracts by the total number of households across tracts.
One way to get the right answer is to loop over each tract, keeping a running total of the population househould values. Then calculate the the ratio.
population = 0
households = 0
tracts = [tract1, tract2, tract3]
for tract in tracts:
population += tract.population
households += tract.households
population / households
1.8604651162790697
That gives the correct answer, but keeping running totals obscures the goal, which is to create a sum of the total population and households across tracts. Summing values is a common pattern, and for many common patterns, Python has some built-in capability to make it easier to accomplish. Built-ins differ from the functionality from the standard library in that you don't have to import anything to get access to built-ins. Code that is is considered Pythonic makes good use of these built-in capabilities. In this case, there is the sum
function. This has the advantage of making it more explicit to the reader that the code is summing values in a collection.
population_values = []
household_values = []
tracts = [tract1, tract2, tract3]
for tract in tracts:
population_values.append(tract.population)
household_values.append(tract.households)
sum(population_values) / sum(household_values)
1.8604651162790697
Appending values to a list in a loop is also a very common pattern. For this case, Python has list comprehensions. List comprehensions are shorter and more readable (once you get used to them) than explicit loops. They also execute faster than the equivalent loop.
tracts = [tract1, tract2, tract3]
population_values = [tract.population for tract in tracts]
household_values = [tract.households for tract in tracts]
sum(population_values) / sum(household_values)
1.8604651162790697
In this particular case, a list comprehension is not so great because we had to iterate over the tracts twice. You could create a nested comprehension to loop over the tracts for for each value in a tract, which would get some effeciencies if you had a lot of tracts.
population_values, household_values = [[tract[i] for tract in tracts] for i in range(len(tract1))]
sum(population_values) / sum(household_values)
1.8604651162790697
But this is an abomination. While comprehensions are generally awesome, resist the temptation to make complicated comprehensions. In such cases, it would be better to use an explicit loop instead.
Even better is to know that recombining a group of pairs into a pair of groups (or vice versa) is also a common pattern. Once again, Python has a built-in function, zip
, to do that for you. By using zip
, you can save yourself a little bit of typing and a significant amount of thinking about the correct implementation. Using zip
also makes your code easier to explain to other people familiar with Python because they don't have to reason through your implementation to make sure it's been done correctly.
population_values, household_values = zip(tract1, tract2, tract3)
sum(population_values) / sum(household_values)
1.8604651162790697
Big Takeaway: Python built-ins can make your life easier, without even having to import additional libraries. sum
, list comprehensions, and zip
are among the more useful built-in capabilities of Python you should be using more. The official documentation has the complete list, but some other useful built-in functions include:
abs
for returning the absolute value of a number.all
andany
for testing the truth of a collection of values.dir
for listing the attributes of an objectenumerate
for getting both the index and a value from a collection. Useful for complex loops.help
for getting information about a Python object.isinstance
andtype
for getting information about an object's typelen
for getting the length of a collection, such as a string or list.max
andmin
for getting the maximum or minimum value from a group of values.open
for opening files.range
for creating a collection of values in a given range.
Context managers¶
An analyst needs to write some values to a csv file. They open the file, write the content, then close the file.
rows = ("x,y,z", "2,4,5")
f = open('data.csv', 'w')
f.write("\n".join(rows))
f.close()
That works, but it has two problems. The first is that you have to remember to write f.close()
at the end or else the file will stay open. That increases the risk of file corruption. That's not too hard in this example, but the more code between the open
function and the close
method, the more likely it is that you will forget.
The second problem is that even if you remember to write the teardown code to close the file, it won't run if the code throws an exception before it gets there. That also increases the risk of file corruption.
Because it is common to need some setup and/or teardown code when working with certain objects, Python has context managers that allow you to apply that code automatically. For example, TextIOWrapper
objects created by opening text files have teardown code to close the file, so you don't have to invoke the close
method yourself. Instead, you can use a with
block to activate the context manager.
with open("data.csv", "w") as f:
f.write("\n".join(rows))
The as f
part of the code creates a name (f
) that points to an object. This is equivalent to
f = open("data.csv", "w")
Before executing the code inside the with
block, the context manager executes the setup code defined for that object. As soon as the code inside the with
block finishes (even if it finished because of an exception), the context manager executes the teardown code defined for the object.
Not every object can be used with a context manager. Whoever wrote the code for defining that object had to add special capabilities to the object to enable a context manager.
Big Takeaway: Context managers let you avoid boilerplate setup/teardown code. Opening files is probably the most common use case for context managers, but pay attention to how they are used in other libraries you work with as well.
Exercises¶
The exercises below invite you to practice applying the different strategies outlined above. They follow the order of the concepts presented, but you can attempt them in any order. Start with the ones that seem most applicable to the work you need to do.
You can find example answers in the ExerciseAnswers.ipynb notebook.
1) Use unpacking for pretty printing¶
The code below uses a loop to print each value in a collection on a separate line.
counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
for county in counties:
print(county)
Anoka Dakota Carver Hennepin Ramsey Scott Washington
Write a different implementation that uses unpacking to print each value on a separate line using a single call to the print
function instead of a loop.
Hint: The print
function's first parameter is *objects
, which is accepts any number of positional arguments (similar to *args
in other functions). These arguments are what will be printed. The second parameter is sep
, which defines the character to put in between the values to print. The default value of sep
is a single space (' '
), but it could be a newline character ('\n'
).
2) Use try/except¶
The code below defines two records using named tuples.
from typing import NamedTuple
class Record(NamedTuple):
total_population: int
population_in_poverty: int
record1 = Record(5000, 200)
record2 = Record(200, 0)
The code below calculates the ratio of each value in the first record to the corresponding value in the second record. It uses conditional logic to catch potential errors.
from math import inf
for field in Record._fields:
if getattr(record2, field) != 0:
ratio = getattr(record1, field) / getattr(record2, field)
else:
ratio = inf
print(ratio)
25.0 inf
Write a different implementation that uses try
and except
instead.
Hint: You may find it useful to first write the code without any error handling to see what type of error occurs.
3) Use standard library data classes¶
The code below uses a dictionary to define a record, then changes one of the values in that record.
record = {
"total_population": 5000,
"population_in_poverty": 200
}
record["total_population"] = 6000
print(record)
{'total_population': 6000, 'population_in_poverty': 200}
This pattern cannot be implented using a named tuple, because named tuples are immutable. A data class is a standard library class that is similar to a named tuple, but it can be editable. Write a different implementation of the code above to use data classes instead of dictionaries
Hint: The official Python documentation may be hard to understand. You may want to search for a tutorial on data classes specifically.
4) Use the built-in min and max functions¶
The code below creates a list of 20 random numbers between -1000 and 1000.
from random import randint
nums = [randint(-1000, 1000) for i in range(20)]
The code below finds the maximum and minimum values of nums
using conditional logic and explicit comparisons to running values.
min_num = 1000
max_num = -1000
for num in nums:
if num > max_num:
max_num = num
if num < min_num:
min_num = num
print(max_num, min_num)
992 -952
5) Open a file with a context manager¶
The code below opens a file and writes to it.
f = open("exercise.txt", "w")
f.write("This is example text for an exercise.")
f.close()
Rewrite it to use a context manager instead.
Help people understand your code¶
Even if you use Pythonic idioms, your code probably won't be perfectly understandable by itself. But with all the time you save by writing code that is more Pythonic, you can spend more time documenting your code. That way other people can figure it out.
Doc strings¶
An analyst has a function that calculates the distance from a given point to the origin in three dimensions.
def distance_from_origin(x, y, z):
return (x**2 + y**2 + z**2) ** 0.5
One option is to say that it is perfectly obvious what this function does from its name and parameters. But your functions are much more obvious to you than they are to other people. "Other people" includes future you. You do not want future you mad at current you for not explaining what your code does.
A better option is to write down explicity what this function does, what kind of arguments you can pass to it, and what kind of value it will return. For example, you might have a text file, or a web page, or a Word doc. Hopefully not a sticky note on your monitor, but even that's better than nothing. Something like:
Calculates the distance from a given point in three dimensions to the origin (0, 0, 0).
Args:
x (float): The x-axis coordinate.
y (float): The y-axis coordinate.
z (float): The z-axis coordinate.
Returns:
float: The distance.
That works OK, but separating your code from your documentation forces people to look in two places. It also means that the built-in help
function is mostly useless for learning about your function.
help(distance_from_origin)
Help on function distance_from_origin in module __main__: distance_from_origin(x, y, z)
A better way to document your code is to include the information as a doc string. You can use doc strings with modules, function, classes, and methods that you create.
def distance_from_origin_docstring(x, y, z):
"""
Calculates the distance from a given point in three dimensions to the origin (0, 0, 0).
Args:
x (float): The x-axis coordinate.
y (float): The y-axis coordinate.
z (float): The z-axis coordinate.
Returns:
float: The distance.
"""
return (x**2 + y**2 + z**2) ** 0.5
By including a doc string, people can use the built-in help
function to see the information without having to open the source code file.
help(distance_from_origin_docstring)
Help on function distance_from_origin_docstring in module __main__: distance_from_origin_docstring(x, y, z) Calculates the distance from a given point in three dimensions to the origin (0, 0, 0). Args: x (float): The x-axis coordinate. y (float): The y-axis coordinate. z (float): The z-axis coordinate. Returns: float: The distance.
Many IDEs will even show the information when you hover over the function name.
Type hints¶
An analyst tries using the distance_from_origin_docstring
function, but is getting an error
coordinates = [2, 5, 4]
distance = distance_from_origin_docstring(*coordinates)
info_string = "The point is " + distance + " meters from the origin"
print(info_string)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[15], line 3 1 coordinates = [2, 5, 4] 2 distance = distance_from_origin_docstring(*coordinates) ----> 3 info_string = "The point is " + distance + " meters from the origin" 4 print(info_string) TypeError: can only concatenate str (not "float") to str
The error is reasonably informative, and the analyst can use it to fix their code. But the problem only showed up after the analyst ran the code. It would be nice to get that information beforehand. Type hints are a way to pass information to type checkers and IDEs that can help ensure that you're using the correct types, without having to actually run the code.
def distance_from_origin_typehints(x: float, y: float, z: float) -> float:
"""
Calculates the distance from a given point in three dimensions to the origin (0, 0, 0).
Args:
x (float): The x-axis coordinate.
y (float): The y-axis coordinate.
z (float): The z-axis coordinate.
Returns:
float: The distance.
"""
return (x**2 + y**2 + z**2) ** 0.5
If the analyst has used this function, type checkers like Mypy would have flagged the use of the distance
name as incorrect usage. Then the analyst could have corrected their code immediately.
Type hints are well-named. They do not force you to use the right types. They will not cause Python to throw an error if you use the wrong types. They give you a hint that you are not using a value correctly. For example, the distance_from_origin_typehints
function executes successfully when you pass it a complex
number as an argument, even though a complex
is not a float
.
coordinates = [2j, 5, 4]
distance = distance_from_origin_typehints(*coordinates)
info_string = f"The point is {distance} meters from the origin"
print(info_string)
The point is (6.082762530298219+0j) meters from the origin
Optimize Performance and Memory Use¶
When you begin to use Python regularly in your work, you'll start noticing bottlenecks in your code. Some workflows may run at lightning speed, while others take hours of processing time to complete, or even crash.
Avoiding bloat is invaluable as you move toward using code for automation, bigger data, and working with APIs. Code efficiency means:
- Less chance of a slowdown or crash: the dreaded MemoryError.
- Quicker response time and fewer bottlenecks for the larger workflow.
- Better scaling.
- Efficient code is often (but not always!) cleaner and more readable.
Let's look at some ways you can reduce bloat in your code.
tl;dr
Access and store only what you need, no more.
- Storage: avoid a list where you could use a tuple
- Membership look-up: avoid a list (or tuple) where you could use a set (or dictionary)
- Iteration: avoid a function (or list comprehension) where you could use a generator (or generator expression)
- Profile: make time for performance checks by profiling your code for bottlenecks
Storage: lists vs. tuples¶
If you have a collection of values, your first thought may be to store them in a list.
data_list = [17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356]
Lists are nice because they are very flexible. You can change the values in the list, including appending and removing values. But that flexibility comes at a cost. Lists are less efficient than tuples. For example, they use more memory.
import sys
data_tuple = (17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356)
print(sys.getsizeof(data_list))
print(sys.getsizeof(data_tuple))
104 88
If you aren't going to be changing the values in a collection, use a tuple instead of a list.
Membership look-up: sequential vs. hashable¶
However, when you want to see if an element already exists in a collection of elements, use a set or dictionary to store that collection if possible.
- List and tuple look-up is sequential, going at the speed of O(n): linear time.
- With lists, Python scans the entire list until it finds the match (or reaches the end).
- Worst case: it has to look at every element.
- Set and dictionary look-up are hashable: mapping keys to values. These go at the speed of O(1): constant time.
- No matter how big the collection is, the set only ever has to check 1 value.
- Sets are built on hash tables. Python computes the hash of the element and jumps straight to where it should be stored.
The example below shows that a set is over 100x faster than a list in calculating the first 10,000 values of Recaman's sequence.
def recaman_check(cur, i, visited):
return (cur - i) < 0 or (cur - i) in visited
def recaman_list(n: int) -> list[int]:
"""
return a list of the first n numbers of the Recaman series
"""
visited_list = [0]
current = 0
for i in range(1, n):
if recaman_check(current, i, visited_list):
current += i
else:
current -= i
visited_list.append(current)
return visited_list
%%timeit
recaman_list(10000)
386 ms ± 36.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def recaman_set(n: int) -> list[int]:
"""
return a set of the first n numbers of the Recaman series
"""
visited_set = {0}
current = 0
for i in range(1, n):
if recaman_check(current, i, visited_set):
current += i
else:
current -= i
visited_set.add(current)
return visited_set
%%timeit
recaman_set(10000)
2.06 ms ± 61.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
When you add an element to a set...
- Python calls the element’s hash() method to get a hash value (an integer);
- That hash value determines where the element will be stored in the set's internal structure; and
- When checking if an element is in the set, Python uses the hash to quickly find it.
Iteration: functions vs. generators¶
We often use functions to operate on data, but generators can be more memory-efficient and faster for certain tasks.
Regular functions and comprehensions typically store outputs into containers, like lists or dictionaries. This can take up unnecessary memory, especially when we're creating multi-step workflows with many intermediate outputs.
In contrast, generators only hold one data item in memory at a time. A generator is a type of iterator that produces results on-demand (lazily), maintaining its state between iterations.
Under the hood, a generator's syntax is similar to a function. Generally, you:
- define a process(),
- provide the logic, and
- ask for the result, either with a return statement (for functions) or a yield statement (for generators).
Here's a quick way to see why a generator is superior for memory. Let's compare a regular function that produces new values endlessly, storing them in a list, to a generator that yields each value one at a time, discarding it from memory as it moves to the next iteration.
def massive_rf():
"""A regular function that produces even numbers, endlessly."""
x_list = []
x = 0
while True:
x_list.append(x)
x += 2
# Run it:
massive_rf()
Woah! That did its best, but my notebook has now informed me that, "Your session crashed after using all available RAM."
def massive_gen():
"""A generator that produces even numbers, endlessly."""
x = 0
while True:
yield x
x += 2
# Run it: (use keyboard interrupt when you want to move on.)
for x in massive_gen():
print(x)
The generator was willing to keep going until I interrupted it because it did not store each result in memory as it proceeded.
Let's look at a more concrete scenario. Imagine you have a large dataset containing millions of employee records. You want to calculate the combined hourly rates of all employees on an annual salary.
# For the sake of simplicity, we'll represent the dataset with a small sample.
employeeDatabase = [
{'lastName': 'Knope', 'rate': 72000, 'pay_class': 'annual'},
{'lastName': 'Gergich', 'rate': 17, 'pay_class': 'hourly'},
{'lastName': 'Ludgate', 'rate': 60000, 'pay_class': 'annual'},
{'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
{'lastName': 'Haverford', 'rate': 52000, 'pay_class': 'annual'}
]
You can use a function for this, but it means i) the entire input dataset will be held in memory, and ii) each result (each worker's hourly value) will be held in memory too.
def hourly_rate(payments):
"""Function that returns each salaried worker's hourly rate."""
hourlyRates = []
for worker in payments:
if worker.get('pay_class') == 'annual':
hourly = worker['rate'] / 2080
hourlyRates.append(hourly)
return hourlyRates
# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))
print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")
Total dispersments per hour for salaried employees: $88.46
If the input dataset is huge, this eats up a ton of space. Instead, what if we process data lazily, storing one row in memory at a time?
def hourly_rate_gen(payments):
"""Generator that yields each salaried worker's hourly rate."""
for worker in payments:
if worker.get('pay_class') == 'annual':
hourly = worker['rate'] / 2080
yield hourly
# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate_gen(employeeDatabase))
print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")
Total dispersments per hour for salaried employees: $88.46
A return statement is your signal that every output being produced will be held in memory at the same time and provided (returned) all at once.
- If a function returns a list of 1 thousand items, all 1 thousand are stored in memory before the end of execution.
In a generator, the yield statement signals that execution can proceed one at a time. When yield is executed, the generator pauses, retaining the generator's state until the next time it is called.
- Lazy outputs: Each output that a generator produces is yielded, then discarded before the next output is yielded.
- Lazy inputs: A generator can also stream input data, but you have to write it that way. For example,
for worker in payments
above is a for loop that streams one element (one worker's information) from the employeeDatabase list at a time.
Tip: Generator pipelines are a powerful workflow for GIS and remote sensing. Use multiple generators to string tasks together lazily. These are hugely helpful for complex spatial analysis workflows, such as raster processing.
Iteration, continued: List comprehension vs. generator expression¶
Generator expressions (aka generator comprehensions) are concise, one-line generators. Generator expressions can be a handy replacement for list comprehensions.
Let's look at how the analysis above would appear in list comprehension format.
hourly = [worker['rate'] / 2080 for worker in employeeDatabase if worker.get('pay_class') == 'annual']
salariesPerHour = sum(hourly)
print(f"${salariesPerHour:.2f}")
$88.46
As with the function, the list comprehension constructs a list of n values. Then, we use sum() to add all values in the list together.
A generator expression looks almost identical to a list comprehension: simply swap out square brackets with parentheses.
hourly = (worker['rate'] / 2080 for worker in employeeDatabase if worker.get('pay_class') == 'annual')
salariesPerHour = sum(hourly)
print(f"${salariesPerHour:.2f}")
$88.46
Profiling: finding bottlenecks¶
Profiling is any technique used to measure the performance of your code, in particular its speed. There are dozens of tools available for profiling. We'll use a few to:
- Check memory use: Use
sys.getsizeof()
to check the memory size of variables. - Spot-profile your code: Use the
timeit
notebook magic to perform some basic profiling by cell or by line. - Profile your script comprehensively: The
cProfile
module has the ability to break down call by call to determine the number of calls and the total time spent on each.
Check memory use with getsizeof()
¶
Use this tool to quickly check how much memory a variable is taking up on your system.
import sys
tract1 = {
"area": 100,
"area_water": 20,
"population": 1000
}
print(f"Bytes: {sys.getsizeof(tract1)}")
Bytes: 184
print(f"Bytes: {sys.getsizeof(recaman_list(1000))}")
print(f"Bytes: {sys.getsizeof(recaman_set(1000))}")
Bytes: 8856 Bytes: 32984
"You said sets were better than lists!"
Remember, sets are preferred over lists for membership lookup because they are faster, not slimmer.
- If you care more about output size, make a list; it takes up less memory.
- If you care more about task speed, make a set.
Spot-check speed with %%timeit
¶
The timeit
module measures the execution time of a selection of code. Among the many ways you'll see it written are "magic" commands:
%timeit
is a form of line magic. Line magic arguments only extend to the end of the current line.%%timeit
is a form of cell magic. It measures the execution time of the entire notebook cell.
With both of these commands, the notebook will test your code multiple times and print the average speed of those calls.
%%timeit
# Cell magic example
from typing import NamedTuple
class Tract(NamedTuple):
population: int
households: int
tract1 = Tract(1000, 500)
tract2 = Tract(2000, 800)
tract3 = Tract(5000, 3000)
tract1.households
113 µs ± 24.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Line magic example
%timeit sum(hourly_rate(employeeDatabase))
%timeit sum(hourly_rate_gen(employeeDatabase))
859 ns ± 222 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 712 ns ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
timeit
tip: Optionally, you can limit the number of calls and repetitions with:
- -n: number of times to execute the main statement (size of each sample to time)
- -r: number of times to repeat the timer (number of samples)
%timeit -n 1 -r 5 sum(hourly_rate_gen(employeeDatabase))
The slowest run took 5.31 times longer than the fastest. This could mean that an intermediate result is being cached. 2.66 µs ± 1.91 µs per loop (mean ± std. dev. of 5 runs, 1 loop each)
Profile with cProfile
¶
Whereas timeit
is a quick way to test speed, cProfile
is useful as a comprehensive and holistic code profiler. Some perks of cProfile
:
- Compare which lines take longest to execute
- See how often a function is executed
- Sort profiling results by time
- See the respective data the function interacts with
- Print detailed reports with multiple statistics
Let's take a look:
import cProfile
cProfile.run('recaman_list(10000)')
20002 function calls in 0.406 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 9999 0.394 0.000 0.394 0.000 <ipython-input-4-1cfc8d8a116c>:1(recaman_check) 1 0.011 0.011 0.406 0.406 <ipython-input-4-1cfc8d8a116c>:4(recaman_list) 1 0.000 0.000 0.406 0.406 <string>:1(<module>) 1 0.000 0.000 0.406 0.406 {built-in method builtins.exec} 9999 0.001 0.000 0.001 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Above the table, you are given the number of function calls and how long the code took overall.
The field cumtime is the cumulative time it took to call a given function, including all of its subfunctions.
cProfile.run('recaman_set(10000)')
20002 function calls in 0.010 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 9999 0.003 0.000 0.003 0.000 <ipython-input-4-1cfc8d8a116c>:1(recaman_check) 1 0.006 0.006 0.010 0.010 <ipython-input-7-c2c0edd0cc91>:1(recaman_set) 1 0.000 0.000 0.010 0.010 <string>:1(<module>) 1 0.000 0.000 0.010 0.010 {built-in method builtins.exec} 9999 0.001 0.000 0.001 0.000 {method 'add' of 'set' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
These results show that the set-based function executed the same number of calls (200002), but ran 40x faster.
cProfile
tip: Use cProfile.Profile() as a context manager!
with cProfile.Profile() as pr:
def recaman_check(cur, i, visited):
return (cur - i) < 0 or (cur - i) in visited
def recaman_set(n: int) -> list[int]:
"""
return a set of the first n numbers of the Recaman series
"""
visited_set = {0}
current = 0
for i in range(1, n):
if recaman_check(current, i, visited_set):
current += i
else:
current -= i
visited_set.add(current)
return visited_set
recaman_set(1000)
pr.print_stats('line') # Order by line number.
2008 function calls in 0.001 seconds Ordered by: line number ncalls tottime percall cumtime percall filename:lineno(function) 999 0.000 0.000 0.000 0.000 {method 'add' of 'set' objects} 1 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr} 1 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 1 0.000 0.000 0.000 0.000 {built-in method builtins.len} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 999 0.000 0.000 0.000 0.000 <ipython-input-22-3b8d66b9f081>:2(recaman_check) 1 0.001 0.001 0.001 0.001 <ipython-input-22-3b8d66b9f081>:5(recaman_set) 1 0.000 0.000 0.000 0.000 cProfile.py:41(print_stats) 1 0.000 0.000 0.000 0.000 cProfile.py:51(create_stats) 1 0.000 0.000 0.000 0.000 pstats.py:108(__init__) 1 0.000 0.000 0.000 0.000 pstats.py:118(init) 1 0.000 0.000 0.000 0.000 pstats.py:137(load_stats)
Exercises¶
Section 2 exercises summary
- Tuple-based storage
- Set-based look-up
- Generator expression
- Generator
- Compare differences in speed with
timeit
%timeit
line magic%%timeit
cell magic
- Check for speed bottlenecks in detail with
cProfile
- Stretch goal: Raster generator
1) Tuple-based storage¶
The code below creates a list containing all years in a research study timeframe, from 1900 to 2030.
The values in this collection will not need to be changed because the study will always use this timeframe.
import sys
def listFromRange(r1, r2):
"""Create a list from a range of values"""
return [item for item in range(r1, r2+1)]
start = 1900
end = 2030
studyYears = listFromRange(start, end)
print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))
[range(1900, 2031)] Bytes used: 64
Your turn: For the same timeframe, write a different implementation using a storage option that takes up less memory.
2) Set-based look-up¶
The code below assigns a collection of placenames to a list. Then, it checks whether a placename is in the list. If not, the placename is reported missing.
If you have 1 million placenames to look up and 6 names in the list, that’s up to 6 million checks.
placeNames_list = ["Kinshasa", "Duluth", "Uruguay", "Doherty Residence", "Dinkytown", "Khazad-dum"]
# List look-up
if "Dinkytown" not in placeNames_list:
print("Missing.") # O(n) look-up
Your turn: Write a different implementation using a storage option that allows quicker checks for membership at scale.
3) Generator expression¶
You have a list of random strings which contain a combination of upper and lowercase letters. You have written a list comprehension, lowerCase, to rewrite all of these strings into lowercase.
import random
import string
# Input dataset: A list of random strings. Each string is 8 letters long.
randomStrings = [''.join(random.choices(string.ascii_letters, k=8)) for i in range(10)]
print(randomStrings)
# Convert all strings to lowercase
lowerCase = [x.lower() for x in randomStrings]
print(lowerCase)
['quCxGiYN', 'cUDXcQBk', 'yOVxBjyl', 'QKznqHJV', 'KwkxAbra', 'hjLXVdAh', 'lppRGHIB', 'VoDgKHws', 'mCzLrskq', 'ovkTIIYS'] ['qucxgiyn', 'cudxcqbk', 'yovxbjyl', 'qkznqhjv', 'kwkxabra', 'hjlxvdah', 'lpprghib', 'vodgkhws', 'mczlrskq', 'ovktiiys']
Your turn: Write a different implementation that still prints all the lowercase results, but operates faster than a list comprehension (when used with a large dataset).
4) Generator¶
The following function compares the length of each input dataset to that of a primary list. If the input and primary lists are the same length, the function calculates their difference and returns the result.
# The list that each dataset will be compared to.
primary = [4, 7, 140, 55, 7, 91, 6]
# Input datasets
inputs = (
[0, 3, 40, 55, 6, 98, 4],
[5, 4, 3, 45, 1, 67, 2],
[7, 150, 0.5, 1]
)
def matchingStructure(inputsList, primList):
"""
This function compares the length of each input collection to the primary
list. An input that matches in length gets multiplied by the primary list and
appended to the results list.
"""
results = []
for item in inputsList:
if len(item) == len(primList):
difference = [b - a for a, b in zip(item, primList)]
results.append(difference)
return results
print(matchingStructure(inputs, primary))
[[4, 4, 100, 0, 1, -7, 2], [-1, 3, 137, 10, 6, 24, 4]]
Your turn: Write a different implementation that uses a generator instead of a function to compare lengths and calculate results.
5) Compare differences in speed using timeit
¶
5.1) %timeit
line magic¶
Using %timeit
line magic, compare the time it takes each comprehension below to run.
[i for i in range(50) if i % 2 == 0]
(i for i in range(50) if i % 2 == 0)
5.2) %%timeit
cell magic¶
Using %%timeit
cell magic, calculate the time it takes this cell to run.
Set the command to execute the main statement only once and repeat the timer only once.
employeeDatabase = [
{'lastName': 'Knope', 'rate': 72000, 'pay_class': 'annual'},
{'lastName': 'Gergich', 'rate': 17, 'pay_class': 'hourly'},
{'lastName': 'Ludgate', 'rate': 60000, 'pay_class': 'annual'},
{'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
{'lastName': 'Haverford', 'rate': 52000, 'pay_class': 'annual'}
]
def hourly_rate(payments):
"""Function that returns each salaried workers' hourly rate."""
hourlyRates = []
for worker in payments:
if worker.get('pay_class') == 'annual':
hourly = worker['rate'] / 2080
hourlyRates.append(hourly)
return hourlyRates
# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))
print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")
6) Check for speed bottlenecks in detail using cProfile
¶
Using cProfile
, answer these questions about the following lines of code:
- How long does everything in this cell take to execute?
- Which item takes the longest time to execute? Tip: Sort by cumtime find hotspots more easily.
dataList = [x for x in range(1, 10_000_000)]
dataTuple = tuple(x for x in range(1, 10_000_000))
listFromList = []
listFromTuple = []
for item in dataList:
new = item + 1
listFromList.append(new)
for item in dataTuple:
new = item + 1
listFromTuple.append(new)
7) Stretch Goal: Raster generator¶
Let's say you have a raster depicting 500 square meter population density (people per 500m²) across a country. That's a huge dataset! You want to resample the raster down to 1 square kilometer (people per 1km²) to make it easier to work with.
To do this, you have written a function that creates a new raster of 1km² grid cells. Each 1km² cell contains the total population of all 500m² cells within it.
import numpy as np
# Starting dataset: 80x80 grid of people per 500m².
highResPop = np.ones((80, 80)) * 5
Note: The example here uses arrays to represent the rasters for simplicity, and each 500m² cell contains exactly 5 people.
def densityKM(popArray):
"""
Function that returns population density per km² cell from a
500 m² resolution population source.
Input: 500x500m 2D array
Output: 1x1km 2D array, covering the same area of interest.
"""
group_size = 20 # Every 20x20 group of 500m² cells equals 1km².
rows, cols = popArray.shape
# Aggregate
kmArray = popArray.reshape(
rows // group_size, group_size,
cols // group_size, group_size
)
# Sum over each group
kmDensity = kmArray.sum(axis=(1, 3))
# Output
return kmDensity
densityKM(highResPop)
array([[2000., 2000., 2000., 2000.], [2000., 2000., 2000., 2000.], [2000., 2000., 2000., 2000.], [2000., 2000., 2000., 2000.]])
Your turn: Write a different implementation using a generator. As an extra challenge, try to find a way to avoid storing your entire km² array in memory. (Instead, process one group of 20x20 cells at a time).
Exercise Answers¶
The code cells below are example answers to the workshop exercises. They are useful if you get stuck and need a hint or if you want to use them as a comparison with your own attempts
1.1) Use unpacking for pretty printing¶
counties = ["Anoka", "Dakota", "Carver", "Hennepin", "Ramsey", "Scott", "Washington"]
print(*counties, sep='\n')
Anoka Dakota Carver Hennepin Ramsey Scott Washington
1.2) Use try/except¶
from math import inf
from typing import NamedTuple
class Record(NamedTuple):
total_population: int
population_in_poverty: int
record1 = Record(5000, 200)
record2 = Record(200, 0)
for field in Record._fields:
try:
ratio = getattr(record1, field) / getattr(record2, field)
except ZeroDivisionError:
ratio = inf
print(ratio)
25.0 inf
1.3) Use standard library data classes¶
from dataclasses import dataclass
@dataclass
class Record:
total_population: int
population_in_poverty: int
record = Record(5000, 200)
record.total_population = 6000
print(record)
Record(total_population=6000, population_in_poverty=200)
1.4) Use the built-in min and max functions¶
from random import randint
nums = [randint(-1000, 1000) for i in range(20)]
print(max(nums), min(nums))
891 -899
1.5) Open a file with a context manager¶
with open("exercise.txt", "w") as f:
f.write("This is example text for an exercise.")
2.1) Tuple-based storage¶
import sys
def tupleFromRange(r1, r2):
"""Create a tuple from a range of values"""
return tuple(range(r1, r2+1))
start = 1900
end = 2030
studyYears = tupleFromRange(start, end)
print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))
(1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030) Bytes used: 1088
2.2) Set-based look-up¶
placeNames_list = ["Kinshasa", "Duluth", "Uruguay", "Doherty Residence", "Dinkytown", "Khazad-dûm"]
placeNames_set = set(placeNames_list)
# Set look-up
if "Dinkytown" not in placeNames_set:
print("Missing.") # O(1) look-up
2.3) Generator expression¶
import random
import string
# Input dataset: A list of random strings. Each string is 8 letters long.
randomStrings = [''.join(random.choices(string.ascii_letters, k=8)) for i in range(10)]
print(randomStrings)
# Convert all strings to lowercase
lowerCase_gen = (x.lower() for x in randomStrings)
for x in lowerCase_gen:
print(x)
['qwUYHnBg', 'yypGsKIi', 'bRVmubKp', 'RbEdRWKF', 'zCRwzvBG', 'swaWiWNs', 'gXndKiBa', 'acRnakiC', 'kuElpvqh', 'fvyfTINZ'] qwuyhnbg yypgskii brvmubkp rbedrwkf zcrwzvbg swawiwns gxndkiba acrnakic kuelpvqh fvyftinz
2.4) Generator¶
# The list that each dataset will be compared to.
primary = [4, 7, 140, 55, 7, 91, 6]
# Input datasets
inputs = (
[0, 3, 40, 55, 6, 98, 4],
[5, 4, 3, 45, 1, 67, 2],
[7, 150, 0.5, 1]
)
def matchingStructure_gen(inputsList, primList):
"""
This generator compares the length of each input collection to the primary
list. An input that matches in length gets multiplied by the primary list and
yielded.
"""
for item in inputsList:
if len(item) == len(primList):
multiplied = [b - a for a, b in zip(item, primList)]
yield multiplied
for item in matchingStructure_gen(inputs, primary):
print(item)
[4, 4, 100, 0, 1, -7, 2] [-1, 3, 137, 10, 6, 24, 4]
2.5) Compare differences in speed using timeit
¶
2.5.1) %timeit
line magic¶
%timeit [i for i in range(50) if i % 2 == 0]
%timeit (i for i in range(50) if i % 2 == 0)
4.2 µs ± 833 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 505 ns ± 19.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
2.5.2) %%timeit
cell magic¶
%%timeit -n 1 -r 1
employeeDatabase = [
{'lastName': 'Knope', 'rate': 72000, 'pay_class': 'annual'},
{'lastName': 'Gergich', 'rate': 17, 'pay_class': 'hourly'},
{'lastName': 'Ludgate', 'rate': 60000, 'pay_class': 'annual'},
{'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
{'lastName': 'Haverford', 'rate': 52000, 'pay_class': 'annual'}
]
def hourly_rate(payments):
"""Function that returns each salaried workers' hourly rate."""
hourlyRates = []
for worker in payments:
if worker.get('pay_class') == 'annual':
hourly = worker['rate'] / 2080
hourlyRates.append(hourly)
return hourlyRates
# Sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))
print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")
Total dispersments per hour for salaried employees: $88.46 79.4 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.6) Check for speed bottlenecks in detail with cProfile
¶
import cProfile
with cProfile.Profile() as pr:
dataList = [x for x in range(1, 10_000_000)]
dataTuple = tuple(x for x in range(1, 10_000_000))
listFromList = []
listFromTuple = []
for item in dataList:
new = item + 1
listFromList.append(new)
for item in dataTuple:
new = item + 1
listFromTuple.append(new)
pr.print_stats('cumtime')
30000008 function calls in 4.517 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 10000000 1.981 0.000 1.981 0.000 <ipython-input-7-233c72c0aca7>:5(<genexpr>) 19999998 1.670 0.000 1.670 0.000 {method 'append' of 'list' objects} 1 0.866 0.866 0.866 0.866 <ipython-input-7-233c72c0aca7>:4(<listcomp>) 1 0.000 0.000 0.000 0.000 cProfile.py:41(print_stats) 1 0.000 0.000 0.000 0.000 pstats.py:108(__init__) 1 0.000 0.000 0.000 0.000 pstats.py:118(init) 1 0.000 0.000 0.000 0.000 pstats.py:137(load_stats) 1 0.000 0.000 0.000 0.000 cProfile.py:51(create_stats) 1 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1 0.000 0.000 0.000 0.000 {built-in method builtins.len} 1 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
2.7) Stretch goal: Raster generator¶
# # # Exercise solution, version 1 # # #
import numpy as np
# Starting dataset: 80x80 grid of people per 500m².
highResPop = np.ones((80, 80)) * 5
def densityKM_gen(popArray):
"""
Generator that yields rows of population density per km² cells from a
500 m² resolution population source.
Input: 500x500m 2D array
Output: Each yield output is a 1D array representing one row of densities.
"""
group_size = 20
rows, cols = popArray.shape
# Aggregate
kmArray = popArray.reshape(
rows // group_size, group_size,
cols // group_size, group_size
)
# Sum over each group
kmDensity = kmArray.sum(axis=(1, 3))
for row in kmDensity:
yield row # Now yields an array
for row in densityKM_gen(highResPop):
print(row)
[2000. 2000. 2000. 2000.] [2000. 2000. 2000. 2000.] [2000. 2000. 2000. 2000.] [2000. 2000. 2000. 2000.]
# # # Exercise solution, version 2 (even more memory efficient) # # #
import numpy as np
# Starting dataset: 80x80 grid of people per 500m².
highResPop = np.ones((80, 80)) * 5
def densityKM_gen2(popArray):
"""
Generator that yields rows of population density per km² cells from a
500 m² resolution population source.
Unlike Solution Version 1, this generator does not create the entire km²
array in memory. It saves memory by processing one group of 20x20
cells at a time.
Input: 500x500m 2D array
Output: Each yield is a 1D NumPy array representing one row of km²
densities, processed group by group.
"""
import numpy as np
group_size = 20
rows, cols = popArray.shape
num_row_blocks = rows // group_size
num_col_blocks = cols // group_size
for i in range(num_row_blocks):
row_densities = []
row_start = i * group_size
for j in range(num_col_blocks):
col_start = j * group_size
block = popArray[row_start:row_start + group_size,
col_start:col_start + group_size]
density = block.sum()
row_densities.append(density)
yield np.array(row_densities)
for row in densityKM_gen2(highResPop):
print(row)
[2000. 2000. 2000. 2000.] [2000. 2000. 2000. 2000.] [2000. 2000. 2000. 2000.] [2000. 2000. 2000. 2000.]