- Beware of functions that iterate over input arguments multiple times. If these arguments are iterators, you may seee strange behavior and missing values.
- Python's iterator protocol defines how containers and iterators interact with the
iter
andnext
build-in functions, for loops and related expressions. - You can easily define your own iterable container type by implementing the
__iter__
method as a generator. - You can detect that a value is an iterator (instead of a contianer) if calling iter on it twice produces the same result, which can then be progressed with the next build-in function.
Effective Python
Say you want to analyze tourism numbers for the U.S state of Texas. Imagine the data set is the number of visitors to each city (in milions per year). You'd like to figure out what percentage of overall tourism each city revceives.
To do this you need a normalization function. It sums the inputs to determine the total number of tourists per yearr. Then is divides each city's individul visitor count by the total to find that city's contribution to the whole.
def normalize(numbers):
total = sum(numbers)
result = []
for value in numbers:
percent = 100 * value / total
result.append(percent)
return result
>>> visits = [15, 35, 80]
>>> percentage = normalize(visits)
>>> percentage
[11.538461538461538, 26.923076923076923, 61.53846153846154]
def read_visits(data_path):
with open(data_path) as f:
for line in f:
yield int(line)
normilize returns []. The cause of this behavior is that an iterator only produces its results single time. If you iterate over an iterator or generator that has already raised a StopIteration exception, you won't get any result the second time around.
>>> it = read_visits('data')
>>> percentage = normalize(it)
>>> percentage
[]
it = read_visits('data')
list(it)
[15, 35, 80]
list(it)
[]
One of solutions may not good one. The copy of the input iterator's contents could be large. Copying the iterator could cause your program to tun out of memory and crash.
def normalize(numbers):
numbers = list(numbers) # Copy the iterator
total = sum(numbers)
result = []
for value in numbers:
percent = 100 * value / total
result.append(percent)
return result
One of solutions may not good one, either. One way around this is to accept a function that returns a new iterator each time it's called.
def normalize_func(get_iter):
total = sum(get_iter()) # New Iterator
result = []
for value in get_iter(): # New Iterator
percent = 100 * value / total
result.append(percent)
return result
To use normilize_func, you can pass in a lambda expression that calls the generator and produces a new iterator each time.
precentage = normalize_func(lambda: read_visits(path))
Though it works, having to pass a lambda function like this is clumsy. The better way to achieve the same result is to provide a new container class that implements the iterator protocol
.
Iterator protocol
The iterator protocol is now Python for loops and related expressions traverse the contents of a container type. When Python sees a statement like for x in foo
it will actually call iter(foo)
The iter build-in function calls the foo.__iter__
special method in turn. The __iter__
method must return an iterator object until it's exhausted (and raises a StopIteration exception)
class ReadVisits:
def __init__(self, data_path):
self.data_path = data_path
def __iter__(self):
with open(self.data_path) as f:
for line in f:
yield int(line)
How to ensure that parameters aren't jsut iterators
The protocol states taht when an interator is passed to the iterator build-in function, iter
will return the iterator itself. In contrast, when a container type is passed iter
, a new iterator object will be returned each time. Thus, you can test an input value for this behavior and riase a TypeError to refect interators.
def normalize(numbers):
if iter(numbers) is iter(numbers):
raise TypeError("Must supply a container")
total = sum(numbers)
result = []
for value in numbers:
percent = 100 * value / total
result.append(percent)
return result