couple your generators with context managers!
As part of my AI Powered Search chapters, I’m cleaning up Hello LTR’s Python API to make the code more readable as book examples. A big part of the API is working with search training data, known as judgments, mapping keywords to documents along with a grade (4 relevant, 0 irrelevant)
You can imagine a CSV file:
Keywords,document,grade Rambo,First Blood,4 Rambo,Rambo III,3 Rambo,Chocolat,0 Batman,Batman Begins,4 Batman,Catwoman,3
I need to scan over large amounts of these three tuples to gather features for each line (stuff like the keyword’s TF*IDF score for the movie’s description, the movie’s release date, etc). All this would ultimately be converted into a complete training set for a machine learning library.
Enter the generators – along with a bug
How to do this?
We might eagerly loop through the file, and parse our judgments like so:
def judgments_from_file(f): judgments= for line in f: judgments.append(parse_row(line)) return judgments
Then we can:
judgments= with open('judgments.txt') as f: # Get the judgments from the file... judgments = judgments_from_file(f)
And sometime later:
And this all works hunky dory. But of course I’d rather not load all that data into memory until I go to use it. I’d rather use a Python generator. Something like:
def judgments_from_file(f): for line in f: yield parse_row(line)
The generator will now only pull data when we need it.
However, Can you spot the bug we introduced by changing
judgments_from_file to be a generator?
Here’s a hint, somewhere in
gather_features we will use the judgments argument:
for j in judgments: process(j)
Now spot the bug?
judgments is now a generator, not a list. The
judgments_from_file code runs lazily. Much later than the eager version at the
for j in judgments: line. The first thing
judgments_from_file does is try to loop the lines in the file (
for line in f). And BLAMMO! Here
f is now a closed file. It’s been closed since we exited the
with block much earlier in the code.
I’ve been bitten by this mistake a few times now in multiple situations. It’s made me annoyed at Python generators. It seems to betray a concept known as uniformity of access. To quote Bertrand Meyer:
All services offered by a module should be available through a uniform notation, which does not betray whether they are implemented through storage or through computation
I feel the consumer shouldn’t need to care if it’s a lazily generated iterator or a list in memory. It should work the same. Of course, the world is never so perfect, and my brain hurts thinking too hard about variable lifetimes and language design.
Context Managers to the Rescue
I realized, after reading Fluent Python that perhaps I could solve this by tying the lifetime of my
judgments directly to the
with block. We can create our own Context Managers with custom behavior on entering and exiting the context (the
In other words, instead of
with open('judgments.txt') as f: ....
I might more safely do:
with judgments_open('judgments.txt') as judgments gather_features(judgments) train_model(judgments)
judgments_open function I can place into this
with block the
judgments variable. This is the generator from above. Tying it to a
with block gives a solid contract to the programmer on that variable’s lifetime.
How to do this? Turns out it’s pretty easy! With a little wrapper over
judgments_from_file generator, I can do this pretty easily when aided with the contextmanager decorator:
@contextmanager def judgments_open(path=None): """ Read judgments from the filesystem""" try: f=open(path, 'r') yield judgments_from_file(f) #<- 'with' runs to here, this becoming the var tied to with block finally: f.close() #<- run after done the 'with' context (or there's an exception)
judgments_open will run up to
yield yielding the return value of
judgments_from_file to the context (this becomes the
judgments variable in
with judgments_open(..) as judgments) Then finally when we’re all done in the with block (or there’s an exception) the rest of
judgments_open is run – closing the file.
Now of course, I could still do something intentionally stupid. Like try to save off a reference to
judgments and do further work with it. But I know I’ll be shooting myself in the foot – as its lifetime is clearly scoped to the with block.
The Lesson: Lazy Generators <3 Context Managers
Our data is often lazy generated from a source – a file, a socket, or a database. These sources under the hood have their own lifetime to be managed. We can’t just willy-nilly return generators and exchange them with an eagerly-generated list. We need to manage the generator’s lifetime. The easiest way to do this is always couple your generators with context managers!
In conclusion, next time you want reach for a generator, think about whether you really should also be reaching for a context manager ;).