Docstring of the day: a mystery
I like to think I’m pretty good about writing docstrings that will be comprehensible to my successor when I leave a job or get hit by a bus, but sometimes I get a bit silly, and write riddles like…
"""Return resumes and Bjorks and pina coladas.
"""
Here’s the code to go along with it:
def mystery(s):
"""Return resumes and Bjorks and pina coladas.
"""
try:
return s.decode('ascii')
except UnicodeDecodeError:
s = s.decode('latin1')
normalized = unicodedata.normalize('NFKD', s)
return ''.join(char for char in normalized if ord(char) < 128)
If you guessed that this code strips diacritics from the given string, giving the closest possible ASCII representation of any foreign characters, then congratulations you clever so-and-so. I was inspired by a similiar routine in Google Refine. They use a sort of homebaked lookup table, rather than the pithier approach of unicode equivalence - schmucks! It turns out that normalizing foreign characters is a helpful step in clustering swathes of messy free text (something I’m doing a lot of right now at Sciencescape).