Bytes at the Boundaries: Bytes and Unicode in Python 2
At a previous job, I was working on a Python2 project, and everything was going well. We were blissfully unaware of the difference between bytes and unicode, and we liked it that way. Well, we didn’t like it that way, but we had other stuff to do!
Every so often, there would be a UnicodeDecodeError, and we’d struggle to guess the right combination of str.decode(), str.encode(), from __future__ import unicode_literals, and unicode() to make the error go away. It wasn’t a disciplined, understand-the-problem-and-think-carefully-of-a-solution approach, but more of a slap-these-functions-together-and-see-what-sticks one.
Eventually, enough of these errors accumulated that we were forced to sit down and understand what this UnicodeDecodeError was all about. Luckily for you, we’ve prepared some tips that should help you avoid such errors for yourself.
How does it happen?
Python 2 allows developers to play fast-and-loose with strings, treating them sometimes as bytes, sometimes as text.
With ASCII characters, roughly the ones you can easily type on a standard American keyboard, this worked out fine. The byte encoding of ASCII characters is the same for most common encodings.
This means that programs treating bytes directly as text often work — for a while. Then one day you might try to work from a café (with an accented é) and everything falls apart.
What’s the difference anyway?
Think of unicode as representing abstract text. A unicode string is a sequence of characters. Meanwhile, bytes are a sequence of… bytes.
Ok, it may be easiest to understand with an example.
What’s len(‘a’)? Definitely 1, a single character.
What about len(‘🐍’)? In Python 2, the answer was 4. In Python 3, the answer is 1. Why this discrepancy?
🐍, like all emoji, is a single character. But it’s not an ASCII character, so it isn’t typically encoded into binary as a single byte. Between emoji, accented letters, Chinese, Japanese, and Korean characters, and all the other glyphs that are unicode characters, there are too many for each of them to get a unique 8-bit encoding. Instead, we usually give these characters multi-byte encodings. So even though 🐍 looks like it’s just a single character, we usually use multiple bytes when we store it in a file or transmit it over the network. Python 3 enforces the separation between semantic (unicode) characters and bytes by requiring different types for them.
Unicode characters are the things we type to communicate with one another. Bytes are the things computers use to communicate with one another. An encoding is how we convert from bytes to unicode, or vice versa. When we run ‘🐍’.encode(‘utf-8’), we’re asking for the representation of the snake emoji in a form that can be sent over the network or saved to disk. When we run file.read().decode(‘utf-8’), we’re saying “I believe this file was saved using the UTF-8 encoding, so use that encoding to turn the bytes you find into characters.
How to deal with it?
We won’t be able to deal only in unicode strings — bytes are what come in from the network, from files, etc. All IO is bytes. The best strategy I’ve found for overcoming this problem is to use unicode strings in all the internals of your application, and encode-to and decode-from bytes at the boundaries.
“Boundaries” here means anywhere your program communicates with the outside world: files, the network, a TTY, etc. When we read from one of these places, we read bytes. When we write to one of these places we write bytes. Decoding these bytes into unicode right away will
- Keep the code that needs to understand this distinction from being spread throughout your app, and
- Crash right away if you make a mistake, rather than crashing somewhere down the line, only with specific actions, only on Tuesdays and bank holidays.
Why can’t Python deal with this automatically?
Even though you’ll usually be using the UTF-8 encoding, it won’t always be the case. There are a lot of character encodings. Sometimes a stream of bytes doesn’t even represent characters — it might be an image or another binary file.
Also, consider that a binary stream isn’t self-describing. Is it an image? A string of ASCII characters? A string of Cyrillic characters encoded using ISO 8859–5? You can’t know without some out-of-band signal.
That is, the meaning of the stream must be communicated over some other channel.
Common channels for this are HTTP Headers (Content-type: application/json; charset=utf-8), file extensions (files that end with “.png” are almost always png images), convention (“I promise to only send you files encoded with UTF-32”), or guessing (“This is probably UTF-8 because I expected text and almost all modern text is UTF-8”).
In Python 2, Python did deal with this automatically — it would assume bytes were ASCII unless told otherwise. Unfortunately, it will be wrong sometimes. And as the Zen of Python advises, “Explicit is better than Implicit”.
Further Reading
Ned Batchelder’s Pragmatic Unicode, or, How do I stop the pain presentation is clear and offers Python-specific advice
Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) gives more historical context around character sets.
Devetry’s How to Scope a Successful Python3 Migration checklist is a useful tool if you’re planning on upgrading from Python2.7 before it reaches end-of-life in 2020.