Categories
file-io python python-3.x unicode zip

How to open an unicode text file inside a zip?

I tried

with zipfile.ZipFile("5.csv.zip", "r") as zfile:
for name in zfile.namelist():
with zfile.open(name, 'rU') as readFile:
line = readFile.readline()
print(line)
split = line.split('\t')

it answers:

b'$0.0\t1822\t1\t1\t1\n'
Traceback (most recent call last)
File "zip.py", line 6
split = line.split('\t')
TypeError: Type str doesn't support the buffer API

How to open the text file as unicode instead of as b?

edit For Python 3, using io.TextIOWrapper as this answer describes is the best choice. The answer below could still be helpful for 2.x. I don’t think anything below is actually incorrect even for 3.x, but io.TestIOWrapper is still better.

If the file is utf-8, this will work:

# the rest of the code as above, then:
with zfile.open(name, 'rU') as readFile:
line = readFile.readline().decode('utf8')
# etc

If you’re going to be iterating over the file you can use codecs.iterdecode, but that won’t work with readline().

with zfile.open(name, 'rU') as readFile:
for line in codecs.iterdecode(readFile, 'utf8'):
print line
# etc

Note that neither approach is necessarily safe for multibyte encodings. For example, little-endian UTF-16 represents the newline character with the bytes b'\x0A\x00'. A non-unicode aware tool looking for newlines will split that incorrectly, leaving the null bytes on the following line. In such a case you’d have to use something that doesn’t try to split the input by newlines, such as ZipFile.read, and then decode the whole byte string at once. This is not a concern for utf-8.