Handling of Unicode

This package presents all text strings as Python unicode objects. From Excel 97 onwards, text in Excel spreadsheets has been stored as UTF-16LE (a 16-bit Unicode Transformation Format). Older files (Excel 95 and earlier) don’t keep strings in Unicode; a CODEPAGE record provides a codepage number (for example, 1252) which is used by xlrd to derive the encoding (for same example: “cp1252”) which is used to translate to Unicode.

If the CODEPAGE record is missing (possible if the file was created by third-party software), xlrd will assume that the encoding is ascii, and keep going. If the actual encoding is not ascii, a UnicodeDecodeError exception will be raised and you will need to determine the encoding yourself, and tell xlrd:

book = xlrd.open_workbook(..., encoding_override="cp1252")

If the CODEPAGE record exists but is wrong (for example, the codepage number is 1251, but the strings are actually encoded in koi8_r), it can be overridden using the same mechanism.

The supplied runxlrd.py has a corresponding command-line argument, which may be used for experimentation:

runxlrd.py -e koi8_r 3rows myfile.xls

The first place to look for an encoding, the “codec name”, is the Python documentation.