Thoughts and ideas: Character Encoding

Google released Colaboratory as a data science tool for the purpose of providing computational and hosting Jupyter notebooks to experiment. It also provides a good demonstration of Tensorflow machine learning library. It saves files as python notebook (.ipynb extension) within a designated folder inside your Google drive account.

I got all excited with this and not to say the least that it provides users with 13 GB of ram, Intel dual-core Xeon processors as shown below. It is rationed from some VM but I am not aware of the internal details.

At the writing of this post, it just provides Python 2 and 3 kernels. R and other languages are supposed to be added in future. It seems they are serious since they have added Jake Vanderplas (http://vanderplas.com/) as a visiting researcher beginning of this year.
I took a plunge in the Colaboratory by trying to see if I can analyze my local data. There is a Jupyter notebook provided as a documentation to load the data from local machine, Google drive, Google sheets, Google cloud as expected: https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/io.ipynb however it does tell us how can we "use" those files. It seems to be outside the scope of documentation so it was of no help to me. I kept trying to read the file after loading it into an object

##Fail

import pandas as pd

from google.colab import files
uploaded = files.upload()

for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))

#To see if the file is in the current folder but I could not find it

!ls

#I loaded the file anyway to see if it was present

pd.read_csv('YCOM-Web2016_2017-11-29.csv')

But there was no file. It kept saying file could not be found.

The problem was pretty trivial, it turns out that uploaded files are "never" uploaded to the hard drive but are stored in RAM as Python objects and we need to work with those objects as I found the solution using this stack overflow link:

https://stackoverflow.com/questions/48340341/read-csv-to-dataframe-in-google-colab

##Parital solution

import pandas as pd
import io

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))

##Output

YCOM-Web2016_2017-11-29.csv(text/csv) - 1243703 bytes, last modified: 1/19/2018 - %100 done
User uploaded file "YCOM-Web2016_2017-11-29.csv" with length 1243703 bytes

##End of output

df = pd.read_csv(io.StringIO(uploaded['YCOM-Web2016_2017-11-29.csv'].decode('utf-8')))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 505403: invalid start byte

What is this UnicodeDecodeError? It turns out that my file did not have utf-8 encoding which I supposed. I needed to convert the io object string into file format which Pandas library could swallow. Again, google come to the rescue, I got this page where we have the list of different encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

for which Python 3 has the support.

I changed my line in the code to

df = pd.read_csv(io.StringIO(uploaded['YCOM-Web2016_2017-11-29.csv'].decode('ISO-8859-1')))
df

and it loaded with full glory!

Notice we have the argument to decode changed from utf-8 to ISO-8859-1 which allows characters not supported by utf-8. I am still mystified since, the file was supposed to be a plain old csv file but well for the future references, always check your file encoding using this command:

file -I file_name.csv

which on my mac shows up as "unknown-8bit". Not very helpful but we do know that it is not plain old "us-ascii".

TLDR: Check your encoding before you load the file and do not make quick assumptions especially when working with new systems!

Thoughts and ideas

Tuesday, January 23, 2018

Character Encoding

No comments:

Post a Comment

Adding GPG keys to Github account

Report Abuse