Friday, 29 January 2016

Importing CSV File in Python


CSV file is one of most common used format we encounter. Let see few technique to efficiently import these data to python for better & efficient use.
CSV is also known as comma separated format. you can easily write a code to import CSV data .
But today we will see some of very efficient way to do so.

Python provide few package like CSV, unicodecsv, pandas Thease packages help you to do your job.

  • CSV : package follow basic approach & slow , it reads data line by line & split.
  • unicodecsv : Import data & create a list of dictionaries each data is associated with its attribute as key. So it  coud be very usefull in many cases but this process is also slow
  • pandas : This technique is very fast as compare to  above two methods.
You can download & test in your computer using Ipython notebook [Github_link]

Example

In [2]:
import unicodecsv
import pprint
import csv
In [3]:
csv_file_name="enrollments.csv"

Json Import of csv file

Import data in json formate, All data are imported as string

In [16]:
enrolment = []

f=open(csv_file_name,'rb')

reader =unicodecsv.DictReader(f)
#reader is a itterater so loop is possible only once
print "type(reader) =",reader

#for each_row in reader:
#    enrolment.append(each_row)
enrolment=list(reader) #shorthand for above two line

#close file
f.close()

print "Total no of row : ",len(enrolment),"\n\n"

#print demo data
pprint.pprint(enrolment[1])
type(reader) = <unicodecsv.py2.DictReader instance at 0x04A10A08>
Total no of row :  1640 


{u'account_key': u'448',
 u'cancel_date': u'2014-11-10',
 u'days_to_cancel': u'5',
 u'is_canceled': u'True',
 u'is_udacity': u'True',
 u'join_date': u'2014-11-05',
 u'status': u'canceled'}

## Simple import





Import data Row by row
In [23]:
data = []
with open(csv_file_name, 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in reader:
        data.append(row)

print "Headings :",(data[0])
print"Data :",(data[1])
Headings : ['account_key,status,join_date,cancel_date,days_to_cancel,is_udacity,is_canceled']
Data : ['448,canceled,2014-11-10,2015-01-14,65,True,True']

Using Pandas ( Very fast than above two)

In [4]:
import pandas as pd
In [7]:
enrol = pd.read_csv(csv_file_name)
In [10]:
enrol
Out[10]:
account_key status join_date cancel_date days_to_cancel is_udacity is_canceled
0 448 canceled 2014-11-10 2015-01-14 65 True True
1 448 canceled 2014-11-05 2014-11-10 5 True True
2 448 canceled 2015-01-27 2015-01-27 0 True True
3 448 canceled 2014-11-10 2014-11-10 0 True True
4 448 current 2015-03-10 NaN NaN True False
5 448 canceled 2015-01-14 2015-01-27 13 True True
6 448 canceled 2015-01-27 2015-03-10 42 True True
7 448 canceled 2015-01-27 2015-01-27 0 True True
8 448 canceled 2015-01-27 2015-01-27 0 True True
9 700 canceled 2014-11-10 2014-11-16 6 False True
10 429 canceled 2014-11-10 2015-03-10 120 False True
11 429 canceled 2015-03-10 2015-06-17 99 False True
12 60 canceled 2014-11-10 2015-01-14 65 False True
13 60 canceled 2015-01-14 2015-04-01 77 False True
14 60 current 2015-04-01 NaN NaN False False
15 1300 canceled 2014-11-10 2014-11-16 6 False True
16 369 current 2014-11-10 NaN NaN False False

In [11]:
len(enrol["account_key"].unique()) #unique account keys
Out[11]:
1302

No comments:

Post a Comment

THANKS FOR UR GREAT COMMENT

Blogger Widgets