Test Installation

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Data Strutures

NumPy data structures

array

NumPy‘s ndarray,which can be created by function array,represents a multidimensional, homogeneous array of fixed-size items.

It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In Numpy dimensions are called axes. The number of axes is rank.

It is more compact than Python’s list (usually only 1/5th), and faster. The difference is mostly due to “indirectness” - a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value — and the memory allocators rounds up to 16). A NumPy array is an array of uniform values — single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. NumPy also has better performance, see below.

In [5]:
from numpy import arange
from timeit import Timer

Nelements = 10000
Ntimeits = 10000

x = arange(Nelements)
y = range(Nelements)

t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print "numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,)
print "list:  %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,)
numpy: 1.215e-05
list:  1.023e-04

If a NumPy array is initialized with different data types, it will be “upcasted”. See example below, the array is initialized as an array of float, and then converted to an array of int.

In [6]:
a1=np.array([1, 2, 3.0])
a2=a1.astype(int)
print ('a1:\n{}'.format(a1))
print ('a2:\n{}'.format(a2))
a1:
[ 1.  2.  3.]
a2:
[1 2 3]

More information of how to specify data types.

Arithmetic operators on arrays apply elementwise. NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are called “universal functions”(ufunc). Within NumPy, these functions operate elementwise on an array, producing an array as output.

Pandas data structures

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

It can be created from array(see below):

In [7]:
s = pd.Series([1,3,5,np.nan,6,8])
s
Out[7]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

dtype could be passed as parameter to create Series, if it is omitted then it will be inferred.

It can also be created from dict:

In [8]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
s2 = pd.Series(d)
print ('Default order:\n{}'.format(s2))
print ('\nSpecify order:\n{}'.format(pd.Series(d,index=['b', 'c', 'd', 'a'])))
Default order:
a    0.0
b    1.0
c    2.0
dtype: float64

Specify order:
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

A series can be accessed like array:

In [9]:
s2[1]
Out[9]:
1.0

or as a dict:

In [10]:
s2['b']
Out[10]:
1.0

DataFrame

DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The is the primary pandas data structure.

In [11]:
dates = pd.date_range('20130101', periods=6)
dates #data type is DatetimeIndex
Out[11]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
In [12]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
Out[12]:
A B C D
2013-01-01 -0.798593 -0.873343 0.741465 1.757651
2013-01-02 0.146600 -1.823005 -0.638928 0.437113
2013-01-03 -0.104512 -0.080836 -2.053928 0.639754
2013-01-04 1.586046 -0.269135 0.870575 -1.521639
2013-01-05 0.368299 -0.586849 1.296458 -0.272598
2013-01-06 1.033061 0.518580 0.284186 0.899014

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [13]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2
Out[13]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [ ]:
df2.dtypes

Viewing Data

Both Sereies and DataFrame have head or tail to show part of long data. More to find in Essential Basic Functionality.

We can also visualize data.

In [14]:
%matplotlib nbagg
#%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df.plot()
Out[14]:
In [15]:
df.plot(x='A',y='B')
Out[15]:
In [16]:
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df2.plot.bar()
Out[16]: