Test Installation¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Data Strutures¶
NumPy data structures¶
array¶
NumPy‘s ndarray,which can be created by function array,represents a multidimensional, homogeneous array of fixed-size items.
It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In Numpy dimensions are called axes. The number of axes is rank.
It is more compact than Python’s list (usually only 1/5th), and faster. The difference is mostly due to “indirectness” - a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value — and the memory allocators rounds up to 16). A NumPy array is an array of uniform values — single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. NumPy also has better performance, see below.
from numpy import arange
from timeit import Timer
Nelements = 10000
Ntimeits = 10000
x = arange(Nelements)
y = range(Nelements)
t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print "numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,)
print "list: %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,)
If a NumPy array is initialized with different data types, it will be “upcasted”. See example below, the array is initialized as an array of float, and then converted to an array of int.
a1=np.array([1, 2, 3.0])
a2=a1.astype(int)
print ('a1:\n{}'.format(a1))
print ('a2:\n{}'.format(a2))
More information of how to specify data types.
Arithmetic operators on arrays apply elementwise. NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are called “universal functions”(ufunc). Within NumPy, these functions operate elementwise on an array, producing an array as output.
s = pd.Series([1,3,5,np.nan,6,8])
s
dtype could be passed as parameter to create Series, if it is omitted then it will be inferred.
It can also be created from dict:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
s2 = pd.Series(d)
print ('Default order:\n{}'.format(s2))
print ('\nSpecify order:\n{}'.format(pd.Series(d,index=['b', 'c', 'd', 'a'])))
A series can be accessed like array:
s2[1]
or as a dict:
s2['b']
DataFrame¶
DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The is the primary pandas data structure.
dates = pd.date_range('20130101', periods=6)
dates #data type is DatetimeIndex
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
Creating a DataFrame by passing a dict of objects that can be converted to series-like.
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2
df2.dtypes
Viewing Data¶
Both Sereies and DataFrame have head or tail to show part of long data. More to find in Essential Basic Functionality.
We can also visualize data.
%matplotlib nbagg
#%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df.plot()
df.plot(x='A',y='B')
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df2.plot.bar()