Test Installation¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Data Strutures¶
NumPy data structures¶
array¶
NumPy‘s ndarray,which can be created by function array,represents a multidimensional, homogeneous array of fixed-size items.
It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In Numpy dimensions are called axes. The number of axes is rank.
It is more compact than Python’s list (usually only 1/5th), and faster. The difference is mostly due to “indirectness” - a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value — and the memory allocators rounds up to 16). A NumPy array is an array of uniform values — single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. NumPy also has better performance, see below.
from numpy import arange
from timeit import Timer
Nelements = 10000
Ntimeits = 10000
x = arange(Nelements)
y = range(Nelements)
t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print "numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,)
print "list: %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,)
If a NumPy array is initialized with different data types, it will be “upcasted”. See example below, the array is initialized as an array of float
, and then converted to an array of int
.
a1=np.array([1, 2, 3.0])
a2=a1.astype(int)
print ('a1:\n{}'.format(a1))
print ('a2:\n{}'.format(a2))
More information of how to specify data types.
Arithmetic operators on arrays apply elementwise. NumPy provides familiar mathematical functions such as sin
, cos
, and exp
. In NumPy, these are called “universal functions”(ufunc). Within NumPy, these functions operate elementwise on an array, producing an array as output.
s = pd.Series([1,3,5,np.nan,6,8])
s
dtype
could be passed as parameter to create Series
, if it is omitted then it will be inferred.
It can also be created from dict:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
s2 = pd.Series(d)
print ('Default order:\n{}'.format(s2))
print ('\nSpecify order:\n{}'.format(pd.Series(d,index=['b', 'c', 'd', 'a'])))
A series can be accessed like array:
s2[1]
or as a dict:
s2['b']
DataFrame¶
DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The is the primary pandas data structure.
dates = pd.date_range('20130101', periods=6)
dates #data type is DatetimeIndex
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
Creating a DataFrame by passing a dict of objects that can be converted to series-like.
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2
df2.dtypes
Viewing Data¶
Both Sereies and DataFrame have head
or tail
to show part of long data. More to find in Essential Basic Functionality.
We can also visualize data.
%matplotlib nbagg
#%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df.plot()
df.plot(x='A',y='B')
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df2.plot.bar()