pandas

概述

Pandas的数据结构: Pandas主要有Series(一维数组), DataFrame(二维数组), Panel(三维数组), Panel4D(四维数组), PanelND(更多维数组)等数据结构. 其中Series和DataFrame应用的最为广泛。

Series是一维带标签的数组, 它可以包含任何数据类型. 包括整数, 字符串, 浮点数, Python对象等. Series可以通过标签来定位.
DataFrame是二维的带标签的数据结构. 我们可以通过标签来定位数据. 这是NumPy所没有的.

查看pandas版本

1
2
3

>>> print(pd.__version__)
0.24.2
>>> pd.show_versions()

Series

创建Series

Series可以看做由一列数据组成的数据集
Series语法如下

1	s = pd.Series(data, index=index)

常用的创建Series的方法有如下三种:

从列表创建Series

>>> arr = ['a', 'b', 'c', 'd']
>>> s1 = pd.Series(arr)  # 如果不指定index, 则默认从0开始
>>> s1
0    a
1    b
2    c
3    d
dtype: object

从Ndarray创建Series

>>> n = np.random.randn(5)  # 几行几列
>>> index = ["a", "b", "c", "d", "e"]
>>> s2 = pd.Series(n, index)
>>> s2
a    0.077125
b   -0.554191
c    0.167562
d   -1.214207
e   -1.038312
dtype: float64

从字典创建Series

>>> dict = {'a':1, 'b':2, 'c':3, 'd':4}
>>> s3 = pd.Series(d)
>>> s3
a    1
b    2
c    3
d    4
dtype: int64

Seires基本操作

修改Series索引

>>> s1
0    a
1    b
2    c
3    d
dtype: object
>>> s1.index = ['a1','a2','a3','a4']
>>> s1
a1    a
a2    b
a3    c
a4    d
dtype: object

>>> s1.index[0] = '01' 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 3938, in __setitem__
        raise TypeError("Index does not support mutable operations")
        TypeError: Index does not support mutable operations
# index不支持可变操作

Series纵向拼接

>>> s1
a    a
b    b
c    c
d    d
dtype: object
>>> s3
a    1
b    2
c    3
d    4
dtype: int64
>>> s4 = s3.append(s1)
>>> s4
a    1
b    2
c    3
d    4
a    a
b    b
c    c
d    d
dtype: object

Series按照指定索引删除元素

>>> s4.drop('a')
b    2
c    3
d    4
b    b
c    c
d    d
dtype: object

Series修改指定索引的元素

>>> s4['a'] = [11, 'a1']
>>> s4
a    11
b     2
c     3
d     4
a    a1
b     b
c     c
d     d
dtype: object

Series按照指定索引查找元素

1
2
3

>>> s4['a']      
a    1
a    a

Series切片

>>> s1
0    a
1    b
2    c
3    d
dtype: object
>>> s1[1:3]
1    b
2    c
dtype: object

Series运算

Series加法运算

>>> s1
0    a
1    b
2    c
3    d
dtype: object
>>> s2
0    e
1    f
2    g
3    h
dtype: object
>>> s1.add(s2)
0    ae
1    bf
2    cg
3    dh
dtype: object
>>> s2.add(s1)
0    ea
1    fb
2    gc
3    hd
dtype: object

Series减法运算

>>> s3
a    1
b    2
c    3
d    4
dtype: int64
>>> s4
a     4
bb    3
c     2
d     1
dtype: int64
>>> s3.sub(s4)
a    -3.0
b     NaN
bb    NaN
c     1.0
d     3.0
dtype: float64

Series乘法运算

>>> s3
a    1
b    2
c    3
d    4
dtype: int64
>>> s4
a     4
bb    3
c     2
d     1
dtype: int64
>>> s3.mul(s4)
a     4.0
b     NaN
bb    NaN
c     6.0
d     4.0
dtype: float64

Series除法运算

>>> s3
a    1
b    2
c    3
d    4
dtype: int64
>>> s4
a     4
bb    3
c     2
d     1
dtype: int64
>>> s3.div(s4)
a     0.25
b      NaN
bb     NaN
c     1.50
d     4.00
dtype: float64

Series求中位数

>>> s3
a    1
b    2
c    3
d    4
dtype: int64
>>> s3.median()
2.5

Series求和

>>> s3
a    1
b    2
c    3
d    4
dtype: int64
>>> s3.sum()
10

Series最大值

>>> s3
a    1
b    2
c    3
d    4
dtype: int64
>>> s3.max()
4

Series最小值

>>> s3
a    1
b    2
c    3
d    4
dtype: int64
>>> s3.min()
1

Series缺失值的检测

>>> series_10 = pd.Series({'a':10, 'b':20, 'c':30, 'd':40})
>>> series_10.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> series_20 = pd.Series(series_10,index=new_index)
>>> series_20
a    10.0
b    20.0
c    30.0
d    40.0
e     NaN
dtype: float64

isnull和notnull检测Series中的缺失值, 返回bool类型的Series值

>>> pd.isnull(ser_10) 
a    False
b    False
c    False
dtype: bool
>>> pd.isnull(ser_20)
a    False
b    False
c    False
d     True
dtype: bool

>>> pd.notnull(ser_10)
a    True
b    True
c    True
dtype: bool
>>> pd.notnull(ser_20)
a     True
b     True
c     True
d    False
dtype: bool

>>> ser_20
a    10.0
b    20.0
c    30.0
d     NaN
dtype: float64
>>> ser_20[pd.isnull(ser_20)]  # 相当于过滤掉缺失值
d   NaN
dtype: float64
>>> ser_20[pd.notnull(ser_20)]
a    10.0
b    20.0
c    30.0
dtype: float64

Series自动对齐

>>> series_1 = pd.Series([1,2,3,4], index=['a','b','c','d'])
>>> series_2 = pd.Series([4,3,2,1,5], index=['a','b','c','d','e'])
>>> series_1 + series_2  
a    5.0
b    5.0
c    5.0
d    5.0
e    NaN
dtype: float64

Series及其name属性

>>> series_1.name = "hello world"
>>> series_1.index.name = 'xxoo'
>>> series_1
xxoo
a    1
b    2
c    3
d    4
Name: hello world, dtype: int64

DataFrame

创建DataFrame

通过NumPy数组创建DataFrame

>>> dates = pd.date_range('today', periods=6)
>>> dates
DatetimeIndex(['2019-05-07 11:19:46.800556', '2019-05-08 11:19:46.800556',
               '2019-05-09 11:19:46.800556', '2019-05-10 11:19:46.800556',
               '2019-05-11 11:19:46.800556', '2019-05-12 11:19:46.800556'],
              dtype='datetime64[ns]', freq='D')
>>> num_arr = np.random.randn(6,4)
>>> num_arr
array([[ 0.73018504, -0.92169922, -1.37867963, -1.40692501],
       [ 1.61475442,  0.07078674,  0.59075899, -1.994313  ],
       [ 0.00830036,  0.20939717,  0.50323969,  0.43514028],
       [ 0.23984788,  0.15418066, -1.06420981,  0.58894686],
       [ 1.30096803,  0.87426295, -0.56629328, -0.14877761],
       [ 0.94481793, -1.04911144, -1.39403504,  0.77847438]])
>>> columns=['A','B','C','D']
>>> df1 = pd.DataFrame(num_arr, dates, columns)
>>> df1
                                   A         B         C         D
2019-05-07 11:19:46.800556  0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556  1.614754  0.070787  0.590759 -1.994313
2019-05-09 11:19:46.800556  0.008300  0.209397  0.503240  0.435140
2019-05-10 11:19:46.800556  0.239848  0.154181 -1.064210  0.588947
2019-05-11 11:19:46.800556  1.300968  0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556  0.944818 -1.049111 -1.394035  0.778474

通过dict创建DataFrame

>>> data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
...         'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
...         'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
...         'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
>>> 
>>> labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
>>> df2 = pd.DataFrame(data, index=labels)
>>> df2
  animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no

>>> data = {'成绩':[90, 90, 99], '姓名': ['Tony','Wayne','Moon'], '爱好':['篮球','排球','乒乓球']}  
>>> df_1 = pd.DataFrame(data)
>>> df_1
   成绩     姓名   爱好
0  90   Tony   篮球
1  90  Wayne   排球
2  99   Moon  乒乓球

>>> df_1.index
RangeIndex(start=0, stop=3, step=1)
>>> df_1.index = ['xx','oo','xxoo']
>>> df_1
      成绩     姓名   爱好
xx    90   Tony   篮球
oo    90  Wayne   排球
xxoo  99   Moon  乒乓球

DataFrame通过二维数组创建

>>> df_1 = pd.DataFrame([['Wayne', 'Tony', 'Moon'],[90, 80, 70]])
>>> df_1
       0     1     2
0  Wayne  Tony  Moon
1     90    80    70
>>> df_1 = pd.DataFrame([['Wayne', 'Tony', 'Moon'],[90, 80, 70]], index=['hello', 'world'])
>>> df_1
           0     1     2
hello  Wayne  Tony  Moon
world     90    80    70
>>> df_1 = pd.DataFrame([['Wayne', 'Tony', 'Moon'],[90, 80, 70]], index=['hello', 'world'], columns=['a','b','c'])
>>> df_1
           a     b     c
hello  Wayne  Tony  Moon
world     90    80    70

DataFrame获取数据

查看index, columns和values

>>> df_1.index
Index(['hello', 'world'], dtype='object')
>>> df_1.columns
Index(['a', 'b', 'c'], dtype='object')
>>> df_1.values
array([['Wayne', 'Tony', 'Moon'],
       [90, 80, 70]], dtype=object)

查看前几行和后几行

>>> df2.head()
  animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
>>> df2.tail()
  animal  age  visits priority
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no

通过标签查询单列

>>> df1
                                   A         B         C         D
2019-05-07 11:19:46.800556  0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556  1.614754  0.070787  0.590759 -1.994313
2019-05-09 11:19:46.800556  0.008300  0.209397  0.503240  0.435140
2019-05-10 11:19:46.800556  0.239848  0.154181 -1.064210  0.588947
2019-05-11 11:19:46.800556  1.300968  0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556  0.944818 -1.049111 -1.394035  0.778474
>>> df1['A']
2019-05-07 11:19:46.800556    0.730185
2019-05-08 11:19:46.800556    1.614754
2019-05-09 11:19:46.800556    0.008300
2019-05-10 11:19:46.800556    0.239848
2019-05-11 11:19:46.800556    1.300968
2019-05-12 11:19:46.800556    0.944818
Freq: D, Name: A, dtype: float64

通过标签查询多列

>>> df1[['A','B']]
                                   A         B
2019-05-07 11:19:46.800556  0.730185 -0.921699
2019-05-08 11:19:46.800556  1.614754  0.070787
2019-05-09 11:19:46.800556  0.008300  0.209397
2019-05-10 11:19:46.800556  0.239848  0.154181
2019-05-11 11:19:46.800556  1.300968  0.874263
2019-05-12 11:19:46.800556  0.944818 -1.049111

通过位置查询

>>> df1.iloc[1:3]
                                   A         B         C         D
2019-05-08 11:19:46.800556  1.614754  0.070787  0.590759 -1.994313
2019-05-09 11:19:46.800556  0.008300  0.209397  0.503240  0.435140

查看DataFrame的统计数据

>>> df1.describe()
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.806479 -0.110364 -0.551537 -0.291242
std    0.613343  0.736756  0.902708  1.149970
min    0.008300 -1.049111 -1.394035 -1.994313
25%    0.362432 -0.673578 -1.300062 -1.092388
50%    0.837501  0.112484 -0.815252  0.143181
75%    1.211931  0.195593  0.235856  0.550495
max    1.614754  0.874263  0.590759  0.778474

DataFrame基本操作

DataFrame转置

>>> df1
                                   A         B         C         D
2019-05-07 11:19:46.800556  0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556  1.614754  0.070787  0.590759 -1.994313
2019-05-09 11:19:46.800556  0.008300  0.209397  0.503240  0.435140
2019-05-10 11:19:46.800556  0.239848  0.154181 -1.064210  0.588947
2019-05-11 11:19:46.800556  1.300968  0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556  0.944818 -1.049111 -1.394035  0.778474
>>> df1.T
   2019-05-07 11:19:46.800556  2019-05-08 11:19:46.800556  ...  2019-05-11 11:19:46.800556  2019-05-12 11:19:46.800556
A                    0.730185                    1.614754  ...                    1.300968                    0.944818
B                   -0.921699                    0.070787  ...                    0.874263                   -1.049111
C                   -1.378680                    0.590759  ...                   -0.566293                   -1.394035
D                   -1.406925                   -1.994313  ...                   -0.148778                    0.778474

[4 rows x 6 columns]

DataFrame按列排序

>>> df1
                                   A         B         C         D
2019-05-07 11:19:46.800556  0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556  1.614754  0.070787  0.590759 -1.994313
2019-05-09 11:19:46.800556  0.008300  0.209397  0.503240  0.435140
2019-05-10 11:19:46.800556  0.239848  0.154181 -1.064210  0.588947
2019-05-11 11:19:46.800556  1.300968  0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556  0.944818 -1.049111 -1.394035  0.778474
>>> df1.sort_values(by='A')
                                   A         B         C         D
2019-05-09 11:19:46.800556  0.008300  0.209397  0.503240  0.435140
2019-05-10 11:19:46.800556  0.239848  0.154181 -1.064210  0.588947
2019-05-07 11:19:46.800556  0.730185 -0.921699 -1.378680 -1.406925
2019-05-12 11:19:46.800556  0.944818 -1.049111 -1.394035  0.778474
2019-05-11 11:19:46.800556  1.300968  0.874263 -0.566293 -0.148778
2019-05-08 11:19:46.800556  1.614754  0.070787  0.590759 -1.994313

DataFrame切片

>>> df1
                                   A         B         C         D
2019-05-07 11:19:46.800556  0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556  1.614754  0.070787  0.590759 -1.994313
2019-05-09 11:19:46.800556  0.008300  0.209397  0.503240  0.435140
2019-05-10 11:19:46.800556  0.239848  0.154181 -1.064210  0.588947
2019-05-11 11:19:46.800556  1.300968  0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556  0.944818 -1.049111 -1.394035  0.778474
>>> df1[1:3]
                                   A         B         C         D
2019-05-08 11:19:46.800556  1.614754  0.070787  0.590759 -1.994313
2019-05-09 11:19:46.800556  0.008300  0.209397  0.503240  0.435140

Dataframe副本copy

>>> df3 = df2.copy()
>>> df3
  animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no

列添加

>>> df_1['location'] = ['SH', 'BJ', 'GZ']
>>> df_1
      成绩     姓名   爱好 location
xx    90   Tony   篮球       SH
oo    90  Wayne   排球       BJ
xxoo  99   Moon  乒乓球       GZ

列删除

>>> df_1
      成绩     姓名   爱好 location
xx    90   Tony   篮球       SH
oo    90  Wayne   排球       BJ
xxoo  99   Moon  乒乓球       GZ
>>> df_1.pop('爱好')
xx       篮球
oo       排球
xxoo    乒乓球
Name: 爱好, dtype: object
>>> df_1
      成绩     姓名 location
xx    90   Tony       SH
oo    90  Wayne       BJ
xxoo  99   Moon       GZ

列修改

>>> df_1['location'] = ['SJZ','TY','ZZ']
>>> df_1
      成绩     姓名 location
xx    90   Tony      SJZ
oo    90  Wayne       TY
xxoo  99   Moon       ZZ

行的获取

>>> df_1
      成绩     姓名 location
xx    90   Tony      SJZ
oo    90  Wayne       TY
xxoo  99   Moon       ZZ
>>> df_1.ix['xx']
成绩            90
姓名          Tony
location     SJZ
Name: xx, dtype: object
>>> df_1.loc['xx']
成绩            90
姓名          Tony
location     SJZ
Name: xx, dtype: object

行增加

>>> df_1.ix['ooxx'] = [88, 'Bessie', 'QHD']
>>> df_1
      成绩      姓名 location
xx    90    Tony      SJZ
oo    90   Wayne       TY
xxoo  99    Moon       ZZ
ooxx  88  Bessie      QHD

行修改

>>> df_1.ix['ooxx'] = [88, 'Bessie', 'SH'] 
>>> df_1
      成绩      姓名 location
xx    90    Tony      SJZ
oo    90   Wayne       TY
xxoo  99    Moon       ZZ
ooxx  88  Bessie       SH

行删除

>>> df_1.drop('xxoo')  
      成绩      姓名 location
xx    90    Tony      SJZ
oo    90   Wayne       TY
ooxx  88  Bessie       SH

索引对象

pandas基本功能

数据文件读取和文本数据读取

通过pandas.read_xx相关函数可以读取文件中的数据, 并形成DataFrame, 常用的方法为read_csv, 主要读取文本类型的数据

cat data.csv
wayne,tony,bessie
19,20,21
1,1,0

>>> pd.read_csv('data.txt', sep=';', header=None)
       0     1       2
0  wayne  tony  bessie
1     19    20      21
2      1     1       0

cat data.csv 
name,age,sex
wayne,19,1
tony,20,1
bessie,21,0

>>> pd.read_csv('data.csv')
     name  age  sex
0   wayne   19    1
1    tony   20    1
2  bessie   21    0

索引, 选取和数据过滤

>>> df01 = pd.read_csv('data.csv')
>>> df01[df01.columns[1:]]
   age  source
0   18    98.5
1   21    78.2
2   24    98.5
3   20    89.2

缺省值NaN的处理方法

cat data1.txt
name,age,salary,gender
Tom,NaN,456.7,M 
Merry,34,345.6,NaN
Gerry,NaN,NaN,NaN
John,23,NaN,M
Joe ,18,385.6,F

>>> df2 = pd.read_csv('data1.csv', sep=',')

>>> df2.isnull()
    name    age  salary  gender
0  False   True   False   False
1  False  False   False    True
2  False   True    True    True
3  False  False    True   False
4  False  False   False   False
>>> df2.notnull()
   name    age  salary  gender
0  True  False    True    True
1  True   True    True   False
2  True  False   False   False
3  True   True   False    True
4  True   True    True    True

>>> df01 = pd.DataFrame(np.random.randint(1,9,size=(4,4)))
>>> df01
   0  1  2  3
0  1  5  8  4
1  3  3  6  8
2  5  7  2  4
3  5  5  6  4


# >>> df01 = pd.DataFrame(np.random.randint(1,9,size=(4,4)), index=['a','b','c','d'])
# >>> df01
#    0  1  2  3
# a  4  2  6  7
# b  3  7  5  6
# c  8  5  3  2
# d  6  7  4  4

>>> df01.ix[1:3,1] = np.NaN  # 将第一行到第三行的第一列修改为NaN
>>> df01
   0    1  2  3
0  3  7.0  6  3
1  2  NaN  1  5
2  3  NaN  1  4
3  8  NaN  8  2

>>> df01.dropna()  # 将NaN都删除掉, 默认只要包含NaN就删除掉
   0    1  2  3
0  3  7.0  6  3

>>> df01.dropna(how='all')
   0    1  2  3
0  3  7.0  6  3
1  2  NaN  1  5
2  3  NaN  1  4
3  8  NaN  8  2

>>> df01
   0    1  2  3
0  3  7.0  6  3
1  2  NaN  1  5
2  3  NaN  1  4
3  8  NaN  8  2
>>> df01.dropna(axis=1)
   0  2  3
0  3  6  3
1  2  1  5
2  3  1  4
3  8  8  2

>>> from pandas import DataFrame
>>> df = DataFrame(np.random.randn(7,3))
>>> df.ix[:4,1] = np.nan  # 第0行到第4行的第1列全都置为NaN
>>> df.ix[:2,2] = np.nan  # 第0行到第2行的第2列全都置为NaN
>>> df
          0         1         2
0 -0.041760       NaN       NaN
1 -0.005338       NaN       NaN
2 -0.728414       NaN       NaN
3 -1.836972       NaN  1.114158
4  1.737801       NaN -0.103160
5 -0.968829 -0.573951  0.503281
6 -0.565628 -0.889926 -1.307314
>>> df.fillna(0)  # 将所有的NaN填充为0
          0         1         2
0 -0.041760  0.000000  0.000000
1 -0.005338  0.000000  0.000000
2 -0.728414  0.000000  0.000000
3 -1.836972  0.000000  1.114158
4  1.737801  0.000000 -0.103160
5 -0.968829 -0.573951  0.503281
6 -0.565628 -0.889926 -1.307314
>>> df.fillna({1:0.5, 2:-1, 3:1})  # 将第1列的NaN填充为0.5, 将第2列的NaN填充为-1, 将第3列的NaN填充为1
          0         1         2
0 -0.041760  0.500000 -1.000000
1 -0.005338  0.500000 -1.000000
2 -0.728414  0.500000 -1.000000
3 -1.836972  0.500000  1.114158
4  1.737801  0.500000 -0.103160
5 -0.968829 -0.573951  0.503281
6 -0.565628 -0.889926 -1.307314

常用的数学统计方法

方法	说明
count	计算非NA值的数量
describe	针对Series或各个DataFrame列计算总的统计值
min/max
argmin/argmax	计算能够获取到最小值和最大值的索引位置(整数)
idxmin/idxmax	算能够获取到最小值和最大值的索引值
quantile	计算样本的分位数(0或1)
sum	值的总和
mean	值的平均数
median	值的中位数
mad	根据平均值计算平均绝对距离差
var	样本值的方差
std	样本值的标准差
cumsum	样本值的累计和
cummin/cummax	样本的累计最小值和最大值
cumprod	样本值的累计积
pct_change	百分数的变化

唯一值, 值计数以及成员资格

unique用于获取Series中的唯一值数组(去重后的数组)
value_counts用于计算一个Series中各值的出现的频率
isin用于判断矢量化集合中的成员资格, 可用于选取Series中或者DataFrame中列数据的子集

>>> from pandas import Series
>>> ser = Series(['a','b','c','b','c','a','d','e','b'])
>>> ser.unique()
array(['a', 'b', 'c', 'd', 'e'], dtype=object)
>>> ser.value_counts()
b    3
a    2
c    2
d    1
e    1
dtype: int64
>>> ser.value_counts(ascending=False)  # 降序排列
b    3
a    2
c    2
d    1
e    1
dtype: int64
>>> ser.value_counts(ascending=True)  # 升序排列
e    1
d    1
c    2
a    2
b    3
dtype: int64

# DataFrame去重
>>> df = DataFrame({'order_id':['1001','1002','1003','1004','1005'], 'member_id':['m01','m01','m02','m01','m02'], 'order_amt':[345, 312.2, 123, 250.2, 235]})
>>> df
  order_id member_id  order_amt
0     1001       m01      345.0
1     1002       m01      312.2
2     1003       m02      123.0
3     1004       m01      250.2
4     1005       m02      235.0
>>> df['member_id'].unique()
array(['m01', 'm02'], dtype=object)

# 成员资格判断
>>> ser
0    a
1    b
2    c
3    b
4    c
5    a
6    d
7    e
8    b
dtype: object
>>> mask = ser.isin(['b','c'])
>>> mask
0    False
1     True
2     True
3     True
4     True
5    False
6    False
7    False
8     True
dtype: bool
>>> ser[mask]
1    b
2    c
3    b
4    c
8    b
dtype: object

算法运算和数据对齐

函数的应用和映射

层次索引

在某一个方向上拥有多个(两个或两个以上)索引级别
通过层次化索引, pandas能够以较低维度的形式处理高维度的数据
通过层次化索引, 可以按照层次统计数据
层次所以包括Series层次索引和DataFrame层次索引

>>> data = Series([988.44, 95895, 3959, 32554, 1235], index=[['2001','2001','2001','2002','2002'],['苹果','香蕉','西瓜','苹果','西瓜']])
>>> data
2001  苹果      988.44
      香蕉    95895.00
      西瓜     3959.00
2002  苹果    32554.00
      西瓜     1235.00
dtype: float64
>>> data01 = data.swaplevel().sort_index()  # 交换分层索引
>>> data01
苹果  2001      988.44
    2002    32554.00
西瓜  2001     3959.00
    2002     1235.00
香蕉  2001    95895.00
dtype: float64

>>> df = DataFrame({'year':[2001, 2001, 2002, 2002, 2003], 'fruit':['apple','banana','apple','banana','apple'], 'production':[2345, 3124, 5668, 2532, 2135], 'profits':[233.44, 4452.2, 1225.2, 7845.2, 2352.2]})
>>> df
   year   fruit  production  profits
0  2001   apple        2345   233.44
1  2001  banana        3124  4452.20
2  2002   apple        5668  1225.20
3  2002  banana        2532  7845.20
4  2003   apple        2135  2352.20
>>> df = df.set_index(['year','fruit'])  # 设置层次化索引
             production  profits
year fruit                      
2001 apple         2345   233.44
     banana        3124  4452.20
2002 apple         5668  1225.20
     banana        2532  7845.20
2003 apple         2135  2352.20

>>> df.ix[2002,'apple']  # 根据层次化索引进行取值
production    5668.0
profits       1225.2
Name: (2002, apple), dtype: float64

>>> df.sum(level='year')  # 按照层次索引统计数据
      production  profits
year                     
2001        5469  4685.64
2002        8200  9070.40
2003        2135  2352.20
>>> df.min(level="fruit")
        production  profits
fruit                      
apple         2135   233.44
banana        2532  4452.20
>>> df.min(level=['year','fruit'])
             production  profits
year fruit                      
2001 apple         2345   233.44
     banana        3124  4452.20
2002 apple         5668  1225.20
     banana        2532  7845.20
2003 apple         2135  2352.20

参考文档

概述