pandas

概述

Pandas的数据结构: Pandas主要有Series(一维数组), DataFrame(二维数组), Panel(三维数组), Panel4D(四维数组), PanelND(更多维数组)等数据结构. 其中Series和DataFrame应用的最为广泛。

Series是一维带标签的数组, 它可以包含任何数据类型. 包括整数, 字符串, 浮点数, Python对象等. Series可以通过标签来定位.
DataFrame是二维的带标签的数据结构. 我们可以通过标签来定位数据. 这是NumPy所没有的.

查看pandas版本

1
2
3
>>> print(pd.__version__)
0.24.2
>>> pd.show_versions()

Series

创建Series

Series可以看做由一列数据组成的数据集
Series语法如下

1
s = pd.Series(data, index=index)

常用的创建Series的方法有如下三种:

从列表创建Series

1
2
3
4
5
6
7
8
>>> arr = ['a', 'b', 'c', 'd']
>>> s1 = pd.Series(arr) # 如果不指定index, 则默认从0开始
>>> s1
0 a
1 b
2 c
3 d
dtype: object

从Ndarray创建Series

1
2
3
4
5
6
7
8
9
10
>>> n = np.random.randn(5)  # 几行几列
>>> index = ["a", "b", "c", "d", "e"]
>>> s2 = pd.Series(n, index)
>>> s2
a 0.077125
b -0.554191
c 0.167562
d -1.214207
e -1.038312
dtype: float64

从字典创建Series

1
2
3
4
5
6
7
8
>>> dict = {'a':1, 'b':2, 'c':3, 'd':4}
>>> s3 = pd.Series(d)
>>> s3
a 1
b 2
c 3
d 4
dtype: int64

Seires基本操作

修改Series索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
>>> s1
0 a
1 b
2 c
3 d
dtype: object
>>> s1.index = ['a1','a2','a3','a4']
>>> s1
a1 a
a2 b
a3 c
a4 d
dtype: object

>>> s1.index[0] = '01'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 3938, in __setitem__
raise TypeError("Index does not support mutable operations")
TypeError: Index does not support mutable operations
# index不支持可变操作

Series纵向拼接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>> s1
a a
b b
c c
d d
dtype: object
>>> s3
a 1
b 2
c 3
d 4
dtype: int64
>>> s4 = s3.append(s1)
>>> s4
a 1
b 2
c 3
d 4
a a
b b
c c
d d
dtype: object

Series按照指定索引删除元素

1
2
3
4
5
6
7
8
>>> s4.drop('a')
b 2
c 3
d 4
b b
c c
d d
dtype: object

Series修改指定索引的元素

1
2
3
4
5
6
7
8
9
10
11
>>> s4['a'] = [11, 'a1']
>>> s4
a 11
b 2
c 3
d 4
a a1
b b
c c
d d
dtype: object

Series按照指定索引查找元素

1
2
3
>>> s4['a']      
a 1
a a

Series切片

1
2
3
4
5
6
7
8
9
10
>>> s1
0 a
1 b
2 c
3 d
dtype: object
>>> s1[1:3]
1 b
2 c
dtype: object

Series运算

Series加法运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>>> s1
0 a
1 b
2 c
3 d
dtype: object
>>> s2
0 e
1 f
2 g
3 h
dtype: object
>>> s1.add(s2)
0 ae
1 bf
2 cg
3 dh
dtype: object
>>> s2.add(s1)
0 ea
1 fb
2 gc
3 hd
dtype: object

Series减法运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>>> s3
a 1
b 2
c 3
d 4
dtype: int64
>>> s4
a 4
bb 3
c 2
d 1
dtype: int64
>>> s3.sub(s4)
a -3.0
b NaN
bb NaN
c 1.0
d 3.0
dtype: float64

Series乘法运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>>> s3
a 1
b 2
c 3
d 4
dtype: int64
>>> s4
a 4
bb 3
c 2
d 1
dtype: int64
>>> s3.mul(s4)
a 4.0
b NaN
bb NaN
c 6.0
d 4.0
dtype: float64

Series除法运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>>> s3
a 1
b 2
c 3
d 4
dtype: int64
>>> s4
a 4
bb 3
c 2
d 1
dtype: int64
>>> s3.div(s4)
a 0.25
b NaN
bb NaN
c 1.50
d 4.00
dtype: float64

Series求中位数

1
2
3
4
5
6
7
8
>>> s3
a 1
b 2
c 3
d 4
dtype: int64
>>> s3.median()
2.5

Series求和

1
2
3
4
5
6
7
8
>>> s3
a 1
b 2
c 3
d 4
dtype: int64
>>> s3.sum()
10

Series最大值

1
2
3
4
5
6
7
8
>>> s3
a 1
b 2
c 3
d 4
dtype: int64
>>> s3.max()
4

Series最小值

1
2
3
4
5
6
7
8
>>> s3
a 1
b 2
c 3
d 4
dtype: int64
>>> s3.min()
1

Series缺失值的检测

1
2
3
4
5
6
7
8
9
10
11
>>> series_10 = pd.Series({'a':10, 'b':20, 'c':30, 'd':40})
>>> series_10.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> series_20 = pd.Series(series_10,index=new_index)
>>> series_20
a 10.0
b 20.0
c 30.0
d 40.0
e NaN
dtype: float64

isnullnotnull检测Series中的缺失值, 返回bool类型的Series值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
>>> pd.isnull(ser_10) 
a False
b False
c False
dtype: bool
>>> pd.isnull(ser_20)
a False
b False
c False
d True
dtype: bool

>>> pd.notnull(ser_10)
a True
b True
c True
dtype: bool
>>> pd.notnull(ser_20)
a True
b True
c True
d False
dtype: bool

>>> ser_20
a 10.0
b 20.0
c 30.0
d NaN
dtype: float64
>>> ser_20[pd.isnull(ser_20)] # 相当于过滤掉缺失值
d NaN
dtype: float64
>>> ser_20[pd.notnull(ser_20)]
a 10.0
b 20.0
c 30.0
dtype: float64

Series自动对齐

1
2
3
4
5
6
7
8
9
>>> series_1 = pd.Series([1,2,3,4], index=['a','b','c','d'])
>>> series_2 = pd.Series([4,3,2,1,5], index=['a','b','c','d','e'])
>>> series_1 + series_2
a 5.0
b 5.0
c 5.0
d 5.0
e NaN
dtype: float64

Series及其name属性

1
2
3
4
5
6
7
8
9
>>> series_1.name = "hello world"
>>> series_1.index.name = 'xxoo'
>>> series_1
xxoo
a 1
b 2
c 3
d 4
Name: hello world, dtype: int64

DataFrame

创建DataFrame

通过NumPy数组创建DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>>> dates = pd.date_range('today', periods=6)
>>> dates
DatetimeIndex(['2019-05-07 11:19:46.800556', '2019-05-08 11:19:46.800556',
'2019-05-09 11:19:46.800556', '2019-05-10 11:19:46.800556',
'2019-05-11 11:19:46.800556', '2019-05-12 11:19:46.800556'],
dtype='datetime64[ns]', freq='D')
>>> num_arr = np.random.randn(6,4)
>>> num_arr
array([[ 0.73018504, -0.92169922, -1.37867963, -1.40692501],
[ 1.61475442, 0.07078674, 0.59075899, -1.994313 ],
[ 0.00830036, 0.20939717, 0.50323969, 0.43514028],
[ 0.23984788, 0.15418066, -1.06420981, 0.58894686],
[ 1.30096803, 0.87426295, -0.56629328, -0.14877761],
[ 0.94481793, -1.04911144, -1.39403504, 0.77847438]])
>>> columns=['A','B','C','D']
>>> df1 = pd.DataFrame(num_arr, dates, columns)
>>> df1
A B C D
2019-05-07 11:19:46.800556 0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556 1.614754 0.070787 0.590759 -1.994313
2019-05-09 11:19:46.800556 0.008300 0.209397 0.503240 0.435140
2019-05-10 11:19:46.800556 0.239848 0.154181 -1.064210 0.588947
2019-05-11 11:19:46.800556 1.300968 0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556 0.944818 -1.049111 -1.394035 0.778474

通过dict创建DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>>> data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
... 'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
... 'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
... 'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
>>>
>>> labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
>>> df2 = pd.DataFrame(data, index=labels)
>>> df2
animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5.0 2 no
f cat 2.0 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7.0 2 no
j dog 3.0 1 no
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>>> data = {'成绩':[90, 90, 99], '姓名': ['Tony','Wayne','Moon'], '爱好':['篮球','排球','乒乓球']}  
>>> df_1 = pd.DataFrame(data)
>>> df_1
成绩 姓名 爱好
0 90 Tony 篮球
1 90 Wayne 排球
2 99 Moon 乒乓球

>>> df_1.index
RangeIndex(start=0, stop=3, step=1)
>>> df_1.index = ['xx','oo','xxoo']
>>> df_1
成绩 姓名 爱好
xx 90 Tony 篮球
oo 90 Wayne 排球
xxoo 99 Moon 乒乓球

DataFrame通过二维数组创建

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> df_1 = pd.DataFrame([['Wayne', 'Tony', 'Moon'],[90, 80, 70]])
>>> df_1
0 1 2
0 Wayne Tony Moon
1 90 80 70
>>> df_1 = pd.DataFrame([['Wayne', 'Tony', 'Moon'],[90, 80, 70]], index=['hello', 'world'])
>>> df_1
0 1 2
hello Wayne Tony Moon
world 90 80 70
>>> df_1 = pd.DataFrame([['Wayne', 'Tony', 'Moon'],[90, 80, 70]], index=['hello', 'world'], columns=['a','b','c'])
>>> df_1
a b c
hello Wayne Tony Moon
world 90 80 70

DataFrame获取数据

查看index, columns和values

1
2
3
4
5
6
7
>>> df_1.index
Index(['hello', 'world'], dtype='object')
>>> df_1.columns
Index(['a', 'b', 'c'], dtype='object')
>>> df_1.values
array([['Wayne', 'Tony', 'Moon'],
[90, 80, 70]], dtype=object)

查看前几行和后几行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> df2.head()
animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5.0 2 no
>>> df2.tail()
animal age visits priority
f cat 2.0 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7.0 2 no
j dog 3.0 1 no

通过标签查询单列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>>> df1
A B C D
2019-05-07 11:19:46.800556 0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556 1.614754 0.070787 0.590759 -1.994313
2019-05-09 11:19:46.800556 0.008300 0.209397 0.503240 0.435140
2019-05-10 11:19:46.800556 0.239848 0.154181 -1.064210 0.588947
2019-05-11 11:19:46.800556 1.300968 0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556 0.944818 -1.049111 -1.394035 0.778474
>>> df1['A']
2019-05-07 11:19:46.800556 0.730185
2019-05-08 11:19:46.800556 1.614754
2019-05-09 11:19:46.800556 0.008300
2019-05-10 11:19:46.800556 0.239848
2019-05-11 11:19:46.800556 1.300968
2019-05-12 11:19:46.800556 0.944818
Freq: D, Name: A, dtype: float64

通过标签查询多列

1
2
3
4
5
6
7
8
>>> df1[['A','B']]
A B
2019-05-07 11:19:46.800556 0.730185 -0.921699
2019-05-08 11:19:46.800556 1.614754 0.070787
2019-05-09 11:19:46.800556 0.008300 0.209397
2019-05-10 11:19:46.800556 0.239848 0.154181
2019-05-11 11:19:46.800556 1.300968 0.874263
2019-05-12 11:19:46.800556 0.944818 -1.049111

通过位置查询

1
2
3
4
>>> df1.iloc[1:3]
A B C D
2019-05-08 11:19:46.800556 1.614754 0.070787 0.590759 -1.994313
2019-05-09 11:19:46.800556 0.008300 0.209397 0.503240 0.435140

查看DataFrame的统计数据

1
2
3
4
5
6
7
8
9
10
>>> df1.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.806479 -0.110364 -0.551537 -0.291242
std 0.613343 0.736756 0.902708 1.149970
min 0.008300 -1.049111 -1.394035 -1.994313
25% 0.362432 -0.673578 -1.300062 -1.092388
50% 0.837501 0.112484 -0.815252 0.143181
75% 1.211931 0.195593 0.235856 0.550495
max 1.614754 0.874263 0.590759 0.778474

DataFrame基本操作

DataFrame转置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>>> df1
A B C D
2019-05-07 11:19:46.800556 0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556 1.614754 0.070787 0.590759 -1.994313
2019-05-09 11:19:46.800556 0.008300 0.209397 0.503240 0.435140
2019-05-10 11:19:46.800556 0.239848 0.154181 -1.064210 0.588947
2019-05-11 11:19:46.800556 1.300968 0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556 0.944818 -1.049111 -1.394035 0.778474
>>> df1.T
2019-05-07 11:19:46.800556 2019-05-08 11:19:46.800556 ... 2019-05-11 11:19:46.800556 2019-05-12 11:19:46.800556
A 0.730185 1.614754 ... 1.300968 0.944818
B -0.921699 0.070787 ... 0.874263 -1.049111
C -1.378680 0.590759 ... -0.566293 -1.394035
D -1.406925 -1.994313 ... -0.148778 0.778474

[4 rows x 6 columns]

DataFrame按列排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>>> df1
A B C D
2019-05-07 11:19:46.800556 0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556 1.614754 0.070787 0.590759 -1.994313
2019-05-09 11:19:46.800556 0.008300 0.209397 0.503240 0.435140
2019-05-10 11:19:46.800556 0.239848 0.154181 -1.064210 0.588947
2019-05-11 11:19:46.800556 1.300968 0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556 0.944818 -1.049111 -1.394035 0.778474
>>> df1.sort_values(by='A')
A B C D
2019-05-09 11:19:46.800556 0.008300 0.209397 0.503240 0.435140
2019-05-10 11:19:46.800556 0.239848 0.154181 -1.064210 0.588947
2019-05-07 11:19:46.800556 0.730185 -0.921699 -1.378680 -1.406925
2019-05-12 11:19:46.800556 0.944818 -1.049111 -1.394035 0.778474
2019-05-11 11:19:46.800556 1.300968 0.874263 -0.566293 -0.148778
2019-05-08 11:19:46.800556 1.614754 0.070787 0.590759 -1.994313

DataFrame切片

1
2
3
4
5
6
7
8
9
10
11
12
>>> df1
A B C D
2019-05-07 11:19:46.800556 0.730185 -0.921699 -1.378680 -1.406925
2019-05-08 11:19:46.800556 1.614754 0.070787 0.590759 -1.994313
2019-05-09 11:19:46.800556 0.008300 0.209397 0.503240 0.435140
2019-05-10 11:19:46.800556 0.239848 0.154181 -1.064210 0.588947
2019-05-11 11:19:46.800556 1.300968 0.874263 -0.566293 -0.148778
2019-05-12 11:19:46.800556 0.944818 -1.049111 -1.394035 0.778474
>>> df1[1:3]
A B C D
2019-05-08 11:19:46.800556 1.614754 0.070787 0.590759 -1.994313
2019-05-09 11:19:46.800556 0.008300 0.209397 0.503240 0.435140

Dataframe副本copy

1
2
3
4
5
6
7
8
9
10
11
12
13
>>> df3 = df2.copy()
>>> df3
animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5.0 2 no
f cat 2.0 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7.0 2 no
j dog 3.0 1 no

列添加

1
2
3
4
5
6
>>> df_1['location'] = ['SH', 'BJ', 'GZ']
>>> df_1
成绩 姓名 爱好 location
xx 90 Tony 篮球 SH
oo 90 Wayne 排球 BJ
xxoo 99 Moon 乒乓球 GZ

列删除

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> df_1
成绩 姓名 爱好 location
xx 90 Tony 篮球 SH
oo 90 Wayne 排球 BJ
xxoo 99 Moon 乒乓球 GZ
>>> df_1.pop('爱好')
xx 篮球
oo 排球
xxoo 乒乓球
Name: 爱好, dtype: object
>>> df_1
成绩 姓名 location
xx 90 Tony SH
oo 90 Wayne BJ
xxoo 99 Moon GZ

列修改

1
2
3
4
5
6
>>> df_1['location'] = ['SJZ','TY','ZZ']
>>> df_1
成绩 姓名 location
xx 90 Tony SJZ
oo 90 Wayne TY
xxoo 99 Moon ZZ

行的获取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> df_1
成绩 姓名 location
xx 90 Tony SJZ
oo 90 Wayne TY
xxoo 99 Moon ZZ
>>> df_1.ix['xx']
成绩 90
姓名 Tony
location SJZ
Name: xx, dtype: object
>>> df_1.loc['xx']
成绩 90
姓名 Tony
location SJZ
Name: xx, dtype: object

行增加

1
2
3
4
5
6
7
>>> df_1.ix['ooxx'] = [88, 'Bessie', 'QHD']
>>> df_1
成绩 姓名 location
xx 90 Tony SJZ
oo 90 Wayne TY
xxoo 99 Moon ZZ
ooxx 88 Bessie QHD

行修改

1
2
3
4
5
6
7
>>> df_1.ix['ooxx'] = [88, 'Bessie', 'SH'] 
>>> df_1
成绩 姓名 location
xx 90 Tony SJZ
oo 90 Wayne TY
xxoo 99 Moon ZZ
ooxx 88 Bessie SH

行删除

1
2
3
4
5
>>> df_1.drop('xxoo')  
成绩 姓名 location
xx 90 Tony SJZ
oo 90 Wayne TY
ooxx 88 Bessie SH

索引对象

pandas基本功能

数据文件读取和文本数据读取

通过pandas.read_xx相关函数可以读取文件中的数据, 并形成DataFrame, 常用的方法为read_csv, 主要读取文本类型的数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
cat data.csv
wayne,tony,bessie
19,20,21
1,1,0

>>> pd.read_csv('data.txt', sep=';', header=None)
0 1 2
0 wayne tony bessie
1 19 20 21
2 1 1 0

cat data.csv
name,age,sex
wayne,19,1
tony,20,1
bessie,21,0

>>> pd.read_csv('data.csv')
name age sex
0 wayne 19 1
1 tony 20 1
2 bessie 21 0

索引, 选取和数据过滤

1
2
3
4
5
6
7
>>> df01 = pd.read_csv('data.csv')
>>> df01[df01.columns[1:]]
age source
0 18 98.5
1 21 78.2
2 24 98.5
3 20 89.2

缺省值NaN的处理方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
cat data1.txt
name,age,salary,gender
Tom,NaN,456.7,M
Merry,34,345.6,NaN
Gerry,NaN,NaN,NaN
John,23,NaN,M
Joe ,18,385.6,F

>>> df2 = pd.read_csv('data1.csv', sep=',')

>>> df2.isnull()
name age salary gender
0 False True False False
1 False False False True
2 False True True True
3 False False True False
4 False False False False
>>> df2.notnull()
name age salary gender
0 True False True True
1 True True True False
2 True False False False
3 True True False True
4 True True True True
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
>>> df01 = pd.DataFrame(np.random.randint(1,9,size=(4,4)))
>>> df01
0 1 2 3
0 1 5 8 4
1 3 3 6 8
2 5 7 2 4
3 5 5 6 4


# >>> df01 = pd.DataFrame(np.random.randint(1,9,size=(4,4)), index=['a','b','c','d'])
# >>> df01
# 0 1 2 3
# a 4 2 6 7
# b 3 7 5 6
# c 8 5 3 2
# d 6 7 4 4

>>> df01.ix[1:3,1] = np.NaN # 将第一行到第三行的第一列修改为NaN
>>> df01
0 1 2 3
0 3 7.0 6 3
1 2 NaN 1 5
2 3 NaN 1 4
3 8 NaN 8 2

>>> df01.dropna() # 将NaN都删除掉, 默认只要包含NaN就删除掉
0 1 2 3
0 3 7.0 6 3

>>> df01.dropna(how='all')
0 1 2 3
0 3 7.0 6 3
1 2 NaN 1 5
2 3 NaN 1 4
3 8 NaN 8 2

>>> df01
0 1 2 3
0 3 7.0 6 3
1 2 NaN 1 5
2 3 NaN 1 4
3 8 NaN 8 2
>>> df01.dropna(axis=1)
0 2 3
0 3 6 3
1 2 1 5
2 3 1 4
3 8 8 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
>>> from pandas import DataFrame
>>> df = DataFrame(np.random.randn(7,3))
>>> df.ix[:4,1] = np.nan # 第0行到第4行的第1列全都置为NaN
>>> df.ix[:2,2] = np.nan # 第0行到第2行的第2列全都置为NaN
>>> df
0 1 2
0 -0.041760 NaN NaN
1 -0.005338 NaN NaN
2 -0.728414 NaN NaN
3 -1.836972 NaN 1.114158
4 1.737801 NaN -0.103160
5 -0.968829 -0.573951 0.503281
6 -0.565628 -0.889926 -1.307314
>>> df.fillna(0) # 将所有的NaN填充为0
0 1 2
0 -0.041760 0.000000 0.000000
1 -0.005338 0.000000 0.000000
2 -0.728414 0.000000 0.000000
3 -1.836972 0.000000 1.114158
4 1.737801 0.000000 -0.103160
5 -0.968829 -0.573951 0.503281
6 -0.565628 -0.889926 -1.307314
>>> df.fillna({1:0.5, 2:-1, 3:1}) # 将第1列的NaN填充为0.5, 将第2列的NaN填充为-1, 将第3列的NaN填充为1
0 1 2
0 -0.041760 0.500000 -1.000000
1 -0.005338 0.500000 -1.000000
2 -0.728414 0.500000 -1.000000
3 -1.836972 0.500000 1.114158
4 1.737801 0.500000 -0.103160
5 -0.968829 -0.573951 0.503281
6 -0.565628 -0.889926 -1.307314

常用的数学统计方法

方法 说明
count 计算非NA值的数量
describe 针对Series或各个DataFrame列计算总的统计值
min/max
argmin/argmax 计算能够获取到最小值和最大值的索引位置(整数)
idxmin/idxmax 算能够获取到最小值和最大值的索引值
quantile 计算样本的分位数(0或1)
sum 值的总和
mean 值的平均数
median 值的中位数
mad 根据平均值计算平均绝对距离差
var 样本值的方差
std 样本值的标准差
cumsum 样本值的累计和
cummin/cummax 样本的累计最小值和最大值
cumprod 样本值的累计积
pct_change 百分数的变化

相关系数与协方差

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>>> df = DataFrame({'GDP':[12, 23, 34, 45, 56], 'air_temperature': [23, 25, 26, 27, 30], 'year':['2001','2002','2003','2004','2005']})
>>> df
GDP air_temperature year
0 12 23 2001
1 23 25 2002
2 34 26 2003
3 45 27 2004
4 56 30 2005
>>> df.corr()
GDP air_temperature
GDP 1.000000 0.977356
air_temperature 0.977356 1.000000
>>> df.cov()
GDP air_temperature
GDP 302.5 44.0
air_temperature 44.0 6.7

唯一值, 值计数以及成员资格

  • unique用于获取Series中的唯一值数组(去重后的数组)
  • value_counts用于计算一个Series中各值的出现的频率
  • isin用于判断矢量化集合中的成员资格, 可用于选取Series中或者DataFrame中列数据的子集
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
>>> from pandas import Series
>>> ser = Series(['a','b','c','b','c','a','d','e','b'])
>>> ser.unique()
array(['a', 'b', 'c', 'd', 'e'], dtype=object)
>>> ser.value_counts()
b 3
a 2
c 2
d 1
e 1
dtype: int64
>>> ser.value_counts(ascending=False) # 降序排列
b 3
a 2
c 2
d 1
e 1
dtype: int64
>>> ser.value_counts(ascending=True) # 升序排列
e 1
d 1
c 2
a 2
b 3
dtype: int64

# DataFrame去重
>>> df = DataFrame({'order_id':['1001','1002','1003','1004','1005'], 'member_id':['m01','m01','m02','m01','m02'], 'order_amt':[345, 312.2, 123, 250.2, 235]})
>>> df
order_id member_id order_amt
0 1001 m01 345.0
1 1002 m01 312.2
2 1003 m02 123.0
3 1004 m01 250.2
4 1005 m02 235.0
>>> df['member_id'].unique()
array(['m01', 'm02'], dtype=object)

# 成员资格判断
>>> ser
0 a
1 b
2 c
3 b
4 c
5 a
6 d
7 e
8 b
dtype: object
>>> mask = ser.isin(['b','c'])
>>> mask
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 True
dtype: bool
>>> ser[mask]
1 b
2 c
3 b
4 c
8 b
dtype: object

算法运算和数据对齐

函数的应用和映射

层次索引

  • 在某一个方向上拥有多个(两个或两个以上)索引级别
  • 通过层次化索引, pandas能够以较低维度的形式处理高维度的数据
  • 通过层次化索引, 可以按照层次统计数据
  • 层次所以包括Series层次索引和DataFrame层次索引
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
>>> data = Series([988.44, 95895, 3959, 32554, 1235], index=[['2001','2001','2001','2002','2002'],['苹果','香蕉','西瓜','苹果','西瓜']])
>>> data
2001 苹果 988.44
香蕉 95895.00
西瓜 3959.00
2002 苹果 32554.00
西瓜 1235.00
dtype: float64
>>> data01 = data.swaplevel().sort_index() # 交换分层索引
>>> data01
苹果 2001 988.44
2002 32554.00
西瓜 2001 3959.00
2002 1235.00
香蕉 2001 95895.00
dtype: float64

>>> df = DataFrame({'year':[2001, 2001, 2002, 2002, 2003], 'fruit':['apple','banana','apple','banana','apple'], 'production':[2345, 3124, 5668, 2532, 2135], 'profits':[233.44, 4452.2, 1225.2, 7845.2, 2352.2]})
>>> df
year fruit production profits
0 2001 apple 2345 233.44
1 2001 banana 3124 4452.20
2 2002 apple 5668 1225.20
3 2002 banana 2532 7845.20
4 2003 apple 2135 2352.20
>>> df = df.set_index(['year','fruit']) # 设置层次化索引
production profits
year fruit
2001 apple 2345 233.44
banana 3124 4452.20
2002 apple 5668 1225.20
banana 2532 7845.20
2003 apple 2135 2352.20

>>> df.ix[2002,'apple'] # 根据层次化索引进行取值
production 5668.0
profits 1225.2
Name: (2002, apple), dtype: float64

>>> df.sum(level='year') # 按照层次索引统计数据
production profits
year
2001 5469 4685.64
2002 8200 9070.40
2003 2135 2352.20
>>> df.min(level="fruit")
production profits
fruit
apple 2135 233.44
banana 2532 4452.20
>>> df.min(level=['year','fruit'])
production profits
year fruit
2001 apple 2345 233.44
banana 3124 4452.20
2002 apple 5668 1225.20
banana 2532 7845.20
2003 apple 2135 2352.20

参考文档