data1 data2 key1 key2 0 0.585600 0.670239 a one 1 1.993652 1.081585 a two 2 -0.188506 -1.506499 b one 3 1.016836 -0.084283 b two 4 -1.428577 -0.464163 a one
1,Series为分组键
按key1分组,计算data1的平均值:
1 2 3 4 5 6 7 8 9 10 11 12
grouped = df['data1'].groupby(df['key1'])
grouped.mean()
Out[9]:
key1
a 0.383558 b 0.414165
Name: data1, dtype: float64
其索引为key1列中的唯一值。
如果我们一次传入多个数组,就会得到不同的结果。
1 2 3 4 5 6 7 8 9 10 11 12
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means
Out[13]:
key1 key2 a one -0.421489 two 1.993652 b one -0.188506 two 1.016836 Name: data1, dtype: float64
通过两个键对数组进行分组,得到的Series具有一个层次化索引。
1 2 3 4 5 6 7 8
means.unstack()
Out[14]:
key2 one two key1 a -0.421489 1.993652 b -0.188506 1.016836
2,数组为分组键
分组键可以是任何长度适当的数组。
1 2 3 4 5 6 7 8 9 10 11 12 13
states = np.array(['Ohio', 'Clifornia', 'California', 'Ohio', 'Ohio'])
Out[21]: data1 data2 key1 a 0.383558 0.429221 b 0.414165 -0.795391
df.groupby(['key1','key2']).mean()
Out[22]:
data1 data2 key1 key2 a one -0.421489 0.103038 two 1.993652 1.081585 b one -0.188506 -1.506499 two 1.016836 -0.084283
只能对为数字的列求平均值
4,字典或Series为分组键
对于数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
people = DataFrame(np.random.randn(5,5), columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
people.ix[2:3,['b','c']] = np.nan
people
Out[56]:
a b c d e Joe 0.561986 -0.473072 0.195847 -0.939898 -0.031220 Steve -1.113971 1.043835 -0.759549 -0.426541 -0.175329 Wes 0.933713 NaN NaN 0.134446 0.001153 Jim 0.262178 -1.242311 -1.482147 -0.206859 0.928791 Travis 0.714951 0.760905 -1.027728 1.124935 0.213150
blue red Joe 2 3 Steve 2 3 Wes 1 2 Jim 2 3 Travis 2 3
5,函数为分组键
任何被当做分组键的函数都会在各个索引值上被调用一次,其返回值就会被用作分组名称。
1 2 3 4 5 6 7 8
people.groupby(len).sum()
Out[66]:
a b c d e 3 1.757877 -1.715382 -1.286300 -1.012311 0.898724 5 -1.113971 1.043835 -0.759549 -0.426541 -0.175329 6 0.714951 0.760905 -1.027728 1.124935 0.213150
将函数跟数组、列表、字典、Series混合使用也不是问题,因为任何东西最终都会被转换成数组:
1 2 3 4 5 6 7 8 9 10
key_list = ['one']*3 + ['two']*2
people.groupby([len,key_list]).min()
Out[71]: a b c d e 3 one 0.561986 -0.473072 0.195847 -0.939898 -0.031220 two 0.262178 -1.242311 -1.482147 -0.206859 0.928791 5 one -1.113971 1.043835 -0.759549 -0.426541 -0.175329 6 two 0.714951 0.760905 -1.027728 1.124935 0.213150
for name, group in df.groupby(['key1','key2']): print name print group ('a', 'one') data1 data2 key1 key2 0 0.585600 0.670239 a one 4 -1.428577 -0.464163 a one
('a', 'two') data1 data2 key1 key2 1 1.993652 1.081585 a two
('b', 'one') data1 data2 key1 key2 2 -0.188506 -1.506499 b one
('b', 'two') data1 data2 key1 key2 3 1.016836 -0.084283 b two
还可将数组片段做出一个字典
1 2 3 4 5 6 7 8 9
pieces = dict(list(df.groupby('key1')))
pieces['b']
Out[43]:
data1 data2 key1 key2 2 -0.188506 -1.506499 b one 3 1.016836 -0.084283 b two
{dtype('float64'): data1 data2 0 0.585600 0.670239 1 1.993652 1.081585 2 -0.188506 -1.506499 3 1.016836 -0.084283 4 -1.428577 -0.464163, dtype('O'): key1 key2 0 a one 1 a two 2 b one 3 b two 4 a one}
total_bill tip sex smoker day time size tip_pct 0 16.99 1.01 Female No Sun Dinner 2 0.059447 1 10.34 1.66 Male No Sun Dinner 3 0.160542 2 21.01 3.50 Male No Sun Dinner 3 0.166587 3 23.68 3.31 Male No Sun Dinner 2 0.139780 4 24.59 3.61 Female No Sun Dinner 4 0.146808
sex smoker Female No 0.156921 Yes 0.182150 Male No 0.160669 Yes 0.152771
Name: tip_pct, dtype: float64
如果传入一组函数或函数名,得到一个DataFrame(列名默认为函数名):
1 2 3 4 5 6 7 8 9
group_pct.agg(['mean','std',peak_to_peak])
Out[93]: mean std peak_to_peak sex smoker Female No 0.156921 0.036421 0.195876 Yes 0.182150 0.071595 0.360233 Male No 0.160669 0.041849 0.220186 Yes 0.152771 0.090588 0.674707
可以传入由 (name, function)元组组成的列表,指定DataFrame的列名:
1 2 3 4 5 6 7 8 9
group_pct.agg([('foo','mean'),('bar',np.std)])
Out[94]: foo bar sex smoker Female No 0.156921 0.036421 Yes 0.182150 0.071595 Male No 0.160669 0.041849 Yes 0.152771 0.090588
直接传入列名到函数的字典:
1 2 3 4 5 6 7 8 9 10
grouped.agg({'tip':np.max, 'size':'sum'})
Out[95]:
tip size sex smoker Female No 5.2 140 Yes 6.5 74 Male No 9.0 263 Yes 10.0 150
tip_pct size min max mean std sum sex smoker Female No 0.056797 0.252672 0.156921 0.036421 140 Yes 0.056433 0.416667 0.182150 0.071595 74 Male No 0.071804 0.291990 0.160669 0.041849 263 Yes 0.035638 0.710345 0.152771 0.090588 150
total_bill tip sex smoker day time size tip_pct 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525 183 23.17 6.50 Male Yes Sun Dinner 4 0.280535 232 11.61 3.39 Male No Sat Dinner 2 0.291990 67 3.07 1.00 Female Yes Sat Dinner 1 0.325733 178 9.60 4.00 Female Yes Sun Dinner 2 0.416667 172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
如果对smoker分组并用该函数调用apply,就会得到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
tips.groupby('smoker').apply(top)
Out[99]:
total_bill tip sex smoker day time size tip_pct smoker No 88 24.71 5.85 Male No Thur Lunch 2 0.236746 185 20.69 5.00 Male No Sun Dinner 5 0.241663 51 10.29 2.60 Female No Sun Dinner 2 0.252672 149 7.51 2.00 Male No Thur Lunch 2 0.266312 232 11.61 3.39 Male No Sat Dinner 2 0.291990 Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525 183 23.17 6.50 Male Yes Sun Dinner 4 0.280535 67 3.07 1.00 Female Yes Sat Dinner 1 0.325733 178 9.60 4.00 Female Yes Sun Dinner 2 0.416667 172 7.25 5.15 Male Yes Sun Dinner 2 0.710345