数据分组方法
- 分组计算根据某个或某几个字段对数据集进行分组,然后运用特点的函数,得到结果
- 使用groupby方法进行分组计算,得到分组对象GroupBy
- 语法为df.groupby(by=)
- 分组对象GroupBy可以运用描述性统计方法,如count(计数)、mean(均值)、median(中位数)、max(最大值)和min(最小值)等
import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据统计'
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
df = pd.read_csv('online_order.csv', encoding='gbk', dtype={'customer':str, 'order':str})
df.head(5)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border=”1″ class=”dataframe”>
<thead>
<tr style="text-align: right;">
<th></th>
<th>customer</th>
<th>order</th>
<th>total_items</th>
<th>discount%</th>
<th>weekday</th>
<th>hour</th>
<th>Food%</th>
<th>Fresh%</th>
<th>Drinks%</th>
<th>Home%</th>
<th>Beauty%</th>
<th>Health%</th>
<th>Baby%</th>
<th>Pets%</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>0</td>
<td>45</td>
<td>23.03</td>
<td>4</td>
<td>13</td>
<td>9.46</td>
<td>87.06</td>
<td>3.48</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>1</th>
<td>0</td>
<td>1</td>
<td>38</td>
<td>1.22</td>
<td>5</td>
<td>13</td>
<td>15.87</td>
<td>75.80</td>
<td>6.22</td>
<td>2.12</td>
<td>0.00</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2</th>
<td>0</td>
<td>2</td>
<td>51</td>
<td>18.08</td>
<td>4</td>
<td>13</td>
<td>16.88</td>
<td>56.75</td>
<td>3.37</td>
<td>16.48</td>
<td>6.53</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>3</td>
<td>57</td>
<td>16.51</td>
<td>1</td>
<td>12</td>
<td>28.81</td>
<td>35.99</td>
<td>11.78</td>
<td>4.62</td>
<td>2.87</td>
<td>15.92</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>4</th>
<td>1</td>
<td>4</td>
<td>53</td>
<td>18.31</td>
<td>2</td>
<td>11</td>
<td>24.13</td>
<td>60.38</td>
<td>7.78</td>
<td>7.72</td>
<td>0.00</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
</tr>
</tbody>
</table>
</div>
grouped = df.groupby('weekday')
type(grouped)
pandas.core.groupby.generic.DataFrameGroupBy
grouped.mean()
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border=”1″ class=”dataframe”>
<thead>
<tr style="text-align: right;">
<th></th>
<th>total_items</th>
<th>discount%</th>
<th>hour</th>
<th>Food%</th>
<th>Fresh%</th>
<th>Drinks%</th>
<th>Home%</th>
<th>Beauty%</th>
<th>Health%</th>
<th>Baby%</th>
<th>Pets%</th>
</tr>
<tr>
<th>weekday</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>30.662177</td>
<td>8.580705</td>
<td>14.693122</td>
<td>22.690866</td>
<td>20.000904</td>
<td>22.522993</td>
<td>13.932553</td>
<td>6.972394</td>
<td>1.152285</td>
<td>11.592562</td>
<td>1.007306</td>
</tr>
<tr>
<th>2</th>
<td>31.868612</td>
<td>8.638014</td>
<td>14.966197</td>
<td>23.994915</td>
<td>19.407738</td>
<td>24.346459</td>
<td>13.559191</td>
<td>4.903366</td>
<td>1.079423</td>
<td>11.277284</td>
<td>1.272638</td>
</tr>
<tr>
<th>3</th>
<td>31.869796</td>
<td>7.794507</td>
<td>15.059898</td>
<td>24.309274</td>
<td>19.957653</td>
<td>23.822470</td>
<td>13.282088</td>
<td>6.702640</td>
<td>1.156829</td>
<td>9.591389</td>
<td>0.937205</td>
</tr>
<tr>
<th>4</th>
<td>32.251899</td>
<td>8.068155</td>
<td>14.324185</td>
<td>24.374364</td>
<td>21.538027</td>
<td>24.553266</td>
<td>13.391946</td>
<td>4.806528</td>
<td>1.031490</td>
<td>9.058201</td>
<td>1.080473</td>
</tr>
<tr>
<th>5</th>
<td>31.406619</td>
<td>9.159031</td>
<td>13.386919</td>
<td>24.602790</td>
<td>20.549153</td>
<td>24.976466</td>
<td>12.485788</td>
<td>5.431221</td>
<td>1.248605</td>
<td>9.655343</td>
<td>0.908227</td>
</tr>
<tr>
<th>6</th>
<td>32.154814</td>
<td>8.414258</td>
<td>14.751084</td>
<td>23.743196</td>
<td>18.707788</td>
<td>23.593699</td>
<td>14.173291</td>
<td>5.878647</td>
<td>1.170585</td>
<td>11.478343</td>
<td>1.150980</td>
</tr>
<tr>
<th>7</th>
<td>32.373837</td>
<td>8.710171</td>
<td>16.989535</td>
<td>22.271512</td>
<td>21.020359</td>
<td>21.093767</td>
<td>13.632481</td>
<td>5.895322</td>
<td>1.145938</td>
<td>13.844250</td>
<td>0.950391</td>
</tr>
</tbody>
</table>
</div>
grouped.mean()['Food%']
weekday
1 22.690866
2 23.994915
3 24.309274
4 24.374364
5 24.602790
6 23.743196
7 22.271512
Name: Food%, dtype: float64
# 多个字段分组
grouped = df.groupby(by=['customer', 'weekday'])
grouped.sum()['total_items']
customer weekday
0 4 96
5 38
1 1 423
2 127
4 37
5 36
10 1 23
3 26
100 1 38
2 78
3 78
7 135
1000 2 6
10000 6 30
10001 6 15
10002 3 11
6 42
7 48
10003 2 4
10004 2 28
3 131
4 93
10005 7 29
10006 2 20
5 27
7 26
10007 2 6
6 15
10008 7 123
10009 1 2
...
9984 6 40
7 61
9985 6 11
9986 1 50
6 49
7 50
9987 1 23
9988 1 18
4 1
9989 1 27
999 1 173
2 45
4 60
5 137
7 149
9990 7 8
9991 6 46
9992 1 13
2 14
5 25
6 24
9993 6 8
9994 2 64
3 57
9995 7 14
9996 7 14
9997 6 5
9998 1 28
6 10
9999 6 4
Name: total_items, Length: 20777, dtype: int64
正文完