还有就是今天要聊的是,数据清洗之 数据分组方法

数据分组方法

  • 分组计算根据某个或某几个字段对数据集进行分组,然后运用特点的函数,得到结果
  • 使用groupby方法进行分组计算,得到分组对象GroupBy
  • 语法为df.groupby(by=)
  • 分组对象GroupBy可以运用描述性统计方法,如count(计数)、mean(均值)、median(中位数)、max(最大值)和min(最小值)等
import pandas as pd
import numpy as np
import os

os.getcwd()

'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据统计'

os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')

df = pd.read_csv('online_order.csv', encoding='gbk', dtype={'customer':str, 'order':str})

df.head(5)

<div>

<style scoped>

.dataframe tbody tr th:only-of-type {

    vertical-align: middle;

}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

</style>

<table border=”1″ class=”dataframe”>

<thead>

<tr style="text-align: right;">

  <th></th>

  <th>customer</th>

  <th>order</th>

  <th>total_items</th>

  <th>discount%</th>

  <th>weekday</th>

  <th>hour</th>

  <th>Food%</th>

  <th>Fresh%</th>

  <th>Drinks%</th>

  <th>Home%</th>

  <th>Beauty%</th>

  <th>Health%</th>

  <th>Baby%</th>

  <th>Pets%</th>

</tr>

</thead>

<tbody>

<tr>

  <th>0</th>

  <td>0</td>

  <td>0</td>

  <td>45</td>

  <td>23.03</td>

  <td>4</td>

  <td>13</td>

  <td>9.46</td>

  <td>87.06</td>

  <td>3.48</td>

  <td>0.00</td>

  <td>0.00</td>

  <td>0.00</td>

  <td>0.0</td>

  <td>0.0</td>

</tr>

<tr>

  <th>1</th>

  <td>0</td>

  <td>1</td>

  <td>38</td>

  <td>1.22</td>

  <td>5</td>

  <td>13</td>

  <td>15.87</td>

  <td>75.80</td>

  <td>6.22</td>

  <td>2.12</td>

  <td>0.00</td>

  <td>0.00</td>

  <td>0.0</td>

  <td>0.0</td>

</tr>

<tr>

  <th>2</th>

  <td>0</td>

  <td>2</td>

  <td>51</td>

  <td>18.08</td>

  <td>4</td>

  <td>13</td>

  <td>16.88</td>

  <td>56.75</td>

  <td>3.37</td>

  <td>16.48</td>

  <td>6.53</td>

  <td>0.00</td>

  <td>0.0</td>

  <td>0.0</td>

</tr>

<tr>

  <th>3</th>

  <td>1</td>

  <td>3</td>

  <td>57</td>

  <td>16.51</td>

  <td>1</td>

  <td>12</td>

  <td>28.81</td>

  <td>35.99</td>

  <td>11.78</td>

  <td>4.62</td>

  <td>2.87</td>

  <td>15.92</td>

  <td>0.0</td>

  <td>0.0</td>

</tr>

<tr>

  <th>4</th>

  <td>1</td>

  <td>4</td>

  <td>53</td>

  <td>18.31</td>

  <td>2</td>

  <td>11</td>

  <td>24.13</td>

  <td>60.38</td>

  <td>7.78</td>

  <td>7.72</td>

  <td>0.00</td>

  <td>0.00</td>

  <td>0.0</td>

  <td>0.0</td>

</tr>

</tbody>

</table>

</div>

grouped = df.groupby('weekday')

type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

grouped.mean()

<div>

<style scoped>

.dataframe tbody tr th:only-of-type {

    vertical-align: middle;

}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

</style>

<table border=”1″ class=”dataframe”>

<thead>

<tr style="text-align: right;">

  <th></th>

  <th>total_items</th>

  <th>discount%</th>

  <th>hour</th>

  <th>Food%</th>

  <th>Fresh%</th>

  <th>Drinks%</th>

  <th>Home%</th>

  <th>Beauty%</th>

  <th>Health%</th>

  <th>Baby%</th>

  <th>Pets%</th>

</tr>

<tr>

  <th>weekday</th>

  <th></th>

  <th></th>

  <th></th>

  <th></th>

  <th></th>

  <th></th>

  <th></th>

  <th></th>

  <th></th>

  <th></th>

  <th></th>

</tr>

</thead>

<tbody>

<tr>

  <th>1</th>

  <td>30.662177</td>

  <td>8.580705</td>

  <td>14.693122</td>

  <td>22.690866</td>

  <td>20.000904</td>

  <td>22.522993</td>

  <td>13.932553</td>

  <td>6.972394</td>

  <td>1.152285</td>

  <td>11.592562</td>

  <td>1.007306</td>

</tr>

<tr>

  <th>2</th>

  <td>31.868612</td>

  <td>8.638014</td>

  <td>14.966197</td>

  <td>23.994915</td>

  <td>19.407738</td>

  <td>24.346459</td>

  <td>13.559191</td>

  <td>4.903366</td>

  <td>1.079423</td>

  <td>11.277284</td>

  <td>1.272638</td>

</tr>

<tr>

  <th>3</th>

  <td>31.869796</td>

  <td>7.794507</td>

  <td>15.059898</td>

  <td>24.309274</td>

  <td>19.957653</td>

  <td>23.822470</td>

  <td>13.282088</td>

  <td>6.702640</td>

  <td>1.156829</td>

  <td>9.591389</td>

  <td>0.937205</td>

</tr>

<tr>

  <th>4</th>

  <td>32.251899</td>

  <td>8.068155</td>

  <td>14.324185</td>

  <td>24.374364</td>

  <td>21.538027</td>

  <td>24.553266</td>

  <td>13.391946</td>

  <td>4.806528</td>

  <td>1.031490</td>

  <td>9.058201</td>

  <td>1.080473</td>

</tr>

<tr>

  <th>5</th>

  <td>31.406619</td>

  <td>9.159031</td>

  <td>13.386919</td>

  <td>24.602790</td>

  <td>20.549153</td>

  <td>24.976466</td>

  <td>12.485788</td>

  <td>5.431221</td>

  <td>1.248605</td>

  <td>9.655343</td>

  <td>0.908227</td>

</tr>

<tr>

  <th>6</th>

  <td>32.154814</td>

  <td>8.414258</td>

  <td>14.751084</td>

  <td>23.743196</td>

  <td>18.707788</td>

  <td>23.593699</td>

  <td>14.173291</td>

  <td>5.878647</td>

  <td>1.170585</td>

  <td>11.478343</td>

  <td>1.150980</td>

</tr>

<tr>

  <th>7</th>

  <td>32.373837</td>

  <td>8.710171</td>

  <td>16.989535</td>

  <td>22.271512</td>

  <td>21.020359</td>

  <td>21.093767</td>

  <td>13.632481</td>

  <td>5.895322</td>

  <td>1.145938</td>

  <td>13.844250</td>

  <td>0.950391</td>

</tr>

</tbody>

</table>

</div>

grouped.mean()['Food%']

weekday

1    22.690866

2    23.994915

3    24.309274

4    24.374364

5    24.602790

6    23.743196

7    22.271512

Name: Food%, dtype: float64

# 多个字段分组
grouped = df.groupby(by=['customer', 'weekday'])

grouped.sum()['total_items']

customer  weekday

0         4           96

          5           38

1         1          423

          2          127

          4           37

          5           36

10        1           23

          3           26

100       1           38

          2           78

          3           78

          7          135

1000      2            6

10000     6           30

10001     6           15

10002     3           11

          6           42

          7           48

10003     2            4

10004     2           28

          3          131

          4           93

10005     7           29

10006     2           20

          5           27

          7           26

10007     2            6

          6           15

10008     7          123

10009     1            2

                    ... 

9984      6           40

          7           61

9985      6           11

9986      1           50

          6           49

          7           50

9987      1           23

9988      1           18

          4            1

9989      1           27

999       1          173

          2           45

          4           60

          5          137

          7          149

9990      7            8

9991      6           46

9992      1           13

          2           14

          5           25

          6           24

9993      6            8

9994      2           64

          3           57

9995      7           14

9996      7           14

9997      6            5

9998      1           28

          6           10

9999      6            4

Name: total_items, Length: 20777, dtype: int64

正文完