数据清洗之 数据整理

数据整理

  • 定义
    • 在数据清洗过程中,很多时候需要将不同的数据整理在一起,方便后续的分析,这个过程也叫数据合并
  • 合并方法
    • 常见的合并方法有堆叠和按主键进行合并,堆叠又分为横向堆叠和纵向堆叠,按主键合并类似于sql里面的关联操作
  • 横向堆叠将两张表或多张表在X轴方向,即横向拼接在一起
  • 纵向堆叠将两张表或多张表在Y轴方向,即纵向拼接在一起
  • 注意join 取inner或者outer,分别代表交集和并集
# 用于读取excel文件
import xlrd
import os
import pandas as pd
import numpy as np

os.getcwd()

'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据表处理'

os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')

workbook = xlrd.open_workbook('meal_order_detail.xlsx')

# 获取工作表的名称
sheet_name = workbook.sheet_names()

sheet_name

['meal_order_detail1', 'meal_order_detail2', 'meal_order_detail3']

order1 = pd.read_excel('meal_order_detail.xlsx', sheet_name='meal_order_detail1')
order2 = pd.read_excel('meal_order_detail.xlsx', sheet_name='meal_order_detail2')
order3 = pd.read_excel('meal_order_detail.xlsx', sheet_name='meal_order_detail3')

# ignore_index=True, 表示重置索引
order = pd.concat([order1, order2, order3], axis=0, ignore_index=True)

# 获取末5行
order.tail(5)

<div>

<style scoped>

.dataframe tbody tr th:only-of-type {

    vertical-align: middle;

}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

</style>

<table border=”1″ class=”dataframe”>

<thead>

<tr style="text-align: right;">

  <th></th>

  <th>detail_id</th>

  <th>order_id</th>

  <th>dishes_id</th>

  <th>logicprn_name</th>

  <th>parent_class_name</th>

  <th>dishes_name</th>

  <th>itemis_add</th>

  <th>counts</th>

  <th>amounts</th>

  <th>cost</th>

  <th>place_order_time</th>

  <th>discount_amt</th>

  <th>discount_reason</th>

  <th>kick_back</th>

  <th>add_inprice</th>

  <th>add_info</th>

  <th>bar_code</th>

  <th>picture_file</th>

  <th>emp_id</th>

</tr>

</thead>

<tbody>

<tr>

  <th>10032</th>

  <td>5683</td>

  <td>672</td>

  <td>610049</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>爆炒双丝</td>

  <td>0</td>

  <td>1</td>

  <td>35</td>

  <td>NaN</td>

  <td>2016-08-31 21:53:30</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/301003.jpg</td>

  <td>1089</td>

</tr>

<tr>

  <th>10033</th>

  <td>5686</td>

  <td>672</td>

  <td>609959</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>小炒羊腰\r\n\r\n\r\n</td>

  <td>0</td>

  <td>1</td>

  <td>36</td>

  <td>NaN</td>

  <td>2016-08-31 21:54:40</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/202005.jpg</td>

  <td>1089</td>

</tr>

<tr>

  <th>10034</th>

  <td>5379</td>

  <td>647</td>

  <td>610012</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>香菇鹌鹑蛋</td>

  <td>0</td>

  <td>1</td>

  <td>39</td>

  <td>NaN</td>

  <td>2016-08-31 21:54:44</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/302001.jpg</td>

  <td>1094</td>

</tr>

<tr>

  <th>10035</th>

  <td>5380</td>

  <td>647</td>

  <td>610054</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>不加一滴油的酸奶蛋糕</td>

  <td>0</td>

  <td>1</td>

  <td>7</td>

  <td>NaN</td>

  <td>2016-08-31 21:55:24</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/501003.jpg</td>

  <td>1094</td>

</tr>

<tr>

  <th>10036</th>

  <td>5688</td>

  <td>672</td>

  <td>609953</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>凉拌菠菜</td>

  <td>0</td>

  <td>1</td>

  <td>27</td>

  <td>NaN</td>

  <td>2016-08-31 21:56:54</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/303004.jpg</td>

  <td>1089</td>

</tr>

</tbody>

</table>

</div>

basic = pd.DataFrame()

# 可以使用for循环进行合并
for i in sheet_name:
    basic_i = pd.read_excel('meal_order_detail.xlsx', sheet_name=i)
    basic = pd.concat([basic, basic_i], axis=0, ignore_index=True)

basic.shape

(10037, 19)

basic.tail(5)

<div>

<style scoped>

.dataframe tbody tr th:only-of-type {

    vertical-align: middle;

}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

</style>

<table border=”1″ class=”dataframe”>

<thead>

<tr style="text-align: right;">

  <th></th>

  <th>detail_id</th>

  <th>order_id</th>

  <th>dishes_id</th>

  <th>logicprn_name</th>

  <th>parent_class_name</th>

  <th>dishes_name</th>

  <th>itemis_add</th>

  <th>counts</th>

  <th>amounts</th>

  <th>cost</th>

  <th>place_order_time</th>

  <th>discount_amt</th>

  <th>discount_reason</th>

  <th>kick_back</th>

  <th>add_inprice</th>

  <th>add_info</th>

  <th>bar_code</th>

  <th>picture_file</th>

  <th>emp_id</th>

</tr>

</thead>

<tbody>

<tr>

  <th>10032</th>

  <td>5683</td>

  <td>672</td>

  <td>610049</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>爆炒双丝</td>

  <td>0</td>

  <td>1</td>

  <td>35</td>

  <td>NaN</td>

  <td>2016-08-31 21:53:30</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/301003.jpg</td>

  <td>1089</td>

</tr>

<tr>

  <th>10033</th>

  <td>5686</td>

  <td>672</td>

  <td>609959</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>小炒羊腰\r\n\r\n\r\n</td>

  <td>0</td>

  <td>1</td>

  <td>36</td>

  <td>NaN</td>

  <td>2016-08-31 21:54:40</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/202005.jpg</td>

  <td>1089</td>

</tr>

<tr>

  <th>10034</th>

  <td>5379</td>

  <td>647</td>

  <td>610012</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>香菇鹌鹑蛋</td>

  <td>0</td>

  <td>1</td>

  <td>39</td>

  <td>NaN</td>

  <td>2016-08-31 21:54:44</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/302001.jpg</td>

  <td>1094</td>

</tr>

<tr>

  <th>10035</th>

  <td>5380</td>

  <td>647</td>

  <td>610054</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>不加一滴油的酸奶蛋糕</td>

  <td>0</td>

  <td>1</td>

  <td>7</td>

  <td>NaN</td>

  <td>2016-08-31 21:55:24</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/501003.jpg</td>

  <td>1094</td>

</tr>

<tr>

  <th>10036</th>

  <td>5688</td>

  <td>672</td>

  <td>609953</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>凉拌菠菜</td>

  <td>0</td>

  <td>1</td>

  <td>27</td>

  <td>NaN</td>

  <td>2016-08-31 21:56:54</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>0</td>

  <td>NaN</td>

  <td>NaN</td>

  <td>caipu/303004.jpg</td>

  <td>1089</td>

</tr>

</tbody>

</table>

</div>

df = pd.read_csv('baby_trade_history.csv', dtype={'user_id': str})

df1 = pd.read_csv('sam_tianchi_mum_baby.csv', dtype={'user_id': str})

# 基本信息数据
df1.head(5)

<div>

<style scoped>

.dataframe tbody tr th:only-of-type {

    vertical-align: middle;

}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

</style>

<table border=”1″ class=”dataframe”>

<thead>

<tr style="text-align: right;">

  <th></th>

  <th>user_id</th>

  <th>birthday</th>

  <th>gender</th>

</tr>

</thead>

<tbody>

<tr>

  <th>0</th>

  <td>2757</td>

  <td>20130311</td>

  <td>1</td>

</tr>

<tr>

  <th>1</th>

  <td>415971</td>

  <td>20121111</td>

  <td>0</td>

</tr>

<tr>

  <th>2</th>

  <td>1372572</td>

  <td>20120130</td>

  <td>1</td>

</tr>

<tr>

  <th>3</th>

  <td>10339332</td>

  <td>20110910</td>

  <td>0</td>

</tr>

<tr>

  <th>4</th>

  <td>10642245</td>

  <td>20130213</td>

  <td>0</td>

</tr>

</tbody>

</table>

</div>

# 交易数据
df.head(5)

<div>

<style scoped>

.dataframe tbody tr th:only-of-type {

    vertical-align: middle;

}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

</style>

<table border=”1″ class=”dataframe”>

<thead>

<tr style="text-align: right;">

  <th></th>

  <th>user_id</th>

  <th>auction_id</th>

  <th>cat_id</th>

  <th>cat1</th>

  <th>property</th>

  <th>buy_mount</th>

  <th>day</th>

</tr>

</thead>

<tbody>

<tr>

  <th>0</th>

  <td>786295544</td>

  <td>41098319944</td>

  <td>50014866</td>

  <td>50022520</td>

  <td>21458:86755362;13023209:3593274;10984217:21985...</td>

  <td>2</td>

  <td>20140919</td>

</tr>

<tr>

  <th>1</th>

  <td>532110457</td>

  <td>17916191097</td>

  <td>50011993</td>

  <td>28</td>

  <td>21458:11399317;1628862:3251296;21475:137325;16...</td>

  <td>1</td>

  <td>20131011</td>

</tr>

<tr>

  <th>2</th>

  <td>249013725</td>

  <td>21896936223</td>

  <td>50012461</td>

  <td>50014815</td>

  <td>21458:30992;1628665:92012;1628665:3233938;1628...</td>

  <td>1</td>

  <td>20131011</td>

</tr>

<tr>

  <th>3</th>

  <td>917056007</td>

  <td>12515996043</td>

  <td>50018831</td>

  <td>50014815</td>

  <td>21458:15841995;21956:3494076;27000458:59723383...</td>

  <td>2</td>

  <td>20141023</td>

</tr>

<tr>

  <th>4</th>

  <td>444069173</td>

  <td>20487688075</td>

  <td>50013636</td>

  <td>50008168</td>

  <td>21458:30992;13658074:3323064;1628665:3233941;1...</td>

  <td>1</td>

  <td>20141103</td>

</tr>

</tbody>

</table>

</div>

# 将df1添加到df右边,关联字段为 "user_id"
df2 = pd.merge(left=df, right=df1, how='inner', left_on='user_id', right_on='user_id')

df2.head(5)

<div>

<style scoped>

.dataframe tbody tr th:only-of-type {

    vertical-align: middle;

}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

</style>

<table border=”1″ class=”dataframe”>

<thead>

<tr style="text-align: right;">

  <th></th>

  <th>user_id</th>

  <th>auction_id</th>

  <th>cat_id</th>

  <th>cat1</th>

  <th>property</th>

  <th>buy_mount</th>

  <th>day</th>

  <th>birthday</th>

  <th>gender</th>

</tr>

</thead>

<tbody>

<tr>

  <th>0</th>

  <td>513441334</td>

  <td>19909384116</td>

  <td>50010557</td>

  <td>50008168</td>

  <td>25935:21991;1628665:29784;22019:34731;22019:20...</td>

  <td>1</td>

  <td>20121212</td>

  <td>20110105</td>

  <td>1</td>

</tr>

<tr>

  <th>1</th>

  <td>377550424</td>

  <td>15771663914</td>

  <td>50015841</td>

  <td>28</td>

  <td>1628665:3233941;1628665:3233942;3914866:11580;...</td>

  <td>1</td>

  <td>20121123</td>

  <td>20110620</td>

  <td>1</td>

</tr>

<tr>

  <th>2</th>

  <td>47342027</td>

  <td>14066344263</td>

  <td>50013636</td>

  <td>50008168</td>

  <td>21458:21599;13585028:3416646;1628665:3233942;1...</td>

  <td>1</td>

  <td>20120911</td>

  <td>20101008</td>

  <td>1</td>

</tr>

<tr>

  <th>3</th>

  <td>119784861</td>

  <td>20796936076</td>

  <td>50140021</td>

  <td>50008168</td>

  <td>21458:120325094;22019:2026;22019:34731;22019:3...</td>

  <td>1</td>

  <td>20121129</td>

  <td>20120327</td>

  <td>0</td>

</tr>

<tr>

  <th>4</th>

  <td>159129426</td>

  <td>15198386301</td>

  <td>50013711</td>

  <td>50008168</td>

  <td>21458:11580;1628665:29778;22019:3340598;22019:...</td>

  <td>2</td>

  <td>20120808</td>

  <td>20100825</td>

  <td>0</td>

</tr>

</tbody>

</table>

</div>

正文完