数据整理
- 定义
- 在数据清洗过程中,很多时候需要将不同的数据整理在一起,方便后续的分析,这个过程也叫数据合并
- 合并方法
- 常见的合并方法有堆叠和按主键进行合并,堆叠又分为横向堆叠和纵向堆叠,按主键合并类似于sql里面的关联操作
- 横向堆叠将两张表或多张表在X轴方向,即横向拼接在一起
- 纵向堆叠将两张表或多张表在Y轴方向,即纵向拼接在一起
- 注意join 取inner或者outer,分别代表交集和并集
# 用于读取excel文件
import xlrd
import os
import pandas as pd
import numpy as np
os.getcwd()
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据表处理'
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
workbook = xlrd.open_workbook('meal_order_detail.xlsx')
# 获取工作表的名称
sheet_name = workbook.sheet_names()
sheet_name
['meal_order_detail1', 'meal_order_detail2', 'meal_order_detail3']
order1 = pd.read_excel('meal_order_detail.xlsx', sheet_name='meal_order_detail1')
order2 = pd.read_excel('meal_order_detail.xlsx', sheet_name='meal_order_detail2')
order3 = pd.read_excel('meal_order_detail.xlsx', sheet_name='meal_order_detail3')
# ignore_index=True, 表示重置索引
order = pd.concat([order1, order2, order3], axis=0, ignore_index=True)
# 获取末5行
order.tail(5)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border=”1″ class=”dataframe”>
<thead>
<tr style="text-align: right;">
<th></th>
<th>detail_id</th>
<th>order_id</th>
<th>dishes_id</th>
<th>logicprn_name</th>
<th>parent_class_name</th>
<th>dishes_name</th>
<th>itemis_add</th>
<th>counts</th>
<th>amounts</th>
<th>cost</th>
<th>place_order_time</th>
<th>discount_amt</th>
<th>discount_reason</th>
<th>kick_back</th>
<th>add_inprice</th>
<th>add_info</th>
<th>bar_code</th>
<th>picture_file</th>
<th>emp_id</th>
</tr>
</thead>
<tbody>
<tr>
<th>10032</th>
<td>5683</td>
<td>672</td>
<td>610049</td>
<td>NaN</td>
<td>NaN</td>
<td>爆炒双丝</td>
<td>0</td>
<td>1</td>
<td>35</td>
<td>NaN</td>
<td>2016-08-31 21:53:30</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/301003.jpg</td>
<td>1089</td>
</tr>
<tr>
<th>10033</th>
<td>5686</td>
<td>672</td>
<td>609959</td>
<td>NaN</td>
<td>NaN</td>
<td>小炒羊腰\r\n\r\n\r\n</td>
<td>0</td>
<td>1</td>
<td>36</td>
<td>NaN</td>
<td>2016-08-31 21:54:40</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/202005.jpg</td>
<td>1089</td>
</tr>
<tr>
<th>10034</th>
<td>5379</td>
<td>647</td>
<td>610012</td>
<td>NaN</td>
<td>NaN</td>
<td>香菇鹌鹑蛋</td>
<td>0</td>
<td>1</td>
<td>39</td>
<td>NaN</td>
<td>2016-08-31 21:54:44</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/302001.jpg</td>
<td>1094</td>
</tr>
<tr>
<th>10035</th>
<td>5380</td>
<td>647</td>
<td>610054</td>
<td>NaN</td>
<td>NaN</td>
<td>不加一滴油的酸奶蛋糕</td>
<td>0</td>
<td>1</td>
<td>7</td>
<td>NaN</td>
<td>2016-08-31 21:55:24</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/501003.jpg</td>
<td>1094</td>
</tr>
<tr>
<th>10036</th>
<td>5688</td>
<td>672</td>
<td>609953</td>
<td>NaN</td>
<td>NaN</td>
<td>凉拌菠菜</td>
<td>0</td>
<td>1</td>
<td>27</td>
<td>NaN</td>
<td>2016-08-31 21:56:54</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/303004.jpg</td>
<td>1089</td>
</tr>
</tbody>
</table>
</div>
basic = pd.DataFrame()
# 可以使用for循环进行合并
for i in sheet_name:
basic_i = pd.read_excel('meal_order_detail.xlsx', sheet_name=i)
basic = pd.concat([basic, basic_i], axis=0, ignore_index=True)
basic.shape
(10037, 19)
basic.tail(5)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border=”1″ class=”dataframe”>
<thead>
<tr style="text-align: right;">
<th></th>
<th>detail_id</th>
<th>order_id</th>
<th>dishes_id</th>
<th>logicprn_name</th>
<th>parent_class_name</th>
<th>dishes_name</th>
<th>itemis_add</th>
<th>counts</th>
<th>amounts</th>
<th>cost</th>
<th>place_order_time</th>
<th>discount_amt</th>
<th>discount_reason</th>
<th>kick_back</th>
<th>add_inprice</th>
<th>add_info</th>
<th>bar_code</th>
<th>picture_file</th>
<th>emp_id</th>
</tr>
</thead>
<tbody>
<tr>
<th>10032</th>
<td>5683</td>
<td>672</td>
<td>610049</td>
<td>NaN</td>
<td>NaN</td>
<td>爆炒双丝</td>
<td>0</td>
<td>1</td>
<td>35</td>
<td>NaN</td>
<td>2016-08-31 21:53:30</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/301003.jpg</td>
<td>1089</td>
</tr>
<tr>
<th>10033</th>
<td>5686</td>
<td>672</td>
<td>609959</td>
<td>NaN</td>
<td>NaN</td>
<td>小炒羊腰\r\n\r\n\r\n</td>
<td>0</td>
<td>1</td>
<td>36</td>
<td>NaN</td>
<td>2016-08-31 21:54:40</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/202005.jpg</td>
<td>1089</td>
</tr>
<tr>
<th>10034</th>
<td>5379</td>
<td>647</td>
<td>610012</td>
<td>NaN</td>
<td>NaN</td>
<td>香菇鹌鹑蛋</td>
<td>0</td>
<td>1</td>
<td>39</td>
<td>NaN</td>
<td>2016-08-31 21:54:44</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/302001.jpg</td>
<td>1094</td>
</tr>
<tr>
<th>10035</th>
<td>5380</td>
<td>647</td>
<td>610054</td>
<td>NaN</td>
<td>NaN</td>
<td>不加一滴油的酸奶蛋糕</td>
<td>0</td>
<td>1</td>
<td>7</td>
<td>NaN</td>
<td>2016-08-31 21:55:24</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/501003.jpg</td>
<td>1094</td>
</tr>
<tr>
<th>10036</th>
<td>5688</td>
<td>672</td>
<td>609953</td>
<td>NaN</td>
<td>NaN</td>
<td>凉拌菠菜</td>
<td>0</td>
<td>1</td>
<td>27</td>
<td>NaN</td>
<td>2016-08-31 21:56:54</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>0</td>
<td>NaN</td>
<td>NaN</td>
<td>caipu/303004.jpg</td>
<td>1089</td>
</tr>
</tbody>
</table>
</div>
df = pd.read_csv('baby_trade_history.csv', dtype={'user_id': str})
df1 = pd.read_csv('sam_tianchi_mum_baby.csv', dtype={'user_id': str})
# 基本信息数据
df1.head(5)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border=”1″ class=”dataframe”>
<thead>
<tr style="text-align: right;">
<th></th>
<th>user_id</th>
<th>birthday</th>
<th>gender</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2757</td>
<td>20130311</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>415971</td>
<td>20121111</td>
<td>0</td>
</tr>
<tr>
<th>2</th>
<td>1372572</td>
<td>20120130</td>
<td>1</td>
</tr>
<tr>
<th>3</th>
<td>10339332</td>
<td>20110910</td>
<td>0</td>
</tr>
<tr>
<th>4</th>
<td>10642245</td>
<td>20130213</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
# 交易数据
df.head(5)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border=”1″ class=”dataframe”>
<thead>
<tr style="text-align: right;">
<th></th>
<th>user_id</th>
<th>auction_id</th>
<th>cat_id</th>
<th>cat1</th>
<th>property</th>
<th>buy_mount</th>
<th>day</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>786295544</td>
<td>41098319944</td>
<td>50014866</td>
<td>50022520</td>
<td>21458:86755362;13023209:3593274;10984217:21985...</td>
<td>2</td>
<td>20140919</td>
</tr>
<tr>
<th>1</th>
<td>532110457</td>
<td>17916191097</td>
<td>50011993</td>
<td>28</td>
<td>21458:11399317;1628862:3251296;21475:137325;16...</td>
<td>1</td>
<td>20131011</td>
</tr>
<tr>
<th>2</th>
<td>249013725</td>
<td>21896936223</td>
<td>50012461</td>
<td>50014815</td>
<td>21458:30992;1628665:92012;1628665:3233938;1628...</td>
<td>1</td>
<td>20131011</td>
</tr>
<tr>
<th>3</th>
<td>917056007</td>
<td>12515996043</td>
<td>50018831</td>
<td>50014815</td>
<td>21458:15841995;21956:3494076;27000458:59723383...</td>
<td>2</td>
<td>20141023</td>
</tr>
<tr>
<th>4</th>
<td>444069173</td>
<td>20487688075</td>
<td>50013636</td>
<td>50008168</td>
<td>21458:30992;13658074:3323064;1628665:3233941;1...</td>
<td>1</td>
<td>20141103</td>
</tr>
</tbody>
</table>
</div>
# 将df1添加到df右边,关联字段为 "user_id"
df2 = pd.merge(left=df, right=df1, how='inner', left_on='user_id', right_on='user_id')
df2.head(5)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border=”1″ class=”dataframe”>
<thead>
<tr style="text-align: right;">
<th></th>
<th>user_id</th>
<th>auction_id</th>
<th>cat_id</th>
<th>cat1</th>
<th>property</th>
<th>buy_mount</th>
<th>day</th>
<th>birthday</th>
<th>gender</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>513441334</td>
<td>19909384116</td>
<td>50010557</td>
<td>50008168</td>
<td>25935:21991;1628665:29784;22019:34731;22019:20...</td>
<td>1</td>
<td>20121212</td>
<td>20110105</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>377550424</td>
<td>15771663914</td>
<td>50015841</td>
<td>28</td>
<td>1628665:3233941;1628665:3233942;3914866:11580;...</td>
<td>1</td>
<td>20121123</td>
<td>20110620</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>47342027</td>
<td>14066344263</td>
<td>50013636</td>
<td>50008168</td>
<td>21458:21599;13585028:3416646;1628665:3233942;1...</td>
<td>1</td>
<td>20120911</td>
<td>20101008</td>
<td>1</td>
</tr>
<tr>
<th>3</th>
<td>119784861</td>
<td>20796936076</td>
<td>50140021</td>
<td>50008168</td>
<td>21458:120325094;22019:2026;22019:34731;22019:3...</td>
<td>1</td>
<td>20121129</td>
<td>20120327</td>
<td>0</td>
</tr>
<tr>
<th>4</th>
<td>159129426</td>
<td>15198386301</td>
<td>50013711</td>
<td>50008168</td>
<td>21458:11580;1628665:29778;22019:3340598;22019:...</td>
<td>2</td>
<td>20120808</td>
<td>20100825</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
正文完