7 min read

datacamp pandas Merging DataFrames 学习笔记

Reading multiple data files | Python

python的特性,我感觉相对于R,就是熟练使用循环,例如批量合成.csv文档,就有如下循环。

filenames = ['1.csv','2.csv']
dataframes = []
for f in filenames: 
    dataframes.append(pd.read_csv(f))

这里.appendmethod是关键。

但是这个时候的dataframeslist,查看需要print(dataframes[0].head())

还有二种方法,功能相似。

Using a comprehension

dataframes = [pd.read_csv(f) for f in filenames]

Using glob

from glob import glob
filenames = glob('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]

Sorting DataFrame with the Index & columns | Python

sort_index()是按照首字母小大写来排序的,不是月份!

In [9]: print(weather2.head())
       Max TemperatureF
Month                  
Apr                  84
Aug                  86
Dec                  68
Feb                  60
Jan                  68

当然也可以sort_values('Max TemperatureF'),不是按照index来。

方法就是自己设置好。

In [1]: year
Out[1]: 
['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec']
# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

.ffill() method to forward-fill the null entries when upsampling. 也就是从上往下的填满。

Another common technique is to reindex a DataFrame using the Index of another DataFrame.

之前都是选择一个series,现在是DataFrame,不同了,如 names_1981.reindex(names_1881.index)

Arithmetic with Series & DataFrames | Python

In [10]: week1_range.divide(week1_mean, axis='rows')这里,axis='rows'表示某列的平均值。

这里multiplydivide可以按照具体需要做。

week1_mean.pct_change() * 100表示累进百分比变动。

\[\frac{V_t-V_{t-1}}{V_{t-1}} \cdot 100%\]

bronze.add(silver)表格,根据index相加。

bronze.add(silver, fill_value = 0)其中,fill_value = 0表示index没有匹配上的,算加上0.

Broadcasting in arithmetic formulas | Python

.str.replace('F','C')类似于R包stringr中的str_replace

Appending & concatenating Series | Python

使用.append或者.concate得到的series或者DataFrame的index都没有改变,可以通过loc[0]选择原来series或者DataFrame的第一行, 可以通过reset_index()或者concat(...,ingore_index=True)来消除这一现象。

Appending pandas Series | Python

print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])Python还可以这样取数,真的是非常方便。

Concatenating pandas Series along row axis | Python

在这个地方,

# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month['Units'])

for loop 后,

In [9]: type(units)
Out[9]: list

因此,需要pd.concat才能形成一个Series的东西。

In [13]: type(quarter1)
Out[13]: pandas.core.series.Series

Appending & concatenating DataFrames | Python

这一节区别了axis=0axis=1pd.concat的区别。 其实逻辑是SQL语言的重复, axis=0等价于axis='rows',因此,就是列的处理,用columns来indexing; axis=1等价于axis='columns',因此,就是列的处理,用index来indexing。

Reading multiple files to build a DataFrame | Python

这个数据合并有点难理解。

主要是这个for loop。

for medal in medal_types:

    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(file_name,header=0,index_col='Country',names=columns)

    # Append medal_df to medals
    medals.append(medal_df)

思路中,进行拆解。

medal_types = ['bronze', 'silver', 'gold']
for medal in medal_types:

    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal

print(file_name)

为什么答案是gold_top5.csv,因此没有append啊,所以只返回最后一次结果,实际上呢,file_name已经被覆盖过两次了。

medal_types = ['bronze', 'silver', 'gold']
for medal in medal_types:

    # Create the file name: file_name
    print("%s_top5.csv" % medal)

这样就好理解了,output是

bronze_top5.csv
silver_top5.csv
gold_top5.csv
In [10]: pd.read_csv('bronze_top5.csv')
Out[10]: 
          Country   Total
0   United States  1052.0
1    Soviet Union   584.0
2  United Kingdom   505.0
3          France   475.0
4         Germany   454.0

所以取单个表来看,其实Country作为一个col,还不是index。

In [14]: pd.read_csv('bronze_top5.csv',header=0,index_col='Country',names=['Country', 'bronze'])
Out[14]: 
                bronze
Country               
United States   1052.0
Soviet Union     584.0
United Kingdom   505.0
France           475.0
Germany          454.0

所以我看懂了!这个地方names实际上连index都命名了,但是命名不代表也把它作为一个col。

Concatenation, keys, & MultiIndexes | Python

MultiIndexes问题,自己加的方法,就是pd.concat中加keys,例子在这,Concatenating vertically to get MultiIndexed rows | Python

或者,之前就先搞好dictrain_dict = {2013: rain2013, 2014: rain2014}

level=0主要是筛选Country这个index。

In [1]: medals.head()
ERROR! Session/line number was not unique in database. History logging moved to new session 32
Out[1]: 
                        Total
       Country               
bronze United States   1052.0
       Soviet Union     584.0
       United Kingdom   505.0
       France           475.0
       Germany          454.0
In [2]: medals.sort_index(level=0)
Out[2]: 
                        Total
       Country               
bronze France           475.0
       Germany          454.0
       Soviet Union     584.0
       United Kingdom   505.0
       United States   1052.0

这里的pd.IndexSlice类似于 Indexing multiple levels of a MultiIndex | Pythonslice()

In [7]: pd.concat(dataframes, keys = ['Hardware', 'Software', 'Service'], axis=1).head()
Out[7]: 
                            Hardware                   Software            \
                             Company   Product Units    Company   Product   
Date                                                                        
2015-02-02 08:33:01              NaN       NaN   NaN      Hooli  Software   
2015-02-02 20:54:49        Mediacore  Hardware   9.0        NaN       NaN   
2015-02-03 14:14:18              NaN       NaN   NaN    Initech  Software   
2015-02-04 15:36:29              NaN       NaN   NaN  Streeplex  Software   
2015-02-04 21:52:45  Acme Coporation  Hardware  14.0        NaN       NaN   

                          Service                
                    Units Company Product Units  
Date                                             
2015-02-02 08:33:01   3.0     NaN     NaN   NaN  
2015-02-02 20:54:49   NaN     NaN     NaN   NaN  
2015-02-03 14:14:18  13.0     NaN     NaN   NaN  
2015-02-04 15:36:29  13.0     NaN     NaN   NaN  
2015-02-04 21:52:45   NaN     NaN     NaN   NaN

stack的效果。

Concatenating DataFrames from a dict | Python

理解起来有点难。

先看一个对象,

In [7]: jan.head()
Out[7]: 
                  Date    Company   Product  Units
0  2015-01-21 19:13:21  Streeplex  Hardware     11
1  2015-01-09 05:23:51  Streeplex   Service      8
2  2015-01-06 17:19:34    Initech  Hardware     17
3  2015-01-02 09:51:06      Hooli  Hardware     16
4  2015-01-11 14:51:02      Hooli  Hardware     11
In [9]: jan.groupby('Company').sum()
Out[9]: 
                 Units
Company               
Acme Coporation     76
Hooli               70
Initech             37
Mediacore           15
Streeplex           50

.sum()这里默认只对numeric的变量求和。

因此,

# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = {}

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby('Company').sum()

这里的month_name, month_data in month_list分别表示keyvalue

Outer & inner joins | Python

涉及到矩阵计算。

In [1]: import numpy as np 
        import pandas as pd
        np.arange(8)
Out[1]: array([0, 1, 2, 3, 4, 5, 6, 7])

np.arange(8)产生连续自然数。

In [3]: np.arange(8).reshape(2,4)
Out[3]: 
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
In [4]: np.arange(8).reshape(2,4) +0.1
Out[4]: 
array([[ 0.1,  1.1,  2.1,  3.1],
       [ 4.1,  5.1,  6.1,  7.1]])

Stacking arrays horizontally

In [5]: A = np.arange(8).reshape(2,4) + 0.1
        B = np.arange(6).reshape(2,3) + 0.2
        C = np.arange(12).reshape(3,4) + 0.3
        np.hstack([B, A])
Out[5]: 
array([[ 0.2,  1.2,  2.2,  0.1,  1.1,  2.1,  3.1],
       [ 3.2,  4.2,  5.2,  4.1,  5.1,  6.1,  7.1]])

hstack横向相加,np.concatenate([B, A], axis=1)类似。

In [6]: A = np.arange(8).reshape(2,4) + 0.1
        B = np.arange(6).reshape(2,3) + 0.2
        C = np.arange(12).reshape(3,4) + 0.3
        np.vstack([A, C])
Out[6]: 
array([[  0.1,   1.1,   2.1,   3.1],
       [  4.1,   5.1,   6.1,   7.1],
       [  0.3,   1.3,   2.3,   3.3],
       [  4.3,   5.3,   6.3,   7.3],
       [  8.3,   9.3,  10.3,  11.3]])

np.concatenate([A, C], axis=0)类似。

Concatenating DataFrames with inner join | Python

In [4]: bronze
Out[4]: 
                 Total
Country               
United States   1052.0
Soviet Union     584.0
United Kingdom   505.0
France           475.0
Germany          454.0
# Create the list of DataFrames: medal_list
medal_list = [bronze, silver, gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis = 1, join = 'inner' )

# Print medals
print(medals)

三个表的Index的交集。

Resampling & concatenating DataFrames with inner join | Python

.resample()的表格,Time Series / Date functionality — pandas 0.21.1 documentation

# Resample and tidy china: china_annual
china_annual = china.resample('A').pct_change(10).dropna()

# Resample and tidy us: us_annual
us_annual = us.resample('A').pct_change(10).dropna()

# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual,us_annual],join='inner', axis=1)

# Resample gdp and print
print(gdp.resample('10A').last())

做好了.resample('10A'),感觉这个函数有点无所谓的感觉,黑箱操作的感觉。

<script.py> output:
                   China        US
    Year                          
    1971-12-31  0.988860  1.073188
    1981-12-31  0.972048  1.749631
    1991-12-31  0.962528  0.922811
    2001-12-31  2.492511  0.720398
    2011-12-31  4.623958  0.460947
    2021-12-31  3.789936  0.377506

Merging DataFrames | Python

主要讲到pd.merge()函数,类似于SQL中的JOIN内容。

Joining DataFrames | Python

pd.merge()函数中,how='inner'是default选项。 在.join()函数中,how='left'是default选项。

使用的选择

  • df1.append(df2): stacking vertically
  • pd.concat([df1, df2]):
    • stacking many horizontally or vertically
    • simple inner/outer joins on Indexes
  • df1.join(df2): inner/outer/le!/right joins on Indexes
  • pd.merge([df1, df2]): many joins on multiple columns

Using merge_ordered() | Python

merge_ordered()合并了还可以排序。

可以记住一些比较常用的attribute,pd.merge_ordered(austin, houston, on = 'date', suffixes= ['_aus','_hus'], fill_method='ffill')

Using merge_asof() | Python

Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.

oil
        Date  Price
0 1970-01-01   3.35
1 1970-02-01   3.35
2 1970-03-01   3.35
3 1970-04-01   3.35
4 1970-05-01   3.35

auto
    mpg  cyl  displ   hp  weight  accel         yr origin  \
0  18.0    8  307.0  130    3504   12.0 1970-01-01     US   
1  15.0    8  350.0  165    3693   11.5 1970-01-01     US   
2  18.0    8  318.0  150    3436   11.0 1970-01-01     US   
3  16.0    8  304.0  150    3433   12.0 1970-01-01     US   
4  17.0    8  302.0  140    3449   10.5 1970-01-01     US   

                        name  
0  chevrolet chevelle malibu  
1          buick skylark 320  
2         plymouth satellite  
3              amc rebel sst  
4                ford torino

\[a.value > b.value\]

说实话,不是特别理解。

Building medals DataFrame | Python

In [10]: editions['Edition'].head()
Out[10]: 
0    1896
1    1900
2    1904
3    1908
4    1912
Name: Edition, dtype: int64

因此,for year in editions['Edition']:中,year就是具体的年份1896

并且,medals_dict[year] = pd.read_csv(file_path)类似于my_dict[key]=value

因此这样就好理解下面的代码了。

# Import pandas
import pandas as pd

# Create empty dictionary: medals_dict
medals_dict = {}

for year in editions['Edition']:

    # Create the file path: file_path
    file_path = 'summer_{:d}.csv'.format(year)
    
    # Load file_path into a DataFrame: medals_dict[year]
    medals_dict[year] = pd.read_csv(file_path)
    
    # Extract relevant columns: medals_dict[year]
    medals_dict[year] = medals_dict[year][['Athlete', 'NOC',  'Medal']]
    
    # Assign year to column 'Edition' of medals_dict
    medals_dict[year]['Edition'] = year
    
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)

# Print first and last 5 rows of medals
print(medals.head())
print(medals.tail())
<script.py> output:
                  Athlete  NOC   Medal  Edition
    0       HAJOS, Alfred  HUN    Gold     1896
    1    HERSCHMANN, Otto  AUT  Silver     1896
    2   DRIVAS, Dimitrios  GRE  Bronze     1896
    3  MALOKINIS, Ioannis  GRE    Gold     1896
    4  CHASAPIS, Spiridon  GRE  Silver     1896
                        Athlete  NOC   Medal  Edition
    29211        ENGLICH, Mirko  GER  Silver     2008
    29212  MIZGAITIS, Mindaugas  LTU  Bronze     2008
    29213       PATRIKEEV, Yuri  ARM  Bronze     2008
    29214         LOPEZ, Mijain  CUB    Gold     2008
    29215        BAROEV, Khasan  RUS  Silver     2008

Computational tools — pandas 0.21.1 documentation 所以这个expanding其实就是rolling window。 .expanding().mean()表示移动平均值。

证书