Reading multiple data files | Python
python的特性,我感觉相对于R,就是熟练使用循环,例如批量合成.csv
文档,就有如下循环。
filenames = ['1.csv','2.csv']
dataframes = []
for f in filenames:
dataframes.append(pd.read_csv(f))
这里.append
method是关键。
但是这个时候的dataframes
是list
,查看需要print(dataframes[0].head())
。
还有二种方法,功能相似。
Using a comprehension
dataframes = [pd.read_csv(f) for f in filenames]
Using glob
from glob import glob
filenames = glob('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]
Sorting DataFrame with the Index & columns | Python
sort_index()
是按照首字母小大写来排序的,不是月份!
In [9]: print(weather2.head())
Max TemperatureF
Month
Apr 84
Aug 86
Dec 68
Feb 60
Jan 68
当然也可以sort_values('Max TemperatureF')
,不是按照index来。
方法就是自己设置好。
In [1]: year
Out[1]:
['Jan',
'Feb',
'Mar',
'Apr',
'May',
'Jun',
'Jul',
'Aug',
'Sep',
'Oct',
'Nov',
'Dec']
# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)
.ffill()
method to forward-fill the null entries when upsampling.
也就是从上往下的填满。
Another common technique is to reindex a DataFrame using the Index of another DataFrame.
之前都是选择一个series,现在是DataFrame,不同了,如
names_1981.reindex(names_1881.index)
。
Arithmetic with Series & DataFrames | Python
In [10]: week1_range.divide(week1_mean, axis='rows')
这里,axis='rows'
表示某列的平均值。
这里multiply
和divide
可以按照具体需要做。
week1_mean.pct_change() * 100
表示累进百分比变动。
\[\frac{V_t-V_{t-1}}{V_{t-1}} \cdot 100%\]
bronze.add(silver)
表格,根据index相加。
bronze.add(silver, fill_value = 0)
其中,fill_value = 0
表示index没有匹配上的,算加上0.
Broadcasting in arithmetic formulas | Python
.str.replace('F','C')
类似于R包stringr
中的str_replace
Appending & concatenating Series | Python
使用.append
或者.concate
得到的series
或者DataFrame
的index都没有改变,可以通过loc[0]
选择原来series
或者DataFrame
的第一行,
可以通过reset_index()
或者concat(...,ingore_index=True)
来消除这一现象。
Appending pandas Series | Python
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
Python还可以这样取数,真的是非常方便。
Concatenating pandas Series along row axis | Python
在这个地方,
# Initialize empty list: units
units = []
# Build the list of Series
for month in [jan, feb, mar]:
units.append(month['Units'])
for loop 后,
In [9]: type(units)
Out[9]: list
因此,需要pd.concat
才能形成一个Series的东西。
In [13]: type(quarter1)
Out[13]: pandas.core.series.Series
Appending & concatenating DataFrames | Python
这一节区别了axis=0
和axis=1
在pd.concat
的区别。
其实逻辑是SQL语言的重复,
axis=0
等价于axis='rows'
,因此,就是列的处理,用columns
来indexing;
axis=1
等价于axis='columns'
,因此,就是列的处理,用index
来indexing。
Reading multiple files to build a DataFrame | Python
这个数据合并有点难理解。
主要是这个for loop。
for medal in medal_types:
# Create the file name: file_name
file_name = "%s_top5.csv" % medal
# Create list of column names: columns
columns = ['Country', medal]
# Read file_name into a DataFrame: df
medal_df = pd.read_csv(file_name,header=0,index_col='Country',names=columns)
# Append medal_df to medals
medals.append(medal_df)
思路中,进行拆解。
medal_types = ['bronze', 'silver', 'gold']
for medal in medal_types:
# Create the file name: file_name
file_name = "%s_top5.csv" % medal
print(file_name)
为什么答案是gold_top5.csv
,因此没有append啊,所以只返回最后一次结果,实际上呢,file_name
已经被覆盖过两次了。
medal_types = ['bronze', 'silver', 'gold']
for medal in medal_types:
# Create the file name: file_name
print("%s_top5.csv" % medal)
这样就好理解了,output是
bronze_top5.csv
silver_top5.csv
gold_top5.csv
In [10]: pd.read_csv('bronze_top5.csv')
Out[10]:
Country Total
0 United States 1052.0
1 Soviet Union 584.0
2 United Kingdom 505.0
3 France 475.0
4 Germany 454.0
所以取单个表来看,其实Country
作为一个col,还不是index。
In [14]: pd.read_csv('bronze_top5.csv',header=0,index_col='Country',names=['Country', 'bronze'])
Out[14]:
bronze
Country
United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0
所以我看懂了!这个地方names
实际上连index都命名了,但是命名不代表也把它作为一个col。
Concatenation, keys, & MultiIndexes | Python
MultiIndexes问题,自己加的方法,就是pd.concat
中加keys
,例子在这,Concatenating vertically to get MultiIndexed rows | Python
。
或者,之前就先搞好dict
,
rain_dict = {2013: rain2013, 2014: rain2014}
level=0
主要是筛选Country
这个index。
In [1]: medals.head()
ERROR! Session/line number was not unique in database. History logging moved to new session 32
Out[1]:
Total
Country
bronze United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0
In [2]: medals.sort_index(level=0)
Out[2]:
Total
Country
bronze France 475.0
Germany 454.0
Soviet Union 584.0
United Kingdom 505.0
United States 1052.0
这里的pd.IndexSlice
类似于
Indexing multiple levels of a MultiIndex | Python
的slice()
。
In [7]: pd.concat(dataframes, keys = ['Hardware', 'Software', 'Service'], axis=1).head()
Out[7]:
Hardware Software \
Company Product Units Company Product
Date
2015-02-02 08:33:01 NaN NaN NaN Hooli Software
2015-02-02 20:54:49 Mediacore Hardware 9.0 NaN NaN
2015-02-03 14:14:18 NaN NaN NaN Initech Software
2015-02-04 15:36:29 NaN NaN NaN Streeplex Software
2015-02-04 21:52:45 Acme Coporation Hardware 14.0 NaN NaN
Service
Units Company Product Units
Date
2015-02-02 08:33:01 3.0 NaN NaN NaN
2015-02-02 20:54:49 NaN NaN NaN NaN
2015-02-03 14:14:18 13.0 NaN NaN NaN
2015-02-04 15:36:29 13.0 NaN NaN NaN
2015-02-04 21:52:45 NaN NaN NaN NaN
stack
的效果。
Concatenating DataFrames from a dict | Python
理解起来有点难。
先看一个对象,
In [7]: jan.head()
Out[7]:
Date Company Product Units
0 2015-01-21 19:13:21 Streeplex Hardware 11
1 2015-01-09 05:23:51 Streeplex Service 8
2 2015-01-06 17:19:34 Initech Hardware 17
3 2015-01-02 09:51:06 Hooli Hardware 16
4 2015-01-11 14:51:02 Hooli Hardware 11
In [9]: jan.groupby('Company').sum()
Out[9]:
Units
Company
Acme Coporation 76
Hooli 70
Initech 37
Mediacore 15
Streeplex 50
.sum()
这里默认只对numeric的变量求和。
因此,
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]
# Create an empty dictionary: month_dict
month_dict = {}
for month_name, month_data in month_list:
# Group month_data: month_dict[month_name]
month_dict[month_name] = month_data.groupby('Company').sum()
这里的month_name, month_data in month_list
分别表示key
和value
。
Outer & inner joins | Python
涉及到矩阵计算。
In [1]: import numpy as np
import pandas as pd
np.arange(8)
Out[1]: array([0, 1, 2, 3, 4, 5, 6, 7])
np.arange(8)
产生连续自然数。
In [3]: np.arange(8).reshape(2,4)
Out[3]:
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
In [4]: np.arange(8).reshape(2,4) +0.1
Out[4]:
array([[ 0.1, 1.1, 2.1, 3.1],
[ 4.1, 5.1, 6.1, 7.1]])
Stacking arrays horizontally
In [5]: A = np.arange(8).reshape(2,4) + 0.1
B = np.arange(6).reshape(2,3) + 0.2
C = np.arange(12).reshape(3,4) + 0.3
np.hstack([B, A])
Out[5]:
array([[ 0.2, 1.2, 2.2, 0.1, 1.1, 2.1, 3.1],
[ 3.2, 4.2, 5.2, 4.1, 5.1, 6.1, 7.1]])
hstack
横向相加,np.concatenate([B, A], axis=1)
类似。
In [6]: A = np.arange(8).reshape(2,4) + 0.1
B = np.arange(6).reshape(2,3) + 0.2
C = np.arange(12).reshape(3,4) + 0.3
np.vstack([A, C])
Out[6]:
array([[ 0.1, 1.1, 2.1, 3.1],
[ 4.1, 5.1, 6.1, 7.1],
[ 0.3, 1.3, 2.3, 3.3],
[ 4.3, 5.3, 6.3, 7.3],
[ 8.3, 9.3, 10.3, 11.3]])
np.concatenate([A, C], axis=0)
类似。
Concatenating DataFrames with inner join | Python
In [4]: bronze
Out[4]:
Total
Country
United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0
# Create the list of DataFrames: medal_list
medal_list = [bronze, silver, gold]
# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis = 1, join = 'inner' )
# Print medals
print(medals)
三个表的Index的交集。
Resampling & concatenating DataFrames with inner join | Python
.resample()
的表格,Time Series / Date functionality — pandas 0.21.1 documentation
。
# Resample and tidy china: china_annual
china_annual = china.resample('A').pct_change(10).dropna()
# Resample and tidy us: us_annual
us_annual = us.resample('A').pct_change(10).dropna()
# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual,us_annual],join='inner', axis=1)
# Resample gdp and print
print(gdp.resample('10A').last())
做好了.resample('10A')
,感觉这个函数有点无所谓的感觉,黑箱操作的感觉。
<script.py> output:
China US
Year
1971-12-31 0.988860 1.073188
1981-12-31 0.972048 1.749631
1991-12-31 0.962528 0.922811
2001-12-31 2.492511 0.720398
2011-12-31 4.623958 0.460947
2021-12-31 3.789936 0.377506
Merging DataFrames | Python
主要讲到pd.merge()
函数,类似于SQL中的JOIN内容。
Joining DataFrames | Python
在pd.merge()
函数中,how='inner'
是default选项。
在.join()
函数中,how='left'
是default选项。
使用的选择
- df1.append(df2): stacking vertically
- pd.concat([df1, df2]):
- stacking many horizontally or vertically
- simple inner/outer joins on Indexes
- df1.join(df2): inner/outer/le!/right joins on Indexes
- pd.merge([df1, df2]): many joins on multiple columns
Using merge_ordered() | Python
merge_ordered()
合并了还可以排序。
可以记住一些比较常用的attribute,pd.merge_ordered(austin, houston, on = 'date', suffixes= ['_aus','_hus'], fill_method='ffill')
。
Using merge_asof() | Python
Similar to
pd.merge_ordered()
, thepd.merge_asof()
function will also merge values in order using theon
column, but for each row in the left DataFrame, only rows from the right DataFrame whose'on'
column values are less than the left value will be kept.
oil
Date Price
0 1970-01-01 3.35
1 1970-02-01 3.35
2 1970-03-01 3.35
3 1970-04-01 3.35
4 1970-05-01 3.35
auto
mpg cyl displ hp weight accel yr origin \
0 18.0 8 307.0 130 3504 12.0 1970-01-01 US
1 15.0 8 350.0 165 3693 11.5 1970-01-01 US
2 18.0 8 318.0 150 3436 11.0 1970-01-01 US
3 16.0 8 304.0 150 3433 12.0 1970-01-01 US
4 17.0 8 302.0 140 3449 10.5 1970-01-01 US
name
0 chevrolet chevelle malibu
1 buick skylark 320
2 plymouth satellite
3 amc rebel sst
4 ford torino
\[a.value > b.value\]
说实话,不是特别理解。
Building medals DataFrame | Python
In [10]: editions['Edition'].head()
Out[10]:
0 1896
1 1900
2 1904
3 1908
4 1912
Name: Edition, dtype: int64
因此,for year in editions['Edition']:
中,year
就是具体的年份1896
。
并且,medals_dict[year] = pd.read_csv(file_path)
类似于my_dict[key]=value
因此这样就好理解下面的代码了。
# Import pandas
import pandas as pd
# Create empty dictionary: medals_dict
medals_dict = {}
for year in editions['Edition']:
# Create the file path: file_path
file_path = 'summer_{:d}.csv'.format(year)
# Load file_path into a DataFrame: medals_dict[year]
medals_dict[year] = pd.read_csv(file_path)
# Extract relevant columns: medals_dict[year]
medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]
# Assign year to column 'Edition' of medals_dict
medals_dict[year]['Edition'] = year
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)
# Print first and last 5 rows of medals
print(medals.head())
print(medals.tail())
<script.py> output:
Athlete NOC Medal Edition
0 HAJOS, Alfred HUN Gold 1896
1 HERSCHMANN, Otto AUT Silver 1896
2 DRIVAS, Dimitrios GRE Bronze 1896
3 MALOKINIS, Ioannis GRE Gold 1896
4 CHASAPIS, Spiridon GRE Silver 1896
Athlete NOC Medal Edition
29211 ENGLICH, Mirko GER Silver 2008
29212 MIZGAITIS, Mindaugas LTU Bronze 2008
29213 PATRIKEEV, Yuri ARM Bronze 2008
29214 LOPEZ, Mijain CUB Gold 2008
29215 BAROEV, Khasan RUS Silver 2008
Computational tools — pandas 0.21.1 documentation
所以这个expanding其实就是rolling window。
.expanding().mean()
表示移动平均值。