这是Python学习笔记,根据datacamp的课程而来,我摘要了一些我认为比较重要的,大家也可以通过相关链接,找寻code和解释。
这一课程的学习笔记关于Manipulating DataFrames with pandas,主要涉及pandas的使用。
Slicing rows | Python
df.loc['b':'a':-1]
倒序
当对pd.DataFrame
进行loc
筛选时,加上命令:-1
,可以简单实现倒序。
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']
# Print the p_counties DataFrame
print(p_counties)
# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['Potter':'Perry':-1]
# Print the p_counties_rev DataFrame
print(p_counties_rev)
Filtering using NaNs | Python
dropna(thresh=1000, axis='columns')
NaNs的处理,thresh
and axis
可以进行一些自定义的功能。
NA
的处理how='any'
和how='all'
的使用,分别对应或和且命题。
Using .map() with a dictionary | Python
.map(dictionary)
map是批量操作,类似于R中purrr
的操作,具体如下,目标是list。
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue', 'Romney':'red'}
# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election['winner'].map(red_vs_blue)
# Print the output of election.head()
print(election.head())
Changing index name labels | Python更换标识,只是修改index的名字而已。
# Assign the string 'MONTHS' to sales.index.name
sales.index.name = 'MONTHS'
# Print the sales DataFrame
print(sales)
# Assign the string 'PRODUCTS' to sales.columns.name
sales.columns.name = 'PRODUCTS'
# Print the sales dataframe again
print(sales)
# Assign the string 'MONTHS' to sales.index.name
sales.index.name = 'MONTHS'
# Print the sales DataFrame
print(sales)
# Assign the string 'PRODUCTS' to sales.columns.name
sales.columns.name = 'PRODUCTS'
# Print the sales dataframe again
print(sales)
原来,之前index
和columns
这两个描述table的变量是没有名字的。
Hierarchical indexing | Python
多等级的index感觉在R中没有,因此是非常方便的功能。
使用set_index
,选择两个col就可以了。
之后再sort_index
就可以readable了。
# Set the index to be the columns ['state', 'month']: sales
sales = sales.set_index(['state', 'month'])
# Sort the MultiIndex: sales
sales = sales.sort_index()
# Print the sales DataFrame
print(sales)
slice
不是特别容易懂。
这里注意函数是tuple
格式,如
stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]
,
其中slice(None)
表示all。
# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[('NY', 1), :]
# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(['CA', 'TX'], 2), :]
# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None), 2), :]
Pivoting DataFrames | Python
# Pivot the users DataFrame: visitors_pivot
visitors_pivot = users.pivot(index = 'weekday', columns='city',values='visitors')
# Print the pivoted DataFrame
print(visitors_pivot)
<script.py> output:
city Austin Dallas
weekday
Mon 326 456
Sun 139 237
Pivoting a single variable | Python
# Pivot users pivoted by both signups and visitors: pivot
pivot = users.pivot(index = 'weekday', columns='city')
# Print the pivoted DataFrame
print(pivot)
visitors signups
city Austin Dallas Austin Dallas
weekday
Mon 326 456 3 5
Sun 139 237 7 12
Pivoting all variables | Python
(un)stack其实就是另一种Hierarchical indexing的方法,之前有set_index
和.pivot()
Stacking & unstacking DataFrames | Python
# Unstack users by 'weekday': byweekday
byweekday = users.unstack('weekday')
# Print the byweekday DataFrame
print(byweekday)
# Stack byweekday by 'weekday' and print it
print(byweekday.stack(level = 'weekday'))
<script.py> output:
visitors signups
weekday Mon Sun Mon Sun
city
Austin 326 139 3 7
Dallas 456 237 5 12
visitors signups
city weekday
Austin Mon 326 3
Sun 139 7
Dallas Mon 456 5
Sun 237 12
Restoring the index order | Python
.swaplevel(0, 1)
to flip the index levels.
sort_index()
是为了readable,
.equals()
是为了检验相等。
# Stack 'city' back into the index of bycity: newusers
newusers = bycity.stack(level = 'city')
# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0, 1)
# Print newusers and verify that the index is not sorted
print(newusers)
# Sort the index of newusers: newusers
newusers = newusers.sort_index()
# Print newusers and verify that the index is now sorted
print(newusers)
# Verify that the new DataFrame is equal to the original
print(newusers.equals(users))
Adding names for readability | Python
In [1]: visitors_by_city_weekday
Out[1]:
city Austin Dallas
weekday
Mon 326 456
Sun 139 237
原来的DataFrame含有Indexweekday
和columnscity
,reset_index
以后变成
# Reset the index: visitors_by_city_weekday
visitors_by_city_weekday = visitors_by_city_weekday.reset_index()
# Print visitors_by_city_weekday
print(visitors_by_city_weekday)
city weekday Austin Dallas
0 Mon 326 456
1 Sun 139 237
其中的city = [0,1]
表示的multiindex的思想。
# Melt visitors_by_city_weekday: visitors
visitors = pd.melt(visitors_by_city_weekday, id_vars=['weekday'],value_name='visitors')
# Print visitors
print(visitors)
weekday city visitors
0 Mon Austin 326
1 Sun Austin 139
2 Mon Dallas 456
3 Sun Dallas 237
- \(\Box\)为什么这里
city
没有被melt。
Obtaining key-value pairs with melt() | Python
key-value pairs用col_level=0
就可以实现,在pd.melt()
中。
- \(\Box\) 感觉不加也可以啊。
Setting up a pivot table | Python
又来一个函数.pivot_table
,之前有.pivot()
。
In [1]: by_city_day = users.pivot_table(index = 'weekday', columns='city')
In [2]:
In [3]: # Print by_city_day
In [4]: print(by_city_day)
visitors signups
city Austin Dallas Austin Dallas
weekday
Mon 326 456 3 5
Sun 139 237 7 12
Using other aggregations in pivot tables | Python
.pivot_table()
就是在复制excel的逻辑。
both the len and ‘count’ aggregation functions produced the same result.
Using margins in pivot tables | Python
margins=True
加入边联表
# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = users.pivot_table(index='weekday', aggfunc=sum)
# Print signups_and_visitors
print(signups_and_visitors)
# Add in the margins: signups_and_visitors_total
signups_and_visitors_total = users.pivot_table(index='weekday', aggfunc=sum, margins=True)
# Print signups_and_visitors_total
print(signups_and_visitors_total)
signups visitors
weekday
Mon 8 782
Sun 19 376
signups visitors
weekday
Mon 8.0 782.0
Sun 19.0 376.0
All 27.0 1158.0
Grouping by another series | Python
groupby
这个地方不需要同一个table的,也就是说可以是其他table,也就是说只要index一致,可以省略一次join的代码。
# Read life_fname into a DataFrame: life
life = pd.read_csv(life_fname, index_col='Country')
# Read regions_fname into a DataFrame: regions
regions = pd.read_csv(regions_fname, index_col='Country')
# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions['region'])
# Print the mean over the '2010' column of life_by_region
print(life_by_region['2010'].mean())
汇总可以用pivot_table
的aggfunc=
。
或者groupby()
的agg()
In [14]: by_class_sub.agg(['max','median'])
Out[14]:
age fare
max median max median
pclass
1 80.0 39.0 512.3292 60.0000
2 70.0 29.0 73.5000 15.0458
3 74.0 24.0 69.5500 8.0500
这里的存在stack,因此需要进行多个index选择。
In [14]: by_class_sub.agg(['max','median'])
Out[14]:
age fare
max median max median
pclass
1 80.0 39.0 512.3292 60.0000
2 70.0 29.0 73.5000 15.0458
3 74.0 24.0 69.5500 8.0500
Aggregating on index levels/fields | Python
可以groupby
index,但是要加上level =
,如
by_year_region = gapminder.groupby(level = ['Year','region'])
。
Grouping on a function of the index | Python
Is there a day of the week that is more popular for customers? To find out, you’re going to use .strftime('%a')
to transform the index datetime values to abbreviated days of the week.
Groupby and transformation | Python
这里的transfrom
类似于R dplyr
包中mutate
函数覆盖某个变量。
并且transfrom
和aggregate
可以合并,自定义一个函数。
def zscore_with_year_and_name(group):
df = pd.DataFrame(
{'mpg': zscore(group['mpg']),
'year': group['yr'],
'name': group['name']})
return df
关键的点在group
这个x
上。
Detecting outliers with Z-Scores | Python
standardized = gapminder_2010.groupby('region')['life','fertility'].transform(zscore)
相当于选择了['life','fertility']
这两个col,因此当然只返回这两个col的结果。
Grouping and filtering with .apply() | Python
没有明白这个函数的method路径。
def c_deck_survival(gr):
c_passengers = gr['cabin'].str.startswith('C').fillna(False)
return gr.loc[c_passengers, 'survived'].mean()
In [1]: ?str.startswith
Docstring:
S.startswith(prefix[, start[, end]]) -> bool
Return True if S starts with the specified prefix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
prefix can also be a tuple of strings to try.
Type: method_descriptor
Grouping and filtering with .filter() | Python
by_com_filt = by_company.filter(lambda g:g['Units'].sum() > 35)
这个地方相当于求和了一个agg函数满足的条件,这样就省略了join等条件,非常的efficient。
# Read the CSV file into a DataFrame: sales
sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)
# Group sales by 'Company': by_company
by_company = sales.groupby('Company')
# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
print(by_com_sum)
# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g:g['Units'].sum() > 35)
print(by_com_filt)
.map
大法好,类似于transform,也不是,类似于R中stringr
中的fct_recode
,如under10 = under10.map({True:'under 10', False:'over 10'})
。
code
# Create the Boolean Series: under10
under10 = titanic['age'] < 10
under10 = under10.map({True:'under 10', False:'over 10'})
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
print(survived_mean_1)
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, 'pclass'])['survived'].mean()
print(survived_mean_2)
结果
age pclass
over 10 1 0.617555
2 0.380392
3 0.238897
under 10 1 0.750000
2 1.000000
3 0.446429
Name: survived, dtype: float64
逻辑
library(DiagrammeR)
## Warning: 程辑包'DiagrammeR'是用R版本3.6.3 来建造的
grViz("
digraph LR {
A [label = '.map\nrecode']
B [label = 'groupby\nT/F condition']
C [label = 'mean()']
A -> B -> C
}")
Using .value_counts() for ranking | Python
Notice that .value_counts()
sorts by values by default. The result is returned as a Series of counts indexed by unique entries from the original Series with values (counts) ranked in descending order.
Compute a histogram of the counts of non-null values.
Parameters
----------
values : ndarray (1-d)
sort : boolean, default True
Sort by values
dropna : boolean, default True
Don't include counts of NaN
Using .pivot_table() to count medals by type | Python
.sum(axis='columns')
产生一个新的col,是每一行的数据相加。
Understanding the column labels | Python
有一个idea,其实filter
就是各种Boolean Series的选择。
Using .nunique() to rank by distinct sports | Python
.nunique()
returns the number of distinct categories.
.isin
等于filter
中的%in%
,这个是R中的dplyr
包的函数。
例子,is_usa_urs = medals.NOC.isin(['USA','URS'])
。
USSR: abbr. 苏维埃社会主义共和国联盟(Union of Soviet Socialist Republics)
Counting USA vs. USSR Cold War Olympic Medals | Python
.idxmax(axis='columns')
不是特别懂。
In [7]: cold_war_usa_usr_medals.head()
Out[7]:
NOC USA URS
Edition
1952 130.0 117.0
1956 118.0 169.0
1960 112.0 169.0
1964 150.0 174.0
1968 149.0 188.0
In [9]: cold_war_usa_usr_medals.idxmax(axis='columns').head()
Out[9]:
Edition
1952 USA
1956 URS
1960 URS
1964 URS
1968 URS
dtype: object
在进行.idxmax(axis='columns')
后,axis='columns'
表示列之间比较,并且.idxmax
返回col的名字。
Reshaping DataFrames for visualization | Python
unstack
也可以对index
,不仅仅是columns
,这个有什么区别?之后再研究。
Visualizing USA Medal Counts by Edition: Line Plot | Python
In [3]: usa_medals_by_year.head()
Out[3]:
Edition Medal
1896 Bronze 2
Gold 11
Silver 7
1900 Bronze 14
Gold 27
Name: Athlete, dtype: int64
进行unstack
后,
In [5]: usa_medals_by_year.head()
Out[5]:
Medal Bronze Gold Silver
Edition
1896 2 11 7
1900 14 27 14
1904 111 146 137
1908 15 34 14
1912 31 45 25
medals.Medal = pd.Categorical(values = medals.Medal, categories=['Bronze', 'Silver', 'Gold'], ordered=True)
medals.info()
查看是否产生了categories。
In [2]: medals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
City 29216 non-null object
Edition 29216 non-null int64
Sport 29216 non-null object
Discipline 29216 non-null object
Athlete 29216 non-null object
NOC 29216 non-null object
Gender 29216 non-null object
Event 29216 non-null object
Event_gender 29216 non-null object
Medal 29216 non-null category
dtypes: category(1), int64(1), object(8)
memory usage: 2.0+ MB