8 min read

datacamp pandas DataFrames 学习笔记

这是Python学习笔记,根据datacamp的课程而来,我摘要了一些我认为比较重要的,大家也可以通过相关链接,找寻code和解释。

这一课程的学习笔记关于Manipulating DataFrames with pandas,主要涉及pandas的使用。

Slicing rows | Python

df.loc['b':'a':-1] 倒序

当对pd.DataFrame进行loc筛选时,加上命令:-1,可以简单实现倒序。

# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']

# Print the p_counties DataFrame
print(p_counties)

# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['Potter':'Perry':-1]

# Print the p_counties_rev DataFrame
print(p_counties_rev)

Filtering using NaNs | Python

dropna(thresh=1000, axis='columns') NaNs的处理,thresh and axis可以进行一些自定义的功能。

NA的处理how='any'how='all'的使用,分别对应命题。

Using .map() with a dictionary | Python

.map(dictionary)

map是批量操作,类似于R中purrr的操作,具体如下,目标是list。

# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue', 'Romney':'red'}

# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election['winner'].map(red_vs_blue)

# Print the output of election.head()
print(election.head())

Changing index name labels | Python更换标识,只是修改index的名字而已。

# Assign the string 'MONTHS' to sales.index.name
sales.index.name = 'MONTHS'

# Print the sales DataFrame
print(sales)

# Assign the string 'PRODUCTS' to sales.columns.name 
sales.columns.name = 'PRODUCTS'

# Print the sales dataframe again
print(sales)
# Assign the string 'MONTHS' to sales.index.name
sales.index.name = 'MONTHS'

# Print the sales DataFrame
print(sales)

# Assign the string 'PRODUCTS' to sales.columns.name 
sales.columns.name = 'PRODUCTS'

# Print the sales dataframe again
print(sales)

原来,之前indexcolumns这两个描述table的变量是没有名字的。

Hierarchical indexing | Python

多等级的index感觉在R中没有,因此是非常方便的功能。

使用set_index,选择两个col就可以了。 之后再sort_index就可以readable了。

# Set the index to be the columns ['state', 'month']: sales
sales = sales.set_index(['state', 'month'])

# Sort the MultiIndex: sales
sales = sales.sort_index()

# Print the sales DataFrame
print(sales)

slice不是特别容易懂。 这里注意函数是tuple格式,如 stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :], 其中slice(None)表示all。

# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[('NY', 1), :]

# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(['CA', 'TX'], 2), :]

# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None), 2), :]

Pivoting DataFrames | Python

# Pivot the users DataFrame: visitors_pivot
visitors_pivot = users.pivot(index = 'weekday', columns='city',values='visitors')

# Print the pivoted DataFrame
print(visitors_pivot)
<script.py> output:
    city     Austin  Dallas
    weekday                
    Mon         326     456
    Sun         139     237

Pivoting a single variable | Python

# Pivot users pivoted by both signups and visitors: pivot
pivot = users.pivot(index = 'weekday', columns='city')

# Print the pivoted DataFrame
print(pivot)
            visitors        signups       
    city      Austin Dallas  Austin Dallas
    weekday                               
    Mon          326    456       3      5
    Sun          139    237       7     12

Pivoting all variables | Python

(un)stack其实就是另一种Hierarchical indexing的方法,之前有set_index.pivot()

Stacking & unstacking DataFrames | Python

# Unstack users by 'weekday': byweekday
byweekday = users.unstack('weekday')

# Print the byweekday DataFrame
print(byweekday)

# Stack byweekday by 'weekday' and print it
print(byweekday.stack(level = 'weekday'))
<script.py> output:
            visitors      signups    
    weekday      Mon  Sun     Mon Sun
    city                             
    Austin       326  139       3   7
    Dallas       456  237       5  12
                    visitors  signups
    city   weekday                   
    Austin Mon           326        3
           Sun           139        7
    Dallas Mon           456        5
           Sun           237       12

Restoring the index order | Python

.swaplevel(0, 1) to flip the index levels. sort_index()是为了readable, .equals()是为了检验相等。

# Stack 'city' back into the index of bycity: newusers
newusers = bycity.stack(level = 'city')

# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0, 1)

# Print newusers and verify that the index is not sorted
print(newusers)

# Sort the index of newusers: newusers
newusers = newusers.sort_index()

# Print newusers and verify that the index is now sorted
print(newusers)

# Verify that the new DataFrame is equal to the original
print(newusers.equals(users))

Adding names for readability | Python

In [1]: visitors_by_city_weekday
Out[1]: 
city     Austin  Dallas
weekday                
Mon         326     456
Sun         139     237

原来的DataFrame含有Indexweekday和columnscityreset_index以后变成

# Reset the index: visitors_by_city_weekday
visitors_by_city_weekday = visitors_by_city_weekday.reset_index()

# Print visitors_by_city_weekday
print(visitors_by_city_weekday)

city weekday  Austin  Dallas
0        Mon     326     456
1        Sun     139     237

其中的city = [0,1]表示的multiindex的思想。

# Melt visitors_by_city_weekday: visitors
visitors = pd.melt(visitors_by_city_weekday, id_vars=['weekday'],value_name='visitors')

# Print visitors
print(visitors)
      weekday    city  visitors
    0     Mon  Austin       326
    1     Sun  Austin       139
    2     Mon  Dallas       456
    3     Sun  Dallas       237
  • \(\Box\)为什么这里city没有被melt。

Obtaining key-value pairs with melt() | Python

key-value pairs用col_level=0就可以实现,在pd.melt()中。

  • \(\Box\) 感觉不加也可以啊。

Setting up a pivot table | Python

又来一个函数.pivot_table,之前有.pivot()

In [1]: by_city_day = users.pivot_table(index = 'weekday', columns='city')

In [2]: 

In [3]: # Print by_city_day

In [4]: print(by_city_day)
        visitors        signups       
city      Austin Dallas  Austin Dallas
weekday                               
Mon          326    456       3      5
Sun          139    237       7     12

Using other aggregations in pivot tables | Python

.pivot_table()就是在复制excel的逻辑。 both the len and ‘count’ aggregation functions produced the same result.

Using margins in pivot tables | Python

margins=True加入边联表

# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = users.pivot_table(index='weekday', aggfunc=sum)

# Print signups_and_visitors
print(signups_and_visitors)

# Add in the margins: signups_and_visitors_total 
signups_and_visitors_total = users.pivot_table(index='weekday', aggfunc=sum, margins=True)

# Print signups_and_visitors_total
print(signups_and_visitors_total)
             signups  visitors
    weekday                   
    Mon            8       782
    Sun           19       376
             signups  visitors
    weekday                   
    Mon          8.0     782.0
    Sun         19.0     376.0
    All         27.0    1158.0

Grouping by another series | Python

groupby这个地方不需要同一个table的,也就是说可以是其他table,也就是说只要index一致,可以省略一次join的代码。

# Read life_fname into a DataFrame: life
life = pd.read_csv(life_fname, index_col='Country')

# Read regions_fname into a DataFrame: regions
regions = pd.read_csv(regions_fname, index_col='Country')

# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions['region'])

# Print the mean over the '2010' column of life_by_region
print(life_by_region['2010'].mean())

汇总可以用pivot_tableaggfunc=。 或者groupby()agg()

In [14]: by_class_sub.agg(['max','median'])
Out[14]: 
         age             fare         
         max median       max   median
pclass                                
1       80.0   39.0  512.3292  60.0000
2       70.0   29.0   73.5000  15.0458
3       74.0   24.0   69.5500   8.0500

这里的存在stack,因此需要进行多个index选择。

In [14]: by_class_sub.agg(['max','median'])
Out[14]: 
         age             fare         
         max median       max   median
pclass                                
1       80.0   39.0  512.3292  60.0000
2       70.0   29.0   73.5000  15.0458
3       74.0   24.0   69.5500   8.0500

Aggregating on index levels/fields | Python

可以groupbyindex,但是要加上level =,如 by_year_region = gapminder.groupby(level = ['Year','region'])

Grouping on a function of the index | Python

Is there a day of the week that is more popular for customers? To find out, you’re going to use .strftime('%a') to transform the index datetime values to abbreviated days of the week.

Groupby and transformation | Python

这里的transfrom类似于R dplyr包中mutate函数覆盖某个变量。

并且transfromaggregate可以合并,自定义一个函数。

def zscore_with_year_and_name(group): 
  df = pd.DataFrame( 
         {'mpg': zscore(group['mpg']), 
          'year': group['yr'], 
          'name': group['name']})
  return df

关键的点在group这个x上。

Detecting outliers with Z-Scores | Python

standardized = gapminder_2010.groupby('region')['life','fertility'].transform(zscore)相当于选择了['life','fertility']这两个col,因此当然只返回这两个col的结果。

Grouping and filtering with .apply() | Python

没有明白这个函数的method路径。

def c_deck_survival(gr):

    c_passengers = gr['cabin'].str.startswith('C').fillna(False)

    return gr.loc[c_passengers, 'survived'].mean()
In [1]: ?str.startswith
Docstring:
S.startswith(prefix[, start[, end]]) -> bool

Return True if S starts with the specified prefix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
prefix can also be a tuple of strings to try.
Type:      method_descriptor

Grouping and filtering with .filter() | Python

by_com_filt = by_company.filter(lambda g:g['Units'].sum() > 35)这个地方相当于求和了一个agg函数满足的条件,这样就省略了join等条件,非常的efficient。

# Read the CSV file into a DataFrame: sales
sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)

# Group sales by 'Company': by_company
by_company = sales.groupby('Company')

# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
print(by_com_sum)

# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g:g['Units'].sum() > 35)
print(by_com_filt)

.map大法好,类似于transform,也不是,类似于R中stringr中的fct_recode,如under10 = under10.map({True:'under 10', False:'over 10'})

code

# Create the Boolean Series: under10
under10 = titanic['age'] < 10
under10 = under10.map({True:'under 10', False:'over 10'})
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
print(survived_mean_1)

# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, 'pclass'])['survived'].mean()
print(survived_mean_2)

结果

    age       pclass
    over 10   1         0.617555
              2         0.380392
              3         0.238897
    under 10  1         0.750000
              2         1.000000
              3         0.446429
    Name: survived, dtype: float64

逻辑

library(DiagrammeR)
## Warning: 程辑包'DiagrammeR'是用R版本3.6.3 来建造的
grViz("
      digraph LR {

A [label = '.map\nrecode']  
B [label = 'groupby\nT/F condition']
C [label = 'mean()']

A -> B -> C
      }")

Using .value_counts() for ranking | Python

Notice that .value_counts() sorts by values by default. The result is returned as a Series of counts indexed by unique entries from the original Series with values (counts) ranked in descending order.

Compute a histogram of the counts of non-null values.

Parameters
----------
values : ndarray (1-d)
sort : boolean, default True
    Sort by values

dropna : boolean, default True
    Don't include counts of NaN

Using .pivot_table() to count medals by type | Python

.sum(axis='columns')产生一个新的col,是每一行的数据相加。

Understanding the column labels | Python

有一个idea,其实filter就是各种Boolean Series的选择。

Using .nunique() to rank by distinct sports | Python

.nunique() returns the number of distinct categories.

.isin等于filter中的%in%,这个是R中的dplyr包的函数。 例子,is_usa_urs = medals.NOC.isin(['USA','URS'])

USSR: abbr. 苏维埃社会主义共和国联盟(Union of Soviet Socialist Republics)

Counting USA vs. USSR Cold War Olympic Medals | Python

.idxmax(axis='columns')不是特别懂。

In [7]: cold_war_usa_usr_medals.head()
Out[7]: 
NOC        USA    URS
Edition              
1952     130.0  117.0
1956     118.0  169.0
1960     112.0  169.0
1964     150.0  174.0
1968     149.0  188.0
In [9]: cold_war_usa_usr_medals.idxmax(axis='columns').head()
Out[9]: 
Edition
1952    USA
1956    URS
1960    URS
1964    URS
1968    URS
dtype: object

在进行.idxmax(axis='columns')后,axis='columns'表示列之间比较,并且.idxmax返回col的名字。

Reshaping DataFrames for visualization | Python

unstack也可以对index,不仅仅是columns,这个有什么区别?之后再研究。

Visualizing USA Medal Counts by Edition: Line Plot | Python

In [3]: usa_medals_by_year.head()
Out[3]: 
Edition  Medal 
1896     Bronze     2
         Gold      11
         Silver     7
1900     Bronze    14
         Gold      27
Name: Athlete, dtype: int64

进行unstack后,

In [5]: usa_medals_by_year.head()
Out[5]: 
Medal    Bronze  Gold  Silver
Edition                      
1896          2    11       7
1900         14    27      14
1904        111   146     137
1908         15    34      14
1912         31    45      25

medals.Medal = pd.Categorical(values = medals.Medal, categories=['Bronze', 'Silver', 'Gold'], ordered=True)

medals.info()查看是否产生了categories。

In [2]: medals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
City            29216 non-null object
Edition         29216 non-null int64
Sport           29216 non-null object
Discipline      29216 non-null object
Athlete         29216 non-null object
NOC             29216 non-null object
Gender          29216 non-null object
Event           29216 non-null object
Event_gender    29216 non-null object
Medal           29216 non-null category
dtypes: category(1), int64(1), object(8)
memory usage: 2.0+ MB

证书