14 min read

list comprehensions in Python 学习笔记

Python Data Science Toolbox (Part 2)

list comprehensions 是列表推导式的意思,这个之后展开介绍。 就是教你用6for和类似的iterate,然后构建函数。 Hugo Bowne-Anderson | DataCamp,是这个哥们的课,讲的有点快,但是还算清晰吧。

Introduction to iterators | Python 1

迭代器最常用的是和 for loop 一起使用。 使用iter()定义一个对象可以被iterate,Iterables 转变为 Iterators。 Iterables可用print调用,Iterators只能用next或者类似的函数调用。 其中next()*____是体现其功能两个最好的例子。

Iterating over iterables (1) | Python

# Create a list of strings: flash
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']

# Print each list item in flash using a for loop
for person in flash:
    print(person)

# Create an iterator for flash: superspeed
## jay garrick
## barry allen
## wally west
## bart allen
superspeed = iter(flash)

# Print each item from the iterator
print(next(superspeed))
## jay garrick
print(next(superspeed))
## barry allen
print(next(superspeed))
## wally west
print(next(superspeed))
## bart allen

Iterating over iterables (2) | Python

对于string,可以使用range()在 for loop中实现iterate。 这里range('string')变成了一个iterables,所以加上iter(),就变成iterator。

# Create an iterator for range(3): small_value
small_value = iter(range(3))

# Print the values in small_value
print(next(small_value))
## 0
print(next(small_value))
## 1
print(next(small_value))

# Loop over range(3) and print the values
## 2
for num in range(3):
    print(num)
## 0
## 1
## 2
# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))

# Print the first 5 values from googol
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))

Iterators as function arguments | Python

range(start, stop[, step]) -> range object使用这个函数。

# Create a range object: values
values = range(10,20+1)

# Print the range object
print(values)

# Create a list of integers: values_list
## range(10, 21)
values_list = list(values)

# Print values_list
print(values_list)

# Get the sum of values: values_sum
## [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
values_sum = sum(values)

# Print values_sum
print(values_sum)
## 165

这里体现了listsum可以处理iterable。

Playing with iterators | Python

enumerate()可以对series带上index,一个单元变成tuple。 因此对enumerate(____)使用for loop时,可以使用indexvalue。 并且enumerate()可以加上参数start设定index开始的数字。

zip可以使得两个series变成一个,并且一个单元包含两个,变成tuple。

并且其实for loop循环全部和直接用*____一回事。

Using enumerate | Python

enumerate() returns an enumerate object that produces a sequence of tuples, and each of the tuples is an index-value pair.

# Create a list of strings: mutants
mutants = ['charles xavier', 
            'bobby drake', 
            'kurt wagner', 
            'max eisenhardt', 
            'kitty pride']

# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))

# Print the list of tuples
print(mutant_list)

# Unpack and print the tuple pairs
## [(0, 'charles xavier'), (1, 'bobby drake'), (2, 'kurt wagner'), (3, 'max eisenhardt'), (4, 'kitty pride')]
for index1,value1 in mutant_list:
    print(index1, value1)

# Change the start index
## 0 charles xavier
## 1 bobby drake
## 2 kurt wagner
## 3 max eisenhardt
## 4 kitty pride
for index2,value2 in enumerate(mutants, start = 1):
    print(index2, value2)
## 1 charles xavier
## 2 bobby drake
## 3 kurt wagner
## 4 max eisenhardt
## 5 kitty pride

总结,对于enumerate,转化为可iterate了,这个时候print(list())for ____都可以。

Using zip | Python

这里zip()enumerate()很相似,可以被list()使用。

mutants = ['charles xavier',
 'bobby drake',
 'kurt wagner',
 'max eisenhardt',
 'kitty pride']
aliases = ['prof x', 'iceman', 'nightcrawler', 'magneto', 'shadowcat']
powers = ['telepathy',
 'thermokinesis',
 'teleportation',
 'magnetokinesis',
 'intangibility']

# Create a list of tuples: mutant_data
mutant_data = list(zip(mutants, aliases, powers))

# Print the list of tuples
print(mutant_data)

# Create a zip object using the three lists: mutant_zip
## [('charles xavier', 'prof x', 'telepathy'), ('bobby drake', 'iceman', 'thermokinesis'), ('kurt wagner', 'nightcrawler', 'teleportation'), ('max eisenhardt', 'magneto', 'magnetokinesis'), ('kitty pride', 'shadowcat', 'intangibility')]
mutant_zip = zip(mutants, aliases, powers)

# Print the zip object
print(mutant_zip)

# Unpack the zip object and print the tuple values
## <zip object at 0x0000000027299E88>
for value1,value2,value3 in mutant_zip:
    print(value1, value2, value3)
## charles xavier prof x telepathy
## bobby drake iceman thermokinesis
## kurt wagner nightcrawler teleportation
## max eisenhardt magneto magnetokinesis
## kitty pride shadowcat intangibility

Using * and zip to ‘unzip’ | Python

mutants = ['charles xavier',
 'bobby drake',
 'kurt wagner',
 'max eisenhardt',
 'kitty pride']
powers = ['telepathy',
 'thermokinesis',
 'teleportation',
 'magnetokinesis',
 'intangibility']
# Create a zip object from mutants and powers: z1
z1 = zip(mutants,powers)

# Print the tuples in z1 by unpacking with *
print(*z1)

# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants,powers)

# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(*z1)

# Check if unpacked tuples are equivalent to original tuples
print(result1 == mutants)
print(result2 == powers)

zip(*____)就是unzip的功能。

Using iterators to load large files into memory | Python

read_csv()中的chunksize可以用来处理大数据的录入。 然后配合forloop可计算文件的大小,有点厉害。

for chunk in pd.read_csv('data.csv', chunksize=1000): 
    result.append(sum(chunk['x']))
total = sum(result)
print(total)

通过total = sum(result)print(total)的定义,我们可以改良。

total = 0
for chunk in pd.read_csv('data.csv', chunksize=1000): 
    total += sum(chunk['x'])
print(total) 

Processing large amounts of Twitter data | Python

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv('tweets.csv', chunksize=10):

    # Iterate over the column in DataFrame
    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

# Print the populated dictionary
print(counts_dict)
<script.py> output:
    {'et': 1, 'und': 2, 'en': 97}

真麻烦,搞成一个函数吧!

Extracting information for large amounts of Twitter data | Python

# Define count_entries()
def count_entries(csv_file, c_size, colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file, chunksize = c_size):

        # Iterate over the column in DataFrame
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    # Return counts_dict
    return counts_dict

# Call count_entries(): result_counts
result_counts = count_entries('tweets.csv', 10, 'lang')

# Print result_counts
print(result_counts)

Congratulations!! | Python

但是真感觉写for loop真麻烦。

List comprehensions | Python

期待one line code 的骚操作。 这就是list comprehension。

new_nums = [num + 1 for num in nums]
for num in nums: 
    new_nums.append(num + 1) 

就是前后置换一下位置。

List comprehensions: Collapse for loops for building lists into a single line.

Components:

  • Iterable
  • Iterator variable (represent members of iterable) 终于找到这个的解释了!
  • Output expression

还可以嵌套,Nested loops 真开心。

pairs_1 = []
for num1 in range(0,2): 
# 2 is not inclusive.
    for num2 in range (6,8):
        pairs_1.append((num1,num2))
print(pairs_1)
## [(0, 6), (0, 7), (1, 6), (1, 7)]
pairs_2 = [(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)] 
print(pairs_2)
## [(0, 6), (0, 7), (1, 6), (1, 7)]

Write a basic list comprehension | Python

首字母提取

doctor = ['house', 'cuddy', 'chase', 'thirteen', 'wilson']
print([doc[0] for doc in doctor])
## ['h', 'c', 'c', 't', 'w']

Nested list comprehensions | Python

[[output expression] for iterator variable in iterable]

matrix = [[0, 1, 2, 3, 4],
          [0, 1, 2, 3, 4],
          [0, 1, 2, 3, 4],
          [0, 1, 2, 3, 4],
          [0, 1, 2, 3, 4]]
# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(5)] for row in range(5)]

# Print the matrix
for row in matrix:
    print(row)
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]

仔细琢磨一下,这个有点骚操作。

Advanced comprehensions | Python

print([num ** 2 for num in range(10) if num % 2 == 0])
## [0, 4, 16, 36, 64]
print([num ** 2 if num % 2 == 0 else 0 for num in range(10)])
## [0, 0, 4, 0, 16, 0, 36, 0, 64, 0]

看到这,我觉得有点像是R里面的pipe %>% 了。

pos_neg = {num: -num for num in range(9)}
print(pos_neg)
## {0: 0, 1: -1, 2: -2, 3: -3, 4: -4, 5: -5, 6: -6, 7: -7, 8: -8}
print(type(pos_neg))
## <class 'dict'>

Using conditionals in comprehensions (1) | Python

只反馈名字字母数\(\geq7\)

# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create list comprehension: new_fellowship
new_fellowship = [member for member in fellowship if len(member) >= 7]

# Print the new list
print(new_fellowship)
## ['samwise', 'aragorn', 'legolas', 'boromir']

Using conditionals in comprehensions (2) | Python

# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create list comprehension: new_fellowship
new_fellowship = [member if len(member) >= 7 else '' for member in fellowship]

# Print the new list
print(new_fellowship)
## ['', 'samwise', '', 'aragorn', 'legolas', 'boromir', '']

Dict comprehensions | Python

Recall that the main difference between a list comprehension and a dict comprehension is the use of curly braces {} instead of []. Additionally, members of the dictionary are created using a colon :, as in key:value.

# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# Create dict comprehension: new_fellowship
new_fellowship = {member:len(member) for member in fellowship}

# Print the new list
print(new_fellowship)
## {'frodo': 5, 'samwise': 7, 'merry': 5, 'aragorn': 7, 'legolas': 7, 'boromir': 7, 'gimli': 5}

Introduction to generator expressions | Python

generator 不需要占用内存空间,更好,其他和List comprehensions没有区别。

generator function 产生一组数不是一个数,其他跟一般方程一致。 yield产生这种骚操作。

List comprehensions vs generators | Python

You’ve seen from the videos that list comprehensions and generator expressions look very similar in their syntax, except for the use of parentheses () in generator expressions and brackets [] in list comprehensions.

# List of strings
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# List comprehension
fellow1 = [member for member in fellowship if len(member) >= 7]

# Generator expression
fellow2 = (member for member in fellowship if len(member) >= 7)

print(type(fellow1))
## <class 'list'>
print(type(fellow2))
## <class 'generator'>

Write your own generator expressions | Python

当对字典进行for loop时,in后面要加上items(),或者用range()

Changing the output in generator expressions | Python

# Create a list of strings: lannister
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Create a generator object: lengths
lengths = (len(person) for person in lannister)

# Iterate over and print the values in lengths
for value in lengths:
    print(value)
## 6
## 5
## 5
## 6
## 7

Build a generator | Python

之前都是generator expression,不是generator function,表示不能根据input,随时修改output,不是一个函数的感觉!

# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""

    # Yield the length of a string
    for person in input_list:
        yield len(person)

# Print the values generated by get_lengths()
for value in get_lengths(lannister):
    print(value)
## 6
## 5
## 5
## 6
## 7

这个地方的关键是yield函数。

加油啊,真多,真麻烦,想推土机一样推过去!

List comprehensions for time-stamped data | Python

You can think of DataFrame columns as single-dimension arrays called Series.

哈哈。

# Extract the created_at column from df: tweet_time
tweet_time = df['created_at']

# Extract the clock time: tweet_clock_time
tweet_clock_time = [entry[11:19] for entry in tweet_time]

# Print the extracted times
print(tweet_clock_time)

Conditional list comprehesions for time-stamped data | Python

# Extract the created_at column from df: tweet_time
tweet_time = df['created_at']

# Extract the clock time: tweet_clock_time
tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == '19']

# Print the extracted times
print(tweet_clock_time)

这两个练习,不太理解,因为结果无法复盘,所以无法理解。

Dictionaries for data science | Python

# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)

# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)

# Print the dictionary
print(rs_dict)

这里zip合并数据,tuple化。 feature_names, row_vals第一个是变量名称,第二个是一行数据。 dict记一下,这个是转化zip为字典格式。

因此可以搞一个函数。

Writing a function to help you | Python

# Define lists2dict()
def lists2dict(list1, list2):
    """Return a dictionary where list1 provides
    the keys and list2 provides the values."""

    # Zip lists: zipped_lists
    zipped_lists = zip(list1, list2)

    # Create a dictionary: rs_dict
    rs_dict = dict(zipped_lists)

    # Return the dictionary
    return rs_dict

# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)

# Print rs_fxn
print(rs_fxn)

Using a list comprehension | Python

# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1])

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])

让人困惑的地方是feature_names这个是给定的,不是从row_listsfor loop出来的。

list_of_dicts且这里出来就是一个list,如果用()就是generator了。

Turning this all into a DataFrame | Python

这个地方就是把dict转化成pd.DataFrame()

# Import the pandas package
import pandas as pd

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)

# Print the head of the DataFrame
print(df.head())

Using Python generators for streaming data | Python

面对数据流的时候, streaming data 为了不占用内存、且可以条条的撸完, 因此generator的实用性就展现出来了。

Processing data in chunks (1) | Python

The command with open('datacamp.csv') as datacamp binds the csv file 'datacamp.csv' as datacamp in the context manager. Here, the with statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.

所以with是个和for差不多感觉的命令。

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Skip the column names
    file.readline()

    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(1000):

        # Split the current line into a list: line
        line = file.readline().split(',')

        # Get the value for the first column: first_col
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

好好理解下这个代码,这是个搞下去的分水岭。 file.readline()显示了文档全部的信息。 不是特别好理解,先过。

process a file line by line for a given number of lines.

Writing a generator to load data in chunks (2) | Python

不是特别懂啊。我觉得自己不够努力! 我要加油! 要全部都学会。

# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        # Yield the line of data
        yield data
        
# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

产生一组数不是一个数,其他跟一般方程一致。 yield产生这种骚操作。

并且这里不是list,而是generator。

Writing a generator to load data in chunks (3) | Python

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):

        row = line.split(',')
        first_col = row[0]

        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

# Print            
print(counts_dict)

open('world_dev_ind.csv')通过open函数打开路径,注意是路径world_dev_ind.csv。 这里read_large_file是产生generator的函数。

Using pandas` read_csv iterator for streaming data | Python

开始pd.read_csv()chunksize的骚操作了。

Writing an iterator to load data in chunks (1) | Python

pd.read_csv(filename, chunksize=100)加上chunksize=100,这就是iteratable了,可以用next了。

# Import the pandas package
import pandas as pd

# Initialize reader object: df_reader
df_reader = pd.read_csv('ind_pop.csv', chunksize = 10)

# Print two chunks
print(next(df_reader))
print(next(df_reader))

这个代码很好很解释了上面那句话。

Writing an iterator to load data in chunks (2) | Python

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize = 1000)

# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out the head of the DataFrame
print(df_urb_pop.head())

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Print pops_list
print(pops_list)

pd.read_csv('ind_pop_data.csv', chunksize = 1000) 中,pd.read_csv录入数据,chunksize定义搞多少数据, 并且这个数据变成了iteratable。 并且next可以体现这点。 head用来看看数据。 df_urb_pop['CountryCode'] == 'CEB'相当于R的filterzip合并且tuple化。 自后list用于list化。 print看结果。

Writing an iterator to load data in chunks (3) | Python

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize = 1000)

# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], 
            df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(tup[0]*tup[1]*0.01) for tup in pops_list]

# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

其实就是R里面mutate一下的方式,搞那么复杂。 [int(tup[0]*tup[1]*0.01) for tup in pops_list]最后这里有点搞, 就是tup(x,y)形式,tup[0]tup[1]分别是,自然地,xy

Keep going!

Writing an iterator to load data in chunks (4) | Python

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Initialize empty DataFrame: data
data = pd.DataFrame()

# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:

    # Check out specific country: df_pop_ceb
    df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

    # Zip DataFrame columns of interest: pops
    pops = zip(df_pop_ceb['Total Population'],
                df_pop_ceb['Urban population (% of total)'])

    # Turn zip object into list: pops_list
    pops_list = list(pops)

    # Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]
    
    # Append DataFrame chunk to data: data
    data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

终于看懂这个代码了, 就是说,先data = pd.DataFrame()建立空表。 然后再forloop 填满。

Writing an iterator to load data in chunks (5) | Python

一共是三个步骤,开始要搞一个函数了。

  • Loading of the file chunk by chunk,
  • Creating the new column of urban population values, and
  • Plotting the urban population data.
# Define plot_pop()
def plot_pop(filename, country_code):

    # Initialize reader object: urb_pop_reader
    urb_pop_reader = pd.read_csv(filename, chunksize=1000)

    # Initialize empty DataFrame: data
    data = pd.DataFrame()
    
    # Iterate over each DataFrame chunk
    for df_urb_pop in urb_pop_reader:
        # Check out specific country: df_pop_ceb
        df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

        # Zip DataFrame columns of interest: pops
        pops = zip(df_pop_ceb['Total Population'],
                    df_pop_ceb['Urban population (% of total)'])

        # Turn zip object into list: pops_list
        pops_list = list(pops)

        # Use list comprehension to create new DataFrame column 'Total Urban Population'
        df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]
    
        # Append DataFrame chunk to data: data
        data = data.append(df_pop_ceb)

    # Plot urban population data
    data.plot(kind='scatter', x='Year', y='Total Urban Population')
    plt.show()

# Set the filename: fn
fn = 'ind_pop_data.csv'

# Call plot_pop for country code 'CEB'
plot_pop('ind_pop_data.csv', 'CEB')

# Call plot_pop for country code 'ARB'
plot_pop('ind_pop_data.csv', 'ARB')

这个地方一看country_code,就知道代指'CEB'之类的。


    1. 迭代器;迭代程序 iterate 美音 /’ɪtərət/ vt. 反复说;重复;重做;[计]迭代