Python Data Science Toolbox (Part 2)
list comprehensions 是列表推导式的意思,这个之后展开介绍。
就是教你用6for
和类似的iterate,然后构建函数。
Hugo Bowne-Anderson | DataCamp,是这个哥们的课,讲的有点快,但是还算清晰吧。
Introduction to iterators | Python 1
迭代器最常用的是和 for loop 一起使用。
使用iter()
定义一个对象可以被iterate,Iterables 转变为 Iterators。
Iterables可用print
调用,Iterators只能用next
或者类似的函数调用。
其中next()
和*____
是体现其功能两个最好的例子。
Iterating over iterables (1) | Python
# Create a list of strings: flash
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']
# Print each list item in flash using a for loop
for person in flash:
print(person)
# Create an iterator for flash: superspeed
## jay garrick
## barry allen
## wally west
## bart allen
superspeed = iter(flash)
# Print each item from the iterator
print(next(superspeed))
## jay garrick
print(next(superspeed))
## barry allen
print(next(superspeed))
## wally west
print(next(superspeed))
## bart allen
Iterating over iterables (2) | Python
对于string,可以使用range()
在 for loop中实现iterate。
这里range('string')
变成了一个iterables,所以加上iter()
,就变成iterator。
# Create an iterator for range(3): small_value
small_value = iter(range(3))
# Print the values in small_value
print(next(small_value))
## 0
print(next(small_value))
## 1
print(next(small_value))
# Loop over range(3) and print the values
## 2
for num in range(3):
print(num)
## 0
## 1
## 2
# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))
# Print the first 5 values from googol
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
Iterators as function arguments | Python
range(start, stop[, step]) -> range object
使用这个函数。
# Create a range object: values
values = range(10,20+1)
# Print the range object
print(values)
# Create a list of integers: values_list
## range(10, 21)
values_list = list(values)
# Print values_list
print(values_list)
# Get the sum of values: values_sum
## [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
values_sum = sum(values)
# Print values_sum
print(values_sum)
## 165
这里体现了list
和sum
可以处理iterable。
Playing with iterators | Python
enumerate()
可以对series带上index,一个单元变成tuple。
因此对enumerate(____)
使用for loop时,可以使用index
和value
。
并且enumerate()
可以加上参数start
设定index开始的数字。
zip
可以使得两个series变成一个,并且一个单元包含两个,变成tuple。
并且其实for loop循环全部和直接用*____
一回事。
Using enumerate | Python
enumerate()
returns an enumerate object that produces a sequence of tuples, and each of the tuples is an index-value pair.
# Create a list of strings: mutants
mutants = ['charles xavier',
'bobby drake',
'kurt wagner',
'max eisenhardt',
'kitty pride']
# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))
# Print the list of tuples
print(mutant_list)
# Unpack and print the tuple pairs
## [(0, 'charles xavier'), (1, 'bobby drake'), (2, 'kurt wagner'), (3, 'max eisenhardt'), (4, 'kitty pride')]
for index1,value1 in mutant_list:
print(index1, value1)
# Change the start index
## 0 charles xavier
## 1 bobby drake
## 2 kurt wagner
## 3 max eisenhardt
## 4 kitty pride
for index2,value2 in enumerate(mutants, start = 1):
print(index2, value2)
## 1 charles xavier
## 2 bobby drake
## 3 kurt wagner
## 4 max eisenhardt
## 5 kitty pride
总结,对于enumerate
,转化为可iterate了,这个时候print(list())
和for ____
都可以。
Using zip | Python
这里zip()
和enumerate()
很相似,可以被list()
使用。
mutants = ['charles xavier',
'bobby drake',
'kurt wagner',
'max eisenhardt',
'kitty pride']
aliases = ['prof x', 'iceman', 'nightcrawler', 'magneto', 'shadowcat']
powers = ['telepathy',
'thermokinesis',
'teleportation',
'magnetokinesis',
'intangibility']
# Create a list of tuples: mutant_data
mutant_data = list(zip(mutants, aliases, powers))
# Print the list of tuples
print(mutant_data)
# Create a zip object using the three lists: mutant_zip
## [('charles xavier', 'prof x', 'telepathy'), ('bobby drake', 'iceman', 'thermokinesis'), ('kurt wagner', 'nightcrawler', 'teleportation'), ('max eisenhardt', 'magneto', 'magnetokinesis'), ('kitty pride', 'shadowcat', 'intangibility')]
mutant_zip = zip(mutants, aliases, powers)
# Print the zip object
print(mutant_zip)
# Unpack the zip object and print the tuple values
## <zip object at 0x0000000027299E88>
for value1,value2,value3 in mutant_zip:
print(value1, value2, value3)
## charles xavier prof x telepathy
## bobby drake iceman thermokinesis
## kurt wagner nightcrawler teleportation
## max eisenhardt magneto magnetokinesis
## kitty pride shadowcat intangibility
Using * and zip to ‘unzip’ | Python
mutants = ['charles xavier',
'bobby drake',
'kurt wagner',
'max eisenhardt',
'kitty pride']
powers = ['telepathy',
'thermokinesis',
'teleportation',
'magnetokinesis',
'intangibility']
# Create a zip object from mutants and powers: z1
z1 = zip(mutants,powers)
# Print the tuples in z1 by unpacking with *
print(*z1)
# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants,powers)
# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(*z1)
# Check if unpacked tuples are equivalent to original tuples
print(result1 == mutants)
print(result2 == powers)
zip(*____)
就是unzip的功能。
Using iterators to load large files into memory | Python
read_csv()
中的chunksize
可以用来处理大数据的录入。
然后配合for
loop可计算文件的大小,有点厉害。
for chunk in pd.read_csv('data.csv', chunksize=1000):
result.append(sum(chunk['x']))
total = sum(result)
print(total)
通过total = sum(result)
和print(total)
的定义,我们可以改良。
total = 0
for chunk in pd.read_csv('data.csv', chunksize=1000):
total += sum(chunk['x'])
print(total)
Processing large amounts of Twitter data | Python
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Iterate over the file chunk by chunk
for chunk in pd.read_csv('tweets.csv', chunksize=10):
# Iterate over the column in DataFrame
for entry in chunk['lang']:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
# Print the populated dictionary
print(counts_dict)
<script.py> output:
{'et': 1, 'und': 2, 'en': 97}
真麻烦,搞成一个函数吧!
Extracting information for large amounts of Twitter data | Python
# Define count_entries()
def count_entries(csv_file, c_size, colname):
"""Return a dictionary with counts of
occurrences as value for each key."""
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Iterate over the file chunk by chunk
for chunk in pd.read_csv(csv_file, chunksize = c_size):
# Iterate over the column in DataFrame
for entry in chunk[colname]:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
# Return counts_dict
return counts_dict
# Call count_entries(): result_counts
result_counts = count_entries('tweets.csv', 10, 'lang')
# Print result_counts
print(result_counts)
Congratulations!! | Python
但是真感觉写for loop真麻烦。
List comprehensions | Python
期待one line code 的骚操作。 这就是list comprehension。
new_nums = [num + 1 for num in nums]
for num in nums:
new_nums.append(num + 1)
就是前后置换一下位置。
List comprehensions: Collapse for loops for building lists into a single line.
Components:
- Iterable
- Iterator variable (represent members of iterable) 终于找到这个的解释了!
- Output expression
还可以嵌套,Nested loops 真开心。
pairs_1 = []
for num1 in range(0,2):
# 2 is not inclusive.
for num2 in range (6,8):
pairs_1.append((num1,num2))
print(pairs_1)
## [(0, 6), (0, 7), (1, 6), (1, 7)]
pairs_2 = [(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)]
print(pairs_2)
## [(0, 6), (0, 7), (1, 6), (1, 7)]
Write a basic list comprehension | Python
首字母提取
doctor = ['house', 'cuddy', 'chase', 'thirteen', 'wilson']
print([doc[0] for doc in doctor])
## ['h', 'c', 'c', 't', 'w']
Nested list comprehensions | Python
[
[output expression] for
iterator variable in
iterable]
matrix = [[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]]
# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(5)] for row in range(5)]
# Print the matrix
for row in matrix:
print(row)
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
仔细琢磨一下,这个有点骚操作。
Advanced comprehensions | Python
print([num ** 2 for num in range(10) if num % 2 == 0])
## [0, 4, 16, 36, 64]
print([num ** 2 if num % 2 == 0 else 0 for num in range(10)])
## [0, 0, 4, 0, 16, 0, 36, 0, 64, 0]
看到这,我觉得有点像是R里面的pipe %>%
了。
pos_neg = {num: -num for num in range(9)}
print(pos_neg)
## {0: 0, 1: -1, 2: -2, 3: -3, 4: -4, 5: -5, 6: -6, 7: -7, 8: -8}
print(type(pos_neg))
## <class 'dict'>
Using conditionals in comprehensions (1) | Python
只反馈名字字母数\(\geq7\)。
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create list comprehension: new_fellowship
new_fellowship = [member for member in fellowship if len(member) >= 7]
# Print the new list
print(new_fellowship)
## ['samwise', 'aragorn', 'legolas', 'boromir']
Using conditionals in comprehensions (2) | Python
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create list comprehension: new_fellowship
new_fellowship = [member if len(member) >= 7 else '' for member in fellowship]
# Print the new list
print(new_fellowship)
## ['', 'samwise', '', 'aragorn', 'legolas', 'boromir', '']
Dict comprehensions | Python
Recall that the main difference between a list comprehension and a dict comprehension is the use of curly braces
{}
instead of[]
. Additionally, members of the dictionary are created using a colon:
, as in key:
value.
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create dict comprehension: new_fellowship
new_fellowship = {member:len(member) for member in fellowship}
# Print the new list
print(new_fellowship)
## {'frodo': 5, 'samwise': 7, 'merry': 5, 'aragorn': 7, 'legolas': 7, 'boromir': 7, 'gimli': 5}
Introduction to generator expressions | Python
generator 不需要占用内存空间,更好,其他和List comprehensions没有区别。
generator function
产生一组数不是一个数,其他跟一般方程一致。
yield
产生这种骚操作。
List comprehensions vs generators | Python
You’ve seen from the videos that list comprehensions and generator expressions look very similar in their syntax, except for the use of parentheses
()
in generator expressions and brackets[]
in list comprehensions.
# List of strings
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# List comprehension
fellow1 = [member for member in fellowship if len(member) >= 7]
# Generator expression
fellow2 = (member for member in fellowship if len(member) >= 7)
print(type(fellow1))
## <class 'list'>
print(type(fellow2))
## <class 'generator'>
Write your own generator expressions | Python
当对字典进行for loop时,in
后面要加上items()
,或者用range()
。
Changing the output in generator expressions | Python
# Create a list of strings: lannister
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']
# Create a generator object: lengths
lengths = (len(person) for person in lannister)
# Iterate over and print the values in lengths
for value in lengths:
print(value)
## 6
## 5
## 5
## 6
## 7
Build a generator | Python
之前都是generator expression,不是generator function,表示不能根据input,随时修改output,不是一个函数的感觉!
# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']
# Define generator function get_lengths
def get_lengths(input_list):
"""Generator function that yields the
length of the strings in input_list."""
# Yield the length of a string
for person in input_list:
yield len(person)
# Print the values generated by get_lengths()
for value in get_lengths(lannister):
print(value)
## 6
## 5
## 5
## 6
## 7
这个地方的关键是yield
函数。
加油啊,真多,真麻烦,想推土机一样推过去!
List comprehensions for time-stamped data | Python
You can think of DataFrame columns as single-dimension arrays called Series.
哈哈。
# Extract the created_at column from df: tweet_time
tweet_time = df['created_at']
# Extract the clock time: tweet_clock_time
tweet_clock_time = [entry[11:19] for entry in tweet_time]
# Print the extracted times
print(tweet_clock_time)
Conditional list comprehesions for time-stamped data | Python
# Extract the created_at column from df: tweet_time
tweet_time = df['created_at']
# Extract the clock time: tweet_clock_time
tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == '19']
# Print the extracted times
print(tweet_clock_time)
这两个练习,不太理解,因为结果无法复盘,所以无法理解。
Dictionaries for data science | Python
# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)
# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)
# Print the dictionary
print(rs_dict)
这里zip
合并数据,tuple化。
feature_names
, row_vals
第一个是变量名称,第二个是一行数据。
dict
记一下,这个是转化zip
为字典格式。
因此可以搞一个函数。
Writing a function to help you | Python
# Define lists2dict()
def lists2dict(list1, list2):
"""Return a dictionary where list1 provides
the keys and list2 provides the values."""
# Zip lists: zipped_lists
zipped_lists = zip(list1, list2)
# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)
# Return the dictionary
return rs_dict
# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)
# Print rs_fxn
print(rs_fxn)
Using a list comprehension | Python
# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1])
# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]
# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])
让人困惑的地方是feature_names
这个是给定的,不是从row_lists
for loop出来的。
list_of_dicts
且这里出来就是一个list,如果用()
就是generator了。
Turning this all into a DataFrame | Python
这个地方就是把dict转化成pd.DataFrame()
。
# Import the pandas package
import pandas as pd
# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]
# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)
# Print the head of the DataFrame
print(df.head())
Using Python generators for streaming data | Python
面对数据流的时候, streaming data 为了不占用内存、且可以条条的撸完, 因此generator的实用性就展现出来了。
Processing data in chunks (1) | Python
The command
with open('datacamp.csv')
as datacamp binds the csv file'datacamp.csv'
asdatacamp
in the context manager. Here, thewith
statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.
所以with
是个和for
差不多感觉的命令。
# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Skip the column names
file.readline()
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Process only the first 1000 rows
for j in range(1000):
# Split the current line into a list: line
line = file.readline().split(',')
# Get the value for the first column: first_col
first_col = line[0]
# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)
好好理解下这个代码,这是个搞下去的分水岭。
file.readline()
显示了文档全部的信息。
不是特别好理解,先过。
process a file line by line for a given number of lines.
Writing a generator to load data in chunks (2) | Python
不是特别懂啊。我觉得自己不够努力! 我要加油! 要全部都学会。
# Define read_large_file()
def read_large_file(file_object):
"""A generator function to read a large file lazily."""
# Loop indefinitely until the end of the file
while True:
# Read a line from the file: data
data = file_object.readline()
# Break if this is the end of the file
if not data:
break
# Yield the line of data
yield data
# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Create a generator object for the file: gen_file
gen_file = read_large_file(file)
# Print the first three lines of the file
print(next(gen_file))
print(next(gen_file))
print(next(gen_file))
产生一组数不是一个数,其他跟一般方程一致。
yield
产生这种骚操作。
并且这里不是list,而是generator。
Writing a generator to load data in chunks (3) | Python
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Iterate over the generator from read_large_file()
for line in read_large_file(file):
row = line.split(',')
first_col = row[0]
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
else:
counts_dict[first_col] = 1
# Print
print(counts_dict)
open('world_dev_ind.csv')
通过open
函数打开路径,注意是路径world_dev_ind.csv
。
这里read_large_file
是产生generator的函数。
Using pandas` read_csv iterator for streaming data | Python
开始pd.read_csv()
和chunksize
的骚操作了。
Writing an iterator to load data in chunks (1) | Python
pd.read_csv(filename, chunksize=100)
加上chunksize=100
,这就是iteratable了,可以用next
了。
# Import the pandas package
import pandas as pd
# Initialize reader object: df_reader
df_reader = pd.read_csv('ind_pop.csv', chunksize = 10)
# Print two chunks
print(next(df_reader))
print(next(df_reader))
这个代码很好很解释了上面那句话。
Writing an iterator to load data in chunks (2) | Python
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize = 1000)
# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)
# Check out the head of the DataFrame
print(df_urb_pop.head())
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Print pops_list
print(pops_list)
pd.read_csv('ind_pop_data.csv', chunksize = 1000)
中,pd.read_csv
录入数据,chunksize
定义搞多少数据,
并且这个数据变成了iteratable。
并且next
可以体现这点。
head
用来看看数据。
df_urb_pop['CountryCode'] == 'CEB'
相当于R的filter
。
zip
合并且tuple化。
自后list
用于list化。
print
看结果。
Writing an iterator to load data in chunks (3) | Python
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize = 1000)
# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(tup[0]*tup[1]*0.01) for tup in pops_list]
# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()
其实就是R里面mutate
一下的方式,搞那么复杂。
[int(tup[0]*tup[1]*0.01) for tup in pops_list]
最后这里有点搞,
就是tup
是(x,y)
形式,tup[0]
和tup[1]
分别是,自然地,x
和y
。
Keep going!
Writing an iterator to load data in chunks (4) | Python
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)
# Initialize empty DataFrame: data
data = pd.DataFrame()
# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]
# Append DataFrame chunk to data: data
data = data.append(df_pop_ceb)
# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()
终于看懂这个代码了,
就是说,先data = pd.DataFrame()
建立空表。
然后再for
loop 填满。
Writing an iterator to load data in chunks (5) | Python
一共是三个步骤,开始要搞一个函数了。
- Loading of the file chunk by chunk,
- Creating the new column of urban population values, and
- Plotting the urban population data.
# Define plot_pop()
def plot_pop(filename, country_code):
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv(filename, chunksize=1000)
# Initialize empty DataFrame: data
data = pd.DataFrame()
# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]
# Append DataFrame chunk to data: data
data = data.append(df_pop_ceb)
# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()
# Set the filename: fn
fn = 'ind_pop_data.csv'
# Call plot_pop for country code 'CEB'
plot_pop('ind_pop_data.csv', 'CEB')
# Call plot_pop for country code 'ARB'
plot_pop('ind_pop_data.csv', 'ARB')
这个地方一看country_code
,就知道代指'CEB'
之类的。
- 迭代器;迭代程序 iterate 美音 /’ɪtərət/ vt. 反复说;重复;重做;[计]迭代