12 min read

Scalable Data Processing in R

deploys under 50MByte work best. Files over 10MByte are likely to cause your deploy to hang.

不适合deploy,文件都特别的大。

现在并行计算都是800万数据,因此必须要学会这个。

Scalable Data Processing in R

R卡死了。

  • 4 hours
  • 15 Videos
  • 49 Exercises

而且研究的两房数据,早一点的话,可能我就可以拿\(A^+\)了。

这个哥们教的, Michael Kane | DataCamp

In this course, you’ll make use of the Federal Housing Finance Agency’s data, a publicly available data set chronicling all mortgages that were held or securitized by both Federal National Mortgage Association (Fannie Mae) and Federal Home Loan Mortgage Corporation (Freddie Mac) from 2009-2015.

What is Scalable Data Processing? | R

“R is not well-suited for working with data larger than 10-20% of a computer’s RAM.” - The R Installation and Administration Manual

  • Move a subset into RAM
  • Process the subset
  • Keep the result and discard the subset

microbenchmark包。

Why is your code slow? | R

In particular, if you have a data set that is about the size of RAM, you might be better off saving most of the data set on the disk. By loading only the parts of a data set you need, you free up resources so that each part can be processed more quickly.

类似于SQL中的compute statsRAM: random access memory 所以不要存在这,而是disk上,这样就快很多了。

How does processing time vary by data size? | R

If you are processing all elements of two data sets, and one data set is bigger, then the bigger data set will take longer to process. However, it’s important to realize that how much longer it takes is not always directly proportional to how much bigger it is. That is, if you have two data sets and one is two times the size of the other, it is not guaranteed that the larger one will take twice as long to process. It could take 1.5 times longer or even four times longer. It depends on which operations are used to process the data set.

的确,调用时间和存储大小不正比。

# Load the microbenchmark package
library(microbenchmark)

# Compare the timings for sorting different sizes of vector
mb <- microbenchmark(
  # Sort a random normal vector length 1e5
  "1e5" = sort(rnorm(1e5)),
  # Sort a random normal vector length 2.5e5
  "2.5e5" = sort(rnorm(2.5e5)),
  # Sort a random normal vector length 5e5
  "5e5" = sort(rnorm(5e5)),
  "7.5e5" = sort(rnorm(7.5e5)),
  "1e6" = sort(rnorm(1e6)),
  times = 10
)

# Plot the resulting benchmark object
plot(mb)

\(\Box\)microbenchmarkCore安装不了。

Great Job! Note that the resulting graph shows that the execution time is not the same every time. This is because while the computer was executing your R code, it was also doing other things. As a result, it is a good practice to run each operation being benchmarked mutiple times, and to look at the median execution time when evaluating the execution time of R code.

bigmemory is used to store, manipulate, and process big matrices, that may be larger than a computer’s RAM

big.matrix

  • Create
  • Retrieve
  • Subset
  • Summarize

You are better off moving data to RAM only when the data are needed for processing.

Out-of-Core computing

big.matrix

  • 20% of the size of RAM
  • Dense matrices

Reading a big.matrix object | R

In this exercise, you’ll create your first file-backed big.matrix object using the read.big.matrix() function. The function is meant to look similar to read.table() but, in addition, it needs to know what type of numeric values you want to read (“char”, “short”, “integer”, “double”), it needs the name of the file that will hold the matrix’s data (the backing file), and it needs the name of the file to hold information about the matrix (a descriptor file). The result will be a file on the disk holding the value read in along with a descriptor file which holds extra information (like the number of columns and rows) about the resulting big.matrix object.

描述文件包含了col的名称和index。 和Python中宣布数据集,并rewritedf.columns差不多。 type = "integer"这里"integer"要文本化。

# Load the bigmemory package
library(bigmemory)

# Create the big.matrix object: x
x <- read.big.matrix("mortgage-sample.csv", header = TRUE, 
                     type = "integer" , 
                     backingfile = "mortgage-sample.bin", 
                     descriptorfile = "mortgage-sample.desc")
    
# Find the dimensions of x
dim(x)
[1] 70000    16
  • 目前的问题是,七牛不能读取数据,外用链接。 因为只能读取本地文档,因为这是disk的问题。

mortgage-sample.binmortgage-sample.desc是自动产生的。

Attaching a big.matrix object | R

Now that the big.matrix object is on the disk, we can use the information stored in the descriptor file to instantly make it available during an R session. This means that you don’t have to reimport the data set, which takes more time for larger files. You can simply point the bigmemory package at the existing structures on the disk and begin accessing data without the wait.

已经compute stats好了。

# Attach mortgage-sample.desc
mort <- attach.big.matrix("mortgage-sample.desc")

# Find the dimensions of mort
dim(mort)

# Look at the first 6 rows of mort
head(mort)
[1] 70000    16
     enterprise record_number msa perc_minority tract_income_ratio
[1,]          1           566   1             1                  3
[2,]          1           116   1             3                  2
[3,]          1           239   1             2                  2
[4,]          1            62   1             2                  3
[5,]          1           106   1             2                  3
[6,]          1           759   1             3                  3
     borrower_income_ratio loan_purpose federal_guarantee borrower_race
[1,]                     1            2                 4             3
[2,]                     1            2                 4             5
[3,]                     3            8                 4             5
[4,]                     3            2                 4             5
[5,]                     3            2                 4             9
[6,]                     2            2                 4             9
     co_borrower_race borrower_gender co_borrower_gender num_units
[1,]                9               2                  4         1
[2,]                9               1                  4         1
[3,]                5               1                  2         1
[4,]                9               2                  4         1
[5,]                9               3                  4         1
[6,]                9               1                  2         2
     affordability year type
[1,]             3 2010    1
[2,]             3 2008    1
[3,]             4 2014    0
[4,]             4 2009    1
[5,]             4 2013    1
[6,]             4 2010    1

Creating tables with big.matrix objects | R

One thing to remember is that $ is not valid for getting a column of either a matrix or a big.matrix.

# Create mort
mort <- attach.big.matrix("mortgage-sample.desc")

# Look at the first 3 rows
mort[1:3, ]

# Create a table of the number of mortgages for each year in the data set
table(mort[, "year"])
     enterprise record_number msa perc_minority tract_income_ratio
[1,]          1           566   1             1                  3
[2,]          1           116   1             3                  2
[3,]          1           239   1             2                  2
     borrower_income_ratio loan_purpose federal_guarantee borrower_race
[1,]                     1            2                 4             3
[2,]                     1            2                 4             5
[3,]                     3            8                 4             5
     co_borrower_race borrower_gender co_borrower_gender num_units
[1,]                9               2                  4         1
[2,]                9               1                  4         1
[3,]                5               1                  2         1
     affordability year type
[1,]             3 2010    1
[2,]             3 2008    1
[3,]             4 2014    0

 2008  2009  2010  2011  2012  2013  2014  2015 
 8468 11101  8836  7996 10935 10216  5714  6734 

Data summary using bigsummary | R

关键的idea就是, 把R的对象存入disk而非留在RAM,要用的时候再调入RAM。

biganalytics包来做分析。

# Load the biganalytics package
library(biganalytics)

# Get the column means of mort
colmean(mort)

# Use biganalytics' summary function to get a summary of the data
summary(mort)
           enterprise         record_number                   msa 
            1.3814571           499.9080571             0.8943571 
        perc_minority    tract_income_ratio borrower_income_ratio 
            1.9701857             2.3431571             2.6898857 
         loan_purpose     federal_guarantee         borrower_race 
            3.7670143             3.9840857             5.3572429 
     co_borrower_race       borrower_gender    co_borrower_gender 
            7.0002714             1.4590714             3.0494857 
            num_units         affordability                  year 
            1.0398143             4.2863429          2011.2714714 
                 type 
            0.5300429 
                               min          max         mean          NAs
enterprise               1.0000000    2.0000000    1.3814571    0.0000000
record_number            0.0000000  999.0000000  499.9080571    0.0000000
msa                      0.0000000    1.0000000    0.8943571    0.0000000
perc_minority            1.0000000    9.0000000    1.9701857    0.0000000
tract_income_ratio       1.0000000    9.0000000    2.3431571    0.0000000
borrower_income_ratio    1.0000000    9.0000000    2.6898857    0.0000000
loan_purpose             1.0000000    9.0000000    3.7670143    0.0000000
federal_guarantee        1.0000000    4.0000000    3.9840857    0.0000000
borrower_race            1.0000000    9.0000000    5.3572429    0.0000000
co_borrower_race         1.0000000    9.0000000    7.0002714    0.0000000
borrower_gender          1.0000000    9.0000000    1.4590714    0.0000000
co_borrower_gender       1.0000000    9.0000000    3.0494857    0.0000000
num_units                1.0000000    4.0000000    1.0398143    0.0000000
affordability            0.0000000    9.0000000    4.2863429    0.0000000
year                  2008.0000000 2015.0000000 2011.2714714    0.0000000
type                     0.0000000    1.0000000    0.5300429    0.0000000

References vs. Copies | R

deepcopy函数。

Copying matrices and big matrices | R

If you want to copy a big.matrix object, then you need to use the deepcopy() function.

cols = 1:3只会选择1-3列。

# Use deepcopy() to create first_three
first_three <- deepcopy(mort, cols = 1:3, 
                        backingfile = "first_three.bin", 
                        descriptorfile = "first_three.desc")

# Set first_three_2 equal to first_three
first_three_2 <- first_three

# Set the value in the first row and first column of first_three to NA
first_three[1, 1] <- NA

# Verify the change shows up in first_three_2
first_three_2[1, 1]

# but not in mort
mort[1, 1]
[1] NA
[1] 1

backingfiledescriptorfile,这里的逻辑是 建立reference,在需要的时候,引入。

The Bigmemory Suite of Packages | R

下载视频。

总结,这个地方,其实发现, 要attach.big.matrix才可以, 调用数据。 这里用mort表示mirror。

视频看不了。

这个就是取变量。 并且可以使用names等函数。

# Load the bigtabulate package
library(bigtabulate)

# Call bigtable to create a variable called race_table
race_table <- bigtable(mort, "borrower_race")

# Rename the elements of race_table
names(race_table) <- race_cat
race_table

这里显然提取一个变量,并且改名字。

ccols: a vector of column indices or names specifying which columns should be used for conditioning (e.g. for building a contingency table or structure for tabulation).

> race_cat
[1] "Native Am"   "Asian"       "Black"       "Pacific Is"  "White"      
[6] "Two or More" "Hispanic"    "Not Avail"

bigtable(mort, "borrower_race")取用一个变量, 并且展示的结果是类似于table的功能。

# Create a table of the borrower race by year
race_year_table <- bigtable(mort, c("borrower_race", "year"))

# Convert rydf to a data frame
rydf <- as.data.frame(race_year_table)

# Create the new column Race
rydf$Race <- race_cat

# Let's see what it looks like
rydf

列联表还可以两个index。

> # Let's see what it looks like
> rydf
  2008 2009 2010 2011 2012 2013 2014 2015        Race
1   11   18   13   16   15   12   29   29   Native Am
2  384  583  603  568  770  673  369  488       Asian
3  363  320  209  204  258  312  185  169       Black
4   33   38   21   13   28   22   17   23  Pacific Is
5 5552 7739 6301 5746 8192 7535 4110 4831       White
6   43   85   65   58   89   78   46   64 Two or More
7  577  563  384  378  574  613  439  512    Hispanic
9 1505 1755 1240 1013 1009  971  519  618   Not Avail

但是要准备先后顺序。

female_residence_prop <- function(x, rows) {
    x_subset <- x[rows, ]
    # Find the proporation of female borrowers in urban areas
    prop_female_urban <- sum(x_subset[, "borrower_gender"] == 2 & 
                                 x_subset[, "msa"] == 1) / 
        sum(x_subset[, "msa"] == 1)
    # Find the proporation of female borrowers in rural areas
    prop_female_rural <- sum(x_subset[, "borrower_gender"] == 2 & 
                                 x_subset[, "msa"] == 0) / 
        sum(x_subset[, "msa"] == 0)
    
    c(prop_female_urban, prop_female_rural)
}

# Find the proportion of female borrowers in 2015
female_residence_prop(mort,mort[, "year"] == 2015)

函数很简单,不用管。

> female_residence_prop(mort,mort[, "year"] == 2015)
[1] 0.2737439 0.2304965
  • Split | R

  • split(): To “split” the mort data by year

  • Map(): To “apply” the function female_residence_prop() to each of the subsets returned from split()

  • Reduce(): To combine the results obtained from Map()

终于知道这些写package的人,怎么定义reduce这个词了。

# Split the row numbers of the mortage data by year
spl <- split(1:nrow(mort), mort[, "year"])

# Call str on spl
str(spl)
> str(spl)
List of 8
 $ 2008: int [1:8468] 2 8 15 17 18 28 35 40 42 47 ...
 $ 2009: int [1:11101] 4 13 25 31 43 49 52 56 67 68 ...
 $ 2010: int [1:8836] 1 6 7 10 21 23 24 27 29 38 ...
 $ 2011: int [1:7996] 11 20 37 46 53 57 73 83 86 87 ...
 $ 2012: int [1:10935] 14 16 26 30 32 33 48 69 81 94 ...
 $ 2013: int [1:10216] 5 9 19 22 36 44 55 58 72 74 ...
 $ 2014: int [1:5714] 3 12 50 60 64 66 103 114 122 130 ...
 $ 2015: int [1:6734] 34 41 54 61 62 65 82 91 102 135 ...

这里就是把mort的index,按照year进行了分组。

# For each of the row splits, find the female residence proportion
all_years <- Map(function(rows) female_residence_prop(mort, rows), spl)

# Call str on all_years
str(all_years)

Map(function(rows) female_residence_prop(mort, rows), spl)function定义, \(x\)sql\(f\)female_residence_prop

> str(all_years)
List of 8
 $ 2008: num [1:2] 0.275 0.204
 $ 2009: num [1:2] 0.244 0.2
 $ 2010: num [1:2] 0.241 0.201
 $ 2011: num [1:2] 0.252 0.241
 $ 2012: num [1:2] 0.244 0.21
 $ 2013: num [1:2] 0.275 0.257
 $ 2014: num [1:2] 0.289 0.268
 $ 2015: num [1:2] 0.274 0.23

rbind: row bind.

Use the dimnames() function to add row and column names to this matrix.

# Collect the results as rows in a matrix
prop_female <- Reduce(rbind, all_years)

# Rename the row and column names
dimnames(prop_female) <- list(names(all_years), c("prop_female_urban", "prop_femal_rural"))

# View the matrix
prop_female
> prop_female
     prop_female_urban prop_femal_rural
2008         0.2748514        0.2039474
2009         0.2441074        0.2002978
2010         0.2413881        0.2014028
2011         0.2520644        0.2408931
2012         0.2438950        0.2101313
2013         0.2751059        0.2567164
2014         0.2886756        0.2678571
2015         0.2737439        0.2304965
> str(prop_female)
 num [1:8, 1:2] 0.275 0.244 0.241 0.252 0.244 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:8] "2008" "2009" "2010" "2011" ...
  ..$ : chr [1:2] "prop_female_urban" "prop_femal_rural"

这是结构。 是矩阵。

# Load the tidyr and ggplot2 packages
library(tidyr)
library(ggplot2)

# Convert prop_female to a data frame
prop_female_df <- as.data.frame(prop_female)

# Add a new column Year
prop_female_df$Year <- row.names(prop_female_df)

# Call gather on prop_female_df
prop_female_long <- gather(prop_female_df, Region, Prop, -Year)

# Create a line plot
ggplot(prop_female_long, aes(x = Year, y = Prop, group = Region, color = Region)) + 
    geom_line()

这是针对小数据进行的画图啊,然并卵啊。

> # Call summary on mort
> summary(mort)
                               min          max         mean          NAs
enterprise               1.0000000    2.0000000    1.3814571    0.0000000
record_number            0.0000000  999.0000000  499.9080571    0.0000000
msa                      0.0000000    1.0000000    0.8943571    0.0000000
perc_minority            1.0000000    9.0000000    1.9701857    0.0000000
tract_income_ratio       1.0000000    9.0000000    2.3431571    0.0000000
borrower_income_ratio    1.0000000    3.0000000    2.6244912  718.0000000
loan_purpose             1.0000000    9.0000000    3.7670143    0.0000000
federal_guarantee        1.0000000    4.0000000    3.9840857    0.0000000
borrower_race            1.0000000    9.0000000    5.3572429    0.0000000
co_borrower_race         1.0000000    9.0000000    7.0002714    0.0000000
borrower_gender          1.0000000    9.0000000    1.4590714    0.0000000
co_borrower_gender       1.0000000    9.0000000    3.0494857    0.0000000
num_units                1.0000000    4.0000000    1.0398143    0.0000000
affordability            0.0000000    9.0000000    4.2863429    0.0000000
year                  2008.0000000 2015.0000000 2011.2714714    0.0000000
type                     0.0000000    1.0000000    0.5300429    0.0000000

一段时间都忘记了,这里summary类似于Python的.info()

# Load biganalytics and dplyr packages
library(biganalytics)
library(dplyr)

# Call summary on mort
summary(mort)

bir_df_wide <- bigtable(mort, c("borrower_income_ratio", "year")) %>% 
    # Turn it into a data.frame
    as.data.frame() %>% 
    # Create a new column called BIR with the corresponding table categories
    mutate(BIR = c(">=0,<=50%", ">50, <=80%", ">80%"))

bir_df_wide
> bir_df_wide
  2008 2009 2010 2011 2012 2013 2014 2015        BIR
1 1205 1473  600  620  745  725  401  380  >=0,<=50%
2 2095 2791 1554 1421 1819 1861 1032 1145 >50, <=80%
3 4844 6707 6609 5934 8338 7559 4255 5169       >80%

然而并非是大数据进行处理,而是汇总表,卧槽。 简直是浪费时间。

# Load the tidyr and ggplot2 packages
library(tidyr)
library(ggplot2)

bir_df_wide %>% 
    # Transform the wide-formatted data.frame into the long format
    gather(Year, Count, -BIR) %>%
    # Use ggplot to create a line plot
    ggplot(aes(x = Year, y = Count, group = BIR, color = BIR)) + 
    geom_line()

么有任何意义。

If your data isn’t numeric - if you have string variables - or if you need a greater range of numeric types - like 8-bit integers - then you might consider trying the ff package. It is similar to bigmemory but includes a structure similar to a data.frame.

别看,这课太垃圾了。