python中变量批量处理集成方案

这里主要是对变量进行一个预处理，因为xgboost模型，不能进文本型、因子型、分类型等非数值化的变量，但是已经有批量操作的集成方案，下面主要先举个例子，一个数据表，然后给出集成方案。

Exploratory data analysis | Python

主要是为了preprocess。

df.isnull()比df.describe()用眼睛看方便许多。

In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 21 columns):
MSSubClass      1460 non-null int64
LotFrontage     1201 non-null float64
LotArea         1460 non-null int64
OverallQual     1460 non-null int64
OverallCond     1460 non-null int64
YearBuilt       1460 non-null int64
Remodeled       1460 non-null int64
GrLivArea       1460 non-null int64
BsmtFullBath    1460 non-null int64
BsmtHalfBath    1460 non-null int64
FullBath        1460 non-null int64
HalfBath        1460 non-null int64
BedroomAbvGr    1460 non-null int64
Fireplaces      1460 non-null int64
GarageArea      1460 non-null int64
MSZoning        1460 non-null object
PavedDrive      1460 non-null object
Neighborhood    1460 non-null object
BldgType        1460 non-null object
HouseStyle      1460 non-null object
SalePrice       1460 non-null int64
dtypes: float64(1), int64(15), object(5)
memory usage: 239.6+ KB

object都是需要处理的。

Encoding categorical columns I: LabelEncoder | Python

Supervised Learning with scikit-learn第四章节已经有处理变量缺失值、数值化的集成方案了。

具体操作如下。

# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

导入集成方案。

# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

处理缺失值，这里直接覆盖为0了。 df.LotFrontage这种写法比df["LotFrontage"]方便，早就应该修改了。

# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == object)

object就是文本之类非数值的变量，这里就是把他们选出来，一行代码全部转数值类型。这里是设定这个条件。这个很好，不知道R里面有没有，非常方便，R里面还有构建变量的attribute真麻烦。

# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()

这里是把刚才设定好的条件应用。

# Create LabelEncoder object: le
le = LabelEncoder()

这是最关键的函数LabelEncoder()，它的目的就是文本转数值。

# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))

.apply这里完成批量修改了，因此无论是python还是R都有批量修改变量类型的代码，并且执行的规则也不仅仅是label encoder了。

最后进行一个比较吧。

In [7]: # Print the head of the categorical columns
        print(df[categorical_columns].head())
  MSZoning PavedDrive Neighborhood BldgType HouseStyle
0       RL          Y      CollgCr     1Fam     2Story
1       RL          Y      Veenker     1Fam     1Story
2       RL          Y      CollgCr     1Fam     2Story
3       RL          Y      Crawfor     1Fam     2Story
4       RL          Y      NoRidge     1Fam     2Story

In [10]: print(df[categorical_columns].head())
   MSZoning  PavedDrive  Neighborhood  BldgType  HouseStyle
0         3           2             5         0           5
1         3           2            24         0           2
2         3           2             5         0           5
3         3           2             6         0           5
4         3           2            15         0           5

比如，这里的HouseStyle完全就修改好了，当然具体的谁赋值为0，1，这个可以通过level设定了。但是这个也不重要，只有逻辑回归会考虑这个问题，trees型模型不考虑这个问题的。

整合的方案在此。

# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == object)

# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()

# Print the head of the categorical columns
print(df[categorical_columns].head())

# Create LabelEncoder object: le
le = LabelEncoder()

# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))

# Print the head of the LabelEncoded categorical columns
print(df[categorical_columns].head())

Encoding categorical columns II: OneHotEncoder | Python

Using LabelEncoder, the CollgCr Neighborhood was encoded as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6. Is Veenker “greater” than Crawfor and CollgCr? No - and allowing the model to assume this natural ordering may result in poor performance.

但是这种分类变量的数值化好粗暴，有ranking的嫌疑。 sklearn.preprocessing.OneHotEncoder — scikit-learn 0.19.1 documentation就是处理这个问题的。中文是一位热码编码，看不懂，真是糟糕的术语翻译。

# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

都是sklearn.preprocessing包的。

# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)

引入函数，但是要解释下参数什么意思。

sparse就是会反馈一个稀疏矩阵，英文是sparse matrix，就是NMF经常搞的那个，这里选False，是个细节，我觉得不重要。

 |  sparse : boolean, default=True
 |      Will return sparse matrix if set True else will return an array.

这是help的解释。

# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ohe.fit_transform(df)

这个地方对categorical_mask的变量进行了编码。因此不需要担心编码影响了其他变量。

接下来，可以看看编码后的效果。

print(df_encoded[:5, :])

In [5]: print(df_encoded[:5, :])
[[  0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   6.00000000e+01   6.50000000e+01
    8.45000000e+03   7.00000000e+00   5.00000000e+00   2.00300000e+03
    0.00000000e+00   1.71000000e+03   1.00000000e+00   0.00000000e+00
    2.00000000e+00   1.00000000e+00   3.00000000e+00   0.00000000e+00
    5.48000000e+02   2.08500000e+05]
 ...]

看到0.00000000e+00和1.00000000e+00就知道全部0和1组成稀疏矩阵完成了，喜极而泣。

In [6]: print(df.shape)
(1460, 21)

In [7]: print(df_encoded.shape)
(1460, 62)

显然集成过的变量会很多了！

最后集成方案如下。

# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)

# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ohe.fit_transform(df)

# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])

# Print the shape of the original DataFrame
print(df.shape)

# Print the shape of the transformed array
print(df_encoded.shape)

Encoding categorical columns III: DictVectorizer | Python

当我搞清楚LabelEncoder和OneHotEncoder时，我就想，既然都做成了集成方案，又没什么数学素养的东西，为什么不直接合并，要分开呢？结果马上这哥们就说，可以合并的，那为什么之前要让我看LabelEncoder和OneHotEncoder，这不是糊弄人吗？这哥们真的不适合教书，好坑啊。

DictVectorizer的解释可以看这里，sklearn.feature_extraction.DictVectorizer — scikit-learn 0.19.1 documentation

# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer

导入DictVectorizer。

# Convert df into a dictionary: df_dict
df_dict = df.to_dict("records")

转字典格式。 "records"的意思是， dictionary - How to convert rows in DataFrame in Python to dictionaries - Stack Overflow 。

举个例子就清楚了。

import pandas as pd

# your df
# =========================
print(df)

   id  score1  score2  score3  score4  score5
0   1  0.0000  0.1087  0.0000  0.0786       1
1   2  0.0532  0.3083  0.2864  0.4464       1
2   3  0.0000  0.0840  0.8090  0.2331       1

# to_dict
# =========================
df.to_dict(orient='records')

Out[318]: 
[{'id': 1.0,
  'score1': 0.0,
  'score2': 0.10865899999999999,
  'score3': 0.0,
  'score4': 0.078597,
  'score5': 1.0},
 {'id': 2.0,
  'score1': 0.053238000000000001,
  'score2': 0.308253,
  'score3': 0.28635300000000002,
  'score4': 0.44643299999999997,
  'score5': 1.0},
 {'id': 3.0,
  'score1': 0.0,
  'score2': 0.083978999999999998,
  'score3': 0.80898300000000001,
  'score4': 0.23305200000000001,
  'score5': 1.0}]

# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)

sparse就不解释了。

# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)

然后对字典格式的数据表进行编码。

最后结果如下。

In [5]: print(dv.vocabulary_)
{'HouseStyle=2Story': 18, 'Neighborhood=OldTown': 46, 'YearBuilt': 61, 'PavedDrive=P': 57, 'BldgType=TwnhsE': 5, 'GrLivArea': 11, 'GarageArea': 10, 'Neighborhood=MeadowV': 39, 'Neighborhood=NAmes': 41, 'MSZoning=RL': 27, 'FullBath': 9, 'BsmtFullBath': 6, 'OverallCond': 54, 'Neighborhood=Edwards': 36, 'Neighborhood=NridgHt': 45, 'MSZoning=RM': 28, 'HouseStyle=2.5Unf': 17, 'LotFrontage': 22, 'Neighborhood=Blmngtn': 29, 'Remodeled': 59, 'Neighborhood=Mitchel': 40, 'Neighborhood=NoRidge': 44, 'BldgType=2fmCon': 2, 'HouseStyle=SFoyer': 19, 'Neighborhood=SawyerW': 49, 'Neighborhood=SWISU': 47, 'MSSubClass': 23, 'OverallQual': 55, 'Neighborhood=Crawfor': 35, 'PavedDrive=Y': 58, 'BldgType=Duplex': 3, 'HalfBath': 12, 'Neighborhood=Somerst': 50, 'HouseStyle=1.5Unf': 14, 'MSZoning=C (all)': 24, 'Neighborhood=Sawyer': 48, 'Neighborhood=Blueste': 30, 'BldgType=Twnhs': 4, 'LotArea': 21, 'BedroomAbvGr': 0, 'Neighborhood=StoneBr': 51, 'SalePrice': 60, 'MSZoning=RH': 26, 'Neighborhood=ClearCr': 33, 'Neighborhood=NPkVill': 42, 'MSZoning=FV': 25, 'Fireplaces': 8, 'Neighborhood=Veenker': 53, 'Neighborhood=CollgCr': 34, 'BldgType=1Fam': 1, 'BsmtHalfBath': 7, 'HouseStyle=SLvl': 20, 'Neighborhood=IDOTRR': 38, 'Neighborhood=Gilbert': 37, 'Neighborhood=NWAmes': 43, 'HouseStyle=1.5Fin': 13, 'Neighborhood=Timber': 52, 'Neighborhood=BrDale': 31, 'Neighborhood=BrkSide': 32, 'HouseStyle=2.5Fin': 16, 'PavedDrive=N': 56, 'HouseStyle=1Story': 15}

这就是全部编码后的数值，不重要。

集成方案在这。

# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer

# Convert df into a dictionary: df_dict
df_dict = df.to_dict("records")

# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)

# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)

# Print the resulting first five rows
print(df_encoded[:5,:])

# Print the vocabulary
print(dv.vocabulary_)

所以全部弄完了，用后面这个方案代码可以写少一点，keep going。

Preprocessing within a pipeline | Python