这里主要是对变量进行一个预处理,因为xgboost模型,不能进文本型、因子型、分类型等非数值化的变量,但是已经有批量操作的集成方案, 下面主要先举个例子,一个数据表,然后给出集成方案。
Exploratory data analysis | Python
主要是为了preprocess
。
df.isnull()
比df.describe()
用眼睛看方便许多。
In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 21 columns):
MSSubClass 1460 non-null int64
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
Remodeled 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
Fireplaces 1460 non-null int64
GarageArea 1460 non-null int64
MSZoning 1460 non-null object
PavedDrive 1460 non-null object
Neighborhood 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(1), int64(15), object(5)
memory usage: 239.6+ KB
object
都是需要处理的。
Encoding categorical columns I: LabelEncoder | Python
Supervised Learning with scikit-learn第四章节已经有处理变量缺失值、数值化的集成方案了。
具体操作如下。
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder
导入集成方案。
# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)
处理缺失值,这里直接覆盖为0了。
df.LotFrontage
这种写法比df["LotFrontage"]
方便,
早就应该修改了。
# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == object)
object
就是文本之类非数值的变量,这里就是把他们选出来,一行代码全部转数值类型。
这里是设定这个条件。
这个很好,不知道R里面有没有,非常方便,R里面还有构建变量的attribute真麻烦。
# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()
这里是把刚才设定好的条件应用。
# Create LabelEncoder object: le
le = LabelEncoder()
这是最关键的函数LabelEncoder()
,它的目的就是文本转数值。
# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))
.apply
这里完成批量修改了,因此无论是python还是R都有批量修改变量类型的代码,并且执行的规则也不仅仅是label encoder了。
最后进行一个比较吧。
In [7]: # Print the head of the categorical columns
print(df[categorical_columns].head())
MSZoning PavedDrive Neighborhood BldgType HouseStyle
0 RL Y CollgCr 1Fam 2Story
1 RL Y Veenker 1Fam 1Story
2 RL Y CollgCr 1Fam 2Story
3 RL Y Crawfor 1Fam 2Story
4 RL Y NoRidge 1Fam 2Story
In [10]: print(df[categorical_columns].head())
MSZoning PavedDrive Neighborhood BldgType HouseStyle
0 3 2 5 0 5
1 3 2 24 0 2
2 3 2 5 0 5
3 3 2 6 0 5
4 3 2 15 0 5
比如,这里的HouseStyle
完全就修改好了,当然具体的谁赋值为0,1,这个可以通过level设定了。
但是这个也不重要,只有逻辑回归会考虑这个问题,trees型模型不考虑这个问题的。
整合的方案在此。
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder
# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)
# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == object)
# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()
# Print the head of the categorical columns
print(df[categorical_columns].head())
# Create LabelEncoder object: le
le = LabelEncoder()
# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))
# Print the head of the LabelEncoded categorical columns
print(df[categorical_columns].head())
Encoding categorical columns II: OneHotEncoder | Python
Using
LabelEncoder
, theCollgCr
Neighborhood
was encoded as5
, while theVeenker
Neighborhood
was encoded as24
, andCrawfor
as 6. IsVeenker
“greater” thanCrawfor
andCollgCr
? No - and allowing the model to assume this natural ordering may result in poor performance.
但是这种分类变量的数值化好粗暴,有ranking的嫌疑。 sklearn.preprocessing.OneHotEncoder — scikit-learn 0.19.1 documentation就是处理这个问题的。中文是一位热码编码,看不懂,真是糟糕的术语翻译。
# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
都是sklearn.preprocessing
包的。
# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)
引入函数,但是要解释下参数什么意思。
sparse
就是会反馈一个稀疏矩阵,英文是sparse matrix
,就是NMF经常搞的那个,这里选False
,是个细节,我觉得不重要。
| sparse : boolean, default=True
| Will return sparse matrix if set True else will return an array.
这是help
的解释。
# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ohe.fit_transform(df)
这个地方对categorical_mask
的变量进行了编码。
因此不需要担心编码影响了其他变量。
接下来,可以看看编码后的效果。
print(df_encoded[:5, :])
In [5]: print(df_encoded[:5, :])
[[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
0.00000000e+00 0.00000000e+00 6.00000000e+01 6.50000000e+01
8.45000000e+03 7.00000000e+00 5.00000000e+00 2.00300000e+03
0.00000000e+00 1.71000000e+03 1.00000000e+00 0.00000000e+00
2.00000000e+00 1.00000000e+00 3.00000000e+00 0.00000000e+00
5.48000000e+02 2.08500000e+05]
...]
看到0.00000000e+00
和1.00000000e+00
就知道全部0和1组成稀疏矩阵完成了,喜极而泣。
In [6]: print(df.shape)
(1460, 21)
In [7]: print(df_encoded.shape)
(1460, 62)
显然集成过的变量会很多了!
最后集成方案如下。
# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)
# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ohe.fit_transform(df)
# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])
# Print the shape of the original DataFrame
print(df.shape)
# Print the shape of the transformed array
print(df_encoded.shape)
Encoding categorical columns III: DictVectorizer | Python
当我搞清楚LabelEncoder
和OneHotEncoder
时,我就想,既然都做成了集成方案,又没什么数学素养的东西,为什么不直接合并,要分开呢?
结果马上这哥们就说,可以合并的,那为什么之前要让我看LabelEncoder
和OneHotEncoder
,这不是糊弄人吗?
这哥们真的不适合教书,好坑啊。
DictVectorizer
的解释可以看这里,sklearn.feature_extraction.DictVectorizer — scikit-learn 0.19.1 documentation
# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer
导入DictVectorizer
。
# Convert df into a dictionary: df_dict
df_dict = df.to_dict("records")
转字典格式。
"records"
的意思是,
dictionary - How to convert rows in DataFrame in Python to dictionaries - Stack Overflow
。
举个例子就清楚了。
import pandas as pd
# your df
# =========================
print(df)
id score1 score2 score3 score4 score5
0 1 0.0000 0.1087 0.0000 0.0786 1
1 2 0.0532 0.3083 0.2864 0.4464 1
2 3 0.0000 0.0840 0.8090 0.2331 1
# to_dict
# =========================
df.to_dict(orient='records')
Out[318]:
[{'id': 1.0,
'score1': 0.0,
'score2': 0.10865899999999999,
'score3': 0.0,
'score4': 0.078597,
'score5': 1.0},
{'id': 2.0,
'score1': 0.053238000000000001,
'score2': 0.308253,
'score3': 0.28635300000000002,
'score4': 0.44643299999999997,
'score5': 1.0},
{'id': 3.0,
'score1': 0.0,
'score2': 0.083978999999999998,
'score3': 0.80898300000000001,
'score4': 0.23305200000000001,
'score5': 1.0}]
# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)
sparse
就不解释了。
# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)
然后对字典格式的数据表进行编码。
最后结果如下。
In [5]: print(dv.vocabulary_)
{'HouseStyle=2Story': 18, 'Neighborhood=OldTown': 46, 'YearBuilt': 61, 'PavedDrive=P': 57, 'BldgType=TwnhsE': 5, 'GrLivArea': 11, 'GarageArea': 10, 'Neighborhood=MeadowV': 39, 'Neighborhood=NAmes': 41, 'MSZoning=RL': 27, 'FullBath': 9, 'BsmtFullBath': 6, 'OverallCond': 54, 'Neighborhood=Edwards': 36, 'Neighborhood=NridgHt': 45, 'MSZoning=RM': 28, 'HouseStyle=2.5Unf': 17, 'LotFrontage': 22, 'Neighborhood=Blmngtn': 29, 'Remodeled': 59, 'Neighborhood=Mitchel': 40, 'Neighborhood=NoRidge': 44, 'BldgType=2fmCon': 2, 'HouseStyle=SFoyer': 19, 'Neighborhood=SawyerW': 49, 'Neighborhood=SWISU': 47, 'MSSubClass': 23, 'OverallQual': 55, 'Neighborhood=Crawfor': 35, 'PavedDrive=Y': 58, 'BldgType=Duplex': 3, 'HalfBath': 12, 'Neighborhood=Somerst': 50, 'HouseStyle=1.5Unf': 14, 'MSZoning=C (all)': 24, 'Neighborhood=Sawyer': 48, 'Neighborhood=Blueste': 30, 'BldgType=Twnhs': 4, 'LotArea': 21, 'BedroomAbvGr': 0, 'Neighborhood=StoneBr': 51, 'SalePrice': 60, 'MSZoning=RH': 26, 'Neighborhood=ClearCr': 33, 'Neighborhood=NPkVill': 42, 'MSZoning=FV': 25, 'Fireplaces': 8, 'Neighborhood=Veenker': 53, 'Neighborhood=CollgCr': 34, 'BldgType=1Fam': 1, 'BsmtHalfBath': 7, 'HouseStyle=SLvl': 20, 'Neighborhood=IDOTRR': 38, 'Neighborhood=Gilbert': 37, 'Neighborhood=NWAmes': 43, 'HouseStyle=1.5Fin': 13, 'Neighborhood=Timber': 52, 'Neighborhood=BrDale': 31, 'Neighborhood=BrkSide': 32, 'HouseStyle=2.5Fin': 16, 'PavedDrive=N': 56, 'HouseStyle=1Story': 15}
这就是全部编码后的数值,不重要。
集成方案在这。
# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer
# Convert df into a dictionary: df_dict
df_dict = df.to_dict("records")
# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)
# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)
# Print the resulting first five rows
print(df_encoded[:5,:])
# Print the vocabulary
print(dv.vocabulary_)
所以全部弄完了,用后面这个方案代码可以写少一点,keep going。