Get the Data with Pandas | Python
# Import the Pandas library
import pandas as pd
# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)
#Print the `head` of the train and test dataframes
print(train.head())
print(test.head())
Rose vs Jack, or Female vs Male | Python
类似于R的,
prop.table
和table
函数。
# absolute numbers
train["Survived"].value_counts()
# percentages
train["Survived"].value_counts(normalize = True)
也可以这样train.Survived.value_counts()
。
normalize = True
得到比例。
train["Survived"][train["Sex"] == 'male'].value_counts()
train["Survived"][train["Sex"] == 'female'].value_counts()
没有pd.pivot_table()
好用。
# Passengers that survived vs passengers that passed away
print(train.Survived.value_counts())
# As proportions
print(train["Survived"].value_counts(normalize = True))
# Males that survived vs males that passed away
print(train["Survived"][train["Sex"] == 'male'].value_counts())
# Females that survived vs Females that passed away
print(train["Survived"][train["Sex"] == 'female'].value_counts())
# Normalized male survival
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True))
# Normalized female survival
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True))
<script.py> output:
0 549
1 342
Name: Survived, dtype: int64
0 0.616162
1 0.383838
Name: Survived, dtype: float64
0 468
1 109
Name: Survived, dtype: int64
1 233
0 81
Name: Survived, dtype: int64
0 0.811092
1 0.188908
Name: Survived, dtype: float64
1 0.742038
0 0.257962
Name: Survived, dtype: float64
It looks like it makes sense to predict that all females will survive, and all men will die.
Does age play a role? | Python
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')
# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
print(train["Child"])
# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))
# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 NaN
6 0.0
7 1.0
8 0.0
9 1.0
10 1.0
11 0.0
12 0.0
13 0.0
14 1.0
15 0.0
16 1.0
17 NaN
18 0.0
19 NaN
20 0.0
21 0.0
22 1.0
23 0.0
24 1.0
25 0.0
26 NaN
27 0.0
28 NaN
29 NaN
...
861 0.0
862 0.0
863 NaN
864 0.0
865 0.0
866 0.0
867 0.0
868 NaN
869 1.0
870 0.0
871 0.0
872 0.0
873 0.0
874 0.0
875 1.0
876 0.0
877 0.0
878 NaN
879 0.0
880 0.0
881 0.0
882 0.0
883 0.0
884 0.0
885 0.0
886 0.0
887 0.0
888 NaN
889 0.0
890 0.0
Name: Child, dtype: float64
1 0.539823
0 0.460177
Name: Survived, dtype: float64
0 0.618968
1 0.381032
Name: Survived, dtype: float64
First Prediction | Python
# Create a copy of test: test_one
test_one = test.copy()
# Initialize a Survived column to 0
test_one["Survived"] = 0
# Set Survived to 1 if Sex equals "female" and print the `Survived` column from `test_one`
test_one["Survived"][test_one["Sex"] == "female"] = 1
print(test_one.Survived)
test_one.Survived
还可以这样玩。
Intro to decision trees | Python
# Import the Numpy library
import numpy as np
# Import 'tree' from scikit-learn library
from sklearn import tree
Cleaning and Formatting your Data | Python
开始覆盖一些缺失值。
利用.fillna()
。
# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")
# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2
#Print the Sex and Embarked columns
print(train.Sex)
print(train.Embarked)
这里对Embarked
进行了特征工程。
Creating your first decision tree | Python
The methods that we will use take
numpy
arrays as inputs and therefore we will need to create those from theDataFrame
that we already have.
也就是说DecisionTreeClassifier
进行时,
需要将输入input转化成numpy
,最简单的方法就是
.values
。
In [1]: type(train["Survived"].values)
Out[1]: numpy.ndarray
target = train["Survived"].values
features = train[["Sex", "Age"]].values
my_tree = tree.DecisionTreeClassifier()
my_tree = my_tree.fit(features, target)
.feature_importances_
可以查重要性排序。
.score(X,y)
看效果。
# Print the train data to see the available features
print(train)
# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values
# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)
# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))
0.977553310887
还不错。
In [5]: train[["Pclass", "Sex", "Age", "Fare"]].columns
Out[5]: Index(['Pclass', 'Sex', 'Age', 'Fare'], dtype='object')
In [6]: my_tree_one.feature_importances_
Out[6]: array([ 0.1269655 , 0.31274009, 0.23147703, 0.32881738])
这点就没有R好了,麻烦,看个排序还要那么搞来搞去。
Predict and submit to Kaggle | Python
.astype(int)
更改变量性质。
.to_csv("my_solution_one.csv", index_label = ["PassengerId"])
中,
.to_csv
指定了index_label
,就是说在.csv
中也要写index。
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[['Pclass', 'Sex', 'Age', 'Fare']].values
# Make your prediction using the test set
my_prediction = my_tree_one.predict(test_features)
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)
# Check that your data frame has 418 entries
print(my_solution.shape)
# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])
Overfitting and how to control it | Python
depth树的深度包括,
max_depth
最大深度,min_samples_split
节点最小样本量,只能大于等于,不能小于。
random_state
就是个set.seed()
,给定随机状态。
# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values
#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)
#Print the score of the new decison tree
print(my_tree_two.score(features_two, target)) # 0.905723905724
这里进行了一定的调整参数,
增加了max_depth = 10
, min_samples_split = 5
在tree.DecisionTreeClassifier
中。
Feature-engineering for our Titanic data set | Python
构建一个family_size
感觉没啥意思。
# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two.SibSp + train_two.Parch + 1
# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values
# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three,target)
# Print the score of this decision tree
print(my_tree_three.score(features_three, target)) #0.979797979798
score
还提高了卧槽,这。
"SibSp", "Parch", "family_size"
三个都在模型中,因为树是非线性的,因此这样可以,抓住sense。
Your submission scored 0.75598, which is not an improvement of your best score. Keep trying!
A Random Forest analysis in Python | Python
n_estimators
needs to be set when using the RandomForestClassifier() class. This argument allows you to set the number of trees you wish to plant and average over.
n_estimators
回答了需要种植多少个树。
# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier
# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)
# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))
# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))
<script.py> output:
0.939393939394
418
Remember,
.score()
measure should be high but not extreme because that would be a sign of overfitting.
.score()
本身就是过拟合的体现。
#Request and print the `.feature_importances_` attribute
print(my_tree_two.feature_importances_)
print(my_forest.feature_importances_)
#Compute and print the mean accuracy score for both models
print(my_tree_two.score(features_two, target))
print(my_forest.score(features_two, target))
<script.py> output:
[ 0.14130255 0.17906027 0.41616727 0.17938711 0.05039699 0.01923751
0.0144483 ]
[ 0.10384741 0.20139027 0.31989322 0.24602858 0.05272693 0.04159232
0.03452128]
0.905723905724
0.939393939394
"Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"
The most important feature was “Sex”, but it was more significant for “my_tree_two”