Kaggle_warmup2-BiClassification

Kaggle上面的Titanic - Machine Learning from Disaster

简要分析

首先看数据

1
2
3
train_path = '/kaggle/input/titanic/train.csv'
train_data = pd.read_csv(train_path)
train_data.head()
train_head

主要是从一些feature进行预测,target feature为Survived,仅0,1,标准二分类. 官方文档也提供了feature engineering的思路. 可以后续做优化不过这里仅做baseline和必要处理,比如fillna,one-hot-encoding之类的.

通过info看下数据na情况. data_info

可以看到age需要进行fillna, Cabin因为na过多可以丢弃, 其余numerical可以做MinMaxScale.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.preprocessing import MinMaxScaler

train_data = pd.read_csv(train_path)

def preprocessing(df):
# drop columns and fill na
df = df.drop(['Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1)
df['Age'] = df['Age'].fillna(df['Age'].mean())
# do one hot encoding
dummy_columns = ['Sex', 'Pclass','Embarked']
df = pd.get_dummies(df, columns=dummy_columns)

# do scale to numerical features
scaler = MinMaxScaler()
feature = df.loc[:, ~df.columns.isin(['Survived'])]
feature = pd.DataFrame(scaler.fit_transform(feature.values), columns=feature.columns)
return feature

label = train_data['Survived']
feature = preprocessing(train_data.loc[:, ~train_data.columns.isin(['Survived'])])
feature.head()

处理完之后数据长成这样 processed_data

后续丢模型训练就行,二分类可以直接丢LogisticRegression,或者也可以自己写个MLP

LogisticRegression

平时并没有怎么用sklearn,不过试了下感觉挺不错的,也可以自己手写一个,预测结果大概也是0.8左右.

1
2
3
4
5
6
7
8
9
10
11
12
# sklearn Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# potentially use rfe for feature selection
# rfe = RFE(model, 8)
# rfe.fit(X, y)
LR = LogisticRegression()
LR.fit(feature,label)
y_pred = LR.predict(feature)
print(accuracy_score(y_pred, label))
# 0.8013468013468014

K-Folder

顺手做一手K-Folder

1
2
3
4
5
6
7
# K-fold cross validation
from sklearn.model_selection import cross_val_score, KFold

kf = KFold(10)
score = cross_val_score(LR, feature, label, cv=kf)
print(score)
# [0.77777778 0.83146067 0.7752809 0.80898876 0.7752809 0.78651685 0.74157303 0.78651685 0.86516854 0.79775281]

PyTorch MLP

主要写一个dataset和dataloader, 然后nn.Module写一个model, 然后zero_grad,backward,step. 最后手动存y做一下accuracy的输出. 这里model删除第2,3层全连和1,2层激活函数,把第一层输出改成1即pytorch实现的LR,效果也和sklearn的差不多.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# PyTorch MLP
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_x, valid_x, train_y, valid_y = train_test_split(feature, label, test_size=0.2)
# create dataset class
class TitanicDataset(Dataset):
def __init__(self, x, y):
self.x = torch.from_numpy(x.values).float().to(device)
self.y = torch.from_numpy(y.values).float().to(device)

def __len__(self):
if len(self.x) != len(self.y):
raise ValueError("Unequal Size of X and Y!")
else:
return len(self.x)

def __getitem__(self, idx):
return self.x[idx],self.y[idx]

# init dataloader
train_dataset = TitanicDataset(train_x, train_y)
train_iter = DataLoader(train_dataset, batch_size=128, shuffle=True)
valid_dataset = TitanicDataset(valid_x, valid_y)
valid_iter = DataLoader(valid_dataset, batch_size=128, shuffle=True)

class MLP(torch.nn.Module):
def __init__(self, n_features):
super(MLP, self).__init__()
self.fc1 = torch.nn.Linear(n_features, 256)
self.fc2 = torch.nn.Linear(256, 128)
self.fc3 = torch.nn.Linear(128, 1)
self.fn1 = torch.nn.ReLU()
self.fn2 = torch.nn.ReLU()
self.fn3 = torch.nn.Sigmoid()

torch.nn.init.xavier_uniform_(self.fc1.weight)
torch.nn.init.xavier_uniform_(self.fc2.weight)
torch.nn.init.xavier_uniform_(self.fc3.weight)

def forward(self, x):
x = self.fn1(self.fc1(x))
x = self.fn2(self.fc2(x))
x = self.fn3(self.fc3(x))
return x

_, n_features = train_x.shape
model = MLP(n_features)
model.to(device)

lr = 1e-4
loss = torch.nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
epoch = 200

model.train()
for i in range(epoch):
total_loss = 0
total = 0
correct = 0
for x, y in train_iter:
optimizer.zero_grad()
y_pred = torch.squeeze(model(x))
l = loss(y_pred, y)
l.backward()
optimizer.step()
total_loss += l.item()
total += len(y)
correct += sum(y_pred.round() == y)

train_accuracy = correct / total
total = 0
correct = 0
for x, y in valid_iter:
y_pred = torch.squeeze(model(x))
total += len(y)
correct += sum(y_pred.round() == y)
test_accuracy = correct / total
if (i+1)%10 == 0:
print(f'Epoch {i+1}:{total_loss}, train accuracy:{train_accuracy}, test accuracy:{test_accuracy}')

输出结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Epoch 10:3.5960923433303833, train accuracy:0.7064606547355652, test accuracy:0.7430167198181152
Epoch 20:3.1748223900794983, train accuracy:0.7851123809814453, test accuracy:0.8156424164772034
Epoch 30:2.885961800813675, train accuracy:0.8089887499809265, test accuracy:0.832402229309082
Epoch 40:2.7163011729717255, train accuracy:0.8047752976417542, test accuracy:0.8212290406227112
Epoch 50:2.670557290315628, train accuracy:0.8089887499809265, test accuracy:0.8156424164772034
Epoch 60:2.5934107899665833, train accuracy:0.8202247023582458, test accuracy:0.8212290406227112
Epoch 70:2.5730879604816437, train accuracy:0.8188202381134033, test accuracy:0.8268156051635742
Epoch 80:2.574642688035965, train accuracy:0.8146067261695862, test accuracy:0.8268156051635742
Epoch 90:2.5959592163562775, train accuracy:0.8160112500190735, test accuracy:0.8268156051635742
Epoch 100:2.555866301059723, train accuracy:0.8132022619247437, test accuracy:0.8268156051635742
Epoch 110:2.5190521478652954, train accuracy:0.8160112500190735, test accuracy:0.832402229309082
Epoch 120:2.5432372093200684, train accuracy:0.8132022619247437, test accuracy:0.8379887938499451
Epoch 130:2.5101636350154877, train accuracy:0.8132022619247437, test accuracy:0.8268156051635742
Epoch 140:2.4863936603069305, train accuracy:0.8146067261695862, test accuracy:0.8268156051635742
Epoch 150:2.441671073436737, train accuracy:0.8117977380752563, test accuracy:0.8268156051635742
Epoch 160:2.452588587999344, train accuracy:0.8160112500190735, test accuracy:0.8268156051635742
Epoch 170:2.4411253333091736, train accuracy:0.8132022619247437, test accuracy:0.8268156051635742
Epoch 180:2.5097872614860535, train accuracy:0.8117977380752563, test accuracy:0.8268156051635742
Epoch 190:2.428959161043167, train accuracy:0.817415714263916, test accuracy:0.8268156051635742
Epoch 200:2.395860642194748, train accuracy:0.8188202381134033, test accuracy:0.8268156051635742

submission

1
2
3
4
5
6
7
8
9
10
11
# submission
test_data = pd.read_csv(test_path)
test_data = preprocessing(test_data)
text_data = torch.from_numpy(test_data.values).float().to(device)
y_pred = model(text_data).round().detach().cpu()

submission_data = pd.read_csv(submission_path)
submission_data.head()
submission_data['Survived'] = y_pred.to(torch.int32)
submission_data.to_csv('submission.csv', index=False)
submission_data.head()