DeBERTa+LGBM+FeatureEngineering学习

参考的notebook的地址: DeBERTa & LightGBM for Automated Essay Scoring.

首先会简要学习一下deberta,然后会看下notebook里面是如何做feature engineering的.

DeBERTa

v1: DeBERTa: Decoding-enhanced BERT with Disentangled Attention

v3: DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

其中v1主要对两个部分做了改进: 一个是Disentangled Attention, 另一个是Enhanced mask decoder, 然后文中还提及到了新的对抗训练改进.

Disentangled Attention

我们知道bert里面对embedding的处理是相加的方法

不同于bert一个词只有一个content embedding和一个position embedding, 求和即位词嵌入; DeBERTa中每个词都有两个vector表示,而position embedding也是relative embedding,然后token_i和token_j的cross attention就可以由四个部分组成: content -> content, content -> positon, position -> content, position -> position.

然后后面具体的计算还挺复杂的, 不过说白了其实还是矩阵交叉, 有点类似之前看的ffm.

Enhanced Mask Decoder

前面讲到DeBERTa的attention用的是relative position,对比bert里面是input的时候直接加起来了,所以到这里其实deberta是缺少绝对位置信息的,可是绝对位置还是有用的,所以作者表示我们还是得加上绝对位置信息.

The BERT model incorporates absolute positions in the input layer. In DeBERTa, we incorporate them right after all the Transformer layers but before the softmax layer for masked token prediction.

所以说白了就是解码的时候(这里的解码个人感觉表示的其实是predict the masked word的意思)我们会加入absolute position来作为补充信息.

SCALE INVARIANT FINE-TUNING(SIFT)

对抗训练一般用来让模型更robust, 在nlp中比较合理的加入干扰的位置应该是在word embedding. 但是embedding本身范围并没有做normalization, 所以这里作者提出先对embedding做layernorm,然后再对embedding做干扰.

DeBERTa + Feature Engineering + LGBM

import

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import gc
import torch
import copy
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments,DataCollatorWithPadding
import nltk
from datasets import Dataset
from glob import glob
import numpy as np
import pandas as pd
import polars as pl
import re
import random
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from scipy.special import softmax
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier,GradientBoostingClassifier,BaggingClassifier
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB,MultinomialNB,ComplementNB
from sklearn.neural_network import MLPClassifier
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, f1_score
from sklearn.metrics import cohen_kappa_score
from lightgbm import log_evaluation, early_stopping
import lightgbm as lgb
nltk.download('wordnet')

baseline DeBERTa

写了一长串, MODEL_PATHS里面是作者已经训练好了的DeBERTa, load进来, 多模型预测丢prediction.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
MAX_LENGTH = 1024
TEST_DATA_PATH = "/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv"
import pandas as pd
from datasets import Dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
DataCollatorWithPadding,
)
from scipy.special import softmax
import torch
import gc
import glob
MODEL_PATHS = [
'/kaggle/input/aes2-400-20240419134941/*/*',
'/kaggle/input/best-model-1/deberta-large-fold1/checkpoint-100/',
'/kaggle/input/train-best-model-3/deberta-large-fold1/checkpoint-200/'
]
EVAL_BATCH_SIZE = 1

models = []
for path in MODEL_PATHS:
models.extend(glob.glob(path))

tokenizer = AutoTokenizer.from_pretrained(models[0])

def tokenize(sample):
return tokenizer(sample['full_text'], max_length=MAX_LENGTH, truncation=True)

df_test = pd.read_csv(TEST_DATA_PATH)
ds = Dataset.from_pandas(df_test).map(tokenize).remove_columns(['essay_id', 'full_text'])

args = TrainingArguments(
".",
per_device_eval_batch_size=EVAL_BATCH_SIZE,
report_to="none"
)

predictions = []
for model in models:
model = AutoModelForSequenceClassification.from_pretrained(model)
trainer = Trainer(
model=model,
args=args,
data_collator=DataCollatorWithPadding(tokenizer),
tokenizer=tokenizer
)
preds = trainer.predict(ds).predictions
predictions.append(softmax(preds, axis=-1))
del model, trainer
torch.cuda.empty_cache()
gc.collect()

提取DeBERTa的prediction

1
2
3
4
5
6
7
predicted_score = 0.
for p in predictions:
predicted_score += p

predicted_score /= len(predictions)
df_test['score'] = predicted_score.argmax(-1) + 1
df_test.head()

Feature Engineering

1. text -> paragraph

依据换行划分自然段

1
2
3
4
5
6
7
8
9
10
11
columns = [  
(
pl.col("full_text").str.split(by="\n\n").alias("paragraph")
),
]
PATH = "/kaggle/input/learning-agency-lab-automated-essay-scoring-2/"

train = pl.read_csv(PATH + "train.csv").with_columns(columns)
test = pl.read_csv(PATH + "test.csv").with_columns(columns)

train.head(1)

2. data preprocessing

具体包括了:

  1. 缩写展开
  2. html移除
  3. 无用信息比如 @ 和标点符号的移除
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import string

cList = {
"ain't": "am not","aren't": "are not","can't": "cannot","can't've": "cannot have","'cause": "because", "could've": "could have","couldn't": "could not","couldn't've": "could not have","didn't": "did not","doesn't": "does not","don't": "do not","hadn't": "had not","hadn't've": "had not have","hasn't": "has not",
"haven't": "have not","he'd": "he would","he'd've": "he would have","he'll": "he will","he'll've": "he will have","he's": "he is",
"how'd": "how did","how'd'y": "how do you","how'll": "how will","how's": "how is","I'd": "I would","I'd've": "I would have","I'll": "I will","I'll've": "I will have","I'm": "I am","I've": "I have",
"isn't": "is not","it'd": "it had","it'd've": "it would have","it'll": "it will", "it'll've": "it will have","it's": "it is","let's": "let us","ma'am": "madam","mayn't": "may not",
"might've": "might have","mightn't": "might not","mightn't've": "might not have","must've": "must have","mustn't": "must not","mustn't've": "must not have","needn't": "need not","needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not","oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
"shan't've": "shall not have","she'd": "she would","she'd've": "she would have","she'll": "she will","she'll've": "she will have","she's": "she is",
"should've": "should have","shouldn't": "should not","shouldn't've": "should not have","so've": "so have","so's": "so is","that'd": "that would","that'd've": "that would have","that's": "that is","there'd": "there had","there'd've": "there would have","there's": "there is","they'd": "they would","they'd've": "they would have","they'll": "they will","they'll've": "they will have","they're": "they are","they've": "they have","to've": "to have","wasn't": "was not","we'd": "we had",
"we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have",
"weren't": "were not","what'll": "what will","what'll've": "what will have",
"what're": "what are","what's": "what is","what've": "what have","when's": "when is","when've": "when have",
"where'd": "where did","where's": "where is","where've": "where have","who'll": "who will","who'll've": "who will have","who's": "who is","who've": "who have","why's": "why is",
"why've": "why have","will've": "will have","won't": "will not","won't've": "will not have","would've": "would have","wouldn't": "would not",
"wouldn't've": "would not have","y'all": "you all","y'alls": "you alls","y'all'd": "you all would",
"y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you had","you'd've": "you would have","you'll": "you you will","you'll've": "you you will have","you're": "you are", "you've": "you have"
}

c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re=c_re):
def replace(match):
return cList[match.group(0)]
return c_re.sub(replace, text)

def removeHTML(x):
html=re.compile(r'<.*?>')
return html.sub(r'',x)
def dataPreprocessing(x):
x = x.lower()
x = removeHTML(x)
x = re.sub("@\w+", '',x)
x = re.sub("'\d+", '',x)
x = re.sub("\d+", '',x)
x = re.sub("http\w+", '',x)
x = re.sub(r"\s+", " ", x)
# x = expandContractions(x)
x = re.sub(r"\.+", ".", x)
x = re.sub(r"\,+", ",", x)
x = x.strip()
return x

def remove_punctuation(text):
"""
Remove all punctuation from the input text.

Args:
- text (str): The input text.

Returns:
- str: The text with punctuation removed.
"""

translator = str.maketrans('', '', string.punctuation)
return text.translate(translator)

很专业好吧,这种缩写展开的词典肯定是经常做这份工作才能有的, 就像pentest选手人手password爆破字典一样.

3. count error

english-word-hx地址

1
2
3
4
5
6
7
8
9
10
11
12
13
import spacy
import re

nlp = spacy.load("en_core_web_sm")

with open('/kaggle/input/english-word-hx/words.txt', 'r') as file:
english_vocab = set(word.strip().lower() for word in file)

def count_spelling_errors(text):
doc = nlp(text)
lemmatized_tokens = [token.lemma_.lower() for token in doc]
spelling_errors = sum(1 for token in lemmatized_tokens if token not in english_vocab)
return spelling_errors

这里就是依据英语单词字典, 将作文中每个词映射到词根然后判断词根是否出现在字典中, 一旦不在字典中就是拼写错误.

第一波处理(paragraph)

  1. 预处理, 移除标点, 计算拼写错误, 计算自然段长度, 计算每个句子长度.

  2. 设置两个划分list, 统计自然段长度大于xx作为一个feature, 小于多少作为一个feature

  3. 一般特征加上拼写错误数量: 统计均值,最大,最小值,峰值等等一些统计学数据.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def Paragraph_Preprocess(tmp):

tmp = tmp.explode('paragraph')
tmp = tmp.with_columns(pl.col('paragraph').map_elements(dataPreprocessing))
tmp = tmp.with_columns(pl.col('paragraph').map_elements(remove_punctuation).alias('paragraph_no_pinctuation'))
tmp = tmp.with_columns(pl.col('paragraph_no_pinctuation').map_elements(count_spelling_errors).alias("paragraph_error_num"))
tmp = tmp.with_columns(pl.col('paragraph').map_elements(lambda x: len(x)).alias("paragraph_len"))
tmp = tmp.with_columns(pl.col('paragraph').map_elements(lambda x: len(x.split('.'))).alias("paragraph_sentence_cnt"),
pl.col('paragraph').map_elements(lambda x: len(x.split(' '))).alias("paragraph_word_cnt"),)
return tmp

# feature_eng
paragraph_fea = ['paragraph_len','paragraph_sentence_cnt','paragraph_word_cnt']
paragraph_fea2 = ['paragraph_error_num'] + paragraph_fea
def Paragraph_Eng(train_tmp):
num_list = [0, 50,75,100,125,150,175,200,250,300,350,400,500,600]
num_list2 = [0, 50,75,100,125,150,175,200,250,300,350,400,500,600,700]
aggs = [
*[pl.col('paragraph').filter(pl.col('paragraph_len') >= i).count().alias(f"paragraph_>{i}_cnt") for i in [0, 50,75,100,125,150,175,200,250,300,350,400,500,600,700] ],
*[pl.col('paragraph').filter(pl.col('paragraph_len') <= i).count().alias(f"paragraph_<{i}_cnt") for i in [25,49]],
*[pl.col(fea).max().alias(f"{fea}_max") for fea in paragraph_fea2],
*[pl.col(fea).mean().alias(f"{fea}_mean") for fea in paragraph_fea2],
*[pl.col(fea).min().alias(f"{fea}_min") for fea in paragraph_fea2],
*[pl.col(fea).sum().alias(f"{fea}_sum") for fea in paragraph_fea2],
*[pl.col(fea).first().alias(f"{fea}_first") for fea in paragraph_fea2],
*[pl.col(fea).last().alias(f"{fea}_last") for fea in paragraph_fea2],
*[pl.col(fea).kurtosis().alias(f"{fea}_kurtosis") for fea in paragraph_fea2],
*[pl.col(fea).quantile(0.25).alias(f"{fea}_q1") for fea in paragraph_fea2],
*[pl.col(fea).quantile(0.75).alias(f"{fea}_q3") for fea in paragraph_fea2],
]
df = train_tmp.group_by(['essay_id'], maintain_order=True).agg(aggs).sort("essay_id")
df = df.to_pandas()
return df
tmp = Paragraph_Preprocess(train)
train_feats = Paragraph_Eng(tmp)
train_feats['score'] = train['score']

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))
train_feats.head(3)

# Features Number: 53

这时候特征数量来到了53

第二波处理(sentence)

类似于paragraph

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def Sentence_Preprocess(tmp):
tmp = tmp.with_columns(pl.col('full_text').map_elements(dataPreprocessing).str.split(by=".").alias("sentence"))
tmp = tmp.explode('sentence')
tmp = tmp.with_columns(pl.col('sentence').map_elements(lambda x: len(x)).alias("sentence_len"))
tmp = tmp.with_columns(pl.col('sentence').map_elements(lambda x: len(x.split(' '))).alias("sentence_word_cnt"))
return tmp

# feature_eng
sentence_fea = ['sentence_len','sentence_word_cnt']
def Sentence_Eng(train_tmp):
aggs = [
*[pl.col('sentence').filter(pl.col('sentence_len') >= i).count().alias(f"sentence_>{i}_cnt") for i in [0,15,50,100,150,200,250,300] ],
*[pl.col('sentence').filter(pl.col('sentence_len') <= i).count().alias(f"sentence_<{i}_cnt") for i in [15,50] ],
*[pl.col(fea).max().alias(f"{fea}_max") for fea in sentence_fea],
*[pl.col(fea).mean().alias(f"{fea}_mean") for fea in sentence_fea],
*[pl.col(fea).min().alias(f"{fea}_min") for fea in sentence_fea],
*[pl.col(fea).sum().alias(f"{fea}_sum") for fea in sentence_fea],
*[pl.col(fea).first().alias(f"{fea}_first") for fea in sentence_fea],
*[pl.col(fea).last().alias(f"{fea}_last") for fea in sentence_fea],
*[pl.col(fea).kurtosis().alias(f"{fea}_kurtosis") for fea in sentence_fea],
*[pl.col(fea).quantile(0.25).alias(f"{fea}_q1") for fea in sentence_fea],
*[pl.col(fea).quantile(0.75).alias(f"{fea}_q3") for fea in sentence_fea],

]
df = train_tmp.group_by(['essay_id'], maintain_order=True).agg(aggs).sort("essay_id")
df = df.to_pandas()
return df

tmp = Sentence_Preprocess(train)
train_feats = train_feats.merge(Sentence_Eng(tmp), on='essay_id', how='left')

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))
train_feats.head(3)

# Features Number: 81

word

然后对单词长度处理,个人感觉有点离谱了,不过就当先了解下可以做的处理了.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# word feature
def Word_Preprocess(tmp):
tmp = tmp.with_columns(pl.col('full_text').map_elements(dataPreprocessing).str.split(by=" ").alias("word"))
tmp = tmp.explode('word')
tmp = tmp.with_columns(pl.col('word').map_elements(lambda x: len(x)).alias("word_len"))
tmp = tmp.filter(pl.col('word_len')!=0)
return tmp

# feature_eng
def Word_Eng(train_tmp):
aggs = [
*[pl.col('word').filter(pl.col('word_len') >= i+1).count().alias(f"word_{i+1}_cnt") for i in range(15) ],
pl.col('word_len').max().alias(f"word_len_max"),
pl.col('word_len').mean().alias(f"word_len_mean"),
pl.col('word_len').std().alias(f"word_len_std"),
pl.col('word_len').quantile(0.25).alias(f"word_len_q1"),
pl.col('word_len').quantile(0.50).alias(f"word_len_q2"),
pl.col('word_len').quantile(0.75).alias(f"word_len_q3"),
]
df = train_tmp.group_by(['essay_id'], maintain_order=True).agg(aggs).sort("essay_id")
df = df.to_pandas()
return df

tmp = Word_Preprocess(train)
train_feats = train_feats.merge(Word_Eng(tmp), on='essay_id', how='left')

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))
train_feats.head(3)

# Features Number: 102

tfidf

真没想到居然还是用得到这个

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
vectorizer = TfidfVectorizer(
tokenizer=lambda x: x,
preprocessor=lambda x: x,
token_pattern=None,
strip_accents='unicode',
analyzer = 'word',
ngram_range=(3,6),
min_df=0.05,
max_df=0.95,
sublinear_tf=True,
)

train_tfid = vectorizer.fit_transform([i for i in train['full_text']])
dense_matrix = train_tfid.toarray()
df = pd.DataFrame(dense_matrix)
tfid_columns = [ f'tfid_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = train_feats['essay_id']
train_feats = train_feats.merge(df, on='essay_id', how='left')
feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Number of Features: ',len(feature_names))
train_feats.head(3)

# Number of Features: 19729

可以看到特征数量直接开始爆炸, 但tf-idf的老问题还是存在

count

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
vectorizer_cnt = CountVectorizer(
tokenizer=lambda x: x,
preprocessor=lambda x: x,
token_pattern=None,
strip_accents='unicode',
analyzer = 'word',
ngram_range=(2,3),
min_df=0.10,
max_df=0.85,
)
train_tfid = vectorizer_cnt.fit_transform([i for i in train['full_text']])
dense_matrix = train_tfid.toarray()
df = pd.DataFrame(dense_matrix)
tfid_columns = [ f'tfid_cnt_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = train_feats['essay_id']
train_feats = train_feats.merge(df, on='essay_id', how='left')

整合deberta

1
2
3
4
5
6
7
8
9
10
11
12
import joblib

deberta_oof = joblib.load('/kaggle/input/aes2-400-20240419134941/oof.pkl')
print(deberta_oof.shape, train_feats.shape)

for i in range(6):
train_feats[f'deberta_oof_{i}'] = deberta_oof[:, i]

feature_names = list(filter(lambda x: x not in ['essay_id','score'], train_feats.columns))
print('Features Number: ',len(feature_names))

train_feats.shape

evaluation metrics

比赛给出的要求是会看quadratic weighted kappa

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def quadratic_weighted_kappa(y_true, y_pred):
y_true = y_true + a
y_pred = (y_pred + a).clip(1, 6).round()
qwk = cohen_kappa_score(y_true, y_pred, weights="quadratic")
return 'QWK', qwk, True
def qwk_obj(y_true, y_pred):
labels = y_true + a
preds = y_pred + a
preds = preds.clip(1, 6)
f = 1/2*np.sum((preds-labels)**2)
g = 1/2*np.sum((preds-a)**2+b)
df = preds - labels
dg = preds - a
grad = (df/g - f*dg/g**2)*len(labels)
hess = np.ones(len(labels))
return grad, hess
a = 2.998
b = 1.092

整合数据,丢LGBM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import pickle
with open('/kaggle/input/aes2-400-fes-202404291649/usefe_list.pkl', mode='br') as fi:
feature_names = pickle.load(fi)
feature_select = feature_names

X = train_feats[feature_names].astype(np.float32).values
y_split = train_feats['score'].astype(int).values
y = train_feats['score'].astype(np.float32).values-a
oof = train_feats['score'].astype(int).values


n_splits = 15

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=0)

f1_scores = []
kappa_scores = []
models = []
predictions = []
callbacks = [log_evaluation(period=25), early_stopping(stopping_rounds=75,first_metric_only=True)]

i=1
for train_index, test_index in skf.split(X, y_split):
print('fold',i)
X_train_fold, X_test_fold = X[train_index], X[test_index]
y_train_fold, y_test_fold, y_test_fold_int = y[train_index], y[test_index], y_split[test_index]
model = lgb.LGBMRegressor(
objective = qwk_obj,
metrics = 'None',
learning_rate = 0.05,
max_depth = 5,
num_leaves = 10,
colsample_bytree=0.3,
reg_alpha = 0.7,
reg_lambda = 0.1,
n_estimators=700,
random_state=42,
extra_trees=True,
class_weight='balanced',
verbosity = - 1)

predictor = model.fit(X_train_fold,
y_train_fold,
eval_names=['train', 'valid'],
eval_set=[(X_train_fold, y_train_fold), (X_test_fold, y_test_fold)],
eval_metric=quadratic_weighted_kappa,
callbacks=callbacks,)
models.append(predictor)
predictions_fold = predictor.predict(X_test_fold)
predictions_fold = predictions_fold + a
oof[test_index]=predictions_fold
predictions_fold = predictions_fold.clip(1, 6).round()
predictions.append(predictions_fold)
f1_fold = f1_score(y_test_fold_int, predictions_fold, average='weighted')
f1_scores.append(f1_fold)


kappa_fold = cohen_kappa_score(y_test_fold_int, predictions_fold, weights='quadratic')
kappa_scores.append(kappa_fold)

cm = confusion_matrix(y_test_fold_int, predictions_fold, labels=[x for x in range(1,7)])

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=[x for x in range(1,7)])
disp.plot()
plt.show()
print(f'F1 score across fold: {f1_fold}')
print(f'Cohen kappa score across fold: {kappa_fold}')
i+=1

mean_f1_score = np.mean(f1_scores)
mean_kappa_score = np.mean(kappa_scores)

print("="*50)
print(f'Mean F1 score across {n_splits} folds: {mean_f1_score}')
print(f'Mean Cohen kappa score across {n_splits} folds: {mean_kappa_score}')
print("="*50)

test data prediction

预处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
tmp = Paragraph_Preprocess(test)
test_feats = Paragraph_Eng(tmp)
# Sentence
tmp = Sentence_Preprocess(test)
test_feats = test_feats.merge(Sentence_Eng(tmp), on='essay_id', how='left')
# Word
tmp = Word_Preprocess(test)
test_feats = test_feats.merge(Word_Eng(tmp), on='essay_id', how='left')

# Tfidf
test_tfid = vectorizer.transform([i for i in test['full_text']])
dense_matrix = test_tfid.toarray()
df = pd.DataFrame(dense_matrix)
tfid_columns = [ f'tfid_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = test_feats['essay_id']
test_feats = test_feats.merge(df, on='essay_id', how='left')

# CountVectorizer
test_tfid = vectorizer_cnt.transform([i for i in test['full_text']])
dense_matrix = test_tfid.toarray()
df = pd.DataFrame(dense_matrix)
tfid_columns = [ f'tfid_cnt_{i}' for i in range(len(df.columns))]
df.columns = tfid_columns
df['essay_id'] = test_feats['essay_id']
test_feats = test_feats.merge(df, on='essay_id', how='left')

for i in range(6):
test_feats[f'deberta_oof_{i}'] = predicted_score[:, i]

# Features number
feature_names = list(filter(lambda x: x not in ['essay_id','score'], test_feats.columns))
print('Features number: ',len(feature_names))
test_feats.head(3)

计算概率

1
2
3
4
5
6
7
8
9
10
11
12
13
14
probabilities = []
for model in models:
proba= model.predict(test_feats[feature_select])+ a
probabilities.append(proba)

predictions = np.mean(probabilities, axis=0)
predictions = np.round(predictions.clip(1, 6))
print(predictions)

submission=pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/sample_submission.csv")
submission['score']=predictions
submission['score']=submission['score'].astype(int)
submission.to_csv("submission.csv",index=None)
display(submission.head())