Kaggle_warmup3-TimeSeries

今天看学习资料的时候发现kaggle还有时序相关的ML问题,想起之前看到有公司也有要求时序处理相关经验,这里就从零开始学习一下时序处理相关,可能一篇不太写得完,毕竟是以前没做过的东西,大概会多写一些.

题目是Store Sales - Time Series Forecasting

然后学习的notebook是来自Amisha0528这位朋友的,

数据读取和合并

首先dataset提供的内容包含在多个文件中

1
2
3
4
5
6
train =pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/train.csv")
test=pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/test.csv")
oil=pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/oil.csv")
stores=pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/stores.csv")
transactions=pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/transactions.csv")
holidays=pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv")

数据设计多表合并,以前并没有碰到过此类问题在这里记录一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
train['test'] = 0
test['test'] = 1

# concat 连接train和test, 依靠test字段做区分
data = pd.concat([train, test], axis=0)
# merge做表合并
data = data.merge(holidays, on='date', how='left')
data= data.merge(stores, on='store_nbr', how='left')
data= data.merge(oil, on='date', how='left')
data= data.merge(transactions, on=['date', 'store_nbr'], how='left')
# set_index指定选取哪些作为index
data = data.set_index(['store_nbr', 'date', 'family'])
data = data.drop(index='2013-01-01', level=1)
data

首先新增test字段做train和test的区分,然后对train和test进行合并方便做表合并,最后选取多列做index head

后续使用pandas的datetime进行处理,这里可以直接单独提取year, month, day, week等字段作为单独列方便后续处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# date处理
data_ = data.copy().reset_index()

data_['date'] = pd.to_datetime(data_["date"])
data_['day_of_week'] = data_['date'].dt.day_of_week
data_['day_of_year'] = data_['date'].dt.dayofyear
data_['day_of_month'] = data_['date'].dt.day
data_['month'] = data_['date'].dt.month
data_['quarter'] = data_['date'].dt.quarter
data_['year'] = data_['date'].dt.year

train = data_[data_['test'] == 0]
test = data_[data_['test'] == 1]

train.head()

这里可以groupby看一下各种时间的sales数据,可以做一些ploting之类的工作

1
2
3
4
5
6
7
# groupby对sale求和
# groupby学习
grouping_columns = ['year', 'quarter', 'month', 'day_of_week', 'day_of_year', 'day_of_month']

for ind, column in enumerate(grouping_columns):
grouped_data = train.groupby(column)['sales'].sum()
grouped_data = pd.DataFrame(grouped_data).reset_index()

创建lag和window

这里我个人感觉是关键部分,通过前面提取的date信息,这里我们通过建立date之间的关系并提取作为新的列,这些列会在后续作为feature使用

1
2
3
4
5
6
7
8
9
10
# shifting data
alphas = [0.95, 0.8, 0.65, 0.5]
lags =[1,7,30]

# lag表delay,alpha表同时被考虑的窗口
for a in alphas:
for i in lags:
data_[f'sales_lag_{i}_alpha_{a}'] = np.log1p(grouped_data['sales'].transform(lambda x: x.shift(i).ewm(alpha=a, min_periods=1).mean()))

data_['sales_lag_7_alpha_0.5'].describe()

shift是创建lag,ewm是选择窗口,表示每个数值由相近的多少个值所决定.

最后数据提取和列的取舍

后续即做train_test_split并且丢弃一些列,这里作者还涉及了一些dummy相关但最后似乎并没有做相应处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
data_['onpromotion'] = data_['onpromotion'].apply(lambda x: x > 0)
sales_lag_columns = list(data_.filter(like="lag").columns)

training_percentage = 0.8
testing_percentage = 0.2

to_dummies = ['day_of_week', 'day_of_month', 'month', 'quarter', 'year', 'store_nbr', 'type_y', 'cluster', 'family', 'onpromotion', 'type_x',
'locale', 'locale_name', 'city', 'state']

X = data_.loc[:, [ 'day_of_week', 'day_of_month', 'month', 'quarter', 'year', 'store_nbr', 'type_y', 'cluster', 'family', 'onpromotion', 'type_x',
'locale', 'locale_name', 'city', 'state', 'test', 'sales', 'id']+ sales_lag_columns]
X[to_dummies] = X[to_dummies].astype('category')

data_train = X[X['test'] == 0]
data_test = X[X['test'] == 1]

n = len(data_train)

training_start = 0
training_end = math.floor(n * training_percentage)
validation_start = training_end
validation_end = n

X_train = data_train.loc[training_start:training_end, :].drop(['test', 'sales', 'id'], axis=1)
y_train = data_train.loc[training_start:training_end, 'sales']
X_val = data_train.loc[validation_start:validation_end, :].drop(['test', 'sales', 'id'], axis=1)
y_val = data_train.loc[validation_start:validation_end, 'sales']

X_test = data_test.loc[:, ].drop(['test', 'sales', 'id'], axis=1)

# filter chars
X_train = X_train.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
X_val = X_val.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
X_test = X_test.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
X_train.head()

train

最后丢进LGBMRegressor, LightGBM, short for light gradient-boosting machine.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
hyper_params = {'task': 'train','boosting_type': 'gbdt','objective': 'regression','metric': ['l1','l2'],'learning_rate': 0.1,
'feature_fraction': 0.9,'bagging_fraction': 0.7,'bagging_freq': 10,'verbose': 0,"max_depth": 50,"num_leaves": 128,"max_bin": 512}

gbm = lgb.LGBMRegressor(**hyper_params)

gbm.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
eval_metric='l1')

y_pred = gbm.predict(X_val)
results = pd.concat([y_val.reset_index(drop=True), pd.Series(y_pred)], axis=1).rename(columns={'sales': 'y_val', 0: 'y_pred'})
results['y_pred'] = results['y_pred'].clip(0)
results = results[results['y_val'] > 10]
results

TODO

这次写的主要是数据处理流程以及最后模型的使用,这次先记录所看的第一个notebook,对于这一类以前没遇到的问题还是得多看多练,后续也会再多找几个notebook进行学习和记录.