Kaggle_warmup3-TimeSeries(3)

blacsheep

2024-04-09

Kaggle-warmup

继续往后看,发现有另一个时序处理的package,MLforcast,抱着学习的心态看一看用法.

preprocessing

首先import需求包并读取数据.

import lightgbm as lgb
import numpy as np
import pandas as pd
from mlforecast import MLForecast
from mlforecast.lag_transforms import ExpandingMean, RollingMean
from mlforecast.target_transforms import GlobalSklearnTransformer
from sklearn.preprocessing import FunctionTransformer
from utilsforecast.feature_engineering import fourier
from utilsforecast.preprocessing import fill_gaps
from utilsforecast.plotting import plot_series

df = pd.read_csv('../input/store-sales-time-series-forecasting/train.csv', parse_dates=['date'])
df = df.drop(columns='id')
df['unique_id'] = df['store_nbr'].astype(str) + '_' + df['family']
df.head()

然后这里可以使用utilsforcast.preprocessing.fillgaps来填补时间空缺, 空缺时间会导致其他列被填na.

filled = fill_gaps(df, freq='D', start='per_serie', end='per_serie', time_col='date')
filled.loc[filled['sales'].isnull(), 'date'].value_counts()
# date
# 2013-12-25    1782
# 2014-12-25    1782
# 2015-12-25    1782
# 2016-12-25    1782
# Name: count, dtype: int64

可以使用plot_series来简单画图,需要指定time_col和target_col.

1	plot_series(filled[filled['date'].between('2016-12', '2017-01')], time_col='date', target_col='sales')

需要注意这里其实id_col被默认指定为了unique_id.

可以看到25号的时候数据为空,使用interpolate来进行填充

1
2

filled['sales_interp'] = filled.groupby('unique_id')['sales'].transform(lambda s: s.interpolate())
plot_series(filled[filled['date'].between('2016-12', '2017-01')], time_col='date', target_col='sales_interp')

然后指定一下categorical_feature

1 2	cat_features = ['store_nbr', 'family'] filled[cat_features] = filled[cat_features].astype('category')

holiday部分不同上一次的one-hot-encoding,这里我们仅判断是否为节日

raw_holidays = pd.read_csv(
    '../input/store-sales-time-series-forecasting/holidays_events.csv',
    parse_dates=['date'],
)
keep_holidays = ~raw_holidays['transferred'] & raw_holidays['locale'].eq('National')
holidays = raw_holidays.loc[keep_holidays, ['date', 'type']].copy()
holidays['is_holiday'] = holidays['type'].eq('Holiday').astype('float')
holidays['is_work_day'] = holidays['type'].eq('Work Day').astype('float')
holidays = holidays.drop(columns='type').groupby('date').max()
holidays.head()

填补oilprice,并且完成表合并

oil = pd.read_csv(
    '../input/store-sales-time-series-forecasting/oil.csv',
    parse_dates=['date'],
)
filled_oil = oil.set_index('date').reindex(pd.date_range(oil['date'].min(), oil['date'].max(), freq='D', name='date'))
filled_oil = filled_oil.interpolate(limit_direction='both').reset_index()
filled_oil.head()


def assemble_df(df, holidays, oil):
    df = df.merge(holidays, on='date', how='left').merge(oil, on='date', how='left')
    df[['is_holiday', 'is_work_day']] = df[['is_holiday', 'is_work_day']].fillna(0)
    return df


train = assemble_df(filled.drop(columns=['sales', 'onpromotion']), holidays, filled_oil)
train.head()

MLForecast Model

%%time
log_tfm = GlobalSklearnTransformer(
    FunctionTransformer(func=np.log1p, inverse_func=np.expm1)
)
model = lgb.LGBMRegressor(
    n_estimators=400,
    learning_rate=1e-2,
    num_leaves=256,
    num_threads=4,
    force_col_wise=True,
    verbosity=-1,   
)

mlf = MLForecast(
    models={'lgb': model},
    freq='D',
    target_transforms=[log_tfm],
    lags=[7, 14],
    lag_transforms={
         1: [ExpandingMean()],
         7: [RollingMean(window_size=14), RollingMean(window_size=28)],
        14: [RollingMean(window_size=14), RollingMean(window_size=28)],
    },
    date_features=['dayofweek', 'day', 'month', 'year'],
    num_threads=4,
)
mlf.fit(
    train,
    time_col='date',
    target_col='sales_interp',
    static_features=cat_features,
)

这里直接用了lag1的ExpandMean以及7和14的lag和14,28的window的rollingmean. 另外单独的lag被指定在了lags里面.

class RollingMean

Args:

lag (int): Number of periods to offset by before applying the transformation.
window_size (int): Length of the rolling window.
min_samples (int, optional): Minimum number of samples required to compute the statistic. If None, defaults to window_size.


class ExpandingMean

Args:

lag (int): Number of periods to offset by before applying the transformation

basemodel用的是lgb.LGBMRegressor

prediction

处理数据喂mlf即可, 后面不做多的分析了.

test = pd.read_csv('../input/store-sales-time-series-forecasting/test.csv', parse_dates=['date'])
test['unique_id'] = test['store_nbr'].astype(str) + '_' + test['family']
X_df = assemble_df(test[['unique_id', 'date', 'onpromotion']], holidays, filled_oil)
X_df.head()

%time preds = mlf.predict(h=16, X_df=X_df)

subm = (
    test
    .merge(preds, on=['unique_id', 'date'])
    .drop(columns='unique_id')
    .rename(columns={'lgb': 'sales'})
    [['id', 'sales']]
    .sort_values('id')
)
subm.to_csv('submission.csv', index=False)
subm