class: center, middle, inverse, title-slide # Programming with Data ## Session 6: Forecasting Walmart Sales ### Dr. Wang Jiwei ### Master of Professional Accounting --- class: inverse, center, middle <!-- Define html_df for displaying small tables in html format --> # Case: Walmart Store Sales Forecasting --- ## The question > How can we predict weekly departmental revenue for Walmart, leveraging our knowledge of Walmart, its business, and some limited historical information - Check out the [Kaggle competition](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting) - Predict weekly for 115,064 (Store, Department, Week) tuples - From 2012-11-02 to 2013-07-26: test dataset - Using [incomplete] weekly revenue data from 2010-02-05 to 2012-11-01 - By department (some weeks missing for some departments): training dataset --- ## More specifically... - Consider time dimensions - What matters: - Time of the year? - Holidays? - Do different stores or departments behave differently? - Wrinkles: - Walmart won't give us weekly sales in the test data - But they'll tell us how well the algorithm performs when we submit the forecasts to Kaggle - We can't use past week sales for prediction because we won't have it for most of the prediction in the testing data... --- ## Load data and packages <!-- Load in primary data set --> ```r library(tidyverse) # we'll extensively use dplyr here library(lubridate) # Great for simple date functions library(broom) # Display regression results in a tidy way weekly <- read.csv("Data/Session_6_WMT_train.csv") weekly.test <- read.csv("Data/Session_6_WMT_test.csv") weekly.features <- read.csv("Data/Session_6_WMT_features.csv") weekly.stores <- read.csv("Data/Session_6_WMT_stores.csv") ``` - `weekly` is our training data - `weekly.test` is our testing data -- no `Weekly_Sales` column - `weekly.features` is general information about (week, store) pairs - Temperature, pricing, etc. - `weekly.stores` is general information about each store --- ## The data - Revenue by week for each department of each of 45 stores - Department is just a number between 1 and 99 - Date of that week - If the week is considered a holiday for sales purposes - Super Bowl (first Sunday in February), Labor Day (first Monday in September), Black Friday (fourth Friday of November), Christmas - Store data: - Which store the data is for, 1 to 45 - Store type (A, B, or C) - Store size - Other data, by week and location: - Temperature, gas price, markdown, CPI, Unemployment, Holidays --- ## The training data ``` ## Rows: 421,570 ## Columns: 5 ## $ Store <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ ## $ Dept <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ ## $ Date <chr> "2010-02-05", "2010-02-12", "2010-02-19", "2010-02-26", "~ ## $ Weekly_Sales <dbl> 24924.50, 46039.49, 41595.55, 19403.54, 21827.90, 21043.3~ ## $ IsHoliday <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~ ``` ``` ## Store Dept Date Weekly_Sales IsHoliday ## 1 1 1 2010-02-05 24924.50 FALSE ## 2 1 1 2010-02-12 46039.49 TRUE ## 3 1 1 2010-02-19 41595.55 FALSE ## 4 1 1 2010-02-26 19403.54 FALSE ## 5 1 1 2010-03-05 21827.90 FALSE ## 6 1 1 2010-03-12 21043.39 FALSE ``` ``` ## Store Dept Date Weekly_Sales ## Min. : 1.0 Min. : 1.00 Length:421570 Min. : -4989 ## 1st Qu.:11.0 1st Qu.:18.00 Class :character 1st Qu.: 2080 ## Median :22.0 Median :37.00 Mode :character Median : 7612 ## Mean :22.2 Mean :44.26 Mean : 15981 ## 3rd Qu.:33.0 3rd Qu.:74.00 3rd Qu.: 20206 ## Max. :45.0 Max. :99.00 Max. :693099 ## IsHoliday ## Mode :logical ## FALSE:391909 ## TRUE :29661 ## ## ## ``` --- ## Walmart's evaluation metric - Walmart uses [MAE (mean absolute error)](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/overview/evaluation), but with a twist: - They care more about holidays, so any error on holidays has **5 times** the penalty - They call this WMAE, for *weighted* mean absolute error `$$WMAE = \frac{1}{\sum w_i} \sum_{i=1}^{n} w_i \left|y_i-\hat{y}_i\right|$$` - `\(n\)` is the number of test data points - `\(\hat{y}_i\)` is your prediction - `\(y_i\)` is the actual sales - `\(w_i\)` is 5 on holidays and 1 otherwise ```r # Construct a function in R to calculate WMAE wmae <- function(actual, predicted, holidays) { sum(abs(actual - predicted) * (holidays * 4 + 1), na.rm = TRUE) / (length(actual) + 4 * sum(holidays)) } ``` --- ## Before we get started... - The data isn't very clean: - Markdowns are given by 5 separate variables instead of 1 - Date is text format instead of a date - CPI and unemployment data are missing in around a third of the training data - There are some (week, store, department) groups missing from our training data! - Some features to add: - Year - Week - A unique ID for tracking: (store-department-week) tuples - The ID Walmart requests we use for submissions: "1_1_2012-11-02" - Average sales by (store, department) - Average sales by (week, store, department) --- ## Data cleaning ```r preprocess_data <- function(df) { # Merge the data together (Pulled data from outside of function -- "scoping") # https://bookdown.org/rdpeng/rprogdatascience/scoping-rules-of-r.html df <- left_join(df, weekly.stores) # last col 'isHoliday' is already in train data, join the first 11 col only. df <- left_join(df, weekly.features[ , 1:11]) # I am not sure what exactly the five markdowns represent # All missing markdowns will be assigned to 0 and record the last non-missing df$markdown <- 0 df[!is.na(df$MarkDown1), ]$markdown <- df[!is.na(df$MarkDown1), ]$MarkDown1 df[!is.na(df$MarkDown2), ]$markdown <- df[!is.na(df$MarkDown2), ]$MarkDown2 df[!is.na(df$MarkDown3), ]$markdown <- df[!is.na(df$MarkDown3), ]$MarkDown3 df[!is.na(df$MarkDown4), ]$markdown <- df[!is.na(df$MarkDown4), ]$MarkDown4 df[!is.na(df$MarkDown5), ]$markdown <- df[!is.na(df$MarkDown5), ]$MarkDown5 # Fix dates and add useful time variables df$date <- as.Date(df$Date) df$week <- week(df$date) df$year <- year(df$date) df } ``` ```r df <- preprocess_data(weekly) df[df$Weekly_Sales < 0, ]$Weekly_Sales <- 0 df_test <- preprocess_data(weekly.test) ``` > Model may perform better without using markdown --- ## What this looks like ```r df[91:94, ] %>% select(Store, date, markdown, MarkDown3, MarkDown4, MarkDown5) %>% html_df() ``` <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Store </th> <th style="text-align:center;"> date </th> <th style="text-align:center;"> markdown </th> <th style="text-align:center;"> MarkDown3 </th> <th style="text-align:center;"> MarkDown4 </th> <th style="text-align:center;"> MarkDown5 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 91 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:center;"> 2011-10-28 </td> <td style="text-align:center;"> 0.00 </td> <td style="text-align:center;"> NA </td> <td style="text-align:center;"> NA </td> <td style="text-align:center;"> NA </td> </tr> <tr> <td style="text-align:left;"> 92 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:center;"> 2011-11-04 </td> <td style="text-align:center;"> 0.00 </td> <td style="text-align:center;"> NA </td> <td style="text-align:center;"> NA </td> <td style="text-align:center;"> NA </td> </tr> <tr> <td style="text-align:left;"> 93 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:center;"> 2011-11-11 </td> <td style="text-align:center;"> 6551.42 </td> <td style="text-align:center;"> 215.07 </td> <td style="text-align:center;"> 2406.62 </td> <td style="text-align:center;"> 6551.42 </td> </tr> <tr> <td style="text-align:left;"> 94 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:center;"> 2011-11-18 </td> <td style="text-align:center;"> 5988.57 </td> <td style="text-align:center;"> 51.98 </td> <td style="text-align:center;"> 427.39 </td> <td style="text-align:center;"> 5988.57 </td> </tr> </tbody> </table> ```r df[1:2, ] %>% select(date, week, year) %>% html_df() ``` <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> date </th> <th style="text-align:center;"> week </th> <th style="text-align:center;"> year </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 2010-02-05 </td> <td style="text-align:center;"> 6 </td> <td style="text-align:center;"> 2010 </td> </tr> <tr> <td style="text-align:left;"> 2010-02-12 </td> <td style="text-align:center;"> 7 </td> <td style="text-align:center;"> 2010 </td> </tr> </tbody> </table> --- ## Cleaning: Missing CPI and Unemployment ```r # Fill in missing CPI and Unemployment data df_test <- df_test %>% group_by(Store, year) %>% mutate(CPI = ifelse(is.na(CPI), mean(CPI, na.rm = T), CPI), Unemployment = ifelse(is.na(Unemployment), mean(Unemployment, na.rm = T), Unemployment)) %>% ungroup() ``` .center[<img src="../../../Figures/ifelse.png">] > Apply the (store, year)'s average CPI and average Unemployment to missing data --- ## Cleaning: Adding IDs - Build a unique ID - Since store, week and department are all 2 digits, make a 6 digit number with 2 digits for each - `sswwdd` - Build Walmart's requested ID for submissions - `ss_dd_YYYY-MM-DD` ```r # Unique IDs in the data df$id <- df$Store *10000 + df$week * 100 + df$Dept df_test$id <- df_test$Store *10000 + df_test$week * 100 + df_test$Dept # Unique ID and factor building swd <- c(df$id, df_test$id) # Pool all IDs swd <- unique(swd) # Only keep unique elements swd <- data.frame(id = swd) # Make a data frame swd$swd <- factor(swd$id) # Extract factors for using later # Add unique factors to data -- ensures same factors for both data sets df <- left_join(df, swd) df_test <- left_join(df_test, swd) ``` ```r df_test$Id <- paste0(df_test$Store, '_', df_test$Dept, "_", df_test$date) ``` --- ## What the IDs look like ```r # id: numerical # swd: factor # Id: character html_df(df_test[c(20000, 40000, 60000), c("Store", "week", "Dept", "id", "swd", "Id")]) ``` <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Store </th> <th style="text-align:center;"> week </th> <th style="text-align:center;"> Dept </th> <th style="text-align:center;"> id </th> <th style="text-align:center;"> swd </th> <th style="text-align:center;"> Id </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 8 </td> <td style="text-align:center;"> 27 </td> <td style="text-align:center;"> 33 </td> <td style="text-align:center;"> 82733 </td> <td style="text-align:center;"> 82733 </td> <td style="text-align:center;"> 8_33_2013-07-05 </td> </tr> <tr> <td style="text-align:left;"> 15 </td> <td style="text-align:center;"> 46 </td> <td style="text-align:center;"> 91 </td> <td style="text-align:center;"> 154691 </td> <td style="text-align:center;"> 154691 </td> <td style="text-align:center;"> 15_91_2012-11-16 </td> </tr> <tr> <td style="text-align:left;"> 23 </td> <td style="text-align:center;"> 52 </td> <td style="text-align:center;"> 25 </td> <td style="text-align:center;"> 235225 </td> <td style="text-align:center;"> 235225 </td> <td style="text-align:center;"> 23_25_2012-12-28 </td> </tr> </tbody> </table> --- ## Add in (store, department) average sales ```r # Calculate average sales by store-dept df <- df %>% group_by(Store, Dept) %>% mutate(store_avg = mean(Weekly_Sales, rm.na = T)) %>% ungroup() # Select the first average sales data for each store-dept df_sa <- df %>% group_by(Store, Dept) %>% slice(1) %>% # Select rows by position select(Store, Dept, store_avg) %>% ungroup() # Distribute the store-dept average sales to the testing data df_test <- left_join(df_test, df_sa) ``` ``` ## Joining, by = c("Store", "Dept") ``` ```r # 36 observations have messed up department codes -- ignore (set to 0) df_test[is.na(df_test$store_avg), ]$store_avg <- 0 # Calculate multipliers based on store_avg (and removing NaN and Inf) df$Weekly_mult <- df$Weekly_Sales / df$store_avg df[!is.finite(df$Weekly_mult), ]$Weekly_mult <- NA ``` --- ## Add in (week, store, dept) average sales ```r # Calculate mean by week-store-dept and distribute to df_test df <- df %>% group_by(Store, Dept, week) %>% mutate(naive_mean = mean(Weekly_Sales, rm.na = T)) %>% ungroup() df_wm <- df %>% group_by(Store, Dept, week) %>% slice(1) %>% ungroup() %>% select(Store, Dept, week, naive_mean) df_test <- df_test %>% arrange(Store, Dept, week) df_test <- left_join(df_test, df_wm) ``` ``` ## Joining, by = c("Store", "Dept", "week") ``` --- ## ISSUE: New (week, store, dept) groups - This is in our testing data! - So we'll need to predict out groups we haven't observed at all ```r table(is.na(df_test$naive_mean)) ``` ``` ## ## FALSE TRUE ## 113827 1237 ``` - Fix: Fill with 1 or 2 lags where possible using `ifelse()` and `lag()` - Fix: Fill with 1 or 2 leads where possible using `ifelse()` and `lead()` - Fill with `store_avg` when the above fail - Code is available in the code file -- a bunch of code like: ```r df_test <- df_test %>% arrange(Store, Dept, date) %>% group_by(Store, Dept) %>% mutate(naive_mean=ifelse(is.na(naive_mean), lag(naive_mean), naive_mean)) %>% ungroup() ``` --- ## Cleaning is done - Data is in order - No missing values where data is needed - Needed values created ```r df %>% group_by(week, Store) %>% mutate(sales = mean(Weekly_Sales)) %>% slice(1) %>% ungroup() %>% ggplot(aes(y = sales, x = week, color = factor(Store))) + geom_line() + xlab("Week") + ylab("Sales for Store (dept average)") + theme(legend.position = "none") # remove the plot legend ``` <img src="Session_6s_Kaggle_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" /> --- ## How much time on data prep? .center[<img src="../../../Figures/datajobtime.jpg" width="600px">] .center[<a target="_blank" href="https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/">The Survey</a>] --- ## Feature engineering techniques > There are many ways to prepare data. You may read the following articles for a summary of typical feature engineering techniques. We will apply more techniques in future topics. .center[<a target="_blank" href="https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114">Fundamental Techniques of Feature Engineering for Machine Learning</a>] .center[<a target="_blank" href="https://towardsdatascience.com/the-hitchhikers-guide-to-feature-extraction-b4c157e96631">The Hitchhiker’s Guide to Feature Extraction</a>] --- class: inverse, center, middle # Tackling the problem --- ## First try .pull-left[ - Ideal: Use last week to predict next week! <img src="../../../Figures/red_x.png"> > No data for testing... ] .pull-right[ - First instinct: try to use a linear regression to solve this <img src="../../../Figures/green_check.png"> > We have this ] --- ## What to put in the model? .center[<img src="../../../Figures/wmt_weekly.png" height="500px">] --- ## First model ```r mod1 <- lm(Weekly_mult ~ factor(IsHoliday) + factor(markdown > 0) + markdown + Temperature + Fuel_Price + CPI + Unemployment, data = df) tidy(mod1) ``` ``` ## # A tibble: 8 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.25 0.0100 125. 0 ## 2 factor(IsHoliday)TRUE 0.0597 0.00337 17.7 2.00e- 70 ## 3 factor(markdown > 0)TRUE 0.0486 0.00240 20.3 3.42e- 91 ## 4 markdown 0.000000697 0.000000237 2.94 3.32e- 3 ## 5 Temperature -0.000832 0.0000490 -17.0 1.16e- 64 ## 6 Fuel_Price -0.0721 0.00223 -32.3 1.23e-228 ## 7 CPI -0.0000842 0.0000241 -3.50 4.67e- 4 ## 8 Unemployment 0.00406 0.000494 8.22 1.97e- 16 ``` ```r glance(mod1) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.00556 0.00554 0.549 337. 0 7 -345649. 691317. 691415. ## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- ## Prep submission and in-sample WMAE ```r # Out of sample result df_test$Weekly_mult <- predict(mod1, df_test) df_test$Weekly_Sales <- df_test$Weekly_mult * df_test$store_avg # Required to submit a csv of Id and Weekly_Sales write.csv(df_test[ , c("Id", "Weekly_Sales")], "WMT_linear.csv", row.names = FALSE) # track df_test$WS_linear <- df_test$Weekly_Sales # Check in sample WMAE df$WS_linear <- predict(mod1, df) * df$store_avg w <- wmae(actual = df$Weekly_Sales, predicted = df$WS_linear, holidays = df$IsHoliday) names(w) <- "Linear" wmaes <- c(w) wmaes ``` ``` ## Linear ## 3040.644 ``` --- ## Performance for linear model <img src="../../../Figures/WMT_linear.png" width="700px"> <img src="../../../Figures/WMT_linear_rank.png" width="700px"> --- ## Visualizing in-sample WMAE ```r # compute WMAE for each obs wmae_obs <- function(actual, predicted, holidays) { abs(actual - predicted) * (holidays * 4 + 1) / (length(actual) + 4 * sum(holidays)) } df$wmaes <- wmae_obs(actual = df$Weekly_Sales, predicted = df$WS_linear, holidays = df$IsHoliday) ggplot(data = df, aes(y = wmaes, x = week, color = factor(IsHoliday))) + geom_jitter(width = 0.25) + xlab("Week") + ylab("WMAE") ``` <img src="Session_6s_Kaggle_files/figure-html/long-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Back to the drawing board... .center[<img src="../../../Figures/wmt_weekly2.png" height="500px">] --- ## Second model: Including week ```r mod2 <- lm(Weekly_mult ~ factor(week) + factor(IsHoliday) + factor(markdown>0) + markdown + Temperature + Fuel_Price + CPI + Unemployment, data=df) tidy(mod2) ``` ``` ## # A tibble: 60 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.01 0.0119 84.6 0 ## 2 factor(week)2 -0.0604 0.00982 -6.16 7.48e- 10 ## 3 factor(week)3 -0.0668 0.00983 -6.80 1.05e- 11 ## 4 factor(week)4 -0.0911 0.00983 -9.27 1.93e- 20 ## 5 factor(week)5 0.0432 0.00981 4.41 1.06e- 5 ## 6 factor(week)6 0.166 0.00953 17.4 5.68e- 68 ## 7 factor(week)7 0.227 0.00910 25.0 8.90e-138 ## 8 factor(week)8 0.101 0.00896 11.3 1.09e- 29 ## 9 factor(week)9 0.0722 0.00897 8.05 8.15e- 16 ## 10 factor(week)10 0.0830 0.00899 9.23 2.63e- 20 ## # ... with 50 more rows ``` ```r glance(mod2) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.0642 0.0640 0.533 490. 0 59 -332843. 665808. 666476. ## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- ## Prep submission and in-sample WMAE ```r # Out of sample result df_test$Weekly_mult <- predict(mod2, df_test) df_test$Weekly_Sales <- df_test$Weekly_mult * df_test$store_avg # Required to submit a csv of Id and Weekly_Sales write.csv(df_test[ , c("Id", "Weekly_Sales")], "WMT_linear2.csv", row.names = FALSE) # track df_test$WS_linear2 <- df_test$Weekly_Sales # Check in sample WMAE df$WS_linear2 <- predict(mod2, df) * df$store_avg w <- wmae(actual = df$Weekly_Sales, predicted = df$WS_linear2, holidays = df$IsHoliday) names(w) <- "Linear 2" wmaes <- c(wmaes, w) wmaes ``` ``` ## Linear Linear 2 ## 3040.644 3208.144 ``` --- ## Performance for linear model 2 <img src="../../../Figures/WMT_linear2.png" width="700px"> <img src="../../../Figures/WMT_linear2_rank.png" width="700px"> ```r wmaes_out ``` ``` ## Linear Linear 2 ## 4954.4 5540.3 ``` --- ## Visualizing in-sample WMAE ```r df$wmaes <- wmae_obs(actual = df$Weekly_Sales, predicted = df$WS_linear2, holidays = df$IsHoliday) ggplot(data=df, aes(y = wmaes, x = week, color = factor(IsHoliday))) + geom_jitter(width = 0.25) + xlab("Week") + ylab("WMAE") ``` <img src="Session_6s_Kaggle_files/figure-html/unnamed-chunk-25-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Visualizing in-sample WMAE by Store ```r ggplot(data=df, aes(y = wmae_obs(Weekly_Sales, WS_linear2, IsHoliday), x = week, color = factor(Store))) + geom_jitter(width = 0.25) + xlab("Week") + ylab("WMAE") + theme(legend.position = "none") ``` <img src="Session_6s_Kaggle_files/figure-html/unnamed-chunk-26-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Visualizing in-sample WMAE by Dept ```r ggplot(data = df, aes(y = wmae_obs(actual = Weekly_Sales, predicted = WS_linear2, holidays = IsHoliday), x = week, color = factor(Dept))) + geom_jitter(width = 0.25) + xlab("Week") + ylab("WMAE") + theme(legend.position = "none") ``` <img src="Session_6s_Kaggle_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Back to the drawing board... .center[<img src="../../../Figures/wmt_weekly3.png" height="500px">] --- ## Third model: Including week x Store x Dept ```r mod3 <- lm(Weekly_mult ~ factor(week):factor(Store):factor(Dept) + factor(IsHoliday) + factor(markdown>0) + markdown + Temperature + Fuel_Price + CPI + Unemployment, data = df) ## Error: cannot allocate vector of size 606.8Gb ``` > ... --- ## Third model: Including week x Store x Dept - Use [`package:fixest`](https://lrberge.github.io/fixest/)'s `feols()`</a> -- it's really more efficient! ```r library(fixest) mod3 <- feols(Weekly_mult ~ markdown + Temperature + Fuel_Price + CPI + Unemployment | swd, data = df) # now you know why create swd tidy(mod3) ``` ``` ## # A tibble: 5 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 markdown -0.00000122 0.000000220 -5.56 2.63e- 8 ## 2 Temperature 0.00130 0.000163 7.95 1.90e- 15 ## 3 Fuel_Price -0.0532 0.00226 -23.5 2.20e-122 ## 4 CPI 0.000190 0.000366 0.518 6.04e- 1 ## 5 Unemployment -0.0291 0.00136 -21.3 8.12e-101 ``` ```r glance(mod3) ``` ``` ## # A tibble: 1 x 9 ## r.squared adj.r.squared within.r.squared pseudo.r.squared sigma nobs AIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> ## 1 0.708 0.526 0.00373 NA 0.379 421551 498237. ## # ... with 2 more variables: BIC <dbl>, logLik <dbl> ``` --- ## Prep submission and in-sample WMAE ```r # Out of sample result # Not sure why there are NA prediction output although all predictors have no missing data df_test$Weekly_mult <- predict(mod3, df_test) df_test$Weekly_Sales <- df_test$Weekly_mult * df_test$store_avg # Required to submit a csv of Id and Weekly_Sales write.csv(df_test[ , c("Id", "Weekly_Sales")], "WMT_FE.csv", row.names = FALSE) # track df_test$WS_FE <- ifelse(is.na(df_test$Weekly_Sales), 0, df_test$Weekly_Sales) # Check in sample WMAE df$WS_FE <- predict(mod3, df) * df$store_avg w <- wmae(actual = df$Weekly_Sales, predicted = df$WS_FE, holidays = df$IsHoliday) names(w) <- "FE" wmaes <- c(wmaes, w) wmaes ``` ``` ## Linear Linear 2 FE ## 3040.644 3208.144 1551.232 ``` --- ## The general predict() function - `predict()` is a generic function for predictions from the results of various model fitting functions. - The function invokes particular methods which depend on the class of the first argument. - For example, if the first argument is an object from the `lm()` model, `predict()` will call the `predict.lm()` function - Typically model functions have been defined such as `predict.lm()` and `predict.glm()` - [`predcit.fixest()`](https://www.rdocumentation.org/packages/fixest/versions/0.8.4/topics/predict.fixest) is defined in the `fixest` package - You may replace the `predict()` with `predict.fixest()` and get same results. - [Refer the manual here](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/predict) --- ## Performance for FE model <img src="../../../Figures/WMT_FE.png" width="700px"> <img src="../../../Figures/WMT_FE_rank.png" width="700px"> ```r wmaes_out ``` ``` ## Linear Linear 2 FE ## 4954.4 5540.3 3357.9 ``` --- ## Visualizing in-sample WMAE ```r df$wmaes <- wmae_obs(actual = df$Weekly_Sales, predicted = df$WS_FE, holidays = df$IsHoliday) ggplot(data=df, aes(y = wmaes, x = week, color = factor(IsHoliday))) + geom_jitter(width = 0.25) + xlab("Week") + ylab("WMAE") ``` <img src="Session_6s_Kaggle_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Problems with the data > Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13 1. The holidays are not always on the same week (the last indicates the week in the testing data) - The Super Bowl is in weeks 7, 7, 6 and *6* - Labor day isn't in our *testing data* at all! - Black Friday is in weeks 48, 47, and *47* - Christmas is in weeks 53, 52, and *52* - Manually adjust the data for these differences 2. Yearly growth -- we aren't capturing it, since we have such a small time span - We can manually adjust the data for this > Code is in the code file -- a lot of [`package:dplyr`](https://dplyr.tidyverse.org) --- ## Performance overall <img src="../../../Figures/WMT_FE_shift.png" width="700px"> <img src="../../../Figures/WMT_FE_shift_rank.png" width="700px"> ```r wmaes_out ``` ``` ## Linear Linear 2 FE Shifted FE ## 4954.4 5540.3 3357.9 3249.1 ``` --- ## Performance overall <img src="../../../Figures/WMT_naivemean.png" width="700px"> <img src="../../../Figures/WMT_naivemean_rank.png" width="700px"> ```r wmaes_out ``` ``` ## Linear Linear 2 FE Shifted FE Naive Mean ## 4954.40 5540.30 3357.90 3249.10 3167.99 ``` --- ## Performance overall <img src="../../../Figures/WMT_ens.png" width="700px"> <img src="../../../Figures/WMT_ens_rank.png" width="700px"> ```r wmaes_out ``` ``` ## Linear Linear 2 FE Shifted FE Naive Mean Ensemble ## 4954.40 5540.30 3357.90 3249.10 3167.99 3173.30 ``` --- ## This was a real problem! - Walmart provided this data back in 2014 as part of a recruiting exercise - <a target="_blank" href="https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting">Details here</a> - <a target="_blank" href="https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/discussion/8125">Discussion of first place entry</a> - <a target="_blank" href="https://github.com/davidthaler/Walmart_competition_code">Code for first place entry</a> - <a target="_blank" href="https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/discussion/8023">Discussion of second place entry</a> - This is what the group project will be like - Each group tackling a data problem which is hosted on Kaggle or Tianchi - You will have training data but testing data will be withheld - You will need to submit to Kaggle/Tianchi for model evaluation --- ## Project deliverables 1. Submission to Kaggle/Tianchi - For model evaluation purpose 2. Submission to me: A .rmd (and .html + .pdf) file including: - The integrated code chunks - Main points and findings - Exploratory analysis of the data used - Your model development, implementation, evaluation, and refinement - A conclusion on how well your group did and what you learned - No zipped file please 3. A group presentation in the last session - A presentation slides (.rmd or .pptx) shall also be submitted - All members to present 4. If files > 50M, please submit through a shared folder using OneDrive or Google Drive. Keep all folder structure with all files and data, and make sure I can reproduce your code without any changes. --- ## Ethics > Kaggle 1st place winner <a target="_blank" href="https://towardsdatascience.com/kaggle-1st-place-winner-cheated-10-000-prize-declared-irrecoverable-bb7e1b639365">cheated</a>, $10,000 prize declared irrecoverable .center[<img src="../../../Figures/kaggle_cheat.jpeg" height = "300px">] --- class: inverse, center, middle # Summary of Session 6 --- ## For next week - Try to replicate the code - You should have completed exploring your project data - Continue your Datacamp career track - [Logistic regression](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression) for classification problems --- ## R Coding Style Guide Style is subjective and arbitrary but it is important to follow a generally accepted style if you want to share code with others. I suggest the [The tidyverse style guide](https://style.tidyverse.org/) which is also adopted by [Google](https://google.github.io/styleguide/Rguide.html) with some modification - Highlights of **the tidyverse style guide**: - *File names*: end with .R - *Identifiers*: variable_name, function_name, try not to use "." as it is reserved by Base R's S3 objects - *Line length*: 80 characters - *Indentation*: two spaces, no tabs (RStudio by default converts tabs to spaces and you may change under global options) - *Spacing*: x = 0, not x=0, no space before a comma, but always place one after a comma - *Curly braces {}*: first on same line, last on own line - *Assignment*: use `<-`, not `=` nor `->` - *Semicolon(;)*: don't use, I used once for the interest of space - *return()*: Use explicit returns in functions: default function return is the last evaluated expression - *File paths*: use [relative file path](https://www.w3schools.com/html/html_filepaths.asp) "../../filename.csv" rather than absolute path "C:/mydata/filename.csv". Backslash needs `\\` --- ## R packages used in this slide This slide was prepared on 2021-09-07 from Session_6s_Kaggle.Rmd with R version 4.1.1 (2021-08-10) Kick Things on Windows 10 x64 build 18362 😄. The attached packages used in this slide are: ``` ## fixest broom lubridate forcats stringr dplyr purrr ## "0.9.0" "0.7.9" "1.7.10" "0.5.1" "1.4.0" "1.0.7" "0.3.4" ## readr tidyr tibble ggplot2 tidyverse kableExtra knitr ## "2.0.1" "1.1.3" "3.1.3" "3.3.5" "1.3.1" "1.3.4" "1.33" ```