Programming with Data

class: center, middle, inverse, title-slide

# Programming with Data
## Session 5: Regression Forecasts with Seasonality
### Dr. Wang Jiwei
### Master of Professional Accounting

---

class: inverse, center, middle

# Application: Quarterly retail revenue

---

## The question

> How can we predict quarterly revenue for retail companies, leveraging our knowledge of such companies

- In aggregate

- By Store

- By department

- Consider time dimensions
    - What matters:
        - Last quarter?
        - Last year?
        - Other timeframes?
    - Cyclicality/Seasonality

---
class: animated, slideInRight

## Time matters a lot for retail
.center[<img src="../../../Figures/holiday_singlesday.jpg" alt="Double 11" height = "500px">]

---

## How to capture time effects?

.pull-left[
- Autoregression
    - Regress `$y_t$` on earlier value(s) of itself
        - Last quarter, last year, etc.
- Controlling for time directly in the model
    - Essentially the same as fixed effects last week
]

.pull-right[
.center[<img src="../../../Figures/calendar.png" width="400px">]
]

---
class: inverse, center, middle

# Quarterly revenue prediction

---

## The data

- From quarterly reports of US retail companies
- Two sets of firms:
    - US "Hypermarkets & Super Centers" [`GICS`](https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard)(gsubind): 30101040]
    - US "Multiline Retail" [`GICS`](https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard)(gind): 255030]
- Data from Compustat - Capital IQ > North America - Daily > Fundamentals Quarterly
    - datadate: all available (1962 to 2020 for this case)

.center[<img src="../../../Figures/walmart.jpg" height="300px">]

---

## Formalization

1. Question
    - How can we predict quarterly revenue for large retail companies?

2. Hypothesis (just the alternative ones)
    1. Current quarter revenue helps predict next quarter revenue
    2. 3 quarters ago revenue helps predict next quarter revenue (Year-over-year)
    3. Different quarters exhibit different patterns (seasonality)
    4. A long-run autoregressive model helps predict next quarter revenue

3. Research design
    - Use OLS for all the above -- t-tests for coefficients
    - Hold out sample (testing data): 2016-2020

---
## Variable generation

.rcode[

```r
library(tidyverse)  # As always
library(plotly)  # interactive graphs
library(lubridate)  # import some sensible date functions

# Generate quarter over quarter growth "revtq_gr"
df <- df %>% group_by(gvkey) %>%
  mutate(revtq_gr = revtq / lag(revtq) - 1) %>% ungroup()

# Generate year-over-year growth "revtq_yoy"
df <- df %>% group_by(gvkey) %>%
  mutate(revtq_yoy = revtq / lag(revtq, 4) - 1) %>% ungroup()

# Generate first difference "revtq_d"
df <- df %>% group_by(gvkey) %>%
  mutate(revtq_d = revtq - lag(revtq)) %>% ungroup()

# Generate a proper date in R
# datadate (end of reporting period) is YYMMDDn8. (int 20200630)
# quarter() is to generate the calendar quarter based on date
# which may be different from company's fiscal quarter
df$date <- ymd(df$datadate)  # From lubridate
df$cqtr <- quarter(df$date)   # From lubridate
```
]

---

## Date manipulation in R

- <a target = "_blank" href = "https://lubridate.tidyverse.org/reference/ymd.html">`ymd()`</a> from [`package:lubridate`](https://lubridate.tidyverse.org) is a handy way of converting date.
    - It also has `ydm()`, `mdy()`, `myd()`, `dmy()` and `dym()`
    - It can handle quarters, times, and date-times as well
    - <a target = "_blank" href="https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf">Cheat sheet</a>
    - It will convert the date format to the ISO 8601 international standard which expresses a day as "2001-02-03".

- <a target="_blank" href="https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/as.Date">`as.Date()`</a> from the Base R can take a date formatted as "YYYY/MM/DD" and convert to a proper date value
    - You can convert other date types using the `format =` argument
        - e.g., "DD.MM.YYYY" is format code "%d.%m.%Y"
        - <a target="_blank" href="https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/strptime">Full list of date codes</a>
    - The default date format also follows ISO 8601.
    - The following code can do the same as `ymd()`

```r
# Generate a proper date in R
# Datadate is YYMMDDn8. (integer 20200630)
df$date <- as.Date(as.character(df$datadate), format = "%Y%m%d")
```

---
## Example output

- The following shows some selective columns

<table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> conm </th>
   <th style="text-align:center;"> date </th>
   <th style="text-align:center;"> revtq </th>
   <th style="text-align:center;"> revtq_gr </th>
   <th style="text-align:center;"> revtq_yoy </th>
   <th style="text-align:center;"> revtq_d </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1962-04-30 </td>
   <td style="text-align:center;"> 156.5 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1962-07-31 </td>
   <td style="text-align:center;"> 161.9 </td>
   <td style="text-align:center;"> 0.0345048 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> 5.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1962-10-31 </td>
   <td style="text-align:center;"> 176.9 </td>
   <td style="text-align:center;"> 0.0926498 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> 15.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1963-01-31 </td>
   <td style="text-align:center;"> 275.5 </td>
   <td style="text-align:center;"> 0.5573770 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> 98.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1963-04-30 </td>
   <td style="text-align:center;"> 171.1 </td>
   <td style="text-align:center;"> -0.3789474 </td>
   <td style="text-align:center;"> 0.0932907 </td>
   <td style="text-align:center;"> -104.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1963-07-31 </td>
   <td style="text-align:center;"> 182.2 </td>
   <td style="text-align:center;"> 0.0648743 </td>
   <td style="text-align:center;"> 0.1253860 </td>
   <td style="text-align:center;"> 11.1 </td>
  </tr>
</tbody>
</table>

```
## # A tibble: 6 x 5
##   conm          date       datadate  fqtr  cqtr
##   <chr>         <date>        <int> <int> <int>
## 1 ALLIED STORES 1962-04-30 19620430     1     2
## 2 ALLIED STORES 1962-07-31 19620731     2     3
## 3 ALLIED STORES 1962-10-31 19621031     3     4
## 4 ALLIED STORES 1963-01-31 19630131     4     1
## 5 ALLIED STORES 1963-04-30 19630430     1     2
## 6 ALLIED STORES 1963-07-31 19630731     2     3
```

---
## Create 8 quarters (2 years) of lags

```r
# Brute force code for variable generation of quarterly data lags
df <- df %>%
  group_by(gvkey) %>%
  mutate(revtq_l1 = lag(revtq), revtq_l2 = lag(revtq, 2),
         revtq_l3 = lag(revtq, 3), revtq_l4 = lag(revtq, 4),
         revtq_l5 = lag(revtq, 5), revtq_l6 = lag(revtq, 6),
         revtq_l7 = lag(revtq, 7), revtq_l8 = lag(revtq, 8),
         revtq_gr1 = lag(revtq_gr), revtq_gr2 = lag(revtq_gr, 2),
         revtq_gr3 = lag(revtq_gr, 3), revtq_gr4 = lag(revtq_gr, 4),
         revtq_gr5 = lag(revtq_gr, 5), revtq_gr6 = lag(revtq_gr, 6),
         revtq_gr7 = lag(revtq_gr, 7), revtq_gr8 = lag(revtq_gr, 8),
         revtq_yoy1 = lag(revtq_yoy), revtq_yoy2 = lag(revtq_yoy, 2),
         revtq_yoy3 = lag(revtq_yoy, 3), revtq_yoy4 = lag(revtq_yoy, 4),
         revtq_yoy5 = lag(revtq_yoy, 5), revtq_yoy6 = lag(revtq_yoy, 6),
         revtq_yoy7 = lag(revtq_yoy, 7), revtq_yoy8 = lag(revtq_yoy, 8),
         revtq_d1 = lag(revtq_d), revtq_d2 = lag(revtq_d, 2),
         revtq_d3 = lag(revtq_d, 3), revtq_d4 = lag(revtq_d, 4),
         revtq_d5 = lag(revtq_d, 5), revtq_d6 = lag(revtq_d, 6),
         revtq_d7 = lag(revtq_d, 7), revtq_d8 = lag(revtq_d, 8)) %>%
  ungroup()
```

---

## Create 8 quarters (2 years) of lags

```r
# Custom function to generate a series of lags
library(rlang)
multi_lag <- function(df, lags, var, postfix="") {
  var <- enquo(var)
  quosures <- map(lags, ~quo(lag(!!var, !!.x))) %>%
    set_names(paste0(quo_text(var), postfix, lags))
  return(ungroup(mutate(group_by(df, gvkey), !!!quosures)))
}
# Generate lags "revtq_l#"
df <- multi_lag(df, 1:8, revtq, "_l")

# Generate changes "revtq_gr#"
df <- multi_lag(df, 1:8, revtq_gr)

# Generate year-over-year changes "revtq_yoy#"
df <- multi_lag(df, 1:8, revtq_yoy)

# Generate first differences "revtq_d#"
df <- multi_lag(df, 1:8, revtq_d)
```

- require more advanced understanding of [`metaprogramming`](https://en.wikipedia.org/wiki/Metaprogramming), [`advanced R`](https://adv-r.hadley.nz/metaprogramming.html), [`tidy evaluation`](https://dplyr.tidyverse.org/articles/programming.html), and [`quosure`](https://www.rdocumentation.org/packages/rlang/versions/0.1/topics/quosure) concepts.
- [`paste0()`](https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/paste): creates a string vector by concatenating all inputs
- [`paste()`](https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/paste): same as `paste0()`, but with spaces added in between

---
## Example output

<table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> conm </th>
   <th style="text-align:center;"> date </th>
   <th style="text-align:center;"> revtq </th>
   <th style="text-align:center;"> revtq_l1 </th>
   <th style="text-align:center;"> revtq_gr1 </th>
   <th style="text-align:center;"> revtq_yoy1 </th>
   <th style="text-align:center;"> revtq_d1 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1962-04-30 </td>
   <td style="text-align:center;"> 156.5 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1962-07-31 </td>
   <td style="text-align:center;"> 161.9 </td>
   <td style="text-align:center;"> 156.5 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1962-10-31 </td>
   <td style="text-align:center;"> 176.9 </td>
   <td style="text-align:center;"> 161.9 </td>
   <td style="text-align:center;"> 0.0345048 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> 5.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1963-01-31 </td>
   <td style="text-align:center;"> 275.5 </td>
   <td style="text-align:center;"> 176.9 </td>
   <td style="text-align:center;"> 0.0926498 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> 15.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1963-04-30 </td>
   <td style="text-align:center;"> 171.1 </td>
   <td style="text-align:center;"> 275.5 </td>
   <td style="text-align:center;"> 0.5573770 </td>
   <td style="text-align:center;"> NA </td>
   <td style="text-align:center;"> 98.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ALLIED STORES </td>
   <td style="text-align:center;"> 1963-07-31 </td>
   <td style="text-align:center;"> 182.2 </td>
   <td style="text-align:center;"> 171.1 </td>
   <td style="text-align:center;"> -0.3789474 </td>
   <td style="text-align:center;"> 0.0932907 </td>
   <td style="text-align:center;"> -104.4 </td>
  </tr>
</tbody>
</table>

---
## Clean and holdout sample

```r
# Clean the data: Replace NaN, Inf, and -Inf with NA
df <- df %>%
  mutate_if(is.numeric, list(~replace(., !is.finite(.), NA)))

# Split into training and test datasets
# Training dataset: We'll use data released before 2016
train <- filter(df, year(date) < 2016)

# Test dataset: We'll use data released 2016 through 2020 (till 3Q2020)
test <- filter(df, year(date) >= 2016)
```

- Same cleaning function as last week:
    - Replaces all `NaN`, `Inf`, and `-Inf` with `NA`
- `year()` comes from [`package:lubridate`](https://lubridate.tidyverse.org)

---
## Training vs. test datasets

> train a model and test/validate it using the same set of data?

- We build analytics models for forecasting and other predictive purposes
- The key question: **could the model be generalized to new dataset?**
    - We need to have a new dataset to test how well the model performs
    - Existing data will be divided into <a target="_blank" href="https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data">training data and test data</a>
- Training data will be used to train/build the model
    - It can be further divided into training set and validation set.
    - The validation set can be used to further tune the model (eg, detect overfitting problem), which  helps get the most optimized model.
    - We will cover (cross) validation in a future topic
- Testing data will be used to test how well the model performs
    - In general, 80/20 rule (Pareto principle) will be applied to split the dataset

---
## Workflow with training/test sets

---
class: inverse, center, middle

# Univariate stats

---
## Univariate stats

- To get a better grasp on the problem, looking at univariate stats can help
    - Summary stats (using `summary()`)
    - Correlations using `cor()`
    - Plots using your preferred package such as [`package:ggplot2`](https://ggplot2.tidyverse.org)

```r
summary(df[ , c("revtq", "revtq_gr", "revtq_yoy", "revtq_d", "fqtr")])
```

```
##      revtq              revtq_gr         revtq_yoy          revtq_d          
##  Min.   :     0.00   Min.   :-1.0000   Min.   :-1.0000   Min.   :-24325.206  
##  1st Qu.:    66.01   1st Qu.:-0.1091   1st Qu.: 0.0024   1st Qu.:   -20.260  
##  Median :   312.59   Median : 0.0501   Median : 0.0704   Median :     4.548  
##  Mean   :  2545.48   Mean   : 0.0625   Mean   : 0.1185   Mean   :    23.730  
##  3rd Qu.:  1386.50   3rd Qu.: 0.2032   3rd Qu.: 0.1476   3rd Qu.:    60.146  
##  Max.   :141671.00   Max.   :14.3333   Max.   :47.6600   Max.   : 16117.000  
##  NA's   :394         NA's   :731       NA's   :1020      NA's   :704         
##       fqtr      
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :2.479  
##  3rd Qu.:3.000  
##  Max.   :4.000  
## 
```

---
## ggplot2 for visualization

- The following slides will use some custom functions using <a target="_blank" href="http://www.cookbook-r.com/Graphs/">`package:ggplot2`</a>
- [`package:ggplot2`](https://ggplot2.tidyverse.org) has an odd syntax:
    - It doesn't use pipes (`%>%`), but instead adds everything together (`+`)

```r
library(ggplot2)  # or tidyverse -- it's part of tidyverse
df %>%
  ggplot(aes(y = var_for_y_axis, x = var_for_y_axis)) +
  geom_point()  # scatterplot
```

- `aes()` is for aesthetics -- how the chart is set up
- Other useful aesthetics:
    - `group =` to set groups to list in the legend.  Not needed if using the below though
    - `color =` to set color by some grouping variable.  Put `factor()` around the variable if you want discrete groups, otherwise it will do a color scale (light to dark)
    - `shape =` to set shapes for points -- <a target="_blank" href="https://cran.r-project.org/web/packages/ggplot2/vignettes/ggplot2-specs.html">see here for a list</a>

---
## ggplot2 for visualization

```r
library(ggplot2)  # or tidyverse -- it's part of tidyverse
df %>%
  ggplot(aes(y = var_for_y_axis, x = var_for_y_axis)) +
  geom_point()  # scatterplot
```

- `geom` stands for geometry -- any shapes, lines, etc. start with `geom`
- Other useful geoms:
    - `geom_line()`: makes a line chart
    - `geom_bar()`: makes a bar chart -- y is the height, x is the category
    - `geom_smooth(method = "lm")`: Adds a linear regression into the chart
    - `geom_abline(slope = 1)`: Adds a 45° line
- Add `xlab("Label text here")` to change the x-axis label
- Add `ylab("Label text here")` to change the y-axis label
- Add `ggtitle("Title text here")` to add a title
- Plenty more details in the <a target="_blank" href="https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf">'Data Visualization Cheat Sheet'</a>

---
## Plotting: Distribution of revenue

.pull-left[
* (1) Revenue
<img src="Session_5s_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" />

* (2) Quarterly growth
<img src="Session_5s_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
* (3) Year-over-year growth
<img src="Session_5s_files/figure-html/unnamed-chunk-17-1.png" width="100%" style="display: block; margin: auto;" />

* (4) First difference
<img src="Session_5s_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## What we learn from the graphs?

1. Revenue
    - `$~$`

2. Quarterly growth
    - `$~$`

3. Year-over-year growth
    - `$~$`

4. First difference
    - `$~$`

---
## What we learn from the graphs?

1. Revenue
    - This is really skewed data -- a lot of small revenue quarters, but a significant amount of large revenue quarters in the tail
        - Potential fix: use `log(revtq)`?

2. Quarterly growth
    - Quarterly growth is reasonably close to normally distributed
        - Good for OLS

3. Year-over-year growth
    - Year over year growth is reasonably close to normally distributed
        - Good for OLS

4. First difference
    - Reasonably close to normally distributed, with really long tails
        - Good enough for OLS

---
## Plotting: Mean revenue by quarter

<div class="clearfix">
.pull-left[
(1) Revenue
<img src="Session_5s_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" />

(2) Quarterly growth
<img src="Session_5s_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
(3) Year-over-year growth
<img src="Session_5s_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" />

(4) First difference
<img src="Session_5s_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## What we learn from the graphs?

1. Revenue
    - `$~$`

2. Quarterly growth
    - `$~$`

3. Year-over-year growth
    - `$~$`

4. First difference
    - `$~$`

---
## What we learn from the graphs?

1. Revenue
    - Revenue seems cyclical!

2. Quarterly growth
    - Definitely cyclical!

3. Year-over-year growth
    - Year over year difference is less cyclical -- more constant

4. First difference
    - Definitely cyclical!

---
## Plotting: Revenue vs lag by quarter

.pull-left[
* (1) Revenue
<img src="Session_5s_files/figure-html/unnamed-chunk-23-1.png" width="100%" style="display: block; margin: auto;" />

* (2) Quarterly growth
<img src="Session_5s_files/figure-html/unnamed-chunk-24-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
* (3) Year-over-year growth
<img src="Session_5s_files/figure-html/unnamed-chunk-25-1.png" width="100%" style="display: block; margin: auto;" />

* (4) First difference
<img src="Session_5s_files/figure-html/unnamed-chunk-26-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## What we learn from the graphs?

1. Revenue
    - `$~$`

2. Quarterly growth
    - `$~$`

3. Year-over-year growth
    - `$~$`

4. First difference
    - `$~$`

---
## What we learn from the graphs?

1. Revenue
    - Revenue is really linear!  But each quarter has a distinct linear relation.

2. Quarterly growth
    - All over the place.  Each quarter appears to have a different pattern though.  Quarters will matter.

3. Year-over-year growth
    - Linear but noisy.

4. First difference
    - Again, all over the place.  Each quarter appears to have a different pattern though.  Quarters will matter.

---
## Correlation matrices

```r
cor(train[,c("revtq","revtq_l1","revtq_l2","revtq_l3","revtq_l4")],
    use = "complete.obs") # delete row if with NA
```

```
##              revtq  revtq_l1  revtq_l2  revtq_l3  revtq_l4
## revtq    1.0000000 0.9917996 0.9939751 0.9907381 0.9973540
## revtq_l1 0.9917996 1.0000000 0.9917016 0.9938476 0.9901821
## revtq_l2 0.9939751 0.9917016 1.0000000 0.9916042 0.9932811
## revtq_l3 0.9907381 0.9938476 0.9916042 1.0000000 0.9910049
## revtq_l4 0.9973540 0.9901821 0.9932811 0.9910049 1.0000000
```

```r
cor(train[,c("revtq_gr","revtq_gr1","revtq_gr2","revtq_gr3","revtq_gr4")],
    use = "complete.obs")
```

```
##              revtq_gr   revtq_gr1   revtq_gr2   revtq_gr3   revtq_gr4
## revtq_gr   1.00000000 -0.33021570  0.06675942 -0.23736085  0.65335232
## revtq_gr1 -0.33021570  1.00000000 -0.32597810  0.06581984 -0.22955824
## revtq_gr2  0.06675942 -0.32597810  1.00000000 -0.33452265  0.07215056
## revtq_gr3 -0.23736085  0.06581984 -0.33452265  1.00000000 -0.32429873
## revtq_gr4  0.65335232 -0.22955824  0.07215056 -0.32429873  1.00000000
```

> Retail revenue has really high autocorrelation!  Concern for multicolinearity.  Revenue growth is less autocorrelated and oscillates.

---
## Correlation matrices

```r
cor(train[,c("revtq_yoy","revtq_yoy1","revtq_yoy2","revtq_yoy3","revtq_yoy4")],
    use="complete.obs")
```

```
##            revtq_yoy revtq_yoy1 revtq_yoy2 revtq_yoy3 revtq_yoy4
## revtq_yoy  1.0000000  0.6588642  0.4183968  0.4216933  0.1805950
## revtq_yoy1 0.6588642  1.0000000  0.5802585  0.3731204  0.3546604
## revtq_yoy2 0.4183968  0.5802585  1.0000000  0.5921796  0.3738081
## revtq_yoy3 0.4216933  0.3731204  0.5921796  1.0000000  0.5710053
## revtq_yoy4 0.1805950  0.3546604  0.3738081  0.5710053  1.0000000
```

```r
cor(train[,c("revtq_d","revtq_d1","revtq_d2","revtq_d3","revtq_d4")],
    use="complete.obs")
```

```
##             revtq_d   revtq_d1   revtq_d2   revtq_d3   revtq_d4
## revtq_d   1.0000000 -0.6203336  0.3300007 -0.6075689  0.9165429
## revtq_d1 -0.6203336  1.0000000 -0.6171063  0.3311438 -0.5872559
## revtq_d2  0.3300007 -0.6171063  1.0000000 -0.6209104  0.3152248
## revtq_d3 -0.6075689  0.3311438 -0.6209104  1.0000000 -0.5908631
## revtq_d4  0.9165429 -0.5872559  0.3152248 -0.5908631  1.0000000
```

> Year over year change fixes the multicollinearity issue. First difference oscillates like quarter over quarter growth.

---
## R Practice

- This practice will look at predicting Walmart's quarterly revenue using:
    - 1 lag
    - Cyclicality
- Practice using:
    - [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html)
    - [`lm()`](https://rdrr.io/r/stats/lm.html)
    - [`package:ggplot2`](https://ggplot2.tidyverse.org)
- Do the exercises in today's practice file
    - <a target="_blank" href="Session_5s_Exercise.html">R Practice</a>

---
class: inverse, center, middle

# Forecasting

---
## 1 period models

- 1 Quarter lag
    - We saw a very strong linear pattern here earlier

```r
mod1 <- lm(revtq ~ revtq_l1, data = train)
```

- Quarter and year lag
    - Year-over-year seemed pretty constant

```r
mod2 <- lm(revtq ~ revtq_l1 + revtq_l4, data = train)
```

- 2 years of lags
    - Other lags could also help us predict

```r
mod3 <- lm(revtq ~ revtq_l1 + revtq_l2 + revtq_l3 + revtq_l4 + 
             revtq_l5 + revtq_l6 + revtq_l7 + revtq_l8, data = train)
```

- 2 years of lags, by observation quarter
    - Take into account cyclicality observed in bar charts

```r
mod4 <- lm(revtq ~ (revtq_l1 + revtq_l2 + revtq_l3 + revtq_l4 +
             revtq_l5 + revtq_l6 + revtq_l7 + revtq_l8):factor(fqtr),
           data = train)
```

---
## Quarter lag

```r
summary(mod1)
```

```
## 
## Call:
## lm(formula = revtq ~ revtq_l1, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24399.7    -35.8    -13.0     36.3  15314.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.299837  12.991379   1.332    0.183    
## revtq_l1     1.001776   0.001474 679.753   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1151 on 8294 degrees of freedom
##   (702 observations deleted due to missingness)
## Multiple R-squared:  0.9824,	Adjusted R-squared:  0.9824 
## F-statistic: 4.621e+05 on 1 and 8294 DF,  p-value: < 2.2e-16
```

---
## Quarter and year lag

```r
summary(mod2)
```

```
## 
## Call:
## lm(formula = revtq ~ revtq_l1 + revtq_l4, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20224.4    -21.6     -7.4     17.8   9320.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.740416   6.900972   1.267    0.205    
## revtq_l1    0.225726   0.005434  41.540   <2e-16 ***
## revtq_l4    0.816635   0.005650 144.532   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 594.5 on 7855 degrees of freedom
##   (1140 observations deleted due to missingness)
## Multiple R-squared:  0.9955,	Adjusted R-squared:  0.9955 
## F-statistic: 8.753e+05 on 2 and 7855 DF,  p-value: < 2.2e-16
```

---
## 2 years of lags

```r
summary(mod3)
```

```
## 
## Call:
## lm(formula = revtq ~ revtq_l1 + revtq_l2 + revtq_l3 + revtq_l4 + 
##     revtq_l5 + revtq_l6 + revtq_l7 + revtq_l8, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4854.9   -14.8    -5.7     8.0  5868.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.173259   4.176286   1.478   0.1394    
## revtq_l1     0.785242   0.011881  66.095  < 2e-16 ***
## revtq_l2     0.106283   0.015152   7.015 2.52e-12 ***
## revtq_l3    -0.026460   0.014771  -1.791   0.0733 .  
## revtq_l4     0.931266   0.011653  79.915  < 2e-16 ***
## revtq_l5    -0.779892   0.012756 -61.141  < 2e-16 ***
## revtq_l6    -0.079794   0.015819  -5.044 4.67e-07 ***
## revtq_l7     0.006604   0.015313   0.431   0.6663    
## revtq_l8     0.065782   0.011621   5.660 1.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 343.8 on 7196 degrees of freedom
##   (1793 observations deleted due to missingness)
## Multiple R-squared:  0.9986,	Adjusted R-squared:  0.9986 
## F-statistic: 6.536e+05 on 8 and 7196 DF,  p-value: < 2.2e-16
```

---
## 2 years of lags, by observation quarter

```r
summary(mod4)
```

```
## 
## Call:
## lm(formula = revtq ~ (revtq_l1 + revtq_l2 + revtq_l3 + revtq_l4 + 
##     revtq_l5 + revtq_l6 + revtq_l7 + revtq_l8):factor(fqtr), 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6141.4   -14.6     0.3    15.7  4980.3 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.42798    3.89557  -0.110 0.912521    
## revtq_l1:factor(fqtr)1  0.50358    0.02104  23.934  < 2e-16 ***
## revtq_l1:factor(fqtr)2  1.11831    0.02231  50.121  < 2e-16 ***
## revtq_l1:factor(fqtr)3  0.81435    0.02848  28.591  < 2e-16 ***
## revtq_l1:factor(fqtr)4  0.89057    0.02585  34.456  < 2e-16 ***
## revtq_l2:factor(fqtr)1  0.25042    0.03399   7.367 1.94e-13 ***
## revtq_l2:factor(fqtr)2 -0.09685    0.02387  -4.057 5.02e-05 ***
## revtq_l2:factor(fqtr)3  0.21067    0.03883   5.425 5.97e-08 ***
## revtq_l2:factor(fqtr)4  0.27270    0.03498   7.797 7.25e-15 ***
## revtq_l3:factor(fqtr)1  0.07270    0.03563   2.040 0.041349 *  
## revtq_l3:factor(fqtr)2 -0.01645    0.03468  -0.474 0.635234    
## revtq_l3:factor(fqtr)3 -0.02509    0.02361  -1.063 0.287895    
## revtq_l3:factor(fqtr)4 -0.20644    0.03805  -5.426 5.96e-08 ***
## revtq_l4:factor(fqtr)1  0.54168    0.03735  14.504  < 2e-16 ***
## revtq_l4:factor(fqtr)2  0.68562    0.03282  20.890  < 2e-16 ***
## revtq_l4:factor(fqtr)3  0.31156    0.03463   8.997  < 2e-16 ***
## revtq_l4:factor(fqtr)4  0.81921    0.01761  46.530  < 2e-16 ***
## revtq_l5:factor(fqtr)1 -0.43451    0.02269 -19.151  < 2e-16 ***
## revtq_l5:factor(fqtr)2 -0.74671    0.03512 -21.260  < 2e-16 ***
## revtq_l5:factor(fqtr)3 -0.23691    0.03639  -6.511 7.99e-11 ***
## revtq_l5:factor(fqtr)4 -0.51489    0.03316 -15.526  < 2e-16 ***
## revtq_l6:factor(fqtr)1  0.04230    0.03444   1.228 0.219304    
## revtq_l6:factor(fqtr)2  0.14185    0.02474   5.735 1.02e-08 ***
## revtq_l6:factor(fqtr)3 -0.15935    0.04103  -3.884 0.000104 ***
## revtq_l6:factor(fqtr)4 -0.03394    0.03657  -0.928 0.353350    
## revtq_l7:factor(fqtr)1  0.12468    0.03689   3.380 0.000729 ***
## revtq_l7:factor(fqtr)2  0.06377    0.03416   1.867 0.061935 .  
## revtq_l7:factor(fqtr)3  0.05196    0.02472   2.102 0.035572 *  
## revtq_l7:factor(fqtr)4 -0.34505    0.03754  -9.192  < 2e-16 ***
## revtq_l8:factor(fqtr)1 -0.11153    0.03137  -3.555 0.000380 ***
## revtq_l8:factor(fqtr)2 -0.13675    0.02785  -4.911 9.27e-07 ***
## revtq_l8:factor(fqtr)3  0.02570    0.02689   0.956 0.339153    
## revtq_l8:factor(fqtr)4  0.11428    0.01554   7.354 2.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 302.6 on 7172 degrees of freedom
##   (1793 observations deleted due to missingness)
## Multiple R-squared:  0.9989,	Adjusted R-squared:  0.9989 
## F-statistic: 2.109e+05 on 32 and 7172 DF,  p-value: < 2.2e-16
```

---
## Testing out of sample

- RMSE: Root Mean Square Error
- RMSE is very affected by outliers, and a bad choice for noisy data that you are OK with missing a few outliers here and there
    - Doubling error *quadruples* that part of the error

```r
rmse <- function(v1, v2) {
  sqrt(mean((v1 - v2)^2, na.rm = TRUE))
}
```

- MAE: Mean Absolute Error
- MAE is measures average accuracy with no weighting
    - Doubling error *doubles* that part of the error

```r
mae <- function(v1, v2) {
  mean(abs(v1-v2), na.rm = TRUE)
}
```

> Both are commonly used for evaluating OLS out of sample

---
## Testing out of sample

<table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> adj_r_sq </th>
   <th style="text-align:center;"> rmse_in </th>
   <th style="text-align:center;"> mae_in </th>
   <th style="text-align:center;"> rmse_out </th>
   <th style="text-align:center;"> mae_out </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 period </td>
   <td style="text-align:left;"> 0.9823645 </td>
   <td style="text-align:center;"> 1151.0560 </td>
   <td style="text-align:center;"> 323.82144 </td>
   <td style="text-align:center;"> 2916.3430 </td>
   <td style="text-align:center;"> 1223.4301 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1 and 4 periods </td>
   <td style="text-align:left;"> 0.9955321 </td>
   <td style="text-align:center;"> 594.4151 </td>
   <td style="text-align:center;"> 157.48397 </td>
   <td style="text-align:center;"> 1143.8276 </td>
   <td style="text-align:center;"> 553.5204 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods </td>
   <td style="text-align:left;"> 0.9986241 </td>
   <td style="text-align:center;"> 343.5646 </td>
   <td style="text-align:center;"> 94.98273 </td>
   <td style="text-align:center;"> 764.7114 </td>
   <td style="text-align:center;"> 362.1292 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods w/ quarters </td>
   <td style="text-align:left;"> 0.9989338 </td>
   <td style="text-align:center;"> 301.9370 </td>
   <td style="text-align:center;"> 92.26997 </td>
   <td style="text-align:center;"> 757.4591 </td>
   <td style="text-align:center;"> 354.6585 </td>
  </tr>
</tbody>
</table>

.pull-left[
1 quarter model
<img src="Session_5s_files/figure-html/unnamed-chunk-42-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
8 period model, by quarter
<img src="Session_5s_files/figure-html/unnamed-chunk-43-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## What about for revenue growth?

Backing out a revenue prediction, `$revt_t=(1+growth_t)\times revt_{t-1}$`

<table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> adj_r_sq </th>
   <th style="text-align:center;"> rmse_in </th>
   <th style="text-align:center;"> mae_in </th>
   <th style="text-align:center;"> rmse_out </th>
   <th style="text-align:center;"> mae_out </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 period </td>
   <td style="text-align:left;"> 0.0955220 </td>
   <td style="text-align:center;"> 1110.5010 </td>
   <td style="text-align:center;"> 307.8361 </td>
   <td style="text-align:center;"> 3202.2234 </td>
   <td style="text-align:center;"> 1338.9696 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1 and 4 periods </td>
   <td style="text-align:left;"> 0.4497703 </td>
   <td style="text-align:center;"> 530.0174 </td>
   <td style="text-align:center;"> 152.8021 </td>
   <td style="text-align:center;"> 1355.5009 </td>
   <td style="text-align:center;"> 631.5524 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods </td>
   <td style="text-align:left;"> 0.6788386 </td>
   <td style="text-align:center;"> 463.3719 </td>
   <td style="text-align:center;"> 123.3965 </td>
   <td style="text-align:center;"> 1165.7280 </td>
   <td style="text-align:center;"> 530.6755 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods w/ quarters </td>
   <td style="text-align:left;"> 0.7720057 </td>
   <td style="text-align:center;"> 381.7661 </td>
   <td style="text-align:center;"> 99.5676 </td>
   <td style="text-align:center;"> 986.1408 </td>
   <td style="text-align:center;"> 452.1947 </td>
  </tr>
</tbody>
</table>

.pull-left[
1 quarter model
<img src="Session_5s_files/figure-html/unnamed-chunk-45-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
8 period model, by quarter
<img src="Session_5s_files/figure-html/unnamed-chunk-46-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## What about for YoY growth?

Backing out a revenue prediction, `$revt_t=(1+yoy\_growth_t)\times revt_{t-4}$`

<table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> adj_r_sq </th>
   <th style="text-align:center;"> rmse_in </th>
   <th style="text-align:center;"> mae_in </th>
   <th style="text-align:center;"> rmse_out </th>
   <th style="text-align:center;"> mae_out </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 period </td>
   <td style="text-align:left;"> 0.4376253 </td>
   <td style="text-align:center;"> 520.7532 </td>
   <td style="text-align:center;"> 129.1364 </td>
   <td style="text-align:center;"> 1570.5401 </td>
   <td style="text-align:center;"> 695.8093 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1 and 4 periods </td>
   <td style="text-align:left;"> 0.5378241 </td>
   <td style="text-align:center;"> 495.5506 </td>
   <td style="text-align:center;"> 127.3290 </td>
   <td style="text-align:center;"> 1400.2662 </td>
   <td style="text-align:center;"> 642.0383 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods </td>
   <td style="text-align:left;"> 0.5430590 </td>
   <td style="text-align:center;"> 383.6760 </td>
   <td style="text-align:center;"> 101.1748 </td>
   <td style="text-align:center;"> 863.9954 </td>
   <td style="text-align:center;"> 425.6484 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods w/ quarters </td>
   <td style="text-align:left;"> 0.1462837 </td>
   <td style="text-align:center;"> 705.8313 </td>
   <td style="text-align:center;"> 193.7847 </td>
   <td style="text-align:center;"> 1214.8656 </td>
   <td style="text-align:center;"> 620.3688 </td>
  </tr>
</tbody>
</table>

.pull-left[
1 quarter model
<img src="Session_5s_files/figure-html/unnamed-chunk-48-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
8 period model
<img src="Session_5s_files/figure-html/unnamed-chunk-49-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## What about for first difference?

Backing out a revenue prediction, `$revt_t = change_t + revt_{t-1}$`

<table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> adj_r_sq </th>
   <th style="text-align:center;"> rmse_in </th>
   <th style="text-align:center;"> mae_in </th>
   <th style="text-align:center;"> rmse_out </th>
   <th style="text-align:center;"> mae_out </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 period </td>
   <td style="text-align:left;"> 0.3578089 </td>
   <td style="text-align:center;"> 896.1441 </td>
   <td style="text-align:center;"> 286.47866 </td>
   <td style="text-align:center;"> 2247.2158 </td>
   <td style="text-align:center;"> 986.9519 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1 and 4 periods </td>
   <td style="text-align:left;"> 0.8502591 </td>
   <td style="text-align:center;"> 444.9570 </td>
   <td style="text-align:center;"> 113.00284 </td>
   <td style="text-align:center;"> 860.6968 </td>
   <td style="text-align:center;"> 411.8824 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods </td>
   <td style="text-align:left;"> 0.9242547 </td>
   <td style="text-align:center;"> 329.4611 </td>
   <td style="text-align:center;"> 95.17826 </td>
   <td style="text-align:center;"> 764.8854 </td>
   <td style="text-align:center;"> 348.4883 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods w/ quarters </td>
   <td style="text-align:left;"> 0.9383434 </td>
   <td style="text-align:center;"> 296.7399 </td>
   <td style="text-align:center;"> 88.32380 </td>
   <td style="text-align:center;"> 731.1697 </td>
   <td style="text-align:center;"> 343.4773 </td>
  </tr>
</tbody>
</table>

.pull-left[
1 quarter model
<img src="Session_5s_files/figure-html/unnamed-chunk-51-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
8 period model, by quarter
<img src="Session_5s_files/figure-html/unnamed-chunk-52-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## Takeaways

1. The first difference model works about as well as the revenue model at predicting next quarter revenue
    - From earlier, it doesn't suffer (as much) from multicollinearity either
        - This is why time series analysis is often done on first differences
            - Or second differences (difference in differences)

2. The other models perform pretty well as well

3. Extra lags generally seems helpful when accounting for cyclicality

4. Regressing by quarter helps a bit, particularly with revenue growth

---
## What about for revenue growth?

Predicting quarter over quarter revenue growth itself (ie, the growth rate)

<table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> adj_r_sq </th>
   <th style="text-align:center;"> rmse_in </th>
   <th style="text-align:center;"> mae_in </th>
   <th style="text-align:center;"> rmse_out </th>
   <th style="text-align:center;"> mae_out </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 period </td>
   <td style="text-align:left;"> 0.0955220 </td>
   <td style="text-align:center;"> 0.3436252 </td>
   <td style="text-align:center;"> 0.2073042 </td>
   <td style="text-align:center;"> 0.2087555 </td>
   <td style="text-align:center;"> 0.1663210 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1 and 4 periods </td>
   <td style="text-align:left;"> 0.4497703 </td>
   <td style="text-align:center;"> 0.2611941 </td>
   <td style="text-align:center;"> 0.1103827 </td>
   <td style="text-align:center;"> 0.1373419 </td>
   <td style="text-align:center;"> 0.0947553 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods </td>
   <td style="text-align:left;"> 0.6788386 </td>
   <td style="text-align:center;"> 0.1737244 </td>
   <td style="text-align:center;"> 0.0848606 </td>
   <td style="text-align:center;"> 0.1269428 </td>
   <td style="text-align:center;"> 0.0801675 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods w/ quarters </td>
   <td style="text-align:left;"> 0.7720057 </td>
   <td style="text-align:center;"> 0.1461233 </td>
   <td style="text-align:center;"> 0.0762027 </td>
   <td style="text-align:center;"> 0.1267874 </td>
   <td style="text-align:center;"> 0.0758181 </td>
  </tr>
</tbody>
</table>

.pull-left[
1 quarter model
<img src="Session_5s_files/figure-html/unnamed-chunk-54-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
8 period model, by quarter
<img src="Session_5s_files/figure-html/unnamed-chunk-55-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## What about for YoY growth?

Predicting YoY revenue growth rate itself

<table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> adj_r_sq </th>
   <th style="text-align:center;"> rmse_in </th>
   <th style="text-align:center;"> mae_in </th>
   <th style="text-align:center;"> rmse_out </th>
   <th style="text-align:center;"> mae_out </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 period </td>
   <td style="text-align:left;"> 0.4376253 </td>
   <td style="text-align:center;"> 0.3022800 </td>
   <td style="text-align:center;"> 0.1085684 </td>
   <td style="text-align:center;"> 0.1511589 </td>
   <td style="text-align:center;"> 0.1006249 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1 and 4 periods </td>
   <td style="text-align:left;"> 0.5378241 </td>
   <td style="text-align:center;"> 0.2389085 </td>
   <td style="text-align:center;"> 0.0993933 </td>
   <td style="text-align:center;"> 0.1493757 </td>
   <td style="text-align:center;"> 0.0967341 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods </td>
   <td style="text-align:left;"> 0.5430590 </td>
   <td style="text-align:center;"> 0.1881716 </td>
   <td style="text-align:center;"> 0.0750616 </td>
   <td style="text-align:center;"> 0.1358365 </td>
   <td style="text-align:center;"> 0.0753768 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods w/ quarters </td>
   <td style="text-align:left;"> 0.1462837 </td>
   <td style="text-align:center;"> 0.2935877 </td>
   <td style="text-align:center;"> 0.1373069 </td>
   <td style="text-align:center;"> 0.1866005 </td>
   <td style="text-align:center;"> 0.1137764 </td>
  </tr>
</tbody>
</table>

.pull-left[
1 quarter model
<img src="Session_5s_files/figure-html/unnamed-chunk-57-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
8 period model
<img src="Session_5s_files/figure-html/unnamed-chunk-58-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## What about for first difference?

Predicting first difference in revenue itself

<table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> adj_r_sq </th>
   <th style="text-align:center;"> rmse_in </th>
   <th style="text-align:center;"> mae_in </th>
   <th style="text-align:center;"> rmse_out </th>
   <th style="text-align:center;"> mae_out </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 period </td>
   <td style="text-align:left;"> 0.3578089 </td>
   <td style="text-align:center;"> 896.1441 </td>
   <td style="text-align:center;"> 286.47866 </td>
   <td style="text-align:center;"> 2247.2158 </td>
   <td style="text-align:center;"> 986.9519 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1 and 4 periods </td>
   <td style="text-align:left;"> 0.8502591 </td>
   <td style="text-align:center;"> 444.9570 </td>
   <td style="text-align:center;"> 113.00284 </td>
   <td style="text-align:center;"> 860.6968 </td>
   <td style="text-align:center;"> 411.8824 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods </td>
   <td style="text-align:left;"> 0.9242547 </td>
   <td style="text-align:center;"> 329.4611 </td>
   <td style="text-align:center;"> 95.17826 </td>
   <td style="text-align:center;"> 764.8854 </td>
   <td style="text-align:center;"> 348.4883 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 8 periods w/ quarters </td>
   <td style="text-align:left;"> 0.9383434 </td>
   <td style="text-align:center;"> 296.7399 </td>
   <td style="text-align:center;"> 88.32380 </td>
   <td style="text-align:center;"> 731.1697 </td>
   <td style="text-align:center;"> 343.4773 </td>
  </tr>
</tbody>
</table>

.pull-left[
1 quarter model
<img src="Session_5s_files/figure-html/unnamed-chunk-60-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
8 period model, by quarter
<img src="Session_5s_files/figure-html/unnamed-chunk-61-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: inverse, center, middle

# Summary of Session 5

---
## For next week

- Try to replicate the code for this session
- How is your group project?
- Datacamp
    - Practice a bit more to keep up to date
        - Using R more will make it more natural
- <a target="_blank" href="https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting">
Case: Walmart Store Sales Forecasting</a>

---

## R Coding Style Guide

Style is subjective and arbitrary but it is important to follow a generally accepted style if you want to share code with others. I suggest the [The tidyverse style guide](https://style.tidyverse.org/) which is also adopted by [Google](https://google.github.io/styleguide/Rguide.html) with some modification
- Highlights of **the tidyverse style guide**:
    - *File names*: end with .R
    - *Identifiers*: variable_name, function_name, try not to use "." as it is reserved by Base R's S3 objects
    - *Line length*: 80 characters
    - *Indentation*: two spaces, no tabs (RStudio by default converts tabs to spaces and you may change under global options)
    - *Spacing*: x = 0, not x=0, no space before a comma, but always place one after a comma
    - *Curly braces {}*: first on same line, last on own line
    - *Assignment*: use `<-`, not `=` nor `->`
    - *Semicolon(;)*: don't use, I used once for the interest of space
    - *return()*: Use explicit returns in functions: default function return is the last evaluated expression
    - *File paths*: use [relative file path](https://www.w3schools.com/html/html_filepaths.asp) "../../filename.csv" rather than absolute path "C:/mydata/filename.csv". Backslash needs `\\`

---

## R packages used in this slide

This slide was prepared on 2021-09-24 from Session_5s.Rmd with R version 4.1.1 (2021-08-10) Kick Things on Windows 10 x64 build 18362 😄.

The attached packages used in this slide are:

```
##      rlang  lubridate     plotly    forcats    stringr      dplyr      purrr 
##   "0.4.11"   "1.7.10"  "4.9.4.1"    "0.5.1"    "1.4.0"    "1.0.7"    "0.3.4" 
##      readr      tidyr     tibble    ggplot2  tidyverse kableExtra      knitr 
##    "2.0.1"    "1.1.3"    "3.1.3"    "3.3.5"    "1.3.1"    "1.3.4"     "1.33"
```