Programming with Data

class: center, middle, inverse, title-slide

# Programming with Data
## Session 1: Introduction
### Dr. Wang Jiwei
### Master of Professional Accounting

---

class: inverse, center, middle

# About the course

---

## What will this course cover?

.pull-left[
<img src="../../../Figures/r-logo.png" height="360px">
]
.pull-right[
1. Prerequisite
    - University Statistics
2. Programming with R
    - R programming foundations
3. Linear regressions with R
    - Forecast/Predict financial outcomes
4. Binary classification with R
    - Event prediction
    - Classification/detection (of financial fraud)
5. Data visualizations with R
    - ggplot2 package in R
6. Advanced methods
    - Lasso, Ridge and Elastic Net regressions
    - Introduction to machine learning
]

> Using R for forecasting and forensics

---
## Teaching philosphy

1. Programming is best learned by doing it
    - more thinking and hands-on practising
2. Working with others greatly extends learning
    - If you are ahead:
        - The best sign that you've mastered a topic is if you can explain it to others
    - If you are lost:
        - Gives you a chance to get the help you need
3. We generally follow the following model to learn programming with data
.center[Source: R for Data Science <img src="../../../Figures/model_data_science.png" height="200px">]

---
## Textbook and learning materials

.pull-left[
- All course materials on SMU eLearn
- There is no required textbook
- If you prefer having a textbook...
    - <a target=_blank href="https://rc2e.com/index.html">R Cookbook</a> is good for beginners
    - <a target=_blank href="https://r4ds.had.co.nz/"> R for Data Science</a> is good for more advanced learners
- Announcements will be mainly on eLearn
- Other useful websites
    - https://www.r-bloggers.com/
    - https://stackoverflow.com/questions
    - https://www.google.com/
]
.pull-right[
.center[<img src="https://rc2e.com/images_v2/book_cover.jpg" height="200px">]
.center[<img src="../../../Figures/R_for_Data_Science.png" height="200px">]
]

---
## Self learning and Datacamp

- You are encouraged to go beyond the assigned materials, either through Datacamp or other online learning platforms such as Coursera and Udemy.
- Datacamp is providing *free* access to their *full* library of analytics and coding online tutorials
    - You will have free access for 6 months (July 1 to Dec 30, 2021), subject to renewal
- Suggestion: enroll into the **"Data Analysts with R/Python"** career track on Datacamp and finish all courses before completing your degree
    - Check eLearn for link to access Datacamp for free
    - Datacamp automatically records when you finish these

> Practice! Practice! Practice!

.center[<img src="../../../Figures/datacamp_leaderboard_mpa20210821.png" height = "200px">]

---
## Grading

- Participation @ 20%
- Progress assessment @ 30%
- Group project @ 50%
- There is no final exam

> Must attemp all components and must pass all components to pass the course

.center[source: medium.com <img src = "../../../Figures/grading.jpeg" height = "300px">]

---
## Participation

.pull-left[
**In Class**
- Come to class to earn 50%
    - If you have a conflict, email me
        - Excused classes do not impact your participation grade
- Ask questions to **extend** or **clarify**
- Answer questions and explain answers
    - Give it your best shot!
- Help those in your group to understand concepts
- Present your work to the class
- Always **on your camera** and speak with your microphone
- Other initiatives to enrich the classroom learning experience
]

.pull-right[
**Outside of Class**
- Verify your understanding of the material
- Apply to other real world data
    - Techniques and code will be useful after graduation
- Answers to assignments are expected to be your own work, unless otherwise stated
    - No sharing answers (unless otherwise stated)
- All submissions on eLearn
    - on time and follow instructions
- I will provide snippets of code to help you with trickier parts
]

---
## Group project

- Data science competition format, hosted on <a target=_blank href="https://www.kaggle.com/competitions">Kaggle</a> or similar platforms.
- The project will finish in Session 10 with group presentations
- I will give your more details in a separate document

.center[<img src="../../../Figures/kaggle_logo.png" height = "200px">]

---
## Expectations

.pull-left[
**In class:**

- Participate
    - Ask questions
        - Clarify
        - Add to the discussion
    - Answer questions
    - Work with classmates

.center[<img src = "https://www.insidehighered.com/sites/default/server_files/styles/large-copy/public/media/GettyImages-476803981.jpg" height = "300px">]
]

.pull-right[
**Outside of class:**

- Check eLearn for course announcements
- Do the tutorials on Datacamp if you are not familiar with R
    - This will make the course much easier!
- Do individual work on your own (unless otherwise stated)
    - Submit on eLearn
- Do online courses through Datacamp or other platforms
- Office hours are there to help!
    - Short questions can be emailed instead
]

---
## Office hours

- Appointment at the following link
    - [https://calendly.com/drdataking](https://calendly.com/drdataking)
    - The default time is 15 minutes
    - If you want longer time, you may book multiple slots
- Short questions can be emailed
    - I try to respond within 24 hours
- Teaching Assistant (check eLearn)
    - always make appointment before approaching TA

.center[<img src = "https://freight.cargo.site/w/1250/i/470e98259bbcf939ec247a13d9f286131edef2c1b879b75997ecaa332119d25f/Open-Office-Hours.jpg
" height = "300px">]

---
## Tech use

- Laptops and other tech are OK!
    - Use them for learning and course related
- Examples of good tech use:
    - Taking notes
    - Viewing slides
    - Working out problems
    - Group discussion
- Avoid:
    - Messaging your friends on Whatsapp/Wechat/Telegram/etc
    - Working on homework or group project in class
    - Playing games or watching livestreams
- <a href="https://www.insidehighered.com/news/2018/07/27/class-cellphone-and-laptop-use-lowers-exam-scores-new-study-shows">In-class cellphone and laptop use lowers exam scores</a>

.center[<img src = "../../../Figures/techuse.jpg" height = "200px">]

---
class: inverse, center, middle

# About you

---
class: inverse, center, middle

# Introduction to analytics

---
## What is analytics?

> **Oxford:** a careful and complete analysis of data using a model, usually performed by a computer; information resulting from this analysis

> **Webster:** the method of logical analysis

> **Wikipedia:** the discovery, interpretation, and communication of meaningful patterns in data and applying those patterns towards effective decision making

> **Simply put:** Solving problems using data

- Additional layers we can add to the definition:
    - Solving problems using *a lot of* data
    - Solving problems using data *and statistics*
    - Solving problems using data *and computers (programming and/or specialized software)*

---
## The trend

> We search "analytics" in Google Books and display the graph showing how the word has occurred since 1960

.center[Made using R [`package:seancarmody/ngramr`](https://github.com/seancarmody/ngramr) which is available on CRAN (the central depository for R packages) and can be installed from RStudio directly]

---
## Analytics vs AI/machine learning

.pull-left[
<img src="../../../Figures/analytics_ml.png" width="350px">
]

.pull-right[
- In class reading:
    - Future of everything: <a target="_blank" href="https://www.datarobot.com/blog/ai-will-enhance-us-not-replace-us-1/">AI Will Enhance Us, Not Replace Us</a>
        - "The future isn't AI versus humans. It is AI-enhanced humans doing what humans are best at."
    - AI Ethics: <a target="_blank" href="https://www.businessinsider.com/apple-card-faces-investigation-over-gender-discrimination-allegation-2019-11?IR=T">Apple Card is facing a formal investigation</a>
        - "We need transparency and fairness."
]

- Class Discussion

> How will Analytics/AI/ML change society and the accounting profession?

---
## What happened before?

.center[<img src="../../../Figures/wsjsurvey.png" height="500px">]

---
class: inverse, center, middle

# Who uses analytics?

---
## In general

.pull-left[
- Companies
    - Finance
    - Manufacturing
    - Transportation
    - Computing
    - ...
]

.pull-right[
- Governments
    - AI.Singapore
    - Big data office
    - "Smart" initiatives
- Academics
- Individuals!
]

> 59% of companies where using big data in a <a target="_blank" href="https://www.forbes.com/sites/louiscolumbus/2018/12/23/big-data-analytics-adoption-soared-in-the-enterprise-in-2018">2018 survey</a>!

> Which corporate function has the highest/lowest adoption of big data analytics?

---
## Adoption of big data by function

.center[<img src = "../../../Figures/Adoption-of-big-data-by-function.jpg">]

---
## What analytics for?

.pull-left[
- Customer service
    - <a target="_blank" href="https://www.forbes.com/sites/tomgroenfeldt/2018/05/03/rbs-uses-analytics-to-make-customer-service-more-than-just-a-slogan/#3d02f7d42108">Royal Bank of Scotland</a>
        - Understanding customer complaints
- Improving products
    - Siemens' <a target="_blank" href="https://www.forbes.com/sites/bernardmarr/2017/05/30/how-siemens-is-using-big-data-and-iot-to-build-the-internet-of-trains/#3b22cfd372b8">Internet of Trains</a>
        - Improving train reliability

- Auditing
    - <a target="_blank" href="https://www.dbs.com/investorday/presentations/Reimagining_Audit.pdf">Continuous Auditing at DBS</a>
        - The Future of Auditing is Auditing the Future

- How about your company?
]

.pull-right[
<a target="_blank" href="https://www.forbes.com/sites/tomgroenfeldt/2018/05/03/rbs-uses-analytics-to-make-customer-service-more-than-just-a-slogan/#3d02f7d42108"><img src="../../../Figures/RBS.png" width="120px"/></a>
<a target="_blank" href="https://www.forbes.com/sites/bernardmarr/2017/05/30/how-siemens-is-using-big-data-and-iot-to-build-the-internet-of-trains/#3b22cfd372b8"><img src="../../../Figures/Siemens-logo.png" width="400px"/></a>
<a target="_blank" href="https://www.dbs.com/investorday/presentations/Reimagining_Audit.pdf"><img src="../../../Figures/DBS_logo.png" width="400px"></a>
]

---
## State of business analytics?

- <a target="_blank" href="https://www.forbes.com/sites/louiscolumbus/2018/06/08/the-state-of-business-intelligence-2018/">Dresner Advisory Service's 2018 Market Study</a>
- Executive Management, Operations, Sales and Finance are the four primary roles driving business analytics adoption in 2018.
- Dashboards, reporting, end-user self-service, advanced visualization, and data warehousing are the top five most important initiatives.

.center[<img src="../../../Figures/Functions-Driving-Business-Intelligence-.jpg" height = "350px">]

---
## Head of Finance Data

.pull-left[
**Key tasks and responsibilities**

- Leading the Finance Data Team to maintain and improve the Financial data application landscape (BI Reporting, Planning and Budgeting system) and data pipelines powering Finance systems and reporting.

- Enable business users to further improve the data literacy and ultimately drive data decision making.

<a target="_blank" href="https://sg.linkedin.com/jobs/view/regional-head-of-finance-data-at-lazada-group-1345336253"><img src="../../../Figures/Lazada_logo.png" width="200px"/></a>
]

.pull-right[
**Qualifications & Skills**

- Degree in Accounting, Finance, Business Administration, Computer Science or related field

- Experience in big database including strong expertise in SQL

- Excel, R/Python (plus), SAP hands-on experience

- Creative and analytical thinker with strong problem-solving skills

- Strong written and oral communication skills
]

---
class: inverse, center, middle

# Statistics Foundations

---
## Frequentist vs Bayesian statistics

.pull-left[
**Frequentist statistics**
> A specific test is one of an infinite number of replications

- The "correct" answer should occur most frequently, i.e., with a high probability
- Focus on true vs. false
- Treat unknowns as fixed constants to figure out
    - Not random quantities
- Where it's used
    - Classical statistics methods
        - Like OLS
]

.pull-right[
**Bayesian statistics**
> Focus on distributions and beliefs

- Prior distribution -- what is believed before the experiment
- Posterior distribution: an updated belief of the distribution due to the experiment
- Derive distributions of parameters
- Where it's used:
    - Many machine learning methods
        - Bayesian updating acts as the learning
    - Bayesian statistics
]

---
## Frequentist: Repeat the test

> Did the sun explode just now?

```r
# Don't worry, we will learn how to program in R soon.
# Define a detector
# repeat the test with frequentist statistics

detector <- function() {
  dice <- sample(1:6, size = 2, replace = TRUE)
  if (sum(dice) == 12) {
    "exploded"
  } else {
    "still there"
  }
}

experiment <- replicate(1000, detector())
# p value
paste("p-value: ",
      sum(experiment == "still there") / 1000,
      "-- Failed to reject H_0 that sun didn't explode")
```

```
# [1] "p-value:  0.971 -- Failed to reject H_0 that sun didn't explode"
```

> Frequentist: The sun didn't explode

---
## Bayesian: Bayes rule

> Did the sun explode just now?

$$
P(A|B) = \frac{P(B|A) P(A)}{P(B)}
$$

- `$A$`: The sun exploded
- `$B$`: The detector said it exploded
- `$P(A)$`: Really, really small.  Say, ~0. Prior belief
- `$P(B)$`: `$\frac{1}{6}\times\frac{1}{6} = \frac{1}{36}$`. Experiment
- `$P(B|A)$`: `$\frac{35}{36}$`. Post belief

$$
P(A|B) = \frac{P(B|A) P(A)}{P(B)} = \frac{\frac{35}{36}\times\sim 0}{\frac{1}{36}} = 35\times \sim 0 \approx 0
$$

> Bayesian: The sun didn't explode

---
## What analytics typically relies on

- Regression approaches
    - Most often done in a frequentist manner
    - Can be done in a Bayesian manner as well
- Machine learning
    - Sometimes Bayesian, sometime frequentist

> We will mainly use frequentist statistics and some applications in bayesian -- for our purposes, we will not debate the merits of either school of thought, but use tools derived from both

.center[<img src="../../../Figures/bayesian-vs-frequentist-methods.jpg" height="200px">]

---
## Confusion from frequentist approaches

- Possible contradictions:
    - `$F$` test says the model is good yet nothing is statistically significant
    - Individual `$p$`-values are good yet the model isn't
    - One measure says the model is good yet another doesn't

> There are many ways to measure a model, each with their own merits. They don't always agree, and it's on us to pick a reasonable measure. We will discuss more in applications.

---
class: inverse, center, middle

# Frequentist approaches to things

---
## Population vs Sample

- Population: all objects belonging to a specified set
    - e.g., All companies in Singapore
- Sample: a (random) subset of the population

.center[<img src="../../../Figures/population_sample.png">]

---
## Parameters vs statistics

> Population parameters vs Sample statistics of a given variable (such as height of boys or earning of companies)

- mean
- median/quantile
- mode
- standard deviation/variance
- max/min
- distribution

.center[<img src="../../../Figures/skewness.png">]

---
## Normal distribution

> A normal (gaussian or bell curve) distribution is a type of continuous probability distribution for a real-valued random variable with the same values of mean, median and mode.

.center[<img src="../../../Figures/Normal_Distribution_PDF.svg" height = "400px">]

---
## Sampling error

- Sample statistic will *not* be exactly equal to population parameter
    - But should be close
- How close depends on sample size
    - Confidence interval

`$$\bar{X} \pm Z_{1-\alpha/2} \, \frac{\sigma}{\sqrt{N}}$$`
> where `$\bar{X}$` is the sample mean, `$Z_{1-\alpha/2}$` is the `$1-\alpha/2$` critical value of the standard normal distribution (1.68, 1.96 and 2.58 for 10%, 5%, and 1% respectively, which corresponding to confidence level of 90%, 95% and 99%), `$\sigma$` is the known population standard deviation, and `$N$` is the sample size.

- The larger the sample size `$N$`, the closer the sample statistic to the population parameter
    - trade-off between data collection costs and sample/margin error
    - we typically choose a confidence level (such as 99%) and a margin error to determine the minimum random sample size `$N$`

---
##Law of large numbers

> The law of large numbers states that as a sample size grows, its mean gets closer to the average of the whole population.

- Roll a 6 sided dice (1 to 6), the expected mean is 3.5

```r
# Roll a dice
i <- 1
dice <- 0
times <- 10000
while (i <= times) {
  dice <- dice + sample(1:6, 1)
  i <- i + 1
}
paste("Roll", times, "times dice and the mean is", dice/times)
```

```
## [1] "Roll 10000 times dice and the mean is 3.4843"
```

---
## Central Limit Theorem

> If you take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed.

```r
i <- 0; meandice <- c()
while (i <= 10000) {
  meandice <- append(meandice,
                     mean(sample(1:6, 30, replace = TRUE)))
  i <- i + 1
}
hist(meandice, col = "lightgreen", breaks = 20)
abline(v = 3.5, col = "blue")
abline(v = mean(meandice), col = "red")
```

---
## Hypotheses

- `$H_0$`: Null hypothesis
    - The status quo is correct
    - Your proposed model/prediction doesn't work
- `$H_A$` or `$H_1$`: Alternative hypothesis
    - The model/prediction you are proposing works
- Frequentist statistics can never directly support `$H_0$`!
    - Reject `$H_0$` (a.k.a find Support for `$H_A$`) if `$p$`-value < a significance level (such as 5% or 1%)
    - Fail to reject `$H_0$` (a.k.a fail to find Support for `$H_A$`) if `$p$`-value >= a significance level
- We can roughly understand `$p$`-value as the probability of `$H_0$`
    - We will discuss more later
    
> Even if our `$p$`-value is 1, we can't say that the results prove the null hypothesis!

---
## OLS terminology

$$
y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \varepsilon
$$

$$
\hat{y} = \alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \hat{\varepsilon}
$$

- `$y$`: The output in our model
    - dependent variable
    - predicted value
- `$\hat{y}$`: The *estimated* output in our model
- `$x_i$`: An input in our model
    - independent variables
    - features
    - predictors
- `$\hat{x}_i$`: An *estimated* input in our model
- `$\hat{~}$`: Something *estimated*, "caret" or "hat"
- `$\alpha$`: A constant, the expected value of `$y$` when all `$x_i$` are 0
- `$\beta_i$`: A coefficient on an input to our model
- `$\varepsilon$`: The error term
    - This is also the *residual* from the regression
        - What's left if you take actual `$y$` minus the model prediction

---
## OLS statistical properties

$$
y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \varepsilon
$$

$$
\hat{y} = \alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \hat{\varepsilon}
$$

1. There should be a *linear* relationship between `$y$` and each `$x_i$`
    - i.e., `$y$` is [approximated by] a constant multiple of each `$x_i$`
    - Otherwise we **shouldn't** use a *linear* regression
2. Each `$\hat{x}_i$` is normally distributed
    - Not so important with larger data sets, but a good to adhere to
3. Each observation is independent
    - We'll violate this one for the sake of *causality*
4. Homoskedasticity: Variance in errors is constant
    - This is important
5. Not too much multicollinearity
    - Each `$\hat{x}_i$` should be relatively independent from the others
    - Some is OK

---
class: inverse, center, middle

# Linear model implementation

---
## What exactly is a linear model?

- Anything OLS is linear
- Many transformations can be recast to linear
    - Ex.: `$log(y) = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 {x_1}^2 + \beta_4 x_1 \cdot x_2$`
        - This is the same as `$y' = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4$` where:
            - `$y' = log(y)$`
            - `$x_3 = {x_1}^2$`
            - `$x_4 = x_1 \cdot x_2$`

> Linear models are *very* flexible

.center[source: wikipedia <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/1200px-Linear_regression.svg.png" height = "250px">]

---
## Mental model of OLS: 1 input

.center[<img src="../../../Figures/OLS_model.png" width="600px">]

> Simple OLS measures a simple linear relationship between an input and an output

- e.g.: Future revenue regressed on assets

---
## Multiple inputs

.center[<img src="../../../Figures/OLS_model_m.png" height="400px">]

> OLS measures simple linear relationships between a set of inputs and one output

- e.g.: Future revenue regressed on multiple accounting and macro variables

---
## Model selection

> We will introduce many models. Pick what fits your problem!

- For forecasting a quantity
    - Usually some sort of linear model regressed using OLS
- For forecasting a binary outcome
    - Usually logit or a related model
- For forensics:
    - Usually logit or a related model

> automated model selection

.center[<a target="_blank" href = "https://www.kdnuggets.com/2020/02/data-scientists-automl-replace.html"><img src = "../../../Figures/automl.png" height = "250px"></a>]

---
## Variable selection

> Feature engineering

- The options:
    1. Use your own knowledge to select variables
    2. Use a selection model to automate it

.pull-left[
**Own knowledge**

- Build a model based on your knowledge of the problem and situation
- This is generally better
    - The result should be more interpretable
    - For prediction, you should know relationships better than most algorithms
]

.pull-right[
<img src="../../../Figures/brain.png" height="200px">
]

---
## Automated variable selection

- Traditional methods include:
    - Forward selection: Start with nothing and add variables with the most contribution to Adj `$R^2$` until it stops going up
    - Backward selection: Start with all inputs and remove variables with the worst (negative) contribution to Adj `$R^2$` until it stops going up
    - Stepwise selection: Like forward selection, but drops non-significant predictors
- Newer methods:
    - Lasso and Elastic Net based models
        - Optimize with high penalties for complexity (i.e., # of inputs)
- We will discuss these in future sessions

.center[<img src="../../../Figures/artificial-neural-network_640.png" height="150px">]

---
## The overfitting problem

> Or: Why do we like simpler models so much?

- Overfitting happens when a model fits in-sample data *too well*...
    - To the point where it also models any idiosyncrasies or errors in the data
    - This harms prediction performance
        - Directly harming our forecasts

> An overfitted model works really well on its own data, and quite poorly on new data

.center[<a target = "_blank" href = "https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76"><img src = "https://miro.medium.com/max/1125/1*_7OPgojau8hkiPUiHoGK_w.png" height = "250px">]

---
class: inverse, center, middle

# Statistical tests and Interpretation

---
## Coefficients

- In OLS: `$\beta_i$`

.pull-left[
- A change in `$x_i$` by 1 unit leads to a change in `$y$` by `$\beta_i$`
- Essentially, the slope between `$x$` and `$y$`
- The blue line in the chart is the regression line for `$\hat{Revenue} = \alpha + \beta_i \hat{Assets}$` for retail firms since 1960
]

.pull-right[
<img src="Session_1s_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## P-values

- `$p$`-values tell us the probability that an individual result is due to random chance

> "The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed." <br> -- Dahiru 2008

- These are very useful, particularly for a frequentist approach
- First used in the 1700s, but popularized by Ronald Fisher in the 1920s and 1930s
- If `$p<0.05$` and the coefficient matches our mental model, we can consider this as supporting our model (i.e. rejecting the null)
    - If `$p<0.05$` but the coefficient is opposite, then it is suggesting a problem with our model
    - If `$p>0.10$`, it is rejecting the alternative hypothesis
- If `$0.05 < p < 0.10$` it depends...
    - For a small dataset or a complex problem, we can use `$0.10$` as a cutoff
    - For a huge dataset or a simple problem, we should use `$0.05$`

---
## R-square

- `$R^2$` = Explained variation / Total variation
    - Variation = difference in the observed output variable from its own mean
- A high `$R^2$` indicates that the model fits the data very well
- A low `$R^2$` indicates that the model is missing much of the variation in the output
- `$R^2$` is technically a *biased* estimator
    - more independent variables, higher `$R^2$`
- Adjusted `$R^2$` downweights `$R^2$` and makes it unbiased
    - `$R^2_{Adj} = P * R^2 + 1 - P$`
        - Where `$P=\frac{n-1}{n-p-1}$`
        - `$n$` is the number of observations
        - `$p$` is the number of inputs in the model

---
class: inverse, center, middle

# Causality

---
## What is causality?

`$A \rightarrow B$`

- Causality is `$A$` *causing* `$B$`
    - This means more than `$A$` and `$B$` are correlated
- i.e., If `$A$` changes, `$B$` changes.  But `$B$` changing doesn't mean `$A$` changed
    - Unless `$B$` is 100% driven by `$A$`
- Very difficult to determine, particularly for events that happen [almost] simultaneously
- <a target="_blank" href="http://tylervigen.com/spurious-correlations">Examples of correlations that aren't causation</a>

.center[<a target="_blank" href="https://xkcd.com/552/"><img src="https://imgs.xkcd.com/comics/correlation.png" height="200px"></a>]

---
## Time and causality

`$A \rightarrow B$` or `$A \leftarrow B$`?

`$A_t \rightarrow B_{t+1}$`

- If there is a separation in time, it's easier to say `$A$` caused `$B$`
    - Observe `$A$`, then see if `$B$` changes after
- Conveniently, we have this structure when forecasting
    - e.g.:

$$
Revenue_{t+1} = Revenue_t + \ldots
$$

---
## Time and causality break down

`$A_t \rightarrow B_{t+1}$`? `$\quad$` OR `$\quad$` `$C \rightarrow A_t$` and `$C \rightarrow B_{t+1}$`?

- The above illustrates the *Correlated omitted variable problem*
    - `$A$` doesn't cause `$B$`...  Instead, some other force `$C$` causes both
    - Bane of social scientists everywhere
- This is less important for predictive analytics, as we care more about performance, but...
    - It can complicate interpreting your results
    - Figuring out `$C$` can help improve you model's predictions
        - So find C!

.center[<a target="_blank" href="https://xkcd.com/925/"><img src="https://imgs.xkcd.com/comics/cell_phones.png" height="200px"></a>]

---
## Discussion

> Some executives believe that all they need to do is establish correlation. Wrong!

.pull-left[
- **A/B Test:** <a target="_blank" href="https://hbr.org/2017/09/the-surprising-power-of-online-experiments">The Surprising Power of Online Experiments</a>
- Further reading: <a target="_blank" href="https://www.r-bloggers.com/causal-inference-cheat-sheet-for-data-scientists/">Causal Inference cheat sheet for data scientists</a>

> so does causation imply correlation?

.center[source: wikipedia <img src="../../../Figures/Correlation_examples.svg">]

]

.pull-right[
.center[<img src="../../../Figures/a_b_test_bing.png" height="400px"></a>]
]

---
class: inverse, center, middle

# Statistics vs. Data Science Jargon

---

## Statistics vs. Data Science Jargon

<table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> Statistics </th>
   <th style="text-align:left;"> Data_Science </th>
   <th style="text-align:left;"> Meaning </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> estimation </td>
   <td style="text-align:left;"> learning </td>
   <td style="text-align:left;"> use data to estimate an unknown parameter (mean, variance, model coefficients, etc.) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> regression </td>
   <td style="text-align:left;"> regression/supervised learning </td>
   <td style="text-align:left;"> Predict a continuous value of  
Y using values of other variables X </td>
  </tr>
  <tr>
   <td style="text-align:left;"> classification </td>
   <td style="text-align:left;"> classification/supervised learning </td>
   <td style="text-align:left;"> Predict a discrete value of Y using values of other variables X </td>
  </tr>
  <tr>
   <td style="text-align:left;"> clustering </td>
   <td style="text-align:left;"> clustering/unsupervised learning </td>
   <td style="text-align:left;"> Group the data based on some variables X </td>
  </tr>
  <tr>
   <td style="text-align:left;"> in/out-of sample </td>
   <td style="text-align:left;"> training/testing sample </td>
   <td style="text-align:left;"> data used for traing/testing models </td>
  </tr>
  <tr>
   <td style="text-align:left;"> independent variable </td>
   <td style="text-align:left;"> feature </td>
   <td style="text-align:left;"> variables/predictors of X </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dependent variable </td>
   <td style="text-align:left;"> label </td>
   <td style="text-align:left;"> predictions of Y </td>
  </tr>
</tbody>
</table>

---
class: inverse, center, middle

# Summary of Session 1

---
## For next week

- start your "data analyst with R" career track on Datacamp
- Review statistics foundation
- Pick a book on R and study it, such as <a target=_blank href="https://rc2e.com/index.html">R Cookbook</a> or <a target=_blank href="https://r4ds.had.co.nz/"> R for Data Science</a>
- Install [R](https://cran.rstudio.com/) and [RStudio](https://www.rstudio.com/products/rstudio/download/#download) if you have not done so

---

## R packages used in this slide

This slide was created in Jan 2019 from Session_1s.Rmd and updated on 2021-10-01 with R version 4.1.1 (2021-08-10) Kick Things on Windows 10 x64 build 18362 😄.

The attached packages used in this slide are:

```
##    forcats    stringr      dplyr      purrr      readr      tidyr     tibble 
##    "0.5.1"    "1.4.0"    "1.0.7"    "0.3.4"    "2.0.1"    "1.1.3"    "3.1.3" 
##  tidyverse    ggplot2     ngramr kableExtra      knitr 
##    "1.3.1"    "3.3.5"    "1.7.4"    "1.3.4"     "1.33"
```