class: center, middle, inverse, title-slide # Programming with Data ## Session 1: Introduction ### Dr. Wang Jiwei ### Master of Professional Accounting --- class: inverse, center, middle # About the course --- ## What will this course cover? .pull-left[ <img src="../../../Figures/r-logo.png" height="360px"> ] .pull-right[ 1. Prerequisite - University Statistics 2. Programming with R - R programming foundations 3. Linear regressions with R - Forecast/Predict financial outcomes 4. Binary classification with R - Event prediction - Classification/detection (of financial fraud) 5. Data visualizations with R - ggplot2 package in R 6. Advanced methods - Lasso, Ridge and Elastic Net regressions - Introduction to machine learning ] > Using R for forecasting and forensics --- ## Teaching philosphy 1. Programming is best learned by doing it - more thinking and hands-on practising 2. Working with others greatly extends learning - If you are ahead: - The best sign that you've mastered a topic is if you can explain it to others - If you are lost: - Gives you a chance to get the help you need 3. We generally follow the following model to learn programming with data .center[Source: R for Data Science <img src="../../../Figures/model_data_science.png" height="200px">] --- ## Textbook and learning materials .pull-left[ - All course materials on SMU eLearn - There is no required textbook - If you prefer having a textbook... - <a target=_blank href="https://rc2e.com/index.html">R Cookbook</a> is good for beginners - <a target=_blank href="https://r4ds.had.co.nz/"> R for Data Science</a> is good for more advanced learners - Announcements will be mainly on eLearn - Other useful websites - https://www.r-bloggers.com/ - https://stackoverflow.com/questions - https://www.google.com/ ] .pull-right[ .center[<img src="https://rc2e.com/images_v2/book_cover.jpg" height="200px">] .center[<img src="../../../Figures/R_for_Data_Science.png" height="200px">] ] --- ## Self learning and Datacamp - You are encouraged to go beyond the assigned materials, either through Datacamp or other online learning platforms such as Coursera and Udemy. - Datacamp is providing *free* access to their *full* library of analytics and coding online tutorials - You will have free access for 6 months (July 1 to Dec 30, 2021), subject to renewal - Suggestion: enroll into the **"Data Analysts with R/Python"** career track on Datacamp and finish all courses before completing your degree - Check eLearn for link to access Datacamp for free - Datacamp automatically records when you finish these > Practice! Practice! Practice! .center[<img src="../../../Figures/datacamp_leaderboard_mpa20210821.png" height = "200px">] --- ## Grading - Participation @ 20% - Progress assessment @ 30% - Group project @ 50% - There is no final exam > Must attemp all components and must pass all components to pass the course .center[source: medium.com <img src = "../../../Figures/grading.jpeg" height = "300px">] --- ## Participation .pull-left[ **In Class** - Come to class to earn 50% - If you have a conflict, email me - Excused classes do not impact your participation grade - Ask questions to **extend** or **clarify** - Answer questions and explain answers - Give it your best shot! - Help those in your group to understand concepts - Present your work to the class - Always **on your camera** and speak with your microphone - Other initiatives to enrich the classroom learning experience ] .pull-right[ **Outside of Class** - Verify your understanding of the material - Apply to other real world data - Techniques and code will be useful after graduation - Answers to assignments are expected to be your own work, unless otherwise stated - No sharing answers (unless otherwise stated) - All submissions on eLearn - on time and follow instructions - I will provide snippets of code to help you with trickier parts ] --- ## Group project - Data science competition format, hosted on <a target=_blank href="https://www.kaggle.com/competitions">Kaggle</a> or similar platforms. - The project will finish in Session 10 with group presentations - I will give your more details in a separate document .center[<img src="../../../Figures/kaggle_logo.png" height = "200px">] --- ## Expectations .pull-left[ **In class:** - Participate - Ask questions - Clarify - Add to the discussion - Answer questions - Work with classmates .center[<img src = "https://www.insidehighered.com/sites/default/server_files/styles/large-copy/public/media/GettyImages-476803981.jpg" height = "300px">] ] .pull-right[ **Outside of class:** - Check eLearn for course announcements - Do the tutorials on Datacamp if you are not familiar with R - This will make the course much easier! - Do individual work on your own (unless otherwise stated) - Submit on eLearn - Do online courses through Datacamp or other platforms - Office hours are there to help! - Short questions can be emailed instead ] --- ## Office hours - Appointment at the following link - [https://calendly.com/drdataking](https://calendly.com/drdataking) - The default time is 15 minutes - If you want longer time, you may book multiple slots - Short questions can be emailed - I try to respond within 24 hours - Teaching Assistant (check eLearn) - always make appointment before approaching TA .center[<img src = "https://freight.cargo.site/w/1250/i/470e98259bbcf939ec247a13d9f286131edef2c1b879b75997ecaa332119d25f/Open-Office-Hours.jpg " height = "300px">] --- ## Tech use - Laptops and other tech are OK! - Use them for learning and course related - Examples of good tech use: - Taking notes - Viewing slides - Working out problems - Group discussion - Avoid: - Messaging your friends on Whatsapp/Wechat/Telegram/etc - Working on homework or group project in class - Playing games or watching livestreams - <a href="https://www.insidehighered.com/news/2018/07/27/class-cellphone-and-laptop-use-lowers-exam-scores-new-study-shows">In-class cellphone and laptop use lowers exam scores</a> .center[<img src = "../../../Figures/techuse.jpg" height = "200px">] --- class: inverse, center, middle # About you --- class: inverse, center, middle # Introduction to analytics --- ## What is analytics? > **Oxford:** a careful and complete analysis of data using a model, usually performed by a computer; information resulting from this analysis <!--https://www.oxfordlearnersdictionaries.com/definition/english/analytics pulled Dec 22, 2020--> > **Webster:** the method of logical analysis <!--https://www.merriam-webster.com/dictionary/analytics pulled Dec 22, 2020--> > **Wikipedia:** the discovery, interpretation, and communication of meaningful patterns in data and applying those patterns towards effective decision making <!--https://en.wikipedia.org/wiki/Analytics pulled Dec 22, 2020--> > **Simply put:** Solving problems using data - Additional layers we can add to the definition: - Solving problems using *a lot of* data - Solving problems using data *and statistics* - Solving problems using data *and computers (programming and/or specialized software)* --- ## The trend > We search "analytics" in Google Books and display the graph showing how the word has occurred since 1960 <img src="Session_1s_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" /> .center[Made using R [`package:seancarmody/ngramr`](https://github.com/seancarmody/ngramr) which is available on CRAN (the central depository for R packages) and can be installed from RStudio directly] --- ## Analytics vs AI/machine learning .pull-left[ <img src="../../../Figures/analytics_ml.png" width="350px"> ] .pull-right[ - In class reading: - Future of everything: <a target="_blank" href="https://www.datarobot.com/blog/ai-will-enhance-us-not-replace-us-1/">AI Will Enhance Us, Not Replace Us</a> - "The future isn't AI versus humans. It is AI-enhanced humans doing what humans are best at." - AI Ethics: <a target="_blank" href="https://www.businessinsider.com/apple-card-faces-investigation-over-gender-discrimination-allegation-2019-11?IR=T">Apple Card is facing a formal investigation</a> - "We need transparency and fairness." ] - Class Discussion > How will Analytics/AI/ML change society and the accounting profession? --- ## What happened before? .center[<img src="../../../Figures/wsjsurvey.png" height="500px">] --- class: inverse, center, middle # Who uses analytics? --- ## In general .pull-left[ - Companies - Finance - Manufacturing - Transportation - Computing - ... ] .pull-right[ - Governments - AI.Singapore - Big data office - "Smart" initiatives - Academics - Individuals! ] > 59% of companies where using big data in a <a target="_blank" href="https://www.forbes.com/sites/louiscolumbus/2018/12/23/big-data-analytics-adoption-soared-in-the-enterprise-in-2018">2018 survey</a>! > Which corporate function has the highest/lowest adoption of big data analytics? --- ## Adoption of big data by function .center[<img src = "../../../Figures/Adoption-of-big-data-by-function.jpg">] --- ## What analytics for? .pull-left[ - Customer service - <a target="_blank" href="https://www.forbes.com/sites/tomgroenfeldt/2018/05/03/rbs-uses-analytics-to-make-customer-service-more-than-just-a-slogan/#3d02f7d42108">Royal Bank of Scotland</a> - Understanding customer complaints - Improving products - Siemens' <a target="_blank" href="https://www.forbes.com/sites/bernardmarr/2017/05/30/how-siemens-is-using-big-data-and-iot-to-build-the-internet-of-trains/#3b22cfd372b8">Internet of Trains</a> - Improving train reliability - Auditing - <a target="_blank" href="https://www.dbs.com/investorday/presentations/Reimagining_Audit.pdf">Continuous Auditing at DBS</a> - The Future of Auditing is Auditing the Future - How about your company? ] .pull-right[ <a target="_blank" href="https://www.forbes.com/sites/tomgroenfeldt/2018/05/03/rbs-uses-analytics-to-make-customer-service-more-than-just-a-slogan/#3d02f7d42108"><img src="../../../Figures/RBS.png" width="120px"/></a> <a target="_blank" href="https://www.forbes.com/sites/bernardmarr/2017/05/30/how-siemens-is-using-big-data-and-iot-to-build-the-internet-of-trains/#3b22cfd372b8"><img src="../../../Figures/Siemens-logo.png" width="400px"/></a> <a target="_blank" href="https://www.dbs.com/investorday/presentations/Reimagining_Audit.pdf"><img src="../../../Figures/DBS_logo.png" width="400px"></a> ] --- ## State of business analytics? - <a target="_blank" href="https://www.forbes.com/sites/louiscolumbus/2018/06/08/the-state-of-business-intelligence-2018/">Dresner Advisory Service's 2018 Market Study</a> - Executive Management, Operations, Sales and Finance are the four primary roles driving business analytics adoption in 2018. - Dashboards, reporting, end-user self-service, advanced visualization, and data warehousing are the top five most important initiatives. .center[<img src="../../../Figures/Functions-Driving-Business-Intelligence-.jpg" height = "350px">] --- ## Head of Finance Data .pull-left[ **Key tasks and responsibilities** - Leading the Finance Data Team to maintain and improve the Financial data application landscape (BI Reporting, Planning and Budgeting system) and data pipelines powering Finance systems and reporting. - Enable business users to further improve the data literacy and ultimately drive data decision making. <a target="_blank" href="https://sg.linkedin.com/jobs/view/regional-head-of-finance-data-at-lazada-group-1345336253"><img src="../../../Figures/Lazada_logo.png" width="200px"/></a> ] .pull-right[ **Qualifications & Skills** - Degree in Accounting, Finance, Business Administration, Computer Science or related field - Experience in big database including strong expertise in SQL - Excel, R/Python (plus), SAP hands-on experience - Creative and analytical thinker with strong problem-solving skills - Strong written and oral communication skills ] --- class: inverse, center, middle # Statistics Foundations --- ## Frequentist vs Bayesian statistics .pull-left[ **Frequentist statistics** > A specific test is one of an infinite number of replications - The "correct" answer should occur most frequently, i.e., with a high probability - Focus on true vs. false - Treat unknowns as fixed constants to figure out - Not random quantities - Where it's used - Classical statistics methods - Like OLS ] .pull-right[ **Bayesian statistics** > Focus on distributions and beliefs - Prior distribution -- what is believed before the experiment - Posterior distribution: an updated belief of the distribution due to the experiment - Derive distributions of parameters - Where it's used: - Many machine learning methods - Bayesian updating acts as the learning - Bayesian statistics ] --- ## Frequentist: Repeat the test > Did the sun explode just now? ```r # Don't worry, we will learn how to program in R soon. # Define a detector # repeat the test with frequentist statistics detector <- function() { dice <- sample(1:6, size = 2, replace = TRUE) if (sum(dice) == 12) { "exploded" } else { "still there" } } experiment <- replicate(1000, detector()) # p value paste("p-value: ", sum(experiment == "still there") / 1000, "-- Failed to reject H_0 that sun didn't explode") ``` ``` # [1] "p-value: 0.971 -- Failed to reject H_0 that sun didn't explode" ``` > Frequentist: The sun didn't explode --- ## Bayesian: Bayes rule > Did the sun explode just now? $$ P(A|B) = \frac{P(B|A) P(A)}{P(B)} $$ - `\(A\)`: The sun exploded - `\(B\)`: The detector said it exploded - `\(P(A)\)`: Really, really small. Say, ~0. Prior belief - `\(P(B)\)`: `\(\frac{1}{6}\times\frac{1}{6} = \frac{1}{36}\)`. Experiment - `\(P(B|A)\)`: `\(\frac{35}{36}\)`. Post belief $$ P(A|B) = \frac{P(B|A) P(A)}{P(B)} = \frac{\frac{35}{36}\times\sim 0}{\frac{1}{36}} = 35\times \sim 0 \approx 0 $$ > Bayesian: The sun didn't explode --- ## What analytics typically relies on - Regression approaches - Most often done in a frequentist manner - Can be done in a Bayesian manner as well - Machine learning - Sometimes Bayesian, sometime frequentist > We will mainly use frequentist statistics and some applications in bayesian -- for our purposes, we will not debate the merits of either school of thought, but use tools derived from both .center[<img src="../../../Figures/bayesian-vs-frequentist-methods.jpg" height="200px">] --- ## Confusion from frequentist approaches - Possible contradictions: - `\(F\)` test says the model is good yet nothing is statistically significant - Individual `\(p\)`-values are good yet the model isn't - One measure says the model is good yet another doesn't > There are many ways to measure a model, each with their own merits. They don't always agree, and it's on us to pick a reasonable measure. We will discuss more in applications. --- class: inverse, center, middle # Frequentist approaches to things --- ## Population vs Sample - Population: all objects belonging to a specified set - e.g., All companies in Singapore - Sample: a (random) subset of the population .center[<img src="../../../Figures/population_sample.png">] --- ## Parameters vs statistics > Population parameters vs Sample statistics of a given variable (such as height of boys or earning of companies) - mean - median/quantile - mode - standard deviation/variance - max/min - distribution .center[<img src="../../../Figures/skewness.png">] --- ## Normal distribution > A normal (gaussian or bell curve) distribution is a type of continuous probability distribution for a real-valued random variable with the same values of mean, median and mode. .center[<img src="../../../Figures/Normal_Distribution_PDF.svg" height = "400px">] --- ## Sampling error - Sample statistic will *not* be exactly equal to population parameter - But should be close - How close depends on sample size - Confidence interval `$$\bar{X} \pm Z_{1-\alpha/2} \, \frac{\sigma}{\sqrt{N}}$$` > where `\(\bar{X}\)` is the sample mean, `\(Z_{1-\alpha/2}\)` is the `\(1-\alpha/2\)` critical value of the standard normal distribution (1.68, 1.96 and 2.58 for 10%, 5%, and 1% respectively, which corresponding to confidence level of 90%, 95% and 99%), `\(\sigma\)` is the known population standard deviation, and `\(N\)` is the sample size. - The larger the sample size `\(N\)`, the closer the sample statistic to the population parameter - trade-off between data collection costs and sample/margin error - we typically choose a confidence level (such as 99%) and a margin error to determine the minimum random sample size `\(N\)` --- ##Law of large numbers > The law of large numbers states that as a sample size grows, its mean gets closer to the average of the whole population. - Roll a 6 sided dice (1 to 6), the expected mean is 3.5 ```r # Roll a dice i <- 1 dice <- 0 times <- 10000 while (i <= times) { dice <- dice + sample(1:6, 1) i <- i + 1 } paste("Roll", times, "times dice and the mean is", dice/times) ``` ``` ## [1] "Roll 10000 times dice and the mean is 3.4843" ``` --- ## Central Limit Theorem > If you take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. ```r i <- 0; meandice <- c() while (i <= 10000) { meandice <- append(meandice, mean(sample(1:6, 30, replace = TRUE))) i <- i + 1 } hist(meandice, col = "lightgreen", breaks = 20) abline(v = 3.5, col = "blue") abline(v = mean(meandice), col = "red") ``` <img src="Session_1s_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Hypotheses - `\(H_0\)`: Null hypothesis - The status quo is correct - Your proposed model/prediction doesn't work - `\(H_A\)` or `\(H_1\)`: Alternative hypothesis - The model/prediction you are proposing works - Frequentist statistics can never directly support `\(H_0\)`! - Reject `\(H_0\)` (a.k.a find Support for `\(H_A\)`) if `\(p\)`-value < a significance level (such as 5% or 1%) - Fail to reject `\(H_0\)` (a.k.a fail to find Support for `\(H_A\)`) if `\(p\)`-value >= a significance level - We can roughly understand `\(p\)`-value as the probability of `\(H_0\)` - We will discuss more later > Even if our `\(p\)`-value is 1, we can't say that the results prove the null hypothesis! --- ## OLS terminology $$ y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \varepsilon $$ $$ \hat{y} = \alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \hat{\varepsilon} $$ - `\(y\)`: The output in our model - dependent variable - predicted value - `\(\hat{y}\)`: The *estimated* output in our model - `\(x_i\)`: An input in our model - independent variables - features - predictors - `\(\hat{x}_i\)`: An *estimated* input in our model - `\(\hat{~}\)`: Something *estimated*, "caret" or "hat" - `\(\alpha\)`: A constant, the expected value of `\(y\)` when all `\(x_i\)` are 0 - `\(\beta_i\)`: A coefficient on an input to our model - `\(\varepsilon\)`: The error term - This is also the *residual* from the regression - What's left if you take actual `\(y\)` minus the model prediction --- ## OLS statistical properties $$ y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \varepsilon $$ $$ \hat{y} = \alpha + \beta_1 \hat{x}_1 + \beta_2 \hat{x}_2 + \ldots + \hat{\varepsilon} $$ 1. There should be a *linear* relationship between `\(y\)` and each `\(x_i\)` - i.e., `\(y\)` is [approximated by] a constant multiple of each `\(x_i\)` - Otherwise we **shouldn't** use a *linear* regression 2. Each `\(\hat{x}_i\)` is normally distributed - Not so important with larger data sets, but a good to adhere to 3. Each observation is independent - We'll violate this one for the sake of *causality* 4. Homoskedasticity: Variance in errors is constant - This is important 5. Not too much multicollinearity - Each `\(\hat{x}_i\)` should be relatively independent from the others - Some is OK --- class: inverse, center, middle # Linear model implementation --- ## What exactly is a linear model? - Anything OLS is linear - Many transformations can be recast to linear - Ex.: `\(log(y) = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 {x_1}^2 + \beta_4 x_1 \cdot x_2\)` - This is the same as `\(y' = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4\)` where: - `\(y' = log(y)\)` - `\(x_3 = {x_1}^2\)` - `\(x_4 = x_1 \cdot x_2\)` > Linear models are *very* flexible .center[source: wikipedia <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/1200px-Linear_regression.svg.png" height = "250px">] --- ## Mental model of OLS: 1 input .center[<img src="../../../Figures/OLS_model.png" width="600px">] > Simple OLS measures a simple linear relationship between an input and an output - e.g.: Future revenue regressed on assets --- ## Multiple inputs .center[<img src="../../../Figures/OLS_model_m.png" height="400px">] > OLS measures simple linear relationships between a set of inputs and one output - e.g.: Future revenue regressed on multiple accounting and macro variables --- ## Model selection > We will introduce many models. Pick what fits your problem! - For forecasting a quantity - Usually some sort of linear model regressed using OLS - For forecasting a binary outcome - Usually logit or a related model - For forensics: - Usually logit or a related model > automated model selection .center[<a target="_blank" href = "https://www.kdnuggets.com/2020/02/data-scientists-automl-replace.html"><img src = "../../../Figures/automl.png" height = "250px"></a>] --- ## Variable selection > Feature engineering - The options: 1. Use your own knowledge to select variables 2. Use a selection model to automate it .pull-left[ **Own knowledge** - Build a model based on your knowledge of the problem and situation - This is generally better - The result should be more interpretable - For prediction, you should know relationships better than most algorithms ] .pull-right[ <img src="../../../Figures/brain.png" height="200px"> ] --- ## Automated variable selection - Traditional methods include: - Forward selection: Start with nothing and add variables with the most contribution to Adj `\(R^2\)` until it stops going up - Backward selection: Start with all inputs and remove variables with the worst (negative) contribution to Adj `\(R^2\)` until it stops going up - Stepwise selection: Like forward selection, but drops non-significant predictors - Newer methods: - Lasso and Elastic Net based models - Optimize with high penalties for complexity (i.e., # of inputs) - We will discuss these in future sessions .center[<img src="../../../Figures/artificial-neural-network_640.png" height="150px">] --- ## The overfitting problem > Or: Why do we like simpler models so much? - Overfitting happens when a model fits in-sample data *too well*... - To the point where it also models any idiosyncrasies or errors in the data - This harms prediction performance - Directly harming our forecasts > An overfitted model works really well on its own data, and quite poorly on new data .center[<a target = "_blank" href = "https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76"><img src = "https://miro.medium.com/max/1125/1*_7OPgojau8hkiPUiHoGK_w.png" height = "250px">] --- class: inverse, center, middle # Statistical tests and Interpretation --- ## Coefficients - In OLS: `\(\beta_i\)` .pull-left[ - A change in `\(x_i\)` by 1 unit leads to a change in `\(y\)` by `\(\beta_i\)` - Essentially, the slope between `\(x\)` and `\(y\)` - The blue line in the chart is the regression line for `\(\hat{Revenue} = \alpha + \beta_i \hat{Assets}\)` for retail firms since 1960 ] .pull-right[ <img src="Session_1s_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## P-values - `\(p\)`-values tell us the probability that an individual result is due to random chance > "The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed." <br> -- Dahiru 2008 - These are very useful, particularly for a frequentist approach - First used in the 1700s, but popularized by Ronald Fisher in the 1920s and 1930s - If `\(p<0.05\)` and the coefficient matches our mental model, we can consider this as supporting our model (i.e. rejecting the null) - If `\(p<0.05\)` but the coefficient is opposite, then it is suggesting a problem with our model - If `\(p>0.10\)`, it is rejecting the alternative hypothesis - If `\(0.05 < p < 0.10\)` it depends... - For a small dataset or a complex problem, we can use `\(0.10\)` as a cutoff - For a huge dataset or a simple problem, we should use `\(0.05\)` --- ## R-square - `\(R^2\)` = Explained variation / Total variation - Variation = difference in the observed output variable from its own mean - A high `\(R^2\)` indicates that the model fits the data very well - A low `\(R^2\)` indicates that the model is missing much of the variation in the output - `\(R^2\)` is technically a *biased* estimator - more independent variables, higher `\(R^2\)` - Adjusted `\(R^2\)` downweights `\(R^2\)` and makes it unbiased - `\(R^2_{Adj} = P * R^2 + 1 - P\)` - Where `\(P=\frac{n-1}{n-p-1}\)` - `\(n\)` is the number of observations - `\(p\)` is the number of inputs in the model --- class: inverse, center, middle # Causality --- ## What is causality? `\(A \rightarrow B\)` - Causality is `\(A\)` *causing* `\(B\)` - This means more than `\(A\)` and `\(B\)` are correlated - i.e., If `\(A\)` changes, `\(B\)` changes. But `\(B\)` changing doesn't mean `\(A\)` changed - Unless `\(B\)` is 100% driven by `\(A\)` - Very difficult to determine, particularly for events that happen [almost] simultaneously - <a target="_blank" href="http://tylervigen.com/spurious-correlations">Examples of correlations that aren't causation</a> .center[<a target="_blank" href="https://xkcd.com/552/"><img src="https://imgs.xkcd.com/comics/correlation.png" height="200px"></a>] --- ## Time and causality `\(A \rightarrow B\)` or `\(A \leftarrow B\)`? `\(A_t \rightarrow B_{t+1}\)` - If there is a separation in time, it's easier to say `\(A\)` caused `\(B\)` - Observe `\(A\)`, then see if `\(B\)` changes after - Conveniently, we have this structure when forecasting - e.g.: $$ Revenue_{t+1} = Revenue_t + \ldots $$ --- ## Time and causality break down `\(A_t \rightarrow B_{t+1}\)`? `\(\quad\)` OR `\(\quad\)` `\(C \rightarrow A_t\)` and `\(C \rightarrow B_{t+1}\)`? - The above illustrates the *Correlated omitted variable problem* - `\(A\)` doesn't cause `\(B\)`... Instead, some other force `\(C\)` causes both - Bane of social scientists everywhere - This is less important for predictive analytics, as we care more about performance, but... - It can complicate interpreting your results - Figuring out `\(C\)` can help improve you model's predictions - So find C! .center[<a target="_blank" href="https://xkcd.com/925/"><img src="https://imgs.xkcd.com/comics/cell_phones.png" height="200px"></a>] --- ## Discussion > Some executives believe that all they need to do is establish correlation. Wrong! .pull-left[ - **A/B Test:** <a target="_blank" href="https://hbr.org/2017/09/the-surprising-power-of-online-experiments">The Surprising Power of Online Experiments</a> - Further reading: <a target="_blank" href="https://www.r-bloggers.com/causal-inference-cheat-sheet-for-data-scientists/">Causal Inference cheat sheet for data scientists</a> > so does causation imply correlation? .center[source: wikipedia <img src="../../../Figures/Correlation_examples.svg">] ] .pull-right[ .center[<img src="../../../Figures/a_b_test_bing.png" height="400px"></a>] ] --- class: inverse, center, middle # Statistics vs. Data Science Jargon --- ## Statistics vs. Data Science Jargon <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Statistics </th> <th style="text-align:left;"> Data_Science </th> <th style="text-align:left;"> Meaning </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> estimation </td> <td style="text-align:left;"> learning </td> <td style="text-align:left;"> use data to estimate an unknown parameter (mean, variance, model coefficients, etc.) </td> </tr> <tr> <td style="text-align:left;"> regression </td> <td style="text-align:left;"> regression/supervised learning </td> <td style="text-align:left;"> Predict a continuous value of Y using values of other variables X </td> </tr> <tr> <td style="text-align:left;"> classification </td> <td style="text-align:left;"> classification/supervised learning </td> <td style="text-align:left;"> Predict a discrete value of Y using values of other variables X </td> </tr> <tr> <td style="text-align:left;"> clustering </td> <td style="text-align:left;"> clustering/unsupervised learning </td> <td style="text-align:left;"> Group the data based on some variables X </td> </tr> <tr> <td style="text-align:left;"> in/out-of sample </td> <td style="text-align:left;"> training/testing sample </td> <td style="text-align:left;"> data used for traing/testing models </td> </tr> <tr> <td style="text-align:left;"> independent variable </td> <td style="text-align:left;"> feature </td> <td style="text-align:left;"> variables/predictors of X </td> </tr> <tr> <td style="text-align:left;"> dependent variable </td> <td style="text-align:left;"> label </td> <td style="text-align:left;"> predictions of Y </td> </tr> </tbody> </table> --- class: inverse, center, middle # Summary of Session 1 --- ## For next week - start your "data analyst with R" career track on Datacamp - Review statistics foundation - Pick a book on R and study it, such as <a target=_blank href="https://rc2e.com/index.html">R Cookbook</a> or <a target=_blank href="https://r4ds.had.co.nz/"> R for Data Science</a> - Install [R](https://cran.rstudio.com/) and [RStudio](https://www.rstudio.com/products/rstudio/download/#download) if you have not done so --- ## R packages used in this slide This slide was created in Jan 2019 from Session_1s.Rmd and updated on 2021-10-01 with R version 4.1.1 (2021-08-10) Kick Things on Windows 10 x64 build 18362 😄. The attached packages used in this slide are: ``` ## forcats stringr dplyr purrr readr tidyr tibble ## "0.5.1" "1.4.0" "1.0.7" "0.3.4" "2.0.1" "1.1.3" "3.1.3" ## tidyverse ggplot2 ngramr kableExtra knitr ## "1.3.1" "3.3.5" "1.7.4" "1.3.4" "1.33" ```