class: center, middle, inverse, title-slide # Programming with Data ## Session 7: Forecasting with Logistic Regressions ### Dr. Wang Jiwei ### Master of Professional Accounting --- class: inverse, center, middle <!-- Define html_df for displaying small tables in html format --> # Binary outcomes --- ## What are binary outcomes? - Thus far we have talked about events with continuous outcomes - Revenue, earnings, ROA, etc - Binary outcomes only have two possible outcomes - Did something happen, *yes* or *no*? - Is a statement *true* or *false*? -- - Financial accounting: - Will the company's earnings meet analysts' expectations? - Will the company have positive earnings? - Managerial accounting: - Will we have problem with our supply chain? - Will our customer go bankrupt? - Audit: - Is the company committing fraud? - Taxation: - Is the company too aggressive in their tax positions? - Management and strategy - Does the business model change? > We can assign a probability to any of these --- ## Binary classficiation algos - Popular algorithms that can be used for [binary classification](https://en.wikipedia.org/wiki/Binary_classification) include: - Logistic Regression (today's session) - Decision Trees (to be covered) - [k-Nearest Neighbors](https://www.datacamp.com/community/tutorials/machine-learning-in-r) - [Support Vector Machine](http://uc-r.github.io/svm) - [Naive Bayes](https://uc-r.github.io/naive_bayes) --- ## Logistic regression - When modeling a binary outcome, we use logistic regression - A.k.a. logit model - The *logit* function is `\(logit(p) = \text{log}\left(\frac{p}{1-p}\right)\)` - Also called *log odds*, see next slide $$ \text{log}\left(\frac{\text{Prob}(y=1|X)}{1-\text{Prob}(y=1|X)}\right)=\alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \varepsilon $$ - The **sign** of the coefficients means the same as before - **+**: *increases* the likelihood of `\(y\)` occurring - **-**: *decreases* the likelihood of `\(y\)` occurring .center[<img src = "../../../Figures/LogitLayout.png" height = "200px">] --- ## Odds vs probability .center[<img src="../../../Figures/odds.png" height="500px">] --- class: inverse, middle, center # Logistic regression interpretation --- ## Interpreting logit values - The level of the coefficient is different - The relationship isn't linear between `\(x_i\)` and `\(y\)` now - Instead, coefficient is in *log odds* - Thus, `\(e^{X\beta}\)` gives you the *odds*, `\(o=\frac{p}{1-p}\)` - To get probability, `\(p\)`, we can calculate `\(p=\frac{o}{1+o}\)` - Interpretation: for a one-unit increase in `\(x_i\)` - the *log odds* for `\(y\)` = 1 increase by `\(\beta\)` (same as the OLS), holding others at a fixed value - the *odds* for `\(y\)` = 1 increase by `\((e^\beta-1)\)` times of baseline odds (ie, odds before the change), holding others at a fixed value - `\(log(o2) - log(o1) = \beta\)` - `\(log(o2/o1) = \beta\)` - `\(o2/o1 = e^\beta\)` - `\(o2 - o1 = (e^\beta - 1) * o1\)` - you need to sum all relevant log odds before converting to probability - <a target="_blank" href="https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/">Click here for a review</a> --- ## Implement logit regression - The logistic model is related to our previous linear models as such: > Both linear and logit models are under the class of General Linear Models (GLMs) - To regress a GLM, we use the `glm()` command. - To run a logit regression: ```r mod <- glm(y ~ x1 + x2 + x3 + ..., data = df, family = binomial) summary(mod) ``` > `family = binomial` is what sets the model to be a logit - In fact, the `lm()` we have been using is actually `glm()` when you specify the option `family = gaussian` --- ## Example logit regression > Do holidays increase the likelihood that a department more than doubles its store's average weekly sales across departments? ```r # Create the binary variable from Walmart sales data df$double <- ifelse(df$Weekly_Sales > df$store_avg * 2, 1, 0) model1 <- glm(double ~ IsHoliday, data = df, family = binomial) tidy(model1) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -3.45 0.00924 -373. 0 ## 2 IsHolidayTRUE 0.539 0.0278 19.4 1.09e-83 ``` > Holidays increase the odds... but by how much? - There are two ways to interpret this: 1. Coefficient by coefficient 2. In total --- ## Interpretting specific coefficients $$ logodds(Double~sales) = -3.45 + 0.54 IsHoliday $$ - Interpreting specific coefficients is easiest done manually - Odds for the `\(IsHoliday\)` coefficient are `exp(0.54) = 1.72` - This means that having a holiday modifies the baseline (i.e., non-Holiday) odds by 1.72 to 1 - Where 1 to 1 is considered no change (`exp(0) = 1`) - Baseline `exp(-3.45)` is 0.032 to 1 ```r # Automating the above: exp(coef(model1)) ``` ``` ## (Intercept) IsHolidayTRUE ## 0.03184725 1.71367497 ``` --- ## Interpretting in total - It is important to note that log odds are additive - So, calculate a new log odd by plugging in values for variables and adding it all up - Holiday: `\(-3.45 + 0.54 * 1 = -2.91\)` - No holiday: `\(-3.45 + 0.54 * 0 = -3.45\)` - Then calculate odds and log odds like before - With holiday: `exp(-2.91) = 0.055` - Without holiday: `exp(-3.45) = 0.032` - Ratio of holiday to without: 1.72! - This is the individual log odds for holiday > We need to specify values to calculate log odds in total --- ## Converting to probabilities - We can calculate a probability at any given point using the log odds $$ Probability = \frac{odds}{odds + 1} $$ - Probability of double sales... - With a holiday: `0.055 / (0.055 + 1) = 0.052` - Without a holiday: `0.032 / (0.032 + 1) = 0.031` > These are easier to interpret, but require specifying values --- ## Using predict() to simplify it - `predict()` can calculate log odds and probabilities for us with minimal effort - Specify `type = "response"` to get probabilities ```r IsHoliday <- c(FALSE, TRUE) test_data <- as.data.frame(IsHoliday) predict(model1, test_data) # log odds if no type = "response" ``` ``` ## 1 2 ## -3.446804 -2.908164 ``` ```r predict(model1, test_data, type = "response") #probabilities ``` ``` ## 1 2 ## 0.03086431 0.05175146 ``` - Here, we see the baseline probability is 3.1% - The probability of doubling sales on a holiday is higher, at 5.2% --- ## R practice: Logit - A continuation of last session's practices answering: - Is Walmart more likely to see a year over year decrease in quarterly revenue during a recession? - Practice using `mutate()` and `glm()` - Do exercises 1 and 2 in the practice file - <a target="_blank" href="Session_7s_Exercise.html">R Practice</a> --- ## What about more complex model? ```r model2 <- glm(double ~ IsHoliday + Temperature + Fuel_Price, data = df, family = binomial) tidy(model2) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.78 0.0673 -26.4 1.93e-153 ## 2 IsHolidayTRUE 0.370 0.0284 13.0 8.80e- 39 ## 3 Temperature -0.0108 0.000470 -23.0 1.70e-117 ## 4 Fuel_Price -0.309 0.0196 -15.8 6.20e- 56 ``` ```r # Odds exp(coef(model2)) ``` ``` ## (Intercept) IsHolidayTRUE Temperature Fuel_Price ## 0.1692308 1.4483570 0.9892316 0.7340376 ``` > We need to specify values for all inputs to determine probabilities, ie, the impact of each input depends on the values of the others! --- ## Probabilities ```r # Average probability in September hday_sep <- mean(predict(model2, filter(df, IsHoliday, month == 9), type = "response")) no_hday_sep <- mean(predict(model2, filter(df, !IsHoliday, month == 9), type = "response")) # Average probability in December hday_dec <- mean(predict(model2, filter(df, IsHoliday, month == 12), type="response")) no_hday_dec <- mean(predict(model2, filter(df, !IsHoliday, month == 12), type="response")) html_df(data.frame(Month=c(9, 9, 12, 12), IsHoliday=c(FALSE, TRUE, FALSE, TRUE), Probability=c(no_hday_sep, hday_sep, no_hday_dec, hday_dec))) ``` <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Month </th> <th style="text-align:center;"> IsHoliday </th> <th style="text-align:center;"> Probability </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:center;"> FALSE </td> <td style="text-align:center;"> 0.0266789 </td> </tr> <tr> <td style="text-align:left;"> 9 </td> <td style="text-align:center;"> TRUE </td> <td style="text-align:center;"> 0.0374761 </td> </tr> <tr> <td style="text-align:left;"> 12 </td> <td style="text-align:center;"> FALSE </td> <td style="text-align:center;"> 0.0398377 </td> </tr> <tr> <td style="text-align:left;"> 12 </td> <td style="text-align:center;"> TRUE </td> <td style="text-align:center;"> 0.0586483 </td> </tr> </tbody> </table> --- ## A bit easier: Marginal effects > Marginal effects tell us the change in our output for a change of 1 unit to an input - Marginal effects are partial derivatives (slope of the tangent line) of a regression equation with respect to each variable in the model for each unit in the data - In OLS regression with no interactions or higher-order term (such as polynomial functions with quadratic terms `\(x^2\)`), the estimated coefficients are marginal effects - Using [`package:margins`](https://github.com/leeper/margins), we can calculate marginal effects - There are a few types that we could calculate: - An *Average Marginal Effect* tells us what the average effect of an input is across all values in our data - This is the default method in the package - We can also specify a specific value to calculate marginal effects at --- ## Logistic 2D/3D curve > Marginal effect means the partial derivative of any given point on the surface .pull-left[ .center[<img src = "../../../Figures/logistic_3d.png">] ] .pull-right[ .center[<img src = "../../../Figures/logistic_2d.png">] ] --- ## Marginal effects in action ```r # Calculate Average Marginal Effects (AME) # It will take a while library(margins) m <- margins(model2) m ``` ``` ## Temperature Fuel_Price IsHoliday ## -0.0003377 -0.009644 0.01334 ``` - By default, the margins() returns the Average Marginal Effect (AME) - A holiday increases the probability of doubling by a flat 1.33% - Not too bad when you consider that the probability of doubling is 3.23% - If the temperature goes up by 1°F (0.55°C), the probability of doubling changes by -0.03% - If the fuel price increases by 1 USD for 1 gallon of gas, the probability of doubling changes by -0.96% --- ## `package:margins` niceties - We can get some extra information about our marginal effects through `summary()`: ```r summary(m) %>% html_df() ``` <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> factor </th> <th style="text-align:center;"> AME </th> <th style="text-align:center;"> SE </th> <th style="text-align:center;"> z </th> <th style="text-align:center;"> p </th> <th style="text-align:center;"> lower </th> <th style="text-align:center;"> upper </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fuel_Price </td> <td style="text-align:center;"> -0.0096438 </td> <td style="text-align:center;"> 0.0006163 </td> <td style="text-align:center;"> -15.64800 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> -0.0108517 </td> <td style="text-align:center;"> -0.0084359 </td> </tr> <tr> <td style="text-align:left;"> IsHoliday </td> <td style="text-align:center;"> 0.0133450 </td> <td style="text-align:center;"> 0.0011754 </td> <td style="text-align:center;"> 11.35372 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 0.0110413 </td> <td style="text-align:center;"> 0.0156487 </td> </tr> <tr> <td style="text-align:left;"> Temperature </td> <td style="text-align:center;"> -0.0003377 </td> <td style="text-align:center;"> 0.0000149 </td> <td style="text-align:center;"> -22.71255 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> -0.0003668 </td> <td style="text-align:center;"> -0.0003085 </td> </tr> </tbody> </table> - Those p-values work just like with our linear models - We also get a confidence interval --- ## Plotting marginal effects ```r # Note: The `which...` part is absolutely necessary at the moment # due to a bug in the package (mismatch of factors and AME values # you may try to remove `which...` to see what happened plot(m, which = summary(m)$factor) ``` <img src="Session_7s_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Marginal effects at a specified value ```r margins(model2, at = list(IsHoliday = c(TRUE, FALSE)), variables = c("Temperature", "Fuel_Price")) %>% summary() %>% html_df() ``` <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> factor </th> <th style="text-align:center;"> IsHoliday </th> <th style="text-align:center;"> AME </th> <th style="text-align:center;"> SE </th> <th style="text-align:center;"> z </th> <th style="text-align:center;"> p </th> <th style="text-align:center;"> lower </th> <th style="text-align:center;"> upper </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fuel_Price </td> <td style="text-align:center;"> FALSE </td> <td style="text-align:center;"> -0.0093401 </td> <td style="text-align:center;"> 0.0005989 </td> <td style="text-align:center;"> -15.59617 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> -0.0105139 </td> <td style="text-align:center;"> -0.0081664 </td> </tr> <tr> <td style="text-align:left;"> Fuel_Price </td> <td style="text-align:center;"> TRUE </td> <td style="text-align:center;"> -0.0131335 </td> <td style="text-align:center;"> 0.0008717 </td> <td style="text-align:center;"> -15.06650 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> -0.0148420 </td> <td style="text-align:center;"> -0.0114250 </td> </tr> <tr> <td style="text-align:left;"> Temperature </td> <td style="text-align:center;"> FALSE </td> <td style="text-align:center;"> -0.0003271 </td> <td style="text-align:center;"> 0.0000146 </td> <td style="text-align:center;"> -22.46024 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> -0.0003556 </td> <td style="text-align:center;"> -0.0002985 </td> </tr> <tr> <td style="text-align:left;"> Temperature </td> <td style="text-align:center;"> TRUE </td> <td style="text-align:center;"> -0.0004599 </td> <td style="text-align:center;"> 0.0000210 </td> <td style="text-align:center;"> -21.92927 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> -0.0005010 </td> <td style="text-align:center;"> -0.0004188 </td> </tr> </tbody> </table> - specify the values through *at* argument, *variables* for the features of changes > On a holiday, if fuel price changes by 1 unit, the probability of doubling changes by -1.31% --- ## Marginal effects at a specified value ```r margins(model2, at = list(Temperature = c(0, 20, 40, 60, 80)), variables = c("IsHoliday")) %>% summary() %>% html_df() ``` <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> factor </th> <th style="text-align:center;"> Temperature </th> <th style="text-align:center;"> AME </th> <th style="text-align:center;"> SE </th> <th style="text-align:center;"> z </th> <th style="text-align:center;"> p </th> <th style="text-align:center;"> lower </th> <th style="text-align:center;"> upper </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> IsHoliday </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 0.0234484 </td> <td style="text-align:center;"> 0.0020168 </td> <td style="text-align:center;"> 11.62643 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 0.0194955 </td> <td style="text-align:center;"> 0.0274012 </td> </tr> <tr> <td style="text-align:left;"> IsHoliday </td> <td style="text-align:center;"> 20 </td> <td style="text-align:center;"> 0.0194072 </td> <td style="text-align:center;"> 0.0016710 </td> <td style="text-align:center;"> 11.61387 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 0.0161320 </td> <td style="text-align:center;"> 0.0226824 </td> </tr> <tr> <td style="text-align:left;"> IsHoliday </td> <td style="text-align:center;"> 40 </td> <td style="text-align:center;"> 0.0159819 </td> <td style="text-align:center;"> 0.0013885 </td> <td style="text-align:center;"> 11.51001 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 0.0132604 </td> <td style="text-align:center;"> 0.0187033 </td> </tr> <tr> <td style="text-align:left;"> IsHoliday </td> <td style="text-align:center;"> 60 </td> <td style="text-align:center;"> 0.0131066 </td> <td style="text-align:center;"> 0.0011592 </td> <td style="text-align:center;"> 11.30623 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 0.0108345 </td> <td style="text-align:center;"> 0.0153786 </td> </tr> <tr> <td style="text-align:left;"> IsHoliday </td> <td style="text-align:center;"> 80 </td> <td style="text-align:center;"> 0.0107120 </td> <td style="text-align:center;"> 0.0009732 </td> <td style="text-align:center;"> 11.00749 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 0.0088046 </td> <td style="text-align:center;"> 0.0126193 </td> </tr> </tbody> </table> > At 0 temperature, a holiday will result in 2.34% increase of probability of doubling. --- class: inverse, center, middle # Today's Application: Shipping delays --- ## The question > Can we leverage global weather data to predict shipping delays? .center[<img src="../../../Figures/Singapore_port.jpg" height="400px">] --- ## A bit about shipping data - WRDS doesn't have shipping data - There are, however, vendors for shipping data, such as: <a target="_blank" href="https://www.marinetraffic.com/en/ais/details/ports/290/Singapore_port:SINGAPORE"><img src="../../../Figures/Marine_traffic.png" height="120px"></a> - They pretty much have any data you could need: - Over 650,000 ships tracked using ground and satellite based AIS - AIS: Automatic Identification System - Live mapping - Weather data - Fleet tracking - Port congestion - <a target="_blank" href="https://www.inmarsat.com/">Inmarsat</a> support for ship operators --- ## What can we see from naval data? .pull-left[ Cruise: Quantum of the Seas .center[<img src="../../../Figures/Ships_Quantum_Seas.JPG" width = "400px">] Oil tankers in the Persian gulf .center[<img src="../../../Figures/Ships_persiangulf_oil.png" width = "400px">] ] -- .pull-right[ Busiest port for transshipment (Singapore) <img src="../../../Figures/Ships_Singapore_busy-trans.png" height = "200px"> Busiest ports by containers and tons (Shanghai & Ningbo-Zhoushan, China) <img src="../../../Figures/Ships_Ningbo-Zhoushan_busy-tonnage.png" height = "200px"> ] --- ## Singaporean owned ships
--- ## Code for last slide's map ```r library(plotly) # for plotting library(RColorBrewer) # for colors # plot with boats, ports, and typhoons # Note: geo is defined in the code file -- it controls layout palette = brewer.pal(8, "Dark2")[c(1, 8, 3, 2)] p <- plot_geo(colors = palette) %>% add_markers(data = df_ports, x = ~port_lon, y = ~port_lat, color = "Port") %>% add_markers(data = df_Aug31, x = ~lon, y = ~lat, color = ~ship_type, text = ~paste('Ship name', shipname)) %>% add_markers(data = typhoon_Aug31, x = ~lon, y = ~lat, color="TYPHOON", text = ~paste("Name", typhoon_name)) %>% layout(showlegend = TRUE, geo = geo, title = 'Singaporean owned container and tanker ships, August 31, 2018') p ``` - `plot_geo()` is from [`package:plotly`](https://plotly-r.com) - `add_markers()` adds points to the map - `layout()` adjusts the layout - Within geo, a list, the following makes the map a globe - `projection = list(type = "orthographic")` --- ## What might matter for shipping? > What observable events or data might provide insight as to whether a naval shipment will be delayed or not? -- 1. Typhoons --- ## Typhoon Jebi .center[<iframe width="560" height="315" src="https://www.youtube.com/embed/kH9o1CSs5Rk?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>] - <a target="_blank" href="https://www.youtube.com/embed/kH9o1CSs5Rk?rel=0">click to play on youtube</a> --- ## Typhoon Jebi .center[<iframe width="800px" height="500px" src="https://earth.nullschool.net/#2018/09/03/2100Z/wind/surface/level/orthographic=-221.34,35.55,806/loc=133.763,27.561"></iframe>] --- ## Typhoons in the data
--- ## Code for last slide's map ```r # plot with boats and typhoons palette = brewer.pal(8, "Dark2")[c(1, 3, 2)] p <- plot_geo(colors = palette) %>% add_markers(data = df_all[df_all$frame == 14,], x = ~lon, y = ~lat, color = ~ship_type, text = ~paste('Ship name', shipname)) %>% add_markers(data = typhoon_Jebi, x = ~lon, y = ~lat, color = "Typhoon Jebi", text = ~paste("Name", typhoon_name, "</br>Time: ", date)) %>% layout(showlegend = TRUE, geo = geo, title = 'Singaporean ships, September 4, 2018, evening') p ``` - This map is made the same way as the first map --- ## R Practice on mapping - Practice interactive mapping using typhoon data - 1 map using [`package:plotly`](https://plotly-r.com) - Bonus practice - 1 map using [`package:leaflet`](https://rstudio.github.io/leaflet/) - There is another interactive mapping [`package:sf`](https://rdrr.io/r/base/library.html) but its installation is not friendly on a Mac - Do exercises 3 and 4 in the practice file - <a target="_blank" href="Session_7s_Exercise.html">R Practice</a> --- class: middle, inverse, center # Predicting delays due to typhoons --- ## Data - If the ship will report a delay of at least 3 hours some time in the next 12-24 hours - What we have: - Ship location - Typhoon location - Typhoon wind speed > We need to calculate distance between ships and typhoons --- ## Distance for geo - There are a number of formulas for this - <a target="_blank" href="https://en.wikipedia.org/wiki/Haversine_formula">*Haversine*</a> for a simple calculation - <a target="_blank" href="https://en.wikipedia.org/wiki/Vincenty's_formulae">*Vincenty's formulae*</a> for a complex, incredibly accurate calculation - Accurate within **0.5mm** - Use `distVincentyEllipsoid()` from `package:geosphere` to get a reasonably quick and accurate calculation - Calculates distance between two sets of points, x and y, structured as matrices - Matrices must have longitude in the first column and latitude in the second column - Provides distance in meters by default ```r library(geosphere) x <- as.matrix(df3[ , c("lon", "lat")]) # ship location y <- as.matrix(df3[ , c("ty_lon", "ty_lat")]) # typhoon location df3$dist_typhoon <- distVincentyEllipsoid(x, y) / 1000 # convert to KM ``` --- ## Clean up - Some indicators to cleanly capture how far away the typhoon is ```r df3$typhoon_500 = ifelse(df3$dist_typhoon < 500 & df3$dist_typhoon >= 0, 1, 0) df3$typhoon_1000 = ifelse(df3$dist_typhoon < 1000 & df3$dist_typhoon >= 500, 1, 0) df3$typhoon_2000 = ifelse(df3$dist_typhoon < 2000 & df3$dist_typhoon >= 1000, 1, 0) ``` .center[<img src="../../../Figures/ifelse.png">] --- ## Do typhoons delay shipments? ```r fit1 <- glm(delayed ~ typhoon_500 + typhoon_1000 + typhoon_2000, data = df3, family = binomial) summary(fit1) ``` ``` ## ## Call: ## glm(formula = delayed ~ typhoon_500 + typhoon_1000 + typhoon_2000, ## family = binomial, data = df3) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.2502 -0.2261 -0.2261 -0.2261 2.7127 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.65377 0.02934 -124.547 <2e-16 *** ## typhoon_500 0.14073 0.16311 0.863 0.3883 ## typhoon_1000 0.20539 0.12575 1.633 0.1024 ## typhoon_2000 0.16059 0.07106 2.260 0.0238 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 14329 on 59184 degrees of freedom ## Residual deviance: 14322 on 59181 degrees of freedom ## (3866 observations deleted due to missingness) ## AIC: 14330 ## ## Number of Fisher Scoring iterations: 6 ``` > It appears so! --- ## Interpretation of coefficients ```r odds1 <- exp(coef(fit1)) odds1 ``` ``` ## (Intercept) typhoon_500 typhoon_1000 typhoon_2000 ## 0.02589334 1.15111673 1.22800815 1.17420736 ``` - Ships 1,000 to 2,000 km from a typhoon have a 17% increased odds of having a delay ```r m1 <- margins(fit1) summary(m1) ``` ``` ## factor AME SE z p lower upper ## typhoon_1000 0.0052 0.0032 1.6322 0.1026 -0.0010 0.0115 ## typhoon_2000 0.0041 0.0018 2.2570 0.0240 0.0005 0.0076 ## typhoon_500 0.0036 0.0042 0.8626 0.3883 -0.0046 0.0117 ``` - Ships 1,000 to 2,000 km from a typhoon have an extra 0.41% chance of having a delay (baseline of 2.5%, ie, all x = 0) --- ## Interpretation of coefficients - Alternatively, we may calculate actual probability by summing up all relevant log odds. ```r prob_odds1 <- c(exp(coef(fit1)[1]), exp(coef(fit1)[c(2, 3, 4)] + coef(fit1)[c(1, 1, 1)])) probability1 <- prob_odds1 / (1 + prob_odds1) probability1 ``` ``` ## (Intercept) typhoon_500 typhoon_1000 typhoon_2000 ## 0.02523980 0.02894356 0.03081733 0.02950702 ``` - Ships 1,000 to 2,000 km from a typhoon have a 3% chance of having a delay (baseline of 2.5%) - Note the calculation of odds for each scenario: - all typhoon_ = 0: `\(\alpha\)` - typhoon_500 = 1: `\(\alpha+\beta1\)` - typhoon_1000 = 1: `\(\alpha+\beta2\)` - typhoon_2000 = 1: `\(\alpha+\beta3\)` --- ## What about typhoon intensity? - Hong Kong's typhoon classification: <a target="_blank" href="https://www.weather.gov.hk/en/informtc/class.htm">Official source</a> 1. 41-62 km/h: Tropical depression 2. 63-87 km/h: Tropical storm 3. 88-117 km/h: Severe tropical storm 4. 118-149 km/h: **Typhoon** 5. 150-184 km/h: **Severe typhoon** 6. 185+km/h: **Super typhoon** ```r # 1 knot (nautical mile/h) = 1.852 km/h # cut() makes a categorical variable out of a numerical variable # using specified bins df3$Super <- ifelse(df3$intensity_vmax * 1.852 >= 185, 1, 0) df3$Moderate <- ifelse(df3$intensity_vmax * 1.852 >= 88 & df3$intensity_vmax * 1.852 < 185, 1, 0) df3$Weak <- ifelse(df3$intensity_vmax * 1.852 >= 41 & df3$intensity_vmax * 1.852 < 88, 1, 0) df3$HK_intensity <- cut(df3$intensity_vmax * 1.852, c(-1, 41, 62, 87, 117, 149, 999)) table(df3$HK_intensity) ``` ``` ## ## (-1,41] (41,62] (62,87] (87,117] (117,149] (149,999] ## 3398 12039 12615 11527 2255 21141 ``` --- ## Typhoon intensity and delays ```r fit2 <- glm(delayed ~ (typhoon_500 + typhoon_1000 + typhoon_2000) : (Weak + Moderate + Super), data = df3, family = binomial) tidy(fit2) ``` ``` ## # A tibble: 10 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -3.65 0.0290 -126. 0 ## 2 typhoon_500:Weak -0.00879 0.213 -0.0413 0.967 ## 3 typhoon_500:Moderate 0.715 0.251 2.86 0.00430 ## 4 typhoon_500:Super -8.91 123. -0.0726 0.942 ## 5 typhoon_1000:Weak 0.250 0.161 1.55 0.121 ## 6 typhoon_1000:Moderate 0.123 0.273 0.451 0.652 ## 7 typhoon_1000:Super -0.0269 0.414 -0.0648 0.948 ## 8 typhoon_2000:Weak 0.182 0.101 1.80 0.0723 ## 9 typhoon_2000:Moderate 0.0253 0.134 0.189 0.850 ## 10 typhoon_2000:Super 0.311 0.136 2.29 0.0217 ``` > Moderate storms predict delays when within 500km > Super typhoons predict delays when 1,000 to 2,000km away --- ## Interpretation of coefficients ```r odds2 <- exp(coef(fit2)) odds2[c(1, 3, 10)] ``` ``` ## (Intercept) typhoon_500:Moderate typhoon_2000:Super ## 0.02589637 2.04505487 1.36507575 ``` - Ships within 500km of a moderately strong storm have 104% higher *odds* of being delayed - Ships 1,000 to 2,000km from a super typhoon have 36% higher *odds* --- ## Interpretation of coefficients ```r prob_odds2 <- c(exp(coef(fit2)[1]), exp(coef(fit2)[c(3, 10)] + coef(fit2)[c(1, 1)])) probability2 <- prob_odds2 / (1 + prob_odds2) probability2 ``` ``` ## (Intercept) typhoon_500:Moderate typhoon_2000:Super ## 0.02524268 0.05029586 0.03414352 ``` - Ships within 500km of a moderately strong storm have a 5% chance of being delayed (baseline: 2.5%) - Ships 1,000 to 2,000km from a super typhoon have a 3.4% chance --- ## Marginal effects ```r m2 <- margins(fit2) summary(m2) %>% html_df() ``` <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> factor </th> <th style="text-align:center;"> AME </th> <th style="text-align:center;"> SE </th> <th style="text-align:center;"> z </th> <th style="text-align:center;"> p </th> <th style="text-align:center;"> lower </th> <th style="text-align:center;"> upper </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Moderate </td> <td style="text-align:center;"> 0.0007378 </td> <td style="text-align:center;"> 0.0006713 </td> <td style="text-align:center;"> 1.0990530 </td> <td style="text-align:center;"> 0.2717449 </td> <td style="text-align:center;"> -0.0005779 </td> <td style="text-align:center;"> 0.0020535 </td> </tr> <tr> <td style="text-align:left;"> Super </td> <td style="text-align:center;"> -0.0050241 </td> <td style="text-align:center;"> 0.0860163 </td> <td style="text-align:center;"> -0.0584087 </td> <td style="text-align:center;"> 0.9534231 </td> <td style="text-align:center;"> -0.1736129 </td> <td style="text-align:center;"> 0.1635647 </td> </tr> <tr> <td style="text-align:left;"> typhoon_1000 </td> <td style="text-align:center;"> 0.0035473 </td> <td style="text-align:center;"> 0.0036186 </td> <td style="text-align:center;"> 0.9802921 </td> <td style="text-align:center;"> 0.3269420 </td> <td style="text-align:center;"> -0.0035450 </td> <td style="text-align:center;"> 0.0106396 </td> </tr> <tr> <td style="text-align:left;"> typhoon_2000 </td> <td style="text-align:center;"> 0.0039224 </td> <td style="text-align:center;"> 0.0017841 </td> <td style="text-align:center;"> 2.1985908 </td> <td style="text-align:center;"> 0.0279070 </td> <td style="text-align:center;"> 0.0004257 </td> <td style="text-align:center;"> 0.0074191 </td> </tr> <tr> <td style="text-align:left;"> typhoon_500 </td> <td style="text-align:center;"> -0.0440484 </td> <td style="text-align:center;"> 0.6803640 </td> <td style="text-align:center;"> -0.0647424 </td> <td style="text-align:center;"> 0.9483791 </td> <td style="text-align:center;"> -1.3775373 </td> <td style="text-align:center;"> 1.2894405 </td> </tr> <tr> <td style="text-align:left;"> Weak </td> <td style="text-align:center;"> 0.0009975 </td> <td style="text-align:center;"> 0.0005154 </td> <td style="text-align:center;"> 1.9353011 </td> <td style="text-align:center;"> 0.0529534 </td> <td style="text-align:center;"> -0.0000127 </td> <td style="text-align:center;"> 0.0020077 </td> </tr> </tbody> </table> - Delays appear to be driven mostly by 2 factors: 1. A typhoon 1,000 to 2,000 km away from the ship 2. Weak typhoons --- class: inverse, center, middle # Summary of Session 7 --- ## For next week - Try to replicate the code - Continue your Datacamp career track - Continue with your project - You can start to explore models --- ## R Coding Style Guide Style is subjective and arbitrary but it is important to follow a generally accepted style if you want to share code with others. I suggest the [The tidyverse style guide](https://style.tidyverse.org/) which is also adopted by [Google](https://google.github.io/styleguide/Rguide.html) with some modification - Highlights of **the tidyverse style guide**: - *File names*: end with .R - *Identifiers*: variable_name, function_name, try not to use "." as it is reserved by Base R's S3 objects - *Line length*: 80 characters - *Indentation*: two spaces, no tabs (RStudio by default converts tabs to spaces and you may change under global options) - *Spacing*: x = 0, not x=0, no space before a comma, but always place one after a comma - *Curly braces {}*: first on same line, last on own line - *Assignment*: use `<-`, not `=` nor `->` - *Semicolon(;)*: don't use, I used once for the interest of space - *return()*: Use explicit returns in functions: default function return is the last evaluated expression - *File paths*: use [relative file path](https://www.w3schools.com/html/html_filepaths.asp) "../../filename.csv" rather than absolute path "C:/mydata/filename.csv". Backslash needs `\\` --- ## R packages used in this slide This slide was prepared on 2021-09-08 from Session_7s.Rmd with R version 4.1.1 (2021-08-10) Kick Things on Windows 10 x64 build 18362 🙋. The attached packages used in this slide are: ``` ## geosphere RColorBrewer plotly margins rlang broom ## "1.5-10" "1.1-2" "4.9.4.1" "0.3.26" "0.4.11" "0.7.9" ## lubridate forcats stringr dplyr purrr readr ## "1.7.10" "0.5.1" "1.4.0" "1.0.7" "0.3.4" "2.0.1" ## tidyr tibble ggplot2 tidyverse kableExtra knitr ## "1.1.3" "3.1.3" "3.3.5" "1.3.1" "1.3.4" "1.33" ```