Programming with Data

class: center, middle, inverse, title-slide

# Programming with Data
## Session 3: R Programming (II)
### Dr. Wang Jiwei
### Master of Professional Accounting

---

class: inverse, center, middle

# Logical expressions

---

## Why use logical expressions?

- We just saw an example in our subsetting function
 - `earnings < 20000`
- Logical expressions give us more control over the data
- They let us easily create logical vectors for subsetting data

```r
df$earnings
```

```
## NULL
```

```r
df$earnings < 20000
```

```
## logical(0)
```

---

## Logical operators

`==` `!=` `>` `<` `>=` `<=` `!` `|` `&`

.pull-left[
- Equals: `==`
    - `2 == 2` `$\rightarrow$` TRUE
    - `2 == 3` `$\rightarrow$` FALSE
    - `'dog'=='dog'` `$\rightarrow$` TRUE
    - `'dog'=='cat'` `$\rightarrow$` FALSE
]
.pull-right[
- Not equals: `!=`
    - The opposite of `==`
    - `2 != 2` `$\rightarrow$` FALSE
    - `2 != 3` `$\rightarrow$` TRUE
    - `'dog'!='cat'` `$\rightarrow$` TRUE
]

- Comparing strings is done character by character

---

## Logical operators

`==` `!=` `>` `<` `>=` `<=` `!` `|` `&`

.pull-left[
- Greater than: `>`
 - `2 > 1` `$\rightarrow$` TRUE
 - `2 > 2` `$\rightarrow$` FALSE
 - `2 > 3` `$\rightarrow$` FALSE
 - `'dog'>'cat'` `$\rightarrow$` TRUE
]
.pull-right[
- Less than: `<`
 - `2 < 1` `$\rightarrow$` FALSE
 - `2 < 2` `$\rightarrow$` FALSE
 - `2 < 3` `$\rightarrow$` TRUE
 - `'dog'<'cat'` `$\rightarrow$` FALSE
]

.pull-left[
- Greater than or equal to: `>=`
 - `2 >= 1` `$\rightarrow$` TRUE
 - `2 >= 2` `$\rightarrow$` TRUE
 - `2 >= 3` `$\rightarrow$` FALSE
]
.pull-right[
- Less than or equal to: `<=`
 - `2 <= 1` `$\rightarrow$` FALSE
 - `2 <= 2` `$\rightarrow$` TRUE
 - `2 <= 3` `$\rightarrow$` TRUE
]

---

## Logical operators

- Not: `!`
    - This simply inverts everything
    - `!TRUE` `$\rightarrow$` FALSE
    - `!FALSE` `$\rightarrow$` TRUE
- And: `&`
    - `TRUE & TRUE` `$\rightarrow$` TRUE
    - `TRUE & FALSE` `$\rightarrow$` FALSE
    - `FALSE & FALSE` `$\rightarrow$` FALSE
- Or: `|` (pipe, same key as '\\')
    - Note that `|` is evaluated after all `&`s
    - `TRUE | TRUE` `$\rightarrow$` TRUE
    - `TRUE | FALSE` `$\rightarrow$` TRUE
    - `FALSE | FALSE` `$\rightarrow$` FALSE
- You can mix in parentheses for grouping as needed

---

## Examples for logical operators

- How many tech firms had >$10B in revenue in 2017?

```r
sum(tech_df$revenue > 10000)
```

```
## [1] 46
```

- How many tech firms had >$10B in revenue but had negative earnings in 2017?

```r
sum(tech_df$revenue > 10000 & tech_df$earnings < 0)
```

```
## [1] 4
```

---

## Examples for logical operators

- Who are those 4 with high revenue and negative earnings?

```r
columns <- c("conm", "tic", "earnings", "revenue")
tech_df[tech_df$revenue > 10000 & tech_df$earnings < 0, columns]
```

```
##                               conm   tic  earnings  revenue
## 2100                   CORNING INC   GLW  -497.000 10116.00
## 2874  TELEFONAKTIEBOLAGET LM ERICS  ERIC -4307.493 24629.64
## 11804        DELL TECHNOLOGIES INC 7732B -3728.000 78660.00
## 23377                   NOKIA CORP   NOK -1796.087 27917.49
```

---

## Other special values

- We know `TRUE` and `FALSE` already
    - Note that `FALSE` can be represented as 0
    - Note that `TRUE` can be represented as any non-zero number
- There are also:
    - `Inf`: Infinity, often caused by dividing something by 0
    - `NaN`: "Not a number," likely that the expression 0/0 occurred
    - `NA`: A missing value, usually *not* due to a mathematical error
    - `NULL`: Indicates a variable has nothing in it
- We can check for these with:
    - [`is.inf()`](https://www.rdocumentation.org/packages/splus2R/versions/1.3-3/topics/is.inf)
    - [`is.nan()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/is.finite)
    - [`is.na()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/NA)
    - [`is.null()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/NULL)

---

## if ... else

- Conditional statements (used for programming)

```r
# cond1, cond2, etc. can be any logical expression
if(cond1) {
  # Code runs if cond1 is TRUE
} else if (cond2) { # Can repeat 'else if' as needed
  # Code runs if this is the first condition that is TRUE
} else {
  # Code runs if none of the above conditions TRUE
}
```

---

## Other uses

- Vectorized conditional statements using [`ifelse()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/ifelse)
    - If else takes 3 vectors and returns 1 vector
        - A vector of `TRUE` or `FALSE`
        - A vector of elements to return from when `TRUE`
        - A vector of elements to return from when `FALSE`

```r
# Outputs odd for odd numbers and even for even numbers
even <- rep("even", 5)
odd <- rep("odd", 5)
numbers <- 1:5
ifelse(numbers %% 2, odd, even)
```

```
## [1] "odd"  "even" "odd"  "even" "odd"
```

---

## Practice: Subsetting df

- This practice focuses on subsetting out potentially interesting parts of our data frame
 - We will also see which of Goldman, JPMorgan, and Citigroup, in which year, had the lowest earnings since 2010
- Do Exercise 5 on the following R practice file:
 - <a target="_blank" href="Session_2s_Exercise.html#Exercise_5:_Subsetting_our_data_frame">R Practice</a>

---
class: inverse, center, middle

# Loops with control structure

---

## Looping: While loop

.pull-left[
<img src="../../../Figures/while-loop.png">
]
.pull-right[
- A [`while()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Control) loop executes code repeatedly until a specified condition is `FALSE`
- An index shall be initiated before the `while` loop, and it must be changed within the loop, otherwise the loop will never end.

```r
i <- 0
while(i < 5) {
 print(i)
 i <- i + 2
}
```

```
## [1] 0
## [1] 2
## [1] 4
```
]

---

## Looping: For loop

.pull-left[
<img src="../../../Figures/for-loop.png">
]
.pull-right[
- A [`for()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Control) loop executes code repeatedly until a specified condition is `FALSE`, while increamenting a given variable

```r
for(i in c(0, 2, 4)) {
  print(i)
}
```

```
## [1] 0
## [1] 2
## [1] 4
```
]

---

## Dangers of looping in R

- Loops in R are relatively slow -- one calculation at a time but R is best for many calculations at once via [vectorization](http://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/) or matrix algebra
    - We will introduce some other ways for loop through vectorized functions such as `lapply()`
- But as a new programmer, it is a must to understand the logic of loop
- [`Sys.time()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Sys.time) to return the current system time

.pull-left[

```r
# Profit margin, all US tech firms
start <- Sys.time()
margin_1 <- rep(0,length(tech_df$ni))
for(i in seq_along(tech_df$ni)) {
 margin_1[i] <- tech_df$earnings[i] /
 tech_df$revenue[i]
}
end <- Sys.time()
time_1 <- end - start
time_1
```

```
## Time difference of 0.00900197 secs
```
]
.pull-right[

```r
# Profit margin, all US tech firms
start <- Sys.time()
margin_2 <- tech_df$earnings /
 tech_df$revenue
end <- Sys.time()
time_2 <- end - start
time_2
```

```
## Time difference of 0.002002001 secs
```
]

---

## Dangers of looping in R

- Loops in R are very slow -- one calculation at a time but R is best for many calculations at once via vectorization or matrix algebra

```r
# Are these calculations identical?
identical(margin_1, margin_2)
```

```
## [1] TRUE
```

```r
# How much slower is the loop?
paste(as.numeric(time_1) / as.numeric(time_2), "times") 
```

```
## [1] "4.49648684053829 times"
```

---
class: inverse, center, middle

# Functions and packages

---

## Help functions

- There are two equivalent ways to quickly access help files:
    - `?` and `help()`
    - Usage to get the help file for [`data.frame()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame):
        - `?data.frame`
        - `help(data.frame)`
- To see the options for a function, use `args()`

```r
args(data.frame)
```

```
## function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, 
##     fix.empty.names = TRUE, stringsAsFactors = FALSE) 
## NULL
```

---

## A note on using functions

```r
args(data.frame)
```

```
## function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, 
##     fix.empty.names = TRUE, stringsAsFactors = FALSE) 
## NULL
```

- The `...` represents a series of inputs
 - In this case, inputs like `name=data`, where name is the column name and data is a vector
- The `____ = ____` arguments are options for the function
 - The default is prespecified, but you can overwrite it
 - eg, you may change `stringsAsFactors` from FALSE (default) to TRUE
- Options can be very useful or save us a lot of time!
- You can always find them by:
 - Using the `?` command
 - Checking other documentation like <a target="_blank" href="https://www.rdocumentation.org/">www.rdocumentation.org</a>
 - Using the `args()` function

---

## Packages in R

- R packages are collections of functions and data sets developed by the community.
- Most R packages are stored on the offcial <a target = "_blank" href = "https://cran.r-project.org/">CRAN</a> repository and can be installed within the RStudio directly
- Alternatively, you may download the package to local disk and use RStudio or command `install.packages(file.choose(), repos=NULL)` to install it

```r
# To install the tidyverse package which will be used for this course
# tidyverse is a collection of useful packages in R
# https://www.tidyverse.org/
install.packages("tidyverse")

# or to install multiple packages in one go:
install.packages(c("ggplot2", "dplyr", "magrittr"))
```

- Load packages using [`library()`](https://rdrr.io/r/base/library.html)
    - Need to do this each time you open a new instance of R

```r
# Load the tidyverse package
library(tidyverse)
```

---

## Pipe notation

> Pipe: output from the left as an input to the right directly.

- The Base R (ie, without any external package) introduced the official pipe notation `|>` as of R version 4.1 in 2021.
  - [The New R Pipe](https://www.r-bloggers.com/2021/05/the-new-r-pipe/)
- But a more popular pipe notation has already been provided by the [`package:magrittr`](https://magrittr.tidyverse.org)
    - Part of [`package:tidyverse`](https://tidyverse.tidyverse.org), an extremely popular collection of packages
- Pipe notation is done using `%>%`
    - `Left %>% Right(arg2, ...)` is the same as `Right(Left, arg2, ...)`

> Piping can drastically improve code readability

---

## Piping example

> Plot tech firms' earnings vs. revenue, >$10B in revenue

```r
# %>% comes from magrittr and ggplot() comes from ggplot2, both part of tidyverse
# alternatively you may launch these two packages separately
# note that ggplot uses a special pipe notation "+"
library(tidyverse)
library(plotly)

plot <- tech_df %>%
 subset(revenue > 10000) %>%
 ggplot(aes(x = revenue, y = earnings)) + # Adds point, and ticker
 geom_point(shape = 1, aes(text = sprintf("Ticker: %s", tic)))
ggplotly(plot) # Makes the plot interactive
```

<div id="htmlwidget-1d20c624f97636c6c1c9" style="width:100%;height:216px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-1d20c624f97636c6c1c9">{"x":{"data":[{"x":[229234,14537,26812.508,12379.8,17439.963,36205.653,24556,10116,24629.643,52056,62761,79139,14849.708,20322,14961,19093,10265,89950,36775.011,78660,48005,27917.488,13055,23231,12052,28871,19063.121,15191.5,10883.276,28204.814,14810,36765.478,10771,17045.7,10480.012,12497,110855,26034.941,13034.905,10170,40653,13113,18358,17636,10939,11505.677],"y":[48351,3434,401.962,1733.4,525.278,2146.801,1751,-497,-4307.493,2526,9601,5753,770.141,5089,3682,397,195,21204,116.641,-3728,9609,-1796.087,1795,2466,1465,344,129.09,523,2546.861,4830.44,1504,3445.149,772,301.173,127.478,3915,12662,1688.891,2812.812,366,15934,1683,6699,1692,2486,1437.567],"text":["revenue: 229234.00 earnings: 48351.000 Ticker: AAPL","revenue: 14537.00 earnings: 3434.000 Ticker: AMAT","revenue: 26812.51 earnings: 401.962 Ticker: ARW","revenue: 12379.80 earnings: 1733.400 Ticker: ADP","revenue: 17439.96 earnings: 525.278 Ticker: AVT","revenue: 36205.65 earnings: 2146.801 Ticker: CAJ","revenue: 24556.00 earnings: 1751.000 Ticker: DXC","revenue: 10116.00 earnings: -497.000 Ticker: GLW","revenue: 24629.64 earnings: -4307.493 Ticker: ERIC","revenue: 52056.00 earnings: 2526.000 Ticker: HPQ","revenue: 62761.00 earnings: 9601.000 Ticker: INTC","revenue: 79139.00 earnings: 5753.000 Ticker: IBM","revenue: 14849.71 earnings: 770.141 Ticker: KYO","revenue: 20322.00 earnings: 5089.000 Ticker: MU","revenue: 14961.00 earnings: 3682.000 Ticker: TXN","revenue: 19093.00 earnings: 397.000 Ticker: WDC","revenue: 10265.00 earnings: 195.000 Ticker: XRX","revenue: 89950.00 earnings: 21204.000 Ticker: MSFT","revenue: 36775.01 earnings: 116.641 Ticker: TECD","revenue: 78660.00 earnings: -3728.000 Ticker: 7732B","revenue: 48005.00 earnings: 9609.000 Ticker: CSCO","revenue: 27917.49 earnings: -1796.087 Ticker: NOK","revenue: 13055.00 earnings: 1795.000 Ticker: PYPL","revenue: 23231.00 earnings: 2466.000 Ticker: QCOM","revenue: 12052.00 earnings: 1465.000 Ticker: FDC","revenue: 28871.00 earnings: 344.000 Ticker: HPE","revenue: 19063.12 earnings: 129.090 Ticker: JBL","revenue: 15191.50 earnings: 523.000 Ticker: CDW","revenue: 10883.28 earnings: 2546.861 Ticker: ASML","revenue: 28204.81 earnings: 4830.440 Ticker: SAP","revenue: 14810.00 earnings: 1504.000 Ticker: CTSH","revenue: 36765.48 earnings: 3445.149 Ticker: ACN","revenue: 10771.00 earnings: 772.000 Ticker: STX","revenue: 17045.70 earnings: 301.173 Ticker: SNX","revenue: 10480.01 earnings: 127.478 Ticker: CRM","revenue: 12497.00 earnings: 3915.000 Ticker: MA","revenue: 110855.00 earnings: 12662.000 Ticker: GOOGL","revenue: 26034.94 earnings: 1688.891 Ticker: LPL","revenue: 13034.91 earnings: 2812.812 Ticker: BIDU","revenue: 10170.00 earnings: 366.000 Ticker: LDOS","revenue: 40653.00 earnings: 15934.000 Ticker: FB","revenue: 13113.00 earnings: 1683.000 Ticker: TEL","revenue: 18358.00 earnings: 6699.000 Ticker: V","revenue: 17636.00 earnings: 1692.000 Ticker: AVGO","revenue: 10939.00 earnings: 2486.000 Ticker: INFY","revenue: 11505.68 earnings: 1437.567 Ticker: AUO"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(0,0,0,1)","opacity":1,"size":5.66929133858268,"symbol":"circle-open","line":{"width":1.88976377952756,"color":"rgba(0,0,0,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":28.1765601217656,"r":7.30593607305936,"b":42.130898021309,"l":54.7945205479452},"plot_bgcolor":"rgba(235,235,235,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-839.900000000001,240189.9],"tickmode":"array","ticktext":["0","50000","100000","150000","200000"],"tickvals":[0,50000,100000,150000,200000],"categoryorder":"array","categoryarray":["0","50000","100000","150000","200000"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"revenue","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-6940.41765,50983.92465],"tickmode":"array","ticktext":["0","10000","20000","30000","40000","50000"],"tickvals":[0,10000,20000,30000,40000,50000],"categoryorder":"array","categoryarray":["0","10000","20000","30000","40000","50000"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"earnings","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":1.88976377952756,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"4c20c543c9b":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"4c20c543c9b","visdat":{"4c20c543c9b":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>

---

## Without piping

```r
library(tidyverse)
library(plotly)

plot <- ggplot(subset(tech_df, revenue > 10000),
 aes(x = revenue, y = earnings)) +
 geom_point(shape = 1, aes(text = sprintf("Ticker: %s", tic)))
ggplotly(plot) # Makes the plot interactive
```

<div id="htmlwidget-166a5300352730e47575" style="width:100%;height:216px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-166a5300352730e47575">{"x":{"data":[{"x":[229234,14537,26812.508,12379.8,17439.963,36205.653,24556,10116,24629.643,52056,62761,79139,14849.708,20322,14961,19093,10265,89950,36775.011,78660,48005,27917.488,13055,23231,12052,28871,19063.121,15191.5,10883.276,28204.814,14810,36765.478,10771,17045.7,10480.012,12497,110855,26034.941,13034.905,10170,40653,13113,18358,17636,10939,11505.677],"y":[48351,3434,401.962,1733.4,525.278,2146.801,1751,-497,-4307.493,2526,9601,5753,770.141,5089,3682,397,195,21204,116.641,-3728,9609,-1796.087,1795,2466,1465,344,129.09,523,2546.861,4830.44,1504,3445.149,772,301.173,127.478,3915,12662,1688.891,2812.812,366,15934,1683,6699,1692,2486,1437.567],"text":["revenue: 229234.00 earnings: 48351.000 Ticker: AAPL","revenue: 14537.00 earnings: 3434.000 Ticker: AMAT","revenue: 26812.51 earnings: 401.962 Ticker: ARW","revenue: 12379.80 earnings: 1733.400 Ticker: ADP","revenue: 17439.96 earnings: 525.278 Ticker: AVT","revenue: 36205.65 earnings: 2146.801 Ticker: CAJ","revenue: 24556.00 earnings: 1751.000 Ticker: DXC","revenue: 10116.00 earnings: -497.000 Ticker: GLW","revenue: 24629.64 earnings: -4307.493 Ticker: ERIC","revenue: 52056.00 earnings: 2526.000 Ticker: HPQ","revenue: 62761.00 earnings: 9601.000 Ticker: INTC","revenue: 79139.00 earnings: 5753.000 Ticker: IBM","revenue: 14849.71 earnings: 770.141 Ticker: KYO","revenue: 20322.00 earnings: 5089.000 Ticker: MU","revenue: 14961.00 earnings: 3682.000 Ticker: TXN","revenue: 19093.00 earnings: 397.000 Ticker: WDC","revenue: 10265.00 earnings: 195.000 Ticker: XRX","revenue: 89950.00 earnings: 21204.000 Ticker: MSFT","revenue: 36775.01 earnings: 116.641 Ticker: TECD","revenue: 78660.00 earnings: -3728.000 Ticker: 7732B","revenue: 48005.00 earnings: 9609.000 Ticker: CSCO","revenue: 27917.49 earnings: -1796.087 Ticker: NOK","revenue: 13055.00 earnings: 1795.000 Ticker: PYPL","revenue: 23231.00 earnings: 2466.000 Ticker: QCOM","revenue: 12052.00 earnings: 1465.000 Ticker: FDC","revenue: 28871.00 earnings: 344.000 Ticker: HPE","revenue: 19063.12 earnings: 129.090 Ticker: JBL","revenue: 15191.50 earnings: 523.000 Ticker: CDW","revenue: 10883.28 earnings: 2546.861 Ticker: ASML","revenue: 28204.81 earnings: 4830.440 Ticker: SAP","revenue: 14810.00 earnings: 1504.000 Ticker: CTSH","revenue: 36765.48 earnings: 3445.149 Ticker: ACN","revenue: 10771.00 earnings: 772.000 Ticker: STX","revenue: 17045.70 earnings: 301.173 Ticker: SNX","revenue: 10480.01 earnings: 127.478 Ticker: CRM","revenue: 12497.00 earnings: 3915.000 Ticker: MA","revenue: 110855.00 earnings: 12662.000 Ticker: GOOGL","revenue: 26034.94 earnings: 1688.891 Ticker: LPL","revenue: 13034.91 earnings: 2812.812 Ticker: BIDU","revenue: 10170.00 earnings: 366.000 Ticker: LDOS","revenue: 40653.00 earnings: 15934.000 Ticker: FB","revenue: 13113.00 earnings: 1683.000 Ticker: TEL","revenue: 18358.00 earnings: 6699.000 Ticker: V","revenue: 17636.00 earnings: 1692.000 Ticker: AVGO","revenue: 10939.00 earnings: 2486.000 Ticker: INFY","revenue: 11505.68 earnings: 1437.567 Ticker: AUO"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(0,0,0,1)","opacity":1,"size":5.66929133858268,"symbol":"circle-open","line":{"width":1.88976377952756,"color":"rgba(0,0,0,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":28.1765601217656,"r":7.30593607305936,"b":42.130898021309,"l":54.7945205479452},"plot_bgcolor":"rgba(235,235,235,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-839.900000000001,240189.9],"tickmode":"array","ticktext":["0","50000","100000","150000","200000"],"tickvals":[0,50000,100000,150000,200000],"categoryorder":"array","categoryarray":["0","50000","100000","150000","200000"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"revenue","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-6940.41765,50983.92465],"tickmode":"array","ticktext":["0","10000","20000","30000","40000","50000"],"tickvals":[0,10000,20000,30000,40000,50000],"categoryorder":"array","categoryarray":["0","10000","20000","30000","40000","50000"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"earnings","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":1.88976377952756,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"4c20210032e3":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"4c20210032e3","visdat":{"4c20210032e3":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>

---

## Practice: library usage

- This practice focuses on using an external library
 - We will also see which of Goldman, JPMorgan, and Citigroup, in which year, had the lowest earnings since 2010
- Do Exercise 6 on the following R practice file:
 - <a target="_blank" href="Session_2s_Exercise.html#Exercise_6:_External_library_usage">R Practice</a>

> Note: The ~ indicates a formula the left side is the y-axis and the right side is the x-axis

> Note: The | tells lattice to make panels based on the variable(s) to the right

---

## Math functions

- [`sum()`](https://rdrr.io/r/base/sum.html): Sum of a vector
- [`abs()`](https://rdrr.io/r/base/MathFun.html): Absolute value
- [`sign()`](https://rdrr.io/r/base/sign.html): The sign of a number

```r
vector = c(-2, -1, 0, 1, 2)
sum(vector)
```

```
## [1] 0
```

```r
abs(vector)
```

```
## [1] 2 1 0 1 2
```

```r
sign(vector)
```

```
## [1] -1 -1  0  1  1
```

---

## Stats functions

- [`mean()`](https://rdrr.io/r/base/mean.html): Calculates the mean of a vector
- [`median()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/median): Calculates the median of a vector
- [`sd()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd): Calculates the sample standard deviation of a vector
- [`quantile()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile): Provides the *quartiles* of a vector
- [`range()`](https://rdrr.io/r/base/range.html): Gives the minimum and maximum of a vector
    - Related: [`min()`](https://rdrr.io/r/base/Extremes.html) and [`max()`](https://rdrr.io/r/base/Extremes.html)

```r
quantile(tech_df$earnings)
```

```
##         0%        25%        50%        75%       100% 
## -4307.4930   -15.9765     1.8370    91.3550 48351.0000
```

```r
range(tech_df$earnings)
```

```
## [1] -4307.493 48351.000
```

---

## Make your own functions!

- Use the `function()` function!
 - `my_func <- function(agruments) {code}`
 - recommended to explicitly use `return()` to specify what to return from the function.

> Simple function: Add 2 to a number

```r
add_two <- function(n) {
 n + 2
}
add_two(500)
```

```
## [1] 502
```

```r
add_two <- function(n) {
 return(n + 2)
}
add_two(500)
```

```
## [1] 502
```

---

## Slightly more complex

```r
mult_together <- function(n1, n2=0, square=FALSE) {
 if (!square) {
 return(n1 * n2)
 } else {
 return(n1 * n1)
 }
}

mult_together(5, 6)
```

```
## [1] 30
```

```r
mult_together(5, 6, square = TRUE)
```

```
## [1] 25
```

```r
mult_together(5, square = TRUE)
```

```
## [1] 25
```

---

## Practice: Functions

- This practice focuses on making a custom function
 - Currency conversion between USD and SGD!
- Do Exercise 7 on the following R practice file:
 - <a target="_blank" href="Session_2s_Exercise.html#Exercise_7:_Making_your_own_function">R Practice</a>

---

## Challenging Practice

Define a function called `digits(n)` which returns the number of digits of a given integer number. For simplicity, we assume `n` is zero or positive integer, ie, n >= 0.
- if you call `digits(251)`, it should return `3`
- if you call `digits(5)`, it should return `1`
- if you call `digits(0)`, it should return `1`

For practice, you are required to use `if` conditions and `while` loops when necessary. You should use integer division `%/%` in the `while` loop to count the number of digits. You are not allowed to use functions such as `nchar()` and `floor()`.

---
class: inverse, center, middle

# Loops with `lapply()` functions

---

## Loops with `lapply()`

You don't have to always write loops using `for` or `while`. There are a group of [`lapply()`](https://rdrr.io/r/base/lapply.html) functions which can implement loops.

- [`lapply()`](https://rdrr.io/r/base/lapply.html): Loop over a list, evaluate a function on each element, and return a list
- there are some others too: [`sapply()`](https://rdrr.io/r/base/lapply.html); [`mapply()`](https://rdrr.io/r/base/mapply.html); [`apply()`](https://rdrr.io/r/base/apply.html); [`vapply()`](https://rdrr.io/r/base/lapply.html); [`tapply()`](https://rdrr.io/r/base/tapply.html)

Let's see the structure of [`lapply()`](https://rdrr.io/r/base/lapply.html). It extracts the function using [`match.fun()`](https://rdrr.io/r/base/match.fun.html), checks whether it is a list (if not, convert to a list using [`as.list()`](https://rdrr.io/r/base/list.html)) and finally loop internally in C code (`.Internal(lapply(X, FUN))`).

```r
lapply
```

```
## function (X, FUN, ...) 
## {
## FUN <- match.fun(FUN)
## if (!is.vector(X) || is.object(X)) 
## X <- as.list(X)
## .Internal(lapply(X, FUN))
## }
## <bytecode: 0x000000001b6fc2e8>
## <environment: namespace:base>
```

---

## Apply a function over a list

[`rnorm()`](https://rdrr.io/r/stats/Normal.html) to generate normal distributed numbers (in a vector format) with default 0 mean and 1 standard deviations.

```r
set.seed(1) # make random number generation reproducible
x_list <- list(a = rnorm(10000), b = rnorm(20000, 1, 5))
str(x_list)
```

```
## List of 2
##  $ a: num [1:10000] -0.626 0.184 -0.836 1.595 0.33 ...
##  $ b: num [1:20000] -3.02 -4.28 -4.18 -4.93 -1.5 ...
```

```r
x_list_mean <- lapply(x_list, mean)
str(x_list_mean)
```

```
## List of 2
##  $ a: num -0.00654
##  $ b: num 1.01
```

```r
x_list_mean_vector <- sapply(x_list, mean)
str(x_list_mean_vector)
```

```
##  Named num [1:2] -0.00654 1.00841
##  - attr(*, "names")= chr [1:2] "a" "b"
```

---

## Apply a function over an array

[`array()`](https://rdrr.io/r/base/array.html) are data objects which can store data in more than two dimensions which allows different data types. Recall that `matrix` is two-dimensional data with same data type and `dataframe` is two-dimensional data which allows different data types. [`apply()`](https://rdrr.io/r/base/apply.html) can evaluate a function over an array.

```r
set.seed(1) # make random number generation reproducible
# create a 2-dimensional array (a matrix for this case)
x_array <- array(c(rnorm(10000), rnorm(20000, 1, 5)), dim = c(2, 10000))
str(x_array)
```

```
##  num [1:2, 1:10000] -0.626 0.184 -0.836 1.595 0.33 ...
```

```r
# apply mean() on the first dimension, ie, rows of a matrix/dataframe
x_array_mean <- apply(x_array, 1, mean)
str(x_array_mean)
```

```
##  num [1:2] 0.467 0.506
```

```r
# apply mean() on the second dimension, ie, columns of a matrix/dataframe
x_array_mean <- apply(x_array, 2, mean)
str(x_array_mean)
```

```
##  num [1:10000] -0.221 0.38 -0.245 0.613 0.135 ...
```

---
class: inverse, center, middle

# Managing dataframes with `dplyr`

---

## Read files to data frames

The most popular file format among data analysts is the [comma-separated values (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) file that uses a comma (`,`) to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
  - you can save Excel file into CSV file
  
The simplest way to import smaller CSV is to use the [`read.csv()`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table) from the base R (ie, without any additional packages). Other functions include: [`read.table()`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table)(for .txt or a tab-delimited text file); [`read.delim()`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table)(for file with a separator that is different from a tab, a comma or a semicolon)

```r
df <- read.csv("data/session_2.csv")
```

Other packages also have import files functions:
  - [readr::read_csv()](https://readr.tidyverse.org/reference/read_delim.html)
  - [data.table::fread()](https://Rdatatable.gitlab.io/data.table/reference/fread.html)
  - [readxl::read_excel()](https://readxl.tidyverse.org/reference/read_excel.html)
  - [other packages](https://www.datacamp.com/community/tutorials/r-data-import-tutorial) for other data formats such as JSON, HTML, SAS, STATA, etc

---

## Single table functions

[`package:dplyr`](https://dplyr.tidyverse.org) is part of the [`package:tidyverse`](https://tidyverse.tidyverse.org) which provides useful functions for data manipulation. A competing package is [`package:data.table`](https://r-datatable.com) which is [more efficient](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/) for large dataset (I suggest > 1G)

* Rows:
  * [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) chooses rows based on column values.
  * [`slice()`](https://dplyr.tidyverse.org/reference/slice.html) chooses rows based on location.
  * [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.html) changes the order of the rows.
  
* Columns:
  * [`select()`](https://dplyr.tidyverse.org/reference/select.html) changes whether or not a column is included.
  * [`rename()`](https://dplyr.tidyverse.org/reference/rename.html) changes the name of columns.
  * [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html) changes the values of columns and creates new columns.
  * [`relocate()`](https://dplyr.tidyverse.org/reference/relocate.html) changes the order of the columns.

* Groups of rows:
  * [`summarize()`](https://dplyr.tidyverse.org/reference/summarise.html) collapses a group into a single row.

---

## Filter rows with `filter()`

[`filter()`](https://dplyr.tidyverse.org/reference/filter.html) allows you to select a subset of rows in a data frame. The first argument is the dataframe. The second and subsequent arguments refer to variables within that dataframe, selecting rows where the expression is `TRUE`.

Select all rows with ticker = AAPL (Apple Inc.) and after 2013 fiscal year:

```r
library(tidyverse)
df %>% filter(tic == "AAPL" & fyear > 2013)
```

```
##   gvkey datadate fyear indfmt consol popsrc datafmt  tic      conm curcd    ni
## 1  1690 20140930  2014   INDL      C      D     STD AAPL APPLE INC   USD 39510
## 2  1690 20150930  2015   INDL      C      D     STD AAPL APPLE INC   USD 53394
## 3  1690 20160930  2016   INDL      C      D     STD AAPL APPLE INC   USD 45687
## 4  1690 20170930  2017   INDL      C      D     STD AAPL APPLE INC   USD 48351
##     revt    cik costat   gind gsector  gsubind
## 1 182795 320193      A 452020      45 45202030
## 2 233715 320193      A 452020      45 45202030
## 3 215091 320193      A 452020      45 45202030
## 4 229234 320193      A 452020      45 45202030
```

This is roughly equivalent to this base R code:

```r
df[df$tic == "AAPL" & df$fyear > 2013, ]
```

---

##  Choose rows with `slice()`

[`slice()`](https://dplyr.tidyverse.org/reference/slice.html) is to select, remove, and duplicate rows by their (integer) locations.

```r
df %>% slice(5:7)
```

```
##   gvkey datadate fyear indfmt consol popsrc datafmt tic     conm curcd   ni
## 1  1004 20150531  2014   INDL      C      D     STD AIR AAR CORP   USD 10.2
## 2  1004 20160531  2015   INDL      C      D     STD AIR AAR CORP   USD 47.7
## 3  1004 20170531  2016   INDL      C      D     STD AIR AAR CORP   USD 56.5
##     revt  cik costat   gind gsector  gsubind
## 1 1594.3 1750      A 201010      20 20101010
## 2 1662.6 1750      A 201010      20 20101010
## 3 1767.6 1750      A 201010      20 20101010
```

It is accompanied by a number of helpers for common use cases:

* [`slice_head()`](https://dplyr.tidyverse.org/reference/slice.html) and [`slice_tail()`](https://dplyr.tidyverse.org/reference/slice.html) select the first or last rows.
* [`slice_sample()`](https://dplyr.tidyverse.org/reference/slice.html) randomly selects rows.
* [`slice_min()`](https://dplyr.tidyverse.org/reference/slice.html) and [`slice_max()`](https://dplyr.tidyverse.org/reference/slice.html) select rows with highest or lowest values of a variable.

---

## Arrange rows with `arrange()`

[`arrange()`](https://dplyr.tidyverse.org/reference/arrange.html) is to reorder the rows by a set of column names:

```r
df %>% arrange(conm, desc(fyear)) %>%
  head()
```

```
##    gvkey datadate fyear indfmt consol popsrc datafmt  tic              conm
## 1 122519 20170630  2017   INDL      C      D     STD FLWS 1-800-FLOWERS.COM
## 2 122519 20160630  2016   INDL      C      D     STD FLWS 1-800-FLOWERS.COM
## 3 122519 20150630  2015   INDL      C      D     STD FLWS 1-800-FLOWERS.COM
## 4 122519 20140630  2014   INDL      C      D     STD FLWS 1-800-FLOWERS.COM
## 5 122519 20130630  2013   INDL      C      D     STD FLWS 1-800-FLOWERS.COM
## 6 122519 20120630  2012   INDL      C      D     STD FLWS 1-800-FLOWERS.COM
##   curcd     ni     revt     cik costat   gind gsector  gsubind
## 1   USD 44.041 1193.625 1084869      A 255020      25 25502020
## 2   USD 36.875 1173.024 1084869      A 255020      25 25502020
## 3   USD 20.287 1121.506 1084869      A 255020      25 25502020
## 4   USD 15.372  756.345 1084869      A 255020      25 25502020
## 5   USD 12.321  735.497 1084869      A 255020      25 25502020
## 6   USD 17.646  716.257 1084869      A 255020      25 25502020
```

---

## Select columns with `select()`

[`select()`](https://dplyr.tidyverse.org/reference/select.html) allows you to subset a data frame by column names (variables/features/predictors)

```r
# Select columns by name
df %>% select(gvkey, tic, conm, fyear) %>%
  slice(1:3)
```

```
##   gvkey tic     conm fyear
## 1  1004 AIR AAR CORP  2010
## 2  1004 AIR AAR CORP  2011
## 3  1004 AIR AAR CORP  2012
```

```r
# Select all columns between gvkey and conm (inclusive)
df %>% select(gvkey:conm)
# Select all columns except those from gvkey to conm (inclusive)
df %>% select(!(gvkey:conm))
# Select all columns ending with "d"
df %>% select(ends_with("d"))
```

---

## Rename columns with `rename()`

[`rename()`](https://dplyr.tidyverse.org/reference/rename.html) allows you to rename column names

```r
# rename columns
df %>% select(gvkey, tic, conm, fyear) %>%
  rename(comp_name = conm) %>%  slice(1:3)
```

```
##   gvkey tic comp_name fyear
## 1  1004 AIR  AAR CORP  2010
## 2  1004 AIR  AAR CORP  2011
## 3  1004 AIR  AAR CORP  2012
```

---

## Add new columns with `mutate()`

[`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html) is to add new columns. [`package:DT`](https://github.com/rstudio/DT) helps to present larger dataset using the [`datatable()`](https://rdrr.io/pkg/DT/man/datatable.html) function.

```r
library(DT)
df %>% mutate(margin = ni / revt) %>% slice(1:20) %>%
  select(gvkey, conm, tic, fyear, ni, revt, margin) %>%
  datatable(options = list(pageLength = 2), rownames = FALSE)
```

<div id="htmlwidget-e4ef2b92bd0cfbd35b60" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-e4ef2b92bd0cfbd35b60">{"x":{"filter":"none","data":[[1004,1004,1004,1004,1004,1004,1004,1013,1045,1045,1045,1045,1045,1045,1045,1045,1050,1050,1050,1050],["AAR CORP","AAR CORP","AAR CORP","AAR CORP","AAR CORP","AAR CORP","AAR CORP","ADC TELECOMMUNICATIONS INC","AMERICAN AIRLINES GROUP INC","AMERICAN AIRLINES GROUP INC","AMERICAN AIRLINES GROUP INC","AMERICAN AIRLINES GROUP INC","AMERICAN AIRLINES GROUP INC","AMERICAN AIRLINES GROUP INC","AMERICAN AIRLINES GROUP INC","AMERICAN AIRLINES GROUP INC","CECO ENVIRONMENTAL CORP","CECO ENVIRONMENTAL CORP","CECO ENVIRONMENTAL CORP","CECO ENVIRONMENTAL CORP"],["AIR","AIR","AIR","AIR","AIR","AIR","AIR","ADCT","AAL","AAL","AAL","AAL","AAL","AAL","AAL","AAL","CECE","CECE","CECE","CECE"],[2010,2011,2012,2013,2014,2015,2016,2010,2010,2011,2012,2013,2014,2015,2016,2017,2010,2011,2012,2013],[69.826,67.723,55,72.9,10.2,47.7,56.5,62,-471,-1979,-1876,-1834,2882,7610,2676,1919,2.105,8.272,10.85,6.557],[1775.782,2074.498,2167.1,2035,1594.3,1662.6,1767.6,1156.6,22170,24022,24855,26712,42650,40990,40180,42207,140.602,139.192,135.052,197.317],[0.0393212680385318,0.0326454882096777,0.02537953947672,0.0358230958230958,0.00639779213447908,0.0286900036088055,0.0319642453043675,0.0536053951236383,-0.0212449255751015,-0.0823828157522271,-0.0754777710722189,-0.0686582809224319,0.0675732708089097,0.185655037814101,0.0666002986560478,0.045466391830739,0.0149713375343167,0.0594287027990114,0.0803394248141457,0.033230791062098]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th>gvkey<\/th>\n <th>conm<\/th>\n <th>tic<\/th>\n <th>fyear<\/th>\n <th>ni<\/th>\n <th>revt<\/th>\n <th>margin<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"pageLength":2,"columnDefs":[{"className":"dt-right","targets":[0,3,4,5,6]}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[2,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

## Change column order with `relocate()`

[`relocate()`](https://dplyr.tidyverse.org/reference/relocate.html) uses a similar syntax as [`select()`](https://dplyr.tidyverse.org/reference/select.html)  to move blocks of columns at once

```r
df %>% relocate(tic:revt, .after = fyear) %>%
  tail()
```

```
##        gvkey datadate fyear  tic               conm curcd      ni   revt indfmt
## 72720 324684 20171231  2017 ASLN ASLAN PHARMACEUTIC   USD -39.892    0.0   INDL
## 72721 326688 20131231  2013  NVT NVENT ELECTRIC PLC   USD      NA     NA   INDL
## 72722 326688 20141231  2014  NVT NVENT ELECTRIC PLC   USD      NA     NA   INDL
## 72723 326688 20151231  2015  NVT NVENT ELECTRIC PLC   USD      NA     NA   INDL
## 72724 326688 20161231  2016  NVT NVENT ELECTRIC PLC   USD 259.100 2116.0   INDL
## 72725 326688 20171231  2017  NVT NVENT ELECTRIC PLC   USD 361.700 2097.9   INDL
##       consol popsrc datafmt     cik costat   gind gsector  gsubind
## 72720      C      D     STD 1722926      A 352010      35 35201010
## 72721      C      D     STD 1720635      A 201040      20 20104010
## 72722      C      D     STD 1720635      A 201040      20 20104010
## 72723      C      D     STD 1720635      A 201040      20 20104010
## 72724      C      D     STD 1720635      A 201040      20 20104010
## 72725      C      D     STD 1720635      A 201040      20 20104010
```

---

## Summarise values with `summarise()`

[`summarize()`](https://dplyr.tidyverse.org/reference/summarise.html) collapses a data frame to a single row.

```r
df %>% summarise(ni_mean = mean(ni, na.rm = TRUE))
```

```
##    ni_mean
## 1 263.1611
```

It's not that useful until we learn the [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) verb in a future topic.

---
class: inverse, center, middle

# Subset a datafram in R

---

## Five ways to subset a datafram

- using brackets by extracting the rows and columns we want

```r
df[1:2, c("gvkey", "fyear", "tic", "conm")]
```

```
##   gvkey fyear tic     conm
## 1  1004  2010 AIR AAR CORP
## 2  1004  2011 AIR AAR CORP
```

- using brackets by omitting the rows and columns we don’t want

```r
df[-c(3:nrow(df)), -c(2, 4:7, 10:nrow(df))]
```

```
##   gvkey fyear tic     conm
## 1  1004  2010 AIR AAR CORP
## 2  1004  2011 AIR AAR CORP
```

- using brackets in combination with the [`which()`](https://rdrr.io/r/base/which.html) and `%in%`

```r
df[which(df$gvkey == 1004 & df$fyear < 2012),
 names(df) %in% c("gvkey", "fyear","tic", "conm")]
```

```
##   gvkey fyear tic     conm
## 1  1004  2010 AIR AAR CORP
## 2  1004  2011 AIR AAR CORP
```

---

## Five ways to subset a datafram

- using the [`subset()`](https://rdrr.io/r/base/subset.html) function

```r
subset(df, df$gvkey == 1004 & df$fyear < 2012, c("gvkey", "fyear","tic", "conm"))
```

```
##   gvkey fyear tic     conm
## 1  1004  2010 AIR AAR CORP
## 2  1004  2011 AIR AAR CORP
```

- using the [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) and [`select()`](https://dplyr.tidyverse.org/reference/select.html) functions from the [`package:dplyr`](https://dplyr.tidyverse.org) package

```r
# library(dplyr) or library(tidyverse)
df %>% filter(gvkey == 1004 & fyear < 2012) %>% select(gvkey, fyear, tic, conm)
```

```
##   gvkey fyear tic     conm
## 1  1004  2010 AIR AAR CORP
## 2  1004  2011 AIR AAR CORP
```

> choose the way which you like the most

---
class: inverse, center, middle

# Summary of Session 3

---

## For next week

- continue with your [Datacamp](https://datacamp.com) and textbook (<a target=_blank href="https://rc2e.com/index.html">R Cookbook</a> or <a target=_blank href="https://r4ds.had.co.nz/"> R for Data Science</a>)
- review today's code and pre-read next week's seminar notes
- complete the **Assignment 1** and submit on eLearn

---

## R Coding Style Guide

Style is subjective and arbitrary but it is important to follow a generally accepted style if you want to share code with others. I suggest the [The tidyverse style guide](https://style.tidyverse.org/) which is also adopted by [Google](https://google.github.io/styleguide/Rguide.html) with some modification
- Highlights of **the tidyverse style guide**:
 - *File names*: end with .R
 - *Identifiers*: variable_name, function_name, try not to use "." as it is reserved by Base R's S3 objects
 - *Line length*: 80 characters
 - *Indentation*: two spaces, no tabs (RStudio by default converts tabs to spaces and you may change under global options)
 - *Spacing*: x = 0, not x=0, no space before a comma, but always place one after a comma
 - *Curly braces {}*: first on same line, last on own line
 - *Assignment*: use `<-`, not `=` nor `->`
 - *Semicolon(;)*: don't use, I used once for the interest of space
 - *return()*: Use explicit returns in functions: default function return is the last evaluated expression
 - *File paths*: use [relative file path](https://www.w3schools.com/html/html_filepaths.asp) "../../filename.csv" rather than absolute path "C:/mydata/filename.csv". Backslash needs `\\`

---

## R packages used in this slide

This slide was prepared on 2021-09-20 from Session_3s.Rmd with R version 4.1.1 (2021-08-10) Kick Things on Windows 10 x64 build 18362 🙋.

The attached packages used in this slide are:

```
##         DT     plotly    forcats    stringr      dplyr      purrr      readr 
##     "0.18"  "4.9.4.1"    "0.5.1"    "1.4.0"    "1.0.7"    "0.3.4"    "2.0.1" 
##      tidyr     tibble    ggplot2  tidyverse kableExtra      knitr 
##    "1.1.3"    "3.1.3"    "3.3.5"    "1.3.1"    "1.3.4"     "1.33"
```