class: center, middle, inverse, title-slide # Programming with Data ## Session 3: R Programming (II) ### Dr. Wang Jiwei ### Master of Professional Accounting --- class: inverse, center, middle <!-- Define html_df for displaying small tables in html format --> <!-- Load in primary data set --> # Logical expressions --- ## Why use logical expressions? - We just saw an example in our subsetting function - `earnings < 20000` - Logical expressions give us more control over the data - They let us easily create logical vectors for subsetting data ```r df$earnings ``` ``` ## NULL ``` ```r df$earnings < 20000 ``` ``` ## logical(0) ``` --- ## Logical operators `==` `!=` `>` `<` `>=` `<=` `!` `|` `&` .pull-left[ - Equals: `==` - `2 == 2` `\(\rightarrow\)` TRUE - `2 == 3` `\(\rightarrow\)` FALSE - `'dog'=='dog'` `\(\rightarrow\)` TRUE - `'dog'=='cat'` `\(\rightarrow\)` FALSE ] .pull-right[ - Not equals: `!=` - The opposite of `==` - `2 != 2` `\(\rightarrow\)` FALSE - `2 != 3` `\(\rightarrow\)` TRUE - `'dog'!='cat'` `\(\rightarrow\)` TRUE ] - Comparing strings is done character by character --- ## Logical operators `==` `!=` `>` `<` `>=` `<=` `!` `|` `&` .pull-left[ - Greater than: `>` - `2 > 1` `\(\rightarrow\)` TRUE - `2 > 2` `\(\rightarrow\)` FALSE - `2 > 3` `\(\rightarrow\)` FALSE - `'dog'>'cat'` `\(\rightarrow\)` TRUE ] .pull-right[ - Less than: `<` - `2 < 1` `\(\rightarrow\)` FALSE - `2 < 2` `\(\rightarrow\)` FALSE - `2 < 3` `\(\rightarrow\)` TRUE - `'dog'<'cat'` `\(\rightarrow\)` FALSE ] .pull-left[ - Greater than or equal to: `>=` - `2 >= 1` `\(\rightarrow\)` TRUE - `2 >= 2` `\(\rightarrow\)` TRUE - `2 >= 3` `\(\rightarrow\)` FALSE ] .pull-right[ - Less than or equal to: `<=` - `2 <= 1` `\(\rightarrow\)` FALSE - `2 <= 2` `\(\rightarrow\)` TRUE - `2 <= 3` `\(\rightarrow\)` TRUE ] --- ## Logical operators - Not: `!` - This simply inverts everything - `!TRUE` `\(\rightarrow\)` FALSE - `!FALSE` `\(\rightarrow\)` TRUE - And: `&` - `TRUE & TRUE` `\(\rightarrow\)` TRUE - `TRUE & FALSE` `\(\rightarrow\)` FALSE - `FALSE & FALSE` `\(\rightarrow\)` FALSE - Or: `|` (pipe, same key as '\\') - Note that `|` is evaluated after all `&`s - `TRUE | TRUE` `\(\rightarrow\)` TRUE - `TRUE | FALSE` `\(\rightarrow\)` TRUE - `FALSE | FALSE` `\(\rightarrow\)` FALSE - You can mix in parentheses for grouping as needed --- ## Examples for logical operators - How many tech firms had >$10B in revenue in 2017? ```r sum(tech_df$revenue > 10000) ``` ``` ## [1] 46 ``` - How many tech firms had >$10B in revenue but had negative earnings in 2017? ```r sum(tech_df$revenue > 10000 & tech_df$earnings < 0) ``` ``` ## [1] 4 ``` --- ## Examples for logical operators - Who are those 4 with high revenue and negative earnings? ```r columns <- c("conm", "tic", "earnings", "revenue") tech_df[tech_df$revenue > 10000 & tech_df$earnings < 0, columns] ``` ``` ## conm tic earnings revenue ## 2100 CORNING INC GLW -497.000 10116.00 ## 2874 TELEFONAKTIEBOLAGET LM ERICS ERIC -4307.493 24629.64 ## 11804 DELL TECHNOLOGIES INC 7732B -3728.000 78660.00 ## 23377 NOKIA CORP NOK -1796.087 27917.49 ``` --- ## Other special values - We know `TRUE` and `FALSE` already - Note that `FALSE` can be represented as 0 - Note that `TRUE` can be represented as any non-zero number - There are also: - `Inf`: Infinity, often caused by dividing something by 0 - `NaN`: "Not a number," likely that the expression 0/0 occurred - `NA`: A missing value, usually *not* due to a mathematical error - `NULL`: Indicates a variable has nothing in it - We can check for these with: - [`is.inf()`](https://www.rdocumentation.org/packages/splus2R/versions/1.3-3/topics/is.inf) - [`is.nan()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/is.finite) - [`is.na()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/NA) - [`is.null()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/NULL) --- ## if ... else - Conditional statements (used for programming) ```r # cond1, cond2, etc. can be any logical expression if(cond1) { # Code runs if cond1 is TRUE } else if (cond2) { # Can repeat 'else if' as needed # Code runs if this is the first condition that is TRUE } else { # Code runs if none of the above conditions TRUE } ``` --- ## Other uses - Vectorized conditional statements using [`ifelse()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/ifelse) - If else takes 3 vectors and returns 1 vector - A vector of `TRUE` or `FALSE` - A vector of elements to return from when `TRUE` - A vector of elements to return from when `FALSE` ```r # Outputs odd for odd numbers and even for even numbers even <- rep("even", 5) odd <- rep("odd", 5) numbers <- 1:5 ifelse(numbers %% 2, odd, even) ``` ``` ## [1] "odd" "even" "odd" "even" "odd" ``` --- ## Practice: Subsetting df - This practice focuses on subsetting out potentially interesting parts of our data frame - We will also see which of Goldman, JPMorgan, and Citigroup, in which year, had the lowest earnings since 2010 - Do Exercise 5 on the following R practice file: - <a target="_blank" href="Session_2s_Exercise.html#Exercise_5:_Subsetting_our_data_frame">R Practice</a> --- class: inverse, center, middle # Loops with control structure --- ## Looping: While loop .pull-left[ <img src="../../../Figures/while-loop.png"> ] .pull-right[ - A [`while()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Control) loop executes code repeatedly until a specified condition is `FALSE` - An index shall be initiated before the `while` loop, and it must be changed within the loop, otherwise the loop will never end. ```r i <- 0 while(i < 5) { print(i) i <- i + 2 } ``` ``` ## [1] 0 ## [1] 2 ## [1] 4 ``` ] --- ## Looping: For loop .pull-left[ <img src="../../../Figures/for-loop.png"> ] .pull-right[ - A [`for()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Control) loop executes code repeatedly until a specified condition is `FALSE`, while increamenting a given variable ```r for(i in c(0, 2, 4)) { print(i) } ``` ``` ## [1] 0 ## [1] 2 ## [1] 4 ``` ] --- ## Dangers of looping in R - Loops in R are relatively slow -- one calculation at a time but R is best for many calculations at once via [vectorization](http://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/) or matrix algebra - We will introduce some other ways for loop through vectorized functions such as `lapply()` - But as a new programmer, it is a must to understand the logic of loop - [`Sys.time()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Sys.time) to return the current system time .pull-left[ ```r # Profit margin, all US tech firms start <- Sys.time() margin_1 <- rep(0,length(tech_df$ni)) for(i in seq_along(tech_df$ni)) { margin_1[i] <- tech_df$earnings[i] / tech_df$revenue[i] } end <- Sys.time() time_1 <- end - start time_1 ``` ``` ## Time difference of 0.00900197 secs ``` ] .pull-right[ ```r # Profit margin, all US tech firms start <- Sys.time() margin_2 <- tech_df$earnings / tech_df$revenue end <- Sys.time() time_2 <- end - start time_2 ``` ``` ## Time difference of 0.002002001 secs ``` ] --- ## Dangers of looping in R - Loops in R are very slow -- one calculation at a time but R is best for many calculations at once via vectorization or matrix algebra ```r # Are these calculations identical? identical(margin_1, margin_2) ``` ``` ## [1] TRUE ``` ```r # How much slower is the loop? paste(as.numeric(time_1) / as.numeric(time_2), "times") ``` ``` ## [1] "4.49648684053829 times" ``` --- class: inverse, center, middle # Functions and packages --- ## Help functions - There are two equivalent ways to quickly access help files: - `?` and `help()` - Usage to get the help file for [`data.frame()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame): - `?data.frame` - `help(data.frame)` - To see the options for a function, use `args()` ```r args(data.frame) ``` ``` ## function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, ## fix.empty.names = TRUE, stringsAsFactors = FALSE) ## NULL ``` --- ## A note on using functions ```r args(data.frame) ``` ``` ## function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, ## fix.empty.names = TRUE, stringsAsFactors = FALSE) ## NULL ``` - The `...` represents a series of inputs - In this case, inputs like `name=data`, where name is the column name and data is a vector - The `____ = ____` arguments are options for the function - The default is prespecified, but you can overwrite it - eg, you may change `stringsAsFactors` from FALSE (default) to TRUE - Options can be very useful or save us a lot of time! - You can always find them by: - Using the `?` command - Checking other documentation like <a target="_blank" href="https://www.rdocumentation.org/">www.rdocumentation.org</a> - Using the `args()` function --- ## Packages in R - R packages are collections of functions and data sets developed by the community. - Most R packages are stored on the offcial <a target = "_blank" href = "https://cran.r-project.org/">CRAN</a> repository and can be installed within the RStudio directly - Alternatively, you may download the package to local disk and use RStudio or command `install.packages(file.choose(), repos=NULL)` to install it ```r # To install the tidyverse package which will be used for this course # tidyverse is a collection of useful packages in R # https://www.tidyverse.org/ install.packages("tidyverse") # or to install multiple packages in one go: install.packages(c("ggplot2", "dplyr", "magrittr")) ``` - Load packages using [`library()`](https://rdrr.io/r/base/library.html) - Need to do this each time you open a new instance of R ```r # Load the tidyverse package library(tidyverse) ``` --- ## Pipe notation > Pipe: output from the left as an input to the right directly. - The Base R (ie, without any external package) introduced the official pipe notation `|>` as of R version 4.1 in 2021. - [The New R Pipe](https://www.r-bloggers.com/2021/05/the-new-r-pipe/) - But a more popular pipe notation has already been provided by the [`package:magrittr`](https://magrittr.tidyverse.org) - Part of [`package:tidyverse`](https://tidyverse.tidyverse.org), an extremely popular collection of packages - Pipe notation is done using `%>%` - `Left %>% Right(arg2, ...)` is the same as `Right(Left, arg2, ...)` > Piping can drastically improve code readability --- ## Piping example > Plot tech firms' earnings vs. revenue, >$10B in revenue ```r # %>% comes from magrittr and ggplot() comes from ggplot2, both part of tidyverse # alternatively you may launch these two packages separately # note that ggplot uses a special pipe notation "+" library(tidyverse) library(plotly) plot <- tech_df %>% subset(revenue > 10000) %>% ggplot(aes(x = revenue, y = earnings)) + # Adds point, and ticker geom_point(shape = 1, aes(text = sprintf("Ticker: %s", tic))) ggplotly(plot) # Makes the plot interactive ```
--- ## Without piping ```r library(tidyverse) library(plotly) plot <- ggplot(subset(tech_df, revenue > 10000), aes(x = revenue, y = earnings)) + geom_point(shape = 1, aes(text = sprintf("Ticker: %s", tic))) ggplotly(plot) # Makes the plot interactive ```
--- ## Practice: library usage - This practice focuses on using an external library - We will also see which of Goldman, JPMorgan, and Citigroup, in which year, had the lowest earnings since 2010 - Do Exercise 6 on the following R practice file: - <a target="_blank" href="Session_2s_Exercise.html#Exercise_6:_External_library_usage">R Practice</a> > Note: The ~ indicates a formula the left side is the y-axis and the right side is the x-axis > Note: The | tells lattice to make panels based on the variable(s) to the right --- ## Math functions - [`sum()`](https://rdrr.io/r/base/sum.html): Sum of a vector - [`abs()`](https://rdrr.io/r/base/MathFun.html): Absolute value - [`sign()`](https://rdrr.io/r/base/sign.html): The sign of a number ```r vector = c(-2, -1, 0, 1, 2) sum(vector) ``` ``` ## [1] 0 ``` ```r abs(vector) ``` ``` ## [1] 2 1 0 1 2 ``` ```r sign(vector) ``` ``` ## [1] -1 -1 0 1 1 ``` --- ## Stats functions - [`mean()`](https://rdrr.io/r/base/mean.html): Calculates the mean of a vector - [`median()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/median): Calculates the median of a vector - [`sd()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd): Calculates the sample standard deviation of a vector - [`quantile()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile): Provides the *quartiles* of a vector - [`range()`](https://rdrr.io/r/base/range.html): Gives the minimum and maximum of a vector - Related: [`min()`](https://rdrr.io/r/base/Extremes.html) and [`max()`](https://rdrr.io/r/base/Extremes.html) ```r quantile(tech_df$earnings) ``` ``` ## 0% 25% 50% 75% 100% ## -4307.4930 -15.9765 1.8370 91.3550 48351.0000 ``` ```r range(tech_df$earnings) ``` ``` ## [1] -4307.493 48351.000 ``` --- ## Make your own functions! - Use the `function()` function! - `my_func <- function(agruments) {code}` - recommended to explicitly use `return()` to specify what to return from the function. > Simple function: Add 2 to a number ```r add_two <- function(n) { n + 2 } add_two(500) ``` ``` ## [1] 502 ``` ```r add_two <- function(n) { return(n + 2) } add_two(500) ``` ``` ## [1] 502 ``` --- ## Slightly more complex ```r mult_together <- function(n1, n2=0, square=FALSE) { if (!square) { return(n1 * n2) } else { return(n1 * n1) } } mult_together(5, 6) ``` ``` ## [1] 30 ``` ```r mult_together(5, 6, square = TRUE) ``` ``` ## [1] 25 ``` ```r mult_together(5, square = TRUE) ``` ``` ## [1] 25 ``` --- ## Practice: Functions - This practice focuses on making a custom function - Currency conversion between USD and SGD! - Do Exercise 7 on the following R practice file: - <a target="_blank" href="Session_2s_Exercise.html#Exercise_7:_Making_your_own_function">R Practice</a> --- ## Challenging Practice Define a function called `digits(n)` which returns the number of digits of a given integer number. For simplicity, we assume `n` is zero or positive integer, ie, n >= 0. - if you call `digits(251)`, it should return `3` - if you call `digits(5)`, it should return `1` - if you call `digits(0)`, it should return `1` For practice, you are required to use `if` conditions and `while` loops when necessary. You should use integer division `%/%` in the `while` loop to count the number of digits. You are not allowed to use functions such as `nchar()` and `floor()`. --- class: inverse, center, middle # Loops with `lapply()` functions --- ## Loops with `lapply()` You don't have to always write loops using `for` or `while`. There are a group of [`lapply()`](https://rdrr.io/r/base/lapply.html) functions which can implement loops. - [`lapply()`](https://rdrr.io/r/base/lapply.html): Loop over a list, evaluate a function on each element, and return a list - there are some others too: [`sapply()`](https://rdrr.io/r/base/lapply.html); [`mapply()`](https://rdrr.io/r/base/mapply.html); [`apply()`](https://rdrr.io/r/base/apply.html); [`vapply()`](https://rdrr.io/r/base/lapply.html); [`tapply()`](https://rdrr.io/r/base/tapply.html) Let's see the structure of [`lapply()`](https://rdrr.io/r/base/lapply.html). It extracts the function using [`match.fun()`](https://rdrr.io/r/base/match.fun.html), checks whether it is a list (if not, convert to a list using [`as.list()`](https://rdrr.io/r/base/list.html)) and finally loop internally in C code (`.Internal(lapply(X, FUN))`). ```r lapply ``` ``` ## function (X, FUN, ...) ## { ## FUN <- match.fun(FUN) ## if (!is.vector(X) || is.object(X)) ## X <- as.list(X) ## .Internal(lapply(X, FUN)) ## } ## <bytecode: 0x000000001b6fc2e8> ## <environment: namespace:base> ``` --- ## Apply a function over a list [`rnorm()`](https://rdrr.io/r/stats/Normal.html) to generate normal distributed numbers (in a vector format) with default 0 mean and 1 standard deviations. ```r set.seed(1) # make random number generation reproducible x_list <- list(a = rnorm(10000), b = rnorm(20000, 1, 5)) str(x_list) ``` ``` ## List of 2 ## $ a: num [1:10000] -0.626 0.184 -0.836 1.595 0.33 ... ## $ b: num [1:20000] -3.02 -4.28 -4.18 -4.93 -1.5 ... ``` ```r x_list_mean <- lapply(x_list, mean) str(x_list_mean) ``` ``` ## List of 2 ## $ a: num -0.00654 ## $ b: num 1.01 ``` ```r x_list_mean_vector <- sapply(x_list, mean) str(x_list_mean_vector) ``` ``` ## Named num [1:2] -0.00654 1.00841 ## - attr(*, "names")= chr [1:2] "a" "b" ``` --- ## Apply a function over an array [`array()`](https://rdrr.io/r/base/array.html) are data objects which can store data in more than two dimensions which allows different data types. Recall that `matrix` is two-dimensional data with same data type and `dataframe` is two-dimensional data which allows different data types. [`apply()`](https://rdrr.io/r/base/apply.html) can evaluate a function over an array. ```r set.seed(1) # make random number generation reproducible # create a 2-dimensional array (a matrix for this case) x_array <- array(c(rnorm(10000), rnorm(20000, 1, 5)), dim = c(2, 10000)) str(x_array) ``` ``` ## num [1:2, 1:10000] -0.626 0.184 -0.836 1.595 0.33 ... ``` ```r # apply mean() on the first dimension, ie, rows of a matrix/dataframe x_array_mean <- apply(x_array, 1, mean) str(x_array_mean) ``` ``` ## num [1:2] 0.467 0.506 ``` ```r # apply mean() on the second dimension, ie, columns of a matrix/dataframe x_array_mean <- apply(x_array, 2, mean) str(x_array_mean) ``` ``` ## num [1:10000] -0.221 0.38 -0.245 0.613 0.135 ... ``` --- class: inverse, center, middle # Managing dataframes with `dplyr` --- ## Read files to data frames The most popular file format among data analysts is the [comma-separated values (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) file that uses a comma (`,`) to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. - you can save Excel file into CSV file The simplest way to import smaller CSV is to use the [`read.csv()`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table) from the base R (ie, without any additional packages). Other functions include: [`read.table()`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table)(for .txt or a tab-delimited text file); [`read.delim()`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table)(for file with a separator that is different from a tab, a comma or a semicolon) ```r df <- read.csv("data/session_2.csv") ``` Other packages also have import files functions: - [readr::read_csv()](https://readr.tidyverse.org/reference/read_delim.html) - [data.table::fread()](https://Rdatatable.gitlab.io/data.table/reference/fread.html) - [readxl::read_excel()](https://readxl.tidyverse.org/reference/read_excel.html) - [other packages](https://www.datacamp.com/community/tutorials/r-data-import-tutorial) for other data formats such as JSON, HTML, SAS, STATA, etc --- ## Single table functions [`package:dplyr`](https://dplyr.tidyverse.org) is part of the [`package:tidyverse`](https://tidyverse.tidyverse.org) which provides useful functions for data manipulation. A competing package is [`package:data.table`](https://r-datatable.com) which is [more efficient](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/) for large dataset (I suggest > 1G) * Rows: * [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) chooses rows based on column values. * [`slice()`](https://dplyr.tidyverse.org/reference/slice.html) chooses rows based on location. * [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.html) changes the order of the rows. * Columns: * [`select()`](https://dplyr.tidyverse.org/reference/select.html) changes whether or not a column is included. * [`rename()`](https://dplyr.tidyverse.org/reference/rename.html) changes the name of columns. * [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html) changes the values of columns and creates new columns. * [`relocate()`](https://dplyr.tidyverse.org/reference/relocate.html) changes the order of the columns. * Groups of rows: * [`summarize()`](https://dplyr.tidyverse.org/reference/summarise.html) collapses a group into a single row. --- ## Filter rows with `filter()` [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) allows you to select a subset of rows in a data frame. The first argument is the dataframe. The second and subsequent arguments refer to variables within that dataframe, selecting rows where the expression is `TRUE`. Select all rows with ticker = AAPL (Apple Inc.) and after 2013 fiscal year: ```r library(tidyverse) df %>% filter(tic == "AAPL" & fyear > 2013) ``` ``` ## gvkey datadate fyear indfmt consol popsrc datafmt tic conm curcd ni ## 1 1690 20140930 2014 INDL C D STD AAPL APPLE INC USD 39510 ## 2 1690 20150930 2015 INDL C D STD AAPL APPLE INC USD 53394 ## 3 1690 20160930 2016 INDL C D STD AAPL APPLE INC USD 45687 ## 4 1690 20170930 2017 INDL C D STD AAPL APPLE INC USD 48351 ## revt cik costat gind gsector gsubind ## 1 182795 320193 A 452020 45 45202030 ## 2 233715 320193 A 452020 45 45202030 ## 3 215091 320193 A 452020 45 45202030 ## 4 229234 320193 A 452020 45 45202030 ``` This is roughly equivalent to this base R code: ```r df[df$tic == "AAPL" & df$fyear > 2013, ] ``` --- ## Choose rows with `slice()` [`slice()`](https://dplyr.tidyverse.org/reference/slice.html) is to select, remove, and duplicate rows by their (integer) locations. ```r df %>% slice(5:7) ``` ``` ## gvkey datadate fyear indfmt consol popsrc datafmt tic conm curcd ni ## 1 1004 20150531 2014 INDL C D STD AIR AAR CORP USD 10.2 ## 2 1004 20160531 2015 INDL C D STD AIR AAR CORP USD 47.7 ## 3 1004 20170531 2016 INDL C D STD AIR AAR CORP USD 56.5 ## revt cik costat gind gsector gsubind ## 1 1594.3 1750 A 201010 20 20101010 ## 2 1662.6 1750 A 201010 20 20101010 ## 3 1767.6 1750 A 201010 20 20101010 ``` It is accompanied by a number of helpers for common use cases: * [`slice_head()`](https://dplyr.tidyverse.org/reference/slice.html) and [`slice_tail()`](https://dplyr.tidyverse.org/reference/slice.html) select the first or last rows. * [`slice_sample()`](https://dplyr.tidyverse.org/reference/slice.html) randomly selects rows. * [`slice_min()`](https://dplyr.tidyverse.org/reference/slice.html) and [`slice_max()`](https://dplyr.tidyverse.org/reference/slice.html) select rows with highest or lowest values of a variable. --- ## Arrange rows with `arrange()` [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.html) is to reorder the rows by a set of column names: ```r df %>% arrange(conm, desc(fyear)) %>% head() ``` ``` ## gvkey datadate fyear indfmt consol popsrc datafmt tic conm ## 1 122519 20170630 2017 INDL C D STD FLWS 1-800-FLOWERS.COM ## 2 122519 20160630 2016 INDL C D STD FLWS 1-800-FLOWERS.COM ## 3 122519 20150630 2015 INDL C D STD FLWS 1-800-FLOWERS.COM ## 4 122519 20140630 2014 INDL C D STD FLWS 1-800-FLOWERS.COM ## 5 122519 20130630 2013 INDL C D STD FLWS 1-800-FLOWERS.COM ## 6 122519 20120630 2012 INDL C D STD FLWS 1-800-FLOWERS.COM ## curcd ni revt cik costat gind gsector gsubind ## 1 USD 44.041 1193.625 1084869 A 255020 25 25502020 ## 2 USD 36.875 1173.024 1084869 A 255020 25 25502020 ## 3 USD 20.287 1121.506 1084869 A 255020 25 25502020 ## 4 USD 15.372 756.345 1084869 A 255020 25 25502020 ## 5 USD 12.321 735.497 1084869 A 255020 25 25502020 ## 6 USD 17.646 716.257 1084869 A 255020 25 25502020 ``` --- ## Select columns with `select()` [`select()`](https://dplyr.tidyverse.org/reference/select.html) allows you to subset a data frame by column names (variables/features/predictors) ```r # Select columns by name df %>% select(gvkey, tic, conm, fyear) %>% slice(1:3) ``` ``` ## gvkey tic conm fyear ## 1 1004 AIR AAR CORP 2010 ## 2 1004 AIR AAR CORP 2011 ## 3 1004 AIR AAR CORP 2012 ``` ```r # Select all columns between gvkey and conm (inclusive) df %>% select(gvkey:conm) # Select all columns except those from gvkey to conm (inclusive) df %>% select(!(gvkey:conm)) # Select all columns ending with "d" df %>% select(ends_with("d")) ``` --- ## Rename columns with `rename()` [`rename()`](https://dplyr.tidyverse.org/reference/rename.html) allows you to rename column names ```r # rename columns df %>% select(gvkey, tic, conm, fyear) %>% rename(comp_name = conm) %>% slice(1:3) ``` ``` ## gvkey tic comp_name fyear ## 1 1004 AIR AAR CORP 2010 ## 2 1004 AIR AAR CORP 2011 ## 3 1004 AIR AAR CORP 2012 ``` --- ## Add new columns with `mutate()` [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html) is to add new columns. [`package:DT`](https://github.com/rstudio/DT) helps to present larger dataset using the [`datatable()`](https://rdrr.io/pkg/DT/man/datatable.html) function. ```r library(DT) df %>% mutate(margin = ni / revt) %>% slice(1:20) %>% select(gvkey, conm, tic, fyear, ni, revt, margin) %>% datatable(options = list(pageLength = 2), rownames = FALSE) ```
--- ## Change column order with `relocate()` [`relocate()`](https://dplyr.tidyverse.org/reference/relocate.html) uses a similar syntax as [`select()`](https://dplyr.tidyverse.org/reference/select.html) to move blocks of columns at once ```r df %>% relocate(tic:revt, .after = fyear) %>% tail() ``` ``` ## gvkey datadate fyear tic conm curcd ni revt indfmt ## 72720 324684 20171231 2017 ASLN ASLAN PHARMACEUTIC USD -39.892 0.0 INDL ## 72721 326688 20131231 2013 NVT NVENT ELECTRIC PLC USD NA NA INDL ## 72722 326688 20141231 2014 NVT NVENT ELECTRIC PLC USD NA NA INDL ## 72723 326688 20151231 2015 NVT NVENT ELECTRIC PLC USD NA NA INDL ## 72724 326688 20161231 2016 NVT NVENT ELECTRIC PLC USD 259.100 2116.0 INDL ## 72725 326688 20171231 2017 NVT NVENT ELECTRIC PLC USD 361.700 2097.9 INDL ## consol popsrc datafmt cik costat gind gsector gsubind ## 72720 C D STD 1722926 A 352010 35 35201010 ## 72721 C D STD 1720635 A 201040 20 20104010 ## 72722 C D STD 1720635 A 201040 20 20104010 ## 72723 C D STD 1720635 A 201040 20 20104010 ## 72724 C D STD 1720635 A 201040 20 20104010 ## 72725 C D STD 1720635 A 201040 20 20104010 ``` --- ## Summarise values with `summarise()` [`summarize()`](https://dplyr.tidyverse.org/reference/summarise.html) collapses a data frame to a single row. ```r df %>% summarise(ni_mean = mean(ni, na.rm = TRUE)) ``` ``` ## ni_mean ## 1 263.1611 ``` It's not that useful until we learn the [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) verb in a future topic. --- class: inverse, center, middle # Subset a datafram in R --- ## Five ways to subset a datafram - using brackets by extracting the rows and columns we want ```r df[1:2, c("gvkey", "fyear", "tic", "conm")] ``` ``` ## gvkey fyear tic conm ## 1 1004 2010 AIR AAR CORP ## 2 1004 2011 AIR AAR CORP ``` - using brackets by omitting the rows and columns we don’t want ```r df[-c(3:nrow(df)), -c(2, 4:7, 10:nrow(df))] ``` ``` ## gvkey fyear tic conm ## 1 1004 2010 AIR AAR CORP ## 2 1004 2011 AIR AAR CORP ``` - using brackets in combination with the [`which()`](https://rdrr.io/r/base/which.html) and `%in%` ```r df[which(df$gvkey == 1004 & df$fyear < 2012), names(df) %in% c("gvkey", "fyear","tic", "conm")] ``` ``` ## gvkey fyear tic conm ## 1 1004 2010 AIR AAR CORP ## 2 1004 2011 AIR AAR CORP ``` --- ## Five ways to subset a datafram - using the [`subset()`](https://rdrr.io/r/base/subset.html) function ```r subset(df, df$gvkey == 1004 & df$fyear < 2012, c("gvkey", "fyear","tic", "conm")) ``` ``` ## gvkey fyear tic conm ## 1 1004 2010 AIR AAR CORP ## 2 1004 2011 AIR AAR CORP ``` - using the [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) and [`select()`](https://dplyr.tidyverse.org/reference/select.html) functions from the [`package:dplyr`](https://dplyr.tidyverse.org) package ```r # library(dplyr) or library(tidyverse) df %>% filter(gvkey == 1004 & fyear < 2012) %>% select(gvkey, fyear, tic, conm) ``` ``` ## gvkey fyear tic conm ## 1 1004 2010 AIR AAR CORP ## 2 1004 2011 AIR AAR CORP ``` > choose the way which you like the most --- class: inverse, center, middle # Summary of Session 3 --- ## For next week - continue with your [Datacamp](https://datacamp.com) and textbook (<a target=_blank href="https://rc2e.com/index.html">R Cookbook</a> or <a target=_blank href="https://r4ds.had.co.nz/"> R for Data Science</a>) - review today's code and pre-read next week's seminar notes - complete the **Assignment 1** and submit on eLearn --- ## R Coding Style Guide Style is subjective and arbitrary but it is important to follow a generally accepted style if you want to share code with others. I suggest the [The tidyverse style guide](https://style.tidyverse.org/) which is also adopted by [Google](https://google.github.io/styleguide/Rguide.html) with some modification - Highlights of **the tidyverse style guide**: - *File names*: end with .R - *Identifiers*: variable_name, function_name, try not to use "." as it is reserved by Base R's S3 objects - *Line length*: 80 characters - *Indentation*: two spaces, no tabs (RStudio by default converts tabs to spaces and you may change under global options) - *Spacing*: x = 0, not x=0, no space before a comma, but always place one after a comma - *Curly braces {}*: first on same line, last on own line - *Assignment*: use `<-`, not `=` nor `->` - *Semicolon(;)*: don't use, I used once for the interest of space - *return()*: Use explicit returns in functions: default function return is the last evaluated expression - *File paths*: use [relative file path](https://www.w3schools.com/html/html_filepaths.asp) "../../filename.csv" rather than absolute path "C:/mydata/filename.csv". Backslash needs `\\` --- ## R packages used in this slide This slide was prepared on 2021-09-20 from Session_3s.Rmd with R version 4.1.1 (2021-08-10) Kick Things on Windows 10 x64 build 18362 🙋. The attached packages used in this slide are: ``` ## DT plotly forcats stringr dplyr purrr readr ## "0.18" "4.9.4.1" "0.5.1" "1.4.0" "1.0.7" "0.3.4" "2.0.1" ## tidyr tibble ggplot2 tidyverse kableExtra knitr ## "1.1.3" "3.1.3" "3.3.5" "1.3.1" "1.3.4" "1.33" ```