Statistical Machine Learning

Introduction to R

Prof. Jodavid Ferreira

UFPE

R


R is a programming language initially developed for statistical computing. It is licensed as Free Software.


  • Currently, it is a programming language widely used in statistics and data science, and it is one of the most popular languages for data analysis.
  • It is an open-source implementation of S, which is a statistical programming language developed by AT&T Bell Laboratories.
  • The advantages of R for statistical programming are its ease of use, the ability to create high-quality graphics, and an active user community.

R Language



  • To obtain R, visit the link: https://cloud.r-project.org/

  • CRAN (Comprehensive R Archive Network) is a set of mirror servers distributed worldwide and is used to distribute R and R packages.

  • A new major version of R is released once a year, and there are two or three minor releases per year.

  • It is recommended to keep R always updated, as new versions bring performance improvements for the latest hardware, new features, and bug fixes.

At the time this lecture was created, R was at version 4.5.2.

First Steps in R



  • From now on, concepts will be presented with examples for a journey into R, aiming to provide a solid understanding of the essential elements to get started in this language.

Variable Types

Variable types determine the nature of the stored information and directly influence the operations that can be performed on them, as well as the appropriate visualizations for each situation. In general, data can be classified into several categories, such as: numerical, categorical, ordinal, binary, among others. Each category has specific characteristics that must be considered when conducting analyses.

Variable Types



Data can be labeled into various types; however, they must reflect the nature of the information they represent. The main data types include:

  1. Numerical Data: Represent measurable quantities, such as age, weight, height, or temperature. They can be continuous, taking any value within a range (e.g., height), or discrete, taking specific values, usually integers (e.g., number of children).

  1. Categorical Data: Represent categories or groups, such as gender, eye color, or marital status. They can be nominal, without a natural order (e.g., colors), or ordinal, with a natural order (e.g., education levels: primary, secondary, higher education).

Variable Types



Source: https://study.com/academy/lesson/types-of-data-text-numbers-multimedia.html

Variable Types



Fonte: https://www.projectpro.io/recipes/convert-categorical-features-numerical-features-in-python

Variable Types



  1. Binary Data: Represent two exclusive categories, such as yes/no, true/false, pass/fail.

Variable Types



  1. Text Data: Represent text, such as customer comments, news articles, and interview transcripts.

https://medium.com/@datanizing/modern-text-mining-with-python-part-2-of-5-data-exploration-with-pandas-ee3456cf6a4

Variable Types



  1. Time Data: Represent information that takes time into account, such as daily sales, hourly temperature, and historical events.

Variable Types


  1. Geospatial Data: Represent geographical information, such as GPS coordinates, country borders, and locations of points of interest.

Source: https://www.researchgate.net/publication/235436779_Analises_espaciais_em_planejamento_urbano_novas_tendencias/figures?lo=1

Variable Types



  1. Multimedia Data: These are data associated with images, audio, and video, such as product photos, interview recordings, and promotional videos.

Variable Types



  1. Network Data: Describe connections and relationships, such as social networks, transportation networks, and academic citation networks.

Data Types in R



In R, each data type is handled using specific variable types.

  • Below are the most common types in R and their corresponding names in parentheses.

  • Additionally, in the subsequent examples, the use of the class() function is observed. This function is used to identify the data type, also known as the class, of an object. In essence, class() returns the type of the object in question within the R type system. The relevance of this function lies in the fact that the variable type defines the operations and manipulations that can be applied to the data.

Data Types in R



  1. Numeric (numeric): For integer or real numbers. Examples include age, income, and temperature.

    # Exemplos de variáveis numéricas
    idade <- 30
    preco_produto <- 99.90
    numero_de_filhos <- 2
    
    idade  # Exibe o valor da variável idade
    preco_produto # Exibe o valor da variável preco_produto
    numero_de_filhos # Exibe o valor da variável numero_de_filhos
    
    class(idade) # Verifica o tipo da variável idade (numeric)
    class(preco_produto) # Verifica o tipo da variável preco_produto (numeric)
    class(numero_de_filhos) # Verifica o tipo da variável numero_de_filhos (numeric)
    [1] 30
    [1] 99.9
    [1] 2
    [1] "numeric"
    [1] "numeric"
    [1] "numeric"

Data Types in R



  1. Dates (Date): A special type for representing dates.

    # Exemplo de variável de data
    data_nascimento <- as.Date("1993-08-15") # Converte texto para o tipo Date
    
    data_nascimento # Exibe a data
    class(data_nascimento) # Verifica o tipo da variável (Date)
    [1] "1993-08-15"
    [1] "Date"

Data Types in R



  1. Categorical (Factors) (factor): For nominal or ordinal qualitative variables. Useful for categories such as gender, education level, and colors.

    # Exemplo de variável categórica (fator)
    nivel_escolaridade <- factor(c("Fundamental", "Médio", "Superior", "Médio", "Fundamental"),
                                 levels = c("Fundamental", "Médio", "Superior"), # Define a ordem dos níveis
                                 ordered = TRUE) # Indica que é um fator ordinal
    
    nivel_escolaridade # Exibe os níveis e os dados
    class(nivel_escolaridade) # Verifica o tipo da variável (factor)
    [1] Fundamental Médio       Superior    Médio       Fundamental
    Levels: Fundamental < Médio < Superior
    [1] "ordered" "factor" 

Data Types in R



  1. Text (character): For qualitative textual variables, such as city names and product descriptions.

    # Exemplo de variável de texto
    nome_cidade <- "São Paulo"
    descricao_produto <- "Smartphone de última geração com câmera de alta resolução."
    
    nome_cidade # Exibe o nome da cidade
    descricao_produto # Exibe a descrição do produto
    class(nome_cidade) # Verifica o tipo da variável (character)
    class(descricao_produto) # Verifica o tipo da variável (character)
    [1] "São Paulo"
    [1] "Smartphone de última geração com câmera de alta resolução."
    [1] "character"
    [1] "character"

Data Types in R



  1. Boolean (Logical) (logical): For variables that can be TRUE or FALSE (TRUE or FALSE in R).

    # Exemplo de variável booleana (lógica)
    aprovado <- TRUE
    possui_carteira_motorista <- FALSE
    
    aprovado # Exibe o valor de aprovado
    possui_carteira_motorista # Exibe o valor de possui_carteira_motorista
    class(aprovado) # Verifica o tipo da variável (logical)
    class(possui_carteira_motorista) # Verifica o tipo da variável (logical)
    [1] TRUE
    [1] FALSE
    [1] "logical"
    [1] "logical"

Packages

Expanding R’s capabilities



  • An R package is a collection of functions, data, and documentation that extends the capabilities of base R.

    • Base R is the set of functions that are available when you install R.
  • There are thousands of packages available on CRAN, which have been made available by developers from all over the world.

  • To install a package in R, use the following:

install.packages("package_name")
  • If no error appears in the console, it indicates that the package was installed correctly.

Packages



  • To load a package in R, we use one of the following functions:
# Exemplo de importação de pacotes
library(tidyverse)

or

# Exemplo de importação de pacotes
require(tidyverse)

and the functions in the package can be used in two ways:

# Exemplo de utilização de função de pacote
iris |>
  filter(Species == "setosa") |>
  head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Packages


or

# Exemplo de utilização de função de pacote
datasets::iris |>
  dplyr::filter(Species == "setosa") |>
  utils::head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

I’ll put it here again below…

# Exemplo de utilização de função de pacote
iris |>
  filter(Species == "setosa") |>
  head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Basic Operations



  • R is a programming language that supports basic arithmetic operations, such as addition, subtraction, multiplication, and division.
# Exemplo de operações aritméticas
1 + 1
[1] 2
8 - 1
[1] 7
10 * 2
[1] 20
35 / 5
[1] 7

Object Creation



  • In R, it is possible to create objects to store values, and these objects can be of different types, such as numbers, strings, vectors, matrices, data frames, among others.
  • We create new objects using the assignment operator <- or =; these can be used as follows:
# Exemplo de criação de objetos
x <- 42/2
y = 47
1+1 -> z
print(x)
[1] 21
print(y)
[1] 47
print(z)
[1] 2

Using Functions



  • R has a series of built-in functions that can be used to perform mathematical, statistical, and data manipulation operations, among others. These functions are used as follows:
# Exemplo de utilização de funções
sqrt(16)
[1] 4
log(2.71828)
[1] 0.9999993
  • Furthermore, it is possible to create custom functions in R, and these functions can be used as follows:
# Exemplo de criação de funções
quadrado <- function(x) {
  return(x^2)
}

saida <- quadrado(5)
print(saida)
[1] 25

Vectors, Arrays, Lists, and Matrices


  • In R, it is possible to create vectors, arrays, and lists, which are data structures that can store multiple values.
  • A vector is a sequence of values of a single type and is created using the c() function.
# Exemplo de criação de vetores
x <- c(1, 2, 3, 4, 5)

print(x)
[1] 1 2 3 4 5
  • An array is a data structure that can store multiple values of different types, and it is created using the array() function.
# Exemplo de criação de arrays
y <- array(c(1, "a", TRUE), dim = c(3, 1))

print(y)
     [,1]  
[1,] "1"   
[2,] "a"   
[3,] "TRUE"

Vectors, Arrays, Lists, and Matrices


  • A list is a data structure that can store multiple values of different types, and it is created using the list() function.
# Exemplo de criação de listas
z <- list(1, "a", TRUE)

print(z)
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE
  • We also have the concept of Matrices, which are vectors with dimensions, that is, they are vectors that have rows and columns, and are created using the matrix() function.
# Exemplo de criação de matrizes
m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)

print(m)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Vectors, Arrays, Lists, and Matrices


  • To access elements of structures such as a vector, array, list, or matrix, we use square brackets [].

It is worth noting that indices in R start at 1, not 0, as in some other programming languages.

# Exemplo de acesso a elementos de vetores
x <- c(1, 2, 3, 4, 5)

print(x[1])
[1] 1
print(x[3])
[1] 3
# Exemplo de acesso a elementos de matrizes
m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)

print(m[1, 2])
[1] 3
print(m[2, 3])
[1] 6

Vectors, Arrays, Lists, and Matrices


Source: https://www.linkedin.com/pulse/trabalhando-com-objetos-r-vetores-matrizes-data-frames-luz-lopes/

Data Manipulation with dplyr



  • The dplyr package is an R package that provides a grammar for data manipulation and is very useful for transforming, filtering, and summarizing data.
  • dplyr provides a set of functions that are easy to use and allow you to perform common data manipulation operations efficiently. Therefore, for this initial stage of data processing, we will focus on using this package.
  • We will always try to use functions in the form package::function() to avoid conflicts between functions with the same name in different packages.

Data Manipulation with dplyr


  • Let’s select a database to use the dplyr functions:
dados <- nycflights13::flights

dados |> 
  head(13)
# A tibble: 13 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
11  2013     1     1      558            600        -2      849            851
12  2013     1     1      558            600        -2      853            856
13  2013     1     1      558            600        -2      924            917
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Data Manipulation with dplyr


  • We were able to filter data using the filter() function.
dados |> 
  dplyr::filter(month == 11, day == 1) |> 
  head(13)
# A tibble: 13 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
 3  2013    11     1      455            500        -5      641            651
 4  2013    11     1      539            545        -6      856            827
 5  2013    11     1      542            545        -3      831            855
 6  2013    11     1      549            600       -11      912            923
 7  2013    11     1      550            600       -10      705            659
 8  2013    11     1      554            600        -6      659            701
 9  2013    11     1      554            600        -6      826            827
10  2013    11     1      554            600        -6      749            751
11  2013    11     1      555            600        -5      847            854
12  2013    11     1      555            600        -5      839            846
13  2013    11     1      555            600        -5      929            943
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Data Manipulation with dplyr


To use filtering effectively, you need to know how to use comparison and logical operators.

Some operators are:

  • Comparison operators:
    • == equal to
    • != not equal to
    • > greater than
    • < less than

  • Logical operators:
    • & and
    • | or
    • ! not

Data Manipulation with dplyr


  • In this case, we can use
dados |> 
  dplyr::filter(month == 11 & day == 1) |> 
  head(5)
# A tibble: 5 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013    11     1        5           2359         6      352            345
2  2013    11     1       35           2250       105      123           2356
3  2013    11     1      455            500        -5      641            651
4  2013    11     1      539            545        -6      856            827
5  2013    11     1      542            545        -3      831            855
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
dados |> 
  dplyr::filter(!(month != 11 | day != 1)) |> 
  head(5)
# A tibble: 5 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013    11     1        5           2359         6      352            345
2  2013    11     1       35           2250       105      123           2356
3  2013    11     1      455            500        -5      641            651
4  2013    11     1      539            545        -6      856            827
5  2013    11     1      542            545        -3      831            855
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Data Manipulation with dplyr


  • A problem that may arise and complicate comparisons is missing values.
  • NAs (“not available”) are values that do not exist in the database.
  • Any operation involving an unknown value will also be unknown.

What is the result?

NA == NA
[1] NA
NA > 5
[1] NA
10 == NA
[1] NA
NA + 10
[1] NA
  • To check if a value is missing, you can use the is.na() function.
is.na(NA)
[1] TRUE
is.na(10)
[1] FALSE

Data Manipulation with dplyr


  • The filter() function only considers rows where the condition is true (TRUE), and discards rows where the condition is false (FALSE) or NA. If you want to preserve missing values, ask for them explicitly:
dados |> 
  dplyr::filter(is.na(dep_time)) |> 
  head(10)
# A tibble: 10 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1       NA           1630        NA       NA           1815
 2  2013     1     1       NA           1935        NA       NA           2240
 3  2013     1     1       NA           1500        NA       NA           1825
 4  2013     1     1       NA            600        NA       NA            901
 5  2013     1     2       NA           1540        NA       NA           1747
 6  2013     1     2       NA           1620        NA       NA           1746
 7  2013     1     2       NA           1355        NA       NA           1459
 8  2013     1     2       NA           1420        NA       NA           1644
 9  2013     1     2       NA           1321        NA       NA           1536
10  2013     1     2       NA           1545        NA       NA           1910
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Data Manipulation with dplyr


  • Another interesting function is arrange(), which is used to sort data.
  • It works similarly to filter(), but instead of filtering, it sorts.
dados |> 
  dplyr::arrange(desc(dep_time)) |> 
  head(10)
# A tibble: 10 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    10    30     2400           2359         1      327            337
 2  2013    11    27     2400           2359         1      515            445
 3  2013    12     5     2400           2359         1      427            440
 4  2013    12     9     2400           2359         1      432            440
 5  2013    12     9     2400           2250        70       59           2356
 6  2013    12    13     2400           2359         1      432            440
 7  2013    12    19     2400           2359         1      434            440
 8  2013    12    29     2400           1700       420      302           2025
 9  2013     2     7     2400           2359         1      432            436
10  2013     2     7     2400           2359         1      443            444
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Data Manipulation with dplyr


  • It is possible to sort by more than one column; simply pass more arguments to the arrange() function.
dados |> 
  dplyr::arrange(year, month, day) |> 
  head(10)
# A tibble: 10 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • If you want to sort in descending order, simply use the desc() function.

Data Manipulation with dplyr


  • The select() function is used to select columns from a data frame.
dados |> 
  dplyr::select(year, month, day) |> 
  head(15)
# A tibble: 15 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
11  2013     1     1
12  2013     1     1
13  2013     1     1
14  2013     1     1
15  2013     1     1

Data Manipulation with dplyr


  • It is also possible to exclude columns using select().
dados |> 
  dplyr::select(-year, -month, -day) |> 
  head(15)
# A tibble: 15 × 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1      517            515         2      830            819        11 UA     
 2      533            529         4      850            830        20 UA     
 3      542            540         2      923            850        33 AA     
 4      544            545        -1     1004           1022       -18 B6     
 5      554            600        -6      812            837       -25 DL     
 6      554            558        -4      740            728        12 UA     
 7      555            600        -5      913            854        19 B6     
 8      557            600        -3      709            723       -14 EV     
 9      557            600        -3      838            846        -8 B6     
10      558            600        -2      753            745         8 AA     
11      558            600        -2      849            851        -2 B6     
12      558            600        -2      853            856        -3 B6     
13      558            600        -2      924            917         7 UA     
14      558            600        -2      923            937       -14 UA     
15      559            600        -1      941            910        31 AA     
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Data Manipulation with dplyr


  • The mutate() function is used to create new columns from existing ones.
dados |> 
  dplyr::mutate(speed = distance / air_time) |> 
  head(15)
# A tibble: 15 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
11  2013     1     1      558            600        -2      849            851
12  2013     1     1      558            600        -2      853            856
13  2013     1     1      558            600        -2      924            917
14  2013     1     1      558            600        -2      923            937
15  2013     1     1      559            600        -1      941            910
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, speed <dbl>

Data Manipulation with dplyr


  • There is also the summarise() function, which is used to summarize data.
  • The summarise() function is very useful for summarizing data and obtaining descriptive statistics.
dados |> 
  dplyr::summarise(mean_distance = mean(distance), 
                   mean_air_time = mean(air_time)) 
# A tibble: 1 × 2
  mean_distance mean_air_time
          <dbl>         <dbl>
1         1040.            NA
  • Note that for the variable air_time the result was NA, this occurred because the mean() function doesn’t know what to do with missing values. In this case, we must pass a value as an argument to remove the missing values.
dados |> 
  dplyr::summarise(mean_distance = mean(distance), 
                   mean_air_time = mean(air_time, na.rm = TRUE)) 
# A tibble: 1 × 2
  mean_distance mean_air_time
          <dbl>         <dbl>
1         1040.          151.

Data Manipulation with dplyr


  • The group_by() function is used to group data by one or more variables.
dados |> 
  dplyr::group_by(month) |> 
  dplyr::summarise(mean_distance = mean(distance), 
                   mean_air_time = mean(air_time, na.rm = TRUE)) 
# A tibble: 12 × 3
   month mean_distance mean_air_time
   <int>         <dbl>         <dbl>
 1     1         1007.          154.
 2     2         1001.          151.
 3     3         1012.          149.
 4     4         1039.          153.
 5     5         1041.          146.
 6     6         1057.          150.
 7     7         1059.          147.
 8     8         1062.          148.
 9     9         1041.          143.
10    10         1039.          149.
11    11         1050.          155.
12    12         1065.          163.

Data Manipulation with dplyr


  • Counting is also a very common operation, and for this, we use the n() function.
dados |> 
  dplyr::group_by(month) |> 
  dplyr::summarise(n = n())
# A tibble: 12 × 2
   month     n
   <int> <int>
 1     1 27004
 2     2 24951
 3     3 28834
 4     4 28330
 5     5 28796
 6     6 28243
 7     7 29425
 8     8 29327
 9     9 27574
10    10 28889
11    11 27268
12    12 28135

Data Manipulation with dplyr


We can also group by multiple variables; simply pass more arguments to the group_by() function.

dados |> 
  dplyr::group_by(month, day) |> 
  dplyr::summarise(n = n())
# A tibble: 365 × 3
# Groups:   month [12]
   month   day     n
   <int> <int> <int>
 1     1     1   842
 2     1     2   943
 3     1     3   914
 4     1     4   915
 5     1     5   720
 6     1     6   832
 7     1     7   933
 8     1     8   899
 9     1     9   902
10     1    10   932
# ℹ 355 more rows

Data Manipulation with dplyr


  • So, when working with data manipulation, we will always aim to work with the tidyverse packages, which follow a data organization philosophy and are very useful for data manipulation. Among them are:

    • dplyr: for data manipulation
    • ggplot2: for data visualization
    • tidyr: for data tidying
    • readr: for data reading
    • purrr: for functional programming, such as mapping and reduction
    • tibble: for data organization
    • stringr: for string manipulation
    • forcats: for factor manipulation




Thank you!


Slide made with quarto