data.table package

Introduction to data.table

Introduce and study data.table form which is enhanced data.frame.
Sample dataset will be a batrips dataset from the bikeshare14 package, which contains anonymous bike share data on bicycle trips around San Francisco in 2014.



What is data table:

  • Enhanced data.frame (Functions used to query data.frames also work on data.tables)
  • columnar data structure
  • every column must be of same length but can be of different type.



Why use data.table?:

  1. Concise & Consistent syntax: think in terms of rows, columns, and groups
  2. Provides a placeholder for each
  3. Feature-rich (Parallelisation, fast updates by reference, powerful joins). For example, list can be easily trasnformed to data frame form.
# install.packages("data.table")
library(data.table)
x_df <- data.frame(id = 1:2, name = c("a", "b"))
x_dt <- data.table(id = 1:2, name = c("a", "b"))
class(x_dt)
## [1] "data.table" "data.frame"
y <- list(id = 1:2, name = c("a", "b"))
x <- as.data.table(y)

# Functions used to query data.frame also work on data.tables.
nrow(x)
## [1] 2
ncol(x)
## [1] 2
dim(x)
## [1] 2 2
  1. A data table never automatically converts character columns to factors, thus preventing bugs by avoiding unexpected behavior.
x_df <- data.frame(id = 1:2, name = c("a", "b"))
class(x_df$name)
## [1] "factor"
x_dt <- data.table(id = 1:2, name = c("a", "b"))
class(x_dt$name)
## [1] "character"
  1. A data table never sets or uses the row names.
rownames(x_df) <- c("R1", "R2")
x_df
##    id name
## R1  1    a
## R2  2    b
rownames(x_dt) <- c("R1", "R2")
x_dt
##    id name
## 1:  1    a
## 2:  2    b


Characteristics of data.table :

  1. data.table automatically limits printing to just the top 5 and bottom 5 rows of the dataset along with column names and row numbers.
  2. General form of data.table syntax: DT[i, j, by]: filter rows in “i”, compute “j”, grouped by “by”. “i” is used to subset or filter rows.
  3. In a data.table, “,” can be omitted.

For example, subseting 3rd and 4th rows from batrips data.frame would be like this:

# install.packages("bikeshare14")
library(bikeshare14)
batrips <- as.data.table(batrips)
head(batrips)
batrips[3:4] #instead of batrips[3:4, ]. Of course, batrips[3:4, ] would work as well.

# Subset everything except first five rows
batrips[-(1:5)] # or batrips[!(1:5)]

# Filter the 1st, 5th and 8th rows
batrips[c(1,5,8)]

# Select all rows except 1 through 4 and 13 through 15
batrips[-c(1:4, 13:15)]

# Select all rows except the first and last
batrips[-c(1, .N)] #.N means the last row.
  1. Speical symbol .N
    .N: integer value that contains the number of rows in the data.table.
library(bikeshare14)
batrips <- as.data.table(batrips)
# it will return the last row.
batrips[.N]
##    trip_id duration          start_date         start_station
## 1:  588914      364 2014-12-31 23:33:00 Embarcadero at Bryant
##    start_terminal            end_date   end_station end_terminal bike_id
## 1:             54 2014-12-31 23:40:00 Howard at 2nd           63      56
##    subscription_type zip_code
## 1:        Subscriber    94105
# Returns all but the last 10 rows
ans <- batrips[1:(.N-10)]; nrow(ans)
## [1] 326329


Logical Expressions:

  • When filtering or subseting certain rows, column names can be omitted.
library(bikeshare14)
batrips <- as.data.table(batrips)
# Instead of batrips[batrips$subscription_type == "Subscriber", ]
batrips[subscription_type == "Subscriber"]
##         trip_id duration          start_date
##      1:  139545      435 2014-01-01 00:14:00
##      2:  139546      432 2014-01-01 00:14:00
##      3:  139547     1523 2014-01-01 00:17:00
##      4:  139558     1600 2014-01-01 00:28:00
##      5:  139575      795 2014-01-01 01:04:00
##     ---                                     
## 277759:  588902      677 2014-12-31 22:09:00
## 277760:  588907      770 2014-12-31 22:51:00
## 277761:  588910      437 2014-12-31 23:18:00
## 277762:  588911      422 2014-12-31 23:19:00
## 277763:  588914      364 2014-12-31 23:33:00
##                                    start_station start_terminal
##      1:                  San Francisco City Hall             58
##      2:                  San Francisco City Hall             58
##      3:                   Embarcadero at Sansome             60
##      4:     Harry Bridges Plaza (Ferry Building)             50
##      5:                    Embarcadero at Folsom             51
##     ---                                                        
## 277759: San Francisco Caltrain (Townsend at 4th)             70
## 277760:                          Townsend at 7th             65
## 277761:                       Powell Street BART             39
## 277762:          Grant Avenue at Columbus Avenue             73
## 277763:                    Embarcadero at Bryant             54
##                    end_date                                   end_station
##      1: 2014-01-01 00:21:00                               Townsend at 7th
##      2: 2014-01-01 00:21:00                               Townsend at 7th
##      3: 2014-01-01 00:42:00                               Beale at Market
##      4: 2014-01-01 00:54:00                             Steuart at Market
##      5: 2014-01-01 01:17:00      San Francisco Caltrain (Townsend at 4th)
##     ---                                                                  
## 277759: 2014-12-31 22:20:00               Grant Avenue at Columbus Avenue
## 277760: 2014-12-31 23:04:00                                 Howard at 2nd
## 277761: 2014-12-31 23:25:00      San Francisco Caltrain (Townsend at 4th)
## 277762: 2014-12-31 23:26:00 Yerba Buena Center of the Arts (3rd @ Howard)
## 277763: 2014-12-31 23:40:00                                 Howard at 2nd
##         end_terminal bike_id subscription_type zip_code
##      1:           65     473        Subscriber    94612
##      2:           65     395        Subscriber    94107
##      3:           56     331        Subscriber    94112
##      4:           74     413        Subscriber    94102
##      5:           70     290        Subscriber    94105
##     ---                                                
## 277759:           73     440        Subscriber    94133
## 277760:           63     677        Subscriber    94107
## 277761:           70     573        Subscriber    95050
## 277762:           68     604        Subscriber    94133
## 277763:           63      56        Subscriber    94105
# Subset rows where start_terminal = 50 and end_terminal is not 60
# Instead of batrips[batrips$start_terminal == 50 & batrips$end_terminal != 60]
batrips[start_terminal == 50 & end_terminal != 60]
##        trip_id duration          start_date
##     1:  139558     1600 2014-01-01 00:28:00
##     2:  139612      324 2014-01-01 08:44:00
##     3:  139627    14747 2014-01-01 10:11:00
##     4:  139628    14676 2014-01-01 10:11:00
##     5:  139652     1018 2014-01-01 11:09:00
##    ---                                     
## 12706:  588674      599 2014-12-31 13:11:00
## 12707:  588673      555 2014-12-31 13:11:00
## 12708:  588723      630 2014-12-31 14:26:00
## 12709:  588838      321 2014-12-31 17:25:00
## 12710:  588837      338 2014-12-31 17:25:00
##                               start_station start_terminal
##     1: Harry Bridges Plaza (Ferry Building)             50
##     2: Harry Bridges Plaza (Ferry Building)             50
##     3: Harry Bridges Plaza (Ferry Building)             50
##     4: Harry Bridges Plaza (Ferry Building)             50
##     5: Harry Bridges Plaza (Ferry Building)             50
##    ---                                                    
## 12706: Harry Bridges Plaza (Ferry Building)             50
## 12707: Harry Bridges Plaza (Ferry Building)             50
## 12708: Harry Bridges Plaza (Ferry Building)             50
## 12709: Harry Bridges Plaza (Ferry Building)             50
## 12710: Harry Bridges Plaza (Ferry Building)             50
##                   end_date                             end_station
##     1: 2014-01-01 00:54:00                       Steuart at Market
##     2: 2014-01-01 08:50:00                         Clay at Battery
##     3: 2014-01-01 14:17:00    Harry Bridges Plaza (Ferry Building)
##     4: 2014-01-01 14:16:00    Harry Bridges Plaza (Ferry Building)
##     5: 2014-01-01 11:26:00 San Francisco Caltrain 2 (330 Townsend)
##    ---                                                            
## 12706: 2014-12-31 13:21:00                          Post at Kearny
## 12707: 2014-12-31 13:20:00                          Post at Kearny
## 12708: 2014-12-31 14:36:00                         2nd at Townsend
## 12709: 2014-12-31 17:30:00     Mechanics Plaza (Market at Battery)
## 12710: 2014-12-31 17:31:00     Mechanics Plaza (Market at Battery)
##        end_terminal bike_id subscription_type zip_code
##     1:           74     413        Subscriber    94102
##     2:           41     448          Customer    81301
##     3:           50     633          Customer         
##     4:           50     631          Customer         
##     5:           69     390        Subscriber    94107
##    ---                                                
## 12706:           47     288        Subscriber    94107
## 12707:           47     575          Customer    94110
## 12708:           61     555        Subscriber    94804
## 12709:           75     635        Subscriber    94108
## 12710:           75     372          Customer    94108



Helper functions for filtering: %like%, %between%, %chin%:

  • %like% allows you to serach for a pattern in a character or a factor vector
  • It can be used independently on a vector as well
library(bikeshare14)
batrips <- as.data.table(batrips)
# subset all rows where start_station starts with San Francisco
# Instead of batrips[grepl("^San Francisco", start_station)],
batrips[start_station %like% "^San Francisco"] # ^ specifies that you are looking for a pattern at the beginning of a string.
##        trip_id duration          start_date
##     1:  139545      435 2014-01-01 00:14:00
##     2:  139546      432 2014-01-01 00:14:00
##     3:  139594      981 2014-01-01 03:43:00
##     4:  139626     1423 2014-01-01 10:09:00
##     5:  139629      737 2014-01-01 10:12:00
##    ---                                     
## 42417:  588867     1092 2014-12-31 18:15:00
## 42418:  588868     1029 2014-12-31 18:16:00
## 42419:  588872     1037 2014-12-31 18:30:00
## 42420:  588898      979 2014-12-31 22:06:00
## 42421:  588902      677 2014-12-31 22:09:00
##                                   start_station start_terminal
##     1:                  San Francisco City Hall             58
##     2:                  San Francisco City Hall             58
##     3:                  San Francisco City Hall             58
##     4:  San Francisco Caltrain 2 (330 Townsend)             69
##     5:  San Francisco Caltrain 2 (330 Townsend)             69
##    ---                                                        
## 42417:                  San Francisco City Hall             58
## 42418:                  San Francisco City Hall             58
## 42419: San Francisco Caltrain (Townsend at 4th)             70
## 42420:                  San Francisco City Hall             58
## 42421: San Francisco Caltrain (Townsend at 4th)             70
##                   end_date                             end_station
##     1: 2014-01-01 00:21:00                         Townsend at 7th
##     2: 2014-01-01 00:21:00                         Townsend at 7th
##     3: 2014-01-01 03:59:00                 San Francisco City Hall
##     4: 2014-01-01 10:33:00 San Francisco Caltrain 2 (330 Townsend)
##     5: 2014-01-01 10:24:00    Harry Bridges Plaza (Ferry Building)
##    ---                                                            
## 42417: 2014-12-31 18:33:00           Powell at Post (Union Square)
## 42418: 2014-12-31 18:33:00           Powell at Post (Union Square)
## 42419: 2014-12-31 18:47:00               Broadway St at Battery St
## 42420: 2014-12-31 22:22:00                         Clay at Battery
## 42421: 2014-12-31 22:20:00         Grant Avenue at Columbus Avenue
##        end_terminal bike_id subscription_type zip_code
##     1:           65     473        Subscriber    94612
##     2:           65     395        Subscriber    94107
##     3:           58     273          Customer    89503
##     4:           69     309        Subscriber    94105
##     5:           50     584        Subscriber    94107
##    ---                                                
## 42417:           71     633          Customer    93245
## 42418:           71     408          Customer    93245
## 42419:           82     613        Subscriber    94133
## 42420:           41     385        Subscriber    94111
## 42421:           73     440        Subscriber    94133
# It can be used independently on a vector as well
x <- c("aaba", "aaba", "baca")
# Search for "aa" anywhere in the string
x %like% "aa"
## [1]  TRUE  TRUE FALSE
# Search for "ba" at the end of a string
x %like% "ba$"
## [1]  TRUE  TRUE FALSE
# Filter all rows where end_station contains "Market"
any_markets <- batrips[end_station %like% "Market"]
# Filter all rows where end_station ends with "Market" 
end_markets <- batrips[end_station %like% "Market$"]
  • %between% works only on numeric columns and allows you to serach for values in the closed interval [val1, val2]
  • Usage: numeric_col %between% c(val1, val2)
library(bikeshare14)
batrips <- as.data.table(batrips)
# Subset all rows where duration is between 2000 and 3000
# instead of batrips[duration >= 2000 & duration <= 3000],
batrips[duration %between% c(2000, 3000)]
##       trip_id duration          start_date
##    1:  139640     2203 2014-01-01 10:38:00
##    2:  139672     2421 2014-01-01 11:41:00
##    3:  139673     2300 2014-01-01 11:44:00
##    4:  139674     2284 2014-01-01 11:44:00
##    5:  139676     2096 2014-01-01 11:47:00
##   ---                                     
## 3596:  587365     2326 2014-12-29 15:48:00
## 3597:  588010     2568 2014-12-30 12:03:00
## 3598:  588009     2569 2014-12-30 12:04:00
## 3599:  588852     2107 2014-12-31 17:37:00
## 3600:  588897     2379 2014-12-31 22:03:00
##                              start_station start_terminal
##    1:                    Steuart at Market             74
##    2:                Washington at Kearney             46
##    3:                Washington at Kearney             46
##    4:                Washington at Kearney             46
##    5:                Washington at Kearney             46
##   ---                                                    
## 3596:        Powell at Post (Union Square)             71
## 3597: Harry Bridges Plaza (Ferry Building)             50
## 3598: Harry Bridges Plaza (Ferry Building)             50
## 3599:                      Townsend at 7th             65
## 3600:                 Paseo de San Antonio              7
##                  end_date            end_station end_terminal bike_id
##    1: 2014-01-01 11:14:00 Embarcadero at Sansome           60     341
##    2: 2014-01-01 12:22:00  Embarcadero at Bryant           54     477
##    3: 2014-01-01 12:22:00  Embarcadero at Bryant           54     367
##    4: 2014-01-01 12:22:00  Embarcadero at Bryant           54     419
##    5: 2014-01-01 12:22:00  Embarcadero at Bryant           54     628
##   ---                                                                
## 3596: 2014-12-29 16:27:00       Davis at Jackson           42     402
## 3597: 2014-12-30 12:46:00 Embarcadero at Sansome           60     453
## 3598: 2014-12-30 12:46:00 Embarcadero at Sansome           60     497
## 3599: 2014-12-31 18:12:00          2nd at Folsom           62     560
## 3600: 2014-12-31 22:42:00   Paseo de San Antonio            7     296
##       subscription_type zip_code
##    1:        Subscriber    94103
##    2:          Customer    94583
##    3:          Customer    90026
##    4:          Customer    90026
##    5:          Customer    91354
##   ---                           
## 3596:        Subscriber    94111
## 3597:          Customer       55
## 3598:          Customer       55
## 3599:        Subscriber    94582
## 3600:          Customer    91387
  • %in% allows selecting rows that exactly matches one or more values.
c("a", "b", "c", "d") %in% c("d", "a")
## [1]  TRUE FALSE FALSE  TRUE
library(bikeshare14)
batrips <- as.data.table(batrips)
# Filter all rows where trip_id is 588841, 139560, or 139562
filter_trip_ids <- batrips[trip_id %in% c(588841, 139560, 139562)]
  • %chin% is similar to %in%, but it is much faster and only for character vectors
  • Usage: character_col %chin% c(“val1”, “val2”, “val3”)
library(bikeshare14)
batrips <- as.data.table(batrips)
# Subset all rows where start_station is "Japantown", "Mezes Park" or "MLK Library"
batrips[start_station %chin% c("Japntown", "Mezes Park", "MLK Library")]
##       trip_id duration          start_date start_station start_terminal
##    1:  140683      426 2014-01-02 16:07:00   MLK Library             11
##    2:  140787      333 2014-01-02 16:51:00   MLK Library             11
##    3:  141476      467 2014-01-03 10:08:00   MLK Library             11
##    4:  141641      246 2014-01-03 12:32:00   MLK Library             11
##    5:  141830      380 2014-01-03 15:56:00   MLK Library             11
##   ---                                                                  
##  999:  588367    34719 2014-12-31 01:08:00   MLK Library             11
## 1000:  588366    34715 2014-12-31 01:08:00   MLK Library             11
## 1001:  588368    34595 2014-12-31 01:10:00   MLK Library             11
## 1002:  588429      692 2014-12-31 07:29:00   MLK Library             11
## 1003:  588430      599 2014-12-31 07:30:00   MLK Library             11
##                  end_date                       end_station end_terminal
##    1: 2014-01-02 16:14:00 San Jose Diridon Caltrain Station            2
##    2: 2014-01-02 16:56:00                  Adobe on Almaden            5
##    3: 2014-01-03 10:15:00              Paseo de San Antonio            7
##    4: 2014-01-03 12:37:00               San Salvador at 1st            8
##    5: 2014-01-03 16:03:00 San Jose Diridon Caltrain Station            2
##   ---                                                                   
##  999: 2014-12-31 10:47:00                       MLK Library           11
## 1000: 2014-12-31 10:47:00                       MLK Library           11
## 1001: 2014-12-31 10:47:00                       MLK Library           11
## 1002: 2014-12-31 07:40:00 San Jose Diridon Caltrain Station            2
## 1003: 2014-12-31 07:40:00 San Jose Diridon Caltrain Station            2
##       bike_id subscription_type zip_code
##    1:     657        Subscriber    94043
##    2:     140        Subscriber    94536
##    3:     182        Subscriber    95035
##    4:     665        Subscriber    95112
##    5:     176        Subscriber    94043
##   ---                                   
##  999:     128          Customer    95112
## 1000:     667          Customer    95112
## 1001:     241          Customer    95076
## 1002:     180        Subscriber    95112
## 1003:     682        Subscriber    95113
# Much faster than batrips[start_station %in% c("Japntown", "Mezes Park", "MLK Library")]

# Filter all rows with specific start stations
two_stations <- batrips[start_station %chin% c("San Francisco City Hall", "Embarcadero at Sansome")]
Avatar
Shawn Kim
Actively seeking for full-time opportunities | Analytics Position

Actively seeking for full-time opportunities | Analytics Position