Apply family in R: avoiding loops on data Science 16.11.2016

r_apply.png

While looping is a great way to iterate through vectors and perform computations, it is not very efficient when we deal with what is known as Big Data. In this case, R provides some advanced functions:

  • lapply() method loops over a list and evaluates a function on each element.
  • sapply() method is a simplified version of lapply().
  • apply() method evaluates a function on the boundaries or margins of an array.
  • tapply() method evaluates a function over subsets of a vector.
  • mapply() method is a multivariate version of lapply().

These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array, and apply a named function with one or several optional arguments.

lapply and sapply

lapply() takes a list and a function as input and evaluates that function over each element of the list. If the input list is not a list, it is converted into a list using the as.list() function before the output is returned. It is much faster than a normal loop because the actual looping is done internally using C code. We look at its example in the following code snippet:

data = list(l1 = 1:10, l2 = 1000:1020)
lapply(data, mean)

Coming to sapply(), it is similar to lapply() except that it tries to simplify the results wherever possible. For example, if the final result is such that every element is of length 1, it returns a vector, if the length of every element in the result is the same but more than 1, a matrix is returned, and if it is not able to simplify the results, we get the same result as lapply(). We illustrate the same with the following example:

data = list(l1 = 1:10, l2 = runif(10), l3 = rnorm(10,2))

lapply(data, mean)
sapply(data, mean)

apply

The apply() function is used to evaluate a function over the margins or boundaries of an array, for instance, applying aggregate functions on the rows (1), columns (2) or both (1:2) of an array. By both, we mean apply the function to each individual value.

data = matrix(rnorm(20), nrow=5, ncol=4)

# row sums
apply(data, 1, sum)

# row means
apply(data, 1, mean)

# col sums
apply(data, 2, sum)

# col means
apply(data, 2, mean)

Let's see how many negative numbers each column has, using apply()

apply(data, 2, function(x) length(x[x<0]))

tapply

The function tapply() is used to evaluate a function over the subsets of any vector. This is similar to applying the GROUP BY construct in SQL if you are familiar with using relational databases. We illustrate the same in the following examples:

data = c(1:10, rnorm(10,2), runif(10))

groups = gl(3,10)

tapply(data, groups, mean)

mapply

The mapply() function is a multivariate version of lapply() and is used to evaluate a function in parallel over sets of arguments. A simple example is if we have to build a list of vectors using the rep() function, we have to write it multiple times. However, with mapply() we can achieve the same in a more elegant way as illustrated next

list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
mapply(rep, 1:4, 4:1)