While looping is a great way to iterate through vectors and perform computations, it is not very efficient when we deal with what is known as Big Data. In this case, R provides some advanced functions:
lapply()method loops over a list and evaluates a function on each element.
sapply()method is a simplified version of
apply()method evaluates a function on the boundaries or margins of an array.
tapply()method evaluates a function over subsets of a vector.
mapply()method is a multivariate version of
These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array, and apply a named function with one or several optional arguments.
lapply and sapply
lapply() takes a list and a function as input and evaluates that function over each element of the list. If the input list is not a list, it is converted into a list using the
as.list() function before the output is returned. It is much faster than a normal loop because the actual looping is done internally using C code. We look at its example in the following code snippet:
data = list(l1 = 1:10, l2 = 1000:1020) lapply(data, mean)
sapply(), it is similar to
lapply() except that it tries to simplify the results wherever possible. For example, if the final result is such that every element is of length 1, it returns a vector, if the length of every element in the result is the same but more than 1, a matrix is returned, and if it is not able to simplify the results, we get the same result as
lapply(). We illustrate the same with the following example:
data = list(l1 = 1:10, l2 = runif(10), l3 = rnorm(10,2)) lapply(data, mean) sapply(data, mean)
apply() function is used to evaluate a function over the margins or boundaries of an array, for instance, applying aggregate functions on the rows (1), columns (2) or both (1:2) of an array. By both, we mean apply the function to each individual value.
data = matrix(rnorm(20), nrow=5, ncol=4) # row sums apply(data, 1, sum) # row means apply(data, 1, mean) # col sums apply(data, 2, sum) # col means apply(data, 2, mean)
Let's see how many negative numbers each column has, using
apply(data, 2, function(x) length(x[x<0]))
tapply() is used to evaluate a function over the subsets of any vector. This is similar to applying the
GROUP BY construct in SQL if you are familiar with using relational databases. We illustrate the same in the following examples:
data = c(1:10, rnorm(10,2), runif(10)) groups = gl(3,10) tapply(data, groups, mean)
mapply() function is a multivariate version of
lapply() and is used to evaluate a function in parallel over sets of arguments. A simple example is if we have to build a list of vectors using the
rep() function, we have to write it multiple times. However, with
mapply() we can achieve the same in a more elegant way as illustrated next
list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)) mapply(rep, 1:4, 4:1)