16/11

2016

While looping is a great way to iterate through vectors and perform computations, it is not very efficient when we deal with what is known as *Big Data*. In this case, R provides some advanced functions:

`lapply()`

method loops over a list and evaluates a function on each element.`sapply()`

method is a simplified version of`lapply()`

.`apply()`

method evaluates a function on the boundaries or margins of an array.`tapply()`

method evaluates a function over subsets of a vector.`mapply()`

method is a multivariate version of`lapply()`

.

These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array, and apply a named function with one or several optional arguments.

**lapply and sapply**

`lapply()`

takes a list and a function as input and evaluates that function over each element of the list. If the input list is not a list, it is converted into a list using the `as.list()`

function before the output is returned. It is much faster than a normal loop because the actual looping is done internally using C code. We look at its example in the following code snippet:

data = list(l1 = 1:10, l2 = 1000:1020) lapply(data, mean)

Coming to `sapply()`

, it is similar to `lapply()`

except that it tries to simplify the results wherever possible. For example, if the final result is such that every element is of length 1, it returns a *vector*, if the length of every element in the result is the same but more than 1, a *matrix* is returned, and if it is not able to simplify the results, we get the same result as `lapply()`

. We illustrate the same with the following example:

data = list(l1 = 1:10, l2 = runif(10), l3 = rnorm(10,2)) lapply(data, mean) sapply(data, mean)

**apply**

The `apply()`

function is used to evaluate a function over the margins or boundaries of an array, for instance, applying *aggregate functions* on the rows (1), columns (2) or both (1:2) of an array. By *both*, we mean *apply the function to each individual value*.

data = matrix(rnorm(20), nrow=5, ncol=4) # row sums apply(data, 1, sum) # row means apply(data, 1, mean) # col sums apply(data, 2, sum) # col means apply(data, 2, mean)

Let's see how many negative numbers each column has, using `apply()`

apply(data, 2, function(x) length(x[x<0]))

**tapply**

The function `tapply()`

is used to evaluate a function over the subsets of any vector. This is similar to applying the `GROUP BY`

construct in SQL if you are familiar with using relational databases. We illustrate the same in the following examples:

data = c(1:10, rnorm(10,2), runif(10)) groups = gl(3,10) tapply(data, groups, mean)

**mapply**

The `mapply()`

function is a multivariate version of `lapply()`

and is used to evaluate a function in parallel over sets of arguments. A simple example is if we have to build a list of vectors using the `rep()`

function, we have to write it multiple times. However, with `mapply()`

we can achieve the same in a more elegant way as illustrated next

list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)) mapply(rep, 1:4, 4:1)