en.proft.me - Blog about Linux, Python, Vim and healthy lifestylehttp://en.proft.me/feed/category/en.proft.me - Blog about Linux, Python, Vim and healthy lifestyle: latest posts by categoriesenTue, 14 Mar 2017 00:00:00 +0200How to convert a factor to numeric in Rhttp://en.proft.me/2017/03/14/how-convert-factor-numeric-r/<p>Sometimes you need to explicitly convert factors to either text or numbers. To do this, you use the functions <code>as.character()</code> or <code>as.numeric()</code>. First, convert your data vector into a factor or use existed factors <code>iris$Species</code> from <em>iris</em> dataset.</p>
<pre class="prettyprint">
v = c("North", "East", "South", "South")
vf <- factor(v)
</pre>
<p>Use <code>as.character()</code> to convert a factor to a character vector</p>
<pre class="prettyprint">
as.character(vf)
</pre>
<p>Use <code>as.numeric()</code> to convert a factor to a numeric vector. Note that this will return the numeric codes that correspond to the factor levels. </p>
<pre class="prettyprint">
as.numeric(vf)
# or from iris dataset
as.numeric(factor(iris$Species))
</pre>
<p>Be very careful when you convert factors with numeric levels to a numeric vector. The results may not be what you expect.</p>
<p>For example, imagine you have a vector that indicates some test score results with the values <code>c(9, 8, 10, 8, 9)</code>, which you convert to a factor</p>
<pre class="prettyprint">
numbers = factor(c(9, 8, 10, 8, 9))
</pre>
<p>To look at the internal representation of numbers, use <code>str()</code></p>
<pre class="prettyprint">
str(numbers)
</pre>
<p>This indicates that R stores the values as <code>c(2, 1, 3, 1, 2)</code> with associated levels of <code>c("8", "9", "10")</code>.</p>Tue, 14 Mar 2017 00:00:00 +0200http://en.proft.me/2017/03/14/how-convert-factor-numeric-r/Density-based clustering in Rhttp://en.proft.me/2017/02/3/density-based-clustering-r/<p><img src="http://en.proft.me/media/science/r_dbc_logo.jpg" alt="Density-based clustering in R" class="right" width="190">
<b class="lb">Introduction</b></p>
<p>Clustering is a multivariate analysis used to group similar objects (close in terms of distance) together in the same group (cluster). Unlike <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">supervised learning</a> methods (for example, classification and regression) a clustering analysis does not use any label information, but simply uses the similarity between data features to group them into clusters.</p>
<p>Clustering does not refer to specific algorithms but it’s a process to create groups based on similarity measure. Clustering analysis use <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">unsupervised learning</a> algorithm to create clusters.</p>
<p>Clustering algorithms generally work on simple principle of maximization of intracluster similarities and minimization of intercluster similarities. The measure of similarity determines how the clusters need to be formed.</p>
<p>Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0.</p>
<p>Clustering can be widely adapted in the analysis of businesses. For example, a marketing department can use clustering to segment customers by personal attributes. As a result of this, different marketing campaigns targeting various types of customers can be designed.</p>
<p>Clustering model is a notion used to signify what kind of clusters we are trying to identify. The four most common models of clustering methods are hierarchical clustering, k-means clustering, model-based clustering, and density-based clustering:</p>
<ul>
<li><a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">Hierarchical clustering</a>. It creates a hierarchy of clusters, and presents the hierarchy in a dendrogram. This method does not require the number of clusters to be specified at the beginning. Distance connectivity between observations is the measure.</li>
<li><a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a>. It is also referred to as flat clustering. Unlike hierarchical clustering, it does not create a hierarchy of clusters, and it requires the number of clusters as an input. However, its performance is faster than hierarchical clustering. Distance from mean value of each observation/cluster is the measure.</li>
<li><a href="http://en.proft.me/2017/02/1/model-based-clustering-r/">Model-based clustering</a> (or <em>Distribution models</em>). Both hierarchical clustering and k-means clustering use a heuristic approach to construct clusters, and do not rely on a formal model. Model-based clustering assumes a data model and applies an EM algorithm to find the most likely model components and the number of clusters. Significance of statistical distribution of variables in the dataset is the measure.</li>
<li><a href="http://en.proft.me/2017/02/3/density-based-clustering-r/">Density-based clustering</a>. It constructs clusters in regard to the density measurement. Clusters in this method have a higher density than the remainder of the dataset. Density in data space is the measure.</li>
</ul>
<p>A good clustering algorithm can be evaluated based on two primary objectives:</p>
<ul>
<li>High intra-class similarity</li>
<li>Low inter-class similarity</li>
</ul>
<p>Density-based clustering uses the idea of <em>density reachability</em> and <em>density connectivity</em> (as an alternative to <em>distance measurement</em>), which makes it very useful in discovering a cluster in nonlinear shapes. This method finds an area with a higher density than the remaining area. One of the most famous methods is <em>Density-Based Spatial Clustering of Applications with Noise</em> (<em>DBSCAN</em>). It uses the concept of <em>density reachability</em> and <em>density connectivity</em>.</p>
<ul>
<li><em>Density reachability</em>. A point <em>p</em> is said to be density reachable from a point <em>q</em> if point <em>p</em> is within <em>e</em> distance from point <em>q</em> and <em>q</em> has sufficient number of points in its neighbors which are within distance <em>e</em>.</li>
<li><em>Density connectivity</em>. A point <em>p</em> and <em>q</em> are said to be density connected if there exist a point <em>r</em> which has sufficient number of points in its neighbors and both the points <em>p</em> and <em>q</em> are within the <em>e</em> distance. This is chaining process. So, if <em>q</em> is neighbor of <em>r</em>, <em>r</em> is neighbor of <em>s</em>, <em>s</em> is neighbor of <em>t</em> which in turn is neighbor of <em>p</em> implies that <em>q</em> is neighbor of <em>p</em>.</li>
</ul>
<p>The Density-based clustering algorithm DBSCAN is a fundamental data clustering technique for finding arbitrary shape clusters as well as for detecting outliers.</p>
<p>Unlike K-Means, DBSCAN does not require the number of clusters as a parameter. Rather it infers the number of clusters based on the data, and it can discover clusters of arbitrary shape (for comparison, K-Means usually discovers spherical clusters). Partitioning methods (K-means, PAM clustering) and hierarchical clustering are suitable for finding <em>spherical-shaped clusters</em> or <em>convex clusters</em>. In other words, they work well for compact and well separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.</p>
<p>Advantages of Density-based clustering are</p>
<ol>
<li>No assumption on the number of clusters. The number of clusters is often unknown in advance. Furthermore, in an evolving data stream, the number of natural clusters is often changing.</li>
<li>Discovery of clusters with arbitrary shape. This is very important for many data stream applications.</li>
<li>Ability to handle outliers (resistant to noise).</li>
</ol>
<p>Disadvantages of Density-based clustering are</p>
<ol>
<li>If there is variation in the density, noise points are not detected</li>
<li>Sensitive to parameters i.e. hard to determine the correct set of parameters.</li>
<li>The quality of DBSCAN depends on the distance measure.</li>
<li>DBSCAN cannot cluster data sets well with large differences in densities.</li>
</ol>
<p>This algorithm works on a parametric approach. The two parameters involved in this algorithm are:</p>
<ul>
<li><em>e</em> (<em>eps</em>) is the radius of our neighborhoods around a data point <em>p</em>.</li>
<li><em>minPts</em> is the minimum number of data points we want in a neighborhood to define a cluster.</li>
</ul>
<p>One limitation of DBSCAN is that it is sensitive to the choice of <em>e</em>, in particular if clusters have different densities. If <em>e</em> is too small, sparser clusters will be defined as noise. If <em>e</em> is too large, denser clusters may be merged together.</p>
<div style="text-align:center;">
<img alt="r_dbc_concept.png" src="http://en.proft.me/media/science/r_dbc_concept.png"><br/>
</div>
<p>Once these parameters are defined, the algorithm divides the data points into three points:</p>
<ul>
<li><em>Core points</em>. A point <em>p</em> is a core point if at least <em>minPts</em> points are within distance <em>l</em> (<em>l</em> is the maximum radius of the neighborhood from <em>p</em>) of it (including <em>p</em>).</li>
<li><em>Border points</em>. A point <em>q</em> is border from <em>p</em> if there is a path <i>p<sup>1</sup></i>, <em>...</em>, <i>p<sup>n</sup></i> with <i>p<sup>1</sup> = p</i> and <i>p<sup>n</sup> = q</i>, where each <i>p<sup>i+1</sup></i> is directly reachable from <i>p<sup>i</sup></i> (all the points on the path must be core points, with the possible exception of <em>q</em>).</li>
<li><em>Outliers</em>. All points not reachable from any other point are outliers.</li>
</ul>
<div style="text-align:center;">
<img alt="r_dbc_points.png" src="http://en.proft.me/media/science/r_dbc_points.png"><br/>
</div>
<p>The steps in <em>DBSCAN</em> are simple after defining the previous steps:</p>
<ol>
<li>Pick at random a point which is not assigned to a cluster and calculate its neighborhood. If, in the neighborhood, this point has <em>minPts</em> then make a cluster around that; otherwise, mark it as outlier.</li>
<li>Once you find all the core points, start expanding that to include border points.</li>
<li>Repeat these steps until all the points are either assigned to a cluster or to an outlier.</li>
</ol>
<p><b class="lb">Modeling in R</b></p>
<p>In the following part, we will demonstrate how to use DBSCAN to perform density-based clustering.</p>
<pre class="prettyprint">
# install.packages("dbscan")
library(dbscan)
</pre>
<p><strong>Determing optimal value of e (eps)</strong>. First of all, we should define the <a href="http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning">optimal value of e</a>. The method proposed here consists of computing the he <em>k-nearest neighbor distances</em> in a matrix of points.</p>
<p>The idea is to calculate, the average of the distances of every point to its <em>k</em> nearest neighbors. The value of <em>k</em> will be specified by the user and corresponds to <em>MinPts</em>.</p>
<p>Next, these <em>k</em>-distances are plotted in an ascending order. The aim is to determine the <em>knee”</em> which corresponds to the optimal <em>e</em> parameter.</p>
<p>A <em>knee</em> corresponds to a threshold where a sharp change occurs along the <em>k</em>-distance curve.</p>
<p>The function <code>kNNdistplot()</code> can be used to draw the <em>k</em>-distance plot.</p>
<pre class="prettyprint">
iris_matrix <- as.matrix(iris[, -5])
kNNdistplot(iris_matrix, k=4)
abline(h=0.4, col="red")
</pre>
<div style="text-align:center;">
<img alt="r_dbc_optimal_e.png" src="http://en.proft.me/media/science/r_dbc_optimal_e.png" width="90%"><br/>
</div>
<p>It can be seen that the optimal <em>e</em> value is around a distance of 0.4.</p>
<pre class="prettyprint">
set.seed(1234)
db = dbscan(iris_matrix, 0.4, 4)
db
</pre>
<p>The result shows DBSCAN has found <em>four</em> clusters and assigned <em>25</em> cases as noise/outliers.</p>
<p>Display the hull plot</p>
<pre class="prettyprint">
hullplot(iris_matrix, db$cluster)
</pre>
<div style="text-align:center;">
<img alt="r_dbc_hull.png" src="http://en.proft.me/media/science/r_dbc_hull.png" width="90%"><br/>
</div>
<p>The hull plot shows the separation is good, so we can play around with the parameters to get more generalized or specialized clusters. Black points are outliers.</p>
<p>You can play with <em>e</em> (neighborhood size) and <em>MinPts</em> (minimum of points needed for core cluster).</p>
<p>Compare amount of the data within each cluster</p>
<pre class="prettyprint">
table(iris$Species, db$cluster)
</pre>
<p><b class="lb">Comparing clustering methods</b></p>
<p>After fitting data into clusters using different clustering methods, you may wish to measure the accuracy of the clustering. In most cases, you can use either <em>intracluster</em> or <em>intercluster</em> metrics as measurements. The higher the intercluster distance, the better it is, and the lower the intracluster distance, the better it is.</p>
<p>We now introduce how to compare different clustering methods using <code>cluster.stat</code> from the <code>fpc</code> package.</p>
<p>Perform the following steps to compare clustering methods.</p>
<p>First, install and load the <code>fpc</code> package</p>
<pre class="prettyprint">
#install.packages("fpc")
library("fpc")
</pre>
<p>Next, retrieve the cluster validation statistics of clustering method:</p>
<pre class="prettyprint">
cs = cluster.stats(dist(iris[1:4]), db$cluster)
</pre>
<p>Most often, we focus on using <code>within.cluster.ss</code> and <code>avg.silwidth</code> to validate the clustering method. The <code>within.cluster.ss</code> measurement stands for the <em>within clusters sum</em> of squares, and <code>avg.silwidth</code> represents the <em>average silhouette width</em>.</p>
<ul>
<li><code>within.cluster.ss</code> measurement shows how closely related objects are in clusters; the smaller the value, the more closely related objects are within the cluster.</li>
<li><code>avg.silwidth</code> is a measurement that considers how closely related objects are within the cluster and how clusters are separated from each other. The silhouette value usually ranges from 0 to 1; a value closer to 1 suggests the data is better clustered.</li>
</ul>
<pre class="prettyprint">
cs[c("within.cluster.ss","avg.silwidth")]
</pre>
<p>Finally, you can <a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">generate the cluster statistics</a> of different clustering method and list them in a table.</p>Fri, 03 Feb 2017 00:00:00 +0200http://en.proft.me/2017/02/3/density-based-clustering-r/Model-based clustering and Gaussian mixture model in Rhttp://en.proft.me/2017/02/1/model-based-clustering-r/<p><b class="lb">Introduction</b></p>
<p>Clustering is a multivariate analysis used to group similar objects (close in terms of distance) together in the same group (cluster). Unlike <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">supervised learning</a> methods (for example, classification and regression) a clustering analysis does not use any label information, but simply uses the similarity between data features to group them into clusters.</p>
<p>Clustering does not refer to specific algorithms but it’s a process to create groups based on similarity measure. Clustering analysis use <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">unsupervised learning</a> algorithm to create clusters.</p>
<p>Clustering algorithms generally work on simple principle of maximization of intracluster similarities and minimization of intercluster similarities. The measure of similarity determines how the clusters need to be formed.</p>
<p>Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0.</p>
<p>Clustering can be widely adapted in the analysis of businesses. For example, a marketing department can use clustering to segment customers by personal attributes. As a result of this, different marketing campaigns targeting various types of customers can be designed.</p>
<p>Clustering model is a notion used to signify what kind of clusters we are trying to identify. The four most common models of clustering methods are hierarchical clustering, k-means clustering, model-based clustering, and density-based clustering:</p>
<ul>
<li><a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">Hierarchical clustering</a>. It creates a hierarchy of clusters, and presents the hierarchy in a dendrogram. This method does not require the number of clusters to be specified at the beginning. Distance connectivity between observations is the measure.</li>
<li><a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a>. It is also referred to as flat clustering. Unlike hierarchical clustering, it does not create a hierarchy of clusters, and it requires the number of clusters as an input. However, its performance is faster than hierarchical clustering. Distance from mean value of each observation/cluster is the measure.</li>
<li><a href="http://en.proft.me/2017/02/1/model-based-clustering-r/">Model-based clustering</a> (or <em>Distribution models</em>). Both hierarchical clustering and k-means clustering use a heuristic approach to construct clusters, and do not rely on a formal model. Model-based clustering assumes a data model and applies an EM algorithm to find the most likely model components and the number of clusters. Significance of statistical distribution of variables in the dataset is the measure.</li>
<li><a href="http://en.proft.me/2017/02/3/density-based-clustering-r/">Density-based clustering</a>. It constructs clusters in regard to the density measurement. Clusters in this method have a higher density than the remainder of the dataset. Density in data space is the measure.</li>
</ul>
<p>A good clustering algorithm can be evaluated based on two primary objectives:</p>
<ul>
<li>High intra-class similarity</li>
<li>Low inter-class similarity</li>
</ul>
<p>One disadvantage of <em>hierarchical clustering algorithms</em>, <em>k-means algorithms</em> and others is that they are largely heuristic and not based on formal models. Formal inference is not possible.</p>
<p>The key advantage of <em>model-based approach</em>, compared to the standard clustering methods (<em>k-means</em>, <em>hierarchical clustering</em>, etc.), is the suggestion of the number of clusters and an appropriate model.</p>
<p>Instead of taking a heuristic approach to build a cluster, <em>model-based clustering</em> uses a probability-based approach. <em>Model-based clustering</em> assumes that the data is generated by an underlying probability distribution and tries to recover the distribution from the data. One common model-based approach is using finite mixture models, which provide a flexible modeling framework for the analysis of the probability distribution. Finite mixture models are a linearly weighted sum of component probability distribution.</p>
<p>In <em>model-based clustering</em> the data are considered as coming from a distribution that is mixture of two or more components (i.e. <em>clusters</em>). Each component <em>k</em> (i.e. group or cluster) is modeled by the normal or Gaussian distribution which is <a href="http://www.sthda.com/english/wiki/model-based-clustering-unsupervised-machine-learning">characterized</a> by the parameters: <em>mean vector</em>, <em>covariance matrix</em>, <em>associated probability</em> (each point has a probability of belonging to each cluster).</p>
<p>Each cluster <em>k</em> is centered at the <em>means</em>, with increased density for points near the <em>mean</em>. Geometric features (shape, volume, orientation) of each cluster are determined by <em>the covariance matrix</em>.</p>
<p><em>Model-based clustering</em> are iterative method to fit a set of dataset into clusters by optimizing distributions of datasets in clusters. Gaussian distribution is nothing but normal distribution. This method works in three steps:</p>
<ol>
<li>First randomly choose Gaussian parameters and fit it to set of data points.</li>
<li>Iteratively optimize the distribution parameters to fit as many points it can.</li>
<li>Once it converges to a local minima, you can assign data points closer to that distribution of that cluster.</li>
</ol>
<p>Although this algorithm create complex models, it does capture correlation and dependence among the attributes. The downside is that these methods usually suffer from an overfitting problem. </p>
<p><b class="lb">Details about Gaussian mixture models</b></p>
<p>Probabilistic model-based clustering techniques have been widely used and have shown promising results in many applications, ranging from image segmentation, handwriting recognition, document clustering, topic modeling to information retrieval. Model-based clustering approaches attempt to optimize the fit between the observed data and some mathematical model using a probabilistic approach.</p>
<blockquote>The word model is usually used to represent the type of constraints and geometric properties of the covariance matrices.</blockquote>
<p>Such methods are often based on the assumption that the data are generated by a mixture of underlying <em>probability distributions</em>. In practice, each cluster can be represented mathematically by a parametric probability distribution, such as a <em>Gaussian</em> or a <em>Poisson</em> distribution. Thus, the clustering problem is transformed into a parameter estimation problem since the entire data can be modeled by a mixture of <em>k</em> component distributions.</p>
<p>In probabilistic models, the core idea is to model the data from a generative process. First, a specific form of the generative model (e.g., mixture of Gaussians) is assumed, and then the parameters of this model are estimated with the use of the <em>Expectation Maximization</em> (EM) algorithm. The available data set is used to estimate the parameters in such as way that they have a maximum likelihood fit to the generative model. Given this model, we then estimate the generative probabilities (or fit probabilities) of the underlying data points. Data points which fit the distribution well will have high fit probabilities, whereas anomalies will have very low fit probabilities.</p>
<blockquote>In model-based clustering, it is assumed that the data are generated by a mixture of probability distributions in which each component represents a different cluster.</blockquote>
<p>The generative models are typically solved with the use of an <em>EM</em> approach which is the most widely used method for estimating the parameters of a finite mixture probability density. The model-based clustering framework provides a principal way to deal with several problems in this approach, such as the number of component densities (or clusters), initial values of the parameters (the EM algorithm needs initial parameter values to get started), and distributions of the component densities (e.g., Gaussian). <em>EM</em> starts off with a random or heuristic initialization and then iteratively uses two steps to resolve the circularity in computation:</p>
<ul>
<li><em>E-Step</em>. Determine the expected probability of assignment of data points to clusters with the use of current model parameters.</li>
<li><em>M-Step</em>. Determine the optimum model parameters of each mixture by using the assignment probabilities as weights.</li>
</ul>
<p>In the previous article I introduced <a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a>, which is one of the most widely used approaches for clustering. This non-parametric method for extracting structure has some excellent properties that make it ideal for many situations. In this article, we'll look at a parametric method that uses the <em>Gaussian distribution</em>, known as the <em>Gaussian mixture model</em> (GMM).</p>
<p>You may have heard of the <em>Gaussian distribution</em>, sometimes known as the <em>normal distribution</em>. Qualitatively it can be thought of as a distribution that occurs naturally and very frequently. This has two parameters, which we need to learn in order to fit this equation to our data: the <em>mean</em> and the <em>standard deviation</em>. The <em>mean</em> is a measure of the center of the distribution. The <em>standard deviation</em> is a measure of the distribution of the data around the mean. Armed with the knowledge that, for a <em>normal distribution</em>, 95% of the data is always within <em>two standard deviations from the mean</em>.</p>
<p><b class="lb">Modeling in R</b></p>
<p>Here, we show example of algorithm on <em>iris dataset</em>.</p>
<p><code>Mclust</code> package uses Bayesian Information Criterion (BIC) to find the number of clusters (model selection). BIC uses the likelihood and a penalty term to guard against overfitting.</p>
<p>First, install and load the library <code>mclust</code></p>
<pre class="prettyprint">
#install.packages("mclust")
library(mclust)
</pre>
<p>You can then perform model-based clustering on the <em>iris dataset</em> using <a href="http://svitsrv25.epfl.ch/R-doc/library/mclust/html/Mclust.html">Mclust</a>:</p>
<pre class="prettyprint">
mb = Mclust(iris[,-5])
#or specify number of clusters
mb3 = Mclust(iris[,-5], 3)
# optimal selected model
mb$modelName
# optimal number of cluster
mb$G
# probality for an observation to be in a given cluster
head(mb$z)
# get probabilities, means, variances
summary(mb, parameters = TRUE)
</pre>
<p>Compare amount of the data within each cluster</p>
<pre class="prettyprint">
table(iris$Species, mb$classification)
# vs
table(iris$Species, mb3$classification)
</pre>
<p>After the data is fit into the model, we plot the model based on clustering results.</p>
<pre class="prettyprint">
plot(mb, what=c("classification"))
</pre>
<div style="text-align:center;">
<img alt="r_mbc_classification.png" src="http://en.proft.me/media/science/r_mbc_classification.png" width="90%"><br/>
</div>
<p>The types of <code>what</code> argument are: "BIC", "classification", "uncertainty", "density". By default, all the above graphs are produced.</p>
<p><code>Mclust</code> uses an identifier for each possible parametrization of the covariance matrix that has three letters: <em>E</em> for "equal", <em>V</em> for "variable" and <em>I</em> for "coordinate axes". The first identifier refers to <em>volume</em>, the second to <em>shape</em> and the third to <em>orientation</em>. For example:</p>
<ul>
<li><em>EEE</em> means that the G clusters have the same volume, shape and orientation in p−dimensional space.</li>
<li><em>VEI</em> means variable volume, same shape and orientation equal to coordinate axes.</li>
<li><em>EIV</em> means same volume, spherical shape and variable orientation.</li>
</ul>
<p>Let's plot estimated density. Contour plot</p>
<pre class="prettyprint">
plot(mb, "density")
</pre>
<div style="text-align:center;">
<img alt="r_mbc_density.png" src="http://en.proft.me/media/science/r_mbc_density.png" width="90%"><br/>
</div>
<p>You can also use the <code>summary()</code> function to obtain the most likely model and the most possible number of clusters. For this example, the most possible number of clusters is five, with a BIC value equal to -556.1142.</p>
<p><b class="lb">Comparing clustering methods</b></p>
<p>After fitting data into clusters using different clustering methods, you may wish to measure the accuracy of the clustering. In most cases, you can use either <em>intracluster</em> or <em>intercluster</em> metrics as measurements. The higher the intercluster distance, the better it is, and the lower the intracluster distance, the better it is.</p>
<p>We now introduce how to compare different clustering methods using <code>cluster.stat</code> from the <code>fpc</code> package.</p>
<p>Perform the following steps to compare clustering methods.</p>
<p>First, install and load the <code>fpc</code> package</p>
<pre class="prettyprint">
#install.packages("fpc")
library("fpc")
</pre>
<p>Next, retrieve the cluster validation statistics of clustering method:</p>
<pre class="prettyprint">
cs = cluster.stats(dist(iris[1:4]), mb$classification)
</pre>
<p>Most often, we focus on using <code>within.cluster.ss</code> and <code>avg.silwidth</code> to validate the clustering method. The <code>within.cluster.ss</code> measurement stands for the <em>within clusters sum</em> of squares, and <code>avg.silwidth</code> represents the <em>average silhouette width</em>.</p>
<ul>
<li><code>within.cluster.ss</code> measurement shows how closely related objects are in clusters; the smaller the value, the more closely related objects are within the cluster.</li>
<li><code>avg.silwidth</code> is a measurement that considers how closely related objects are within the cluster and how clusters are separated from each other. The silhouette value usually ranges from 0 to 1; a value closer to 1 suggests the data is better clustered.</li>
</ul>
<pre class="prettyprint">
cs[c("within.cluster.ss","avg.silwidth")]
</pre>
<p>Finally, you can <a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">generate the cluster statistics</a> of different clustering method and list them in a table.</p>
<p><b class="lb">The relationship between k-means and GMM</b></p>
<p>K-means can be expressed as a special case of the Gaussian mixture model. In general, the Gaussian mixture is more expressive because membership of a data item to a cluster is dependent on the shape of that cluster, not just its proximity.</p>
<p>As with k-means, training a Gaussian mixture model with EM can be quite sensitive to initial starting conditions. If we compare and contrast GMM to k-means, we’ll find a few more initial starting conditions in the former than in the latter. In particular, not only must the initial centroids be specified, but the initial covariance matrices and mixture weights must be specified also. Among other strategies, one approach is to run k-means and use the resultant centroids to determine the initial starting conditions for the Gaussian mixture model.</p>
<p><b class="lb">Outcome</b></p>
<p>A key feature of most clustering algorithms is that every point is assigned to a single cluster. But realistically, many datasets contain a large gray area, and mixture models are a way to capture that.</p>
<p>You can think of <em>Gaussian mixture models</em> as a version of <em>k-means</em> that captures the notion of a gray area and gives <a href="http://en.proft.me/2015/12/7/how-calculate-confidence-interval-r/">confidence levels</a> whenever it assigns a point to a particular cluster.</p>
<p>Each cluster is modeled as a multivariate Gaussian distribution, and the model is specified by giving the following:</p>
<ol>
<li>The number of clusters.</li>
<li>The fraction of all data points that are in each cluster.</li>
<li>Each cluster’s mean and its d-by-d covariance matrix.</li>
</ol>
<p>When training a GMM, the computer keeps a running confidence level of how likely each point is to be in each cluster, and it never decides them definitively: the mean and standard deviation for a cluster will be influenced by every point in the training data, but in proportion to how likely they are to be in that cluster. When it comes time to cluster a new point, you get out a confidence level for every cluster in your model.</p>
<p>Mixture models have many of the same blessings and curses of k-means. They are simple to understand and implement. The computational costs are very light and can be done in a distributed way. They provide clear, understandable output that can be easily used to cluster additional points in the future. On the other hand, they both assume more-or-less convex clusters, and they are both liable to fall into local minimums when training.</p>Wed, 01 Feb 2017 00:00:00 +0200http://en.proft.me/2017/02/1/model-based-clustering-r/Exploring Hierarchical clustering in Rhttp://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/<p><img src="http://en.proft.me/media/science/r_hc_tree.png" alt="r_hc_tree.png" class="right" width="120">
<b class="lb">Introduction</b></p>
<p>Clustering is a multivariate analysis used to group similar objects (close in terms of distance) together in the same group (cluster). Unlike <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">supervised learning</a> methods (for example, classification and regression) a clustering analysis does not use any label information, but simply uses the similarity between data features to group them into clusters.</p>
<p>Clustering does not refer to specific algorithms but it’s a process to create groups based on similarity measure. Clustering analysis use <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">unsupervised learning</a> algorithm to create clusters.</p>
<p>Clustering algorithms generally work on simple principle of maximization of intracluster similarities and minimization of intercluster similarities. The measure of similarity determines how the clusters need to be formed.</p>
<p>Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0.</p>
<p>Clustering can be widely adapted in the analysis of businesses. For example, a marketing department can use clustering to segment customers by personal attributes. As a result of this, different marketing campaigns targeting various types of customers can be designed.</p>
<p>Clustering model is a notion used to signify what kind of clusters we are trying to identify. The four most common models of clustering methods are hierarchical clustering, k-means clustering, model-based clustering, and density-based clustering:</p>
<ul>
<li><a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">Hierarchical clustering</a>. It creates a hierarchy of clusters, and presents the hierarchy in a dendrogram. This method does not require the number of clusters to be specified at the beginning. Distance connectivity between observations is the measure.</li>
<li><a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a>. It is also referred to as flat clustering. Unlike hierarchical clustering, it does not create a hierarchy of clusters, and it requires the number of clusters as an input. However, its performance is faster than hierarchical clustering. Distance from mean value of each observation/cluster is the measure.</li>
<li><a href="http://en.proft.me/2017/02/1/model-based-clustering-r/">Model-based clustering</a> (or <em>Distribution models</em>). Both hierarchical clustering and k-means clustering use a heuristic approach to construct clusters, and do not rely on a formal model. Model-based clustering assumes a data model and applies an EM algorithm to find the most likely model components and the number of clusters. Significance of statistical distribution of variables in the dataset is the measure.</li>
<li><a href="http://en.proft.me/2017/02/3/density-based-clustering-r/">Density-based clustering</a>. It constructs clusters in regard to the density measurement. Clusters in this method have a higher density than the remainder of the dataset. Density in data space is the measure.</li>
</ul>
<p>A good clustering algorithm can be evaluated based on two primary objectives:</p>
<ul>
<li>High intra-class similarity</li>
<li>Low inter-class similarity</li>
</ul>
<p>Hierarchical clustering adopts either an agglomerative or divisive method to build a hierarchy of clusters. Regardless of which approach is adopted, both first use a distance similarity measure to combine or split clusters. The recursive process continues until there is only one cluster left or you cannot split more clusters. Eventually, we can use a dendrogram to represent the hierarchy of clusters.</p>
<ul>
<li><em>Agglomerative hierarchical clustering</em>. This is a bottom-up approach. Each observation starts in its own cluster. We can then compute the similarity (or the distance) between each cluster and then merge the two most similar ones at each iteration until there is only one cluster left.</li>
<li><em>Divisive hierarchical clustering</em>. This is a top-down approach. All observations start in one cluster, and then we split the cluster into the two least dissimilar clusters recursively until there is one cluster for each observation.</li>
</ul>
<p>The <em>agglomerative methods</em>, using a recursive algorithm that follows the next phases:</p>
<ul>
<li>Find the two closest points in the dataset.</li>
<li>Link these points and consider them as a single point.</li>
<li>The process starts again, now using the new dataset that contains the new point.</li>
</ul>
<p>In a hierarchical clustering, there are two very important parameters: <em>the distance metric</em> and <em>the linkage method</em>.</p>
<p><strong>Clustering distance metric</strong>. Defining closeness is a fundamental aspect. A measure of dissimilarity is that which defines clusters that will be combined in the case of agglomerative method, or that, in the case of divisive clustering method, when these are to be divided. The main measures of distance are as follows:</p>
<ul>
<li>Euclidean distance</li>
<li>Maximum distance</li>
<li>Manhattan distance</li>
<li>Canberra distance</li>
<li>Binary distance</li>
<li>Pearson distance</li>
<li>Correlation</li>
<li>Spearman distance</li>
</ul>
<p>You can read more about distance measure and it visualization <a href="http://www.sthda.com/english/wiki/clarifying-distance-measures-unsupervised-machine-learning">here</a>.</p>
<p><strong>Linkage methods</strong>. The linkage methods determine how the distance between two clusters is defined. A linkage rule is necessary for calculating the inter-cluster distances. It is important to try several linkage methods to compare their results. Depending on the dataset, some methods may work better. The following is a list of the most common linkage methods:</p>
<ul>
<li><em>Single linkage</em>. This refers to the shortest distance between two points in each cluster. The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster. A good choice when clusters are obviously separated.</li>
<li><em>Complete linkage</em>. This refers to the longest distance between two points in each cluster. The distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. It can be sensitive to outliers.</li>
<li><em>Average linkage</em> This refer to the average distance between two points in each cluster. The distance between two clusters is the mean distance between an observation in one cluster and an observation in the other cluster.</li>
<li><em>Ward linkage</em>. This refers to the sum of the squared distance from each point to the mean of the merged clusters. The distance between two clusters is the sum of the squared deviations from points to centroids. Try to minimize the within-cluster sum of squares. It can be sensitive to outliers.</li>
<li><em>Centroid Linkage</em>. The distance between two clusters is the distance between the cluster centroids or means.</li>
<li><em>Median Linkage</em>. The distance between two clusters is the median distance between an observation in one cluster and an observation in the other cluster. It reduces the effect of outliers.</li>
<li><em>McQuitty Linkage</em> When two clusters A and B are be joined, the distance to new cluster C is the average of distances of A and B to C. So, the distance depends on a combination of clusters instead of individual observations in the clusters.</li>
</ul>
<p>Hierarchical clustering is based on the distance connectivity between observations of clusters. The steps involved in the clustering process are:</p>
<ol>
<li>Start with <em>N</em> clusters, <em>N</em> is number of elements (i.e., assign each element to its own cluster). In other words distances (similarities) between the clusters are equal to the distances (similarities) between the items they contain.</li>
<li>Now merge pairs of clusters with the closest to other (most similar clusters) (e.g., the first iteration will reduce the number of clusters to <em>N - 1</em>).</li>
<li>Again compute the distance (similarities) and merge with closest one.</li>
<li>Repeat Steps 2 and 3 to exhaust the items until you get all data points in one cluster.</li>
</ol>
<p><b class="lb">Modeling in R</b></p>
<p>In R, we use the <code>hclust()</code> function for <em>hierarchical cluster analysis</em>. This is part of the <code>stats</code> package. To perform hierarchical clustering, the input data has to be in a distance matrix form.</p>
<p>Another important function used here is <code>dist()</code> this function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix. By default, it is Euclidean distance.</p>
<p>You can then examine the <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">iris</a> dataset structure:</p>
<pre class="prettyprint">
str(iris)
</pre>
<p>Let's look at the parallel coordinates plot of the data.</p>
<blockquote>Parallel coordinates are a common way of <a href="https://en.wikipedia.org/wiki/Parallel_coordinates">visualizing</a> high-dimensional geometry and analyzing multivariate data. The lines show the relationship between the axes, much like scatterplots, and the patterns that the lines form indicate the relationship. You can also gather <a href="https://www.safaribooksonline.com/blog/2014/03/31/mastering-parallel-coordinate-charts-r/">details</a> about the relationships between the axes when you see the clustering of lines.</blockquote>
<pre class="prettyprint">
# get nice colors
library(colorspace)
iris2 <- iris[,-5]
species_labels <- iris[,5]
species_col <- rev(rainbow_hcl(3))[as.numeric(species_labels)]
MASS::parcoord(iris2, col = species_col, var.label = TRUE, lwd = 2)
title("Parallel coordinates plot of the Iris data")
legend("top", cex = 1, legend = as.character(levels(species_labels)), fill = unique(species_col), horiz = TRUE)
</pre>
<div style="text-align:center;">
<img alt="r_hc_parallel.png" src="http://en.proft.me/media/science/r_hc_parallel.png"><br/>
</div>
<p>We can see that the <em>Setosa</em> species are distinctly different from <em>Versicolor</em> and <em>Virginica</em> (they have lower petal length and width). But <em>Versicolor</em> and <em>Virginica</em> cannot easily be separated based on measurements of their sepal and petal width/length.</p>
<p>You can then use agglomerative hierarchical clustering to cluster the customer data:</p>
<pre class="prettyprint">
iris_hc = hclust(dist(iris[,1:4], method="euclidean"), method="ward.D2")
</pre>
<p>We use the Euclidean distance as distance metrics, and use Ward's minimum variance method to perform agglomerative clustering.</p>
<p>Lastly, you can use the <code>plot</code> function to plot the dendrogram:</p>
<pre class="prettyprint">
plot(iris_hc, hang = -0.01, cex = 0.7)
</pre>
<div style="text-align:center;">
<img alt="r_hc_iris.png" src="http://en.proft.me/media/science/r_hc_iris.png"><br/>
</div>
<p>Finally, we use the <code>plot</code> function to plot the dendrogram of hierarchical clusters. We specify <code>hang</code> to display labels at the bottom of the dendrogram, and use <code>cex</code> to shrink the label to 70 percent of the normal size.</p>
<p>In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches. You can see more kinds of visualizing dendrograms in R <a href="https://rpubs.com/gaston/dendrograms">here</a> or <a href="http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning">here</a>.</p>
<p>In a dendrogram, we can see the hierarchy of clusters, but we have not grouped data into different clusters yet. However, we can determine how many clusters are within the dendrogram and cut the dendrogram at a certain tree height to separate the data into different groups. We'll demonstrate how to use the <code>cutree</code> function to separate the data into a given number of clusters.</p>
<p>First, categorize the data into four groups:</p>
<pre class="prettyprint">
fit = cutree(iris_hc, k = 3)
</pre>
<p>You can then examine the cluster labels for the data:</p>
<pre class="prettyprint">
fit
</pre>
<p>Count the number of data within each cluster:</p>
<pre class="prettyprint">
table(fit)
</pre>
<p>Finally, you can visualize how data is clustered with the red rectangle border:</p>
<pre class="prettyprint">
plot(iris_hc, hang = -0.01, cex = 0.7)
rect.hclust(iris_hc, k=3, border="red")
</pre>
<div style="text-align:center;">
<img alt="r_hc_iris_clusters.png" src="http://en.proft.me/media/science/r_hc_iris_clusters.png"><br/>
</div>
<p><code>table</code> the clustering distribution with actual <em>Species</em></p>
<pre class="prettyprint">
table(fit, iris$Species)
</pre>
<p><b class="lb">Comparing clustering methods</b></p>
<p>After fitting data into clusters using different clustering methods, you may wish to measure the accuracy of the clustering. In most cases, you can use either <em>intracluster</em> or <em>intercluster</em> metrics as measurements. The higher the intercluster distance, the better it is, and the lower the intracluster distance, the better it is.</p>
<p>We now introduce how to compare different clustering methods using <code>cluster.stat</code> from the <code>fpc</code> package.</p>
<p>Perform the following steps to compare clustering methods.</p>
<p>First, install and load the <code>fpc</code> package</p>
<pre class="prettyprint">
#install.packages("fpc")
library("fpc")
</pre>
<p>You then need to use hierarchical clustering with the <em>single</em> method to cluster customer data and generate the object <code>hc_single</code></p>
<pre class="prettyprint">
single_c = hclust(dist(iris[,1:4]), method="single")
hc_single = cutree(single_c, k=3)
</pre>
<p>Use hierarchical clustering with the <em>complete</em> method to cluster customer data and generate the object <code>hc_complete</code></p>
<pre class="prettyprint">
complete_c = hclust(dist(iris[,1:4]), method="complete")
hc_complete = cutree(complete_c, k=3)
</pre>
<p>You can then use <a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a> to cluster customer data and generate the object <code>km</code></p>
<pre class="prettyprint">
set.seed(1234)
km = kmeans(iris[1:4], 3, iter.max=1000, algorithm=c("Forgy"))
</pre>
<p>Next, retrieve the cluster validation statistics of either clustering method:</p>
<pre class="prettyprint">
cs = cluster.stats(dist(iris[1:4]), km$cluster)
</pre>
<p>Most often, we focus on using <code>within.cluster.ss</code> and <code>avg.silwidth</code> to validate the clustering method. The <code>within.cluster.ss</code> measurement stands for the <em>within clusters sum</em> of squares, and <code>avg.silwidth</code> represents the <em>average silhouette width</em>.</p>
<ul>
<li><code>within.cluster.ss</code> measurement shows how closely related objects are in clusters; the smaller the value, the more closely related objects are within the cluster.</li>
<li><code>avg.silwidth</code> is a measurement that considers how closely related objects are within the cluster and how clusters are separated from each other. The silhouette value usually ranges from 0 to 1; a value closer to 1 suggests the data is better clustered.</li>
</ul>
<pre class="prettyprint">
cs[c("within.cluster.ss","avg.silwidth")]
</pre>
<p>Finally, we can generate the cluster statistics of each clustering method and list them in a table:</p>
<pre class="prettyprint">
sapply(list(kmeans = km$cluster, hc_single = hc_single, hc_complete = hc_complete),
function(c) cluster.stats(dist(iris[1:4]), c)[c("within.cluster.ss","avg.silwidth")])
</pre>
<p><b class="lb">Useful links</b></p>
<ul>
<li><a href="http://www.sthda.com/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">Hierarchical Clustering Essentials</a></li>
</ul>Sun, 29 Jan 2017 00:00:00 +0200http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/Classification using Random forest in Rhttp://en.proft.me/2017/01/24/classification-using-random-forest-r/<p><img src="http://en.proft.me/media/science/r_rf.png" alt="r_rf.png" class="right" width="120">
<b class="lb">Introduction</b></p>
<p><em>Random forest</em> (or <em>decision tree forests</em>) is one of the most popular decision tree-based ensemble models. The accuracy of these models tends to be higher than most of the other <a href="http://en.proft.me/2016/11/9/classification-using-decision-trees-r/">decision trees</a>. Random Forest algorithm can be used for both classification and regression applications.</p>
<p>The main drawback of <em>decision trees</em> is that they are prone to <em>overfitting</em>. The reason for this is that trees, if grown deep, are able to fit all kinds of variations in the data, including noise. Although it is possible to address this partially by <a href="http://en.proft.me/2016/11/9/classification-using-decision-trees-r/">pruning</a>, the result often remains less than satisfactory.</p>
<p><em>Ensemble learning</em> is a method to combine results produced by different learners into one format, with the aim of producing better classification results and regression results.</p>
<p>In ensemble learning, <em>bagging</em>, <em>boosting</em>, and <em>random forest</em> are the three most common methods.</p>
<p><em>Random forest</em> uses the classification results voted from many classification trees. The idea is simple: a single classification tree will obtain a single classification result with a single input vector. However, a random forest grows many classification trees, obtaining multiple results from a single input. Therefore, a random forest will use the majority of votes from all the decision trees to classify data or use an average output for regression.</p>
<p><em>Random forest</em> creates a large number of <em>decision trees</em>. Every observation is feed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is feed into all the trees and taking a majority vote for each classification model. In ensemble terms, the <em>trees</em> are <em>weak learners</em> and the <em>random forest</em> is a <em>strong learner</em>.</p>
<div style="text-align:center; margin-bottom: 5px;">
<a href="https://citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics/"><img alt="r_rf_concept.png" src="http://en.proft.com.ua/media/science/r_rf_concept.jpg" width="90%"></a><br/>
</div>
<p>It's worth noting that relative to other ensemble-based methods, <em>random forests</em> are quite competitive and offer key advantages relative to the competition. For instance, <em>random forests</em> tend to be easier to use and less prone to overfitting. The following is the general strengths and weaknesses of <em>random forest</em> models.</p>
<p>Strengths of <em>Random forest</em></p>
<ul>
<li>This algorithm can solve both type of problems i.e. classification and regression and does a decent estimation at both fronts.</li>
<li>It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.</li>
<li>Selects only the most important features.</li>
<li>Can be used on data with an extremely large number of features or examples (higher dimensionality). It can handle thousands of input variables and identify most significant variables so it is considered as one of the dimensionality reduction methods. Further, the model outputs <em>Importance of variable</em>, which can be a very handy feature (on some random data set).</li>
</ul>
<p>Weaknesses of <em>Random forest</em></p>
<ul>
<li>Unlike a decision tree, the model is not easily interpretable. Random Forest can feel like a black box approach for statistical modelers.</li>
<li>It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.</li>
<li>May require some work to tune the model to the data.</li>
</ul>
<p><b class="lb">Modeling in R</b></p>
<p><em>Random forest</em> algorithm is built in <code>randomForest</code> package of R and same name function allows us to use it.</p>
<p>Some of the commonly used parameters of <code>randomForest</code> functions are</p>
<ul>
<li><code>x</code> - Random forest formula</li>
<li><code>data</code> - input data frame</li>
<li><code>ntree</code> - number of decision trees to be grown</li>
<li><code>mtry</code> - the number of features used to find the best feature</li>
<li><code>replace</code> takes True and False and indicates whether to take sample with/without replacement</li>
<li><code>importance</code> - whether independent variable importance in random forest be assessed</li>
<li><code>proximity</code> - whether to calculate proximity measures between rows of a data frame</li>
</ul>
<p>The <code>randomForest</code> package optionally produces two additional pieces of information: a <em>measure of the importance</em> of the predictor variables, and a <em>measure of the internal structure</em> of the data (the proximity of different data points to one another).</p>
<p>By default, the <code>randomForest()</code> function creates an ensemble of 500 trees that consider <code>sqrt(p)</code> random features at each split, where <em>p</em> is the number of features in the training dataset and <code>sqrt()</code> refers to R's square root function. The goal of using a large number of trees is to train enough so that each feature has a chance to appear in several models.</p>
<p>Let's see how the default <code>randomForest()</code> parameters work with the <code>iris</code> data set. The <code>set.seed()</code> function ensures that the result can be replicated:</p>
<pre class="prettyprint">
#install.packages("randomForest")
library("randomForest")
set.seed(1234)
</pre>
<p>Split <code>iris</code> data to <em>training</em> data and <em>testing</em> data</p>
<pre class="prettyprint">
ind = sample(2, nrow(iris), replace=TRUE, prob=c(0.7,0.3))
trainData = iris[ind==1,]
testData = iris[ind==2,]
</pre>
<p>Generate <em>Random forest</em> learning tree</p>
<pre class="prettyprint">
iris_rf = randomForest(Species~., data=trainData, ntree=100, proximity=T)
table(predict(iris_rf), trainData$Species)
</pre>
<p>To look at a summary of the model's performance, we can simply type the resulting object's name:</p>
<pre class="prettyprint">
iris_rf
</pre>
<p>The output notes that the <em>random forest</em> included 100 trees and tried two variables at each split. At first glance, you might be alarmed at the seemingly poor performance according to the confusion matrix—the error rate of 5.36 percent is far worse than the resubstitution error of any of the other ensemble methods. However, this confusion matrix does not show resubstitution error. Instead, it reflects the <em>out-of-bag error rate</em> (listed in the output as OOB estimate of error rate), which unlike resubstitution error, is an unbiased estimate of the test set error. This means that it should be a fairly reasonable estimate of future performance.</p>
<p>You can use the <code>plot()</code> function to plot the mean square error of the forest object:</p>
<pre class="prettyprint">
plot(iris_rf)
</pre>
<div style="text-align:center; margin-bottom: 5px;">
<img alt="r_rf_plot.png" src="http://en.proft.com.ua/media/science/r_rf_plot.png"></a>
</div>
<p>100 decision trees or a forest has been built using the <em>random forest</em> algorithm based learning. We can plot the error rate across decision trees. The plot seems to indicate that after 40 decision trees, there is not a significant reduction in error rate.</p>
<p>You can then examine the importance of each attribute within the fitted classifier:</p>
<pre class="prettyprint">
importance(iris_rf)
</pre>
<p>Try to build <em>random forest</em> for testing data. Similar to other classification methods, you can obtain the classification table:</p>
<pre class="prettyprint">
irisPred = predict(iris_rf, newdata=testData)
table(irisPred, testData$Species)
</pre>
<p>Try to see the margin, positive or negative, if positif it means correct classification</p>
<pre class="prettyprint">
plot(margin(iris_rf, testData$Species))
</pre>
<p>Let's determine the misclassification rate. First, build a confusion matrix. Each column of the matrix represents the number of predictions of each class, while each row represents the instances in the actual class.</p>
<pre class="prettyprint">
CM = table(irisPred, testData$Species)
</pre>
<p>Second, build a <em>diagonal mark quality prediction</em>. Applying the <code>diag</code> function to this table then selects the diagonal elements, i.e., the number of points where <em>random forest</em> agrees with the true classification, and the <code>sum</code> command simply adds these values up.</p>
<pre class="prettyprint">
accuracy = (sum(diag(CM)))/sum(CM)
</pre>
<p>The model has a high overall accuracy that is <em>0.9473684</em>.</p>
<p><b class="lb">Useful links</b></p>
<ul>
<li><a href="https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/">A Complete Tutorial on Tree Based Modeling from Scratch</a></li>
<li><a href="http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf">Classification and Regression by randomForest</a></li>
</ul>Tue, 24 Jan 2017 00:00:00 +0200http://en.proft.me/2017/01/24/classification-using-random-forest-r/Classification using k-Nearest Neighbors in Rhttp://en.proft.me/2017/01/22/classification-using-k-nearest-neighbors-r/<p><img src="http://en.proft.me/media/science/r_knn_classify.png" alt="r_knn_classify.png" class="right" width="120">
<b class="lb">Introduction</b></p>
<p>The <em>k-NN</em> algorithm is among the simplest of all <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">machine learning algorithms</a>. It also might surprise many to know that <em>k-NN</em> is one of the <a href="http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf">top 10 data mining algorithms</a>.</p>
<p><em>k-Nearest Neighbors algorithm</em> (or k-NN for short) is a <em>non-parametric method</em> used for <em>classification</em> and <em>regression</em>. In both cases, the input points consists of the <em>k</em> closest training examples in the <em>feature space</em>.</p>
<blockquote>Non-parametric means that it does not make any assumptions on the underlying data distribution.</blockquote>
<p>Since the points are in <em>feature space</em>, they have a notion of <em>distance</em>. This need not necessarily be Euclidean distance although it is the one commonly used. Each of the training data consists of <em>a set of vectors</em> and <em>class label</em> associated with each vector. In the simplest case, it will be either <em>+</em> or <em>–</em> (for positive or negative classes). But <em>k-NN</em>, can work equally well with arbitrary number of classes.</p>
<blockquote><i>k</i> number decides how many neighbors (where neighbors is defined based on the distance metric) influence the classification. With a given <i>k</i> value we can make boundaries of each class. </blockquote>
<p>For any fixed point <em>x</em> we can calculate how close each observation <i>X<sup>i</sup></i> is to <em>x</em> using the Euclidean distance. This ranks the data by how close they are to <em>x</em>. Imagine drawing a small ball about <em>x</em> and slowly inflating it. As the ball hits the first observation <i>X<sup>i</sup></i>, this is the first nearest neighbor of <em>x</em>. As the ball further inflates and hits a second observation, this observation is the second nearest neighbor.</p>
<div style="text-align:center; margin-bottom: 5px;">
<a href="http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/"><img alt="dp_decorator.png" src="http://en.proft.com.ua/media/science/r_knn_concept.png" width="60%"></a><br/>
</div>
<p>Efficiency of the <em>k-NN</em> algorithm largely depends on the value of <em>k</em> i.e, choosing the number of nearest neighbors. Choosing the optimal value for <em>k</em> is best done by first inspecting the data. In general, a large <em>k</em> value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good <em>k</em> value by using an independent dataset to validate the <em>k</em> value. Historically, the optimal <em>k</em> for most datasets has been between 3-10.</p>
<p><em>k-NN</em> makes predictions using the training dataset directly. Predictions are made for a new instance <em>x</em> by searching through the entire training set for the <em>k</em> most similar instances (the neighbors) and summarizing the output variable for those <em>k</em> instances. For regression this might be the mean output variable, in classification this might be the mode (or most common) class value.</p>
<p>To determine which of the <em>k</em> instances in the training dataset are most similar to a new input a distance measure is used. You can choose the best distance metric based on the properties of your data. Following are the most popular distance measures</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a>. This measure is popular for real-valued input variables. The Euclidean distance is always greater than or equal to zero. The measurement would be zero for identical points and high for points that show little similarity. It is a good distance measure to use if the input variables are similar in type (e.g. all measured widths and heights)</li>
<li><a href="https://en.wikipedia.org/wiki/Hamming_distance">Hamming distance</a>. Calculate the distance between binary vectors.</li>
<li><a href="https://en.wikipedia.org/wiki/Taxicab_geometry">Manhattan distance</a>. Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance. It is a good measure to use if the input variables are not similar in type (such as age, gender, height, etc.).</li>
<li><a href="https://en.wikipedia.org/wiki/Minkowski_distance">Minkowski distance</a>. Generalization of Euclidean and Manhattan distance.</li>
<li><a href="https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_similarity_and_distance">Tanimoto distance</a> is only applicable for a binary variable.</li>
<li><a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard distance</a> measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.</li>
<li><a href="https://en.wikipedia.org/wiki/Mahalanobis_distance">Mahalanobis distance</a></li>
<li><a href="https://en.wikipedia.org/wiki/Cosine_similarity">Cosine distance</a> ranges from +1 to -1 where +1 is the highest correlation. Complete opposite points have correlation -1.</li>
</ul>
<p>The output of <em>k-NN</em> depends on whether it is used for classification or regression:</p>
<ul>
<li>In <em>k-NN classification</em>, the <em>output</em> is a <em>class membership</em>. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its <em>k</em> nearest neighbors (<em>k</em> is a positive integer, typically small).</li>
<li>In <em>k-NN regression</em>, the output is the <em>property value</em> for the object. This value is the average of the values of its k nearest neighbors.</li>
</ul>
<p>However, it is more widely used in classification problems in the industry because it's easy of interpretation and low calculation time (for small dataset). The computational complexity of <em>k-NN</em> increases with the size of the training dataset.</p>
<p><em>k-NN</em> is also a <em>lazy algorithm</em>. What this means is that it does not use the training data points to do any <em>generalization</em>. Lack of generalization means that <em>k-NN</em> keeps all the training data and all the training data is needed during the testing phase. As result:</p>
<ul>
<li>more time might be needed as in the worst case, all data points might take point in decision.</li>
<li>more memory is needed as we need to store all training data.</li>
</ul>
<p>In <em>k-Nearest Neighbor classification</em>, the training dataset is used to classify each member of a <em>target</em> dataset. The structure of the data is that there is a <em>classification (categorical) variable</em> of interest ("buyer," or "non-buyer," for example), and a number of additional <em>predictor variables</em> (age, income, location, etc).</p>
<p>Before use <em>k-NN</em> we should do some preparations</p>
<ul>
<li><em>Rescale Data</em>. <em>k-NN</em> performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian distribution.</li>
<li><em>Address Missing Data</em>. Missing data will mean that the distance between samples can not be calculated. These samples could be excluded or the missing values could be imputed.</li>
<li><em>Lower Dimensionality</em>. <em>k-NN</em> is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. <em>k-NN</em> can benefit from feature selection that reduces the dimensionality of the input feature space.</li>
</ul>
<p>Following is the algorithm:</p>
<ol>
<li>For each row (case) in the target dataset (the set to be classified), locate the <em>k</em> closest members (the <em>k</em> nearest neighbors) of the training dataset. A Euclidean distance measure is used to calculate how close each member of the training set is to the target row that is being examined.</li>
<li>Examine the <em>k</em> nearest neighbors - which classification (category) do most of them belong to? Assign this category to the row being examined.</li>
<li>Repeat this procedure for the remaining rows (cases) in the target set.</li>
</ol>
<p><b class="lb">Modeling in R</b></p>
<p>Let's see how to use <em>k-NN</em> for classification. In this case, we are given some data points for training and also a new unlabelled data for testing. Our aim is to find the class label for the new point. The algorithm has different behavior based on <em>k</em>.</p>
<p>Before starting with the <em>k-NN</em> implementation using R, I would like to discuss two Data Normalization/Standardization techniques for data consistent. You can easily see this when you go through the results of the <code>summary()</code> function. Look at the minimum and maximum values of all the (numerical) attributes. If you see that one attribute has a wide range of values, you will need to normalize your dataset, because this means that the distance will be dominated by this feature.</p>
<ul>
<li><em>Min-Max Normalization</em>. This process transforms a feature value in such a way that all of its values falls in the range of 0 and 1. This technique is traditionally used with <em>k-Nearest Neighbors</em> classification problems.</li>
<li><em>Z-Score Standardization</em>. This process subtracts the mean value of feature <em>x</em> and divides it with the standard deviation of <em>x</em>. The disadvantage with min-max normalization technique is that it tends to bring data towards the mean. If there is a need for outliers to get weighted more than the other values, z-score standardization technique suits better. In order to achieve z-score standardization, one could use R’s built-in <code>scale()</code> function.</li>
</ul>
<p>To perform a <em>k-nearest neighbour classification</em> in R we will make use of the <code>knn</code> function in the <code>class</code> package and <code>iris</code> data set.</p>
<pre class="prettyprint">
# familiarize with data
str(iris)
table(iris$Species)
</pre>
<p>A quick look at the <code>Species</code> attribute through tells you that the division of the species of flowers is 50-50-50.</p>
<p>If you want to check the percentual division of the <code>Species</code> attribute, you can ask for a table of proportions:</p>
<pre class="prettyprint">
round(prop.table(table(iris$Species)) * 100, digits = 1)
</pre>
<p>Also, you can try to get an idea of your data by making some graphs, such as <a href="http://en.proft.me/2016/12/3/charts-r-usage/#scatter">scatter plots</a>. It can be interesting to see how much one variable is affected by another. In other words, you want to see if there is any correlation between two variables.</p>
<pre class="prettyprint">
plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=19)
legend("topright", legend=levels(iris$Species), bty="n", pch=19, col=palette())
</pre>
<div style="text-align:center; margin-bottom: 5px;">
<img alt="r_knn_plot.png" src="http://en.proft.com.ua/media/science/r_knn_plot.png"></a>
</div>
<p>You see that there is a high correlation between the <code>Sepal.Length</code> and the <code>Sepal.Width</code> of the <code>Setosa</code> iris flowers, while the correlation is somewhat less high for the <code>Virginica</code> and <code>Versicolor</code> flowers.</p>
<p>To normalize the data use the following function.</p>
<pre class="prettyprint">
normMinMax = function(x) {
return((x-min(x))/max(x)-min(x))
}
iris_norm <- as.data.frame(lapply(iris[1:4], normMinMax))
</pre>
<p>Now create <em>training</em> and <em>test</em> data set. Following is random splitting of iris data as 70% train and 30% test data sets.</p>
<pre class="prettyprint">
set.seed(1234)
indexes = sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
iris_train = iris[indexes==1, 1:4]
iris_test = iris[indexes==2, 1:4]
iris_train_labels = iris[indexes==1, 5]
iris_test_labels = iris[indexes==2, 5]
</pre>
<p>Load <code>class</code> package and call <code>knn()</code> function to build the model.</p>
<pre class="prettyprint">
#install.packages("class")
library("class")
iris_mdl = knn(train=iris_train, test=iris_test, cl=iris_train_labels, k=3)
</pre>
<p>The function <code>knn</code> takes as its arguments a training data set and a testing data set, which we are now going to create.</p>
<p>You can read about Error rate with varying <em>k</em> <a href="http://www.learnbymarketing.com/tutorials/k-nearest-neighbors-in-r-example/">here</a>.</p>
<p>To evaluate the model, use <code>CrossTable()</code> function in <code>gmodels</code> package.</p>
<pre class="prettyprint">
#install.packages("gmodels")
library("gmodels")
CrossTable(x=iris_test_labels, y=iris_mdl, prop.chisq = FALSE)
</pre>
<p>Output:</p>
<div style="text-align:center; margin-bottom: 5px;">
<a href="http://technokarak.com/complete-tutorial-of-knn-classification-algorithm-using-r-programming.html"><img alt="r_knn_output.png" src="http://en.proft.com.ua/media/science/r_knn_output.png"></a><br/>
</div>
<p>As you can see from the table above the model make one error mistake of predicting as <code>versicolor</code> instead of <code>virginica</code>, whereas rest all the predicted correctly.</p>
<p>Since the data set contains only 150 observations, the model makes only one mistake. There may be more mistakes in huge data. In that case we need to try using different approach like instead of <em>min-max normalization</em> use <em>Z-Score standardization</em> and also vary the values of <em>k</em> to see the impact on accuracy of predicted result.</p>
<p>We would like to determine the misclassification rate for a given value of <em>k</em>. First, build a confusion matrix. Each column of the matrix represents the number of predictions of each class, while each row represents the instances in the actual class.</p>
<pre class="prettyprint">
CM = table(iris_test_labels, iris_mdl)
</pre>
<p>Second, build a <em>diagonal mark quality prediction</em>. Applying the <code>diag</code> function to this table then selects the diagonal elements, i.e., the number of points where <em>k-nearest neighbour classification</em> agrees with the true classification, and the <code>sum</code> command simply adds these values up.</p>
<pre class="prettyprint">
accuracy = (sum(diag(CM)))/sum(CM)
</pre>
<p>The model has a high overall accuracy that is <em>0.9736842</em>.</p>
<p><b class="lb">Pros and Cons of k-NN algorithm</b></p>
<p><strong>Pros</strong>. The algorithm is highly unbiased in nature and makes no prior assumption of the underlying data. Being simple and effective in nature, it is easy to implement and has gained good popularity.</p>
<p><strong>Cons</strong>. Indeed it is simple but <em>k-NN</em> algorithm has drawn a lot of flake for being extremely simple! If we take a deeper look, this doesn’t create a model since there’s no abstraction process involved. Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the prediction time is pretty high with useful insights missing at times. Therefore, building this algorithm requires time to be invested in data preparation (especially treating the missing data and categorical features) to obtain a robust model.</p>Sun, 22 Jan 2017 00:00:00 +0200http://en.proft.me/2017/01/22/classification-using-k-nearest-neighbors-r/Charts in R by usagehttp://en.proft.me/2016/12/3/charts-r-usage/<p><img src="http://en.proft.me/media/science/r_charts.png" alt="r_charts.png" class="right" width="100"></p>
<p>Every data mining project is incomplete without proper data visualization. From a functional point of view, the following are the graphs and charts which a data scientist would like the audience to look at to infer the information:</p>
<p>There is simple diagram with suggests how to choose the right chart for presentation</p>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="R charts suggests" src="http://en.proft.me/media/science/r_charts_suggests.png" width="95%" />
</div>
<p>Let's initiate data and colors</p>
<pre class="prettyprint">
dataX <- sample(1:50, 20, replace=T)
dataY <- sample(1:50, 20, replace=T)
dataZ <- sample(1:5, 20, replace=T)
palette <- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
c1 = palette[1]
c2 = palette[2]
c3 = palette[3]
c4 = "#FFFFF"
</pre>
<p><b class="lb">Comparisons between variables</b></p>
<p>Basically, when it is needed to represent two or more properties within a variable, then the following charts are used:</p>
<p><a name="bar"></a>
<strong>Bar chart</strong></p>
<p>A bar plot is a chart with rectangular bars with lengths proportional to the values that they represent. The bars can be plotted either vertically or horizontally. A simple bar chart can be created in R with the <code>barplot</code> function. In the example below, data from the sample "pressure" dataset is used to plot the vapor pressure of Mercury as a function of temperature.</p>
<pre class="prettyprint">
barplot(dataX, border=c3, col=c3)
# change font size
barplot(dataX, border=c3, col=c3, cex.axis=0.9, names.arg = dataX)
# plot line
abline(h=10, col="Red", lty=5)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_bar.png" src="http://en.proft.me/media/science/r_chart_bar.png" />
</div>
<p><a name="box"></a>
<strong>Box plot</strong></p>
<p>A box plot is a chart that illustrates groups of numerical data through the use of quartiles. A simple box plot can be created in R with the <code>boxplot</code> function.</p>
<pre class="prettyprint">
boxplot(dataX, border=c2, col=c3, cex.axis=0.9)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_box.png" src="http://en.proft.me/media/science/r_chart_box.png" />
</div>
<p><a name="bubble"></a>
<strong>Bubble chart</strong></p>
<p>The bubble chart is a variant of the scatterplot. Like in the scatterplot, points are plotted on a chart area (typically an x-y grid). Two quantitative variables are mapped to the x and y axes, and a third quantitative variables is mapped to the size of each point.</p>
<pre class="prettyprint">
symbols(dataX, dataY, circles=dataZ, fg=c2, bg=c3, inches=0.5, xlab="Data X")
text(dataX, dataY, dataZ, col=c4)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_bubble.png" src="http://en.proft.me/media/science/r_chart_bubble.png" />
</div>
<p><a name="hist"></a>
<strong>Histogram</strong></p>
<p>A histogram is a graphical representation of the distribution of data. A histogram represents the frequencies of values of a variable bucketed into ranges. Each bar in histogram represents the height of the number of values present in that range.</p>
<p>It allows you to easily see where a relatively large amount of the data is situated and where there is very little data to be found. In other words, you can see where the middle is in your data distribution, how close the data lie around this middle and where possible outliers are to be found. A simple histogram chart can be created in R with the <code>hist</code> function.</p>
<pre class="prettyprint">
hist(dataX, border=c4, col=c3, xlab="Data X")
# add grid
grid()
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_hist.png" src="http://en.proft.me/media/science/r_chart_hist.png" />
</div>
<p><a name="line"></a>
<strong>Line graph</strong></p>
<p>A line chart is a graph that connects a series of points by drawing line segments between them. These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually used in identifying the trends in data, thus the line is often drawn chronologically. The <code>plot()</code> function in R is used to create the line graph.</p>
<pre class="prettyprint">
plot(dataX, type="o", col=c3, xlab="Time")
legend("bottomleft", title="DataX vs Time", legend=c("data and time"), fill=c3, horiz=T, bty="n")
</pre>
<p>Single character defines the <code>type</code> of line chart to be plotted.</p>
<ul>
<li>p = Points</li>
<li>l = Line</li>
<li>b = Lines and Points</li>
<li>o = Overplotted</li>
<li>h = "Histogram like" lines</li>
<li>s = Steps</li>
</ul>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_line.png" src="http://en.proft.me/media/science/r_chart_line.png" />
</div>
<p><a name="stacked"></a>
<strong>Stacked bar chart</strong></p>
<p>Stacked Barplots, or graphs that depict conditional distributions of data, are great for being able to see a level-wise breakdown of the data. We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.</p>
<p>More than two variables are represented as a matrix which is used to create the group bar chart and stacked bar chart.</p>
<pre class="prettyprint">
months = c("Mar","Apr","May")
dd = data.frame(v1=dataX, v2=dataY, v3=dataZ)
mm = as.matrix(dd)
barplot(mm[1:3,], main = "Total revenue", names.arg=months, xlab="month", ylab="revenue", col=palette)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_stacked.png" src="http://en.proft.me/media/science/r_chart_stacked.png" />
</div>
<p><a name="radar"></a>
<strong>Radar chart (Spider chart)</strong></p>
<p>A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. This makes them useful for seeing which variables have similar values or if there are any outliers amongst each variable. Radar Charts are also useful for seeing which variables are scoring high or low within a dataset, making them ideal for displaying performance.</p>
<p>The radar chart is also known as web chart, spider chart, star chart,[1] star plot, cobweb chart, irregular polygon, polar chart, or Kiviat diagram.</p>
<pre class="prettyprint">
install.packages("fmsb")
library("fmsb")
dd = data.frame(
datax=dataX[(3:5)],
datay=dataY[(3:5)],
dataz=dataZ[(3:5)]
)
colnames(dd) = c("Data X", "Data Y", "Data Z")
maxRow = rep(max(dd), ncol(dd))
minRow = rep(min(dd), ncol(dd))
dd = rbind(maxRow, minRow, dd)
colorsRC = c(rgb(0.2,0.5,0.5,0.4), rgb(0.8,0.2,0.5,0.4), rgb(0.7,0.5,0.1,0.4))
# plot all data
radarchart(dd, vlcex=0.8, pfcol=colorsRC, cglcol="#999999")
# plot one data
radarchart(dd[(1:3),], vlcex=0.8, pfcol=colorsRC, cglcol="#999999")
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_radar.png" src="http://en.proft.me/media/science/r_chart_radar.png" />
</div>
<p><a href="http://blog.scottlogic.com/2011/09/23/a-critique-of-radar-charts.html">A critique of Radar charts</a></p>
<p><a name="pie"></a>
<strong>Pie chart</strong></p>
<p>A pie-chart is a representation of values as slices of a circle with different colors. The slices are labeled and the numbers corresponding to each slice is also represented in the chart. <em>Bar plots</em> typically illustrate the same data but in a format that is simpler to comprehend for the viewer. As such, it's recommend avoiding pie charts if at all possible.</p>
<p>In R the pie chart is created using the <code>pie()</code> function which takes positive numbers as a vector input. The additional parameters are used to control labels, color, title etc.</p>
<pre class="prettyprint">
pie(dataX, col=rainbow(length(x)))
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_pie.png" src="http://en.proft.me/media/science/r_chart_pie.png" />
</div>
<p><b class="lb">Testing/viewing proportions</b></p>
<p>It is used when there is a need to display the proportion of contribution by one category to the overall level:</p>
<ul>
<li><a href="#bubble">Bubble chart</a></li>
<li><em>Bubble map</em> is useful for visualization of data points on a map.</li>
<li><a href="#stacked">Stacked bar chart</a></li>
<li><a href="http://analyticstraining.com/2014/how-to-create-a-word-cloud-in-r/">Word cloud</a></li>
</ul>
<p><b class="lb">Relationship between variables</b></p>
<p>Association between two or more variables can be shown using the following charts:</p>
<p><a name="scatter"></a>
<strong>Scatterplot</strong></p>
<p>A <em>scatter diagram</em> examines the relationships between data collected for two different characteristics. Although the scatter diagram cannot determine the cause of such a relationship, it can show whether or not such a relationship exists, and if so, just how strong it is. The analysis produced by the scatter diagram is called <em>Regression Analysis</em>.</p>
<p>The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. A simple scatter plot can be created in R with the <code>plot</code> function.</p>
<pre class="prettyprint">
plot(dataX, dataY, pch=19, axes=T, frame.plot=F, col=ifelse(dataX > 25, c1, c2))
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_scatter.png" src="http://en.proft.me/media/science/r_chart_scatter.png" />
</div>
<ul>
<li><a href="#bar">Bar chart</a></li>
<li><a href="#radar">Radar chart</a></li>
<li><a href="#line">Line graph</a></li>
<li><a href="https://nrich.maths.org/7288">Tree diagram</a> is simply a way of representing a sequence of events. Tree diagrams are particularly useful in probability since they record all possible outcomes in a clear and uncomplicated manner.</li>
</ul>
<p><b class="lb">Variable hierarchy</b></p>
<p>When it is required to display the order in the variables, such as a sequence of variables, then the following charts are used:</p>
<ul>
<li>Tree diagram</li>
<li><em>Tree map</em> is similar to a pie chart in that it visually displays proportions by varying the area of a shape. A treemap has two useful advantages over a pie chart. First, you can display a lot more elements. In a pie chart, there is an upper-limit to the number of wedges that can be comfortably added to the circle. In a treemap, you can display hundreds, or thousands, of pieces of information. Secondly, a treemap allows you to arrange your data elements hierarchically. That is, you can group your proportions using categorical variables in your data.</li>
</ul>
<p><b class="lb">Data with locations</b></p>
<p>When a dataset contains the geographic location of different cities, countries, and states names, or longitudes and latitudes, then the following charts can be used to display visualization:</p>
<ul>
<li><a href="#bubble">Bubble map</a></li>
<li>Geo mapping</li>
<li>Dot map</li>
<li>Flow map</li>
</ul>
<p><b class="lb">Contribution analysis or part-to-whole</b></p>
<p>When it is required to display constituents of a variable and contribution of each categorical level towards the overall variable, then the following charts are used:</p>
<ul>
<li><a href="#pie">Pie chart</a></li>
<li><a href="#stacked">Stacked bar chart</a></li>
<li><em>Doughnut chart</em>. Just like a pie chart, a doughnut chart shows the relationship of parts to a whole, but a doughnut chart can contain more than one data series. Each data series that you plot in a doughnut chart adds a ring to the chart. The first data series is displayed in the center of the chart. Doughnut charts display data in rings, where each ring represents a data series. If percentages are displayed in data labels, each ring will total 100%.</li>
</ul>
<p><b class="lb">Statistical distribution</b></p>
<p>In order to understand the variation in a variable across different dimensions, represented by another categorical variable, the following charts are used:</p>
<ul>
<li><a href="#box">Box plot</a></li>
<li><a href="#bubble">Bubble chart</a></li>
<li><a href="#hist">Histogram</a></li>
</ul>
<p><strong>Stem and leaf plot</strong></p>
<p>A stem-and-leaf display is a device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualizing the shape of a distribution.</p>
<p>Unlike histograms, stem-and-leaf displays retain the original data to at least two significant digits, and put the data in order, thereby easing the move to order-based inference and non-parametric statistics.</p>
<p>A basic stem-and-leaf display contains two columns separated by a vertical line. The left column contains the stems and the right column contains the leaves.</p>
<p>A simple stem-and-leaf can be created in R with the <code>stem</code> function. <code>stem</code> produces a stem-and-leaf plot of the values in <code>x</code>. The parameter <code>scale</code> can be used to expand the scale of the plot.</p>
<pre class="prettyprint">
stem(dataX)
</pre>
<p><b class="lb">Unseen patterns</b></p>
<p>For pattern recognition and relative importance of data points on different dimensions of a variable, the following charts are used:</p>
<ul>
<li><a href="#bar">Bar chart</a></li>
<li><a href="#box">Box plot</a></li>
<li><a href="#bubble">Bubble chart</a></li>
<li><a href="#scatter">Scatterplot</a></li>
<li><em>Spiral plot</em> are ideal for showing large data sets, usually to show trends over a large time period. The graph begins from the center of the spiral and progresses outwards. Spiral Plots are versatile and can use bars, lines or points to be displayed along the spiral path. This makes Spiral Plots great for displaying periodic patterns. Colour can be assigned to each period to break them up and to allow some comparison between each period. So for example, if we were to show data over a year, we could assign a colour for each month on the graph. </li>
<li><a href="#line">Line chart</a></li>
</ul>
<p><b class="lb">Spread of values or range</b></p>
<p>The following charts only give the spread of the data points across different bounds:</p>
<ul>
<li>Span chart</li>
<li><a href="#box">Box plot</a></li>
<li><a href="#hist">Histogram</a></li>
</ul>
<p><b class="lb">Textual data representation</b></p>
<p>This is a very interesting way of representing the textual data:</p>
<ul>
<li>Word cloud</li>
</ul>
<p><b class="lb">Useful links</b></p>
<ul>
<li><a href="http://davetang.org/muse/2015/02/12/animated-plots-using-r/">Animated plots using R</a></li>
</ul>Sat, 03 Dec 2016 00:00:00 +0200http://en.proft.me/2016/12/3/charts-r-usage/Modeling Self Organising Maps in Rhttp://en.proft.me/2016/11/29/modeling-self-organising-maps-r/<p><b class="lb">Introduction</b></p>
<p><img src="http://en.proft.me/media/science/som_logo.png" alt="som_logo.png" class="right" width="120"></p>
<p>The <em>Self-Organizing Maps (SOMs)</em> network is a neural network based method for dimension reduction. <em>SOMs</em> can learn from complex, multidimensional data and transform them into a map of fewer dimensions, such as a two-dimensional plot. The two-dimensional plot provides an easy-to-use graphical user interface to help the decision-maker visualize the similarities between consumer preference patterns.</p>
<p>The SOMs also known as Kohonen Neural Networks is an <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">unsupervised</a> neural network that projects a <em>N</em>-dimensional data set in a reduced dimension space (usually a uni-dimensional (one) or bi-dimensional (two) map). The projection is topological preserving, that is, the proximity among objects in the input space is approximately preserved in the output space. This process, of reducing the dimensionality of vectors, is essentially a data compression technique known as <em>vector quantisation</em>.</p>
<blockquote>The property of <i>topology preserving</i> means that the mapping preserves the relative distance between the points. Points that are near each other in the input space are mapped to nearby map units in the SOMs. The SOMs can thus serve as a cluster analyzing tool of high-dimensional data.</blockquote>
<p><em>SOMs</em> have been successfully applied as a classification tool to various problem domains, including speech recognition, image data compression, image or character recognition, robot control, and medical diagnosis.</p>
<p>SOMs only operate with numeric variables. Categorical variables must be converted to dummy variables.</p>
<p>The SOMs network typically has two layers of nodes, the input layer (<em>n</em> units, (length of training vectors) and the Kohonen layer (<em>m</em> units, number of categories). The input layer is fully connected to the Kohonen layer. The Kohonen layer, the core of the SOM network, functions similar to biological systems in that it can compress the representation of sparse data and spread out dense data using usually a one- or two-dimensional map. This is done by assigning different subareas of the Kohonen layer to different categories of information and, therefore, the location of the processing element in a network becomes specific to a certain characteristic feature in the set of input data.</p>
<p>Following figure is example of two dimensional SOMs. The network is created from a 2D lattice of 'nodes', each of which is fully connected to the input layer. Figure shows a very small Kohonen network of 4 X 4 nodes connected to the input layer (shown in green) representing a two dimensional vector.</p>
<div style="text-align:center; margin-bottom: 10px;">
<a href="http://www.ai-junkie.com/ann/som/som1.html">
<img alt="som_structure.jpg" src="http://en.proft.me/media/science/som_structure.jpg" />
</a>
</div>
<p>Each node has a specific topological position (an <em>x</em>, <em>y</em> coordinate in the lattice) and contains a vector of weights of the same dimension as the input vectors. That is to say, if the training data consists of vectors, <em>V</em>, of <em>n</em> dimensions:</p>
<p><img src="http://latex.codecogs.com/gif.latex?V_1,%20V_2,%20V_3,%20\cdots,%20V_n" alt=""/></p>
<p>Then each node will contain a corresponding weight vector <em>W</em>, of <em>n</em> dimensions:</p>
<p><img src="http://latex.codecogs.com/gif.latex?W_1,%20W_2,%20W_3,%20\cdots,%20W_n" alt=""/></p>
<p>The lines connecting the nodes in figure above are only there to represent adjacency and do not signify a connection as normally indicated when discussing a <a href="http://en.proft.me/2016/06/15/getting-started-deep-learning-r/">neural network</a>.</p>
<p>The SOMs is composed of a pre-defined string or grid of <em>N</em>-dimensional units <em>c^(i)</em> that forms a <em>competitive layer</em>, where only one unit responds at the presentation of each input sample <i>x<sup>(m)</sup></i>. The activation function is an inverse function of <img src="http://latex.codecogs.com/gif.latex?||x^{(m)}-c^{(i)}||" alt="">, so that the unit that is closest to <i>x<sup>(m)</sup></i> wins the competition. Here <img src="http://latex.codecogs.com/gif.latex?||.||" alt=""> indicates euclidean distance which is the most common way of measuring distance between vectors. The winning unit is then updated according to the relationship</p>
<p><img src="http://latex.codecogs.com/gif.latex?c^{(i)}_{new}=c^{(i)}_{old}+\eta(x^{(m)}-c^{(i)}_{old})" alt=""></p>
<p>where <img src="http://latex.codecogs.com/gif.latex?\eta" alt=""/> is the <em>learning rate</em>. However, unlike other competitive methods, each unit is characterised also by its position in the grid.</p>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="som_structure.png" src="http://en.proft.me/media/science/som_structure.png" />
</div>
<p>Therefore, the learning algorithm updates not only the <em>weights</em> of the <em>winning unit</em> but also the <em>weights</em> of its <em>neighbour units</em> in inverse proportion of their distance. The neighbourhood size of each unit shrinks progressively during the training process, starting with nearly the whole map and ending with the single unit. In this way, the map auto-organises so that the units that are spatially close correspond to patterns that are similar in the original space. These units form receptive areas, named <em>activity bubbles</em>, that are encircled by units (<em>dead units</em>) that never win the competition for the samples of the training data set. However, the dead units are trained to form a connection zone amongst the activity bubbles and could win for samples not included in the training set.</p>
<p>A common example used to help teach the principals behind SOMs is the mapping of colours from their three dimensional components - red, green and blue, into two dimensions. Following figure shows an example of a SOMs trained to recognize the eight different colours shown on the right. The colours have been presented to the network as 3D vectors - one dimension for each of the colour components - and the network has learnt to represent them in the 2D space you can see. Notice that in addition to clustering the colours into distinct regions, regions of similar properties are usually found adjacent to each other. Screenshot of the demo program (left) and the colours it has classified (right).</p>
<div style="text-align:center; margin-bottom: 10px;">
<a href="http://www.ai-junkie.com/ann/som/som1.html">
<img alt="som_example.jpg" src="http://en.proft.me/media/science/som_example.jpg" />
</a>
</div>
<p>Training occurs in several steps and over many iterations:</p>
<ol>
<li>Select the size and type of the map. The shape can be hexagonal or square, depending on the shape of the nodes your require. Typically, hexagonal grids are preferred since each node then has 6 immediate neighbours.</li>
<li>Initialise all weight vectors randomly.</li>
<li>Choose a random datapoint from training data and present it to the SOM.</li>
<li>Find the "Best Matching Unit" (BMU) in the map – the most similar node (winning unit).</li>
<li>Determine the nodes within the "neighbourhood" of the BMU. The size of the neighbourhood decreases with each iteration.</li>
<li>Adjust weights of nodes in the BMU neighbourhood towards the chosen datapoint. The learning rate decreases with each iteration. The magnitude of the adjustment is proportional to the proximity of the node to the BMU.</li>
<li>Repeat Steps 3-6 for N iterations / convergence.</li>
</ol>
<p><b class="lb">SOMs vs k-means</b></p>
<p>It must be noted that SOM and <a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means</a> algorithms are rigorously identical when the radius of the neighborhood function in the SOM equals zero (Bodt, Verleysen et al. 1997).</p>
<p>In a sense, SOMs can be thought of as a spatially constrained form of k-means clustering (Ripley 1996). In that analogy, every unit corresponds to a "cluster", and the number of clusters is defined by the size of the grid, which typically is arranged in a rectangular or hexagonal fashion.</p>
<p>SOMs is less prone to local optima than k-means. During <a href="http://www.novaims.unl.pt/labnt/geosom/Public/o1-1_5_lobo05_SOM_kmeans.pdf">tests</a> it is quite evident that the search space is better explored by SOMs.</p>
<p>If you are interested in clustering/quantization then <em>k-means</em> obviously is not the best choice, because it is sensitive to initialization. SOMs provide a more robust learning. <em>k-means</em> is more sensitive to the noise present in the dataset compared to SOMs.</p>
<p><b class="lb">Modeling in R</b></p>
<p>The <a href="http://cran.r-project.org/web/packages/kohonen/kohonen.pdf">kohonen</a> package is a well-documented package in R that facilitates the creation and visualisation of SOMs.</p>
<p>We'll map the 150-sample <em>iris</em> data set to a map of five-by-five hexagonally oriented units.</p>
<pre class="prettyprint">
# install the kohonen package
install.packages("kohonen")
# load the kohonen package
library("kohonen")
# scale data
iris.sc = scale(iris[, 1:4])
# build grid
iris.grid = somgrid(xdim = 5, ydim=5, topo="hexagonal")
# build model
iris.som = som(iris.sc, grid=iris.grid, rlen=100, alpha=c(0.05,0.01))
</pre>
<p>The <code>som()</code> function has several parameters. I mention them briefly below. Default values are available for all of them, except the first, the data.</p>
<ul>
<li><code>grid</code> - the rectangular or hexagonal grid of units. The format is the one returned by the function <code>somgrid</code> from the class package.</li>
<li><code>rlen</code> - the numer of iterations, i.e. the number of times the data set will be presented to the map. The default is 100.</li>
<li><code>alpha</code> - the learning rate, determining the size of the adjustments during training. The decrease is linear, and default values are to start from 0.05 and to stop at 0.01.</li>
</ul>
<p>The result of the training, the <code>iris.som</code> object, is a list. The most important element is the <code>codes</code> element, which contains the <strong>codebook</strong> (the data structure that holds the weight vector for each neuron in the grid) vectors as rows. Another element worth inspecting is <code>changes</code>, a vector indicating the size of the adaptions to the codebook vectors during training.</p>
<p>SOMs visualisation are made up of multiple <em>nodes</em>. Typical SOMs visualisations are of <em>heatmaps</em>. A heatmap shows the distribution of a variable across the SOMs. If we imagine our SOMs as a room full of people that we are looking down upon, and we were to get each person in the room to hold up a coloured card that represents their age – the result would be a SOMs heatmap. People of similar ages would, ideally, be aggregated in the same area. The same can be repeated for age, weight, etc. Visualisation of different heatmaps allows one to explore the relationship between the input variables.</p>
<p>The <code>kohonen.plot</code> function is used to visualise the quality of your generated SOMs and to explore the relationships between the variables in your data set. There are a number <a href="http://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/">different plot</a> types available. Understanding the use of each is key to exploring your SOMs and discovering relationships in your data.</p>
<p><strong>Training Progress.</strong> As the SOMs training iterations progress, the distance from each node's weights to the samples represented by that node is reduced. Ideally, this distance should reach a minimum plateau. This plot option shows the progress over time. If the curve is continually decreasing, more iterations are required.</p>
<pre class="prettyprint">
plot(iris.som, type="changes")
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="som_changes.png" src="http://en.proft.me/media/science/som_changes.png" />
</div>
<p>So, the preceding graph explains how the mean distance to the closest unit changes with increase in the iteration level.</p>
<p><strong>Node Counts</strong>. The Kohonen packages allows us to visualise the count of how many samples are mapped to each node on the map. This metric can be used as a measure of map quality – ideally the sample distribution is relatively uniform. Large values in some map areas suggests that a larger map would be beneficial. Empty nodes indicate that your map size is too big for the number of samples. Aim for at least 5-10 samples per node when choosing map size.</p>
<pre class="prettyprint">
plot(iris.som, type="count")
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="som_count.png" src="http://en.proft.me/media/science/som_count.png" />
</div>
<p><strong>Neighbour Distance</strong>. Often referred to as the <em>U-Matrix</em>, this visualisation is of the distance between each node and its neighbours. Typically viewed with a grayscale palette, areas of low neighbour distance indicate groups of nodes that are similar. Areas with large distances indicate the nodes are much more dissimilar – and indicate natural boundaries between node clusters. The U-Matrix can be used to identify clusters within the SOMs map.</p>
<pre class="prettyprint">
plot(iris.som, type="dist.neighbours")
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="som_neighbours.png" src="http://en.proft.me/media/science/som_neighbours.png" />
</div>
<p><strong>Codes/Weight vectors</strong>. The node weight vectors, or <em>codebook</em>, are made up of normalised values of the original variables used to generate the SOM. Each node’s weight vector is representative / similar of the samples mapped to that node. By visualising the weight vectors across the map, we can see patterns in the distribution of samples and variables. The default visualisation of the weight vectors is a <em>fan diagram</em>, where individual fan representations of the magnitude of each variable in the weight vector is shown for each node.</p>
<pre class="prettyprint">
plot(iris.som, type="codes")
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="som_codes.png" src="http://en.proft.me/media/science/som_codes.png" />
</div>
<p><strong>Heatmaps</strong>. A SOM heatmap allows the visualisation of the distribution of a single variable across the map. Typically, a SOM investigative process involves the creation of multiple heatmaps, and then the comparison of these heatmaps to identify interesting areas on the map. It is important to remember that the individual sample positions do not move from one visualisation to another, the map is simply coloured by different variables.</p>
<p>The default Kohonen heatmap is created by using the type <code>heatmap</code>, and then providing one of the variables from the set of node weights. In this case we visualise the average <code>Petal.Width</code> level on the SOM.</p>
<pre class="prettyprint">
coolBlueHotRed <- function(n, alpha = 1) {rainbow(n, end=4/6, alpha=alpha)[n:1]}
plot(iris.som, type = "property", property = iris.som$codes[,4], main=names(iris.som$data)[4], palette.name=coolBlueHotRed)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="som_heatmaps.png" src="http://en.proft.me/media/science/som_heatmaps.png" />
</div>
<p><strong>Clustering</strong>. The SOMs is comprised of a collection of codebook vectors connected together in a topological arrangement, typically a one dimensional line or a two dimensional grid. The codebook vectors themselves represent prototypes (points) within the domain, whereas the topological structure imposes an ordering between the vectors during the training process. The result is a low dimensional projection or approximation of the problem domain which may be visualized, or from which clusters may be extracted.</p>
<pre class="prettyprint">
## use hierarchical clustering to cluster the codebook vectors
groups = 3
iris.hc = cutree(hclust(dist(iris.som$codes)), groups)
# plot
plot(iris.som, type="codes", bgcol=rainbow(groups)[iris.hc])
#cluster boundaries
add.cluster.boundaries(iris.som, iris.hc)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="som_clustering.png" src="http://en.proft.me/media/science/som_clustering.png" />
</div>
<p><b class="lb">Outcome</b></p>
<p>Learning algorithms such as SOMs use an alternative computational model for data processing. Instead of directly manipulating the data, a SOM adapts to the data, or, using the standard terminology, learns from the data. This computational model offers a certain flexibility, which can be used to develop fast and robust data processing algorithms.</p>
<p>The SOMs was designed for unsupervised learning problems such as feature extraction, visualization and clustering. Some extensions of the approach can label the prepared codebook vectors which can be used for classification.</p>
<p>So, SOMs are an unsupervised data visualisation technique that can be used to visualise high-dimensional data sets in lower (typically 2) dimensional representations.</p>
<p>Also, the SOMs has the capability to generalize. Generalization capability means that the network can recognize or characterize inputs it has never encountered before. A new input is assimilated with the map unit it is mapped to.</p>
<p>Advantages os SOMs include:</p>
<ul>
<li>Intuitive method to develop segmentation.</li>
<li>Relatively simple algorithm, easy to explain results to non-data scientists.</li>
<li>New data points can be mapped to trained model for predictive purposes.</li>
</ul>
<p>Disadvantages include:</p>
<ul>
<li>Lack of parallelisation capabilities for very large data sets.</li>
<li>Difficult to represent very many variables in two dimensional plane.</li>
<li>Requires clean, numeric data.</li>
</ul>Tue, 29 Nov 2016 00:00:00 +0200http://en.proft.me/2016/11/29/modeling-self-organising-maps-r/Apply family in R: avoiding loops on datahttp://en.proft.me/2016/11/16/apply-family-r-avoiding-loops-data/<p><img src="http://en.proft.me/media/science/r_apply.png" alt="r_apply.png" class="right" width="120"></p>
<p>While looping is a great way to iterate through <a href="http://en.proft.me/2014/03/23/r-instrument-modeling/">vectors</a> and perform computations, it is not very efficient when we deal with what is known as <em>Big Data</em>. In this case, R provides some advanced functions:</p>
<ul>
<li><code>lapply()</code> method loops over a list and evaluates a function on each element.</li>
<li><code>sapply()</code> method is a simplified version of <code>lapply()</code>.</li>
<li><code>apply()</code> method evaluates a function on the boundaries or margins of an array.</li>
<li><code>tapply()</code> method evaluates a function over subsets of a vector.</li>
<li><code>mapply()</code> method is a multivariate version of <code>lapply()</code>.</li>
</ul>
<p>These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array, and apply a named function with one or several optional arguments.</p>
<p><b class="lb">lapply and sapply</b></p>
<p><code>lapply()</code> takes a list and a function as input and evaluates that function over each element of the list. If the input list is not a list, it is converted into a list using the <code>as.list()</code> function before the output is returned. It is much faster than a normal loop because the actual looping is done internally using C code. We look at its example in the following code snippet:</p>
<pre class="prettyprint">
data = list(l1 = 1:10, l2 = 1000:1020)
lapply(data, mean)
</pre>
<p>Coming to <code>sapply()</code>, it is similar to <code>lapply()</code> except that it tries to simplify the results wherever possible. For example, if the final result is such that every element is of length 1, it returns a <em>vector</em>, if the length of every element in the result is the same but more than 1, a <em>matrix</em> is returned, and if it is not able to simplify the results, we get the same result as <code>lapply()</code>. We illustrate the same with the following example:</p>
<pre class="prettyprint">
data = list(l1 = 1:10, l2 = runif(10), l3 = rnorm(10,2))
lapply(data, mean)
sapply(data, mean)
</pre>
<p><b class="lb">apply</b></p>
<p>The <code>apply()</code> function is used to evaluate a function over the margins or boundaries of an array, for instance, applying <em>aggregate functions</em> on the rows (1), columns (2) or both (1:2) of an array. By <em>both</em>, we mean <em>apply the function to each individual value</em>. </p>
<pre class="prettyprint">
data = matrix(rnorm(20), nrow=5, ncol=4)
# row sums
apply(data, 1, sum)
# row means
apply(data, 1, mean)
# col sums
apply(data, 2, sum)
# col means
apply(data, 2, mean)
</pre>
<p>Let's see how many negative numbers each column has, using <code>apply()</code> </p>
<pre class="prettyprint">
apply(data, 2, function(x) length(x[x<0]))
</pre>
<p><b class="lb">tapply</b></p>
<p>The function <code>tapply()</code> is used to evaluate a function over the subsets of any vector. This is similar to applying the <code>GROUP BY</code> construct in SQL if you are familiar with using relational databases. We illustrate the same in the following examples:</p>
<pre class="prettyprint">
data = c(1:10, rnorm(10,2), runif(10))
groups = gl(3,10)
tapply(data, groups, mean)
</pre>
<p><b class="lb">mapply</b></p>
<p>The <code>mapply()</code> function is a multivariate version of <code>lapply()</code> and is used to evaluate a function in parallel over sets of arguments. A simple example is if we have to build a list of vectors using the <code>rep()</code> function, we have to write it multiple times. However, with <code>mapply()</code> we can achieve the same in a more elegant way as illustrated next</p>
<pre class="prettyprint">
list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
mapply(rep, 1:4, 4:1)
</pre>Wed, 16 Nov 2016 00:00:00 +0200http://en.proft.me/2016/11/16/apply-family-r-avoiding-loops-data/Principal Component Analysis (PCA) in Rhttp://en.proft.me/2016/11/15/principal-component-analysis-pca-r/<p><img src="http://en.proft.me/media/science/pca_r.jpg" alt="pca_r.jpg" class="right" width="120"></p>
<p><strong>Principal Component Analysis (PCA)</strong> is a dimensionality reduction technique that is widely used in data analysis. More concretely, PCA is used to reduce a large number of correlated variables into a smaller set of uncorrelated variables called principal components.</p>
<p><b class="lb">Introduction</b></p>
<p>In many datasets, particularly in the social sciences, you will see many variables highly correlated with each other. It may additionally suffer from high dimensionality or, as it is known, the <em>curse of dimensionality</em>. The goal then is to use PCA in order to create a smaller set of variables that capture most of the information from the original set of variables, thus simplifying the dataset and often leading to hidden insights. These new variables (principal components) are highly uncorrelated with each other.</p>
<blockquote>Dimensionality reduction is a process of features extraction rather than a feature selection process. Feature extraction is a process of transforming the data in the high-dimensional space to a space of fewer dimensions.</blockquote>
<p><em>Why do we need to reduce the dimension of the data?</em> Well sometimes it's just difficult to do analysis on high dimensional data, especially on interpreting it. This is because there are dimensions that aren't significant which adds to our problem on the analysis. So in order to deal with this, we remove those nuisance dimension and deal with the significant one.</p>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="pca_reduction.gif" src="http://en.proft.me/media/science/pca_reduction.gif" width="500"/>
</div>
<p>PCA is the process of finding the principal components. The <em>first principal component</em> in a dataset is the linear combination that captures the maximum variance in the data. A <em>second component</em> is created by selecting another linear combination that maximizes the variance with the constraint that its direction is perpendicular to the first component. The subsequent components (equal to the number of variables) would follow this same rule.</p>
<iframe width="460" height="275" src="https://www.youtube.com/embed/BfTMmoDFXyE" frameborder="0" allowfullscreen style="margin-bottom:20px"></iframe>
<p><b class="lb">Modeling in R</b></p>
<p>There are many packages and functions that can apply PCA in R. In this post I will use the function <code>prcomp()</code> from the <code>stats</code> package. </p>
<p>There are <a href="http://www.sthda.com/english/wiki/principal-component-analysis-in-r-prcomp-vs-princomp-r-software-and-data-mining">several functions</a> <code>prcomp()</code> and <code>princomp()</code> from the built-in R <code>stats</code> package.</p>
<ul>
<li>The function <code>princomp()</code> uses the spectral decomposition approach.</li>
<li>The function <code>prcomp()</code> uses the singular value decomposition (SVD).</li>
</ul>
<p>Function <code>prcomp()</code> performs a singular value decomposition directly on the data matrix this is slightly more accurate than doing an eigenvalue decomposition on the covariance matrix but conceptually equivalent. Therefore, <code>prcomp()</code> is the preferred function.</p>
<p>I will use the classical <em>iris dataset</em> for the demonstration. The data contain four continuous variables which corresponds to physical measures of flowers and a categorical variable describing the flowers' species.</p>
<pre class="prettyprint">
# load and explore data
data(iris)
head(iris, 3)
</pre>
<p>We will apply PCA to the four continuous variables and use the categorical variable to visualize the PCs later. PCA is sensitive to scale, so the data has been scaled with a <code>log()</code> transformation to the continuous variables. This is necessary if the input variables have very different variances, beside <code>log</code> transformation we can use <code>scale()</code> function. Also we set <code>center</code> and <code>scale</code> equal to <code>TRUE</code> in the call to <code>prcomp()</code> to standardize the variables prior to the application of PCA:</p>
<pre class="prettyprint">
# log transform
log.ir = log(iris[, 1:4])
ir.species = iris[, 5]
</pre>
<p>Once you have standardised your variables, you can carry out a principal component analysis using the <code>prcomp()</code> function in R.</p>
<pre class="prettyprint">
ir.pca = prcomp(log.ir, center = TRUE, scale. = TRUE)
</pre>
<p>The <code>prcomp()</code> function returns an object of class <code>prcomp</code>, which have some methods available. The <code>print()</code> method returns the standard deviation of each of the four PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables. The standard deviations here are the square roots of the eigenvalues.</p>
<pre class="prettyprint">
print(ir.pca)
</pre>
<p>The <code>plot()</code> method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The figure below is useful to decide how many PCs to retain for further analysis. In this simple case with only 4 PCs this is not a hard task and we can see that the first two PCs explain most of the variability in the data.</p>
<pre class="prettyprint">
plot(ir.pca, type = "l")
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="pca_plot_line.png" src="http://en.proft.me/media/science/pca_plot_line.png"/>
</div>
<p>By seeing the graph we could say that the first two principal components are able to explain great part of our data. However isn't there any less subjective way to choose our principal components? Yes, the easiest ways are by using two criteria: <em>Kaiser-Guttman criterion</em> and <em>Broken Stick criterion</em>. They are both based on the eigenvalues, which tell us how much a principal component is able to explain of our initial data set. To get our eigenvalues, just call:</p>
<pre class="prettyprint">
ev = ir.pca$sdev^2
</pre>
<p>Now we can use the eigenvalues to contruct a very useful graph that can be help us to decide which princiapl components we need to choose. Copy and paste following <code>evplot()</code> function into R. I've slightly modify it to look more pretty. Original function <a href="http://www.davidzeleny.net/anadat-r/doku.php/en:numecolr:evplot">here</a>. Then, you can call <code>evplot()</code></p>
<pre class="prettyprint">
evplot = function(ev) {
# Broken stick model (MacArthur 1957)
n = length(ev)
bsm = data.frame(j=seq(1:n), p=0)
bsm$p[1] = 1/n
for (i in 2:n) bsm$p[i] = bsm$p[i-1] + (1/(n + 1 - i))
bsm$p = 100*bsm$p/n
# Plot eigenvalues and % of variation for each axis
op = par(mfrow=c(2,1),omi=c(0.1,0.3,0.1,0.1), mar=c(1, 1, 1, 1))
barplot(ev, main="Eigenvalues", col="bisque", las=2)
abline(h=mean(ev), col="red")
legend("topright", "Average eigenvalue", lwd=1, col=2, bty="n")
barplot(t(cbind(100*ev/sum(ev), bsm$p[n:1])), beside=TRUE,
main="% variation", col=c("bisque",2), las=2)
legend("topright", c("% eigenvalue", "Broken stick model"),
pch=15, col=c("bisque",2), bty="n")
par(op)
}
evplot(ev)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="pca_eigenvalues.png" src="http://en.proft.me/media/science/pca_eigenvalues.png"/>
</div>
<p>The first <em>barplot</em> tells us (according to Kaiser-Guttman criterion) that only the first principal component should be chosen (because its bar is above the red line). The same component should be chosen according to the other <em>barplot</em> (broken stick model), because it has the only bar above the red bar. Simple as that.</p>
<p>The <code>summary()</code> method describe the importance of the PCs. The first row describe again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component while the third row describe the cumulative proportion of explained variance. We can see there that the first two PCs accounts for more than 95% of the variance of the data.</p>
<pre class="prettyprint">
summary(ir.pca)
</pre>
<p>We can use the <code>predict()</code> function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the <em>iris data</em> has just arrived and we want to see what is their PCs values:</p>
<pre class="prettyprint">
# predict PCs
predict(ir.pca, newdata=tail(log.ir, 2))
</pre>
<p>You might want to see how our initial variables contribute to two or three principal components, and how observations are explained by components. We can use a <code>ggbiplot()</code> for that. Note that <code>choices</code> is the argument defining which principal components will be used on the <code>ggbiplot()</code> (we will use the first two components).</p>
<pre class="prettyprint">
library(devtools)
install_github("ggbiplot", "vqv")
library(ggbiplot)
ggbiplot(ir.pca, choices=c(1,2), obs.scale = 1, var.scale = 1)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="pca_ggbiplot1.png" src="http://en.proft.me/media/science/pca_ggbiplot1.png"/>
</div>
<p>We can also use our fifth variable from the iris data set (the categorical variable <code>iris[,5]</code>) to define groups on the <code>ggbiplot()</code>. In that way, observations that belong to the same Iris' species will be grouped.</p>
<pre class="prettyprint">
ggbiplot(ir.pca,choices=c(1,2), groups=iris[,5], obs.scale = 1, var.scale = 1, ellipse = TRUE)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="pca_ggbiplot2.png" src="http://en.proft.me/media/science/pca_ggbiplot2.png"/>
</div>
<p>We can see that <code>Petal.Length</code>, <code>Petal.Width</code> and <code>Sepal.Length</code> are positively explained by the first principal component, while the <code>Sepal.Width</code> is negatively explained by it. Also, the second principal component was only able to negatively explain <code>Sepal.Width</code> and <code>Sepal.Length</code>.</p>
<p>So, advantages of Principal Component Analysis include:</p>
<ul>
<li>Reduces the time and storage space required.</li>
<li>Remove multi-collinearity and improves the performance of the machine learning model.</li>
<li>Makes it easier to visualize the data when reduced to very low dimensions such as 2D or 3D.</li>
</ul>Tue, 15 Nov 2016 00:00:00 +0200http://en.proft.me/2016/11/15/principal-component-analysis-pca-r/