en.proft.me - Blog about Linux, Python, Vim and healthy lifestylehttp://en.proft.me/feed/category/en.proft.me - Blog about Linux, Python, Vim and healthy lifestyle: latest posts by categoriesenFri, 03 Aug 2018 00:00:00 +0300Evaluating data-mining algorithmshttp://en.proft.me/2018/08/3/evaluating-data-mining-algorithms/<p>In this note, you will learn how to evaluate models built using data-mining techniques. The ultimate goal for any data analytics model is to perform well on future data. This objective could be achieved only if we build a model that is efficient and robust during the development stage.</p>
<p>While evaluating any model, the most important things we need to consider are as follows:</p>
<ul>
<li>Whether the model is over fitting or under fitting</li>
<li>How well the model fits the future data or test data</li>
</ul>
<p>Under fitting, also known as bias, is a scenario when the model doesn't even perform well on training data. This means that we fit a less robust model to the data. For example, say the data is distributed non-linearly and we are fitting the data with a linear model. From the following image, we see that data is non-linearly distributed. Assume that we have fitted a linear model (orange line). In this case, during the model building stage itself, the predictive power will be low.</p>
<p><em>Over fitting</em> is a scenario when the model performs well on training data, but does really bad on test data. This scenario arises when the model memorizes the data pattern rather than learning from data. For example, say the data is distributed in a non-linear pattern, and we have fitted a complex model, shown using the green line. In this case, we observe that the model is fitted very close to the data distribution, taking care of all the ups and downs. In this case, the model is most likely to fail on previously unseen data.</p>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_evaluating_overfitting.png" src="/media/science/r_evaluating_overfitting.png" width="50%">
</div>
<p>The preceding image shows simple, complex, and appropriate fitted models' training data. The green fit represents overfitting, the orange line represents underfitting, the black and blue lines represent the appropriate model, which is a trade-off between underfit and overfit.</p>
<p>Any fitted model is evaluated to avoid previously mentioned scenarios using cross validation, regularization, pruning, model comparisons, ROC curves, confusion matrices, and so on.</p>
<p><strong>Cross validation</strong>. This is a very popular technique for model evaluation for almost all models. In this technique, we divide the data into two datasets: a training dataset and a test dataset. The model is built using the training dataset and evaluated using the test dataset. This process is repeated many times. The test errors are calculated for every iteration. The averaged test error is calculated to generalize the model accuracy at the end of all the iterations.</p>
<p><strong>Regularization</strong>. In this technique, the data variables are penalized to reduce the complexity of the model with the objective to minimize the cost function. There are two most popular regularization techniques: ridge regression and lasso regression. In both techniques, we try to reduce the variable co-efficient to zero. Thus, a smaller number of variables will fit the data optimally.</p>
<p><strong>Confusion matrix</strong>. This technique is popularly used in evaluating a classification model. We build a confusion matrix using the results of the model. We calculate precision and recall/sensitivity/specificity to evaluate the model.</p>
<p><strong>Precision</strong>. This is the probability whether the truly classified records are relevant.</p>
<p><strong>Recall/Sensitivity</strong>. This is the probability whether the relevant records are truly classified.</p>
<p><strong>Specificity</strong>. Also known as true negative rate, this is the proportion of truly classified wrong records.</p>
<p>A confusion matrix shown in the following image is constructed using the results of classification models discussed.</p>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_confusion_matrix.png" src="/media/science/r_confusion_matrix.png" width="55%">
</div>
<p>Let's understand the confusion matrix:</p>
<ul>
<li><strong>TRUE POSITVE (TP)</strong>. This is a count of all the responses where the actual response is negative and the model predicted is positive</li>
<li><strong>FALSE POSITIVE (FP)</strong>. This is a count of all the responses where the actual response is negative, but the model predicted is positive. It is, in general, a FALSE ALARM.</li>
<li><strong>FALSE NEGATIVE (FN)</strong>. This is a count of all the responses where the actual response is positive, but the model predicted is negative. It is, in general, A MISS.</li>
<li><strong>TRUE NEGATIVE (TN)</strong>. This is a count of all the responses where the actual response is negative, and the model predicted is negative.</li>
</ul>Fri, 03 Aug 2018 00:00:00 +0300http://en.proft.me/2018/08/3/evaluating-data-mining-algorithms/An overview of a Market Basket Analysis (Association Mining) in Rhttp://en.proft.me/2018/07/20/market-basket-analysis-association-mining-r/<p>Market basket analysis is a data mining technique that has the purpose of finding the optimal combination of products or services and allows marketers to exploit this knowledge to provide recommendations, optimize product placement, or develop marketing programs that take advantage of cross-selling. In short, the idea is to identify which items go well together, and profit from it.</p>
<p>You can think of the results of the analysis as an <code>if ... then</code> statement. If a customer buys an airplane ticket, then there is a 46 percent probability that they will buy a hotel room, and if they go on to buy a hotel room, then there is a 33 percent probability that they will rent a car.</p>
<blockquote>There are many ways to see the similarities between items. These are techniques that fall under the general umbrella of <b>association</b>. The outcome of this type of technique, in simple terms, is a set of rules that can be understood as "if this, then that".</blockquote>
<p>However, it is not just for sales and marketing. It is also being used in fraud detection and healthcare; for example, if a patient undergoes treatment A, then there is a 26 percent probability that they might exhibit symptom X. The following measures are used to evaluate the strength of association. Suppose, you are interested in the association between two events A and B:</p>
<ul>
<li><strong>Itemset</strong>: This is a collection of one or more items in the dataset.</li>
<li><strong>Support</strong>: This is the proportion of the transactions in the data that contain an itemset of interest. <em>Support = Number of Rows having both A AND B / Total Number of Rows</em>.</li>
<li><strong>Confidence</strong>: This is the conditional probability that if a person purchases or does <em>x</em>, they will purchase or do <em>y</em>; the act of doing <em>x</em> is referred to as the antecedent or Left-Hand Side (LHS), and <em>y</em> is the consequence or Right-Hand Side (RHS). <em>Confidence = Number of Rows having both A AND B / Number of Rows with A</em>. <em>Expected Confidence = Number of rows with B / Total Number of Rows</em>.</li>
<li><strong>Lift</strong>: This is the ratio of the support of <em>x</em> occurring together with <em>y</em> divided by the probability that <em>x</em> and <em>y</em> occur if they are independent. It is the confidence divided by the probability of <em>x</em> times the probability of <em>y</em>; for example, say that we have the probability of <em>x</em> and <em>y</em> occurring together as 10 percent and the probability of <em>x</em> is 20 percent and <em>y</em> is 30 percent, then the lift would be 10 percent (20 percent times 30 percent) or 16.67 percent. <em>Lift = Confidence / Expected Confidence</em>.</li>
<li><strong>Rule</strong>: A rule is a notation that represents which item/s is frequently bought with what item/s. It has an LHS and an RHS part and can be represented as follows: <strong>itemset A => itemset B</strong>. This means, the item/s on the right were frequently purchased along with items on the left.</li>
</ul>
<p>The package in R that you can use to perform a market basket analysis is <code>arules</code>: Mining Association Rules and Frequent Itemsets. The package offers two different methods of finding rules. The algorithms that the package supports are <em>apriori</em> and <em>ECLAT</em>. There are other algorithms to conduct a market basket analysis, but <em>apriori</em> is used most frequently, and so, that will be our focus.</p>
<p>With <em>apriori</em>, the principle is that, if an itemset is frequent, then all of its subsets must also be frequent. A minimum frequency (support) is determined by the analyst prior to executing the algorithm, and once established, the algorithm will run as follows:</p>
<ul>
<li>Let <em>k=1</em> (the number of items)</li>
<li>Generate itemsets of a length that are equal to or greater than the specified support</li>
<li>Iterate <em>k + (1...n)</em>, pruning those that are infrequent (less than the support)</li>
<li>Stop the iteration when no new frequent itemsets are identified</li>
</ul>
<p>Once you have an ordered summary of the most frequent itemsets, you can continue the analysis process by examining the confidence and lift in order to identify the associations of interest.</p>
<p>For our example, we will focus on identifying the association rules for a grocery store. The dataset will be from the <em>arules</em> package and is called <em>Groceries</em>. This dataset consists of actual transactions over a 30-day period from a real-world grocery store and consists of 9,835 different purchases. All the items purchased are put into one of 169 categories, for example, bread, wine, meat, and so on.</p>
<p>For this analysis, we will only need to load two packages, as well as the Groceries dataset:</p>
<pre class="prettyprint">
#install.packages("arules")
#install.packages("arulesViz")
library(arules)
library(arulesViz)
data(Groceries)
head(Groceries)
</pre>
<p>This dataset is structured as a <em>sparse matrix</em> object, known as the <code>transaction</code> class.</p>
<p>So, once the structure is that of the class transaction, our standard exploration techniques will not work, but the <em>arules</em> package offers us other techniques to explore the data.</p>
<p>The best way to explore this data is with an item frequency plot using the <code>itemFrequencyPlot()</code> function in the <code>arules</code> package. You will need to specify the transaction dataset, the number of items with the highest frequency to plot, and whether or not you want the relative or absolute frequency of the items. Let's first look at the absolute frequency and the top 10 items only:</p>
<pre class="prettyprint">
itemFrequencyPlot(Groceries, topN = 10, type = "absolute")
</pre>
<p>The output of the preceding command is as follows:</p>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_arules_FrequencyPlot.png" src="/media/science/r_arules_FrequencyPlot.png" width="60%">
</div>
<p>The top item purchased was <em>whole milk</em> with roughly 2,500 of the 9,836 transactions in the basket.</p>
<p>Look at the first five transactions</p>
<pre class="prettyprint">
inspect(Groceries[1:5])
</pre>
<p><b class="lb">Modeling and evaluation</b></p>
<p>Throughout the modeling process, we will use the <em>apriori algorithm</em>, which is the appropriately named <code>apriori()</code> function in the <code>arules</code> package. The two main things that we will need to specify in the function is the dataset and parameters. As for the parameters, you will need to apply judgment when specifying the minimum support, confidence, and the minimum and/or maximum length of basket items in an itemset.</p>
<p>Using the item frequency plots, along with trial and error, let's set the minimum support at 1 in 1,000 transactions and minimum confidence at 90 percent. Additionally, let's establish the maximum number of items to be associated as four. The following is the code to create the object that we will call <code>rules</code></p>
<p><b class="lb">
rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.9, maxlen=4))
</b></p>
<p>You will always have to pass the minimum required <em>support</em> and <em>confidence</em>.</p>
<ul>
<li>We set the minimum support to 0.001</li>
<li>We set the minimum confidence of 0.9</li>
<li>We then show the top 5 rules</li>
</ul>
<p>Calling the object shows how many rules the algorithm produced.</p>
<p>There are a number of ways to examine the rules. The first thing that I recommend is to set the number of displayed digits to only two, with the <code>options()</code> function in base R. Then, sort and inspect the top five rules based on the lift that they provide, as follows:</p>
<pre class="prettyprint">
options(digits = 2)
rules <- sort(rules, by = "lift", decreasing = TRUE)
inspect(rules[1:5])
</pre>
<p>The rule that provides the best overall lift is the purchase of <em>liquor</em> and <em>red wine</em> on the probability of purchasing bottled beer.</p>
<p>You can also sort by the support and confidence, so let's have a look at the first 5 rules <code>by="confidence"</code> in descending order, as follows:</p>
<pre class="prettyprint">
rules <- sort(rules, by = "confidence", decreasing = TRUE)
inspect(rules[1:5])
</pre>
<p>You can see in the table that <em>confidence</em> for these transactions is 100 percent. Moving on to our specific study of beer, we can utilize a function in <code>arules</code> to develop cross tabulations - the <code>crossTable()</code> function - and then examine whatever suits our needs. The first step is to create a table with our dataset:</p>
<pre class="prettyprint">
tab <- crossTable(Groceries)
</pre>
<p>With <code>tab</code> created, we can now examine the joint occurrences between the items.</p>
<p>As you might imagine, shoppers only selected liver loaf 50 times out of the 9,835 transactions. Additionally, of the 924 times, people gravitated toward sausage, 10 times they felt compelled to grab liver loaf. If you want to look at a specific example, you can either specify the row and column number or just spell that item out:</p>
<pre class="prettyprint">
table["bottled beer","bottled beer"]
</pre>
<p>This tells us that there were 792 transactions of bottled bottled beer. Let's see what the joint occurrence between bottled beer and canned beer is:</p>
<pre class="prettyprint">
table["bottled beer","canned beer"]
</pre>
<p>We can now move on and derive specific rules for <em>bottled beer</em>. We will again use the <code>apriori()</code> function, but this time, we will add a syntax around <code>appearance</code>. This means that we will specify in the syntax that we want the left-hand side to be items that increase the probability of a purchase of <em>bottled beer</em>, which will be on the right-hand side. In the following code, notice that I've adjusted the support and confidence numbers. Feel free to experiment with your own settings:</p>
<pre class="prettyprint">
beer.rules <- apriori(data = Groceries, parameter = list(support
= 0.0015, confidence = 0.3), appearance = list(default = "lhs",
rhs = "bottled beer"))
</pre>
<p>We find ourselves with only 4 association rules. We have seen one of them already. Now let's bring in the other three rules in descending order by lift:</p>
<pre class="prettyprint">
beer.rules <- sort(beer.rules, decreasing = TRUE, by = "lift")
inspect(beer.rules)
</pre>
<p>In all of the instances, the purchase of <em>bottled beer</em> is associated with booze, either <em>liquor</em> and/or <em>red wine</em>, which is no surprise to anyone. What is interesting is that <em>white wine</em> is not in the mix here. Let's take a closer look at this and compare the joint occurrences of <em>bottled beer</em> and types of wine:</p>
<pre class="prettyprint">
tab["bottled beer", "red/blush wine"]
[1] 48
tab["red/blush wine", "red/blush wine"]
[1] 189
48/189
[1] 0.25
tab["white wine", "white wine"]
[1] 187
tab["bottled beer", "white wine"]
[1] 22
22/187
[1] 0.12
</pre>
<p>It's interesting that 25 percent of the time, when someone purchased <em>red wine</em>, they also purchased <em>bottled beer</em>. But with <em>white wine</em>, a joint purchase only happened in 12 percent of the instances. We certainly don't know why in this analysis, but this could potentially help us to determine how we should position our product in this grocery store.</p>
<p>Another thing before we move on is to look at a plot of the rules. This is done with the <code>plot()</code> function in the <code>arulesViz</code> package. There are many graphic options available. For this example, let's specify that we want a <em>graph</em>, showing lift, and the rules provided and shaded by <em>confidence</em>. The following syntax will provide this accordingly:</p>
<pre class="prettyprint">
plot(beer.rules, method="graph", measure="lift", shading="confidence")
</pre>
<p>The following is the output of the preceding command:</p>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_arules_graph.png" src="/media/science/r_arules_graph.png" width="80%">
</div>
<p>This graph shows that <em>liquor/red wine</em> provides the best <em>lift</em> and the highest level of <em>confidence</em> with both the size of the circle and its shading.</p>
<p><b class="lb">Find what factors influenced an event X</b></p>
<p>To find out what customers had purchased before buying <em>Whole Milk</em>. This will help you understand the patterns that led to the purchase of *whole milk.</p>
<pre class="prettyprint">
rules <- apriori (data=Groceries, parameter=list (supp=0.001,conf = 0.08),
appearance = list (default="lhs",rhs="whole milk"), control = list (verbose=F))
</pre>
<p><b class="lb">Find out what events were influenced by a given event</b></p>
<p>In this case: the Customers who bought <em>Whole Milk</em> also bought. In the equation, <em>whole milk</em> is in LHS (left hand side).</p>
<pre class="prettyprint">
rules <- apriori (data=Groceries, parameter=list (supp=0.001,conf = 0.15,minlen=2),
appearance = list (default="rhs",lhs="whole milk"), control = list (verbose=F))
</pre>Fri, 20 Jul 2018 00:00:00 +0300http://en.proft.me/2018/07/20/market-basket-analysis-association-mining-r/Machine Learning Frameworkshttp://en.proft.me/2018/06/21/machine-learning-frameworks/<p>One of the most important advances in practical machine learning involved the creation of frameworks. <em>Frameworks</em> automate many aspects of developing machine learning applications, and they allow developers to re-use code and take advantage of best practices. This discussion introduces five of the most popular frameworks: Torch, Theano, Caffe, Keras, and TensorFlow.</p>
<p><strong>Torch</strong></p>
<p>Torch is the first machine learning framework to attract a significant following. Originally released in 2002 by Ronan Collobert, it began as a toolset for numeric computing. Torch’s computations involve multidimensional arrays called <em>tensors</em>, which can be processed with regular vector/matrix operations. Over time, Torch acquired routines for building, training, and evaluating neural networks.</p>
<p>Torch garnered a great deal of interest from academics and corporations like IBM and Facebook. But its adoption has been limited by its reliance on Lua as its interface language. The other frameworks in this discussion —Theano, Caffe, Keras, and TensorFlow — can be interfaced through Python, which has emerged as the language of choice in the machine learning domain.</p>
<p><strong>Theano</strong></p>
<p>In 2010, a machine learning group at the University of Montreal released Theano, a library for numeric computation. Like NumPy, Theano provides a wide range of Python routines for operating on multidimensional arrays. Unlike NumPy, Theano stores operations in a data structure called a <em>graph</em>, which it compiles into high-performance code. Theano also supports <em>symbolic differentiation</em>, which makes it possible to find derivatives of functions automatically.</p>
<p>Because of its high performance and symbolic differentiation, many machine learning developers have adopted Theano as their numeric computation toolset of choice. Developers particularly appreciate Theano’s ability to execute graphs on graphics processing units (GPUs) as well as central processing units (CPUs).</p>
<p><strong>Caffe</strong></p>
<p>As part of his PhD dissertation at UC Berkeley, Yangqing Jia created Caffe, a framework for developing image recognition applications. As others joined in the development, Caffe expanded to support other machine learning algorithms and many different types of neural networks.</p>
<p>Caffe is written in C++, and like Theano, it supports GPU acceleration. This emphasis on performance has endeared Caffe to many academic and corporate developers. Facebook has become particularly interested in Caffe, and in 2007 it released a reworked version called Caffe2. This version improves Caffe’s performance and makes executing applications on smartphones possible.</p>
<p><strong>Keras</strong></p>
<p>While other offerings focus on performance and breadth of capabilities, Keras is concerned with modularity and simplicity of development. François Chollet created Keras as an interface to other machine learning frameworks, and many developers access Theano through Keras to combine Keras’s simplicity with Theano’s performance.</p>
<p>Keras’s simplicity stems from its small API and intuitive set of functions. These functions focus on accomplishing standard tasks in machine learning, which makes Keras ideal for newcomers to the field but of limited value for those who want to customize their operations.</p>
<p>François Chollet released Keras under the MIT License, and Google has incorporated his interface into TensorFlow. For this reason, many TensorFlow developers prefer to code their neural networks using Keras.</p>
<p><strong>TensorFlow</strong></p>
<p>As the title implies, this book centers on TensorFlow, Google’s gift to the world of machine learning. The Google Brain team released TensorFlow 1.0 in 2015, and as of the time of this writing, the current version is 1.4. It’s provided under the Apache 2.0 open source license, which means you’re free to use it, modify it, and distribute your modifications.</p>
<p>TensorFlow’s primary interface is Python, but like Caffe, its core functionality is written in C++ for improved performance. Like Theano, TensorFlow stores operations in a graph that can be deployed to a GPU, a remote system, or a network of remote systems. In addition, TensorFlow provides a utility called TensorBoard, which makes visualizing graphs and their operations possible.</p>
<p>Like other frameworks, TensorFlow supports execution on CPUs and GPUs. In addition, TensorFlow applications can be executed on the Google Cloud Platform (GCP). The GCP provides world-class processing power at relatively low cost, and in my opinion, GCP processing is TensorFlow’s most important advantage.</p>Thu, 21 Jun 2018 00:00:00 +0300http://en.proft.me/2018/06/21/machine-learning-frameworks/How to convert a factor to numeric in Rhttp://en.proft.me/2017/03/14/how-convert-factor-numeric-r/<p>Sometimes you need to explicitly convert factors to either text or numbers. To do this, you use the functions <code>as.character()</code> or <code>as.numeric()</code>. First, convert your data vector into a factor or use existed factors <code>iris$Species</code> from <em>iris</em> dataset.</p>
<pre class="prettyprint">
v = c("North", "East", "South", "South")
vf <- factor(v)
</pre>
<p>Use <code>as.character()</code> to convert a factor to a character vector</p>
<pre class="prettyprint">
as.character(vf)
</pre>
<p>Use <code>as.numeric()</code> to convert a factor to a numeric vector. Note that this will return the numeric codes that correspond to the factor levels. </p>
<pre class="prettyprint">
as.numeric(vf)
# or from iris dataset
as.numeric(factor(iris$Species))
</pre>
<p>Be very careful when you convert factors with numeric levels to a numeric vector. The results may not be what you expect.</p>
<p>For example, imagine you have a vector that indicates some test score results with the values <code>c(9, 8, 10, 8, 9)</code>, which you convert to a factor</p>
<pre class="prettyprint">
numbers = factor(c(9, 8, 10, 8, 9))
</pre>
<p>To look at the internal representation of numbers, use <code>str()</code></p>
<pre class="prettyprint">
str(numbers)
</pre>
<p>This indicates that R stores the values as <code>c(2, 1, 3, 1, 2)</code> with associated levels of <code>c("8", "9", "10")</code>.</p>Tue, 14 Mar 2017 00:00:00 +0200http://en.proft.me/2017/03/14/how-convert-factor-numeric-r/Density-based clustering in Rhttp://en.proft.me/2017/02/3/density-based-clustering-r/<p><img src="http://en.proft.me/media/science/r_dbc_logo.jpg" alt="Density-based clustering in R" class="right" width="190">
<b class="lb">Introduction</b></p>
<p>Clustering is a multivariate analysis used to group similar objects (close in terms of distance) together in the same group (cluster). Unlike <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">supervised learning</a> methods (for example, classification and regression) a clustering analysis does not use any label information, but simply uses the similarity between data features to group them into clusters.</p>
<p>Clustering does not refer to specific algorithms but it’s a process to create groups based on similarity measure. Clustering analysis use <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">unsupervised learning</a> algorithm to create clusters.</p>
<p>Clustering algorithms generally work on simple principle of maximization of intracluster similarities and minimization of intercluster similarities. The measure of similarity determines how the clusters need to be formed.</p>
<p>Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0.</p>
<p>Clustering can be widely adapted in the analysis of businesses. For example, a marketing department can use clustering to segment customers by personal attributes. As a result of this, different marketing campaigns targeting various types of customers can be designed.</p>
<p>Clustering model is a notion used to signify what kind of clusters we are trying to identify. The four most common models of clustering methods are hierarchical clustering, k-means clustering, model-based clustering, and density-based clustering:</p>
<ul>
<li><a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">Hierarchical clustering</a>. It creates a hierarchy of clusters, and presents the hierarchy in a dendrogram. This method does not require the number of clusters to be specified at the beginning. Distance connectivity between observations is the measure.</li>
<li><a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a>. It is also referred to as flat clustering. Unlike hierarchical clustering, it does not create a hierarchy of clusters, and it requires the number of clusters as an input. However, its performance is faster than hierarchical clustering. Distance from mean value of each observation/cluster is the measure.</li>
<li><a href="http://en.proft.me/2017/02/1/model-based-clustering-r/">Model-based clustering</a> (or <em>Distribution models</em>). Both hierarchical clustering and k-means clustering use a heuristic approach to construct clusters, and do not rely on a formal model. Model-based clustering assumes a data model and applies an EM algorithm to find the most likely model components and the number of clusters. Significance of statistical distribution of variables in the dataset is the measure.</li>
<li><a href="http://en.proft.me/2017/02/3/density-based-clustering-r/">Density-based clustering</a>. It constructs clusters in regard to the density measurement. Clusters in this method have a higher density than the remainder of the dataset. Density in data space is the measure.</li>
</ul>
<p>A good clustering algorithm can be evaluated based on two primary objectives:</p>
<ul>
<li>High intra-class similarity</li>
<li>Low inter-class similarity</li>
</ul>
<p>Density-based clustering uses the idea of <em>density reachability</em> and <em>density connectivity</em> (as an alternative to <em>distance measurement</em>), which makes it very useful in discovering a cluster in nonlinear shapes. This method finds an area with a higher density than the remaining area. One of the most famous methods is <em>Density-Based Spatial Clustering of Applications with Noise</em> (<em>DBSCAN</em>). It uses the concept of <em>density reachability</em> and <em>density connectivity</em>.</p>
<ul>
<li><em>Density reachability</em>. A point <em>p</em> is said to be density reachable from a point <em>q</em> if point <em>p</em> is within <em>e</em> distance from point <em>q</em> and <em>q</em> has sufficient number of points in its neighbors which are within distance <em>e</em>.</li>
<li><em>Density connectivity</em>. A point <em>p</em> and <em>q</em> are said to be density connected if there exist a point <em>r</em> which has sufficient number of points in its neighbors and both the points <em>p</em> and <em>q</em> are within the <em>e</em> distance. This is chaining process. So, if <em>q</em> is neighbor of <em>r</em>, <em>r</em> is neighbor of <em>s</em>, <em>s</em> is neighbor of <em>t</em> which in turn is neighbor of <em>p</em> implies that <em>q</em> is neighbor of <em>p</em>.</li>
</ul>
<p>The Density-based clustering algorithm DBSCAN is a fundamental data clustering technique for finding arbitrary shape clusters as well as for detecting outliers.</p>
<p>Unlike K-Means, DBSCAN does not require the number of clusters as a parameter. Rather it infers the number of clusters based on the data, and it can discover clusters of arbitrary shape (for comparison, K-Means usually discovers spherical clusters). Partitioning methods (K-means, PAM clustering) and hierarchical clustering are suitable for finding <em>spherical-shaped clusters</em> or <em>convex clusters</em>. In other words, they work well for compact and well separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.</p>
<p>Advantages of Density-based clustering are</p>
<ol>
<li>No assumption on the number of clusters. The number of clusters is often unknown in advance. Furthermore, in an evolving data stream, the number of natural clusters is often changing.</li>
<li>Discovery of clusters with arbitrary shape. This is very important for many data stream applications.</li>
<li>Ability to handle outliers (resistant to noise).</li>
</ol>
<p>Disadvantages of Density-based clustering are</p>
<ol>
<li>If there is variation in the density, noise points are not detected</li>
<li>Sensitive to parameters i.e. hard to determine the correct set of parameters.</li>
<li>The quality of DBSCAN depends on the distance measure.</li>
<li>DBSCAN cannot cluster data sets well with large differences in densities.</li>
</ol>
<p>This algorithm works on a parametric approach. The two parameters involved in this algorithm are:</p>
<ul>
<li><em>e</em> (<em>eps</em>) is the radius of our neighborhoods around a data point <em>p</em>.</li>
<li><em>minPts</em> is the minimum number of data points we want in a neighborhood to define a cluster.</li>
</ul>
<p>One limitation of DBSCAN is that it is sensitive to the choice of <em>e</em>, in particular if clusters have different densities. If <em>e</em> is too small, sparser clusters will be defined as noise. If <em>e</em> is too large, denser clusters may be merged together.</p>
<div style="text-align:center;">
<img alt="r_dbc_concept.png" src="http://en.proft.me/media/science/r_dbc_concept.png"><br/>
</div>
<p>Once these parameters are defined, the algorithm divides the data points into three points:</p>
<ul>
<li><em>Core points</em>. A point <em>p</em> is a core point if at least <em>minPts</em> points are within distance <em>l</em> (<em>l</em> is the maximum radius of the neighborhood from <em>p</em>) of it (including <em>p</em>).</li>
<li><em>Border points</em>. A point <em>q</em> is border from <em>p</em> if there is a path <i>p<sup>1</sup></i>, <em>...</em>, <i>p<sup>n</sup></i> with <i>p<sup>1</sup> = p</i> and <i>p<sup>n</sup> = q</i>, where each <i>p<sup>i+1</sup></i> is directly reachable from <i>p<sup>i</sup></i> (all the points on the path must be core points, with the possible exception of <em>q</em>).</li>
<li><em>Outliers</em>. All points not reachable from any other point are outliers.</li>
</ul>
<div style="text-align:center;">
<img alt="r_dbc_points.png" src="http://en.proft.me/media/science/r_dbc_points.png"><br/>
</div>
<p>The steps in <em>DBSCAN</em> are simple after defining the previous steps:</p>
<ol>
<li>Pick at random a point which is not assigned to a cluster and calculate its neighborhood. If, in the neighborhood, this point has <em>minPts</em> then make a cluster around that; otherwise, mark it as outlier.</li>
<li>Once you find all the core points, start expanding that to include border points.</li>
<li>Repeat these steps until all the points are either assigned to a cluster or to an outlier.</li>
</ol>
<p><b class="lb">Modeling in R</b></p>
<p>In the following part, we will demonstrate how to use DBSCAN to perform density-based clustering.</p>
<pre class="prettyprint">
# install.packages("dbscan")
library(dbscan)
</pre>
<p><strong>Determing optimal value of e (eps)</strong>. First of all, we should define the <a href="http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning">optimal value of e</a>. The method proposed here consists of computing the he <em>k-nearest neighbor distances</em> in a matrix of points.</p>
<p>The idea is to calculate, the average of the distances of every point to its <em>k</em> nearest neighbors. The value of <em>k</em> will be specified by the user and corresponds to <em>MinPts</em>.</p>
<p>Next, these <em>k</em>-distances are plotted in an ascending order. The aim is to determine the <em>knee”</em> which corresponds to the optimal <em>e</em> parameter.</p>
<p>A <em>knee</em> corresponds to a threshold where a sharp change occurs along the <em>k</em>-distance curve.</p>
<p>The function <code>kNNdistplot()</code> can be used to draw the <em>k</em>-distance plot.</p>
<pre class="prettyprint">
iris_matrix <- as.matrix(iris[, -5])
kNNdistplot(iris_matrix, k=4)
abline(h=0.4, col="red")
</pre>
<div style="text-align:center;">
<img alt="r_dbc_optimal_e.png" src="http://en.proft.me/media/science/r_dbc_optimal_e.png" width="90%"><br/>
</div>
<p>It can be seen that the optimal <em>e</em> value is around a distance of 0.4.</p>
<pre class="prettyprint">
set.seed(1234)
db = dbscan(iris_matrix, 0.4, 4)
db
</pre>
<p>The result shows DBSCAN has found <em>four</em> clusters and assigned <em>25</em> cases as noise/outliers.</p>
<p>Display the hull plot</p>
<pre class="prettyprint">
hullplot(iris_matrix, db$cluster)
</pre>
<div style="text-align:center;">
<img alt="r_dbc_hull.png" src="http://en.proft.me/media/science/r_dbc_hull.png" width="90%"><br/>
</div>
<p>The hull plot shows the separation is good, so we can play around with the parameters to get more generalized or specialized clusters. Black points are outliers.</p>
<p>You can play with <em>e</em> (neighborhood size) and <em>MinPts</em> (minimum of points needed for core cluster).</p>
<p>Compare amount of the data within each cluster</p>
<pre class="prettyprint">
table(iris$Species, db$cluster)
</pre>
<p><b class="lb">Comparing clustering methods</b></p>
<p>After fitting data into clusters using different clustering methods, you may wish to measure the accuracy of the clustering. In most cases, you can use either <em>intracluster</em> or <em>intercluster</em> metrics as measurements. The higher the intercluster distance, the better it is, and the lower the intracluster distance, the better it is.</p>
<p>We now introduce how to compare different clustering methods using <code>cluster.stat</code> from the <code>fpc</code> package.</p>
<p>Perform the following steps to compare clustering methods.</p>
<p>First, install and load the <code>fpc</code> package</p>
<pre class="prettyprint">
#install.packages("fpc")
library("fpc")
</pre>
<p>Next, retrieve the cluster validation statistics of clustering method:</p>
<pre class="prettyprint">
cs = cluster.stats(dist(iris[1:4]), db$cluster)
</pre>
<p>Most often, we focus on using <code>within.cluster.ss</code> and <code>avg.silwidth</code> to validate the clustering method. The <code>within.cluster.ss</code> measurement stands for the <em>within clusters sum</em> of squares, and <code>avg.silwidth</code> represents the <em>average silhouette width</em>.</p>
<ul>
<li><code>within.cluster.ss</code> measurement shows how closely related objects are in clusters; the smaller the value, the more closely related objects are within the cluster.</li>
<li><code>avg.silwidth</code> is a measurement that considers how closely related objects are within the cluster and how clusters are separated from each other. The silhouette value usually ranges from 0 to 1; a value closer to 1 suggests the data is better clustered.</li>
</ul>
<pre class="prettyprint">
cs[c("within.cluster.ss","avg.silwidth")]
</pre>
<p>Finally, you can <a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">generate the cluster statistics</a> of different clustering method and list them in a table.</p>Fri, 03 Feb 2017 00:00:00 +0200http://en.proft.me/2017/02/3/density-based-clustering-r/Model-based clustering and Gaussian mixture model in Rhttp://en.proft.me/2017/02/1/model-based-clustering-r/<p><b class="lb">Introduction</b></p>
<p>Clustering is a multivariate analysis used to group similar objects (close in terms of distance) together in the same group (cluster). Unlike <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">supervised learning</a> methods (for example, classification and regression) a clustering analysis does not use any label information, but simply uses the similarity between data features to group them into clusters.</p>
<p>Clustering does not refer to specific algorithms but it’s a process to create groups based on similarity measure. Clustering analysis use <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">unsupervised learning</a> algorithm to create clusters.</p>
<p>Clustering algorithms generally work on simple principle of maximization of intracluster similarities and minimization of intercluster similarities. The measure of similarity determines how the clusters need to be formed.</p>
<p>Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0.</p>
<p>Clustering can be widely adapted in the analysis of businesses. For example, a marketing department can use clustering to segment customers by personal attributes. As a result of this, different marketing campaigns targeting various types of customers can be designed.</p>
<p>Clustering model is a notion used to signify what kind of clusters we are trying to identify. The four most common models of clustering methods are hierarchical clustering, k-means clustering, model-based clustering, and density-based clustering:</p>
<ul>
<li><a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">Hierarchical clustering</a>. It creates a hierarchy of clusters, and presents the hierarchy in a dendrogram. This method does not require the number of clusters to be specified at the beginning. Distance connectivity between observations is the measure.</li>
<li><a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a>. It is also referred to as flat clustering. Unlike hierarchical clustering, it does not create a hierarchy of clusters, and it requires the number of clusters as an input. However, its performance is faster than hierarchical clustering. Distance from mean value of each observation/cluster is the measure.</li>
<li><a href="http://en.proft.me/2017/02/1/model-based-clustering-r/">Model-based clustering</a> (or <em>Distribution models</em>). Both hierarchical clustering and k-means clustering use a heuristic approach to construct clusters, and do not rely on a formal model. Model-based clustering assumes a data model and applies an EM algorithm to find the most likely model components and the number of clusters. Significance of statistical distribution of variables in the dataset is the measure.</li>
<li><a href="http://en.proft.me/2017/02/3/density-based-clustering-r/">Density-based clustering</a>. It constructs clusters in regard to the density measurement. Clusters in this method have a higher density than the remainder of the dataset. Density in data space is the measure.</li>
</ul>
<p>A good clustering algorithm can be evaluated based on two primary objectives:</p>
<ul>
<li>High intra-class similarity</li>
<li>Low inter-class similarity</li>
</ul>
<p>One disadvantage of <em>hierarchical clustering algorithms</em>, <em>k-means algorithms</em> and others is that they are largely heuristic and not based on formal models. Formal inference is not possible.</p>
<p>The key advantage of <em>model-based approach</em>, compared to the standard clustering methods (<em>k-means</em>, <em>hierarchical clustering</em>, etc.), is the suggestion of the number of clusters and an appropriate model.</p>
<p>Instead of taking a heuristic approach to build a cluster, <em>model-based clustering</em> uses a probability-based approach. <em>Model-based clustering</em> assumes that the data is generated by an underlying probability distribution and tries to recover the distribution from the data. One common model-based approach is using finite mixture models, which provide a flexible modeling framework for the analysis of the probability distribution. Finite mixture models are a linearly weighted sum of component probability distribution.</p>
<p>In <em>model-based clustering</em> the data are considered as coming from a distribution that is mixture of two or more components (i.e. <em>clusters</em>). Each component <em>k</em> (i.e. group or cluster) is modeled by the normal or Gaussian distribution which is <a href="http://www.sthda.com/english/wiki/model-based-clustering-unsupervised-machine-learning">characterized</a> by the parameters: <em>mean vector</em>, <em>covariance matrix</em>, <em>associated probability</em> (each point has a probability of belonging to each cluster).</p>
<p>Each cluster <em>k</em> is centered at the <em>means</em>, with increased density for points near the <em>mean</em>. Geometric features (shape, volume, orientation) of each cluster are determined by <em>the covariance matrix</em>.</p>
<p><em>Model-based clustering</em> are iterative method to fit a set of dataset into clusters by optimizing distributions of datasets in clusters. Gaussian distribution is nothing but normal distribution. This method works in three steps:</p>
<ol>
<li>First randomly choose Gaussian parameters and fit it to set of data points.</li>
<li>Iteratively optimize the distribution parameters to fit as many points it can.</li>
<li>Once it converges to a local minima, you can assign data points closer to that distribution of that cluster.</li>
</ol>
<p>Although this algorithm create complex models, it does capture correlation and dependence among the attributes. The downside is that these methods usually suffer from an overfitting problem. </p>
<p><b class="lb">Details about Gaussian mixture models</b></p>
<p>Probabilistic model-based clustering techniques have been widely used and have shown promising results in many applications, ranging from image segmentation, handwriting recognition, document clustering, topic modeling to information retrieval. Model-based clustering approaches attempt to optimize the fit between the observed data and some mathematical model using a probabilistic approach.</p>
<blockquote>The word model is usually used to represent the type of constraints and geometric properties of the covariance matrices.</blockquote>
<p>Such methods are often based on the assumption that the data are generated by a mixture of underlying <em>probability distributions</em>. In practice, each cluster can be represented mathematically by a parametric probability distribution, such as a <em>Gaussian</em> or a <em>Poisson</em> distribution. Thus, the clustering problem is transformed into a parameter estimation problem since the entire data can be modeled by a mixture of <em>k</em> component distributions.</p>
<p>In probabilistic models, the core idea is to model the data from a generative process. First, a specific form of the generative model (e.g., mixture of Gaussians) is assumed, and then the parameters of this model are estimated with the use of the <em>Expectation Maximization</em> (EM) algorithm. The available data set is used to estimate the parameters in such as way that they have a maximum likelihood fit to the generative model. Given this model, we then estimate the generative probabilities (or fit probabilities) of the underlying data points. Data points which fit the distribution well will have high fit probabilities, whereas anomalies will have very low fit probabilities.</p>
<blockquote>In model-based clustering, it is assumed that the data are generated by a mixture of probability distributions in which each component represents a different cluster.</blockquote>
<p>The generative models are typically solved with the use of an <em>EM</em> approach which is the most widely used method for estimating the parameters of a finite mixture probability density. The model-based clustering framework provides a principal way to deal with several problems in this approach, such as the number of component densities (or clusters), initial values of the parameters (the EM algorithm needs initial parameter values to get started), and distributions of the component densities (e.g., Gaussian). <em>EM</em> starts off with a random or heuristic initialization and then iteratively uses two steps to resolve the circularity in computation:</p>
<ul>
<li><em>E-Step</em>. Determine the expected probability of assignment of data points to clusters with the use of current model parameters.</li>
<li><em>M-Step</em>. Determine the optimum model parameters of each mixture by using the assignment probabilities as weights.</li>
</ul>
<p>In the previous article I introduced <a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a>, which is one of the most widely used approaches for clustering. This non-parametric method for extracting structure has some excellent properties that make it ideal for many situations. In this article, we'll look at a parametric method that uses the <em>Gaussian distribution</em>, known as the <em>Gaussian mixture model</em> (GMM).</p>
<p>You may have heard of the <em>Gaussian distribution</em>, sometimes known as the <em>normal distribution</em>. Qualitatively it can be thought of as a distribution that occurs naturally and very frequently. This has two parameters, which we need to learn in order to fit this equation to our data: the <em>mean</em> and the <em>standard deviation</em>. The <em>mean</em> is a measure of the center of the distribution. The <em>standard deviation</em> is a measure of the distribution of the data around the mean. Armed with the knowledge that, for a <em>normal distribution</em>, 95% of the data is always within <em>two standard deviations from the mean</em>.</p>
<p><b class="lb">Modeling in R</b></p>
<p>Here, we show example of algorithm on <em>iris dataset</em>.</p>
<p><code>Mclust</code> package uses Bayesian Information Criterion (BIC) to find the number of clusters (model selection). BIC uses the likelihood and a penalty term to guard against overfitting.</p>
<p>First, install and load the library <code>mclust</code></p>
<pre class="prettyprint">
#install.packages("mclust")
library(mclust)
</pre>
<p>You can then perform model-based clustering on the <em>iris dataset</em> using <a href="http://svitsrv25.epfl.ch/R-doc/library/mclust/html/Mclust.html">Mclust</a>:</p>
<pre class="prettyprint">
mb = Mclust(iris[,-5])
#or specify number of clusters
mb3 = Mclust(iris[,-5], 3)
# optimal selected model
mb$modelName
# optimal number of cluster
mb$G
# probality for an observation to be in a given cluster
head(mb$z)
# get probabilities, means, variances
summary(mb, parameters = TRUE)
</pre>
<p>Compare amount of the data within each cluster</p>
<pre class="prettyprint">
table(iris$Species, mb$classification)
# vs
table(iris$Species, mb3$classification)
</pre>
<p>After the data is fit into the model, we plot the model based on clustering results.</p>
<pre class="prettyprint">
plot(mb, what=c("classification"))
</pre>
<div style="text-align:center;">
<img alt="r_mbc_classification.png" src="http://en.proft.me/media/science/r_mbc_classification.png" width="90%"><br/>
</div>
<p>The types of <code>what</code> argument are: "BIC", "classification", "uncertainty", "density". By default, all the above graphs are produced.</p>
<p><code>Mclust</code> uses an identifier for each possible parametrization of the covariance matrix that has three letters: <em>E</em> for "equal", <em>V</em> for "variable" and <em>I</em> for "coordinate axes". The first identifier refers to <em>volume</em>, the second to <em>shape</em> and the third to <em>orientation</em>. For example:</p>
<ul>
<li><em>EEE</em> means that the G clusters have the same volume, shape and orientation in p−dimensional space.</li>
<li><em>VEI</em> means variable volume, same shape and orientation equal to coordinate axes.</li>
<li><em>EIV</em> means same volume, spherical shape and variable orientation.</li>
</ul>
<p>Let's plot estimated density. Contour plot</p>
<pre class="prettyprint">
plot(mb, "density")
</pre>
<div style="text-align:center;">
<img alt="r_mbc_density.png" src="http://en.proft.me/media/science/r_mbc_density.png" width="90%"><br/>
</div>
<p>You can also use the <code>summary()</code> function to obtain the most likely model and the most possible number of clusters. For this example, the most possible number of clusters is five, with a BIC value equal to -556.1142.</p>
<p><b class="lb">Comparing clustering methods</b></p>
<p>After fitting data into clusters using different clustering methods, you may wish to measure the accuracy of the clustering. In most cases, you can use either <em>intracluster</em> or <em>intercluster</em> metrics as measurements. The higher the intercluster distance, the better it is, and the lower the intracluster distance, the better it is.</p>
<p>We now introduce how to compare different clustering methods using <code>cluster.stat</code> from the <code>fpc</code> package.</p>
<p>Perform the following steps to compare clustering methods.</p>
<p>First, install and load the <code>fpc</code> package</p>
<pre class="prettyprint">
#install.packages("fpc")
library("fpc")
</pre>
<p>Next, retrieve the cluster validation statistics of clustering method:</p>
<pre class="prettyprint">
cs = cluster.stats(dist(iris[1:4]), mb$classification)
</pre>
<p>Most often, we focus on using <code>within.cluster.ss</code> and <code>avg.silwidth</code> to validate the clustering method. The <code>within.cluster.ss</code> measurement stands for the <em>within clusters sum</em> of squares, and <code>avg.silwidth</code> represents the <em>average silhouette width</em>.</p>
<ul>
<li><code>within.cluster.ss</code> measurement shows how closely related objects are in clusters; the smaller the value, the more closely related objects are within the cluster.</li>
<li><code>avg.silwidth</code> is a measurement that considers how closely related objects are within the cluster and how clusters are separated from each other. The silhouette value usually ranges from 0 to 1; a value closer to 1 suggests the data is better clustered.</li>
</ul>
<pre class="prettyprint">
cs[c("within.cluster.ss","avg.silwidth")]
</pre>
<p>Finally, you can <a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">generate the cluster statistics</a> of different clustering method and list them in a table.</p>
<p><b class="lb">The relationship between k-means and GMM</b></p>
<p>K-means can be expressed as a special case of the Gaussian mixture model. In general, the Gaussian mixture is more expressive because membership of a data item to a cluster is dependent on the shape of that cluster, not just its proximity.</p>
<p>As with k-means, training a Gaussian mixture model with EM can be quite sensitive to initial starting conditions. If we compare and contrast GMM to k-means, we’ll find a few more initial starting conditions in the former than in the latter. In particular, not only must the initial centroids be specified, but the initial covariance matrices and mixture weights must be specified also. Among other strategies, one approach is to run k-means and use the resultant centroids to determine the initial starting conditions for the Gaussian mixture model.</p>
<p><b class="lb">Outcome</b></p>
<p>A key feature of most clustering algorithms is that every point is assigned to a single cluster. But realistically, many datasets contain a large gray area, and mixture models are a way to capture that.</p>
<p>You can think of <em>Gaussian mixture models</em> as a version of <em>k-means</em> that captures the notion of a gray area and gives <a href="http://en.proft.me/2015/12/7/how-calculate-confidence-interval-r/">confidence levels</a> whenever it assigns a point to a particular cluster.</p>
<p>Each cluster is modeled as a multivariate Gaussian distribution, and the model is specified by giving the following:</p>
<ol>
<li>The number of clusters.</li>
<li>The fraction of all data points that are in each cluster.</li>
<li>Each cluster’s mean and its d-by-d covariance matrix.</li>
</ol>
<p>When training a GMM, the computer keeps a running confidence level of how likely each point is to be in each cluster, and it never decides them definitively: the mean and standard deviation for a cluster will be influenced by every point in the training data, but in proportion to how likely they are to be in that cluster. When it comes time to cluster a new point, you get out a confidence level for every cluster in your model.</p>
<p>Mixture models have many of the same blessings and curses of k-means. They are simple to understand and implement. The computational costs are very light and can be done in a distributed way. They provide clear, understandable output that can be easily used to cluster additional points in the future. On the other hand, they both assume more-or-less convex clusters, and they are both liable to fall into local minimums when training.</p>Wed, 01 Feb 2017 00:00:00 +0200http://en.proft.me/2017/02/1/model-based-clustering-r/Exploring Hierarchical clustering in Rhttp://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/<p><img src="http://en.proft.me/media/science/r_hc_tree.png" alt="r_hc_tree.png" class="right" width="120">
<b class="lb">Introduction</b></p>
<p>Clustering is a multivariate analysis used to group similar objects (close in terms of distance) together in the same group (cluster). Unlike <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">supervised learning</a> methods (for example, classification and regression) a clustering analysis does not use any label information, but simply uses the similarity between data features to group them into clusters.</p>
<p>Clustering does not refer to specific algorithms but it’s a process to create groups based on similarity measure. Clustering analysis use <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">unsupervised learning</a> algorithm to create clusters.</p>
<p>Clustering algorithms generally work on simple principle of maximization of intracluster similarities and minimization of intercluster similarities. The measure of similarity determines how the clusters need to be formed.</p>
<p>Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0.</p>
<p>Clustering can be widely adapted in the analysis of businesses. For example, a marketing department can use clustering to segment customers by personal attributes. As a result of this, different marketing campaigns targeting various types of customers can be designed.</p>
<p>Clustering model is a notion used to signify what kind of clusters we are trying to identify. The four most common models of clustering methods are hierarchical clustering, k-means clustering, model-based clustering, and density-based clustering:</p>
<ul>
<li><a href="http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/">Hierarchical clustering</a>. It creates a hierarchy of clusters, and presents the hierarchy in a dendrogram. This method does not require the number of clusters to be specified at the beginning. Distance connectivity between observations is the measure.</li>
<li><a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a>. It is also referred to as flat clustering. Unlike hierarchical clustering, it does not create a hierarchy of clusters, and it requires the number of clusters as an input. However, its performance is faster than hierarchical clustering. Distance from mean value of each observation/cluster is the measure.</li>
<li><a href="http://en.proft.me/2017/02/1/model-based-clustering-r/">Model-based clustering</a> (or <em>Distribution models</em>). Both hierarchical clustering and k-means clustering use a heuristic approach to construct clusters, and do not rely on a formal model. Model-based clustering assumes a data model and applies an EM algorithm to find the most likely model components and the number of clusters. Significance of statistical distribution of variables in the dataset is the measure.</li>
<li><a href="http://en.proft.me/2017/02/3/density-based-clustering-r/">Density-based clustering</a>. It constructs clusters in regard to the density measurement. Clusters in this method have a higher density than the remainder of the dataset. Density in data space is the measure.</li>
</ul>
<p>A good clustering algorithm can be evaluated based on two primary objectives:</p>
<ul>
<li>High intra-class similarity</li>
<li>Low inter-class similarity</li>
</ul>
<p>Hierarchical clustering adopts either an agglomerative or divisive method to build a hierarchy of clusters. Regardless of which approach is adopted, both first use a distance similarity measure to combine or split clusters. The recursive process continues until there is only one cluster left or you cannot split more clusters. Eventually, we can use a dendrogram to represent the hierarchy of clusters.</p>
<ul>
<li><em>Agglomerative hierarchical clustering</em>. This is a bottom-up approach. Each observation starts in its own cluster. We can then compute the similarity (or the distance) between each cluster and then merge the two most similar ones at each iteration until there is only one cluster left.</li>
<li><em>Divisive hierarchical clustering</em>. This is a top-down approach. All observations start in one cluster, and then we split the cluster into the two least dissimilar clusters recursively until there is one cluster for each observation.</li>
</ul>
<div style="text-align:center;margin-bottom: 20px;">
<img alt="r_hc_types.png" src="/media/science/r_hc_types.png" width="70%">
</div>
<p>The <em>agglomerative methods</em>, using a recursive algorithm that follows the next phases:</p>
<ul>
<li>Find the two closest points in the dataset.</li>
<li>Link these points and consider them as a single point.</li>
<li>The process starts again, now using the new dataset that contains the new point.</li>
</ul>
<p>In a hierarchical clustering, there are two very important parameters: <em>the distance metric</em> and <em>the linkage method</em>.</p>
<p><strong>Clustering distance metric</strong>. Defining closeness is a fundamental aspect. A measure of dissimilarity is that which defines clusters that will be combined in the case of agglomerative method, or that, in the case of divisive clustering method, when these are to be divided. The main measures of distance are as follows:</p>
<ul>
<li>Euclidean distance</li>
<li>Maximum distance</li>
<li>Manhattan distance</li>
<li>Canberra distance</li>
<li>Binary distance</li>
<li>Pearson distance</li>
<li>Correlation</li>
<li>Spearman distance</li>
</ul>
<p>You can read more about distance measure and it visualization <a href="http://www.sthda.com/english/wiki/clarifying-distance-measures-unsupervised-machine-learning">here</a>.</p>
<p><strong>Linkage methods</strong>. The linkage methods determine how the distance between two clusters is defined. A linkage rule is necessary for calculating the inter-cluster distances. It is important to try several linkage methods to compare their results. Depending on the dataset, some methods may work better. The following is a list of the most common linkage methods:</p>
<ul>
<li><em>Single linkage</em>. This refers to the shortest distance between two points in each cluster. The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster. A good choice when clusters are obviously separated.</li>
<li><em>Complete linkage</em>. This refers to the longest distance between two points in each cluster. The distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. It can be sensitive to outliers.</li>
<li><em>Average linkage</em> This refer to the average distance between two points in each cluster. The distance between two clusters is the mean distance between an observation in one cluster and an observation in the other cluster.</li>
<li><em>Ward linkage</em>. This refers to the sum of the squared distance from each point to the mean of the merged clusters. The distance between two clusters is the sum of the squared deviations from points to centroids. Try to minimize the within-cluster sum of squares. It can be sensitive to outliers.</li>
<li><em>Centroid Linkage</em>. The distance between two clusters is the distance between the cluster centroids or means.</li>
<li><em>Median Linkage</em>. The distance between two clusters is the median distance between an observation in one cluster and an observation in the other cluster. It reduces the effect of outliers.</li>
<li><em>McQuitty Linkage</em> When two clusters A and B are be joined, the distance to new cluster C is the average of distances of A and B to C. So, the distance depends on a combination of clusters instead of individual observations in the clusters.</li>
</ul>
<div style="text-align:center;margin-bottom: 20px;">
<a href="http://slideplayer.com/slide/9336538/" target="_blank">
<img alt="r_hc_linkage.jpg" src="/media/science/r_hc_linkage.jpg" width="80%">
</a>
</div>
<div style="text-align:center;margin-bottom: 20px;">
<img alt="r_hc_linkage_table.png" src="/media/science/r_hc_linkage_table.png" width="80%">
</div>
<p>Hierarchical clustering is based on the distance connectivity between observations of clusters. The steps involved in the clustering process are:</p>
<ol>
<li>Start with <em>N</em> clusters, <em>N</em> is number of elements (i.e., assign each element to its own cluster). In other words distances (similarities) between the clusters are equal to the distances (similarities) between the items they contain.</li>
<li>Now merge pairs of clusters with the closest to other (most similar clusters) (e.g., the first iteration will reduce the number of clusters to <em>N - 1</em>).</li>
<li>Again compute the distance (similarities) and merge with closest one.</li>
<li>Repeat Steps 2 and 3 to exhaust the items until you get all data points in one cluster.</li>
</ol>
<p><b class="lb">Modeling in R</b></p>
<p>In R, we use the <code>hclust()</code> function for <em>hierarchical cluster analysis</em>. This is part of the <code>stats</code> package. To perform hierarchical clustering, the input data has to be in a distance matrix form.</p>
<p>Another important function used here is <code>dist()</code> this function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix. By default, it is Euclidean distance.</p>
<p>You can then examine the <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">iris</a> dataset structure:</p>
<pre class="prettyprint">
str(iris)
</pre>
<p>Let's look at the parallel coordinates plot of the data.</p>
<blockquote>Parallel coordinates are a common way of <a href="https://en.wikipedia.org/wiki/Parallel_coordinates">visualizing</a> high-dimensional geometry and analyzing multivariate data. The lines show the relationship between the axes, much like scatterplots, and the patterns that the lines form indicate the relationship. You can also gather <a href="https://www.safaribooksonline.com/blog/2014/03/31/mastering-parallel-coordinate-charts-r/">details</a> about the relationships between the axes when you see the clustering of lines.</blockquote>
<pre class="prettyprint">
# get nice colors
library(colorspace)
iris2 <- iris[,-5]
species_labels <- iris[,5]
species_col <- rev(rainbow_hcl(3))[as.numeric(species_labels)]
MASS::parcoord(iris2, col = species_col, var.label = TRUE, lwd = 2)
title("Parallel coordinates plot of the Iris data")
legend("top", cex = 1, legend = as.character(levels(species_labels)), fill = unique(species_col), horiz = TRUE)
</pre>
<div style="text-align:center;">
<img alt="r_hc_parallel.png" src="/media/science/r_hc_parallel.png"><br/>
</div>
<p>We can see that the <em>Setosa</em> species are distinctly different from <em>Versicolor</em> and <em>Virginica</em> (they have lower petal length and width). But <em>Versicolor</em> and <em>Virginica</em> cannot easily be separated based on measurements of their sepal and petal width/length.</p>
<p>You can then use agglomerative hierarchical clustering to cluster the customer data:</p>
<pre class="prettyprint">
iris_hc = hclust(dist(iris[,1:4], method="euclidean"), method="ward.D2")
</pre>
<p>We use the Euclidean distance as distance metrics, and use Ward's minimum variance method to perform agglomerative clustering.</p>
<p>Lastly, you can use the <code>plot</code> function to plot the dendrogram:</p>
<pre class="prettyprint">
plot(iris_hc, hang = -0.01, cex = 0.7)
</pre>
<div style="text-align:center;">
<img alt="r_hc_iris.png" src="http://en.proft.me/media/science/r_hc_iris.png"><br/>
</div>
<p>Finally, we use the <code>plot</code> function to plot the dendrogram of hierarchical clusters. We specify <code>hang</code> to display labels at the bottom of the dendrogram, and use <code>cex</code> to shrink the label to 70 percent of the normal size.</p>
<p>In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches. You can see more kinds of visualizing dendrograms in R <a href="https://rpubs.com/gaston/dendrograms">here</a> or <a href="http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning">here</a>.</p>
<p>In a dendrogram, we can see the hierarchy of clusters, but we have not grouped data into different clusters yet. However, we can determine how many clusters are within the dendrogram and cut the dendrogram at a certain tree height to separate the data into different groups. We'll demonstrate how to use the <code>cutree</code> function to separate the data into a given number of clusters.</p>
<p>First, categorize the data into four groups:</p>
<pre class="prettyprint">
fit = cutree(iris_hc, k = 3)
</pre>
<p>You can then examine the cluster labels for the data:</p>
<pre class="prettyprint">
fit
</pre>
<p>Count the number of data within each cluster:</p>
<pre class="prettyprint">
table(fit, iris$Species)
</pre>
<p>Finally, you can visualize how data is clustered with the red rectangle border:</p>
<pre class="prettyprint">
plot(iris_hc, hang = -0.01, cex = 0.7)
rect.hclust(iris_hc, k=3, border="red")
# rect.hclust(res.hc, k = 3, border = 2:4)
</pre>
<div style="text-align:center;">
<img alt="r_hc_iris_clusters.png" src="http://en.proft.me/media/science/r_hc_iris_clusters.png"><br/>
</div>
<p><code>table</code> the clustering distribution with actual <em>Species</em></p>
<pre class="prettyprint">
table(fit, iris$Species)
</pre>
<p>An alternative way to produce dendrograms is to specifically convert hclust objects into dendrograms objects.</p>
<pre class="prettyprint">
# Convert hclust into a dendrogram and plot
hcd <- as.dendrogram(iris_hc)
# Default plot
plot(hcd, type = "rectangle", ylab = "Y")
# Define nodePar
nodePar <- list(lab.cex = 0.6, pch = c(NA, 19),
cex = 0.7, col = "blue")
# Customized plot; remove labels
plot(hcd, ylab = "Y", nodePar = nodePar, leaflab = "none")
# Horizontal plot
plot(hcd, xlab = "Y",
nodePar = nodePar, horiz = TRUE)
# Change edge color
plot(hcd, xlab = "Height", nodePar = nodePar,
edgePar = list(col = 2:3, lwd = 2:1))
</pre>
<p>The R package <code>ggplot2</code> have no functions to plot dendrograms. However, the ad-hoc package <code>ggdendro</code> offers a decent solution. </p>
<pre class="prettyprint">
# install.packages('ggdendro')
library(ggplot2)
library(ggdendro)
# basic option
ggdendrogram(iris_hc)
# another option
ggdendrogram(iris_hc, rotate = TRUE, size = 4, theme_dendro = FALSE, color = "tomato")
</pre>
<p>Using the function <code>fviz_cluster()</code> [in <code>factoextra</code>], we can also visualize the result in a scatter plot. Observations are represented by points in the plot, using principal components. A frame is drawn around each cluster.</p>
<pre class="prettyprint">
# install.packages('factoextra')
library(factoextra)
fviz_cluster(list(data=iris[,1:4], cluster=fit))
</pre>
<p><b class="lb">Comparing clustering methods</b></p>
<p>After fitting data into clusters using different clustering methods, you may wish to measure the accuracy of the clustering. In most cases, you can use either <em>intracluster</em> or <em>intercluster</em> metrics as measurements. The higher the intercluster distance, the better it is, and the lower the intracluster distance, the better it is.</p>
<p>We now introduce how to compare different clustering methods using <code>cluster.stat</code> from the <code>fpc</code> package.</p>
<p>Perform the following steps to compare clustering methods.</p>
<p>First, install and load the <code>fpc</code> package</p>
<pre class="prettyprint">
#install.packages("fpc")
library("fpc")
</pre>
<p>You then need to use hierarchical clustering with the <em>single</em> method to cluster customer data and generate the object <code>hc_single</code></p>
<pre class="prettyprint">
single_c = hclust(dist(iris[,1:4]), method="single")
hc_single = cutree(single_c, k=3)
</pre>
<p>Use hierarchical clustering with the <em>complete</em> method to cluster customer data and generate the object <code>hc_complete</code></p>
<pre class="prettyprint">
complete_c = hclust(dist(iris[,1:4]), method="complete")
hc_complete = cutree(complete_c, k=3)
</pre>
<p>You can then use <a href="http://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/">k-means clustering</a> to cluster customer data and generate the object <code>km</code></p>
<pre class="prettyprint">
set.seed(1234)
km = kmeans(iris[1:4], 3, iter.max=1000, algorithm=c("Forgy"))
</pre>
<p>Next, retrieve the cluster validation statistics of either clustering method:</p>
<pre class="prettyprint">
cs = cluster.stats(dist(iris[1:4]), km$cluster)
</pre>
<p>Most often, we focus on using <code>within.cluster.ss</code> and <code>avg.silwidth</code> to validate the clustering method. The <code>within.cluster.ss</code> measurement stands for the <em>within clusters sum</em> of squares, and <code>avg.silwidth</code> represents the <em>average silhouette width</em>.</p>
<ul>
<li><code>within.cluster.ss</code> measurement shows how closely related objects are in clusters; the smaller the value, the more closely related objects are within the cluster.</li>
<li><code>avg.silwidth</code> is a measurement that considers how closely related objects are within the cluster and how clusters are separated from each other. The silhouette value usually ranges from 0 to 1; a value closer to 1 suggests the data is better clustered.</li>
</ul>
<pre class="prettyprint">
cs[c("within.cluster.ss","avg.silwidth")]
</pre>
<p>Finally, we can generate the cluster statistics of each clustering method and list them in a table:</p>
<pre class="prettyprint">
sapply(list(kmeans = km$cluster, hc_single = hc_single, hc_complete = hc_complete),
function(c) cluster.stats(dist(iris[1:4]), c)[c("within.cluster.ss","avg.silwidth")])
</pre>
<p><b class="lb">Data preparation</b></p>
<p>Remove any missing value (i.e, NA values for not available)</p>
<pre class="prettyprint">
df <- na.omit(iris[,1:4])
</pre>
<p>Next, we have to scale all the numeric variables. Scaling means each variable will now have mean zero and standard deviation one. Ideally, you want a unit in each coordinate to represent the same degree of difference. Scaling makes the standard deviation the unit of measurement in each coordinate. This is done to avoid the clustering algorithm to depend to an arbitrary variable unit. You can scale/standardize the data using the R function scale:</p>
<pre class="prettyprint">
data <- scale(df)
</pre>
<p><b class="lb">Useful links</b></p>
<ul>
<li><a href="http://www.sthda.com/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning">Hierarchical Clustering Essentials</a></li>
</ul>Sun, 29 Jan 2017 00:00:00 +0200http://en.proft.me/2017/01/29/exploring-hierarchical-clustering-r/Classification using Random forest in Rhttp://en.proft.me/2017/01/24/classification-using-random-forest-r/<p><img src="http://en.proft.me/media/science/r_rf.png" alt="r_rf.png" class="right" width="120">
<b class="lb">Introduction</b></p>
<p><em>Random forest</em> (or <em>decision tree forests</em>) is one of the most popular decision tree-based ensemble models. The accuracy of these models tends to be higher than most of the other <a href="http://en.proft.me/2016/11/9/classification-using-decision-trees-r/">decision trees</a>. Random Forest algorithm can be used for both classification and regression applications.</p>
<p>The main drawback of <em>decision trees</em> is that they are prone to <em>overfitting</em>. The reason for this is that trees, if grown deep, are able to fit all kinds of variations in the data, including noise. Although it is possible to address this partially by <a href="http://en.proft.me/2016/11/9/classification-using-decision-trees-r/">pruning</a>, the result often remains less than satisfactory.</p>
<p><em>Ensemble learning</em> is a method to combine results produced by different learners into one format, with the aim of producing better classification results and regression results.</p>
<p>In ensemble learning, <em>bagging</em>, <em>boosting</em>, and <em>random forest</em> are the three most common methods.</p>
<p><em>Random forest</em> uses the classification results voted from many classification trees. The idea is simple: a single classification tree will obtain a single classification result with a single input vector. However, a random forest grows many classification trees, obtaining multiple results from a single input. Therefore, a random forest will use the majority of votes from all the decision trees to classify data or use an average output for regression.</p>
<p><em>Random forest</em> creates a large number of <em>decision trees</em>. Every observation is feed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is feed into all the trees and taking a majority vote for each classification model. In ensemble terms, the <em>trees</em> are <em>weak learners</em> and the <em>random forest</em> is a <em>strong learner</em>.</p>
<div style="text-align:center; margin-bottom: 5px;">
<a href="https://citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics/"><img alt="r_rf_concept.png" src="http://en.proft.com.ua/media/science/r_rf_concept.jpg" width="90%"></a><br/>
</div>
<p>It's worth noting that relative to other ensemble-based methods, <em>random forests</em> are quite competitive and offer key advantages relative to the competition. For instance, <em>random forests</em> tend to be easier to use and less prone to overfitting. The following is the general strengths and weaknesses of <em>random forest</em> models.</p>
<p>Strengths of <em>Random forest</em></p>
<ul>
<li>This algorithm can solve both type of problems i.e. classification and regression and does a decent estimation at both fronts.</li>
<li>It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.</li>
<li>Selects only the most important features.</li>
<li>Can be used on data with an extremely large number of features or examples (higher dimensionality). It can handle thousands of input variables and identify most significant variables so it is considered as one of the dimensionality reduction methods. Further, the model outputs <em>Importance of variable</em>, which can be a very handy feature (on some random data set).</li>
</ul>
<p>Weaknesses of <em>Random forest</em></p>
<ul>
<li>Unlike a decision tree, the model is not easily interpretable. Random Forest can feel like a black box approach for statistical modelers.</li>
<li>It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.</li>
<li>May require some work to tune the model to the data.</li>
</ul>
<p><b class="lb">Modeling in R</b></p>
<p><em>Random forest</em> algorithm is built in <code>randomForest</code> package of R and same name function allows us to use it.</p>
<p>Some of the commonly used parameters of <code>randomForest</code> functions are</p>
<ul>
<li><code>x</code> - Random forest formula</li>
<li><code>data</code> - input data frame</li>
<li><code>ntree</code> - number of decision trees to be grown</li>
<li><code>mtry</code> - the number of features used to find the best feature</li>
<li><code>replace</code> takes True and False and indicates whether to take sample with/without replacement</li>
<li><code>importance</code> - whether independent variable importance in random forest be assessed</li>
<li><code>proximity</code> - whether to calculate proximity measures between rows of a data frame</li>
</ul>
<p>The <code>randomForest</code> package optionally produces two additional pieces of information: a <em>measure of the importance</em> of the predictor variables, and a <em>measure of the internal structure</em> of the data (the proximity of different data points to one another).</p>
<p>By default, the <code>randomForest()</code> function creates an ensemble of 500 trees that consider <code>sqrt(p)</code> random features at each split, where <em>p</em> is the number of features in the training dataset and <code>sqrt()</code> refers to R's square root function. The goal of using a large number of trees is to train enough so that each feature has a chance to appear in several models.</p>
<p>Let's see how the default <code>randomForest()</code> parameters work with the <code>iris</code> data set. The <code>set.seed()</code> function ensures that the result can be replicated:</p>
<pre class="prettyprint">
#install.packages("randomForest")
library("randomForest")
set.seed(1234)
</pre>
<p>Split <code>iris</code> data to <em>training</em> data and <em>testing</em> data</p>
<pre class="prettyprint">
ind = sample(2, nrow(iris), replace=TRUE, prob=c(0.7,0.3))
trainData = iris[ind==1,]
testData = iris[ind==2,]
</pre>
<p>Generate <em>Random forest</em> learning tree</p>
<pre class="prettyprint">
iris_rf = randomForest(Species~., data=trainData, ntree=100, proximity=T)
table(predict(iris_rf), trainData$Species)
</pre>
<p>To look at a summary of the model's performance, we can simply type the resulting object's name:</p>
<pre class="prettyprint">
iris_rf
</pre>
<p>The output notes that the <em>random forest</em> included 100 trees and tried two variables at each split. At first glance, you might be alarmed at the seemingly poor performance according to the confusion matrix—the error rate of 5.36 percent is far worse than the resubstitution error of any of the other ensemble methods. However, this confusion matrix does not show resubstitution error. Instead, it reflects the <em>out-of-bag error rate</em> (listed in the output as OOB estimate of error rate), which unlike resubstitution error, is an unbiased estimate of the test set error. This means that it should be a fairly reasonable estimate of future performance.</p>
<p>You can use the <code>plot()</code> function to plot the mean square error of the forest object:</p>
<pre class="prettyprint">
plot(iris_rf)
</pre>
<div style="text-align:center; margin-bottom: 5px;">
<img alt="r_rf_plot.png" src="http://en.proft.com.ua/media/science/r_rf_plot.png"></a>
</div>
<p>100 decision trees or a forest has been built using the <em>random forest</em> algorithm based learning. We can plot the error rate across decision trees. The plot seems to indicate that after 40 decision trees, there is not a significant reduction in error rate.</p>
<p>You can then examine the importance of each attribute within the fitted classifier:</p>
<pre class="prettyprint">
importance(iris_rf)
</pre>
<p>Try to build <em>random forest</em> for testing data. Similar to other classification methods, you can obtain the classification table:</p>
<pre class="prettyprint">
irisPred = predict(iris_rf, newdata=testData)
table(irisPred, testData$Species)
</pre>
<p>Try to see the margin, positive or negative, if positif it means correct classification</p>
<pre class="prettyprint">
plot(margin(iris_rf, testData$Species))
</pre>
<p>Let's determine the misclassification rate. First, build a confusion matrix. Each column of the matrix represents the number of predictions of each class, while each row represents the instances in the actual class.</p>
<pre class="prettyprint">
CM = table(irisPred, testData$Species)
</pre>
<p>Second, build a <em>diagonal mark quality prediction</em>. Applying the <code>diag</code> function to this table then selects the diagonal elements, i.e., the number of points where <em>random forest</em> agrees with the true classification, and the <code>sum</code> command simply adds these values up.</p>
<pre class="prettyprint">
accuracy = (sum(diag(CM)))/sum(CM)
</pre>
<p>The model has a high overall accuracy that is <em>0.9473684</em>.</p>
<p><b class="lb">Useful links</b></p>
<ul>
<li><a href="https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/">A Complete Tutorial on Tree Based Modeling from Scratch</a></li>
<li><a href="http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf">Classification and Regression by randomForest</a></li>
</ul>Tue, 24 Jan 2017 00:00:00 +0200http://en.proft.me/2017/01/24/classification-using-random-forest-r/Classification using k-Nearest Neighbors in Rhttp://en.proft.me/2017/01/22/classification-using-k-nearest-neighbors-r/<p><img src="http://en.proft.me/media/science/r_knn_classify.png" alt="r_knn_classify.png" class="right" width="120">
<b class="lb">Introduction</b></p>
<p>The <em>k-NN</em> algorithm is among the simplest of all <a href="http://en.proft.me/2015/12/24/types-machine-learning-algorithms/">machine learning algorithms</a>. It also might surprise many to know that <em>k-NN</em> is one of the <a href="http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf">top 10 data mining algorithms</a>.</p>
<p><em>k-Nearest Neighbors algorithm</em> (or k-NN for short) is a <em>non-parametric method</em> used for <em>classification</em> and <em>regression</em>. In both cases, the input points consists of the <em>k</em> closest training examples in the <em>feature space</em>.</p>
<blockquote>Non-parametric means that it does not make any assumptions on the underlying data distribution.</blockquote>
<p>Since the points are in <em>feature space</em>, they have a notion of <em>distance</em>. This need not necessarily be Euclidean distance although it is the one commonly used. Each of the training data consists of <em>a set of vectors</em> and <em>class label</em> associated with each vector. In the simplest case, it will be either <em>+</em> or <em>–</em> (for positive or negative classes). But <em>k-NN</em>, can work equally well with arbitrary number of classes.</p>
<blockquote><i>k</i> number decides how many neighbors (where neighbors is defined based on the distance metric) influence the classification. With a given <i>k</i> value we can make boundaries of each class. </blockquote>
<p>For any fixed point <em>x</em> we can calculate how close each observation <i>X<sup>i</sup></i> is to <em>x</em> using the Euclidean distance. This ranks the data by how close they are to <em>x</em>. Imagine drawing a small ball about <em>x</em> and slowly inflating it. As the ball hits the first observation <i>X<sup>i</sup></i>, this is the first nearest neighbor of <em>x</em>. As the ball further inflates and hits a second observation, this observation is the second nearest neighbor.</p>
<div style="text-align:center; margin-bottom: 5px;">
<a href="http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/"><img alt="dp_decorator.png" src="http://en.proft.com.ua/media/science/r_knn_concept.png" width="60%"></a><br/>
</div>
<p>Efficiency of the <em>k-NN</em> algorithm largely depends on the value of <em>k</em> i.e, choosing the number of nearest neighbors. Choosing the optimal value for <em>k</em> is best done by first inspecting the data. In general, a large <em>k</em> value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good <em>k</em> value by using an independent dataset to validate the <em>k</em> value. Historically, the optimal <em>k</em> for most datasets has been between 3-10.</p>
<p><em>k-NN</em> makes predictions using the training dataset directly. Predictions are made for a new instance <em>x</em> by searching through the entire training set for the <em>k</em> most similar instances (the neighbors) and summarizing the output variable for those <em>k</em> instances. For regression this might be the mean output variable, in classification this might be the mode (or most common) class value.</p>
<p>To determine which of the <em>k</em> instances in the training dataset are most similar to a new input a distance measure is used. You can choose the best distance metric based on the properties of your data. Following are the most popular distance measures</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a>. This measure is popular for real-valued input variables. The Euclidean distance is always greater than or equal to zero. The measurement would be zero for identical points and high for points that show little similarity. It is a good distance measure to use if the input variables are similar in type (e.g. all measured widths and heights)</li>
<li><a href="https://en.wikipedia.org/wiki/Hamming_distance">Hamming distance</a>. Calculate the distance between binary vectors.</li>
<li><a href="https://en.wikipedia.org/wiki/Taxicab_geometry">Manhattan distance</a>. Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance. It is a good measure to use if the input variables are not similar in type (such as age, gender, height, etc.).</li>
<li><a href="https://en.wikipedia.org/wiki/Minkowski_distance">Minkowski distance</a>. Generalization of Euclidean and Manhattan distance.</li>
<li><a href="https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_similarity_and_distance">Tanimoto distance</a> is only applicable for a binary variable.</li>
<li><a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard distance</a> measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.</li>
<li><a href="https://en.wikipedia.org/wiki/Mahalanobis_distance">Mahalanobis distance</a></li>
<li><a href="https://en.wikipedia.org/wiki/Cosine_similarity">Cosine distance</a> ranges from +1 to -1 where +1 is the highest correlation. Complete opposite points have correlation -1.</li>
</ul>
<p>The output of <em>k-NN</em> depends on whether it is used for classification or regression:</p>
<ul>
<li>In <em>k-NN classification</em>, the <em>output</em> is a <em>class membership</em>. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its <em>k</em> nearest neighbors (<em>k</em> is a positive integer, typically small).</li>
<li>In <em>k-NN regression</em>, the output is the <em>property value</em> for the object. This value is the average of the values of its k nearest neighbors.</li>
</ul>
<p>However, it is more widely used in classification problems in the industry because it's easy of interpretation and low calculation time (for small dataset). The computational complexity of <em>k-NN</em> increases with the size of the training dataset.</p>
<p><em>k-NN</em> is also a <em>lazy algorithm</em>. What this means is that it does not use the training data points to do any <em>generalization</em>. Lack of generalization means that <em>k-NN</em> keeps all the training data and all the training data is needed during the testing phase. As result:</p>
<ul>
<li>more time might be needed as in the worst case, all data points might take point in decision.</li>
<li>more memory is needed as we need to store all training data.</li>
</ul>
<p>In <em>k-Nearest Neighbor classification</em>, the training dataset is used to classify each member of a <em>target</em> dataset. The structure of the data is that there is a <em>classification (categorical) variable</em> of interest ("buyer," or "non-buyer," for example), and a number of additional <em>predictor variables</em> (age, income, location, etc).</p>
<p>Before use <em>k-NN</em> we should do some preparations</p>
<ul>
<li><em>Rescale Data</em>. <em>k-NN</em> performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian distribution.</li>
<li><em>Address Missing Data</em>. Missing data will mean that the distance between samples can not be calculated. These samples could be excluded or the missing values could be imputed.</li>
<li><em>Lower Dimensionality</em>. <em>k-NN</em> is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. <em>k-NN</em> can benefit from feature selection that reduces the dimensionality of the input feature space.</li>
</ul>
<p>Following is the algorithm:</p>
<ol>
<li>For each row (case) in the target dataset (the set to be classified), locate the <em>k</em> closest members (the <em>k</em> nearest neighbors) of the training dataset. A Euclidean distance measure is used to calculate how close each member of the training set is to the target row that is being examined.</li>
<li>Examine the <em>k</em> nearest neighbors - which classification (category) do most of them belong to? Assign this category to the row being examined.</li>
<li>Repeat this procedure for the remaining rows (cases) in the target set.</li>
</ol>
<p><b class="lb">Modeling in R</b></p>
<p>Let's see how to use <em>k-NN</em> for classification. In this case, we are given some data points for training and also a new unlabelled data for testing. Our aim is to find the class label for the new point. The algorithm has different behavior based on <em>k</em>.</p>
<p>Before starting with the <em>k-NN</em> implementation using R, I would like to discuss two Data Normalization/Standardization techniques for data consistent. You can easily see this when you go through the results of the <code>summary()</code> function. Look at the minimum and maximum values of all the (numerical) attributes. If you see that one attribute has a wide range of values, you will need to normalize your dataset, because this means that the distance will be dominated by this feature.</p>
<ul>
<li><em>Min-Max Normalization</em>. This process transforms a feature value in such a way that all of its values falls in the range of 0 and 1. This technique is traditionally used with <em>k-Nearest Neighbors</em> classification problems.</li>
<li><em>Z-Score Standardization</em>. This process subtracts the mean value of feature <em>x</em> and divides it with the standard deviation of <em>x</em>. The disadvantage with min-max normalization technique is that it tends to bring data towards the mean. If there is a need for outliers to get weighted more than the other values, z-score standardization technique suits better. In order to achieve z-score standardization, one could use R’s built-in <code>scale()</code> function.</li>
</ul>
<p>To perform a <em>k-nearest neighbour classification</em> in R we will make use of the <code>knn</code> function in the <code>class</code> package and <code>iris</code> data set.</p>
<pre class="prettyprint">
# familiarize with data
str(iris)
table(iris$Species)
</pre>
<p>The <code>table()</code> function performs a simple two-way cross-tabulation of values. For each unique value of the first variable, it counts the occurrences of different values of the second variable. It works for numeric, factor, and character variables.</p>
<p>A quick look at the <code>Species</code> attribute through tells you that the division of the species of flowers is 50-50-50.</p>
<p>If you want to check the percentual division of the <code>Species</code> attribute, you can ask for a table of proportions:</p>
<pre class="prettyprint">
round(prop.table(table(iris$Species)) * 100, digits = 1)
</pre>
<p>Also, you can try to get an idea of your data by making some graphs, such as <a href="http://en.proft.me/2016/12/3/charts-r-usage/#scatter">scatter plots</a>. It can be interesting to see how much one variable is affected by another. In other words, you want to see if there is any correlation between two variables.</p>
<pre class="prettyprint">
plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=19)
legend("topright", legend=levels(iris$Species), bty="n", pch=19, col=palette())
</pre>
<div style="text-align:center; margin-bottom: 5px;">
<img alt="r_knn_plot.png" src="http://en.proft.com.ua/media/science/r_knn_plot.png"></a>
</div>
<p>You see that there is a high correlation between the <code>Sepal.Length</code> and the <code>Sepal.Width</code> of the <code>Setosa</code> iris flowers, while the correlation is somewhat less high for the <code>Virginica</code> and <code>Versicolor</code> flowers.</p>
<p>To normalize the data use the following function.</p>
<pre class="prettyprint">
normMinMax = function(x) {
return((x-min(x))/max(x)-min(x))
}
iris_norm <- as.data.frame(lapply(iris[1:4], normMinMax))
</pre>
<p>Now create <em>training</em> and <em>test</em> data set. Following is random splitting of iris data as 70% train and 30% test data sets.</p>
<pre class="prettyprint">
set.seed(1234)
indexes = sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
iris_train = iris[indexes==1, 1:4]
iris_test = iris[indexes==2, 1:4]
iris_train_labels = iris[indexes==1, 5]
iris_test_labels = iris[indexes==2, 5]
</pre>
<p>Load <code>class</code> package and call <code>knn()</code> function to build the model.</p>
<pre class="prettyprint">
#install.packages("class")
library("class")
iris_mdl = knn(train=iris_train, test=iris_test, cl=iris_train_labels, k=3)
</pre>
<p>The function <code>knn</code> takes as its arguments a training data set and a testing data set, which we are now going to create.</p>
<p>You can read about Error rate with varying <em>k</em> <a href="http://www.learnbymarketing.com/tutorials/k-nearest-neighbors-in-r-example/">here</a>.</p>
<p>To evaluate the model, use <code>CrossTable()</code> function in <code>gmodels</code> package.</p>
<pre class="prettyprint">
#install.packages("gmodels")
library("gmodels")
CrossTable(x=iris_test_labels, y=iris_mdl, prop.chisq = FALSE)
</pre>
<p>Output:</p>
<div style="text-align:center; margin-bottom: 5px;">
<a href="http://technokarak.com/complete-tutorial-of-knn-classification-algorithm-using-r-programming.html"><img alt="r_knn_output.png" src="http://en.proft.com.ua/media/science/r_knn_output.png"></a><br/>
</div>
<p>As you can see from the table above the model make one error mistake of predicting as <code>versicolor</code> instead of <code>virginica</code>, whereas rest all the predicted correctly.</p>
<p>Since the data set contains only 150 observations, the model makes only one mistake. There may be more mistakes in huge data. In that case we need to try using different approach like instead of <em>min-max normalization</em> use <em>Z-Score standardization</em> and also vary the values of <em>k</em> to see the impact on accuracy of predicted result.</p>
<p>We would like to determine the misclassification rate for a given value of <em>k</em>. First, build a confusion matrix. When the values of the classification are plotted in a <em>n</em> x <em>n</em> matrix (2x2 in case of binary classification), the matrix is called the <em>confusion matrix</em>. Each column of the matrix represents the number of predictions of each class, while each row represents the instances in the actual class.</p>
<pre class="prettyprint">
CM = table(iris_test_labels, iris_mdl)
</pre>
<p>Second, build a <em>diagonal mark quality prediction</em>. Applying the <code>diag</code> function to this table then selects the diagonal elements, i.e., the number of points where <em>k-nearest neighbour classification</em> agrees with the true classification, and the <code>sum</code> command simply adds these values up.</p>
<pre class="prettyprint">
accuracy = (sum(diag(CM)))/sum(CM)
</pre>
<p>The model has a high overall accuracy that is <em>0.9736842</em>.</p>
<p><b class="lb">Pros and Cons of k-NN algorithm</b></p>
<p><strong>Pros</strong>. The algorithm is highly unbiased in nature and makes no prior assumption of the underlying data. Being simple and effective in nature, it is easy to implement and has gained good popularity.</p>
<p><strong>Cons</strong>. Indeed it is simple but <em>k-NN</em> algorithm has drawn a lot of flake for being extremely simple! If we take a deeper look, this doesn’t create a model since there’s no abstraction process involved. Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the prediction time is pretty high with useful insights missing at times. Therefore, building this algorithm requires time to be invested in data preparation (especially treating the missing data and categorical features) to obtain a robust model.</p>Sun, 22 Jan 2017 00:00:00 +0200http://en.proft.me/2017/01/22/classification-using-k-nearest-neighbors-r/Charts in R by usagehttp://en.proft.me/2016/12/3/charts-r-usage/<p><img src="http://en.proft.me/media/science/r_charts.png" alt="r_charts.png" class="right" width="100"></p>
<p>Every data mining project is incomplete without proper data visualization. From a functional point of view, the following are the graphs and charts which a data scientist would like the audience to look at to infer the information:</p>
<p>There is simple diagram with suggests how to choose the right chart for presentation</p>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="R charts suggests" src="http://en.proft.me/media/science/r_charts_suggests.png" width="95%" />
</div>
<p>Let's initiate data and colors</p>
<pre class="prettyprint">
dataX <- sample(1:50, 20, replace=T)
dataY <- sample(1:50, 20, replace=T)
dataZ <- sample(1:5, 20, replace=T)
palette <- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
c1 = palette[1]
c2 = palette[2]
c3 = palette[3]
c4 = "#FFFFF"
</pre>
<p><b class="lb">Comparisons between variables</b></p>
<p>Basically, when it is needed to represent two or more properties within a variable, then the following charts are used:</p>
<p><a name="bar"></a>
<strong>Bar chart</strong></p>
<p>A bar plot is a chart with rectangular bars with lengths proportional to the values that they represent. The bars can be plotted either vertically or horizontally. A simple bar chart can be created in R with the <code>barplot</code> function. In the example below, data from the sample "pressure" dataset is used to plot the vapor pressure of Mercury as a function of temperature.</p>
<pre class="prettyprint">
barplot(dataX, border=c3, col=c3)
# change font size
barplot(dataX, border=c3, col=c3, cex.axis=0.9, names.arg = dataX)
# plot line
abline(h=10, col="Red", lty=5)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_bar.png" src="http://en.proft.me/media/science/r_chart_bar.png" />
</div>
<p><a name="box"></a>
<strong>Box plot</strong></p>
<p>A box plot is a chart that illustrates groups of numerical data through the use of quartiles. A simple box plot can be created in R with the <code>boxplot</code> function.</p>
<pre class="prettyprint">
boxplot(dataX, border=c2, col=c3, cex.axis=0.9)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_box.png" src="http://en.proft.me/media/science/r_chart_box.png" />
</div>
<p><a name="bubble"></a>
<strong>Bubble chart</strong></p>
<p>The bubble chart is a variant of the scatterplot. Like in the scatterplot, points are plotted on a chart area (typically an x-y grid). Two quantitative variables are mapped to the x and y axes, and a third quantitative variables is mapped to the size of each point.</p>
<pre class="prettyprint">
symbols(dataX, dataY, circles=dataZ, fg=c2, bg=c3, inches=0.5, xlab="Data X")
text(dataX, dataY, dataZ, col=c4)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_bubble.png" src="http://en.proft.me/media/science/r_chart_bubble.png" />
</div>
<p><a name="hist"></a>
<strong>Histogram</strong></p>
<p>A histogram is a graphical representation of the distribution of data. A histogram represents the frequencies of values of a variable bucketed into ranges. Each bar in histogram represents the height of the number of values present in that range.</p>
<p>It allows you to easily see where a relatively large amount of the data is situated and where there is very little data to be found. In other words, you can see where the middle is in your data distribution, how close the data lie around this middle and where possible outliers are to be found. A simple histogram chart can be created in R with the <code>hist</code> function.</p>
<pre class="prettyprint">
hist(dataX, border=c4, col=c3, xlab="Data X")
# add grid
grid()
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_hist.png" src="http://en.proft.me/media/science/r_chart_hist.png" />
</div>
<p><a name="line"></a>
<strong>Line graph</strong></p>
<p>A line chart is a graph that connects a series of points by drawing line segments between them. These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually used in identifying the trends in data, thus the line is often drawn chronologically. The <code>plot()</code> function in R is used to create the line graph.</p>
<pre class="prettyprint">
plot(dataX, type="o", col=c3, xlab="Time")
legend("bottomleft", title="DataX vs Time", legend=c("data and time"), fill=c3, horiz=T, bty="n")
</pre>
<p>Single character defines the <code>type</code> of line chart to be plotted.</p>
<ul>
<li>p = Points</li>
<li>l = Line</li>
<li>b = Lines and Points</li>
<li>o = Overplotted</li>
<li>h = "Histogram like" lines</li>
<li>s = Steps</li>
</ul>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_line.png" src="http://en.proft.me/media/science/r_chart_line.png" />
</div>
<p><a name="stacked"></a>
<strong>Stacked bar chart</strong></p>
<p>Stacked Barplots, or graphs that depict conditional distributions of data, are great for being able to see a level-wise breakdown of the data. We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.</p>
<p>More than two variables are represented as a matrix which is used to create the group bar chart and stacked bar chart.</p>
<pre class="prettyprint">
months = c("Mar","Apr","May")
dd = data.frame(v1=dataX, v2=dataY, v3=dataZ)
mm = as.matrix(dd)
barplot(mm[1:3,], main = "Total revenue", names.arg=months, xlab="month", ylab="revenue", col=palette)
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_stacked.png" src="http://en.proft.me/media/science/r_chart_stacked.png" />
</div>
<p><a name="radar"></a>
<strong>Radar chart (Spider chart)</strong></p>
<p>A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. This makes them useful for seeing which variables have similar values or if there are any outliers amongst each variable. Radar Charts are also useful for seeing which variables are scoring high or low within a dataset, making them ideal for displaying performance.</p>
<p>The radar chart is also known as web chart, spider chart, star chart,[1] star plot, cobweb chart, irregular polygon, polar chart, or Kiviat diagram.</p>
<pre class="prettyprint">
install.packages("fmsb")
library("fmsb")
dd = data.frame(
datax=dataX[(3:5)],
datay=dataY[(3:5)],
dataz=dataZ[(3:5)]
)
colnames(dd) = c("Data X", "Data Y", "Data Z")
maxRow = rep(max(dd), ncol(dd))
minRow = rep(min(dd), ncol(dd))
dd = rbind(maxRow, minRow, dd)
colorsRC = c(rgb(0.2,0.5,0.5,0.4), rgb(0.8,0.2,0.5,0.4), rgb(0.7,0.5,0.1,0.4))
# plot all data
radarchart(dd, vlcex=0.8, pfcol=colorsRC, cglcol="#999999")
# plot one data
radarchart(dd[(1:3),], vlcex=0.8, pfcol=colorsRC, cglcol="#999999")
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_radar.png" src="http://en.proft.me/media/science/r_chart_radar.png" />
</div>
<p><a href="http://blog.scottlogic.com/2011/09/23/a-critique-of-radar-charts.html">A critique of Radar charts</a></p>
<p><a name="pie"></a>
<strong>Pie chart</strong></p>
<p>A pie-chart is a representation of values as slices of a circle with different colors. The slices are labeled and the numbers corresponding to each slice is also represented in the chart. <em>Bar plots</em> typically illustrate the same data but in a format that is simpler to comprehend for the viewer. As such, it's recommend avoiding pie charts if at all possible.</p>
<p>In R the pie chart is created using the <code>pie()</code> function which takes positive numbers as a vector input. The additional parameters are used to control labels, color, title etc.</p>
<pre class="prettyprint">
pie(dataX, col=rainbow(length(x)))
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_pie.png" src="http://en.proft.me/media/science/r_chart_pie.png" />
</div>
<p><b class="lb">Testing/viewing proportions</b></p>
<p>It is used when there is a need to display the proportion of contribution by one category to the overall level:</p>
<ul>
<li><a href="#bubble">Bubble chart</a></li>
<li><em>Bubble map</em> is useful for visualization of data points on a map.</li>
<li><a href="#stacked">Stacked bar chart</a></li>
<li><a href="http://analyticstraining.com/2014/how-to-create-a-word-cloud-in-r/">Word cloud</a></li>
</ul>
<p><b class="lb">Relationship between variables</b></p>
<p>Association between two or more variables can be shown using the following charts:</p>
<p><a name="scatter"></a>
<strong>Scatterplot</strong></p>
<p>A <em>scatter diagram</em> examines the relationships between data collected for two different characteristics. Although the scatter diagram cannot determine the cause of such a relationship, it can show whether or not such a relationship exists, and if so, just how strong it is. The analysis produced by the scatter diagram is called <em>Regression Analysis</em>.</p>
<p>The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. A simple scatter plot can be created in R with the <code>plot</code> function.</p>
<pre class="prettyprint">
plot(dataX, dataY, pch=19, axes=T, frame.plot=F, col=ifelse(dataX > 25, c1, c2))
</pre>
<div style="text-align:center; margin-bottom: 10px;">
<img alt="r_chart_scatter.png" src="http://en.proft.me/media/science/r_chart_scatter.png" />
</div>
<ul>
<li><a href="#bar">Bar chart</a></li>
<li><a href="#radar">Radar chart</a></li>
<li><a href="#line">Line graph</a></li>
<li><a href="https://nrich.maths.org/7288">Tree diagram</a> is simply a way of representing a sequence of events. Tree diagrams are particularly useful in probability since they record all possible outcomes in a clear and uncomplicated manner.</li>
</ul>
<p><b class="lb">Variable hierarchy</b></p>
<p>When it is required to display the order in the variables, such as a sequence of variables, then the following charts are used:</p>
<ul>
<li>Tree diagram</li>
<li><em>Tree map</em> is similar to a pie chart in that it visually displays proportions by varying the area of a shape. A treemap has two useful advantages over a pie chart. First, you can display a lot more elements. In a pie chart, there is an upper-limit to the number of wedges that can be comfortably added to the circle. In a treemap, you can display hundreds, or thousands, of pieces of information. Secondly, a treemap allows you to arrange your data elements hierarchically. That is, you can group your proportions using categorical variables in your data.</li>
</ul>
<p><b class="lb">Data with locations</b></p>
<p>When a dataset contains the geographic location of different cities, countries, and states names, or longitudes and latitudes, then the following charts can be used to display visualization:</p>
<ul>
<li><a href="#bubble">Bubble map</a></li>
<li>Geo mapping</li>
<li>Dot map</li>
<li>Flow map</li>
</ul>
<p><b class="lb">Contribution analysis or part-to-whole</b></p>
<p>When it is required to display constituents of a variable and contribution of each categorical level towards the overall variable, then the following charts are used:</p>
<ul>
<li><a href="#pie">Pie chart</a></li>
<li><a href="#stacked">Stacked bar chart</a></li>
<li><em>Doughnut chart</em>. Just like a pie chart, a doughnut chart shows the relationship of parts to a whole, but a doughnut chart can contain more than one data series. Each data series that you plot in a doughnut chart adds a ring to the chart. The first data series is displayed in the center of the chart. Doughnut charts display data in rings, where each ring represents a data series. If percentages are displayed in data labels, each ring will total 100%.</li>
</ul>
<p><b class="lb">Statistical distribution</b></p>
<p>In order to understand the variation in a variable across different dimensions, represented by another categorical variable, the following charts are used:</p>
<ul>
<li><a href="#box">Box plot</a></li>
<li><a href="#bubble">Bubble chart</a></li>
<li><a href="#hist">Histogram</a></li>
</ul>
<p><strong>Stem and leaf plot</strong></p>
<p>A stem-and-leaf display is a device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualizing the shape of a distribution.</p>
<p>Unlike histograms, stem-and-leaf displays retain the original data to at least two significant digits, and put the data in order, thereby easing the move to order-based inference and non-parametric statistics.</p>
<p>A basic stem-and-leaf display contains two columns separated by a vertical line. The left column contains the stems and the right column contains the leaves.</p>
<p>A simple stem-and-leaf can be created in R with the <code>stem</code> function. <code>stem</code> produces a stem-and-leaf plot of the values in <code>x</code>. The parameter <code>scale</code> can be used to expand the scale of the plot.</p>
<pre class="prettyprint">
stem(dataX)
</pre>
<p><b class="lb">Unseen patterns</b></p>
<p>For pattern recognition and relative importance of data points on different dimensions of a variable, the following charts are used:</p>
<ul>
<li><a href="#bar">Bar chart</a></li>
<li><a href="#box">Box plot</a></li>
<li><a href="#bubble">Bubble chart</a></li>
<li><a href="#scatter">Scatterplot</a></li>
<li><em>Spiral plot</em> are ideal for showing large data sets, usually to show trends over a large time period. The graph begins from the center of the spiral and progresses outwards. Spiral Plots are versatile and can use bars, lines or points to be displayed along the spiral path. This makes Spiral Plots great for displaying periodic patterns. Colour can be assigned to each period to break them up and to allow some comparison between each period. So for example, if we were to show data over a year, we could assign a colour for each month on the graph. </li>
<li><a href="#line">Line chart</a></li>
</ul>
<p><b class="lb">Spread of values or range</b></p>
<p>The following charts only give the spread of the data points across different bounds:</p>
<ul>
<li>Span chart</li>
<li><a href="#box">Box plot</a></li>
<li><a href="#hist">Histogram</a></li>
</ul>
<p><b class="lb">Textual data representation</b></p>
<p>This is a very interesting way of representing the textual data:</p>
<ul>
<li>Word cloud</li>
</ul>
<p><b class="lb">Useful links</b></p>
<ul>
<li><a href="http://davetang.org/muse/2015/02/12/animated-plots-using-r/">Animated plots using R</a></li>
</ul>Sat, 03 Dec 2016 00:00:00 +0200http://en.proft.me/2016/12/3/charts-r-usage/