Every data mining project is incomplete without proper data visualization. From a functional point of view, the following are the graphs and charts which a data scientist would like the audience to look at to infer the information:
There is simple diagram with suggests how to choose the right chart for presentation
Let's initiate data and colors
dataX <- sample(1:50, 20, replace=T) dataY <- sample(1:50, 20, replace=T) dataZ <- sample(1:5, 20, replace=T) palette <- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2') c1 = palette[1] c2 = palette[2] c3 = palette[3] c4 = "#FFFFF"
Comparisons between variables
Basically, when it is needed to represent two or more properties within a variable, then the following charts are used:
A bar plot is a chart with rectangular bars with lengths proportional to the values that they represent. The bars can be plotted either vertically or horizontally. A simple bar chart can be created in R with the barplot
function. In the example below, data from the sample "pressure" dataset is used to plot the vapor pressure of Mercury as a function of temperature.
barplot(dataX, border=c3, col=c3) # change font size barplot(dataX, border=c3, col=c3, cex.axis=0.9, names.arg = dataX) # plot line abline(h=10, col="Red", lty=5)
A box plot is a chart that illustrates groups of numerical data through the use of quartiles. A simple box plot can be created in R with the boxplot
function.
boxplot(dataX, border=c2, col=c3, cex.axis=0.9)
The bubble chart is a variant of the scatterplot. Like in the scatterplot, points are plotted on a chart area (typically an x-y grid). Two quantitative variables are mapped to the x and y axes, and a third quantitative variables is mapped to the size of each point.
symbols(dataX, dataY, circles=dataZ, fg=c2, bg=c3, inches=0.5, xlab="Data X") text(dataX, dataY, dataZ, col=c4)
A histogram is a graphical representation of the distribution of data. A histogram represents the frequencies of values of a variable bucketed into ranges. Each bar in histogram represents the height of the number of values present in that range.
It allows you to easily see where a relatively large amount of the data is situated and where there is very little data to be found. In other words, you can see where the middle is in your data distribution, how close the data lie around this middle and where possible outliers are to be found. A simple histogram chart can be created in R with the hist
function.
hist(dataX, border=c4, col=c3, xlab="Data X") # add grid grid()
A line chart is a graph that connects a series of points by drawing line segments between them. These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually used in identifying the trends in data, thus the line is often drawn chronologically. The plot()
function in R is used to create the line graph.
plot(dataX, type="o", col=c3, xlab="Time") legend("bottomleft", title="DataX vs Time", legend=c("data and time"), fill=c3, horiz=T, bty="n")
Single character defines the type
of line chart to be plotted.
Stacked Barplots, or graphs that depict conditional distributions of data, are great for being able to see a level-wise breakdown of the data. We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.
More than two variables are represented as a matrix which is used to create the group bar chart and stacked bar chart.
months = c("Mar","Apr","May") dd = data.frame(v1=dataX, v2=dataY, v3=dataZ) mm = as.matrix(dd) barplot(mm[1:3,], main = "Total revenue", names.arg=months, xlab="month", ylab="revenue", col=palette)
A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. This makes them useful for seeing which variables have similar values or if there are any outliers amongst each variable. Radar Charts are also useful for seeing which variables are scoring high or low within a dataset, making them ideal for displaying performance.
The radar chart is also known as web chart, spider chart, star chart,[1] star plot, cobweb chart, irregular polygon, polar chart, or Kiviat diagram.
install.packages("fmsb") library("fmsb") dd = data.frame( datax=dataX[(3:5)], datay=dataY[(3:5)], dataz=dataZ[(3:5)] ) colnames(dd) = c("Data X", "Data Y", "Data Z") maxRow = rep(max(dd), ncol(dd)) minRow = rep(min(dd), ncol(dd)) dd = rbind(maxRow, minRow, dd) colorsRC = c(rgb(0.2,0.5,0.5,0.4), rgb(0.8,0.2,0.5,0.4), rgb(0.7,0.5,0.1,0.4)) # plot all data radarchart(dd, vlcex=0.8, pfcol=colorsRC, cglcol="#999999") # plot one data radarchart(dd[(1:3),], vlcex=0.8, pfcol=colorsRC, cglcol="#999999")
A pie-chart is a representation of values as slices of a circle with different colors. The slices are labeled and the numbers corresponding to each slice is also represented in the chart. Bar plots typically illustrate the same data but in a format that is simpler to comprehend for the viewer. As such, it's recommend avoiding pie charts if at all possible.
In R the pie chart is created using the pie()
function which takes positive numbers as a vector input. The additional parameters are used to control labels, color, title etc.
pie(dataX, col=rainbow(length(x)))
Testing/viewing proportions
It is used when there is a need to display the proportion of contribution by one category to the overall level:
Relationship between variables
Association between two or more variables can be shown using the following charts:
A scatter diagram examines the relationships between data collected for two different characteristics. Although the scatter diagram cannot determine the cause of such a relationship, it can show whether or not such a relationship exists, and if so, just how strong it is. The analysis produced by the scatter diagram is called Regression Analysis.
The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. A simple scatter plot can be created in R with the plot
function.
plot(dataX, dataY, pch=19, axes=T, frame.plot=F, col=ifelse(dataX > 25, c1, c2))
Variable hierarchy
When it is required to display the order in the variables, such as a sequence of variables, then the following charts are used:
Data with locations
When a dataset contains the geographic location of different cities, countries, and states names, or longitudes and latitudes, then the following charts can be used to display visualization:
Contribution analysis or part-to-whole
When it is required to display constituents of a variable and contribution of each categorical level towards the overall variable, then the following charts are used:
Statistical distribution
In order to understand the variation in a variable across different dimensions, represented by another categorical variable, the following charts are used:
Stem and leaf plot
A stem-and-leaf display is a device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualizing the shape of a distribution.
Unlike histograms, stem-and-leaf displays retain the original data to at least two significant digits, and put the data in order, thereby easing the move to order-based inference and non-parametric statistics.
A basic stem-and-leaf display contains two columns separated by a vertical line. The left column contains the stems and the right column contains the leaves.
A simple stem-and-leaf can be created in R with the stem
function. stem
produces a stem-and-leaf plot of the values in x
. The parameter scale
can be used to expand the scale of the plot.
stem(dataX)
Unseen patterns
For pattern recognition and relative importance of data points on different dimensions of a variable, the following charts are used:
Spread of values or range
The following charts only give the spread of the data points across different bounds:
Textual data representation
This is a very interesting way of representing the textual data:
Useful links