For the following examples, I am using the famous dataset published by Sir Ronald Fisher in 1936. This file is available from the UCI Machine Learning Repository.

Data Set Information:

This is perhaps the best-known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. Predicted attribute: class of iris plant.

This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick ‘@’ espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,”Iris-setosa” where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,”Iris-setosa” where the errors are in the second and third features.

Attribute Information:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
— Iris Setosa
— Iris Versicolour
— Iris Virginica

Histograms

Histograms are widely used in statistical data analysis.

A histogram is a two-dimensional graphical display that consists of contiguous (adjoining) rectangles. The height of the rectangles maps to the frequency with which each of the corresponding values in the horizontal scale has been observed in the dataset. The vertical axis is labeled either frequency or relative frequency (or probability). Observations that fall on the boundary of a bin are allocated to the lower bin.

Information conveyed:

Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.

Histograms are especially convenient for understanding the shape of the data distribution.

A histogram can be created in R using the function hist().

# create a histogram for petal length 
hist(x = dat$petal_length, 
     col="lightblue", # adds color to the bars 
     main = "Distribution of Petal Length (cm)", # adds a main title 
     xlab = "Length (cm)", # relabels the x axis 
     ylab = "Frequency", # relabels the y axis 
     ylim = c(0, 40)) # adjusts the y axis

Boxplots

A box plot summarizes a data set using five statistics while also plotting unusual observations (outliers).

From: Barr, Diez, Rundel (2016)

a dark line denoting the median, which splits the data in half.
a rectangular box represents the middle 50% of the data.
the length of the box represents the interquartile range (IQR).
The boundaries of the box are:
- the first quartile (the 25th percentile, Q1)
- the third quartile (the 75th percentile, Q3)

fivenum(dat$petal_length) # lists the quartiles for petal_length

[1] 1.00 1.60 4.35 5.10 6.90

boxplot(x = dat$petal_length, 
     horizontal = TRUE, 
     col="lightblue", 
     main="Distribution of Petal Length", 
     xlab = "Length (cm)")

Scatterplots

A scatterplot is a tool to visualize and study the relationship between numerical variables. It provides a case-by-case view of data for two numerical variables. Each point in a scatterplot represents a single case.

Each case of the dataset gets plotted as a point whose horizontal-vertical coordinates relate to its values for the two variables.

The following R code generates a scatterplot to visualize the relationship between petal length and petal width.

plot(x = dat$petal_length, 
     y = dat$petal_width, 
     pch=16, 
     main="Petal Length vs. Petal Width", 
     xlab = "Petal Length (cm)", 
     ylab= "Petal Width (cm)")

Francis A. Méndez Mediavilla, Ph.D.

Department of Information Systems and Analytics, Texas State University

Visualization Using R Base

Histograms

Boxplots

References