Data visualization is fundamental to understanding patterns and distributions within datasets. While complex plots have their place, sometimes the simplest tools are the most effective. Enter the strip chart, also known as a one-dimensional scatter plot or dot plot. Particularly useful for smaller datasets or when comparing distributions across groups, strip charts offer a clear view of individual data points.1 This guide explores how to create and customize strip charts in R using the base graphics system.
What is a Strip Chart?
A strip chart displays numerical data along a single axis.4 Each data point is represented individually, typically as a dot or symbol. This direct representation makes strip charts excellent for visualizing the distribution of one-dimensional quantitative data, especially when the number of observations (N) is small, as it avoids the potential over-summarization seen in plots like histograms or box plots for limited data.2 They are particularly effective for:
- Visualizing the spread and density of data points.
- Identifying potential clusters, gaps, or outliers.
- Comparing distributions across different categories or groups.1
In R’s base graphics system, the primary function for creating these plots is stripchart()
.2
Getting Started: Your First Strip Chart
The stripchart()
function, part of R’s base graphics, requires no external packages.8 Its most basic usage involves providing a numeric vector containing the data to be plotted.2
Let’s use the built-in mtcars
dataset to create a simple strip chart showing the distribution of miles per gallon (MPG) for various car models.
# Load the mtcars dataset (comes with R)
data(mtcars)
# Basic strip chart of miles per gallon (mpg)
stripchart(mtcars$mpg,
main = "Basic Strip Chart of Car MPG",
xlab = "Miles Per Gallon (MPG)")
This code generates a horizontal strip chart where each point represents the MPG of one car model. The main
argument sets the title, and xlab
provides a label for the x-axis.4 Notice that stripchart()
accepts various arguments for customization, which will be explored further.
However, even in this simple example, some points may overlap, making it difficult to discern the exact number of data points at certain values. This visual clutter arises from the default behavior of stripchart()
, which uses method = "overplot"
.1 When multiple data points share the same or very close values, they are plotted directly on top of each other. This default is straightforward but often insufficient, especially for datasets with non-unique values or dense clusters, necessitating methods to handle this overlap effectively.
Untangling Overlapping Points: The method
Argument
The challenge of overlapping points (overplotting) is common in strip charts. The stripchart()
function provides the method
argument to control how these coincident points are displayed.2 Understanding these methods is crucial for creating informative plots.
There are three primary methods available in base R’s stripchart()
:
-
method = "overplot"
(Default): As seen previously, points with identical or very close values are plotted directly on top of one another.1 While this shows the exact location of each value, it can easily hide the true density of points in crowded regions. -
method = "jitter"
: This method adds a small amount of random noise (typically perpendicular to the data axis) to each point’s position.2 This separation helps prevent points from completely overlapping, making it much easier to visualize the density and distribution, especially for continuous or near-continuous data. The amount of jittering can be controlled by thejitter
argument (defaulting to 0.1).2 Increasing thejitter
value spreads the points out more. This approach emphasizes the overall distribution shape and density rather than the exact frequency at specific values, as the perpendicular positioning is random. -
method = "stack"
: When points have identical values, this method stacks them neatly on top of each other (or side-by-side ifvertical = TRUE
).2 This creates something akin to a mini-histogram at each distinct data value. It is particularly effective for discrete data (like counts or ratings) or data that has been rounded, where multiple observations frequently share the exact same value.2 If the data has many unique values, “stack” can look similar to “overplot” or create very tall, thin stacks that are hard to compare visually.
Let’s compare these methods using the Ozone
readings from the built-in airquality
dataset. We’ll remove missing values (NA
s) first.
# Using airquality data (remove NAs first)
ozone_data <- airquality$Ozone[!is.na(airquality$Ozone)]
# Set up plot layout to show all three side-by-side
par(mfrow = c(3, 1), mar = c(4, 4, 2, 1)) # 3 rows, 1 col; adjust margins
# Overplot (Default - likely messy)
stripchart(ozone_data, main="Overplot Method", xlab="Ozone (ppb)")
# Jitter Method
stripchart(ozone_data, method = "jitter", jitter = 0.2,
main="Jitter Method", xlab="Ozone (ppb)", pch=16) # Use solid points
# Stack Method (Works best with discrete/rounded data)
# Stacking continuous data can be less informative
stripchart(ozone_data, method = "stack",
main="Stack Method", xlab="Ozone (ppb)", pch=16)
# Reset plot layout
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1) # Reset to default
Observing these plots reveals how the choice of method
influences the visual interpretation. “Jitter” provides a good sense of the overall density, while “stack” highlights the frequency of specific (or binned, if rounded) values. “Overplot” often obscures information in dense areas.
Here’s a table summarizing the methods:
Method | Description | How Overlap is Handled | Best For | Key Argument(s) |
"overplot" |
Default behavior | Points plotted directly on top | Seeing exact values with minimal overlap | – |
"jitter" |
Adds small random displacement perpendicular | Points slightly shifted randomly | Visualizing density, continuous data | jitter |
"stack" |
Stacks identical points | Points stacked neatly | Discrete or rounded data, seeing frequencies | offset |
Comparing Groups Side-by-Side: Using Formulas
One of the primary strengths of strip charts is their ability to facilitate comparisons of distributions across different groups or categories.1 R’s formula interface provides an elegant way to achieve this within the stripchart()
function.2
The formula syntax is typically , where is the numeric variable whose distribution is of interest, and is a factor or categorical variable defining the groups.2 When using a formula, the data
argument must also be provided to specify the data frame containing these variables.2 This approach is highly efficient as it avoids the need for manual data subsetting or looping to create plots for each group. It leverages a standard R convention used in many modeling and plotting functions (like boxplot()
, lm()
, etc.), making the syntax familiar to many R users.
Let’s use the classic iris
dataset to compare the distribution of Sepal Length for the three different Iris species (setosa
, versicolor
, virginica
). Using method = "jitter"
is highly recommended for group comparisons, as it allows for a clearer visual assessment of the distribution within each group, facilitating comparison of central tendency, spread, and potential outliers.6
# Using the iris dataset
data(iris)
# Compare Sepal Length across Species using formula
stripchart(Sepal.Length ~ Species, data = iris,
method = "jitter", jitter = 0.2, # Use jitter for clarity
main = "Iris Sepal Length by Species",
xlab = "Species",
ylab = "Sepal Length (cm)",
col = c("red", "green3", "blue"), # Assign distinct colors to groups
pch = 16) # Use solid circles for better visibility
This code generates three separate strip charts (one for each species) aligned along a common vertical axis (Sepal Length), making visual comparison straightforward. The col
argument is used here to assign a different color to each species’ points, enhancing visual distinction. The factor levels in iris$Species
determine the order of the groups along the x-axis. While the group.names
argument exists to manually set labels 2, using the inherent factor levels is often sufficient.
Making it Your Own: Customizing Appearance
Like most base R plotting functions, stripchart()
offers extensive customization options through various graphical parameters.2 Many standard parameters recognized by par()
can be passed directly to stripchart()
or through the ...
argument mechanism, which allows functions to accept additional, unspecified arguments and pass them along to underlying plotting routines.11
Here are some common customizations demonstrated by enhancing the previous grouped iris
plot:
vertical = TRUE
: Changes the plot orientation, drawing the strips vertically instead of horizontally. This can be particularly useful when comparing many groups or when group names are long.1 Note that whenvertical = TRUE
, the roles of the axes are swapped: the data () is plotted horizontally, and the groups () are plotted vertically. Consequently,xlab
will label the continuous data axis, andylab
will label the group axis.pch
: Controls the plotting symbol shape. Common values include1
(open circle),16
(solid circle),17
(solid triangle),15
(solid square). A vector ofpch
values can be supplied to assign different shapes to different groups.2 A full list can be found in R’s help documentation forpoints
(?points
).col
: Sets the color of the points. Similar topch
, a vector of colors can be used for different groups.2 Using distinct colors and shapes effectively maps visual properties to data features (groups), significantly enhancing clarity and interpretation compared to using defaults or random aesthetic choices.main
,xlab
,ylab
: Customize the main title and axis labels.2 Remember thatxlab
andylab
refer to the data axis and group axis respectively, regardless of thevertical
setting.cex
: Adjusts the size of the plotting symbols.2frame.plot = FALSE
: Removes the bounding box drawn around the plot area.3axes = TRUE/FALSE
: Controls whether axes are drawn.2at
: Provides manual control over the numeric locations where the strips are drawn along the group axis. Useful for non-uniform spacing or when adding strip charts to existing plots (add = TRUE
).2
Let’s apply some of these customizations to create a vertical iris
plot:
# Vertical Iris plot with more customization
stripchart(Sepal.Length ~ Species, data = iris,
method = "jitter", jitter = 0.2,
vertical = TRUE, # Make it vertical
main = "Iris Sepal Length by Species (Vertical)",
ylab = "Species", # Note: axes flip, ylab now labels the vertical group axis
xlab = "Sepal Length (cm)", # xlab now labels the horizontal data axis
col = c("darkorange", "purple", "cyan4"), # Different colors
pch = c(15, 17, 18), # Different shapes per species (square, triangle, diamond)
cex = 1.2, # Slightly larger points
frame.plot = FALSE, # Remove box around plot
las = 1) # Make axis labels horizontal for readability (style 1)
This customized plot uses orientation, color, shape, and size to present the comparison more effectively.
Mini-Project: Exploring Iris Sepal Lengths
Now, let’s synthesize the concepts covered into a small project.
Goal: To visually compare the distribution of sepal lengths for the three different species of Iris flowers (setosa
, versicolor
, virginica
) using a well-customized strip chart that clearly highlights any differences.
Data: We will use the built-in iris
dataset, focusing on the Sepal.Length
(numeric) and Species
(factor) variables.
Steps & Code: We’ll build upon the previous examples, incorporating the formula interface, the jitter method for clarity, and meaningful customizations (colors, shapes, labels, orientation). Assigning visual elements systematically based on group levels makes the code robust and the plot easier to interpret. Using named vectors ensures the correct color and shape are applied to each species, regardless of the internal order of factor levels.
# Final Iris Sepal Length Comparison Plot
# Define colors and shapes systematically using named vectors
# This ensures 'setosa' is always red/circle, etc.
species_colors <- c("setosa" = "red", "versicolor" = "green3", "virginica" = "blue")
species_pch <- c("setosa" = 16, "versicolor" = 17, "virginica" = 18) # solid circle, triangle, diamond
# Create the stripchart
stripchart(Sepal.Length ~ Species, data = iris,
method = "jitter", jitter = 0.2, # Jitter points to show density
vertical = TRUE, # Vertical orientation often better for group comparisons
main = "Comparison of Iris Sepal Lengths by Species",
xlab = "Sepal Length (cm)", # Label for the data axis
ylab = "Species", # Label for the group axis
# Use names to ensure correct color/pch assignment
col = species_colors,
pch = species_pch,
cex = 1.1, # Adjust point size
las = 1) # Make species labels horizontal for readability
# Optional: Add means for reference
# Calculate mean sepal length for each species
library(dplyr) # Using dplyr for concise calculation
means <- iris %>%
group_by(Species) %>%
summarize(mean_length = mean(Sepal.Length),.groups = 'drop')
# Add mean points to the plot (Note: positions 1, 2, 3 correspond to factor levels)
points(means$mean_length, 1:3, pch = 4, col = "black", cex = 1.5, lwd=2) # Add 'X' at mean locations
legend("bottomright", legend="Mean", pch=4, pt.cex=1.5, pt.lwd=2, col="black", bty="n") # Add legend for mean
Interpretation: This plot effectively visualizes the sepal length distributions for the three Iris species. It clearly shows that:
setosa
(red circles) generally has the shortest sepal lengths, with a relatively narrow distribution.virginica
(blue diamonds) tends to have the longest sepal lengths, with a wider spread.versicolor
(green triangles) falls in between, overlapping significantly with both other species but centered higher thansetosa
and lower thanvirginica
. The jittering allows observation of the density within each group, and the added mean markers (black ‘X’) provide a quick reference for the central tendency of each species.
Bonus Tip: Enhancing Boxplots with Strip Charts
While strip charts show individual points, boxplots provide a concise summary of the distribution (median, quartiles, range, potential outliers). Combining these two plot types can offer a richer understanding than either plot alone.9 This is particularly useful for moderate sample sizes where a boxplot might hide important details within the interquartile range (IQR) box, but a strip chart alone lacks explicit summary statistics.
The technique involves creating the boxplot first, then overlaying the strip chart using add = TRUE
.9 Careful use of jitter and point aesthetics (like color and transparency) is needed to ensure the points don’t completely obscure the underlying boxplot.
Here’s how to overlay jittered points onto a boxplot for the iris
sepal length data:
# Boxplot with overlaid stripchart for Iris Sepal Length
# Set up plot area (optional, but good practice)
par(mar = c(5, 4, 4, 2) + 0.1) # Default margins
# Create the boxplot first (vertical)
boxplot(Sepal.Length ~ Species, data = iris,
main = "Iris Sepal Length: Boxplot with Individual Points",
xlab = "Species",
ylab = "Sepal Length (cm)",
col = "lightblue", # Light color for the box
border = "darkblue") # Darker border for the box
# Overlay the stripchart using jitter
stripchart(Sepal.Length ~ Species, data = iris,
method = "jitter", jitter = 0.15, # Adjust jitter amount
vertical = TRUE, # Match boxplot orientation
pch = 16, # Solid circles
# Use semi-transparent points to see boxplot underneath
col = rgb(0, 0, 0, alpha = 0.5), # Black points with 50% transparency
add = TRUE) # Crucial: Add to the existing boxplot
This combined plot provides both the summary statistics from the boxplot (median line, IQR box, whiskers) and the view of individual data points from the strip chart. It allows for a better assessment of data density within the box, verification of potential outliers shown by the boxplot, and a general sense of the distribution’s shape (e.g., symmetry, skewness) that complements the boxplot’s summary.
Conclusion: Why Use Strip Charts? Get to the Point!
Strip charts, created in R using the stripchart()
function, are a valuable tool in the data visualization toolkit. Their core strengths lie in their simplicity and their ability to display every individual data point along a single dimension.3 This makes them particularly effective for:
- Visualizing distributions of small datasets where aggregation might obscure details.2
- Comparing distributions across different groups using the formula interface ().1
- Identifying patterns like clustering, gaps, and potential outliers.
- Supplementing other plot types, like boxplots, to provide a more complete picture of the data.9
Key techniques involve choosing the appropriate method
("overplot"
, "jitter"
, "stack"
) to handle overlapping points effectively and using standard graphical parameters (col
, pch
, vertical
, labels, etc.) for customization.
While more complex visualization systems like ggplot2
offer similar capabilities (e.g., geom_jitter
, geom_dotplot
) within a different framework 15, the base R stripchart()
provides a quick, accessible, and often sufficient way to generate these informative plots. They offer a unique view of the data’s “texture”—revealing nuances that aggregated summaries might miss—making them a worthwhile addition to any exploratory data analysis workflow.