Getting to the Point: A Guide to R Strip Charts

Data visualization is fundamental to understanding patterns and distributions within datasets. While complex plots have their place, sometimes the simplest tools are the most effective. Enter the strip chart, also known as a one-dimensional scatter plot or dot plot. Particularly useful for smaller datasets or when comparing distributions across groups, strip charts offer a clear view of individual data points.1 This guide explores how to create and customize strip charts in R using the base graphics system.

What is a Strip Chart?

A strip chart displays numerical data along a single axis.4 Each data point is represented individually, typically as a dot or symbol. This direct representation makes strip charts excellent for visualizing the distribution of one-dimensional quantitative data, especially when the number of observations (N) is small, as it avoids the potential over-summarization seen in plots like histograms or box plots for limited data.2 They are particularly effective for:

  • Visualizing the spread and density of data points.
  • Identifying potential clusters, gaps, or outliers.
  • Comparing distributions across different categories or groups.1

In R’s base graphics system, the primary function for creating these plots is stripchart().2

Getting Started: Your First Strip Chart

The stripchart() function, part of R’s base graphics, requires no external packages.8 Its most basic usage involves providing a numeric vector containing the data to be plotted.2

Let’s use the built-in mtcars dataset to create a simple strip chart showing the distribution of miles per gallon (MPG) for various car models.

Code snippet

# Load the mtcars dataset (comes with R)
data(mtcars)

# Basic strip chart of miles per gallon (mpg)
stripchart(mtcars$mpg,
           main = "Basic Strip Chart of Car MPG",
           xlab = "Miles Per Gallon (MPG)")

This code generates a horizontal strip chart where each point represents the MPG of one car model. The main argument sets the title, and xlab provides a label for the x-axis.4 Notice that stripchart() accepts various arguments for customization, which will be explored further.

However, even in this simple example, some points may overlap, making it difficult to discern the exact number of data points at certain values. This visual clutter arises from the default behavior of stripchart(), which uses method = "overplot".1 When multiple data points share the same or very close values, they are plotted directly on top of each other. This default is straightforward but often insufficient, especially for datasets with non-unique values or dense clusters, necessitating methods to handle this overlap effectively.

Untangling Overlapping Points: The method Argument

The challenge of overlapping points (overplotting) is common in strip charts. The stripchart() function provides the method argument to control how these coincident points are displayed.2 Understanding these methods is crucial for creating informative plots.

There are three primary methods available in base R’s stripchart():

  1. method = "overplot" (Default): As seen previously, points with identical or very close values are plotted directly on top of one another.1 While this shows the exact location of each value, it can easily hide the true density of points in crowded regions.

  2. method = "jitter": This method adds a small amount of random noise (typically perpendicular to the data axis) to each point’s position.2 This separation helps prevent points from completely overlapping, making it much easier to visualize the density and distribution, especially for continuous or near-continuous data. The amount of jittering can be controlled by the jitter argument (defaulting to 0.1).2 Increasing the jitter value spreads the points out more. This approach emphasizes the overall distribution shape and density rather than the exact frequency at specific values, as the perpendicular positioning is random.

  3. method = "stack": When points have identical values, this method stacks them neatly on top of each other (or side-by-side if vertical = TRUE).2 This creates something akin to a mini-histogram at each distinct data value. It is particularly effective for discrete data (like counts or ratings) or data that has been rounded, where multiple observations frequently share the exact same value.2 If the data has many unique values, “stack” can look similar to “overplot” or create very tall, thin stacks that are hard to compare visually.

Let’s compare these methods using the Ozone readings from the built-in airquality dataset. We’ll remove missing values (NAs) first.

Code snippet

# Using airquality data (remove NAs first)
ozone_data <- airquality$Ozone[!is.na(airquality$Ozone)]

# Set up plot layout to show all three side-by-side
par(mfrow = c(3, 1), mar = c(4, 4, 2, 1)) # 3 rows, 1 col; adjust margins

# Overplot (Default - likely messy)
stripchart(ozone_data, main="Overplot Method", xlab="Ozone (ppb)")

# Jitter Method
stripchart(ozone_data, method = "jitter", jitter = 0.2,
           main="Jitter Method", xlab="Ozone (ppb)", pch=16) # Use solid points

# Stack Method (Works best with discrete/rounded data)
# Stacking continuous data can be less informative
stripchart(ozone_data, method = "stack",
           main="Stack Method", xlab="Ozone (ppb)", pch=16)

# Reset plot layout
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1) # Reset to default

Observing these plots reveals how the choice of method influences the visual interpretation. “Jitter” provides a good sense of the overall density, while “stack” highlights the frequency of specific (or binned, if rounded) values. “Overplot” often obscures information in dense areas.

Here’s a table summarizing the methods:

Method Description How Overlap is Handled Best For Key Argument(s)
"overplot" Default behavior Points plotted directly on top Seeing exact values with minimal overlap
"jitter" Adds small random displacement perpendicular Points slightly shifted randomly Visualizing density, continuous data jitter
"stack" Stacks identical points Points stacked neatly Discrete or rounded data, seeing frequencies offset

Comparing Groups Side-by-Side: Using Formulas

One of the primary strengths of strip charts is their ability to facilitate comparisons of distributions across different groups or categories.1 R’s formula interface provides an elegant way to achieve this within the stripchart() function.2

The formula syntax is typically , where is the numeric variable whose distribution is of interest, and is a factor or categorical variable defining the groups.2 When using a formula, the data argument must also be provided to specify the data frame containing these variables.2 This approach is highly efficient as it avoids the need for manual data subsetting or looping to create plots for each group. It leverages a standard R convention used in many modeling and plotting functions (like boxplot(), lm(), etc.), making the syntax familiar to many R users.

Let’s use the classic iris dataset to compare the distribution of Sepal Length for the three different Iris species (setosa, versicolor, virginica). Using method = "jitter" is highly recommended for group comparisons, as it allows for a clearer visual assessment of the distribution within each group, facilitating comparison of central tendency, spread, and potential outliers.6

Code snippet

# Using the iris dataset
data(iris)

# Compare Sepal Length across Species using formula
stripchart(Sepal.Length ~ Species, data = iris,
           method = "jitter", jitter = 0.2, # Use jitter for clarity
           main = "Iris Sepal Length by Species",
           xlab = "Species",
           ylab = "Sepal Length (cm)",
           col = c("red", "green3", "blue"), # Assign distinct colors to groups
           pch = 16) # Use solid circles for better visibility

This code generates three separate strip charts (one for each species) aligned along a common vertical axis (Sepal Length), making visual comparison straightforward. The col argument is used here to assign a different color to each species’ points, enhancing visual distinction. The factor levels in iris$Species determine the order of the groups along the x-axis. While the group.names argument exists to manually set labels 2, using the inherent factor levels is often sufficient.

Making it Your Own: Customizing Appearance

Like most base R plotting functions, stripchart() offers extensive customization options through various graphical parameters.2 Many standard parameters recognized by par() can be passed directly to stripchart() or through the ... argument mechanism, which allows functions to accept additional, unspecified arguments and pass them along to underlying plotting routines.11

Here are some common customizations demonstrated by enhancing the previous grouped iris plot:

  • vertical = TRUE: Changes the plot orientation, drawing the strips vertically instead of horizontally. This can be particularly useful when comparing many groups or when group names are long.1 Note that when vertical = TRUE, the roles of the axes are swapped: the data () is plotted horizontally, and the groups () are plotted vertically. Consequently, xlab will label the continuous data axis, and ylab will label the group axis.
  • pch: Controls the plotting symbol shape. Common values include 1 (open circle), 16 (solid circle), 17 (solid triangle), 15 (solid square). A vector of pch values can be supplied to assign different shapes to different groups.2 A full list can be found in R’s help documentation for points (?points).
  • col: Sets the color of the points. Similar to pch, a vector of colors can be used for different groups.2 Using distinct colors and shapes effectively maps visual properties to data features (groups), significantly enhancing clarity and interpretation compared to using defaults or random aesthetic choices.
  • main, xlab, ylab: Customize the main title and axis labels.2 Remember that xlab and ylab refer to the data axis and group axis respectively, regardless of the vertical setting.
  • cex: Adjusts the size of the plotting symbols.2
  • frame.plot = FALSE: Removes the bounding box drawn around the plot area.3
  • axes = TRUE/FALSE: Controls whether axes are drawn.2
  • at: Provides manual control over the numeric locations where the strips are drawn along the group axis. Useful for non-uniform spacing or when adding strip charts to existing plots (add = TRUE).2

Let’s apply some of these customizations to create a vertical iris plot:

Code snippet

# Vertical Iris plot with more customization
stripchart(Sepal.Length ~ Species, data = iris,
           method = "jitter", jitter = 0.2,
           vertical = TRUE, # Make it vertical
           main = "Iris Sepal Length by Species (Vertical)",
           ylab = "Species", # Note: axes flip, ylab now labels the vertical group axis
           xlab = "Sepal Length (cm)", # xlab now labels the horizontal data axis
           col = c("darkorange", "purple", "cyan4"), # Different colors
           pch = c(15, 17, 18), # Different shapes per species (square, triangle, diamond)
           cex = 1.2, # Slightly larger points
           frame.plot = FALSE, # Remove box around plot
           las = 1) # Make axis labels horizontal for readability (style 1)

This customized plot uses orientation, color, shape, and size to present the comparison more effectively.

Mini-Project: Exploring Iris Sepal Lengths

Now, let’s synthesize the concepts covered into a small project.

Goal: To visually compare the distribution of sepal lengths for the three different species of Iris flowers (setosa, versicolor, virginica) using a well-customized strip chart that clearly highlights any differences.

Data: We will use the built-in iris dataset, focusing on the Sepal.Length (numeric) and Species (factor) variables.

Steps & Code: We’ll build upon the previous examples, incorporating the formula interface, the jitter method for clarity, and meaningful customizations (colors, shapes, labels, orientation). Assigning visual elements systematically based on group levels makes the code robust and the plot easier to interpret. Using named vectors ensures the correct color and shape are applied to each species, regardless of the internal order of factor levels.

Code snippet

# Final Iris Sepal Length Comparison Plot

# Define colors and shapes systematically using named vectors
# This ensures 'setosa' is always red/circle, etc.
species_colors <- c("setosa" = "red", "versicolor" = "green3", "virginica" = "blue")
species_pch <- c("setosa" = 16, "versicolor" = 17, "virginica" = 18) # solid circle, triangle, diamond

# Create the stripchart
stripchart(Sepal.Length ~ Species, data = iris,
           method = "jitter", jitter = 0.2, # Jitter points to show density
           vertical = TRUE, # Vertical orientation often better for group comparisons
           main = "Comparison of Iris Sepal Lengths by Species",
           xlab = "Sepal Length (cm)", # Label for the data axis
           ylab = "Species", # Label for the group axis
           # Use names to ensure correct color/pch assignment
           col = species_colors,
           pch = species_pch,
           cex = 1.1, # Adjust point size
           las = 1) # Make species labels horizontal for readability

# Optional: Add means for reference
# Calculate mean sepal length for each species
library(dplyr) # Using dplyr for concise calculation
means <- iris %>%
  group_by(Species) %>%
  summarize(mean_length = mean(Sepal.Length),.groups = 'drop')

# Add mean points to the plot (Note: positions 1, 2, 3 correspond to factor levels)
points(means$mean_length, 1:3, pch = 4, col = "black", cex = 1.5, lwd=2) # Add 'X' at mean locations
legend("bottomright", legend="Mean", pch=4, pt.cex=1.5, pt.lwd=2, col="black", bty="n") # Add legend for mean

Interpretation: This plot effectively visualizes the sepal length distributions for the three Iris species. It clearly shows that:

  • setosa (red circles) generally has the shortest sepal lengths, with a relatively narrow distribution.
  • virginica (blue diamonds) tends to have the longest sepal lengths, with a wider spread.
  • versicolor (green triangles) falls in between, overlapping significantly with both other species but centered higher than setosa and lower than virginica. The jittering allows observation of the density within each group, and the added mean markers (black ‘X’) provide a quick reference for the central tendency of each species.

Bonus Tip: Enhancing Boxplots with Strip Charts

While strip charts show individual points, boxplots provide a concise summary of the distribution (median, quartiles, range, potential outliers). Combining these two plot types can offer a richer understanding than either plot alone.9 This is particularly useful for moderate sample sizes where a boxplot might hide important details within the interquartile range (IQR) box, but a strip chart alone lacks explicit summary statistics.

The technique involves creating the boxplot first, then overlaying the strip chart using add = TRUE.9 Careful use of jitter and point aesthetics (like color and transparency) is needed to ensure the points don’t completely obscure the underlying boxplot.

Here’s how to overlay jittered points onto a boxplot for the iris sepal length data:

Code snippet

# Boxplot with overlaid stripchart for Iris Sepal Length

# Set up plot area (optional, but good practice)
par(mar = c(5, 4, 4, 2) + 0.1) # Default margins

# Create the boxplot first (vertical)
boxplot(Sepal.Length ~ Species, data = iris,
        main = "Iris Sepal Length: Boxplot with Individual Points",
        xlab = "Species",
        ylab = "Sepal Length (cm)",
        col = "lightblue", # Light color for the box
        border = "darkblue") # Darker border for the box

# Overlay the stripchart using jitter
stripchart(Sepal.Length ~ Species, data = iris,
           method = "jitter", jitter = 0.15, # Adjust jitter amount
           vertical = TRUE, # Match boxplot orientation
           pch = 16, # Solid circles
           # Use semi-transparent points to see boxplot underneath
           col = rgb(0, 0, 0, alpha = 0.5), # Black points with 50% transparency
           add = TRUE) # Crucial: Add to the existing boxplot

This combined plot provides both the summary statistics from the boxplot (median line, IQR box, whiskers) and the view of individual data points from the strip chart. It allows for a better assessment of data density within the box, verification of potential outliers shown by the boxplot, and a general sense of the distribution’s shape (e.g., symmetry, skewness) that complements the boxplot’s summary.

Conclusion: Why Use Strip Charts? Get to the Point!

Strip charts, created in R using the stripchart() function, are a valuable tool in the data visualization toolkit. Their core strengths lie in their simplicity and their ability to display every individual data point along a single dimension.3 This makes them particularly effective for:

  • Visualizing distributions of small datasets where aggregation might obscure details.2
  • Comparing distributions across different groups using the formula interface ().1
  • Identifying patterns like clustering, gaps, and potential outliers.
  • Supplementing other plot types, like boxplots, to provide a more complete picture of the data.9

Key techniques involve choosing the appropriate method ("overplot", "jitter", "stack") to handle overlapping points effectively and using standard graphical parameters (col, pch, vertical, labels, etc.) for customization.

While more complex visualization systems like ggplot2 offer similar capabilities (e.g., geom_jitter, geom_dotplot) within a different framework 15, the base R stripchart() provides a quick, accessible, and often sufficient way to generate these informative plots. They offer a unique view of the data’s “texture”—revealing nuances that aggregated summaries might miss—making them a worthwhile addition to any exploratory data analysis workflow.

Leave a Reply