Understanding Box Plots
Fast Facts:
Box plots give your audience a visual representation of the most frequently occurring value as well as the spread in your data
Displaying critical statistics on a box plot makes information easier to digest than reading the same information on a crowded table
Box plots allow you to quickly compare two different groups of data, whether you’re comparing year-to-year or journal-to-journal
November 17, 2022
By: Sherrie Hill and Kristen Overstreet
Box plots are an excellent way to show detailed information about data with a range of values. Editorial office personnel are often asked to report on key indicators for their journal which do not inherently have a single value, such as the time to initial decision. Since all submissions will not get an initial decision in exactly the same number of days, there will be a range of times. Some journals address this by reporting the average time to initial decision. However, if the data is not normally distributed or has significant outliers, the average may not be the best statistic to use for this data. The median value is often a good alternative to the average value in these cases.
Though reporting the median might provide a better representation of what most authors experience, it does not tell the full story. The better solution is to report not only the median but also the variance in the data. Variance gives your reader an understanding of how spread out your data is from the median or average value. When reporting the median, we usually report the interquartile range (IQR) to give information about the spread or variance in the data. For more information on the interquartile range, read our blog post Outliers, Consistency and Context: The Importance of Reporting Variability in Editorial Office Performance Data.
Let’s consider a scenario where we have three journals, which all report an average time to initial decision of 38 days. On the surface, we would assume that the authors for all three journals will have about the same experience regardless of which of these journals they choose to submit to.
However, if we take a closer look, we might get a better view of the journals’ true performance. Firstly, let’s look at the median time to initial decision for each journal. To determine the median, we first arrange the data points in sequential order (ascending or descending). Then look for the middle value from the list. Note, if the data set contains an even number of values, take the average between the middle two numbers. From the data below, we can see that the median shows a slightly different picture than the average. However, the values are not significantly different, and it still appears that the journals perform fairly similarly. Based solely on the median value, we might expect that the manuscripts for Journal B may reach an initial decision a few days before Journal A.
To understand the spread of the data points (variance), we will also look at the interquartile range, which describes the “Middle 50%”, where half of all the values fall. With the values still in sequential order, divide them equally into four sections (quarters or quartiles). In our time to initial decision example, the first quartile (Q1) contains the data points for the manuscripts that had the fastest time to initial decision. The fourth quartile (Q4) contains the data points for the manuscripts that had the slowest time to initial decision. In our example, the quartile dividing lines for Q1 and Q3 fall between two values. As noted above, we will need to report average values for Q1 and Q3.
We are mostly interested in what “typically” happens, which we see in the second and third quartiles (Q2 and Q3). This range of values is called the Middle 50% or Interquartile Range (IQR). To calculate the IQR, subtract the Q1 value from the Q3 value. This number represents the spread or variance for the most commonly occurring values from the middle of the data. Since the summary table shows the median and quartile information, it tells your reader everything that they need to know to better understand the data.
However, it is often easier for readers to interpret the data when they see it represented graphically. Box plots are the best way to visually show the median and interquartile range, as well as the minimum and maximum values. To better interpret box plots, it helps to understand how they are created. If I were to take each of the time to initial decision data points and plot them on a chart for each of the journals, I would get a graph like the one below (left). We could then show our data as it was grouped equally into four boxes (quartiles) for each journal. Note that though each of the boxes has the same number of data points, the size of the box is not the same. Data points that are more similar (more densely packed) result in smaller boxes. When we see a box plot with shorter boxes, we know that there is less variance in that data. In our example, smaller boxes would indicate that the authors for these manuscripts had a very similar time to initial decision.
We want to report the information about the Middle 50% so that the reader understands the variance in our data. To do this, we need to know the value at the top of the first box which contains a quarter of the data. We call this value the first quartile (Q1). The highest value of interest for the Middle 50% is the value at the top of the third box, which is called the third quartile (Q3). You will notice that the value which is at the top of the second box is the value that is in the middle of all the data points (median value). The IQR is obtained by subtracting Q1 from Q3 (IQR = Q3 – Q1).
Graphically, we want to focus the readers’ attention on the IQR (Middle 50%). To do this, we show the boxes for quartiles 2 and 3 on the box plot since they show the range for 50% of the data points.
The extreme data points in quartiles 1 and 4 are also important since they give information about the lowest (minimum) and highest (maximum) values that were recorded. A line is drawn through our minimum and maximum values with a vertical line connecting these values. We now have the standard version of the box plot. It should be noted that if your data has outliers, they can be represented by data points above or below the stated maximum and minimum of the box plot. We will investigate how to determine outliers in a future blog post.
Now that we have created the box plot, it will be easier for our audience to understand the author’s experience for each journal. The median time to initial decision for Journal A is 38 days. This journal also has shorter boxes (smaller variance in the data points). So, 50% of the authors had a very similar time to initial decision for their manuscripts. For journal A, most manuscripts will get an initial decision sometime between 32.5 to 42 days (IQR = 9.5 days). Journal B has a slightly better median value of 36 days and short boxes (low variance), where 50% of their manuscripts reach an initial decision within 30.5 to 41.5 days (IQR = 11 days). However, the box plot also shows that some authors have very different experiences. Though the minimum values for Journal A and Journal B are similar, the maximum values are significantly different. At least one of the manuscripts at Journal B took 78 days to reach an initial decision.
Journal C’s data tells another story. While the median value for Journal C is 37 days, there is a larger variance in the time to initial decision, with 50% of the submission reaching an initial decision in 23.5 to 52 days (IQR = 24 days). Because this range is so large, it is hard to predict how fast a manuscript will move through this journal. The box plot also shows that at least one manuscript received a decision after only 2 days, while other submissions took up to 85 days to get an initial decision. This shows that the author’s experience at Journal C is very inconsistent. It will be hard for the journal staff to advise an author on what they can expect with any degree of certainty.
As we all strive to create a positive, predictable experience for our authors, box plots can give journal staff illuminating information about their journal’s performance for any key indicator where there is a possible range of values.