Outliers, Consistency and Context: the Importance of Reporting Variability in Editorial Office Performance Data
A measure of variance is a summary statistic representing the amount of spread or scattering in a data set most commonly given as either a Standard Deviation or Interquartile Range.
If your data is normally distributed, it is appropriate to report the mean (average). Standard deviation is usually reported in conjunction with the mean to describe how the data are distributed.
If your data is skewed or bimodal (not the standard bell-shaped curve seen for normally distributed data), it is more appropriate to report the median. When reporting the median, use the interquartile range (IQR) to describe the data’s distribution.
This article was originally published in Volume 15, Issue 1 of EON (Editorial Office News) in February 2022 (Read the original article here).
July 21, 2022
By: Jason Roberts and Sherrie Hill
Are you reporting what you think is a key value in your performance reports? Does that statistic tell the whole story? Does your audience take away a full data-derived understanding of your journal stakeholder behaviors? Are you basing processing protocols off data points without full context?
Here is the problem: many journal offices report key indicators as single values such as total submission volume, number of reject decisions, total accepted manuscripts, number of accepted review invitations. This is appropriate since the value (e.g. the journal received 250 submissions in 2021) is based on a single count of an item. There were not 249 or 251 submissions, just 250 submissions. But other indicators that editorial offices report on, particularly anything that is a measurement of time, should ideally be provided with additional information such as a measure of variance.
Understanding simple statistics that add context
So, what is a measure of variance and why is it useful to provide such a measure? A measure of variance is a summary statistic that represents the amount of spread or scattering in a data set most commonly given as either a Standard Deviation or Interquartile Range. We appreciate both of these statistical concepts already sound daunting but hopefully this article will aid your understanding by using examples relevant to the editorial office experience. They are also easy to generate in MS Excel, if that is the software you use to analyze your journal data.
To help us better understand what we mean, Figure 1 (A and B) below perfectly illustrates how measures of variance can helpfully provide an enriched description of your journal’s performance.
Both journals presented in these visual examples have a mean turnaround time to post an initial decision of 30 days. The journal on the left (Figure 1a) has a neat-looking bell curve with the majority of decisions being posted close to the mean turnaround time. The journal represented in Figure 1b, with a distinctive “W” shape is in reality delivering a wildly variable service to its authors. Sometimes, it is very quick, on other occasions it is painfully slow. So, why should you care about that? Most obviously, repeat authors appreciate consistency of service delivery. If a journal fluctuates markedly in its performance, it may undermine a regular authors’ faith in the journal’s ability to repeatedly deliver strong and timely peer review. Authors have many journal options when it comes to submitting their work. All editorial offices, therefore, should be striving for consistency or run the risk of handing your competitors an advantage. Journals like to boast about visible improvements in processing efficiency. Indeed, a reduction in the mean turnaround time to initial decision is always welcome. But if a proportion of articles still endure a painfully slow peer review experience, despite an overall improvement in the journal’s overall performance statistic, that is not necessary a “win” for your journal.
It might be useful to explore this concept further with a more detailed example. For instance, if you were asked how long the reviewers for your journal take to submit a review, you actually cannot state a single, absolute, value of time. Some take 5 days, some take 15 days. Some reviewers will submit their review on the day the review is due, some will be early, some will be late. In these instances, you need to provide your audience with more information so that they have a better understanding of the actual situation. This is done by providing not only the mean (average) or median (middle value) for the data, but also a measure of how much variability is present. Again, we are doing this to determine the consistency of reviewer behavior.
Running with this example further, it might be tempting to just report the mean or median along with the minimum and maximum values (i.e. the time of the shortest interval between acceptance of the invitation to review and completion of the review). Additionally, if we subtract the minimum from the maximum value, we could report the range of our data. While this would give your audience information about the best and worst cases of the time taken for the reviewers to submit a review, that range value isn’t always informative. The table below, which is commonly seen in editorial office reports, illustrates the point. If you employed this table, you would actually be reporting that on average it takes a manuscript 41 days to receive an initial decision, but it could happen in as few as 7 days or as much as 130 days, as was the case in the most extreme examples over the time period measured. Problematically, the range is so wide, you actually could not give anyone a realistic idea of how quickly a new manuscript would move through the peer-review process.
Time to Initial Decision
The range of data can be affected by both large variability in the data (i.e. that lack of consistency we referred to earlier) as well as by outliers. Outliers are data points that fall well outside what we would typically expect (for instance a manuscript taking 130 days to receive an initial decision, when 99% of manuscripts have received a decision in, perhaps, 80 days or less). If we consider the timing data most typically used in editorial offices, such as time to initial decision, time to final decision, time for a reviewer to submit their review, and so on, there are a multitude of ways that outliers can be introduced. What we need to do, in generating our reports is recognize these data points as outliers and then decide on how to report them. To ignore them, is to risk undermining faith in the summary statistics you present and possibly lead to unwarranted processing policy revisions or unnecessary hand-wringing if the summary data looks bad when compared with previous annual performance metrics.
Let’s look at a few scenarios:
First, we will consider time for a reviewer to submit a review. Your journal might set the time limit for a reviewer to submit a review at 14 days. However, there are situations that arise that cause the journal to make exceptions. Sometimes the reviewer has an unexpected life event, or they may have an important deadline come up that they failed to appreciate when initially accepting the invitation to review. Rather than dropping this reviewer and beginning the process of locating a new potential reviewer (thus potentially taking considerable time to try to get this new individual to accept the assignment), most journals would rather grant the current reviewer additional time. There are also specialist reviewers, such as statistical reviewers, who all-too-frequently cannot finish their analysis in the allotted time and are routinely allowed additional time. If one of these reviewers also requests a due date extension, then their time to submit their review moves even farther away from the allotted 14 days. There are reviewers who are needed for niche submissions that require a particular reviewer’s specialized expertise. These reviewers are often top in their field and have busy careers. Most editors will be willing to wait additional time for their input.
We see similar issues occur in our other timing values, such as time from editor assignment to initial decision. This data can be affected by newer, less experienced, editors taking longer to secure reviewers and make decisions. Guest editors, regardless of the journal and training they may receive, regularly take longer because they are simply not familiar with a submission system or processing expectations, and their assigned submissions can become stalled. Submissions on niche topics or of low quality might require extra time to move through the peer review process as the editor struggles to find the required number of reviewers. In short, there are a multitude of reasons why your data may be wildly inconsistent, some of which you can control for, others you can do nothing more than report and take account of their existence.
In summary, longer than expected review times and slow-moving manuscripts can cause our timing data to have outliers. Therefore, due to the potential presence of outliers, the range is not a good statistic to report (at least in isolation, without further statistics) because it can give a false impression of what is happening routinely.
Standard Deviations and Interquartile Ranges
Standard deviation (SD) is another common statistic that is frequently used to describe how data are distributed within a dataset. Standard deviation is a mathematical value calculated using the mean (average or μ) for the dataset.
In the editorial office, we would use the population standard deviation rather than the sample standard deviation since we know the actual value (time, measured in days) for every review that was submitted, for example. As a rule of thumb, your mean (average) +/- the standard deviation will tell you where 68% of your data will fall within your dataset.
For this sample case, we would report the time it takes a submission to reach an initial decision is a mean of 41 days (+/- 14 days). Therefore, most (68%) of the submissions will receive an initial decision between 27 to 55 days (i.e. 14 days either side of that mean of 41 days). As you can see, SD allows us to give us a more accurate representation of the spread of our data than the Range. To drive the point home further, a smaller standard deviation would reveal a more consistent performance because the spread of data from the mean is less pronounced.
Time to Initial Decision
However, there are two potential issues with using standard deviation for editorial office timing charts. First, the standard deviation calculation assumes that your data is normally distributed. That means that you have just as many values that are higher than the mean (average) value as you do that are lower than the average and that they are evenly distributed in a bell shape (as shown in FIGURE 3). In general, there is no statistical reason that would lead to a journal’s timing data being normally distributed. For instance, reviewers can be wildly inconsistent on when they decide to return a peer review. This is true not just when we are comparing reviewer-to-reviewer performance, but even when we look at the consistency of an individual’s own performance across multiple manuscripts (i.e. sometimes they are fast, and sometimes they are slow). And then there are those pesky outliers we mentioned earlier that impinge upon calculating the mean accurately. For the particular data set that we are considering, our distribution looks like this:
As you can see, the histogram of the data doesn’t have the nice bell-shaped normal distribution that we saw in the previous standard deviation chart (FIGURE 3). That means that if we calculated a standard deviation, we could not say with confidence that a value is just as likely to fall the same distance above or below our mean (average) value.
Secondly, as you can see in the formula above, the standard deviation calculation uses the dataset’s mean (μ or average). The mean (average) often produces a good estimated value to represent a dataset unless, and this is crucial, there are outliers. As just mentioned, data point(s) that fall well below or above the expected value can significantly affect the calculated mean (average). Though you might have only a few outliers, the mean (average) could be affected enough to change the year-on-year data trend, which might not accurately represent how your journal is actually running. Its likely most journals at some point have experienced the case of a paper that took 200 days to complete initial peer review, completely throwing off your mean turnaround time in the process.
When reporting your timing data, it is safer to assume that the data is not normally distributed and that it might contain outliers. This is sometimes called “messy” data. However (and very importantly), there is a solution and one, frankly, we would love to see all editorial offices adopt. The interquartile range (IQR) is a better way to describe messy data. To calculate the interquartile range, all of the values are sequentially ordered from smallest to largest. Then the data is divided into four equal parts (quartiles). The middle two boxes contain 50% of all of your data points. This is called the middle fifty or interquartile range.
Since the IQR is based off sequential order, we report out the median (middle) value for the dataset rather than the mean (average). This is the number that falls exactly in the middle of all of the values once they are sequentially ordered. In our graphic, the median is shown as Q2. Any outliers that exist in the dataset do not affect the median since we only consider the sequential order for the values. Just like how you calculate a mean in Excel, it is just as simply to deploy Excel to calculate your IQR using the formula builder functionality.
The next thing that we need to tell our audience is how much variability there is in the dataset. For IQR, we do this by reporting the value that is the separating point between the first and second quartiles (Q1) and the value that is between the third and fourth quartiles (Q3).
Time to Initial Decision
For our example, we would report that the median time to initial decision was 42 days and the interquartile range was 10 days (calculated thus: 45 days [Q3 value] – 35 [Q1] days). When reported as a value, it is understood to mean that 50% of the data points were within a 10-day span including 42 days. However, you should not divide the IQR in half and report it as a +/- value (such as 42 days +/- 5 days). Since our dataset is not necessarily normally distributed, there is no statistical reason to assume that a value is equally likely to fall above or below the median value. For the data in the table above, you will notice that the median value is not equidistant between 35 and 45 days. Again, Excel’s formula builder can do all the heavy lifting for you here and tell you the Q3 and Q1 value and the IQR.
When preparing editorial office charts or tables for key indicators that are based on a range of values, you should include not only an estimated value (median/mean) but also some indication of how much that value might vary within your given dataset (the measure of variance such as SD or IQR). Unless you are able to determine if your dataset is normally distributed and does not have outliers, it might be more prudent to report the median and interquartile range rather than the mean (average) and standard deviation. Giving your audience the median value and the IQR will provide sufficient information to predict the outcome of various events occurring in the journal. We appreciate that most journals just report a simple mean statistic, not least because even the most mathematically illiterate of us can grasp a simple concept such as the mean. We are advocating, strongly, however, that editorial offices switch to using the median and including the interquartile range to illustrate the spread of data.
Why you should care?
If your job is simply to generate statistics on performance to give to an editor or if you use data to change management practices at your journal it is imperative you are employing the correct and most effective techniques to interpret meaning in your data. Therefore, the utility of providing measures of variance for setting context for understanding performance metrics can not be understated. In short, when analyzing your data you really should ask yourself: is whatever you are measuring consistent? Are a few outliers disrupting your overall performance statistics? You don’t want to make decisions based on results that may have been heavily skewed by outliers.
You may not be statistically inclined. You may be intimidated by data and the act of generating reports. But many editors and editorial board members do possess the requisite interpretive skills and, truly, will appreciate the added statistical context you would provide when supplying a measure of variance.
As this article has demonstrated, it requires little effort to generate these additional statistics but in doing so, you can provide additional depth to the story your data tells.
Figure 1a – Bell curve/normal distribution
Figure 1b – Bimodal distribution
Figure 2 – Standard deviation equation
Figure 3 – Bell curve with standard deviations
Figure 4 – Skewed distribution
Figure 5 – Interquartile range