WordPress Theme built by Shufflehound. © 2014 - 2019 VII Digital (7Digital LLC)

7 Search Ranking Factors Analyzed: A Follow-Up Study

Posted by Jeff_Baker

Grab yourself a cup of coffee (or two) and buckle up, because we’re doing maths today.

Again.

Back it on up…

A quick refresher from last time: I pulled data from 50 keyword-targeted articles written on Brafton’s blog between January and June of 2018.

We used a technique of writing these articles published earlier on Moz that generates some seriously awesome results (we’re talking more than doubling our organic traffic in the last six months, but we will get to that in another publication).

We pulled this data again… Only I updated and reran all the data manually, doubling the dataset. No APIs. My brain is Swiss cheese.

We wanted to see how newly written, original content performs over time, and which factors may have impacted that performance.

Why do this the hard way, dude?

“Why not just pull hundreds (or thousands!) of data points from search results to broaden your dataset?”, you might be thinking. It’s been done successfully quite a few times!

Trust me, I was thinking the same thing while weeping tears into my keyboard.

The answer was simple: I wanted to do something different from the massive aggregate studies. I wanted a level of control over as many potentially influential variables as possible.

By using our own data, the study benefited from:

  • The same root Domain Authority across all content.
  • Similar individual URL link profiles (some laughs on that later).
  • Known original publish dates and without reoptimization efforts or tinkering.
  • Known original keyword targets for each blog (rather than guessing).
  • Known and consistent content depth/quality scores (MarketMuse).
  • Similar content writing techniques for targeting specific keywords for each blog.

You will never eliminate the possibility of misinterpreting correlation as causation. But controlling some of the variables can help.

As Rand once said in a Whiteboard Friday, “Correlation does not imply causation (but it sure is a hint).

Caveat:

What we gained in control, we lost in sample size. A sample size of 96 is much less useful than ten thousand, or a hundred thousand. So look at the data carefully and use discretion when considering the ranking factors you find most likely to be true.

This resource can help gauge the confidence you should put into each Pearson Correlation value. Generally, the stronger the relationship, the smaller sample size needed to be be confident in the results.

So what exactly have you done here?

We have generated hints at what may influence the organic performance of newly created content. No more, and no less. But they are indeed interesting hints and maybe worth further discussion or research.

What have you not done?

We have not published sweeping generalizations about Google’s algorithm. This post should not be read as a definitive guide to Google’s algorithm, nor should you assume that your site will demonstrate the same correlations.

So what should I do with this data?

The best way to read this article, is to observe the potential correlations we observed with our data and consider the possibility of how those correlations may or may not apply to your content and strategy.

I’m hoping that this study takes a new approach to studying individual URLs and stimulates constructive debate and conversation.

Your constructive criticism is welcome, and hopefully pushes these conversations forward!

The stat sheet

So quit jabbering and show me the goods, you say? Alright, let’s start with our stats sheet, formatted like a baseball card, because why not?:

*Note: Only blogs with complete ranking data were used in the study. We threw out blogs with missing data rather than adding arbitrary numbers.

And as always, here is the original data set if you care to reproduce my results.

So now the part you have been waiting for…

The analysis

To start, please use a refresher on the Pearson Correlation Coefficient from my last blog post, or Rand’s.

1. Time and performance

I started with a question: “Do blogs age like a Macallan 18 served up neat on a warm summer Friday afternoon, or like tepid milk on a hot summer Tuesday?

Does the time indexed play a role in how a piece of content performs?

Correlation 1: Time and target keyword position

First we will map the target keyword ranking positions against the number of days its corresponding blog has been indexed. Visually, if there is any correlation we will see some sort of negative or positive linear relationship.

There is a clear negative relationship between the two variables, which means the two variables may be related. But we need to go beyond visuals and use the PCC.

Days live vs. target keyword position

PCC

-.343

Relationship

Moderate

The data shows a moderate relationship between how long a blog has been indexed and the positional ranking of the target keyword.

But before getting carried away, we shouldn’t solely trust one statistical method and call it a day. Let’s take a look at things another way: Let’s compare the average age of articles whose target keywords rank in the top ten against the average age of articles whose target keywords rank outside the top ten.

Average age of articles based on position

Target KW position ≤ 10

144.8 days

Target KW position > 10

84.1 days

Now a story is starting to become clear: Our newly written content takes a significant amount of time to fully mature.

But for the sake of exhausting this hint, let’s look at the data one final way. We will group the data into buckets of target keyword positions, and days indexed, then apply them to a heatmap.

This should show us a clear visual clustering of how articles perform over time.

This chart, quite literally, paints a picture. According to the data, we shouldn’t expect a new article to realize its full potential until at least 100 days, and likely longer. As a blog post ages, it appears to gain more favorable target keyword positioning.

Correlation 2: Time and total ranking keywords on URL

You’ll find that when you write an article it will (hopefully) rank for the keyword you target. But often times it will also rank for other keywords. Some of these are variants of the target keyword, some are tangentially related, and some are purely random noise.

Instinct will tell you that you want your articles to rank for as many keywords as possible (ideally variants and tangentially related keywords).

Predictably, we have found that the relationship between the number of keywords an article ranks for and its estimated monthly organic traffic (per SEMrush) is strong (.447).

We want all of our articles to do things like this:

We want lots of variants each with significant search volume. But, does an article increase the total number of keywords it ranks for over time? Let’s take a look.

Visually this graph looks a little murky due to the existence of two clear outliers on the far right. We will first run the analysis with the outliers, and again without. With the outliers, we observe the following:

Days live vs. total keywords ranking on URL (w/outliers)

PCC

.281

Relationship

Weak/borderline moderate

There appears to be a relationship between the two variables, but it isn’t as strong. Let’s see what happens when we remove those two outliers:

Visually, the relationship looks stronger. Let’s look at the PCC:

Days live vs. total keywords ranking on URL (without outliers)

PCC

.390

Relationship

Moderate/borderline strong

The relationship appears to be much stronger with the two outliers removed.

But again, let’s look at things another way.

Let’s look at the average age of the top 25% of articles and compare them to the average age of the bottom 25% of articles:

Average age of top 25% of articles versus bottom 25%

Top 25%

148.9 days

Bottom 25%

73.8 days

This is exactly why we look at data multiple ways! The top 25% of blog posts with the most ranking keywords have been indexed an average of 149 days, while the bottom 25% have been indexed 74 days — roughly half.

To be fully sure, let’s again cluster the data into a heatmap to observe where performance falls on the time continuum:

We see a very similar pattern as in our previous analysis: a clustering of top-performing blogs starting at around 100 days.

Time and performance assumptions

You still with me? Good, because we are saying something BIG here. In our observation, it takes between 3 and 5 months for new content to perform in organic search. Or at the very least, mature.

To look at this one final way, I’ve created a scatterplot of only the top 25% of highest performing blogs and compared them to their time indexed:

There are 48 data plots on this chart, the blue plots represent the top 25% of articles in terms of strongest target keyword ranking position. The orange plots represent the top 25% of articles with the highest number of keyword rankings on their URL. (These can be, and some are, the same URL.)

Looking at the data a little more closely, we see the following:

90% of the top 25% of highest-performing content took at least 100 days to mature, and only two articles took less than 75 days.

Time and performance conclusion

For those of you just starting a content marketing program, remember that you may not see the full organic potential for your first piece of content until month 3 at the earliest. And, it takes at least a couple months of content production to make a true impact, so you really should wait a minimum of 6 months to look for any sort of results.

In conclusion, we expect new content to take at least 100 days to fully mature.

2. Links

But wait, some of you may be saying. What about links, buddy? Articles build links over time, too!

It stands to reason that, over time, a blog will gain links (and ranking potential) over time. Links matter, and higher positioned rankings gain links at a faster rate. Thus, we are at risk of misinterpreting correlation for causation if we don’t look at this carefully.

But what none of you know, that I know, is that being the terrible SEO that I am, I had no linking strategy with this campaign.

And I mean zero strategy. The average article generated 1.3 links from .5 linking domains.

Nice.

Linking domains vs. target keyword position

PCC

-.022

Relationship

None

Average linking domains to top 25% of articles

.46

Average linking domains to bottom 25% of articles

.46

The one thing consistent across all the articles was a shocking and embarrassing lack of inbound links. This is demonstrated by an insignificant correlation coefficient of -.022. The same goes for the total number of links per URL, with a correlation coefficient of -.029.

These articles appear to have performed primarily on their content rather than inbound links.

(And they certainly would have performed much better with a strong, or any, linking strategy. Nobody is arguing the value of links here.) But mostly…

Shame on me.

Shame. Shame. Shame.

But on a positive note, we were able to generate a more controlled experiment on the effects of time and blog performance. So, don’t fire me just yet?

Note: It would be interesting to pull link quality metrics into the discussion (for the precious few links we did earn) rather than total volume. However, after a cursory look at the data, nothing stood out as being significant.

3. Word count

Content marketers and SEOs love talking about word count. And for good reason. When we collectively agreed that “quality content” was the key to rankings, it would stand to reason that longer content would be more comprehensive, and thus do a better job of satisfying searcher intent. So let’s test that theory.

Correlation 1: Target keyword position versus total word count

Will longer articles increase the likelihood of ranking for the keyword you are targeting?

Not in our case. To be sure, let’s run a similar analysis as before.

Word count vs. target keyword position

PCC

.111

Relationship

Negligible

Average word count of top 25% articles

1,774

Average word count of bottom 25% articles

1,919

The data shows no impact on rankings based on the length of our articles.

Correlation 2: Total keywords ranking on URL versus word count

One would think that longer content would result in is additional ranking keywords, right? Even by accident, you would think that the more related topics you discuss in an article, the more keywords you will rank for. Let’s see if that’s true:

Total keywords ranking on URL vs. word count

PCC

-.074

Relationship

None

Not in this case.

Word count, speculative tangent

So how can it be that so many studies demonstrate higher word counts result in more favorable rankings? Some reconciliation is in order, so allow me to speculate on what I think may be happening in these studies.

  1. Most likely: Measurement techniques. These studies generally look at one factor relative to rankings: average absolute word count based on position. (And, there actually isn’t much of a difference in average word count between position one and ten.)
  2. As we are demonstrating in this article, there may be many other factors at play that need to be isolated and tested for correlations in order to get the full picture, such as: time indexed, on-page SEO (to be discussed later), Domain Authority, link profile, and depth/quality of content (also to be discussed later with MarketMuse as a measure). It’s possible that correlation does not imply correlation, and by using word count averages as the single method of measure, we may be painting too broad of a stroke.

  3. Likely: High quality content is longer, by nature. We know that “quality content” is discussed in terms of how well a piece satisfies the intent of the reader. In an ideal scenario, you will create content that fully satisfies everything a searcher would want to know about a given topic. Ideally you own the resource center for the topic, and the searcher does not need to revisit SERPs and weave together answers from multiple sources. By nature, this type of comprehensive content is quite lengthy. Long-form content is arguably a byproduct of creating for quality. Cyrus Shepard does a better job of explaining this likelihood here.
  4. Less likely: Long-form threshold. The articles we wrote for this study ranged from just under 1,000 words to nearly as high as 4,000 words. One could consider all of these as “long-form content,” and perhaps Google does as well. Perhaps there is a word count threshold that Google uses.

This is all speculation. What we can say for certain is that all our content is 900 words and up, and shows no incremental benefit to be had from additional length.

Feel free to disagree with any (or all) of my speculations on my interpretation of the discrepancies of results, but I tend to have the same opinion as Brian Dean with the information available.

4. MarketMuse

At this point, most of you are familiar with MarketMuse. They have created a number of AI-powered tools that help with content planning and optimization.

We use the Content Optimizer tool, which evaluates the top 20 results for any keyword and generates an outline of all the major topics being discussed in SERPs. This helps you create content that is more comprehensive than your competitors, which can lead to better performance in search.

Based on the competitive landscape, the tool will generate a recommended content score (their proprietary algorithm) that you should hit in order to compete with the competing pages ranking in SERPs.

But… if you’re a competitive fellow, what happens if you want to blow the recommended score out of the water? Do higher scores have an impact on rankings? Does it make a difference if your competition has a very low average score?

We pulled every article’s content score, along with MarketMuse’s recommended scores and the average competitor scores, to answer these questions.

Correlation 1: Overall MarketMuse content score

Does a higher overall content score result in better rankings? Let’s take a look:

Absolute MarketMuse score vs. target keyword position

PCC

.000

Relationship

None

A perfect zero! We weren’t able to beat the system by racking up points. I also checked to see if a higher absolute score would result in a larger number of keywords ranking on the URL — it doesn’t.

Correlation 2: Beating the recommended score

As mentioned, based on the competitive landscape, MarketMuse will generate a recommended content score. What happens if you blow the recommended score out of the water? Do you get bonus points?

In order to calculate this correlation, we pulled the content score percentage attainment and compared it to the target keyword position. For example, if we scored a 30 of recommended 25, we hit 120% attainment. Let’s see if it matters:

Percentage content score attainment vs. target keyword position

PCC

.028

Relationship

None

No bonus points for doing extra credit!

Correlation 3: Beating the average competitors’ scores

Okay, if you beat MarketMuse’s recommendations, you don’t get any added benefit, but what if you completely destroy your competitors’ average content scores?

We will calculate this correlation the same way we previously did, with percentage attainment over the average competitor. For example, if we scored a 30 over the average of 10, we hit 300% attainment. Let’s see if that matters:

Percentage attainment over average competitor score versus target KW position

PCC

-.043

Relationship

None

That didn’t work either! Seems that there are no hacks or shortcuts here.

MarketMuse summary

We know that MarketMuse works, but it seems that there are no additional tricks to this tool.

If you regularly hit the recommended score as we did (average 110% attainment, with 81% of blogs hitting 100% attainment or better) and cover the topics prescribed, you should do well. But don’t fixate on competitor scores or blowing the recommended score out of the water. You may just be wasting your time.

Note: It’s worth noting that we probably would have shown stronger correlations had we intentionally bombed a few MarketMuse scores. Perhaps a test for another day.

5. On-page optimization

Ah, old-school technical SEO. This type of work warms the cockles of a seasoned SEO’s heart. But does it still have a place in our constantly evolving world? Has Google advanced to the point where it doesn’t need technical cues from SEOs to understand what a page is about?

To find out, I have pulled Moz’s on-page optimization score for every article and compared them to the target keywords’ positional rankings:

Let’s take a look at the scatterplot for all the keyword targets.

Now looking at the math:

On-page optimization score vs. target keyword position

PCC

-.384

Relationship

Moderate/strong

Average on-page score for top 25%

91%

Average on-page score for bottom 25%

87%

If you have a keen eye you may have noticed a few strong outliers on the scatterplot. If we remove three of the largest outliers, the correlation goes up to -.435, a strong relationship.

Before we jump to conclusions, let’s look at this data one final way.

Let’s take a look at the percentage of articles with their target keywords ranking 1–10 that also have a 90% on-page score or better. We will compare that number to the percentage of articles ranking outside the top ten that also have a 90% on-page score or better.

If our assumption is correct, we will see a much higher percentage of keywords ranking 1–10 with an on-page score of 90% or better, and a lower number for articles ranking greater than 10.

On-page optimization score by rankings

Percentage of KWs ranking 1–10 with ≥ 90% score

73.5%

Percentage of keywords ranking >10 with ≥ 90% score

53.2%

This is enough of a hint for me. I’m implementing a 90% minimum on-page score from here on out.

Old school SEOs, rejoice!

6. The competition’s average word count

We won’t put this “word count” argument to bed just yet…

Let’s ask ourselves, “Does it matter how long the average content of the top 20 results is?”

Is there a relationship between the length of your content versus the average competitor?

What if your competitors are writing very short form, and you want to beat them with long-form content?

We will measure this the same way as before, with percentage attainment. For example, if the average word count of the top 20 results for “content marketing agency” is 300, and our piece is 450 words, we hit 150% attainment.

Let’s see if you can “out-verbose” your opponents.

Percentage word count attainment versus target KW position

PCC

.062

Relationship

None

Alright, I’ll put word count to bed now, I promise.

7. Keyword density

You’ve made it to the last analysis. Congratulations! How many cups of coffee have you consumed? No judgment; this report was responsible for entire coffee farms being completely decimated by yours truly.

For selfish reasons, I couldn’t resist the temptation to dispel this ancient tactic of “using target keywords” in blog content. You know what I’m talking about: when someone says “This blog doesn’t FEEL optimized… did you use the target keyword enough?”

There are still far too many people that believe that littering target keywords throughout a piece of content will yield results. And misguided SEO agencies, along with certain SEO tools, perpetuate this belief.

Yoast has a tool in WordPress that some digital marketers live and die by. They don’t think that a blog is complete until Yoast shows the magical green light, indicating that the content has satisfied the majority of its SEO recommendations:

Uh oh, keyword density is too low! Let’s see if it that ACTUALLY matters.

Not looking so good, my keyword-stuffing friends! Let’s take a look at the PCC:

Target keyword ranking position vs. Yoast keyword density

PCC

.097

Relationship

None/Negligible

Believers would like to see a negative relationship here; as the keyword density goes down, the ranking position decreases, producing a downward sloping line.

What we are looking at is a slightly upward-sloping line, which would indicate losing rankings by keyword stuffing — but fortunately not TOO upward sloping, given the low correlation value.

Okay, so PLEASE let that be the end of “keyword density.” This practice has been disproven in past studies, as referenced by Zyppy. Let’s confidently put this to bed, forever. Please.

Oh, and just for kicks, the Flesch Reading Ease score has no bearing on rankings either (-.03 correlation). Write to a third grade level, or a college level, it doesn’t matter.

TL;DR (I don’t blame you)

What we learned from our data

  1. Time: It took 100 days or more for an article to fully mature and show its true potential. A content marketing program probably shouldn’t be fully scrutinized until month 5 or 6 at the very earliest.
  2. Links: Links matter, I’m just terrible at generating them. Shame.
  3. Word count: It’s not about the length of the content, in absolute terms or relative to the competition. It’s about what is written and how resourceful it is.
  4. MarketMuse: We have proven that MarketMuse works as it prescribes, but there is no added benefit to breaking records.
  5. On-page SEO: Our data demonstrates that it still matters. We all still have a job.
  6. Competitor content length: We weren’t successful at blowing our competitors out of the water with longer content.
  7. Keyword density: Just stop. Join us in modern times. The water is warm.

In conclusion, some reasonable guidance we agree on is:

Wait at least 100 days to evaluate the performance of your content marketing program, write comprehensive content, and make sure your on-page SEO score is 90%+.

Oh, and build links. Unlike me. Shame.

Now go take a nap.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Ranking the 6 Most Accurate Keyword Research Tools

Posted by Jeff_Baker

In January of 2018 Brafton began a massive organic keyword targeting campaign, amounting to over 90,000 words of blog content being published.

Did it work?

Well, yeah. We doubled the number of total keywords we rank for in less than six months. By using our advanced keyword research and topic writing process published earlier this year we also increased our organic traffic by 45% and the number of keywords ranking in the top ten results by 130%.

But we got a whole lot more than just traffic.

From planning to execution and performance tracking, we meticulously logged every aspect of the project. I’m talking blog word count, MarketMuse performance scores, on-page SEO scores, days indexed on Google. You name it, we recorded it.

As a byproduct of this nerdery, we were able to draw juicy correlations between our target keyword rankings and variables that can affect and predict those rankings. But specifically for this piece…

How well keyword research tools can predict where you will rank.

A little background

We created a list of keywords we wanted to target in blogs based on optimal combinations of search volume, organic keyword difficulty scores, SERP crowding, and searcher intent.

We then wrote a blog post targeting each individual keyword. We intended for each new piece of blog content to rank for the target keyword on its own.

With our keyword list in hand, my colleague and I manually created content briefs explaining how we would like each blog post written to maximize the likelihood of ranking for the target keyword. Here’s an example of a typical brief we would give to a writer:

This image links to an example of a content brief Brafton delivers to writers.

Between mid-January and late May, we ended up writing 55 blog posts each targeting 55 unique keywords. 50 of those blog posts ended up ranking in the top 100 of Google results.

We then paused and took a snapshot of each URL’s Google ranking position for its target keyword and its corresponding organic difficulty scores from Moz, SEMrush, Ahrefs, SpyFu, and KW Finder. We also took the PPC competition scores from the Keyword Planner Tool.

Our intention was to draw statistical correlations between between our keyword rankings and each tool’s organic difficulty score. With this data, we were able to report on how accurately each tool predicted where we would rank.

This study is uniquely scientific, in that each blog had one specific keyword target. We optimized the blog content specifically for that keyword. Therefore every post was created in a similar fashion.

Do keyword research tools actually work?

We use them every day, on faith. But has anyone ever actually asked, or better yet, measured how well keyword research tools report on the organic difficulty of a given keyword?

Today, we are doing just that. So let’s cut through the chit-chat and get to the results…

This image ranks each of the 6 keyword research tools, in order, Moz leads with 4.95 stars out of 5, followed by KW Finder, SEMrush, AHREFs, SpyFu, and lastly Keyword Planner Tool.

While Moz wins top-performing keyword research tool, note that any keyword research tool with organic difficulty functionality will give you an advantage over flipping a coin (or using Google Keyword Planner Tool).

As you will see in the following paragraphs, we have run each tool through a battery of statistical tests to ensure that we painted a fair and accurate representation of its performance. I’ll even provide the raw data for you to inspect for yourself.

Let’s dig in!

The Pearson Correlation Coefficient

Yes, statistics! For those of you currently feeling panicked and lobbing obscenities at your screen, don’t worry — we’re going to walk through this together.

In order to understand the relationship between two variables, our first step is to create a scatter plot chart.

Below is the scatter plot for our 50 keyword rankings compared to their corresponding Moz organic difficulty scores.

This image shows a scatter plot for Moz's keyword difficulty scores versus our keyword rankings. In general, the data clusters fairly tight around the regression line.

We start with a visual inspection of the data to determine if there is a linear relationship between the two variables. Ideally for each tool, you would expect to see the X variable (keyword ranking) increase proportionately with the Y variable (organic difficulty). Put simply, if the tool is working, the higher the keyword difficulty, the less likely you will rank in a top position, and vice-versa.

This chart is all fine and dandy, however, it’s not very scientific. This is where the Pearson Correlation Coefficient (PCC) comes into play.

The PCC measures the strength of a linear relationship between two variables. The output of the PCC is a score ranging from +1 to -1. A score greater than zero indicates a positive relationship; as one variable increases, the other increases as well. A score less than zero indicates a negative relationship; as one variable increases, the other decreases. Both scenarios would indicate a level of causal relationship between the two variables. The stronger the relationship between the two veriables, the closer to +1 or -1 the PCC will be. Scores near zero indicate a weak or no relatioship.

Phew. Still with me?

So each of these scatter plots will have a corresponding PCC score that will tell us how well each tool predicted where we would rank, based on its keyword difficulty score.

We will use the following table from statisticshowto.com to interpret the PCC score for each tool:

Coefficient Correlation R Score

Key

.70 or higher

Very strong positive relationship

.40 to +.69

Strong positive relationship

.30 to +.39

Moderate positive relationship

.20 to +.29

Weak positive relationship

.01 to +.19

No or negligible relationship

0

No relationship [zero correlation]

-.01 to -.19

No or negligible relationship

-.20 to -.29

Weak negative relationship

-.30 to -.39

Moderate negative relationship

-.40 to -.69

Strong negative relationship

-.70 or higher

Very strong negative relationship

In order to visually understand what some of these relationships would look like on a scatter plot, check out these sample charts from Laerd Statistics.

These scatter plots show three types of correlations: positive, negative, and no correlation. Positive correlations have data plots that move up and to the right. Negative correlations move down and to the right. No correlation has data that follows no linear pattern

And here are some examples of charts with their correlating PCC scores (r):

These scatter plots show what different PCC values look like visually. The tighter the grouping of data around the regression line, the higher the PCC value.

The closer the numbers cluster towards the regression line in either a positive or negative slope, the stronger the relationship.

That was the tough part – you still with me? Great, now let’s look at each tool’s results.

Test 1: The Pearson Correlation Coefficient

Now that we’ve all had our statistics refresher course, we will take a look at the results, in order of performance. We will evaluate each tool’s PCC score, the statistical significance of the data (P-val), the strength of the relationship, and the percentage of keywords the tool was able to find and report keyword difficulty values for.

In order of performance:

#1: Moz

This image shows a scatter plot for Moz's keyword difficulty scores versus our keyword rankings. In general, the data clusters fairly tight around the regression line.

Revisiting Moz’s scatter plot, we observe a tight grouping of results relative to the regression line with few moderate outliers.

Moz Organic Difficulty Predictability

PCC

0.412

P-val

.003 (P<0.05)

Relationship

Strong

% Keywords Matched

100.00%

Moz came in first with the highest PCC of .412. As an added bonus, Moz grabs data on keyword difficulty in real time, rather than from a fixed database. This means that you can get any keyword difficulty score for any keyword.

In other words, Moz was able to generate keyword difficulty scores for 100% of the 50 keywords studied.

#2: SpyFu

This image shows a scatter plot for SpyFu's keyword difficulty scores versus our keyword rankings. The plot is similar looking to Moz's, with a few larger outliers.

Visually, SpyFu shows a fairly tight clustering amongst low difficulty keywords, and a couple moderate outliers amongst the higher difficulty keywords.

SpyFu Organic Difficulty Predictability

PCC

0.405

P-val

.01 (P<0.05)

Relationship

Strong

% Keywords Matched

80.00%

SpyFu came in right under Moz with 1.7% weaker PCC (.405). However, the tool ran into the largest issue with keyword matching, with only 40 of 50 keywords producing keyword difficulty scores.

#3: SEMrush

This image shows a scatter plot for SEMrush's keyword difficulty scores versus our keyword rankings. The data has a significant amount of outliers relative to the regression line.

SEMrush would certainly benefit from a couple mulligans (a second chance to perform an action). The Correlation Coefficient is very sensitive to outliers, which pushed SEMrush’s score down to third (.364).

SEMrush Organic Difficulty Predictability

PCC

0.364

P-val

.01 (P<0.05)

Relationship

Moderate

% Keywords Matched

92.00%

Further complicating the research process, only 46 of 50 keywords had keyword difficulty scores associated with them, and many of those had to be found through SEMrush’s “phrase match” feature individually, rather than through the difficulty tool.

The process was more laborious to dig around for data.

#4: KW Finder

This image shows a scatter plot for KW Finder's keyword difficulty scores versus our keyword rankings. The data also has a significant amount of outliers relative to the regression line.

KW Finder definitely could have benefitted from more than a few mulligans with numerous strong outliers, coming in right behind SEMrush with a score of .360.

KW Finder Organic Difficulty Predictability

PCC

0.360

P-val

.01 (P<0.05)

Relationship

Moderate

% Keywords Matched

100.00%

Fortunately, the KW Finder tool had a 100% match rate without any trouble digging around for the data.

#5: Ahrefs

This image shows a scatter plot for AHREF's keyword difficulty scores versus our keyword rankings. The data shows tight clustering amongst low difficulty score keywords, and a wide distribution amongst higher difficulty scores.

Ahrefs comes in fifth by a large margin at .316, barely passing the “weak relationship” threshold.

Ahrefs Organic Difficulty Predictability

PCC

0.316

P-val

.03 (P<0.05)

Relationship

Moderate

% Keywords Matched

100%

On a positive note, the tool seems to be very reliable with low difficulty scores (notice the tight clustering for low difficulty scores), and matched all 50 keywords.

#6: Google Keyword Planner Tool

This image shows a scatter plot for Google Keyword Planner Tool's keyword difficulty scores versus our keyword rankings. The data shows randomly distributed plots with no linear relationship.

Before you ask, yes, SEO companies still use the paid competition figures from Google’s Keyword Planner Tool (and other tools) to assess organic ranking potential. As you can see from the scatter plot, there is in fact no linear relationship between the two variables.

Google Keyword Planner Tool Organic Difficulty Predictability

PCC

0.045

P-val

Statistically insignificant/no linear relationship

Relationship

Negligible/None

% Keywords Matched

88.00%

SEO agencies still using KPT for organic research (you know who you are!) — let this serve as a warning: You need to evolve.

Test 1 summary

For scoring, we will use a ten-point scale and score every tool relative to the highest-scoring competitor. For example, if the second highest score is 98% of the highest score, the tool will receive a 9.8. As a reminder, here are the results from the PCC test:

This bar chart shows the final PCC values for the first test, summarized.

And the resulting scores are as follows:

Tool

PCC Test

Moz

10

SpyFu

9.8

SEMrush

8.8

KW Finder

8.7

Ahrefs

7.7

KPT

1.1

Moz takes the top position for the first test, followed closely by SpyFu (with an 80% match rate caveat).

Test 2: Adjusted Pearson Correlation Coefficient

Let’s call this the “Mulligan Round.” In this round, assuming sometimes things just go haywire and a tool just flat-out misses, we will remove the three most egregious outliers to each tool’s score.

Here are the adjusted results for the handicap round:

Adjusted Scores (3 Outliers removed)

PCC

Difference (+/-)

SpyFu

0.527

0.122

SEMrush

0.515

0.150

Moz

0.514

0.101

Ahrefs

0.478

0.162

KWFinder

0.470

0.110

Keyword Planner Tool

0.189

0.144

As noted in the original PCC test, some of these tools really took a big hit with major outliers. Specifically, Ahrefs and SEMrush benefitted the most from their outliers being removed, gaining .162 and .150 respectively to their scores, while Moz benefited the least from the adjustments.

For those of you crying out, “But this is real life, you don’t get mulligans with SEO!”, never fear, we will make adjustments for reliability at the end.

Here are the updated scores at the end of round two:

Tool

PCC Test

Adjusted PCC

Total

SpyFu

9.8

10

19.8

Moz

10

9.7

19.7

SEMrush

8.8

9.8

18.6

KW Finder

8.7

8.9

17.6

AHREFs

7.7

9.1

16.8

KPT

1.1

3.6

4.7

SpyFu takes the lead! Now let’s jump into the final round of statistical tests.

Test 3: Resampling

Being that there has never been a study performed on keyword research tools at this scale, we wanted to ensure that we explored multiple ways of looking at the data.

Big thanks to Russ Jones, who put together an entirely different model that answers the question: “What is the likelihood that the keyword difficulty of two randomly selected keywords will correctly predict the relative position of rankings?”

He randomly selected 2 keywords from the list and their associated difficulty scores.

Let’s assume one tool says that the difficulties are 30 and 60, respectively. What is the likelihood that the article written for a score of 30 ranks higher than the article written on 60? Then, he performed the same test 1,000 times.

He also threw out examples where the two randomly selected keywords shared the same rankings, or data points were missing. Here was the outcome:

Resampling

% Guessed correctly

Moz

62.2%

Ahrefs

61.2%

SEMrush

60.3%

Keyword Finder

58.9%

SpyFu

54.3%

KPT

45.9%

As you can see, this tool was particularly critical on each of the tools. As we are starting to see, no one tool is a silver bullet, so it is our job to see how much each tool helps make more educated decisions than guessing.

Most tools stayed pretty consistent with their levels of performance from the previous tests, except SpyFu, which struggled mightily with this test.

In order to score this test, we need to use 50% as the baseline (equivalent of a coin flip, or zero points), and scale each tool relative to how much better it performed over a coin flip, with the top scorer receiving ten points.

For example, Ahrefs scored 11.2% better than flipping a coin, which is 8.2% less than Moz which scored 12.2% better than flipping a coin, giving AHREFs a score of 9.2.

The updated scores are as follows:

Tool

PCC Test

Adjusted PCC

Resampling

Total

Moz

10

9.7

10

29.7

SEMrush

8.8

9.8

8.4

27

Ahrefs

7.7

9.1

9.2

26

KW Finder

8.7

8.9

7.3

24.9

SpyFu

9.8

10

3.5

23.3

KPT

1.1

3.6

-.4

.7

So after the last statistical accuracy test, we have Moz consistently performing alone in the top tier. SEMrush, Ahrefs, and KW Finder all turn in respectable scores in the second tier, followed by the unique case of SpyFu, which performed outstanding in the first two tests (albeit, only returning results on 80% of the tested keywords), then falling flat on the final test.

Finally, we need to make some usability adjustments.

Usability Adjustment 1: Keyword Matching

A keyword research tool doesn’t do you much good if it can’t provide results for the keywords you are researching. Plain and simple, we can’t treat two tools as equals if they don’t have the same level of practical functionality.

To explain in practical terms, if a tool doesn’t have data on a particular keyword, one of two things will happen:

  1. You have to use another tool to get the data, which devalues the entire point of using the original tool.
  2. You miss an opportunity to rank for a high-value keyword.

Neither scenario is good, therefore we developed a penalty system. For each 10% match rate under 100%, we deducted a single point from the final score, with a maximum deduction of 5 points. For example, if a tool matched 92% of the keywords, we would deduct .8 points from the final score.

One may argue that this penalty is actually too lenient considering the significance of the two unideal scenarios outlined above.

The penalties are as follows:

Tool

Match Rate

Penalty

KW Finder

100%

0

Ahrefs

100%

0

Moz

100%

0

SEMrush

92%

-.8

Keyword Planner Tool

88%

-1.2

SpyFu

80%

-2

Please note we gave SEMrush a lot of leniency, in that technically, many of the keywords evaluated were not found in its keyword difficulty tool, but rather through manually digging through the phrase match tool. We will give them a pass, but with a stern warning!

Usability Adjustment 2: Reliability

I told you we would come back to this! Revisiting the second test in which we threw away the three strongest outliers that negatively impacted each tool’s score, we will now make adjustments.

In real life, there are no mulligans. In real life, each of those three blog posts that were thrown out represented a significant monetary and time investment. Therefore, when a tool has a major blunder, the result can be a total waste of time and resources.

For that reason, we will impose a slight penalty on those tools that benefited the most from their handicap.

We will use the level of PCC improvement to evaluate how much a tool benefitted from removing their outliers. In doing so, we will be rewarding the tools that were the most consistently reliable. As a reminder, the amounts each tool benefitted were as follows:

Tool

Difference (+/-)

Ahrefs

0.162

SEMrush

0.150

Keyword Planner Tool

0.144

SpyFu

0.122

KWFinder

0.110

Moz

0.101

In calculating the penalty, we scored each of the tools relative to the top performer, giving the top performer zero penalty and imposing penalties based on how much additional benefit the tools received over the most reliable tool, on a scale of 0–100%, with a maximum deduction of 5 points.

So if a tool received twice the benefit of the top performing tool, it would have had a 100% benefit, receiving the maximum deduction of 5 points. If another tool received a 20% benefit over of the most reliable tool, it would get a 1-point deduction. And so on.

Tool

% Benefit

Penalty

Ahrefs

60%

-3

SEMrush

48%

-2.4

Keyword Planner Tool

42%

-2.1

SpyFu

20%

-1

KW Finder

8%

-.4

Moz

0

Results

All told, our penalties were fairly mild, with a slight shuffling in the middle tier. The final scores are as follows:

Tool

Total Score

Stars (5 max)

Moz

29.7

4.95

KW Finder

24.5

4.08

SEMrush

23.8

3.97

Ahrefs

23.0

3.83

Spyfu

20.3

3.38

KPT

-2.6

0.00

Conclusion

Using any organic keyword difficulty tool will give you an advantage over not doing so. While none of the tools are a crystal ball, providing perfect predictability, they will certainly give you an edge. Further, if you record enough data on your own blogs’ performance, you will get a clearer picture of the keyword difficulty scores you should target in order to rank on the first page.

For example, we know the following about how we should target keywords with each tool:

Tool

Average KD ranking ≤10

Average KD ranking ≥ 11

Moz

33.3

37.0

SpyFu

47.7

50.6

SEMrush

60.3

64.5

KWFinder

43.3

46.5

Ahrefs

11.9

23.6

This is pretty powerful information! It’s either first page or bust, so we now know the threshold for each tool that we should set when selecting keywords.

Stay tuned, because we made a lot more correlations between word count, days live, total keywords ranking, and all kinds of other juicy stuff. Tune in again in early September for updates!

We hope you found this test useful, and feel free to reach out with any questions on our math!

Disclaimer: These results are estimates based on 50 ranking keywords from 50 blog posts and keyword research data pulled from a single moment in time. Search is a shifting landscape, and these results have certainly changed since the data was pulled. In other words, this is about as accurate as we can get from analyzing a moving target.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Ranking the 6 Most Accurate Keyword Research Tools

Posted by Jeff_Baker

In January of 2018 Brafton began a massive organic keyword targeting campaign, amounting to over 90,000 words of blog content being published.

Did it work?

Well, yeah. We doubled the number of total keywords we rank for in less than six months. By using our advanced keyword research and topic writing process published earlier this year we also increased our organic traffic by 45% and the number of keywords ranking in the top ten results by 130%.

But we got a whole lot more than just traffic.

From planning to execution and performance tracking, we meticulously logged every aspect of the project. I’m talking blog word count, MarketMuse performance scores, on-page SEO scores, days indexed on Google. You name it, we recorded it.

As a byproduct of this nerdery, we were able to draw juicy correlations between our target keyword rankings and variables that can affect and predict those rankings. But specifically for this piece…

How well keyword research tools can predict where you will rank.

A little background

We created a list of keywords we wanted to target in blogs based on optimal combinations of search volume, organic keyword difficulty scores, SERP crowding, and searcher intent.

We then wrote a blog post targeting each individual keyword. We intended for each new piece of blog content to rank for the target keyword on its own.

With our keyword list in hand, my colleague and I manually created content briefs explaining how we would like each blog post written to maximize the likelihood of ranking for the target keyword. Here’s an example of a typical brief we would give to a writer:

This image links to an example of a content brief Brafton delivers to writers.

Between mid-January and late May, we ended up writing 55 blog posts each targeting 55 unique keywords. 50 of those blog posts ended up ranking in the top 100 of Google results.

We then paused and took a snapshot of each URL’s Google ranking position for its target keyword and its corresponding organic difficulty scores from Moz, SEMrush, Ahrefs, SpyFu, and KW Finder. We also took the PPC competition scores from the Keyword Planner Tool.

Our intention was to draw statistical correlations between between our keyword rankings and each tool’s organic difficulty score. With this data, we were able to report on how accurately each tool predicted where we would rank.

This study is uniquely scientific, in that each blog had one specific keyword target. We optimized the blog content specifically for that keyword. Therefore every post was created in a similar fashion.

Do keyword research tools actually work?

We use them every day, on faith. But has anyone ever actually asked, or better yet, measured how well keyword research tools report on the organic difficulty of a given keyword?

Today, we are doing just that. So let’s cut through the chit-chat and get to the results…

This image ranks each of the 6 keyword research tools, in order, Moz leads with 4.95 stars out of 5, followed by KW Finder, SEMrush, AHREFs, SpyFu, and lastly Keyword Planner Tool.

While Moz wins top-performing keyword research tool, note that any keyword research tool with organic difficulty functionality will give you an advantage over flipping a coin (or using Google Keyword Planner Tool).

As you will see in the following paragraphs, we have run each tool through a battery of statistical tests to ensure that we painted a fair and accurate representation of its performance. I’ll even provide the raw data for you to inspect for yourself.

Let’s dig in!

The Pearson Correlation Coefficient

Yes, statistics! For those of you currently feeling panicked and lobbing obscenities at your screen, don’t worry — we’re going to walk through this together.

In order to understand the relationship between two variables, our first step is to create a scatter plot chart.

Below is the scatter plot for our 50 keyword rankings compared to their corresponding Moz organic difficulty scores.

This image shows a scatter plot for Moz's keyword difficulty scores versus our keyword rankings. In general, the data clusters fairly tight around the regression line.

We start with a visual inspection of the data to determine if there is a linear relationship between the two variables. Ideally for each tool, you would expect to see the X variable (keyword ranking) increase proportionately with the Y variable (organic difficulty). Put simply, if the tool is working, the higher the keyword difficulty, the less likely you will rank in a top position, and vice-versa.

This chart is all fine and dandy, however, it’s not very scientific. This is where the Pearson Correlation Coefficient (PCC) comes into play.

The PCC measures the strength of a linear relationship between two variables. The output of the PCC is a score ranging from +1 to -1. A score greater than zero indicates a positive relationship; as one variable increases, the other increases as well. A score less than zero indicates a negative relationship; as one variable increases, the other decreases. Both scenarios would indicate a level of causal relationship between the two variables. The stronger the relationship between the two veriables, the closer to +1 or -1 the PCC will be. Scores near zero indicate a weak or no relatioship.

Phew. Still with me?

So each of these scatter plots will have a corresponding PCC score that will tell us how well each tool predicted where we would rank, based on its keyword difficulty score.

We will use the following table from statisticshowto.com to interpret the PCC score for each tool:

Coefficient Correlation R Score

Key

.70 or higher

Very strong positive relationship

.40 to +.69

Strong positive relationship

.30 to +.39

Moderate positive relationship

.20 to +.29

Weak positive relationship

.01 to +.19

No or negligible relationship

0

No relationship [zero correlation]

-.01 to -.19

No or negligible relationship

-.20 to -.29

Weak negative relationship

-.30 to -.39

Moderate negative relationship

-.40 to -.69

Strong negative relationship

-.70 or higher

Very strong negative relationship

In order to visually understand what some of these relationships would look like on a scatter plot, check out these sample charts from Laerd Statistics.

These scatter plots show three types of correlations: positive, negative, and no correlation. Positive correlations have data plots that move up and to the right. Negative correlations move down and to the right. No correlation has data that follows no linear pattern

And here are some examples of charts with their correlating PCC scores (r):

These scatter plots show what different PCC values look like visually. The tighter the grouping of data around the regression line, the higher the PCC value.

The closer the numbers cluster towards the regression line in either a positive or negative slope, the stronger the relationship.

That was the tough part – you still with me? Great, now let’s look at each tool’s results.

Test 1: The Pearson Correlation Coefficient

Now that we’ve all had our statistics refresher course, we will take a look at the results, in order of performance. We will evaluate each tool’s PCC score, the statistical significance of the data (P-val), the strength of the relationship, and the percentage of keywords the tool was able to find and report keyword difficulty values for.

In order of performance:

#1: Moz

This image shows a scatter plot for Moz's keyword difficulty scores versus our keyword rankings. In general, the data clusters fairly tight around the regression line.

Revisiting Moz’s scatter plot, we observe a tight grouping of results relative to the regression line with few moderate outliers.

Moz Organic Difficulty Predictability

PCC

0.412

P-val

.003 (P<0.05)

Relationship

Strong

% Keywords Matched

100.00%

Moz came in first with the highest PCC of .412. As an added bonus, Moz grabs data on keyword difficulty in real time, rather than from a fixed database. This means that you can get any keyword difficulty score for any keyword.

In other words, Moz was able to generate keyword difficulty scores for 100% of the 50 keywords studied.

#2: SpyFu

This image shows a scatter plot for SpyFu's keyword difficulty scores versus our keyword rankings. The plot is similar looking to Moz's, with a few larger outliers.

Visually, SpyFu shows a fairly tight clustering amongst low difficulty keywords, and a couple moderate outliers amongst the higher difficulty keywords.

SpyFu Organic Difficulty Predictability

PCC

0.405

P-val

.01 (P<0.05)

Relationship

Strong

% Keywords Matched

80.00%

SpyFu came in right under Moz with 1.7% weaker PCC (.405). However, the tool ran into the largest issue with keyword matching, with only 40 of 50 keywords producing keyword difficulty scores.

#3: SEMrush

This image shows a scatter plot for SEMrush's keyword difficulty scores versus our keyword rankings. The data has a significant amount of outliers relative to the regression line.

SEMrush would certainly benefit from a couple mulligans (a second chance to perform an action). The Correlation Coefficient is very sensitive to outliers, which pushed SEMrush’s score down to third (.364).

SEMrush Organic Difficulty Predictability

PCC

0.364

P-val

.01 (P<0.05)

Relationship

Moderate

% Keywords Matched

92.00%

Further complicating the research process, only 46 of 50 keywords had keyword difficulty scores associated with them, and many of those had to be found through SEMrush’s “phrase match” feature individually, rather than through the difficulty tool.

The process was more laborious to dig around for data.

#4: KW Finder

This image shows a scatter plot for KW Finder's keyword difficulty scores versus our keyword rankings. The data also has a significant amount of outliers relative to the regression line.

KW Finder definitely could have benefitted from more than a few mulligans with numerous strong outliers, coming in right behind SEMrush with a score of .360.

KW Finder Organic Difficulty Predictability

PCC

0.360

P-val

.01 (P<0.05)

Relationship

Moderate

% Keywords Matched

100.00%

Fortunately, the KW Finder tool had a 100% match rate without any trouble digging around for the data.

#5: Ahrefs

This image shows a scatter plot for AHREF's keyword difficulty scores versus our keyword rankings. The data shows tight clustering amongst low difficulty score keywords, and a wide distribution amongst higher difficulty scores.

Ahrefs comes in fifth by a large margin at .316, barely passing the “weak relationship” threshold.

Ahrefs Organic Difficulty Predictability

PCC

0.316

P-val

.03 (P<0.05)

Relationship

Moderate

% Keywords Matched

100%

On a positive note, the tool seems to be very reliable with low difficulty scores (notice the tight clustering for low difficulty scores), and matched all 50 keywords.

#6: Google Keyword Planner Tool

This image shows a scatter plot for Google Keyword Planner Tool's keyword difficulty scores versus our keyword rankings. The data shows randomly distributed plots with no linear relationship.

Before you ask, yes, SEO companies still use the paid competition figures from Google’s Keyword Planner Tool (and other tools) to assess organic ranking potential. As you can see from the scatter plot, there is in fact no linear relationship between the two variables.

Google Keyword Planner Tool Organic Difficulty Predictability

PCC

0.045

P-val

Statistically insignificant/no linear relationship

Relationship

Negligible/None

% Keywords Matched

88.00%

SEO agencies still using KPT for organic research (you know who you are!) — let this serve as a warning: You need to evolve.

Test 1 summary

For scoring, we will use a ten-point scale and score every tool relative to the highest-scoring competitor. For example, if the second highest score is 98% of the highest score, the tool will receive a 9.8. As a reminder, here are the results from the PCC test:

This bar chart shows the final PCC values for the first test, summarized.

And the resulting scores are as follows:

Tool

PCC Test

Moz

10

SpyFu

9.8

SEMrush

8.8

KW Finder

8.7

Ahrefs

7.7

KPT

1.1

Moz takes the top position for the first test, followed closely by SpyFu (with an 80% match rate caveat).

Test 2: Adjusted Pearson Correlation Coefficient

Let’s call this the “Mulligan Round.” In this round, assuming sometimes things just go haywire and a tool just flat-out misses, we will remove the three most egregious outliers to each tool’s score.

Here are the adjusted results for the handicap round:

Adjusted Scores (3 Outliers removed)

PCC

Difference (+/-)

SpyFu

0.527

0.122

SEMrush

0.515

0.150

Moz

0.514

0.101

Ahrefs

0.478

0.162

KWFinder

0.470

0.110

Keyword Planner Tool

0.189

0.144

As noted in the original PCC test, some of these tools really took a big hit with major outliers. Specifically, Ahrefs and SEMrush benefitted the most from their outliers being removed, gaining .162 and .150 respectively to their scores, while Moz benefited the least from the adjustments.

For those of you crying out, “But this is real life, you don’t get mulligans with SEO!”, never fear, we will make adjustments for reliability at the end.

Here are the updated scores at the end of round two:

Tool

PCC Test

Adjusted PCC

Total

SpyFu

9.8

10

19.8

Moz

10

9.7

19.7

SEMrush

8.8

9.8

18.6

KW Finder

8.7

8.9

17.6

AHREFs

7.7

9.1

16.8

KPT

1.1

3.6

4.7

SpyFu takes the lead! Now let’s jump into the final round of statistical tests.

Test 3: Resampling

Being that there has never been a study performed on keyword research tools at this scale, we wanted to ensure that we explored multiple ways of looking at the data.

Big thanks to Russ Jones, who put together an entirely different model that answers the question: “What is the likelihood that the keyword difficulty of two randomly selected keywords will correctly predict the relative position of rankings?”

He randomly selected 2 keywords from the list and their associated difficulty scores.

Let’s assume one tool says that the difficulties are 30 and 60, respectively. What is the likelihood that the article written for a score of 30 ranks higher than the article written on 60? Then, he performed the same test 1,000 times.

He also threw out examples where the two randomly selected keywords shared the same rankings, or data points were missing. Here was the outcome:

Resampling

% Guessed correctly

Moz

62.2%

Ahrefs

61.2%

SEMrush

60.3%

Keyword Finder

58.9%

SpyFu

54.3%

KPT

45.9%

As you can see, this tool was particularly critical on each of the tools. As we are starting to see, no one tool is a silver bullet, so it is our job to see how much each tool helps make more educated decisions than guessing.

Most tools stayed pretty consistent with their levels of performance from the previous tests, except SpyFu, which struggled mightily with this test.

In order to score this test, we need to use 50% as the baseline (equivalent of a coin flip, or zero points), and scale each tool relative to how much better it performed over a coin flip, with the top scorer receiving ten points.

For example, Ahrefs scored 11.2% better than flipping a coin, which is 8.2% less than Moz which scored 12.2% better than flipping a coin, giving AHREFs a score of 9.2.

The updated scores are as follows:

Tool

PCC Test

Adjusted PCC

Resampling

Total

Moz

10

9.7

10

29.7

SEMrush

8.8

9.8

8.4

27

Ahrefs

7.7

9.1

9.2

26

KW Finder

8.7

8.9

7.3

24.9

SpyFu

9.8

10

3.5

23.3

KPT

1.1

3.6

-.4

.7

So after the last statistical accuracy test, we have Moz consistently performing alone in the top tier. SEMrush, Ahrefs, and KW Finder all turn in respectable scores in the second tier, followed by the unique case of SpyFu, which performed outstanding in the first two tests (albeit, only returning results on 80% of the tested keywords), then falling flat on the final test.

Finally, we need to make some usability adjustments.

Usability Adjustment 1: Keyword Matching

A keyword research tool doesn’t do you much good if it can’t provide results for the keywords you are researching. Plain and simple, we can’t treat two tools as equals if they don’t have the same level of practical functionality.

To explain in practical terms, if a tool doesn’t have data on a particular keyword, one of two things will happen:

  1. You have to use another tool to get the data, which devalues the entire point of using the original tool.
  2. You miss an opportunity to rank for a high-value keyword.

Neither scenario is good, therefore we developed a penalty system. For each 10% match rate under 100%, we deducted a single point from the final score, with a maximum deduction of 5 points. For example, if a tool matched 92% of the keywords, we would deduct .8 points from the final score.

One may argue that this penalty is actually too lenient considering the significance of the two unideal scenarios outlined above.

The penalties are as follows:

Tool

Match Rate

Penalty

KW Finder

100%

0

Ahrefs

100%

0

Moz

100%

0

SEMrush

92%

-.8

Keyword Planner Tool

88%

-1.2

SpyFu

80%

-2

Please note we gave SEMrush a lot of leniency, in that technically, many of the keywords evaluated were not found in its keyword difficulty tool, but rather through manually digging through the phrase match tool. We will give them a pass, but with a stern warning!

Usability Adjustment 2: Reliability

I told you we would come back to this! Revisiting the second test in which we threw away the three strongest outliers that negatively impacted each tool’s score, we will now make adjustments.

In real life, there are no mulligans. In real life, each of those three blog posts that were thrown out represented a significant monetary and time investment. Therefore, when a tool has a major blunder, the result can be a total waste of time and resources.

For that reason, we will impose a slight penalty on those tools that benefited the most from their handicap.

We will use the level of PCC improvement to evaluate how much a tool benefitted from removing their outliers. In doing so, we will be rewarding the tools that were the most consistently reliable. As a reminder, the amounts each tool benefitted were as follows:

Tool

Difference (+/-)

Ahrefs

0.162

SEMrush

0.150

Keyword Planner Tool

0.144

SpyFu

0.122

KWFinder

0.110

Moz

0.101

In calculating the penalty, we scored each of the tools relative to the top performer, giving the top performer zero penalty and imposing penalties based on how much additional benefit the tools received over the most reliable tool, on a scale of 0–100%, with a maximum deduction of 5 points.

So if a tool received twice the benefit of the top performing tool, it would have had a 100% benefit, receiving the maximum deduction of 5 points. If another tool received a 20% benefit over of the most reliable tool, it would get a 1-point deduction. And so on.

Tool

% Benefit

Penalty

Ahrefs

60%

-3

SEMrush

48%

-2.4

Keyword Planner Tool

42%

-2.1

SpyFu

20%

-1

KW Finder

8%

-.4

Moz

0

Results

All told, our penalties were fairly mild, with a slight shuffling in the middle tier. The final scores are as follows:

Tool

Total Score

Stars (5 max)

Moz

29.7

4.95

KW Finder

24.5

4.08

SEMrush

23.8

3.97

Ahrefs

23.0

3.83

Spyfu

20.3

3.38

KPT

-2.6

0.00

Conclusion

Using any organic keyword difficulty tool will give you an advantage over not doing so. While none of the tools are a crystal ball, providing perfect predictability, they will certainly give you an edge. Further, if you record enough data on your own blogs’ performance, you will get a clearer picture of the keyword difficulty scores you should target in order to rank on the first page.

For example, we know the following about how we should target keywords with each tool:

Tool

Average KD ranking ≤10

Average KD ranking ≥ 11

Moz

33.3

37.0

SpyFu

47.7

50.6

SEMrush

60.3

64.5

KWFinder

43.3

46.5

Ahrefs

11.9

23.6

This is pretty powerful information! It’s either first page or bust, so we now know the threshold for each tool that we should set when selecting keywords.

Stay tuned, because we made a lot more correlations between word count, days live, total keywords ranking, and all kinds of other juicy stuff. Tune in again in early September for updates!

We hope you found this test useful, and feel free to reach out with any questions on our math!

Disclaimer: These results are estimates based on 50 ranking keywords from 50 blog posts and keyword research data pulled from a single moment in time. Search is a shifting landscape, and these results have certainly changed since the data was pulled. In other words, this is about as accurate as we can get from analyzing a moving target.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!