Ahrefs is backlink checker SEO analysis and competitor monitoring company. It boosts one of the biggest backlink indexes in the market. For any bigger site you can be sure that its bots try to crawl you, at least to improve its backlink index.
Ahrefs plans start at 99 USD ( there were limited free ones as well). It would be quite a good price except these plans might leak data to your main competitors. You can get topics and posts your competitors think are important easily and for free without logging in into Ahrefs account. That is a big leak: it can help to determine what competition is working on and where you can improve.
See, Ahrefs tries to render the pages that someone else is monitoring on your site. First time I noticed it in my log files researching them for different reason. I was interested why they re-fetch style files even if the expire header is set for much longer time frame.
The letter is kind of a lie, as the posts that fetch CSS are not the ones that have the most traffic or the most important, nor it is an backlink index update. Also, I have noticed that Ahrefs fetches CSS for the same URI all the time again and again ( Link is checked 2x per day per average). That is huge: no search engine checks same files that aren’t updated in months so often. My guess it is competitor analysis for someone using ahrefs and having entered either competitors or important keywords.
- The links rendered could be treated as important for me at some point of history e.g. Delta hijacker;
- However, part of them are expired in importance and have low queries/month based on google keyword tool;
- On some of them I have never ranked well. That is really important, as it allows to improve in the future.
The fetching is done to crawl JS links and evaluate page scores. It does not do this for the whole internet, just for important pages. And that lets us filtering the pages easily:
Step 1. Download your access log file
run grep :
grep Ahrefs access_log |grep style.css > ahrefsleak.txt
you can replace style.css with any js or stylesheet name your site uses. You will get lines like this :
18.104.22.168 – – [18/Mar/2018:19:08:16 +0000] “GET /wp-content/[zzzzzz]/style.css HTTP/2.0” 200 24327 “https://www.[site.com]//[uri]” “Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)”
where https://www.[site.com]//[uri] is the uri someone is interested in. I got around 3000 lines from the week worth of data. Now lets filter them further. Note, that the IP it uses for fetching styles is different from the main bot IP.
Update : Since 2018 April 10, Ahrefs does not show referer information. However, it still leaks paid account data. So, you might need to craft a bit fancier grep command:
grep Ahrefs logs/access_log |egrep -B1 "(css|js)" |egrep -v "(css|js|--)"
Step 2. Import the file to spreadsheets
You should import this data into google sheets, excel or libreoffice spreadsheets using space as separator. Delete all columns except the ones referencing https://www.[site.com]//[uri]. Sort data by this column.
Step 3. Delete duplicate data
For google spreadsheets, this is good tutorial that one could use: https://developers.google.com/apps-script/articles/removing_duplicates. I got around 200 URIs and basic keywords that someone is interested in and careless enough to submit them to Ahrefs.
Now you can check the rankings of these posts manually, improve your content and you will benefit from the Ahrefs without paying for it.
Ahrefs could solve this problem in several ways:
- Stop showing referrer data when fetching scrips and styles. However, this would be easy to overcome and check too. (That is what happened on April 10)
- Start adhering to silly internet standards like expires header and render multiple pages in batches.
- Fetch styles for all the pages.
At the moment I suggest stopping entering new important keywords into its monitor if you use paid plan.