Take control over AdSense Matched content recommendations

AdSense hates documentation, or have at least chosen to not document how you can exclude articles from appearing in the AdSense Matched content sections. Luckily, there’s in fact a system for it and I’ll even tell you how to use it.

Matched content is a article recommendation system from AdSense that tries to suggest relevant and interesting articles to your website’s visitors. It’s currently only available to publishers with many pages and high levels of traffic. You’ve probably seen many of these content recommendation systems all around the web. The one thing they all have in common is that they most often make bad recommendations.

Not all AdSense publishers are eligible to use Matched content. Many publishers are interested in the system, however, as it has all of Google’s knowledge and personal personalization experience behind their recommendations.

Sometimes, however, it chooses to feature something you’d rather not have put in front of too many people. These are articles that may be aren’t all that flattering to your business, they may be outdated or otherwise no longer relevant, or may be they’re just boring legal pages. Whatever the reason, it’s sometimes desirable to exclude certain pages from the becoming recommended reading for your visitors.

You can exclude a page using the standard /robots.txt file, but this will also exclude your page from search engines which maybe isn’t what you want.

If you’re running Matched content units from AdSense, a.k.a. “Recommended by Google”, you may have noticed some new bot activity from Google. Roughly once per day, a robot identifying itself as “Mediapartners-Google” originating from an IP address belonging to Google requests the following three files:

/google_matched_content_rules.xml
/google_matched_content_blacklist.txt
/google_matched_content_whitelist.txt

In standing with the finest AdSense traditions, the formats of these files aren’t documented anywhere.

These files are placed at the top level/root of your web server. Their purpose is made clear from their names, however, and their format can be determined with some trial and error. But please do keep in mind that as the following information isn’t documented by AdSense, it could mean that AdSense may change it up or stop supporting it at any time.

The two text files was are enough to work out: List one URL per line that you wish to exclude in the blocklist, and list exceptions to broader exclusion rules in the allow-list. You can make a broad matching rule using wildcard expressions wit the * character. For example, https://www.example.com/documents/ doesn’t exclude other pages underneath /documents/ unless you add a wildcard at the end like this: https://www.example.com/documents/*.

If you’ve excluded a directory using wildcards, you can still allow certain pages to appear in content recommendations by including i in the allow-list file. The longest URL (in number of characters) in either your blocklist or allow-list seem to take precedence. There’s no point in including rules for allowed URL in the allow-list, as this is the default policy anyway.

The /google_matched_content_rules.xml file was a bit trickier to work out. In the end I set up a crawler to retrieve this file from the Alexa Top 1 Million websites, hoping that some publishers would have this file available. Six out of a million websites had this file, but that was enough for me to work out the format and options.

Here is a simple sample rules file:

<?xml version="1.0" encoding="utf-8"?>
<rules>
  <blacklist>
    <uri>https://example.com/about/*</uri>
    <uri>https://example.com/privacy-policy</uri>
  </blacklist>
  <whitelist>
    <uri>https://example.com/about/mobile-apps</uri>
  </whitelist>
</rules>

So far, I’ve only shown the same capabilities in the rules file as you’ve got with the plain text block and allow list files. However, I’ve also identified some more powerful features, though I’m not entirely sure on how they’re supposed to work. The sample size was rather small and I’ve had more trouble verifying these, although I do my guesses as to their application are pretty spot on. Some documentation from AdSense would be of great help here! I’ll provide a more complex example, and then move on to speculate about how it all hangs together afterward.

<?xml version="1.0" encoding="utf-8"?>
<!-- SPDX-License-Identifier: CC0-1.0 -->
<rules>
  <section>
    <source>
      <uri>https://example.com/reviews/movies/*</uri>
    </source>
    <target>
      <uri>https://example.com/celebrities/*</uri>
      <uri>https://example.com/culture/*</uri>
      <uri>https://example.com/entertainment/*</uri>
    </target>
  </section>
  <section>
    <source>
      <uri>https://example.com/technology/*</uri>
    </source>
    <target>
      <uri>https://example.com/gadgets/*</uri>
      <uri>https://example.com/science/*</uri>
      <uri>https://example.com/technology/*</uri>
    </target>
  </section>
  <whitelist>
    <rss>https://example.com/feed.rss2</rss>
  </whitelist>
</rules>

As I’ve already mentioned, I can only speculate on the exact purposes of the section, source, target, and rss elements. However, my educated guesses as to their purposes are as follows:

A section’s source identifies a partial URL that your readers visit on your website. While on this section, as identified by their source, matched content should preferably recommend content that match the URLs of the identified target addresses. This seems correct given the format, but I’ve been unable to confirm that this is how it works. Recommendations seem to still be at AdSense’s discretion.

This can be a powerful feature to many publishers who publish frequently and on diverse topics.

The last element, rss, seem to specify the URL of a syndication feed. I presume this feed is fetched more often than once per day like the main rules files, and I further guess that links in the feed may be prioritized above other links when they’d provide relevant recommendations. I haven’t been able to confirm this either, but I can’t see any other use this tag would have.

Google AdSense doesn’t have any contact options, so I was unable to inquire for any input for this article.

I’ll update this post later as more information from AdSense becomes available.