How to Block AI Chatbots From Scraping Your Website’s Content

Photo from Pexels; no attribution

https://www.pexels.com/photo/man-people-woman-laptop-16094044/

4

As things stand, AI chatbots have a free license to scrape your website and use its content without your permission. Concerned about your content being scraped by such tools?

The good news is, you can stop AI tools from accessing your website, but there are some caveats. Here, we show you how to block the bots using the robots.txt file for your website, plus the pros and cons of doing so.

Man on laptop with ChatGPT open

How Do AI Chatbots Access Your Web Content?

AI chatbots are trained using multiple datasets, some of which are open-source and publicly available. For example, GPT3 was trained using five datasets, according toa research paper published by OpenAI:

Common Crawlincludes petabytes (thousands of TBs) of data from websites collected since 2008, similarly to how Google’s search algorithm crawls web content. WebText2 is a dataset created by OpenAI, containing roughly 45 million web pages linked to from Reddit posts with at least three upvotes.

a screenshot of a Bing Chat response showing citations for where it sources information

So, in the case of ChatGPT, the AI bot isn’t accessing and crawling your web pages directly–not yet, anyway. Although, OpenAI’sannouncement of a ChatGPT-hosted web browserhas raised concerns that this could be about to change.

In the meantime, website owners should keep an eye on other AI chatbots, as more of them hit the market. Bard is the other big name in the field, and very little is known aboutthe datasets being used to train it. Obviously, we know Google’s search bots are constantly crawling web pages, but this doesn’t necessarily mean Bard has access to the same data.

A screenshot showing an example of a featured snippet in Google Search

Why Are Some Website Owners Concerned?

The biggest concern for website owners is that AI bots like ChatGPT, Bard, and Bing Chat devalue their content. AI bots use existing content to generate their responses, but also reduce the need for users to access the original source. Instead of users visiting websites to access information, they can simply get Google or Bing to generate a summary of the information they need.

When it comes to AI chatbots in search, the big concern for website owners is losing traffic. In the case of Bard, the AI botrarely includes citations in its generative responses, telling users which pages it gets its information from.

Man using ChatGPT Dark Mode on His Laptop With Glasses on Side

So, aside from replacing website visits with AI responses, Bard removes almost any chance of the source website receiving traffic–even if the user wants more information. Bing Chat, on the other hand, more commonly links to information sources.

In other words, the current fleet of generative AI tools areusing the work of content creatorsto systematically replace the need for content creators. Ultimately, you have to askwhat incentive this leaves website ownersto continue publishing content. And, by extension, what happens to AI bots when websites stop publishing the content they rely upon to function?

How to Block AI Bots From Your Website

If you don’t want AI bots using your web content, you may block them from accessing your site using therobots.txtfile. Unfortunately, you have to block each individual bot and specify them by name.

For example, Common Crawl’s bot is called CCBot and you can block it by adding the following code to your robots.txt file:

This will block Common Crawl from crawling your website in the future but it won’t remove any data already collected from previous crawls.

If you’re worried about ChatGPT’s new plugins accessing your web content, OpenAI has already publishedinstructions for blocking its bot. In this case, ChatGPT’s bot is called ChatGPT-User and you can block it by adding the following code to your robots.txt file:

Blocking search engine AI bots from crawling your content is another problem entirely, though. As Google is highly secretive about the training data it uses, it’s impossible to identify which bots you’ll need to block and whether they’ll even respect commands in yourrobots.txtfile (many crawlers don’t).

How Effective Is This Method?

Blocking AI bots in yourrobots.txtfile is the most effective method currently available, but it’s not particularly reliable.

The first problem is that you have to specify each bot you want to block, but who can keep track of every AI bot hitting the market? The next issue is that commands in yourrobots.txtfile are non-compulsory instructions. While Common Crawl, ChatGPT, and many other bots respect these commands, many bots don’t.

The other big caveat is that you can only block AI bots from performing future crawls. You can’t remove data from previous crawls or send requests to companies like OpenAI to erase all of your data.

Should You Block AI Tools From Accessing Your Website?

Unfortunately, there’s no simple way to block all AI bots from accessing your website, and manually blocking each individual bot is almost impossible. Even if you keep up with the latest AI bots roaming the web, there’s no guarantee they’ll all adhere to the commands in yourrobots.txtfile.

The real question here is whether the results are worth the effort, and the short answer is (almost certainly) no.

There are potential downsides to blocking AI bots from your website, too. Most of all, you won’t be able to collect meaningful data to prove whether tools like Bard are benefiting or harming your search marketing strategy.

Yes, you can assume that a lack of citations is harmful, but you’re only guessing if you lack the data because you blocked AI bots from accessing your content. It was a similar story when Google first introducedfeatured snippetsto Search.

For relevant queries, Google shows a snippet of content from web pages on the results page, answering the user’s question. This means users don’t need to click through to a website to get the answer they’re looking for. This caused panic among website owners and SEO experts who rely on generating traffic from search queries.

However, the kind of queries that trigger featured snippets are generally low-value searches like “what is X” or “what’s the weather like in New York”. Anyone who wants in-depth information or a comprehensive weather report is still going to click through, and those who don’t were never all that valuable in the first place.

You might find it’s a similar story with generative AI tools, but you’ll need the data to prove it.

Don’t Rush Into Anything

Website owners and publishers are understandably concerned about AI technology and frustrated by the idea of bots using their content to generate instant responses. However, this isn’t the time for rushing into counteroffensive moves. AI technology is a fast-moving field, and things will continue to evolve at a rapid pace. Take this opportunity to see how things play out and analyze the potential threats and opportunities AI brings to the table.

The current system of relying on content creators’ work to replace them isn’t sustainable. Whether companies like Google and OpenAI change their approach or governments introduce new regulations, something has to give. At the same time, the negative implications of AI chatbots on content creation are becoming increasingly apparent, which website owners and content creators can use to their advantage.

Curious about the influence of AI chatbots on content creation and marketing? Here are some ways these AI chatbots are transforming the industry.

Obsidian finally feels complete.

One casual AI chat exposed how vulnerable I was.

These are the best free movies I found on Tubi, but there are heaps more for you to search through.

These films will leave you questioning humanity, but also wanting more.

Free AI tools are legitimately powerful; you just need to know how to stack them.

Technology Explained

PC & Mobile