“AI” companies think that we should have to opt-out of data-scraping bots that take our work to train their products. There isn’t even a required no-scraping period between the announcement and when they start. Too late? Tough. Once they have your data, they don’t provide you with a way to have it deleted, even before they’ve processed it for training.

These companies should be prevented from using data that they haven’t been given explicit consent for. Opt-out is problematic as it counts on concerned parties hearing about new or modified bots BEFORE their sites are targeted by them. That is simply not practical.

It should be strictly opt-in. No one should be required to provide their work for free to any person or organization. The online community is under no responsibility to help them create their products. Some will declare that I am “Anti-AI” for saying such things, but that would be a misrepresentation. I am not declaring that these systems should be torn down, simply that their developers aren’t entitled to our work. They can still build those systems with purchased or donated data.

There are ongoing court cases and debates in political circles around the world. Decisions and policies will move more slowly than either side on this issue would like, but in the meantime, SOME of the bots involved in scraping data for training have been identified and can be blocked. (Others may still be secret or operate without respect for the wishes of a website’s owner.) Here’s how:

(If you are not technically inclined, please talk to your webmaster, whatever support options are at your disposal, or a tech-savvy friend.)

robots.txt

This is a file placed in the home directory of your website that is used to tell web crawlers and bots which portions of your website they are allowed to visit. Well-behaved bots honor these directives. (Not all scraping bots are well-behaved and there are no consequences, short of negative public opinion, for ignoring them. At this point, there have been no claims that bots being named in this post have ignored these directives.)

This what our robots.txt looks like:

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot 
Disallow: /
User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ImagesiftBot 
Disallow: /

User-agent: cohere-ai
Disallow: /

The first line identifies CCBot, the bot used by the Common Crawl. This data has been used by ChatGPT, Bard, and others for training a number of models. The second line states that this user-agent is not allowed to access data from our entire website. Some image scraping bots also use Common Crawl data to find images.

The next two user-agents identify ChatGPT-specific bots.

ChatGPT-User is the bot used when a ChatGPT user instructs it to reference your website. It’s not automatically going to your site on its own, but it is still accessing and using data from your site.

GPTBot is a bot that OpenAI specifically uses to collect bulk training data from your website for ChatGPT.

Google-Extended is the recently announced product token that allows you to block Google from scraping your site for Bard and VertexAI. This will not have an impact on Google Search indexing. The only way this works is if it is in your robots.txt. According to their documentation: “Google-Extended doesn’t have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity.”

Anthropic-ai is used by used by Anthropic to gather data for their “AI” products, such as Claude.

Claudebot is another agent used by Anthropic that is more specifically related to Claude.

Omgilibot and Omgili are from webz.io. I noticed The New York Times was blocking them and discovered that they sell data for training LLMs.

FacebookBot is Meta’s bot that crawls public web pages to improve language models for their speech recognition technology. This is not what Facebook uses to get the image and snippet for when you post a link there.

Diffbot is a somewhat dishonest scraping bot used to collect data to train LLMs. This is their default user-agent, but they make it easy for their clients to change it to something else and ignore your wishes.

Bytespider has been identified as ByteDance’s bot used to gather data for their LLMs, including Doubao.

ImagesiftBot is billed as a reverse image search tool, but it’s associated with The Hive, a company that produces models for image generation. It’s not definitively scraping for “AI” models, but there are enough reasons to be concerned that it may be. Commenters here have suggested it’s inclusion. If anyone from the company would like to clarify, we’re all ears.

cohere-AI is unconfirmed bot believed to be associated with Cohere’s chatbot. It falls into the same class as ChatGPT-User as it is appears to trigger in response to a user-directed query.

ChatGPT has been previously reported to use another unnamed bot that had been referencing Reddit posts to find “quality data.” That bot’s user agent has never been officially identified and its current status is unknown. Reddit, Tumblr, and others have recently announced their intent to license their user’s content to the “AI” industry and in some cases, there are no opt-out controls. Protecting your work outside of your own websites is likely to become increasingly complicated.

Updating or Installing robots.txt

You can check if your website has a robots.txt by going to yourwebsite.com/robots.txt. If it doesn’t find that page, then you don’t have one.

If your site is hosted by Squarespace (see below), or another simple website-building site, you could have a problem. At present, many of those companies don’t allow you to update or add your own robots.txt. They may not even have the ability to do it for you. I recommend contacting support so you can get specific information regarding their current abilities and plans to offer such functionality. Remind them that once slurped up, you have no ability to remove your work from their hold, so this is an urgent priority. (It also demonstrates once again why “opt-out” is a bad model.)

If you are using Wix, they provide directions for modifying your robots.txt here.

If you are using Squarespace, they provide directions for blocking a very fixed set of AI scraping bots here. They will allow you to block some, but not all of the bots mentioned in this post.

If you are using WordPress (not WordPress.com–see below), there are a few plugins that allow you to modify your robots.txt. Many of these include SEO (Search Engine Optimization) plugins have robots.txt editing features. (Use those instead of making your own.) Here’s a few we’ve run into:

- - Yoast: directions
  - AIOSEO: directions (there’s a report in the comments that user agent blocking may not be working at the moment)
  - SEOPress: directions
  - Dark Visitors: (under option 2) – this one will self-update with newly discovered bots, they also maintain a useful website for information about different bots

If your WordPress site doesn’t have a robots.txt or something else that modifies robots.txt, these two plugins can block GPTBot and CCBot for you. (Disclaimer: I don’t use these plugins, but know people who do.)

For more experienced users: If you don’t have a robots.txt, you can create a text file by that name and upload it via FTP to your website’s home directory. If you have one, it can be downloaded, altered and reuploaded. If your hosting company provides you with cPanel or some other control panel, you can use its file manager to view, modify, or create the file as well.

If your site already has a robots.txt, it’s important to know where it came from as something else may be updating it. You don’t want to accidentally break something, so talk to whoever set up your website or your hosting provider’s support team.

Firewalls and CDNs (less common, but better option)

Your website may have a firewall or CDN in front of your actual server. Many of these products have the ability to block bots and specific user agents. Blocking the user agents (CCBot, GPTBot, ChatGPT-User, Anthropic-ai, ClaudeBot, Omgilibot, Omgili, FacebookBot, Diffbot, Bytespider, ImagesiftBot, and cohere-ai) there is even more effective than using a robots.txt directive. (As I mentioned, directives can be ignored. Blocks at the firewall level prevent them from accessing your site at all.) Some of these products include Sucuri, Cloudflare, QUIC.cloud, and Wordfence. (Happy to add more if people let me know about them. Please include a link to their user agent blocking documentation as well.) Contact their support if you need further assistance.

CLOUDFLARE USERS: In September, Cloudflare rolled out a new setting to block AI bots in the Web Application Firewall (WAF) located under the Security settings. Directions are in the linked post. There does not appear to be a list of the bots this new rule covers, so it is still advisable to use other measures as well.

NOTE: Google-Extended isn’t a bot. You need to have this in your robots.txt file if you want to prevent them from using your site content as training.

.htaccess (another option)

In the comments, DJ Mary pointed out that you can also block user agents with your website’s .htaccess file by adding these lines:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|anthropic-ai|ClaudeBot|Omgilibot|Omgili|FacebookBot|Diffbot|Bytespider|ImagesiftBot|cohere-ai) [NC]
RewriteRule ^ – [F]

I’d rate this one as something for more experienced people to do. This has a similar effect to that of the firewall and CDN blocks above.

NOTE: Google-Extended isn’t a bot. You need to have this in your robots.txt file if you want to prevent them from using your site content as training.

Additional Protection for Images

There are some image-scraping tools that honor the following directive:

<meta name="robots" content="noai, noimageai">

when placed in the header section of your webpages. Unfortunately, many image-scraping tools allow their users to ignore this directives.

Tools like Glaze and Mist that can make it more difficult for models to perform style mimicry based on altered images. (Assuming they don’t get or already have an unaltered copy from another source.)

There are other techniques that you can apply for further protection (blocking direct access to images, watermarking, etc.) but I’m probably not the best person to talk to for this one. If you know a good source, recommend them in the comments.

Podcasts

The standard lack of transparency from the “AI” industry makes it difficult to know what is being done with regards to audio. It is clear, however, that the Common Crawl has audio listed among the types of data it has acquired. Blocks to the bots mentioned should protect an RSS feed (the part of your site that shares information about episodes), but if your audio files (or RSS feed) are hosted on a third party website (like Libsyn, PodBean, Blubrry, etc.), it may be open from their end if they aren’t blocking. I am presently unaware of any that are blocking those bots, but I have started asking. The very nature of how podcasts are distributed makes it very difficult to close up the holes that would allow access. This is yet another reason why Opt-In needs to be the standard.

ai.txt

I just came across this one recently and I don’t know which “AI” companies are respecting Spawning’s ai.txt settings, but if anyone is, it’s worth having. They provide a tool to generate the file and an assortment of installation directions for different websites.

https://site.spawning.ai/spawning-ai-txt

Substack

If you are using Substack, there is an option to “block AI training,” but it is defaulted to off. Courtesy of Alan Baxter on BlueSky “If you do NOT want your publication to be used to train AI, open your publication, go to Settings > Publication details and switch it on.”

Image courtesy of Alan Baxter.

WordPress.com

If you are using the WordPress hosting provided by WordPress.com (this is not to be confused with WordPress installed on your own hosting plan), please be aware that they have partnered with “AI” companies and will be providing your content to those companies unless you opt-out. This post details what you can do to prevent that, but the option you are looking for (Prevent Third-Party Sharing) can be found under Settings → General, in the privacy section.

“Activating the “Prevent third-party sharing” feature excludes your site’s public content from our network of content and research partners. It also adds known AI bots to the “disallow” list in your site’s robots.txt file in order to stop them from crawling your site, though it is up to AI platforms to honor this request. Using this option also means your blog posts will not appear in the WordPress.com Reader.”

Closing

None of these options are guarantees. They are based on an honor system and there’s no shortage of dishonorable people who want to acquire your data for the “AI” gold rush or other purposes. Sadly, the most effective means of protecting your work from scraping is to not put it online at all. Even paywall models can be compromised by someone determined to do so.

Other techniques for informing bots that you don’t want work used for “AI” training (or might be willing to do so under license) are being developed. One such effort, TDM Reservation Protocol (TDMRep) has been drafted by a W3C community group but “is not a W3C Standard nor is it on the W3C Standards Track.” I am unaware of any bots currently employing this functionality, though some vendors have mentioned it to me. While overly complex, the advantage to something like this is that it would avoid the necessity of having to block individual bots and companies. Like others, it would not protect you from bad actors.

Writers and artists should also start advocating for “AI”-specific clauses in their contracts to restrict publishers using, selling, donating, or licensing your work for the purposes of training these systems. Online works might be the most vulnerable to being fed to training algorithms, but print, audio, and ebook editions developed by publishers can be used too. It is not safe to assume that anyone will take the necessary efforts to protect your work from these uses, so get it in writing.

[This post will be updated with additional information as it becomes available.]

9/28/2023 – Added the recently announced Google-Extended robots.txt product token. This must be in robots.txt. There are no alternatives.

9/28/2023 – Added Omgilibot/Omgili, bots apparently used by a company that sells data for LLM training.

9/29/2023 – Adam Johnson on Mastodon pointed us at FacebookBot, which is used by Meta to help improve their language models.

11/6/2023 – Added anthropic-ai user-agent used by Anthropic.

11/16/2023 – Added Substack section information provided by Alan Baxter.

11/17/2023 – Squarespace has provided directions to block some, but not all bots mentioned in this post.

12/12/2023 – Added information about Cloudflare AI blocking rules in WAF.

1/25/2024 – Added Bytespider bot courtesy of darkvisitors.com.

2/28/2024 – Added WordPress.com directions for preventing the sharing of your content with their “AI” partners.

3/1/2024 – Added ImagesiftBot due to concerns raised by people commenting on this post. At this time, it’s not a confirmed “AI” scraping bot that is contributing to art generating models, but there’s sufficient cause to be cautious due to it’s association with a company (The Hive) that does. Hive hasn’t revealed how it acquires data or from where.

3/23/2024 – Added information about Diffbot and cohere-ai.

4/8/2024 – Added Dark Visitors WordPress Plugin

4/27/2024 – Added ClaudeBot as per comment from Scott Adams

32 Comments

Add Comment →

Lena

Thanks for this. I’d already modified robots.txt, but it hadn’t occurred to me to find the block list at QUIC.cloud (previously known only as “hoops I had to jump through for caching”).

There’s never a perfect solution to bad actors, but someday “the plaintiffs took these measures to protect their work from the defendant, and the defendant ignored those measures” might be important, so we should use the imperfect solutions available.

08/24/2023

Reply
Beau

Since AI users tend to be dishonest, from what I have experienced, I put a price on AI uses in my contracts. So if they do use an AI program without telling me and it comes to my attention, I got proof that they owe me a certain amount of money.
Something like
Misuse as AI “training” per piece: 30000.
Every piece generated from an AI database build in part of my work: 10000 per piece generated.
Same for some sort of AI filtering, such as tools that are meant to obscure AI plagiarism, that’s just more use of material in AI databases and “training”, as most “tools” just harvest anything that it is used on. So every piece being filtered, 30000 and every output from the filtering, 10000.
So, totally allowed, just a bit costly. Even with an explanation for the high price. The replacement factor and the general moral objections from my side.
I have seen AI users hand over AI generated images to people who specifically demanded no AI use – even in their contract. So – I just do not trust those folks at all anymore.

I would not use Cloudflare, this is the only company I looked into for a few things so far.
Their ToS specify that you grant them ALL rights to use your material, that is send with their service, in any way. I am not sure which products that all applies to – but still. The point 2.5 about “Costumer Content and Network Data” sounds spooky
https://www.cloudflare.com/terms/
“2.5.1 […] Subject to the terms of this Agreement, you hereby grant us a non-exclusive, fully sublicensable, worldwide, royalty-free right to collect, use, copy, store, transmit, modify and create derivative works of Customer Content, in each case to the extent necessary to provide the Services.”
“Services” is defined at the top of the ToS page.
I can not understand why they want all those rights, sub-licensing for example. Having a tool to shrink images online for example – Okay your service is “modifying” something, I can buy that. But “collect”? “Store”? For how long?
It’s just too expansive for me and I do not trust those, or similar, ToS.

08/24/2023

Reply
- Neil Clarke
  
  Collect and store tend to be required for caching, but yeah, that language is a bit to open-ended. Will look into it some more.
  
  08/24/2023
  
  Reply
Eric J. Francis

Hi, Neil!

Great article! Let me put in a plug for my employer, Akamai Technologies, which invented the CDN and is a top security company. Kudos to you for keeping this issue visible, and sorry you have to do it.

08/24/2023

Reply
Eric J. Francis

Hi, Neil!

Great article. Let me put in a plug for my employer, Akamai Technologies, whcih invented the CDN. They’re also a top security company. Sorry you have to deal with this, but very glad you’re keeping it in the public eye.

Eric

08/24/2023

Reply
Eric J. Francis

My apologies, here is the link to Akamai’s CDN page: https://www.akamai.com/solutions/content-delivery-network

08/24/2023

Reply
- Neil Clarke
  
  Thanks Eric. I’ve been linking to vendor’s documentation for how to block user agents, but I can’t find it on Akamai’s site. If you can provide a link to that, I’ll gladly add them.
  
  08/24/2023
  
  Reply
DJ MARY

A nice recap… There are some IP adresses that can be blocked too.

But for people who can edit their .htaccess file, they should try this:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot) [NC]
RewriteRule ^ – [F]

(this is an edited version of the section I use on my blogs to block all unwanted bots – mine contains 20+ keywords to look for)

FYI, on WordPress I use AIOWPS plugin. There’s a Blacklist Manager that allows IPs and user – agents to be blocked. It works for IPs, but seems to not work for UA anymore.
I switched to using the “Custom .htaccess rules settings” in Tools, to add the above condition.

Sincerely
DJM

PS: shared your page on Mastodon, before noticing you had an account. Edited the post to tag you.

08/24/2023

Reply
Koen Martens

In case anyone is interested, here’s a snippet for lighttpd configurations that will deny access to the user agents:

$REQUEST_HEADER[“User-Agent”] =~ “(?i)(CCBot|ChatGPT|GPTBot)” {
url.access-deny = ( “” )
}

08/25/2023

Reply
Jasmine

thank you for this immensely helpful resource! i’d suggest correcting the double quote characters in the meta tag string, just so that it’s more convenient to copy and paste; it doesn’t parse correctly with only right quotation marks.

08/25/2023

Reply
- Neil Clarke
  
  Thanks. WordPress was converting the quotes, so I switched that to preformatted text and it looks better now.
  
  08/25/2023
  
  Reply
Bella Han

Thank you, Neil, for ringing the bell. I’m not particularly familiar with web tech, but I’ll keep your warning in mind when maintaining my personal website. Truly, “the most effective means of protecting your work from scraping is to not put it online at all.”

09/28/2023

Reply
- Neil Clarke
  
  Yes, but as we’ve seen from the recent debacle concerning the books3 dataset, works don’t need to be online to be stolen. For those that don’t know, the books3 dataset is made up of many pirated ebooks and has been used to train many language models. It is the subject of some of the pending lawsuits regarding LLMs.
  
  10/02/2023
  
  Reply
Robert

This is great, thanks. Come comments.

Omgilibot should be blocked using “omgili” and “omgilibot” according to http://omgili.com/crawler.html

You’ll want to add something about Castle Globals Hive AI https://thehive.ai/ as far as I know, the bot does not identify itself though.

There’s Diffbot https://www.diffbot.com/ but in general the bots do not identify themselves. The site https://www.diffbot.com/products/crawl/ claims to use “tens of
thousands of unique IPs” to hide itself.

The EU’s OpenWebSearch https://openwebsearch.eu includes “knowledge graphs containing structured conceptual knowledge or AI language models capable of text generation.” https://openwebsearch.eu/the-project/ so you may want to add information about blocking that from https://openwebsearch.eu/owler/

09/29/2023

Reply
- Neil Clarke
  
  Thanks! I’m still researching some of those other services and have emailed OpenWebSearch (worth a try) as well as some others in hopes of gaining some more information. Diffbot is one of the many known bad bots. (https://soggi.org/misc/articles/Diffbot-blocking-bad-bot-rude-content-scraper-from-websites.htm) This is just one of the many reasons we need legal requirements regarding training data, at the very minimum a requirement of transparency and an ability to opt-out. Optimally, a requirement for opt-in. Violations of those requirements and the products built using those fruits need to be banned, fined, and held legally liable for each individual instance. Without consequences, nothing will change.
  
  10/02/2023
  
  Reply
- Neil Clarke
  
  I’ve heard back from the OpenWebSearch team and they claim “We currently don’t train LLMs on the crawled content. We only index.” Further comments suggest that they are monitoring the standards being developed for opting out, pointing to a few, and intend to adopt one. “We hope to be able to keep transparency high, i.e. every content owner / webmaster should be able to see how her/his website is processed in our pipeline.” LLMs are, however, “something to consider for the future.”
  
  10/03/2023
  
  Reply
John

FWIW you can group all user-agents together in robots.txt, it makes maintenance easier:

user-agent: ua1
user-agent: ua2
user-agent: ua3
disallow: /

Also, if you’re blocking with htaccess, I’d recommend still letting them get the robots.txt file, so that they know not to try.

09/29/2023

Reply
- Neil Clarke
  
  Thanks. Good advice.
  
  10/02/2023
  
  Reply
Fred

There’s another aspect these scrapers tend to lightly skip over with intent: with original content, the creator owns the copyright and with that, the right to determine what happens with it. The fact that you make something public does not automatically mean that you made it free for all use, you can state exceptions (that’s why the likes of Meta et al make you practically surrender your firstborn in in their terms of service – you give them a perpetual license to use whatever you post by agreeing to anything to them).

Of course, those who benefit from your free labour will enthusiastically ignore the terms of your site, but that scraping without your consent doesn’t cancel your copyright just because they’d want it to.

It would need a collective action to stop them on this basis, though, , as content thieves (let’s call it what it is) are making a lot of money with their theft, part of which will be used to fight the lawsuits against those whose rights were trampled in the process – only collective action will make this affordable.

That said, there’s nothing that says you can’t start polluting your content with data that is not visible to users but will be picked up by AI so there’s evidence it was your work that was used. You might as well start now.

Oh, by the way, don’t forget putting disclaimers under email as well – some providers scrape those too. Just because a robot (well, software) does it doesn’t make that right either.

10/02/2023

Reply
Cendrine Marrouat

Thank for this wonderful post! I added it to my list of recommendations in this article on my blog: https://creativeramblings.com/ramblings/ai-generated-content.

11/18/2023

Reply
Nutchanon Wetchasit

Thanks for the article and the list of “U/A”. Now I should add here that there is also a *nuclear option* of CIDR-blocking all requests coming from web hosting, VPS, colo/dedi, and “cloud”/clown computing (dis)services; reviewed according to `whois` output of the offending IPs:

https://cheapskatesguide.org/articles/the-centralization-of-robots.html

I now employ this option using old-school `Deny from CIDR/MASKLEN` directive in Apache `.htaccess` (on top of robots.txt) whenever I found spiders which disregard `robots.txt, cloaked with browser-like U/A strings, faked other crawlers’ U/A, or doing drive-by exploits; in my daily logs review. This is less time-consuming in the long run than blocking individual IPs; especially for personal website owners.

(This is similar to a DiffBot blocking method you linked in a comment earlier, but implemented as a regular regimen targeting *every* non-identified bot)

Cheers for the small Internet of people.

11/27/2023

Reply
Nutchanon Wetchasit

I’m now spotting a new bot, with U/A string “Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com)” which its declared affiliation is https://imagesift.com/ operates a reverse-image search as a cover; but it actually scrapes the web for https://thehive.ai/ which provides automated “moderation” and text/image content laundering (dis)service.

This bot seems to come from multiple addresses in the range 64.124.8.0/24 (64.124.8.32, 64.124.8.34, 64.124.8.46, 64.124.8.61, 64.124.8.69, 64.124.8.70, 64.124.8.74, 64.124.8.75, 64.124.8.77, 64.124.8.86, 64.124.8.91, 64.124.8.99 are individual IPs I’ve seen), and while it seems to pay some mind to robots.txt…

Its robots.txt behavior is really dubious; its operator said if one don’t block it specifically (even if you block *ALL* other bots with `User-Agent: *`), it would try to invite itself in as if it was being “Googlebot” and scrape on anyway if the site happens to allow Googlebot (which I’d imagine many people’s site do). Webmasters beware.

This is just despicable; so apart from adding `Disallow: /` for ImagesftBot specifically, I also CIDR-blocked entire range of both originating address and all addresses associated to imagesift.com and thehive.ai as well. (The latter two use Amazon AWS, so I just do CIDR-wide nuclear-block on corresponding parts of AWS)

12/18/2023

Reply
Gerardo

Hi,

just be aware that Bytespider reads the robots.txt but completely ignores it and keeps processing the website (as of today). It is therefore better to block it e.g. via the user agent as described.

01/29/2024

Reply
Jack

You might want to add, https://imagesift.com to the ever-growing list of AI scraping bots. This one is hitting my online gallery (~180k of photos).

03/01/2024

Reply
- Neil Clarke
  
  Thanks. At present, it’s billed as reverse image search, which would put it in the same class as the Google Image Search tools, which I don’t currently list. It is, however, connected to The Hive, which has not revealed the identity of their bots or how they acquire data for their models, so I’ve included it with a short warning. If they are willing to go on record that their scraping won’t ever be used for training generative AI, I’ll remove them. So far as I can tell, they’ve made no such statement, so it feels fair to be cautious given their lineage.
  
  03/01/2024
  
  Reply
JenT

Hi there, about WordPress.com and the Reader, the Support guide was later updated to remove “Using this option also means your blog posts will not appear in the WordPress.com Reader” following an investigation by an employee of Automattic. https://mastodon.social/@wpcommaven/112008801675632474

03/05/2024

Reply
Julien B.

Hello,
Thank you for this article which helped me discover ImagesiftBot.
I’m also blocking User-agent: cohere-ai even though it hasn’t been confirmed yet.

03/21/2024

Reply
Pascal Roussel

Excellent article, Neil, thanks a lot!

Updating all these info info in a timely manner will obviously be critical but also very time consuming because of the extensive research it requires. How do you envision tackling this in the long run?

03/30/2024

Reply
- Neil Clarke
  
  There’s a wide variety of sources for this kind of information that I am monitoring and using for periodic updates to the site. I try to update within 24 hours of something new coming to my attention. As you can see in the comments, a number of other people are alerting us to things they hear about as well. We’re happy to be a central clearinghouse for this information, but it’s really a community effort.
  
  03/30/2024
  
  Reply
Frank

Thanks for the article. Only: Can someone show me an example where an AI is actually blocked with the Robots.txt configuration. I did various tests with well-known sites that adapted the robots.txt accordingly… and ChatGPT allways summarizes the pages for me from each of these sites, or shows me suitable results for my questions. For example: “Show me three webpages on amazon.com with books from Brian Tracy”.

04/14/2024

Reply
Scott Adams

You need to include ClaudeBot because it still crawls my site with only this block in place Anthropic-ai

04/27/2024

Reply
- Neil Clarke
  
  Thank you!
  
  04/27/2024
  
  Reply

Award-Winning Editor of Clarkesworld Magazine, Forever Magazine, The Best Science Fiction of the Year, and More

Block the Bots that Feed “AI” Models by Scraping Your Website

32 Comments

Lena

Beau

Neil Clarke

Eric J. Francis

Eric J. Francis

Eric J. Francis

Neil Clarke

DJ MARY

Koen Martens

Jasmine

Neil Clarke

Bella Han

Neil Clarke

Robert

Neil Clarke

Neil Clarke

John

Neil Clarke

Fred

Cendrine Marrouat

Nutchanon Wetchasit

Nutchanon Wetchasit

Gerardo

Jack

Neil Clarke

JenT

Julien B.

Pascal Roussel

Neil Clarke

Frank

Scott Adams

Neil Clarke

Leave a Reply Cancel reply