Cloudflare Reports Perplexity’s Use of Stealth Bots to Bypass Website Directives for Data Retrieval

Perplexity, an artificial intelligence company, is facing scrutiny for allegedly engaging in unauthorized data scraping from various websites. Cloudflare, a prominent web security firm, conducted a study revealing that Perplexity’s crawler bots have been ignoring website directives and concealing their identities to evade detection. The findings indicate a breach of trust between content creators and web crawlers, prompting Cloudflare to take action against these stealth tactics.
Cloudflare Exposes Perplexity’s Stealth Crawling Activities
In a recent blog post, Cloudflare detailed its findings regarding Perplexity’s “stealth crawling” practices. The company reported that Perplexity has been modifying its user agents and changing its source Autonomous System Numbers (ASNs) to obscure its crawling activities. This behavior includes ignoring or failing to fetch robots.txt files, which are essential for guiding web crawlers on how to interact with a website. The implications of these actions are significant, as they undermine the established trust between website owners and crawlers.
Understanding the relationship between content websites and crawlers is crucial. Content creators rely on third-party services, such as search engines, to index their information and make it accessible to users. However, this relationship hinges on mutual trust, where crawlers must adhere to specific rules that dictate transparency and respect for website preferences. When a website blocks a bot, it is expected that the bot will cease its crawling activities. Perplexity’s alleged disregard for these protocols raises concerns about the integrity of data scraping practices.
Cloudflare’s Research Methodology and Findings
Cloudflare researchers conducted tests to verify Perplexity’s activities by creating new test domains that were not indexed by any search engine. These domains were intentionally made inaccessible to ensure that no external discovery could occur. To further prevent crawling, the researchers implemented a robots.txt file designed to block all bots from accessing any part of the website. Despite these precautions, Cloudflare found that Perplexity was still able to extract detailed information about these domains.
The researchers discovered that Perplexity’s web crawlers employ various tactics to bypass website directives. For instance, if a declared user agent is denied access through robots.txt, Perplexity’s bots simply ignore these instructions and continue scraping data. Additionally, when a website utilizes a web application firewall (WAF) to block the bot, Perplexity resorts to using a generic browser agent that mimics popular browsers like Google Chrome or macOS. This strategy allows the bots to evade detection and continue their scraping activities.
Impact of Stealth Tactics on Data Quality
Cloudflare’s investigation revealed that Perplexity’s undeclared bots utilize multiple IP addresses not listed in the company’s official range, further complicating detection efforts. These bots also employ different automatic system numbers to mask their identity. Notably, Cloudflare observed that when these stealth crawlers were successfully blocked, the quality of Perplexity’s responses diminished. The company began relying on alternative data sources to provide answers, indicating a direct correlation between the effectiveness of the crawling tactics and the quality of the information retrieved.
In response to these findings, Cloudflare has enhanced its bot management system to automatically register and protect against the undeclared crawling activities associated with Perplexity. The company has implemented signature matches for these stealth crawlers within its managed rules, effectively blocking unauthorized AI crawling activity. This protective measure is now available to all Cloudflare users, including those on the free tier, ensuring a broader defense against similar tactics in the future.
Observer Voice is the one stop site for National, International news, Sports, Editorโs Choice, Art/culture contents, Quotes and much more. We also cover historical contents. Historical contents includes World History, Indian History, and what happened today. The website also covers Entertainment across the India and World.