Articles Insights News

Why can’t search engines access the deep web?

Share

With 5.25 billion users, 130 trillion websites and 2.5 quintillion bytes of data generated daily, the internet is a big place unfathomably big – and its growth shows no sign of stopping. However, most of its data – commonly known as the ‘deep web’- cannot be reached via standard search engines and is therefore inaccessible to the everyday internet user. Yet it is possible to access this goldmine of intelligence that lies beneath the surface web to uncover game-changing intelligence—provided you know where and how to look.

The surface web

The ‘surface web’ refers to the publicly available portion of the internet that is indexed by search engines (such as Google, Bing, or Brave Search) and includes news and e-commerce sites, blogs (a reported 600 million of them), and social media networks. The number of indexed websites is vast, containing over 1.9 billion unique hostnames and an estimated 45 billion webpages.     

Yet the truly staggering statistic is that between 96%-99% of content on the internet is not indexed by search engines. This includes the hundreds of billions of webpages from outdated or deleted sites found on internet archives, as well as countless unstructured data points generated on mobile apps, which often have limited web-access (at least for now).

The truly staggering statistic is that between 96%-99% of content on the internet is not indexed by search engines.

So how do we get to this data? Using open-source intelligence (OSINT) techniques and tools, a significant proportion of the deep web can be accessed, including information behind paywalls, leaked data, and confidential corporate records.

The deep web

The sheer size of the deep web is almost impossible to fully appreciate, but a better understanding of how search engines crawl and index the internet can help to clarify things.

Commercial search engines use bots (or spiders) that crawl the internet and discover new sites by following the hyperlinks between webpages. The search engine analyses the content of each page and stores it, thereby indexing the page. Engines deliberately exclude much of the material crawled by their bots. In 2016, Google reported that it had crawled 130 trillion webpages. Yet it only indexes an estimated 45 billion pages, in order to prioritise what it considers the most relevant results, which it ranks using a range of evaluative metrics. However, several factors affect the crawling process—or even stop it altogether.

Encryption and CAPTCHAs

As crawlers are unable to input passwords, any pages using encryption (such as online bank accounts, medical records, and private emails) will be excluded from search engine indexes. More significantly, spiders are (with some exceptions) unable to access information via search boxes, or that is hidden behind CAPTCHAs, meaning that the bulk of online public records – like lawsuits, property deeds and corporate filings – remain hidden from view, only accessible to those who can find the appropriate registry.

Robots.txt

Another important aspect of the crawling process is the Robots Exclusion Protocol, also known as a robots.txt file. The role of this file is to provide the crawler with instructions as it undertakes its scan of the site. Furthermore, the file provides instructions to the bot about which areas of the site should be scanned, and which should be excluded from its crawl. Robots.txt files are typically used by website owners to prevent their site being overloaded with unnecessary web traffic.

Web traffic

The file can also be a rich source of intelligence. As in the case above, by examining the Robots Exclusion Protocol, OSINT analysts can gain further insights into a website and potentially uncover otherwise hidden files. In this case, the website’s robots.txt file specifies the exclusion of several directories from the crawl, as well as two PDF files – copies of the company’s privacy policy. Whilst exclusion often means these files are inaccessible via search engines, they remain accessible to analysts because the exact URL is known.

“A significant proportion of the deep web is available if the exact URL is known.”

It is also important to note that robots.txt files often specify which crawlers can scan the website. Consequently, Google, Bing and DuckDuckGo may provide widely divergent search results – not to mention the numerous country-specific search engines which analysts commonly use during international investigations.

Read about the challenges of conducting investigations in Arabic here.

The dark web

The dark web, a subsection of the deep web, remains inaccessible to most users as it utilises specific software, configurations, or – in the example of the commonly used TOR network – non-standard internet communication protocols. It is estimated that the dark web constitutes between 0.01% and 0.03% of the deep web, though this number continues to grow rapidly. Activities on the dark web are well-documented, including the sale of illegal firearms, drugs, malware, and stolen personal identification. For these reasons, the dark web is not typically utilised for corporate investigations, though can prove useful in some cases.

Finding the intelligence

The internet provides multiple challenges to anyone hunting out information, not least due to the vast amount of data it holds. Locating and extracting decision-making intelligence from the noise of the surface, deep and dark webs requires expertise in a myriad of OSINT techniques.

Article by Andrew Knight.

Related News