A look at the Web Crawler, AI Search & Data Ingestion

In this article you will find:

The main topics:

  1. Architecture
  2. Data ingestion
  3. Crawler
  4. FAQ

Screenshot 2024-05-15 at 11.17.34 AM.png

Screenshot 2024-05-15 at 11.17.46 AM.png

Screenshot 2024-05-15 at 11.17.59 AM.png

Web Crawler FAQ

Question: We have started an ingestion, but it returns 0 page?

The most common causes for the ingestion to return 0 page are:

Common Causes​ Best Practice​
Setting the configuration to not use headless and browsing a webpage that uses Single Page Application technologies. When browsing an SPA powered page, if the crawler does not use headless browser, since these pages use JavaScript, it will not load the content or subsequently find all the pages to crawl.​

(An SPA (Single-page application) is a web app implementation that loads only a single web document, and then updates the body content of that single document via JavaScript APIs such as Fetch when different content is to be shown. One of the ways to identify an SPA technologies web page is as you navigate through different sections or "pages" of the website, the URL in the address bar changes, but the page doesn't reload completely. Instead, only parts of the page update.) | If any web pages that you are crawling are Single Page Application (SPA) technologies, be sure to turn on the headless browsing option. | | **Crawled pages are behind a firewall.**​ | If the pages are behind systems that filter traffic, you will need to update the filtering systems to whitelist the traffic from the crawler. You can reach out to your Zammo point of contact to get the IP ranges for the crawler for your Zammo managed app. |

Question: Why is our ingestion process taking a very long time?