Thumbs up picture
CENTIPEDE

High-performance cloud-based web harvesting platform

Intelligent data scraping

What can CENTIPEDE do?

CENTIPEDE is a high-performance cloud-based web harvesting platform built using NLP Tools & Techniques. CENTIPEDE performs:

  • NLP based Data crawling for targeted websites
  • Automated and accurate data collection through quality monitoring layer
  • Output data is provided in a custom file format

Key Features

Quality Monitoring Layer

96% completeness & accuracy of data with a “quality monitoring layer” on top of the crawler engine.

Data Aggregator – Hybrid Approach

  • For certain websites like Ticketmaster, the initial data aggregation would check for the API (if exists) and leverage it for getting event data
  • In case of failure it would fall back to crawling the data with IP masking (to avoid blacklisting)
  • The advantage of the hybrid approach allows maximum data collection if the APIs are available and reduces the effort and optimizes the performance of crawling

Customizable Solution

CENTIPEDE can be customized to your specific need

CENTIPEDE is designed to configure the specific set of websites to crawl

Data crawling/extraction on regular intervals

Considered a specific set of “mandatory” and “nice to have” fields – provided by the client

Quality Monitoring Layer to ensure up to 96% completeness and accuracy

An automated quality monitoring layer will ensure completeness and accuracy

Stores the crawled/extracted data in your FTP/SFTP

The extracted data will be stored in “CSV” format with a date and timestamp

Components considered for pricing

  • Hosted on AWS – provides high availability
  • Uses high performance clusters to run the scraping process in parallel
  • Apart from the infrastructure costs, to ensure speed and availability, we have built a quality monitoring layer that compares the scraped data to the screenshots captured while scraping. This uses OCR techniques and ensures that the data scraped is accurate