Class FetchOverdueCrawlDatumProcessor

  • All Implemented Interfaces:
    CrawlDatumProcessor

    public class FetchOverdueCrawlDatumProcessor
    extends Object
    implements CrawlDatumProcessor
    Simple custom crawl datum processor that counts the number of records that are overdue for fetching, e.g. new unfetched URLs that haven't been fetched within two days.
    • Field Detail

      • overDueTimeLimit

        protected long overDueTimeLimit
      • overDueTime

        protected long overDueTime
      • numOverDue

        protected long numOverDue
    • Constructor Detail

      • FetchOverdueCrawlDatumProcessor

        public FetchOverdueCrawlDatumProcessor​(Configuration conf)
    • Method Detail

      • count

        public void count​(CrawlDatum crawlDatum)
        Description copied from interface: CrawlDatumProcessor
        Process a single crawl datum instance to aggregate custom counts.
        Specified by:
        count in interface CrawlDatumProcessor
        Parameters:
        crawlDatum - CrawlDatum instance to count information from
      • finalize

        public void finalize​(HostDatum hostDatum)
        Description copied from interface: CrawlDatumProcessor
        Process the final host datum instance and store the aggregated custom counts in the HostDatum.
        Specified by:
        finalize in interface CrawlDatumProcessor
        Parameters:
        hostDatum - HostDatum instance to hold the aggregated custom counts