Class CrawlDatum

    • Field Detail

      • STATUS_DB_UNFETCHED

        public static final byte STATUS_DB_UNFETCHED
        Page was not fetched yet.
        See Also:
        Constant Field Values
      • STATUS_DB_FETCHED

        public static final byte STATUS_DB_FETCHED
        Page was successfully fetched.
        See Also:
        Constant Field Values
      • STATUS_DB_GONE

        public static final byte STATUS_DB_GONE
        Page no longer exists.
        See Also:
        Constant Field Values
      • STATUS_DB_REDIR_TEMP

        public static final byte STATUS_DB_REDIR_TEMP
        Page temporarily redirects to other page.
        See Also:
        Constant Field Values
      • STATUS_DB_REDIR_PERM

        public static final byte STATUS_DB_REDIR_PERM
        Page permanently redirects to other page.
        See Also:
        Constant Field Values
      • STATUS_DB_NOTMODIFIED

        public static final byte STATUS_DB_NOTMODIFIED
        Page was successfully fetched and found not modified.
        See Also:
        Constant Field Values
      • STATUS_DB_DUPLICATE

        public static final byte STATUS_DB_DUPLICATE
        Page was marked as being a duplicate of another page
        See Also:
        Constant Field Values
      • STATUS_DB_ORPHAN

        public static final byte STATUS_DB_ORPHAN
        Page was marked as orphan, e.g. has no inlinks anymore
        See Also:
        Constant Field Values
      • STATUS_DB_MAX

        public static final byte STATUS_DB_MAX
        Maximum value of DB-related status.
        See Also:
        Constant Field Values
      • STATUS_FETCH_SUCCESS

        public static final byte STATUS_FETCH_SUCCESS
        Fetching was successful.
        See Also:
        Constant Field Values
      • STATUS_FETCH_RETRY

        public static final byte STATUS_FETCH_RETRY
        Fetching unsuccessful, needs to be retried (transient errors).
        See Also:
        Constant Field Values
      • STATUS_FETCH_REDIR_TEMP

        public static final byte STATUS_FETCH_REDIR_TEMP
        Fetching temporarily redirected to other page.
        See Also:
        Constant Field Values
      • STATUS_FETCH_REDIR_PERM

        public static final byte STATUS_FETCH_REDIR_PERM
        Fetching permanently redirected to other page.
        See Also:
        Constant Field Values
      • STATUS_FETCH_GONE

        public static final byte STATUS_FETCH_GONE
        Fetching unsuccessful - page is gone.
        See Also:
        Constant Field Values
      • STATUS_FETCH_NOTMODIFIED

        public static final byte STATUS_FETCH_NOTMODIFIED
        Fetching successful - page is not modified.
        See Also:
        Constant Field Values
      • STATUS_FETCH_MAX

        public static final byte STATUS_FETCH_MAX
        Maximum value of fetch-related status.
        See Also:
        Constant Field Values
      • STATUS_SIGNATURE

        public static final byte STATUS_SIGNATURE
        Page signature.
        See Also:
        Constant Field Values
      • STATUS_INJECTED

        public static final byte STATUS_INJECTED
        Page was newly injected.
        See Also:
        Constant Field Values
      • STATUS_LINKED

        public static final byte STATUS_LINKED
        Page discovered through a link.
        See Also:
        Constant Field Values
      • STATUS_PARSE_META

        public static final byte STATUS_PARSE_META
        Page got metadata from a parser
        See Also:
        Constant Field Values
    • Constructor Detail

      • CrawlDatum

        public CrawlDatum()
      • CrawlDatum

        public CrawlDatum​(int status,
                          int fetchInterval)
      • CrawlDatum

        public CrawlDatum​(int status,
                          int fetchInterval,
                          float score)
    • Method Detail

      • hasDbStatus

        public static boolean hasDbStatus​(CrawlDatum datum)
      • hasFetchStatus

        public static boolean hasFetchStatus​(CrawlDatum datum)
      • getStatus

        public byte getStatus()
      • getStatusName

        public static String getStatusName​(byte value)
      • getStatusByName

        public static byte getStatusByName​(String name)
      • setStatus

        public void setStatus​(int status)
      • getFetchTime

        public long getFetchTime()
        Get the fetch time.
        Returns:
        long value indicating either the time of the last fetch, or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.
      • setFetchTime

        public void setFetchTime​(long fetchTime)
        Sets either the time of the last fetch or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.
        Parameters:
        fetchTime - the fetch time to set.
      • getModifiedTime

        public long getModifiedTime()
      • setModifiedTime

        public void setModifiedTime​(long modifiedTime)
      • getRetriesSinceFetch

        public byte getRetriesSinceFetch()
      • setRetriesSinceFetch

        public void setRetriesSinceFetch​(int retries)
      • getFetchInterval

        public int getFetchInterval()
      • setFetchInterval

        public void setFetchInterval​(int fetchInterval)
      • setFetchInterval

        public void setFetchInterval​(float fetchInterval)
      • getScore

        public float getScore()
      • setScore

        public void setScore​(float score)
      • getSignature

        public byte[] getSignature()
      • setSignature

        public void setSignature​(byte[] signature)
      • setMetaData

        public void setMetaData​(MapWritable mapWritable)
      • putAllMetaData

        public void putAllMetaData​(CrawlDatum other)
        Add all metadata from other CrawlDatum to this CrawlDatum.
        Parameters:
        other - CrawlDatum
      • getMetaData

        public MapWritable getMetaData()
        Get CrawlDatum metadata
        Returns:
        a MapWritable if it was set or read in #readFields(DataInput), returns empty map in case CrawlDatum was freshly created (lazily instantiated).
        See Also:
        readFields(DataInput)
      • set

        public void set​(CrawlDatum that)
        Copy the contents of another instance into this instance.
        Parameters:
        that - an existing CrawlDatum
      • compareTo

        public int compareTo​(CrawlDatum that)
        Sort two CrawlDatum objects by decreasing score.
        Specified by:
        compareTo in interface Comparable<CrawlDatum>
        Parameters:
        that - an existing CrawlDatum
        Returns:
        1 if any one field (score, status, fetchTime, retries, fetchInterval or modifiedTime) of the new CrawlDatum minus the correspoinding field of the existing CrawlDatum is greater than 0, otherwise return -1.
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class Object
      • execute

        public boolean execute​(org.apache.commons.jexl3.JexlScript expr,
                               String url)