Class CCIndexingFilter

    • Field Detail

      • FIELD

        public static String FIELD
        The name of the document field we use.
    • Constructor Detail

      • CCIndexingFilter

        public CCIndexingFilter()
    • Method Detail

      • filter

        public NutchDocument filter​(NutchDocument doc,
                                    Parse parse,
                                    Text url,
                                    CrawlDatum datum,
                                    Inlinks inlinks)
                             throws IndexingException
        Description copied from interface: IndexingFilter
        Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.
        Specified by:
        filter in interface IndexingFilter
        Parameters:
        doc - document instance for collecting fields
        parse - parse data instance
        url - page url
        datum - crawl datum for the page (fetch datum from segment containing fetch status and fetch time)
        inlinks - page inlinks
        Returns:
        modified (or a new) document instance, or null (meaning the document should be discarded)
        Throws:
        IndexingException - if an error occurs during during filtering
      • addUrlFeatures

        public void addUrlFeatures​(NutchDocument doc,
                                   String urlString)
        Add the features represented by a license URL. Urls are of the form "http://creativecommons.org/licenses/xx-xx/xx/xx", where "xx" names a license feature.
        Parameters:
        doc - a NutchDocument to augment
        urlString - the url to extract features from