Package org.apache.nutch.parse
Class OutlinkExtractor
- java.lang.Object
-
- org.apache.nutch.parse.OutlinkExtractor
-
public class OutlinkExtractor extends Object
Extractor to extractOutlink
s / URLs from plain text using Regular Expressions.- Since:
- 0.7
- Version:
- 1.0
- Author:
- Stephan Strittmatter - http://www.sybit.de
- See Also:
- Comparison of different regexp-Implementations , Overview about Java Regexp APIs
-
-
Constructor Summary
Constructors Constructor Description OutlinkExtractor()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static Outlink[]
getOutlinks(String plainText, String anchor, Configuration conf)
ExtractsOutlink
from given plain text and adds anchor to the extractedOutlink
sstatic Outlink[]
getOutlinks(String plainText, Configuration conf)
ExtractsOutlink
from given plain text.
-
-
-
Method Detail
-
getOutlinks
public static Outlink[] getOutlinks(String plainText, Configuration conf)
ExtractsOutlink
from given plain text. Applying this method to non-plain-text can result in extremely lengthy runtimes for parasitic cases (postscript is a known example).- Parameters:
plainText
- the plain text from wich URLs should be extracted.conf
- a populatedConfiguration
- Returns:
- Array of
Outlink
s within found in plainText
-
getOutlinks
public static Outlink[] getOutlinks(String plainText, String anchor, Configuration conf)
ExtractsOutlink
from given plain text and adds anchor to the extractedOutlink
s- Parameters:
plainText
- the plain text from wich URLs should be extracted.anchor
- the anchor of the urlconf
- a populatedConfiguration
- Returns:
- Array of
Outlink
s within found in plainText
-
-