java.lang.Object
- org.apache.nutch.util.URLUtil

```
public class URLUtil
extends Object
```
Utility class for URL analysis

Constructor Summary

Constructors
Constructor Description

URLUtil()

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method	Description
`static String`	`chooseRepr(String src, String dst, boolean temp)`	Given two urls, a src and a destination of a redirect, it returns the representative url.
`static String`	`getDomainName(String url)`	Returns the domain name of the URL.
`static String`	`getDomainName(URL url)`	Get the domain name of the URL.
`static String`	`getDomainSuffix(String url)`	Returns the domain suffix corresponding to the last public part of the hostname.
`static String`	`getDomainSuffix(URL url)`	Returns the public suffix corresponding to the last public part of the hostname.
`static String`	`getHost(String url)`	Returns the lowercased hostname for the URL or null if the URL is not well-formed.
`static String`	`getHost(URL url)`	Returns the lowercased hostname for the URL.
`static String[]`	`getHostSegments(String url)`	Partitions of the hostname of the url by "."
`static String[]`	`getHostSegments(URL url)`	Partitions of the hostname of the url by "."
`static String`	`getPage(String url)`	Returns the page for the url.
`static String`	`getProtocol(String url)`
`static String`	`getProtocol(URL url)`
`static String`	`getTopLevelDomainName(String url)`	Returns the top-level domain name of the URL.
`static String`	`getTopLevelDomainName(URL url)`	Returns the top-level domain name of the URL.
`static boolean`	`isHomePageOf(URL url, String hostName)`	Test whether a URL is the home page or root page of a host.
`static boolean`	`isSameDomainName(String url1, String url2)`	Returns whether the given URLs have the same domain name.
`static boolean`	`isSameDomainName(URL url1, URL url2)`	Returns whether the given URLs have the same domain name.
`static void`	`main(String[] args)`	For testing
`static URL`	`resolveURL(URL base, String target)`	Resolve relative URL-s and fix a java.net.URL error in handling of URLs with pure query targets.
`static String`	`toASCII(String url)`
`static String`	`toUNICODE(String url)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - URLUtil
```
public URLUtil()
```
- Method Detail
  - resolveURL
```
public static URL resolveURL(URL base,
                             String target)
                      throws MalformedURLException
```
    Resolve relative URL-s and fix a java.net.URL error in handling of URLs with pure query targets.
    
    Parameters:
    
    base - base url
    
    target - target url (may be relative)
    
    Returns:
    
    resolved absolute url.
    
    Throws:
    
    MalformedURLException - if the input base URL is malformed
  - getDomainName
```
public static String getDomainName(URL url)
```
    Get the domain name of the URL. The domain name of a URL is the substring of the URL's hostname, w/o subdomain names. As an example
    getDomainName(new URL("https://lucene.apache.org/"))
    will return
    apache.org Special cases:
    
    if the hostname does not end in a valid domain suffix, the entire hostname is returned.
    
    for URLs without a hostname, an empty string is returned.
    
    Valid domain suffixes are taken from the https://publicsuffix.org/list/public_suffix_list.dat and are compared using crawler-commons' EffectiveTldFinder. Only ICANN domain suffixes are used. Because EffectiveTldFinder loads the public suffix list as file "effective_tld_names.dat" from the Java classpath, it's possible to use the a specific version of the public suffix list (e.g., the most recent one) by placing the public suffix list with the name "effective_tld_names.dat" in Nutch's conf/ folder. See EffectiveTldFinder.getAssignedDomain(String, boolean, boolean)
    Parameters:
    
    url - input URL to extract the domain from
    
    Returns:
    
    the domain name string
  - getDomainName
```
public static String getDomainName(String url)
                            throws MalformedURLException
```
    Returns the domain name of the URL. The domain name of a URL is the substring of the URL's hostname, w/o subdomain names. As an example
    getDomainName("https://lucene.apache.org/")
    will return
    apache.org See getDomainName(URL) for more information.
    
    Parameters:
    
    url - input URL string to extract the domain from
    
    Returns:
    
    the domain name
    
    Throws:
    
    MalformedURLException - if the input URL is malformed
  - getTopLevelDomainName
```
public static String getTopLevelDomainName(URL url)
```
    Returns the top-level domain name of the URL. The top-level domain name of a URL is the substring of the URL's hostname, w/o subdomain names. As an example
    getTopLevelDomainName(new URL("https://www.example.co.uk/"))
    will return
    uk In case of internationalized top-level domains, the ASCII representation is returned.
    
    Parameters:
    
    url - input URL to extract the top-level domain name from
    
    Returns:
    
    the top-level domain name or null if there is none
  - getTopLevelDomainName
```
public static String getTopLevelDomainName(String url)
                                    throws MalformedURLException
```
    Returns the top-level domain name of the URL. The top-level domain name of a URL is the substring of the URL's hostname, w/o subdomain names. As an example
    getTopLevelDomainName("https://www.example.co.uk/")
    will return
    uk In case of internationalized top-level domains, the ASCII representation is returned.
    
    Parameters:
    
    url - input URL string to extract the top-level domain name from
    
    Returns:
    
    the top-level domain name or null if there is none
    
    Throws:
    
    MalformedURLException - if the input URL is malformed
  - isSameDomainName
```
public static boolean isSameDomainName(URL url1,
                                       URL url2)
```
    Returns whether the given URLs have the same domain name. As an example,
    isSameDomain(new URL("http://lucene.apache.org") , new URL("http://people.apache.org/"))
    will return true.
    
    Parameters:
    
    url1 - first URL to compare domain name
    
    url2 - second URL to compare domain name
    
    Returns:
    
    true if the domain names are equal
  - isSameDomainName
```
public static boolean isSameDomainName(String url1,
                                       String url2)
                                throws MalformedURLException
```
    Returns whether the given URLs have the same domain name. As an example,
    isSameDomain("http://lucene.apache.org" ,"http://people.apache.org/")
    will return true.
    
    Parameters:
    
    url1 - first URL string to compare domain name
    
    url2 - second URL string to compare domain name
    
    Returns:
    
    true if the domain names are equal
    
    Throws:
    
    MalformedURLException - if any of the input URLs are malformed
  - getDomainSuffix
```
public static String getDomainSuffix(URL url)
```
    Returns the public suffix corresponding to the last public part of the hostname. In case of internationalized domain suffixes, the ASCII representation is returned. For the URL https://www.taiuru.māori.nz/ the suffix xn--mori-qsa.nz is returned.
    
    Parameters:
    
    url - a URL to extract the domain suffix from
    
    Returns:
    
    the domain suffix or null if there is none
  - getDomainSuffix
```
public static String getDomainSuffix(String url)
                              throws MalformedURLException
```
    Returns the domain suffix corresponding to the last public part of the hostname. In case of internationalized domain suffixes, the ASCII representation is returned. For the URL https://www.taiuru.māori.nz/ the suffix xn--mori-qsa.nz is returned.
    
    Parameters:
    
    url - a URL to extract the domain suffix from
    
    Returns:
    
    the domain suffix or null if there is none
    
    Throws:
    
    MalformedURLException - if the input URL string is malformed
  - getHostSegments
```
public static String[] getHostSegments(URL url)
```
    Partitions of the hostname of the url by "."
    
    Parameters:
    
    url - a URL to extract host segments from
    
    Returns:
    
    a string array of host segments
  - getHostSegments
```
public static String[] getHostSegments(String url)
                                throws MalformedURLException
```
    Partitions of the hostname of the url by "."
    
    Parameters:
    
    url - a url string to extract host segments from
    
    Returns:
    
    a string array of host segments
    
    Throws:
    
    MalformedURLException - if the input url string is malformed
  - chooseRepr
```
public static String chooseRepr(String src,
                                String dst,
                                boolean temp)
```
    Given two urls, a src and a destination of a redirect, it returns the representative url.
    This method implements an extended version of the algorithm used by the Yahoo! Slurp crawler described here:
    How does the Yahoo! webcrawler handle redirects?
    
    Choose target url if either url is malformed.
    
    If different domains the keep the destination whether or not the redirect is temp or perm
    
    a.com -> b.com*
    
    If the redirect is permanent and the source is root, keep the source.
    
    *a.com -> a.com?y=1 || *a.com -> a.com/xyz/index.html
    
    If the redirect is permanent and the source is not root and the destination is root, keep the destination
    
    a.com/xyz/index.html -> a.com*
    
    If the redirect is permanent and neither the source nor the destination is root, then keep the destination
    
    a.com/xyz/index.html -> a.com/abc/page.html*
    
    If the redirect is temporary and source is root and destination is not root, then keep the source
    
    *a.com -> a.com/xyz/index.html
    
    If the redirect is temporary and source is not root and destination is root, then keep the destination
    
    a.com/xyz/index.html -> a.com*
    
    If the redirect is temporary and neither the source or the destination is root, then keep the shortest url. First check for the shortest host, and if both are equal then check by path. Path is first by length then by the number of / path separators.
    
    a.com/xyz/index.html -> a.com/abc/page.html*
    
    *www.a.com/xyz/index.html -> www.news.a.com/xyz/index.html
    
    If the redirect is temporary and both the source and the destination are root, then keep the shortest sub-domain
    
    *www.a.com -> www.news.a.com
    
    While not in this logic there is a further piece of representative url logic that occurs during indexing and after scoring. During creation of the basic fields before indexing, if a url has a representative url stored we check both the url and its representative url (which should never be the same) against their linkrank scores and the highest scoring one is kept as the url and the lower scoring one is held as the orig url inside of the index.
    Parameters:
    
    src - The source url.
    
    dst - The destination url.
    
    temp - Is the redirect a temporary redirect.
    
    Returns:
    
    String The representative url.
  - getHost
```
public static String getHost(String url)
```
    Returns the lowercased hostname for the URL or null if the URL is not well-formed.
    
    Parameters:
    
    url - The URL to check.
    
    Returns:
    
    String the hostname for the URL.
  - getHost
```
public static String getHost(URL url)
```
    Returns the lowercased hostname for the URL.
    
    Parameters:
    
    url - The URL to check.
    
    Returns:
    
    String the hostname for the URL.
  - getPage
```
public static String getPage(String url)
```
    Returns the page for the url. The page consists of the protocol, host, and path, but does not include the query string. The host is lowercased but the path is not.
    
    Parameters:
    
    url - The url to check.
    
    Returns:
    
    String The page for the url.
  - getProtocol
```
public static String getProtocol(String url)
```
  - getProtocol
```
public static String getProtocol(URL url)
```
  - toASCII
```
public static String toASCII(String url)
```
  - toUNICODE
```
public static String toUNICODE(String url)
```
  - main
```
public static void main(String[] args)
```
    For testing
    
    Parameters:
    
    args - print with no args to get help
  - isHomePageOf
```
public static boolean isHomePageOf(URL url,
                                   String hostName)
```
    Test whether a URL is the home page or root page of a host. This is the case if the URL path is / and query, port, fragment, userinfo are empty resp. not given. In other words the URL is: protocol://hostName/
    
    Parameters:
    
    url - the URL to test
    
    hostName - the host name to test the URL on
    
    Returns:
    
    true if the URL is the home or root page of the host

Class URLUtil

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

URLUtil

Method Detail

resolveURL

getDomainName

getDomainName

getTopLevelDomainName

getTopLevelDomainName

isSameDomainName

isSameDomainName

getDomainSuffix

getDomainSuffix

getHostSegments

getHostSegments

chooseRepr

getHost

getHost

getPage

getProtocol

getProtocol

toASCII

toUNICODE

main

isHomePageOf