Class HttpBase
- java.lang.Object
-
- org.apache.nutch.protocol.http.api.HttpBase
-
-
Field Summary
Fields Modifier and Type Field Description protected String
accept
The "Accept" request header value.protected String
acceptCharset
The "Accept-Charset" request header value.protected String
acceptLanguage
The "Accept-Language" request header value.static int
BUFFER_SIZE
static Text
COOKIE
protected boolean
enableCookieHeader
Controls whether or not to set Cookie HTTP header based on CrawlDatum metadataprotected boolean
enableIfModifiedsinceHeader
Configuration directive for If-Modified-Since HTTP headerprotected int
maxContent
The length limit for downloaded content, in bytes.protected long
maxCrawlDelay
Skip page if Crawl-Delay longer than this value.protected int
maxDuration
The time limit to download the entire content, in seconds.protected boolean
partialAsTruncated
Whether to save partial fetches as truncated content.protected HashMap<String,String>
proxyException
The proxy exception list.protected String
proxyHost
The proxy hostname.protected int
proxyPort
The proxy port.protected Proxy.Type
proxyType
The proxy port.static Text
RESPONSE_TIME
protected boolean
responseTime
Record response time in CrawlDatum's meta data, see property http.store.responsetime.protected boolean
storeHttpHeaders
Record the HTTP response header in the metadata, see propertystore.http.headers
.protected boolean
storeHttpRequest
Record the HTTP request in the metadata, see propertystore.http.request
.protected boolean
storeIPAddress
Record the IP address of the responding server, see propertystore.ip.address
.protected boolean
storeProtocolVersions
Record the HTTP and SSL/TLS protocol versions and the SSL/TLS cipher suites, see propertystore.protocol.versions
.protected int
timeout
The network timeout in millisecondprotected boolean
tlsCheckCertificate
Whether to check TLS/SSL certificatesprotected Set<String>
tlsPreferredCipherSuites
Which TLS/SSL cipher suites to supportprotected Set<String>
tlsPreferredProtocols
Which TLS/SSL protocols to supportprotected boolean
useHttp11
Do we use HTTP/1.1?protected boolean
useHttp2
Whether to use HTTP/2protected boolean
useProxy
Indicates if a proxy is usedprotected String
userAgent
The Nutch 'User-Agent' request header-
Fields inherited from interface org.apache.nutch.protocol.Protocol
X_POINT_ID
-
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description String
getAccept()
String
getAcceptCharset()
String
getAcceptLanguage()
Value of "Accept-Language" request header sent by Nutch.Configuration
getConf()
String
getCookie(URL url)
If per-host cookies are configured, this method will look it up for the given url.int
getMaxContent()
int
getMaxDuration()
The time limit to download the entire content, in seconds.ProtocolOutput
getProtocolOutput(Text url, CrawlDatum datum)
Get theProtocolOutput
for a given url and crawldatumString
getProxyHost()
int
getProxyPort()
protected abstract Response
getResponse(URL url, CrawlDatum datum, boolean followRedirects)
crawlercommons.robots.BaseRobotRules
getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Retrieve robot rules applicable for this URL.int
getTimeout()
Set<String>
getTlsPreferredCipherSuites()
Set<String>
getTlsPreferredProtocols()
boolean
getUseHttp11()
String
getUserAgent()
boolean
isCookieEnabled()
boolean
isIfModifiedSinceEnabled()
boolean
isStoreHttpHeaders()
boolean
isStoreHttpRequest()
boolean
isStoreIPAddress()
boolean
isStorePartialAsTruncated()
Whether to save partial fetches as truncated content, cf.boolean
isTlsCheckCertificates()
protected void
logConf()
protected static void
main(HttpBase http, String[] args)
byte[]
processDeflateEncoded(byte[] compressed, URL url)
byte[]
processGzipEncoded(byte[] compressed, URL url)
void
setConf(Configuration conf)
boolean
useProxy(String host)
boolean
useProxy(URI uri)
boolean
useProxy(URL url)
-
-
-
Field Detail
-
RESPONSE_TIME
public static final Text RESPONSE_TIME
-
COOKIE
public static final Text COOKIE
-
BUFFER_SIZE
public static final int BUFFER_SIZE
- See Also:
- Constant Field Values
-
proxyHost
protected String proxyHost
The proxy hostname.
-
proxyPort
protected int proxyPort
The proxy port.
-
proxyType
protected Proxy.Type proxyType
The proxy port.
-
useProxy
protected boolean useProxy
Indicates if a proxy is used
-
timeout
protected int timeout
The network timeout in millisecond
-
maxContent
protected int maxContent
The length limit for downloaded content, in bytes.
-
maxDuration
protected int maxDuration
The time limit to download the entire content, in seconds.
-
partialAsTruncated
protected boolean partialAsTruncated
Whether to save partial fetches as truncated content.
-
userAgent
protected String userAgent
The Nutch 'User-Agent' request header
-
acceptLanguage
protected String acceptLanguage
The "Accept-Language" request header value.
-
acceptCharset
protected String acceptCharset
The "Accept-Charset" request header value.
-
accept
protected String accept
The "Accept" request header value.
-
useHttp11
protected boolean useHttp11
Do we use HTTP/1.1?
-
useHttp2
protected boolean useHttp2
Whether to use HTTP/2
-
responseTime
protected boolean responseTime
Record response time in CrawlDatum's meta data, see property http.store.responsetime.
-
storeIPAddress
protected boolean storeIPAddress
Record the IP address of the responding server, see propertystore.ip.address
.
-
storeHttpRequest
protected boolean storeHttpRequest
Record the HTTP request in the metadata, see propertystore.http.request
.
-
storeHttpHeaders
protected boolean storeHttpHeaders
Record the HTTP response header in the metadata, see propertystore.http.headers
.
-
storeProtocolVersions
protected boolean storeProtocolVersions
Record the HTTP and SSL/TLS protocol versions and the SSL/TLS cipher suites, see propertystore.protocol.versions
.
-
maxCrawlDelay
protected long maxCrawlDelay
Skip page if Crawl-Delay longer than this value.
-
tlsCheckCertificate
protected boolean tlsCheckCertificate
Whether to check TLS/SSL certificates
-
tlsPreferredProtocols
protected Set<String> tlsPreferredProtocols
Which TLS/SSL protocols to support
-
tlsPreferredCipherSuites
protected Set<String> tlsPreferredCipherSuites
Which TLS/SSL cipher suites to support
-
enableIfModifiedsinceHeader
protected boolean enableIfModifiedsinceHeader
Configuration directive for If-Modified-Since HTTP header
-
enableCookieHeader
protected boolean enableCookieHeader
Controls whether or not to set Cookie HTTP header based on CrawlDatum metadata
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
getProtocolOutput
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
Description copied from interface:Protocol
Get theProtocolOutput
for a given url and crawldatum- Specified by:
getProtocolOutput
in interfaceProtocol
- Parameters:
url
- canonical urldatum
- associatedCrawlDatum
- Returns:
- the
ProtocolOutput
-
getProxyHost
public String getProxyHost()
-
getProxyPort
public int getProxyPort()
-
useProxy
public boolean useProxy(URL url)
-
useProxy
public boolean useProxy(URI uri)
-
useProxy
public boolean useProxy(String host)
-
getTimeout
public int getTimeout()
-
isIfModifiedSinceEnabled
public boolean isIfModifiedSinceEnabled()
-
isCookieEnabled
public boolean isCookieEnabled()
-
isStoreIPAddress
public boolean isStoreIPAddress()
-
isStoreHttpRequest
public boolean isStoreHttpRequest()
-
isStoreHttpHeaders
public boolean isStoreHttpHeaders()
-
getMaxContent
public int getMaxContent()
-
getMaxDuration
public int getMaxDuration()
The time limit to download the entire content, in seconds. See the propertyhttp.time.limit
.- Returns:
- the maximum duration
-
isStorePartialAsTruncated
public boolean isStorePartialAsTruncated()
Whether to save partial fetches as truncated content, cf. the propertyhttp.partial.truncated
.- Returns:
- true if partially fetched truncated content is stored
-
getUserAgent
public String getUserAgent()
-
getCookie
public String getCookie(URL url)
If per-host cookies are configured, this method will look it up for the given url.- Parameters:
url
- the url to look-up a cookie for- Returns:
- the cookie or null
-
getAcceptLanguage
public String getAcceptLanguage()
Value of "Accept-Language" request header sent by Nutch.- Returns:
- The value of the header "Accept-Language" header.
-
getAcceptCharset
public String getAcceptCharset()
-
getAccept
public String getAccept()
-
getUseHttp11
public boolean getUseHttp11()
-
isTlsCheckCertificates
public boolean isTlsCheckCertificates()
-
logConf
protected void logConf()
-
processGzipEncoded
public byte[] processGzipEncoded(byte[] compressed, URL url) throws IOException
- Throws:
IOException
-
processDeflateEncoded
public byte[] processDeflateEncoded(byte[] compressed, URL url) throws IOException
- Throws:
IOException
-
getResponse
protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) throws ProtocolException, IOException
- Throws:
ProtocolException
IOException
-
getRobotRules
public crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Description copied from interface:Protocol
Retrieve robot rules applicable for this URL.- Specified by:
getRobotRules
in interfaceProtocol
- Parameters:
url
- URL to checkdatum
- page datumrobotsTxtContent
- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContent
is appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
-