Class HttpBase

    • Field Detail

      • RESPONSE_TIME

        public static final Text RESPONSE_TIME
      • COOKIE

        public static final Text COOKIE
      • proxyHost

        protected String proxyHost
        The proxy hostname.
      • proxyPort

        protected int proxyPort
        The proxy port.
      • proxyType

        protected Proxy.Type proxyType
        The proxy port.
      • useProxy

        protected boolean useProxy
        Indicates if a proxy is used
      • timeout

        protected int timeout
        The network timeout in millisecond
      • maxContent

        protected int maxContent
        The length limit for downloaded content, in bytes.
      • maxDuration

        protected int maxDuration
        The time limit to download the entire content, in seconds.
      • partialAsTruncated

        protected boolean partialAsTruncated
        Whether to save partial fetches as truncated content.
      • userAgent

        protected String userAgent
        The Nutch 'User-Agent' request header
      • acceptLanguage

        protected String acceptLanguage
        The "Accept-Language" request header value.
      • acceptCharset

        protected String acceptCharset
        The "Accept-Charset" request header value.
      • accept

        protected String accept
        The "Accept" request header value.
      • useHttp11

        protected boolean useHttp11
        Do we use HTTP/1.1?
      • useHttp2

        protected boolean useHttp2
        Whether to use HTTP/2
      • responseTime

        protected boolean responseTime
        Record response time in CrawlDatum's meta data, see property http.store.responsetime.
      • storeIPAddress

        protected boolean storeIPAddress
        Record the IP address of the responding server, see property store.ip.address.
      • storeHttpRequest

        protected boolean storeHttpRequest
        Record the HTTP request in the metadata, see property store.http.request.
      • storeHttpHeaders

        protected boolean storeHttpHeaders
        Record the HTTP response header in the metadata, see property store.http.headers.
      • storeProtocolVersions

        protected boolean storeProtocolVersions
        Record the HTTP and SSL/TLS protocol versions and the SSL/TLS cipher suites, see property store.protocol.versions.
      • maxCrawlDelay

        protected long maxCrawlDelay
        Skip page if Crawl-Delay longer than this value.
      • tlsCheckCertificate

        protected boolean tlsCheckCertificate
        Whether to check TLS/SSL certificates
      • tlsPreferredProtocols

        protected Set<String> tlsPreferredProtocols
        Which TLS/SSL protocols to support
      • tlsPreferredCipherSuites

        protected Set<String> tlsPreferredCipherSuites
        Which TLS/SSL cipher suites to support
      • enableIfModifiedsinceHeader

        protected boolean enableIfModifiedsinceHeader
        Configuration directive for If-Modified-Since HTTP header
      • enableCookieHeader

        protected boolean enableCookieHeader
        Controls whether or not to set Cookie HTTP header based on CrawlDatum metadata
    • Constructor Detail

      • HttpBase

        public HttpBase()
        Creates a new instance of HttpBase
      • HttpBase

        public HttpBase​(org.slf4j.Logger logger)
        Creates a new instance of HttpBase
        Parameters:
        logger - the Logger to use in this HttpBase
    • Method Detail

      • getProxyHost

        public String getProxyHost()
      • getProxyPort

        public int getProxyPort()
      • useProxy

        public boolean useProxy​(URL url)
      • useProxy

        public boolean useProxy​(URI uri)
      • useProxy

        public boolean useProxy​(String host)
      • getTimeout

        public int getTimeout()
      • isIfModifiedSinceEnabled

        public boolean isIfModifiedSinceEnabled()
      • isCookieEnabled

        public boolean isCookieEnabled()
      • isStoreIPAddress

        public boolean isStoreIPAddress()
      • isStoreHttpRequest

        public boolean isStoreHttpRequest()
      • isStoreHttpHeaders

        public boolean isStoreHttpHeaders()
      • getMaxContent

        public int getMaxContent()
      • getMaxDuration

        public int getMaxDuration()
        The time limit to download the entire content, in seconds. See the property http.time.limit.
        Returns:
        the maximum duration
      • isStorePartialAsTruncated

        public boolean isStorePartialAsTruncated()
        Whether to save partial fetches as truncated content, cf. the property http.partial.truncated.
        Returns:
        true if partially fetched truncated content is stored
      • getUserAgent

        public String getUserAgent()
      • getCookie

        public String getCookie​(URL url)
        If per-host cookies are configured, this method will look it up for the given url.
        Parameters:
        url - the url to look-up a cookie for
        Returns:
        the cookie or null
      • getAcceptLanguage

        public String getAcceptLanguage()
        Value of "Accept-Language" request header sent by Nutch.
        Returns:
        The value of the header "Accept-Language" header.
      • getAcceptCharset

        public String getAcceptCharset()
      • getAccept

        public String getAccept()
      • getUseHttp11

        public boolean getUseHttp11()
      • isTlsCheckCertificates

        public boolean isTlsCheckCertificates()
      • getTlsPreferredCipherSuites

        public Set<String> getTlsPreferredCipherSuites()
      • getTlsPreferredProtocols

        public Set<String> getTlsPreferredProtocols()
      • logConf

        protected void logConf()
      • processGzipEncoded

        public byte[] processGzipEncoded​(byte[] compressed,
                                         URL url)
                                  throws IOException
        Throws:
        IOException
      • processDeflateEncoded

        public byte[] processDeflateEncoded​(byte[] compressed,
                                            URL url)
                                     throws IOException
        Throws:
        IOException
      • getRobotRules

        public crawlercommons.robots.BaseRobotRules getRobotRules​(Text url,
                                                                  CrawlDatum datum,
                                                                  List<Content> robotsTxtContent)
        Description copied from interface: Protocol
        Retrieve robot rules applicable for this URL.
        Specified by:
        getRobotRules in interface Protocol
        Parameters:
        url - URL to check
        datum - page datum
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        Returns:
        robot rules (specific for this URL or default), never null