Defined in file: hzHttpClient.h

The hzWebhost class facilitates automated downloading from the set of documents available at any given domain. This is a parameter driven and generally recursive process. Starting from a list of one or more 'root' pages such as the home page, pages are downloaded and any links these may contain to other pages are garnered. Then subject to specified limiting criteria, these latter pages are downloaded. The process terminates when all discovered pages are downloaded. By default links are limited to other pages on the same website or on other websites listed as related to this. Other criteria may apply such as date of file and file type these pages are alse read in. Where authentication is required, the authentication sequence is normally by login form submission. The login form will be downloaded from a particular URL, the username and password filled in and sent back to the URL indicated in the form (this may or may not be the same). NOTE this will not work where anti-robot mechanisms are in place, such as google recaptcha forms. Note also that the sequence of pages to visit may have to include a seemingly pointless visit to a page (normally the home page), purely for the client to be issued with a cookie in order for the login to be accepted.

This class employes the folowing private sub-classes

_nodeList

_pageList

Constructors/Detructors

hzWebhost*hzWebhost(void)
void~hzWebhost(void)

Public Methods:

hzEcodeAddBan(hzString& pageEnding)
hzEcodeAddRSS(hzUrl& rss)Adds an RSS feed URL for the target website
hzEcodeAddRoot(hzUrl& url)hzString& criteria, Adds a root URL for the target website
hzEcodeAuthBasic(const char* username)const char* password, Sets the basic authentication string for the website (if the site uses this method). Once set all requests to the target website will be submitted with this string in the HTTP header.
hzDocument*Download(hzUrl& url)Fetch the page found at the supplied URL and return as a document (either XML or HTML). Note that if the page has already been downloaded (is in the site's history) then it is only downloaded again if it the time to live has expired. If the page is not downloaded then this function will reload it from file. Pointer to newly allocated document. Must be deleted after use.
hzEcodeGetRSS(void)In general a website can be thought of as a source of 'rolling' news updates in which old pages are deleted, new pages created and existing pages can be modified on an ad-hoc basis. The RSS feeds allow greter ease when syncing an external website to the local machine. By periodically reading one or more RSS feeds one can obtain a set of links which can generally be taken as the set of pages deemed 'current' by the website. By comparing these links to a history file of already fetched links, new pages can be added to a respository as they appear on the site. The RSS feeds are just XML files containing links. This function will obtain all the RSS feeds from the site, garner all the links from them and then download any pages from the links that are not already in the site history. The feeds themselves are not saved as these will be fetched again. Arguments: None
hzEcodeLogin(void)Execute the login process. This is always a case of downloading each page listed in m_Authspteps (if any) and then posting to the URL given in m_Authpage (if provided) with the name-value pairs listed in in m_Authform. Arguments: None
voidLogout(void)Execute the logout process. Arguments: None Returns: None
hzEcodeScrape(void)In general a website can be thought of as a source of 'rolling' news updates in which old pages are deleted, new pages created and existing pages can be modified on an ad-hoc basis. A scrape captures the current state of the website or a limited portion of it to file. The scraping process runs through a set of known links for the website, downloading the page for each in turn. Each downloaded page is then examined for links. Links to domains other than the one in qestion are ignored. Links to such things as images are also ignored. Remaining links not found in the set of known links are added to this set. The process terminates when all the links have been attempted. The set of known links will need to comprise the site's home-page and a login page if this exists and if it is not the same as the home page. These will usually be enough to 'bootstrap' the rest of the site. Arguments: None
hzEcodeSync(void)Run the series of hzWebCMD directives to sync key pages from a website to a repository Arguments: None
void_clear(void)Clears the hzWebhost for shutdown or for re-initialization for syncing another website Arguments: None Returns: None
hzEcode_loadstatus(void)Load visit status file (called upon startup). This way we do not re-fetch pages that have already been loaded unless they are out of date. Arguments: None
hzEcode_savestatus(void)Write out visit status file. This keeps a record of what URL's have already been downloaded and to which files, and the expiry date (after which the page will have to be fetched again) Arguments: None
hzEcodegetRss_r(HttpRC& hRet)hzUrl& feed, uint32_t nLevel, Recursive fetch of RSS documents. The supplied URL is downloaded and loaded into an XML document. There it is tested to ensure it is an XML document. The RSS feed is assumed to contain only links. These links may be to HTML pages or other (sub RSS feeds). The HTML pages are end points of the process. They are downloaded but any links they may contain are recorded but not followed. The sub-RSS feeds are then processed by recursive call to this function.

Member Variables:

hzHttpClientHCHTTP client instance
hzStringm_AuthBasicIf set, this is supplied with each GetPage() call.
hzUrlm_AuthexitLogout URL
hzList<hzPair>m_AuthformList of name value pairs to submit to the site's login form (given as m_Authpage see below)
hzUrlm_AuthpageLogin page for site (if applicable, may be same as home)
hzList<hzUrl>m_AuthstepsInitial URL requests that must be made for cookie collecting before the login form can be submitted.
hzSet<hzString>m_BannedFilter for banning visitation. Links meeting this are not visted, stored or processed.
hzList<hzWebCMD>m_CommandsList of commands to effect a SYNC operation
hzUrlm_ContactUsUsed to post messages to websites
hzStringm_CookiePathSession cookie (set when new cookie is offered by server, used in all subsequent requests)
hzStringm_CookieSessSession cookie (set when new cookie is offered by server, used in all subsequent requests)
hzMapS<hzString,hzCookie>m_CookiesAll cookies needed for the session with server
hzSet<hzString>m_DomainsAllowed domains for the site and it's links
hzSet<hzEmaddr>m_EmailsEmail addresses occuring in this page's body
hzList<hzUrl>m_FeedsList of root commands (Webscraping only)
hzMapS<hzString,hzHtmForm*>m_FormsMap of forms found in loaded pages.
hzUrlm_HomepageRoot URL for site
hzStringm_NameCanonical name of site eg 'positive news'
hzMapS<hzString,hzWebhost::_nodeList*>m_NodelistsMap of lists of selected nodes
hzSet<hzUrl>m_OffsiteLinks discovered that are to pages in other domains or websites
uint32_tm_OpflagsOperational flags
hzMapS<hzString,hzWebhost::_pageList*>m_PagelistsMap of lists of selected links
hzStringm_PasswordFor controlled access to site
hzStringm_ReposTarget directory for download pages.
hzList<hzPair>m_RootsList of root commands (Webscraping only)
uint32_tm_SofarCount of Sync commands executed
hzChainm_StylesStylesheet
hzChainm_TraceList of data items garnered (XML format)
hzStringm_UsernameFor controlled access to site
hzDocument*m_docAuthLogin page
hzDocument*m_docHomeHome page
hzMapS<hzUrl,hzDocMeta*>m_mapHistLinks to other pages occuring in this page's body
hzDocument*m_resAuthLogin page response
hzDocument*m_resLastLast page downloaded
hzXmlSlctm_tagDateFor of extraction of a date
hzXmlSlctm_tagDescFor of extraction of a description
hzXmlSlctm_tagItemFor of extraction of a item
hzXmlSlctm_tagLinkFor of extraction of a link
hzXmlSlctm_tagTitlFor of extraction of a title
hzXmlSlctm_tagUqidFor of extraction of a unique item id
hzVect<hzDocMeta*>m_vecHistLinks to other pages occuring in this page's body