Defined in file: hzHttpClient.h
The hzWebhost class facilitates automated downloading from the set of documents available at any given domain. This is a parameter driven and generally recursive process. Starting from a list of one or more 'root' pages such as the home page, pages are downloaded and any links these may contain to other pages are garnered. Then subject to specified limiting criteria, these latter pages are downloaded. The process terminates when all discovered pages are downloaded. By default links are limited to other pages on the same website or on other websites listed as related to this. Other criteria may apply such as date of file and file type these pages are alse read in. Where authentication is required, the authentication sequence is normally by login form submission. The login form will be downloaded from a particular URL, the username and password filled in and sent back to the URL indicated in the form (this may or may not be the same). NOTE this will not work where anti-robot mechanisms are in place, such as google recaptcha forms. Note also that the sequence of pages to visit may have to include a seemingly pointless visit to a page (normally the home page), purely for the client to be issued with a cookie in order for the login to be accepted.
This class employes the folowing private sub-classes
_nodeList
_pageList
Constructors/Detructors
| hzWebhost* | hzWebhost | (void) | |
| void | ~hzWebhost | (void) |
Public Methods:
| hzEcode | AddBan | (hzString& pageEnding) | |
| hzEcode | AddRSS | (hzUrl& rss) | Adds an RSS feed URL for the target website |
| hzEcode | AddRoot | (hzUrl& url)hzString& criteria, | Adds a root URL for the target website |
| hzEcode | AuthBasic | (const char* username)const char* password, | Sets the basic authentication string for the website (if the site uses this method). Once set all requests to the target website will be submitted with this string in the HTTP header. |
| hzDocument* | Download | (hzUrl& url) | Fetch the page found at the supplied URL and return as a document (either XML or HTML). Note that if the page has already been downloaded (is in the site's history) then it is only downloaded again if it the time to live has expired. If the page is not downloaded then this function will reload it from file. Pointer to newly allocated document. Must be deleted after use. |
| hzEcode | GetRSS | (void) | In general a website can be thought of as a source of 'rolling' news updates in which old pages are deleted, new pages created and existing pages can be modified on an ad-hoc basis. The RSS feeds allow greter ease when syncing an external website to the local machine. By periodically reading one or more RSS feeds one can obtain a set of links which can generally be taken as the set of pages deemed 'current' by the website. By comparing these links to a history file of already fetched links, new pages can be added to a respository as they appear on the site. The RSS feeds are just XML files containing links. This function will obtain all the RSS feeds from the site, garner all the links from them and then download any pages from the links that are not already in the site history. The feeds themselves are not saved as these will be fetched again. Arguments: None |
| hzEcode | Login | (void) | Execute the login process. This is always a case of downloading each page listed in m_Authspteps (if any) and then posting to the URL given in m_Authpage (if provided) with the name-value pairs listed in in m_Authform. Arguments: None |
| void | Logout | (void) | Execute the logout process. Arguments: None Returns: None |
| hzEcode | Scrape | (void) | In general a website can be thought of as a source of 'rolling' news updates in which old pages are deleted, new pages created and existing pages can be modified on an ad-hoc basis. A scrape captures the current state of the website or a limited portion of it to file. The scraping process runs through a set of known links for the website, downloading the page for each in turn. Each downloaded page is then examined for links. Links to domains other than the one in qestion are ignored. Links to such things as images are also ignored. Remaining links not found in the set of known links are added to this set. The process terminates when all the links have been attempted. The set of known links will need to comprise the site's home-page and a login page if this exists and if it is not the same as the home page. These will usually be enough to 'bootstrap' the rest of the site. Arguments: None |
| hzEcode | Sync | (void) | Run the series of hzWebCMD directives to sync key pages from a website to a repository Arguments: None |
| void | _clear | (void) | Clears the hzWebhost for shutdown or for re-initialization for syncing another website Arguments: None Returns: None |
| hzEcode | _loadstatus | (void) | Load visit status file (called upon startup). This way we do not re-fetch pages that have already been loaded unless they are out of date. Arguments: None |
| hzEcode | _savestatus | (void) | Write out visit status file. This keeps a record of what URL's have already been downloaded and to which files, and the expiry date (after which the page will have to be fetched again) Arguments: None |
| hzEcode | getRss_r | (HttpRC& hRet)hzUrl& feed, uint32_t nLevel, | Recursive fetch of RSS documents. The supplied URL is downloaded and loaded into an XML document. There it is tested to ensure it is an XML document. The RSS feed is assumed to contain only links. These links may be to HTML pages or other (sub RSS feeds). The HTML pages are end points of the process. They are downloaded but any links they may contain are recorded but not followed. The sub-RSS feeds are then processed by recursive call to this function. |
Member Variables:
| hzHttpClient | HC | HTTP client instance | |
| hzString | m_AuthBasic | If set, this is supplied with each GetPage() call. | |
| hzUrl | m_Authexit | Logout URL | |
| hzList<hzPair> | m_Authform | List of name value pairs to submit to the site's login form (given as m_Authpage see below) | |
| hzUrl | m_Authpage | Login page for site (if applicable, may be same as home) | |
| hzList<hzUrl> | m_Authsteps | Initial URL requests that must be made for cookie collecting before the login form can be submitted. | |
| hzSet<hzString> | m_Banned | Filter for banning visitation. Links meeting this are not visted, stored or processed. | |
| hzList<hzWebCMD> | m_Commands | List of commands to effect a SYNC operation | |
| hzUrl | m_ContactUs | Used to post messages to websites | |
| hzString | m_CookiePath | Session cookie (set when new cookie is offered by server, used in all subsequent requests) | |
| hzString | m_CookieSess | Session cookie (set when new cookie is offered by server, used in all subsequent requests) | |
| hzMapS<hzString,hzCookie> | m_Cookies | All cookies needed for the session with server | |
| hzSet<hzString> | m_Domains | Allowed domains for the site and it's links | |
| hzSet<hzEmaddr> | m_Emails | Email addresses occuring in this page's body | |
| hzList<hzUrl> | m_Feeds | List of root commands (Webscraping only) | |
| hzMapS<hzString,hzHtmForm*> | m_Forms | Map of forms found in loaded pages. | |
| hzUrl | m_Homepage | Root URL for site | |
| hzString | m_Name | Canonical name of site eg 'positive news' | |
| hzMapS<hzString,hzWebhost::_nodeList*> | m_Nodelists | Map of lists of selected nodes | |
| hzSet<hzUrl> | m_Offsite | Links discovered that are to pages in other domains or websites | |
| uint32_t | m_Opflags | Operational flags | |
| hzMapS<hzString,hzWebhost::_pageList*> | m_Pagelists | Map of lists of selected links | |
| hzString | m_Password | For controlled access to site | |
| hzString | m_Repos | Target directory for download pages. | |
| hzList<hzPair> | m_Roots | List of root commands (Webscraping only) | |
| uint32_t | m_Sofar | Count of Sync commands executed | |
| hzChain | m_Styles | Stylesheet | |
| hzChain | m_Trace | List of data items garnered (XML format) | |
| hzString | m_Username | For controlled access to site | |
| hzDocument* | m_docAuth | Login page | |
| hzDocument* | m_docHome | Home page | |
| hzMapS<hzUrl,hzDocMeta*> | m_mapHist | Links to other pages occuring in this page's body | |
| hzDocument* | m_resAuth | Login page response | |
| hzDocument* | m_resLast | Last page downloaded | |
| hzXmlSlct | m_tagDate | For of extraction of a date | |
| hzXmlSlct | m_tagDesc | For of extraction of a description | |
| hzXmlSlct | m_tagItem | For of extraction of a item | |
| hzXmlSlct | m_tagLink | For of extraction of a link | |
| hzXmlSlct | m_tagTitl | For of extraction of a title | |
| hzXmlSlct | m_tagUqid | For of extraction of a unique item id | |
| hzVect<hzDocMeta*> | m_vecHist | Links to other pages occuring in this page's body |