Defined in file: hzDocument.h
Derivative of: hzDocument
A whole or partial HTML Page or Document
Constructors/Detructors
| hzDocHtml* | hzDocHtml | (void) | |
| NULL-TYPE | hzDocHtml | (void) | |
| void | ~hzDocHtml | (void) | |
| NULL-TYPE | ~hzDocHtml | (void) |
Public Methods:
| void | Clear | (void) | Recursively clear the tree of nodes Arguments: None Returns: None |
| hzString& | CookiePath | (void) | |
| hzString& | CookieSess | (void) | |
| hzEcode | Export | (hzString& filepath) | Exports a HTML page to a file named as per the supplied file path. |
| uint32_t | ExtractLinksBasic | (hzVect<hzUrl>& links)hzSet<hzString>& domains, hzString& form, | Find all links on a page lying within a set of acceptable domains and matching any supplied criteria. These are aggregated to the supplied vector of link URLs. If no domains or criteria are supplied, all the links in the page will be aggregated. Note the links in a page are established in the Load() function. This function meerly filters them. It does not read the page content. |
| uint32_t | ExtractLinksContent | (hzMapS<hzUrl,hzString>& links)hzSet<hzString>& domains, hzString& criteria, | Find all links on a page lying within a set of acceptable domains and matching any supplied criteria. These are aggregated to the supplied map of link URLs to link content. If no domains or criteria are supplied, all the links in the page will be aggregated. Note the links in a page are established in the Load() function. This function meerly filters them. It does not read the page content. |
| hzEcode | FindElements | (hzVect<hzHtmElem*>& elements)hzString& htag, hzString& attrName, hzString& attrValue, | Find all elements in a page with the given tag name and/or attribute and value. |
| hzEcode | FindElements | (hzVect<hzHtmElem*>& elements)const char* srchExp, | Find all tags meeting the supplied criteria and place pointers to the tags in the supplied results vector. Note: The criteria will be of the form of one or more name-value pairs as follows:- 1) name="some_name"; - Only applies if the element is given an id which is often not the case 2) type="html_tagtype"; - The element is of the right type, eg <table> 3) class="class_value"; - The element has the given class value 4) pname="param_name"; - The element has the parameter 4) pvalue="param_value"; - The element has the parameter value 6) cont="content_value"; - The element has contents of the given value |
| hzEcode | FindElements | (hzVect<hzHtmElem*>& elements)hzString& srchExp, | Select elements from this document according to the supplied search expression Webpages (HTML documents) commonly contain a lot of supurfluous matter whilst confining most information content to a limited set of elements (tags). If it is known which element(s) contain what information (eg title, author, body content etc), FindElements can be used to select these element(s) and from there, data can be efficiently extracted. Support functions: SelectElements() itself calls the private member function _selectExp to do the selecting. This places selected elements in a hzSet ordered by their RAM address (this ensures tags are only counted once). SelectElements() then re-orders the elements from the hzSet into a hzVect. _selectExp (hzSet<hzHtmElem*>& elements, const hzString& exp) simply breaks up the expression into a term or 'term op expression' and calls the second fupport function _selectTerm() to find the set of tags for each term. _selectTerm (hzSet<hzHtmElem*>& elements, const hzString& exp) deals only with terms designed to specify elements. Each term consists of one or more tag specifiers, which when multiple, are separated by a + sign. A single tag specifier will identify a list of one or more tags within the document. Subsequent tag specifiers will do the same but will limit the search to descendents of the tags found under the previous tag specifier. The _selectTerm() calls the third support function _selectTag() on each tag specifier in turn, to actually do the selecting. _selectTag (hzSet<hzHtmElem*>& parents, hzSet<hzHtmElem*>& elements, const hzString& exp) uses a single tag specifier to select tags from the HTML document and then if a list of parents (previously found tags) is supplied the selected tags are tested to ensure they have an ancestor among the list of parents. Each tag specifier will be encased in a <> block and be of the general form <tagname attr1='value1' attr2='value2' ...> where either the tag name or at least one attribute must exist. If an attribute is specified the tag must match on the attribute be be selected. Wildcards can be used as well. |
| hzHtmElem* | GetRoot | (void) | |
| hzEcode | Import | (hzString& path) | Loads an HTML document into a tree of HTML nodes |
| hzEcode | Load | (hzChain& Z) | Populate the hzDocHtml object with HTML source code in the supplied chain. Two scenarios are permitted - Full or Partial as follows:- 1) Full: If the HTML source has the <html> as its first tag it will be considered as a full page and tested as such. It will be expected to have the standard sub-tags of <head> and <body> and thier corresponding anti-tags. If either of these are missing or in error (malformed or containing unxpected or malformed tags) the HTML source code is deemed to be syntactically in error and the load fails. 2) Partial: If the opening tag of the HTML source code is not the <html> tag it is viable only if it would be viable as a HTML fragment that could be seemlessly inserted into the <body> part of a whole HTML page. This is to say that all it's tags must be legal sub-tags of <body> and not of <head> and nor must the <body> or <head> tag or anti-tag be present. In either case, tags are loaded into a tree of nodes (tags). The nodes/tags may be searched for and examined. Note: Unlike XML where tags are named so that content in the tree can be searched directly, the nodes in HTML are not named named and so cannot be definitely referenced (they only have type). Some other process must apply application specific criteria to read meaning into the data. |
| hzEcode | Load | (const char* fpath) | Loads an XML document into a tree of XML nodes |
| void | Report | (hzLogger& xlog) | Show list of nodes plus content Returns: None |
| hzDoctype | Whatami | (void) | |
| hzEcode | _htmPreproc | (hzChain& Z) | Remove comments and non applicable conditional comments from HTML |
| hzHtmElem* | _proctag | (hzHtmElem* pParent)hzChain::Iter& ci, hzHtagtype type, | This assumes the chain iterator is currently at a '<' char and that this is the start of an HTML tag or ant-tag. To succeed the tag must be both a known HTML tag and of the correct form. If successful, the iterator will be advanced to one place beyond the terminating '>'. If unsuccessful, the iterator will be left unchanged. Pointer to a new hzHtmElem if the operation was sussessful NULL if function could not identify a tag Scope: Private to the hzDocHtml class. |
| void | _report | (hzLogger& xlog)hzHtmElem* node, | Recursive suport function for non-recursive hzDocHtml::Report Returns: None |
| hzEcode | _selectExp | (hzSet<hzHtmElem*>& elements)hzString& srchExp, | Recursive support function for hzDocHtml::SelectElements (see below) Breaks up the expression into a term or 'term op expression' and calls _selectTerm to find the set of tags for each term. The terms can be enclosed in parenthesis but individually, they take the form of tags enclosed in a <> block. The tag name is the first and often only part but optionally after that, attributes may be specified. |
| hzEcode | _selectTag | (hzSet<hzHtmElem*>& parents)hzSet<hzHtmElem*>& elements, hzString& tagspec, | Finds the set of tags meeting the supplied tag specifier. |
| hzEcode | _selectTerm | (hzSet<hzHtmElem*>& elements)hzString& term, | A 'term' within the context of HTML document tag selection, can be a specification of a single tag or it can specifiy multiple tags. In the latter case, where multiple tag specifiers are concatenated, hierarchy is implied. Selection works on the basis of more detail, more tests. For example, the term <div> will populate the set of elements found with every <div> tag in the document. The term <div class> will only find div tags with an attribute of 'class' while the term <div class="body"> will only find div tags that have an attribute of class whose value is 'body'. It should be noted however, that tags are selected if they have what is asked for in the term. There is not presently, any means to exclude tags if they have something we don't want them to have. A hierarchical concatenated term such as <div class='body'><p> will find every paragraph tag in the document whose parent tag is a div with an attribute of class whose value is 'body'. If no div tags meet that criteria nothing will be selected. Likewise if div tags do meet the <div class="body"> test but are not followed directly by the <p> tag, nothing is selected. Note that multiple tag terms are implemented by multiples calls to _selectTag, with the selection of tags found being reduced by each call. |
| hzEcode | _xport | (hzChain& Z)hzHtmElem* node, | Recursive support function for hzDocHtml::Export. It exports the full tag (including attributes and content) of the supplied node and all subnodes, to the supplied chain. Note this is a support function for hzDocHtml::Export |
Member Variables:
| hzString | m_Base | Base for URLs begining with / | |
| hzChain | m_Content | Full content of web-page | |
| hzString | m_CookiePath | Set by HTML header upon Browse() or LoadHtml() or LoadFile() | |
| hzString | m_CookieSess | Set by HTML header upon Browse() or LoadHtml() or LoadFile() | |
| hzSet<hzEmaddr> | m_Emails | Email addresses occuring in this page's body | |
| hzString | m_EntityTag | Entity tag from the header if given | |
| hzList<hzHtmForm*> | m_Forms | List of forms appearing in the page (if any) | |
| hzString | m_Title | Will be filename on export | |
| hzArray<hzHtmElem> | m_arrNodes | Complete set of nodes, in order of appearence in the document | |
| hzMapM<hzString,hzHtmElem*> | m_mapTags | All nodes within document | |
| hzHtmElem* | m_pBody | All tags found in the body (body is level 1) | |
| hzHtmElem* | m_pHead | All tags found in the header (head is level 1) | |
| hzHtmElem* | m_pRoot | All tags found on level 0 | |
| hzSet<hzUrl> | m_setLinks | Links to other pages occuring in this page's body | |
| hzVect<hzUrl> | m_vecLinks | All elements in the order they appear | |
| hzVect<hzHtmElem*> | m_vecTags | All elements in the order they appear | |
| hzVect<hzString> | m_vecText | All text sections found in page |