Defined in file: hzDocument.h

Derivative of: hzDocument

A whole or partial HTML Page or Document

Constructors/Detructors

hzDocHtml*	hzDocHtml	(void)
NULL-TYPE	hzDocHtml	(void)
void	~hzDocHtml	(void)
NULL-TYPE	~hzDocHtml	(void)

Public Methods:

void	Clear	(void)	Recursively clear the tree of nodes Arguments: None Returns: None
hzString&	CookiePath	(void)
hzString&	CookieSess	(void)
hzEcode	Export	(hzString& filepath)	Exports a HTML page to a file named as per the supplied file path.
uint32_t	ExtractLinksBasic	(hzVect<hzUrl>& links)hzSet<hzString>& domains, hzString& form,	Find all links on a page lying within a set of acceptable domains and matching any supplied criteria. These are aggregated to the supplied vector of link URLs. If no domains or criteria are supplied, all the links in the page will be aggregated. Note the links in a page are established in the Load() function. This function meerly filters them. It does not read the page content.
uint32_t	ExtractLinksContent	(hzMapS<hzUrl,hzString>& links)hzSet<hzString>& domains, hzString& criteria,	Find all links on a page lying within a set of acceptable domains and matching any supplied criteria. These are aggregated to the supplied map of link URLs to link content. If no domains or criteria are supplied, all the links in the page will be aggregated. Note the links in a page are established in the Load() function. This function meerly filters them. It does not read the page content.
hzEcode	FindElements	(hzVect<hzHtmElem*>& elements)hzString& htag, hzString& attrName, hzString& attrValue,	Find all elements in a page with the given tag name and/or attribute and value.
hzEcode	FindElements	(hzVect<hzHtmElem>& elements)const char srchExp,	Find all tags meeting the supplied criteria and place pointers to the tags in the supplied results vector. Note: The criteria will be of the form of one or more name-value pairs as follows:- 1) name="some_name"; - Only applies if the element is given an id which is often not the case 2) type="html_tagtype"; - The element is of the right type, eg <table> 3) class="class_value"; - The element has the given class value 4) pname="param_name"; - The element has the parameter 4) pvalue="param_value"; - The element has the parameter value 6) cont="content_value"; - The element has contents of the given value
hzEcode	FindElements	(hzVect<hzHtmElem*>& elements)hzString& srchExp,	Select elements from this document according to the supplied search expression Webpages (HTML documents) commonly contain a lot of supurfluous matter whilst confining most information content to a limited set of elements (tags). If it is known which element(s) contain what information (eg title, author, body content etc), FindElements can be used to select these element(s) and from there, data can be efficiently extracted. Support functions: SelectElements() itself calls the private member function _selectExp to do the selecting. This places selected elements in a hzSet ordered by their RAM address (this ensures tags are only counted once). SelectElements() then re-orders the elements from the hzSet into a hzVect. _selectExp (hzSet<hzHtmElem>& elements, const hzString& exp) simply breaks up the expression into a term or 'term op expression' and calls the second fupport function _selectTerm() to find the set of tags for each term. _selectTerm (hzSet<hzHtmElem>& elements, const hzString& exp) deals only with terms designed to specify elements. Each term consists of one or more tag specifiers, which when multiple, are separated by a + sign. A single tag specifier will identify a list of one or more tags within the document. Subsequent tag specifiers will do the same but will limit the search to descendents of the tags found under the previous tag specifier. The _selectTerm() calls the third support function _selectTag() on each tag specifier in turn, to actually do the selecting. _selectTag (hzSet<hzHtmElem>& parents, hzSet<hzHtmElem>& elements, const hzString& exp) uses a single tag specifier to select tags from the HTML document and then if a list of parents (previously found tags) is supplied the selected tags are tested to ensure they have an ancestor among the list of parents. Each tag specifier will be encased in a <> block and be of the general form <tagname attr1='value1' attr2='value2' ...> where either the tag name or at least one attribute must exist. If an attribute is specified the tag must match on the attribute be be selected. Wildcards can be used as well.
hzHtmElem*	GetRoot	(void)
hzEcode	Import	(hzString& path)	Loads an HTML document into a tree of HTML nodes
hzEcode	Load	(hzChain& Z)	Populate the hzDocHtml object with HTML source code in the supplied chain. Two scenarios are permitted - Full or Partial as follows:- 1) Full: If the HTML source has the <html> as its first tag it will be considered as a full page and tested as such. It will be expected to have the standard sub-tags of <head> and <body> and thier corresponding anti-tags. If either of these are missing or in error (malformed or containing unxpected or malformed tags) the HTML source code is deemed to be syntactically in error and the load fails. 2) Partial: If the opening tag of the HTML source code is not the <html> tag it is viable only if it would be viable as a HTML fragment that could be seemlessly inserted into the <body> part of a whole HTML page. This is to say that all it's tags must be legal sub-tags of <body> and not of <head> and nor must the <body> or <head> tag or anti-tag be present. In either case, tags are loaded into a tree of nodes (tags). The nodes/tags may be searched for and examined. Note: Unlike XML where tags are named so that content in the tree can be searched directly, the nodes in HTML are not named named and so cannot be definitely referenced (they only have type). Some other process must apply application specific criteria to read meaning into the data.
hzEcode	Load	(const char* fpath)	Loads an XML document into a tree of XML nodes
void	Report	(hzLogger& xlog)	Show list of nodes plus content Returns: None
hzDoctype	Whatami	(void)
hzEcode	_htmPreproc	(hzChain& Z)	Remove comments and non applicable conditional comments from HTML
hzHtmElem*	_proctag	(hzHtmElem* pParent)hzChain::Iter& ci, hzHtagtype type,	This assumes the chain iterator is currently at a '<' char and that this is the start of an HTML tag or ant-tag. To succeed the tag must be both a known HTML tag and of the correct form. If successful, the iterator will be advanced to one place beyond the terminating '>'. If unsuccessful, the iterator will be left unchanged. Pointer to a new hzHtmElem if the operation was sussessful NULL if function could not identify a tag Scope: Private to the hzDocHtml class.
void	_report	(hzLogger& xlog)hzHtmElem* node,	Recursive suport function for non-recursive hzDocHtml::Report Returns: None
hzEcode	_selectExp	(hzSet<hzHtmElem*>& elements)hzString& srchExp,	Recursive support function for hzDocHtml::SelectElements (see below) Breaks up the expression into a term or 'term op expression' and calls _selectTerm to find the set of tags for each term. The terms can be enclosed in parenthesis but individually, they take the form of tags enclosed in a <> block. The tag name is the first and often only part but optionally after that, attributes may be specified.
hzEcode	_selectTag	(hzSet<hzHtmElem>& parents)hzSet<hzHtmElem>& elements, hzString& tagspec,	Finds the set of tags meeting the supplied tag specifier.
hzEcode	_selectTerm	(hzSet<hzHtmElem*>& elements)hzString& term,	A 'term' within the context of HTML document tag selection, can be a specification of a single tag or it can specifiy multiple tags. In the latter case, where multiple tag specifiers are concatenated, hierarchy is implied. Selection works on the basis of more detail, more tests. For example, the term <div> will populate the set of elements found with every <div> tag in the document. The term <div class> will only find div tags with an attribute of 'class' while the term <div class="body"> will only find div tags that have an attribute of class whose value is 'body'. It should be noted however, that tags are selected if they have what is asked for in the term. There is not presently, any means to exclude tags if they have something we don't want them to have. A hierarchical concatenated term such as <div class='body'><p> will find every paragraph tag in the document whose parent tag is a div with an attribute of class whose value is 'body'. If no div tags meet that criteria nothing will be selected. Likewise if div tags do meet the <div class="body"> test but are not followed directly by the <p> tag, nothing is selected. Note that multiple tag terms are implemented by multiples calls to _selectTag, with the selection of tags found being reduced by each call.
hzEcode	_xport	(hzChain& Z)hzHtmElem* node,	Recursive support function for hzDocHtml::Export. It exports the full tag (including attributes and content) of the supplied node and all subnodes, to the supplied chain. Note this is a support function for hzDocHtml::Export

Member Variables:

hzString	m_Base	Base for URLs begining with /
hzChain	m_Content	Full content of web-page
hzString	m_CookiePath	Set by HTML header upon Browse() or LoadHtml() or LoadFile()
hzString	m_CookieSess	Set by HTML header upon Browse() or LoadHtml() or LoadFile()
hzSet<hzEmaddr>	m_Emails	Email addresses occuring in this page's body
hzString	m_EntityTag	Entity tag from the header if given
hzList<hzHtmForm*>	m_Forms	List of forms appearing in the page (if any)
hzString	m_Title	Will be filename on export
hzArray<hzHtmElem>	m_arrNodes	Complete set of nodes, in order of appearence in the document
hzMapM<hzString,hzHtmElem*>	m_mapTags	All nodes within document
hzHtmElem*	m_pBody	All tags found in the body (body is level 1)
hzHtmElem*	m_pHead	All tags found in the header (head is level 1)
hzHtmElem*	m_pRoot	All tags found on level 0
hzSet<hzUrl>	m_setLinks	Links to other pages occuring in this page's body
hzVect<hzUrl>	m_vecLinks	All elements in the order they appear
hzVect<hzHtmElem*>	m_vecTags	All elements in the order they appear
hzVect<hzString>	m_vecText	All text sections found in page