Many web applications allow users to upload files, usually for the purpose of sharing information. Either the files ARE the information e.g. an audio or video clip, or they CONTAIN the information in the form of extractable text e.g. the resume of a job candidate uploaded as a .docx file. The uploaded files can be stored as files although the most recommended method is to store them in binary datum repositories. Uploaded file are invariably introduced into the system by form submissions which means a data object and thus a data class is involved. The file arrives as a value of a data object member that is of a binary data type - this should be BASETYPE_BINARY unless the file is actually a text file inwhich case it will be BASETYPE_TXTDOC.

An uploaded file is of limited use on its own. It is recommended that uploaded files are accompanied by their original filename and by a text description. The latter can be omitted if the file is a text file. It is simple to associate such data to files - by making filename and description members of the data class used in the upload form. The name will be a member of type STRING, the description a member of type TXTDOC and will have an index applied in the applicable repository.

The file description has two possible sources. It could be supplied as part of the form submission, or it could be extracted from the file itself. Automated text extraction is only a good idea in some circumstances. It should be noted that it requires functionality that depends on the MIME type of the file. There are a considerable number of text-bearing MIME types and new formats are arising all the time. While many are just compressed files that expand out to XML (at some point), extraction of text requires a detailed knowledge of the format to hand. Where text extraction is desired, the usual approach is to select a narrow set of acceptable MIME types, then source external programs to cover that set.

Under the Dissemino regime, file type checking based on filename extension is built into the form validation javascript generated as part of the page generation process. This is generally considered adequate as browsers operate on the same principle. This though means that initially, the server simply accepts the MIME type sent in by the browser. What happens next depends on the <procdata> section of the relevent formhandler. If the file is to be stored as is, a class member of type BINARY will act as the recepticle. If it is desired that the given filename (as named on the user's PC) is to be retained this will required a second class member of type TEXT. If the file is to have text extracted from it (a process which will fail if the MIME type is wrong), the extracted text will have to be placed in yet another class member of type DOCUMENT.

So the <procdata> section would look smething like this:-

<procdata>
    <extract tgt="thetext" src="%e_file->data"/>
    <commit class="clasname" object="objid">
        <set member="filename" input="%e_file->name"/>
        <set member="filedata" input="%e_file->data"/>
        <set member="filetext" input="%e_thetext"/>
    </commit>
</procdata>

Note firstly, the use of the -> notation. The class member 'filedata' is of type BINARY and as such has two values when it appears in the HTTP event. The given filename given by ->name and the file content given by ->data. Note also that the <extract> instruction is creating and populating an event bound variable called 'thetext' with text extracted from the file content. The format of the extracted text is XML bound within a <xtreeItem> tag which can be directly included in a page of generated HTML output. This variable is later used in the <commit> instruction to set the class member 'filetext' which is of type DOCUMENT. In the class definition, the filetext member may have an index applied.

In another page designed to present the uploaded file, either the extracted text can be included allowing those without the application installed to see the text (wise for docx) OR the file can be embedded with either an <embed> or an <embed> tag (OK for PDF as most browsers have the adobe reader active) or there may be separate links allowing a choice eg 'view as text' and 'view as PDF'.