TidyHtml

New Umbraco datatype to help import of xml/html (paste or load from file/url) and apply cleaning operations to remove content, tags and tag attributes.

Basically it is used as text content box with a filter applied - the text input is html/xml, and the result after the filter is applied is text or some subset of the original html/xml - depending on your setup of the datatype.

Why would you use this? An obvious situation is that you (or your content editor) sometimes want to aggregate existing html/xml files into a site, without going into too much trouble cleaning up the text to import.

E.g. some 3rd party software for handling sport events typically output their results as html pages. An easy way to import these results into a site would be to add a TidyHtml datatype to a document type of your site, and then import the result html when creating a new node of this document type. If the TidyHtml datatype is set up to handle the sport result html (e.g. by skipping all content outside and including the <body> tags and afterwards removing all tag attributes) then your template would be table to directly output the TidyHtml field content and a nice way using the layout etc. from the site.

 

Installation

Download and install the package - this is done from the Packages option in the Developer section.

What is installed is basically a setup for a DataType, i.e. only the TidyHtml render-control is available by default - not any ready-to-use datatypes.

 

Getting started

Set up the data type

Start up by making a new Data Type and select the "Tidy Html" render control from the drop down.

Now you have to specify TidyHtml properties, e.g. what this datatype is supposed to tidy up.

TidyHtml_DataTypeSetup

 

External data encoding:

This property specifies the encoding used when load the text content from a file/web stream. Choose the encoding that is used for encoding your source.

Output xml (vs. CDATA):

You have the option to either save the cleaned up content as normal text (CDATA), normally used as textbox content, or as XML. Saving as XML would allow you to query it from e.g. your XSLT macros. When saving as XML the content must be semantically correct - otherwise you will get an error upon saving. By default the content is saved as CDATA text.

Adding tidy-functions:

Here you set up a sequence of tidy-functions as well as the tag used in the function.

Available functions are:

Function Description
Remove tag Remove start/end tags of this type
Remove tag and content Remove start/end tags of this type as well as all content embedded. In case of missing end tags the remaining text will be removed
Remove tag and subtags Remove start/end tags of this type as well as all tags of any type from content embedded. In case of missing end tags the remaining text will be removed
Strip attributes from tag Remove any attributes from tags of the specified type
Strip attributes from tag and subtags Remove any attributes from tags of the specified type and from all tags embedded. In case of missing end tags all tags in the remaining text will be removed
Remove content outside tag Remove any content before the start tag and after the end tag.

 

Use your TidyHtml Date Type

Add your new Tidy Html Data Type to a document type, and create a content node of the document type.

Now it's time to actually use the new data type.

In this example we would like to use a standard html page created by a 3rd party running event software as our source.

TidyHtml_SprintCupHtmlSource

You have the option to either:

  1. Copy/paste the text to the textbox of the data type
  2. Specify a local text file using the browse button - remember to click "Save" to actually load the file
  3. Specify a web page URL in the url-edit box - remember to click "Save" to actually load the file

Either option results in the text appearing in the Data Type textbox.

Now click TidyHtml to apply your functions to the content.

TidyHtml_ResultatNode

 

Now render the new Data Type field in e.g. a template. In this example the Data Type field name is "htmlImport". Alternatively, if the content is saved as XML, you can reference and query it in your xslt macros.

 

TidyHtml_TemplateFinally - launch your web page to review the result

TidyHtml_SprintCupResult

 

 

And thats it.

The package is available for download below. Sources are also available for everyone - just send us a mail.