New Umbraco datatype to help import of xml/html (paste or load
from file/url) and apply cleaning operations to remove content,
tags and tag attributes.
Basically it is used as text content box with a filter applied -
the text input is html/xml, and the result after the filter is
applied is text or some subset of the original html/xml - depending
on your setup of the datatype.
Why would you use this? An obvious situation is that you (or
your content editor) sometimes want to aggregate existing html/xml
files into a site, without going into too much trouble cleaning up
the text to import.
E.g. some 3rd party software for handling sport events typically
output their results as html pages. An easy way to import these
results into a site would be to add a TidyHtml datatype to a
document type of your site, and then import the result html when
creating a new node of this document type. If the TidyHtml datatype
is set up to handle the sport result html (e.g. by skipping all
content outside and including the <body> tags and afterwards
removing all tag attributes) then your template would be table to
directly output the TidyHtml field content and a nice way using the
layout etc. from the site.
Installation
Download and install the package - this is done from the
Packages option in the Developer section.
What is installed is basically a setup for a DataType, i.e. only
the TidyHtml render-control is available by default - not any
ready-to-use datatypes.
Getting started
Set up the data type
Start up by making a new Data Type and select the "Tidy Html"
render control from the drop down.
Now you have to specify TidyHtml properties, e.g. what this
datatype is supposed to tidy up.

External data encoding:
This property specifies the encoding used when load the text
content from a file/web stream. Choose the encoding that is used
for encoding your source.
Output xml (vs. CDATA):
You have the option to either save the cleaned up content as
normal text (CDATA), normally used as textbox content, or as XML.
Saving as XML would allow you to query it from e.g. your XSLT
macros. When saving as XML the content must be semantically correct
- otherwise you will get an error upon saving. By default the
content is saved as CDATA text.
Adding tidy-functions:
Here you set up a sequence of tidy-functions as well as the tag
used in the function.
Available functions are:
Function |
Description |
Remove tag |
Remove start/end tags of this type |
Remove tag and content |
Remove start/end tags of this type as well as all content
embedded. In case of missing end tags the remaining text will be
removed |
Remove tag and subtags |
Remove start/end tags of this type as well as all tags of any
type from content embedded. In case of missing end tags the
remaining text will be removed |
Strip attributes from tag |
Remove any attributes from tags of the specified type |
Strip attributes from tag and subtags |
Remove any attributes from tags of the specified type and from
all tags embedded. In case of missing end tags all tags in the
remaining text will be removed |
Remove content outside tag |
Remove any content before the start tag and after the end
tag. |
Use your TidyHtml Date Type
Add your new Tidy Html Data Type to a document type, and create
a content node of the document type.
Now it's time to actually use the new data type.
In this example we would like to use a standard html page
created by a 3rd party running event software as our source.

You have the option to either:
- Copy/paste the text to the textbox of the data type
- Specify a local text file using the browse button - remember to
click "Save" to actually load the file
- Specify a web page URL in the url-edit box - remember to click
"Save" to actually load the file
Either option results in the text appearing in the Data Type
textbox.
Now click TidyHtml to apply your functions to the content.

Now render the new Data Type field in e.g. a template. In this
example the Data Type field name is "htmlImport". Alternatively, if
the content is saved as XML, you can reference and query it in your
xslt macros.
Finally - launch your web
page to review the result

And thats it.
The package is available for download below. Sources are also
available for everyone - just send us a mail.