uPortal and Feed Cleaning

Posted on August 11, 2005August 11, 2005 by lhl

The nature of syndicated data on the web is such that quality and correctness is oftentimes (nearly invariably) uneven. The RSS specs are themselves rather murky, and even the best of sites will push out the occasional unescaped entity or improper encoding.

As seems to be its natural inclination, uPortal completely ignores this reality and completely barfs when encountering any hint of irregularity. uPortal parses RSS via its XSLT channel utilizing Xalan-J, where “error recovery” means throwing an exception, dying, and spewing an ugly error at the user.

By and large most commonly run into error is character encoding issues. The uPortal channel, expecting XML, defaults to UTF-8 when encoding is left unspecified. If there are multi-byte characters, you’re screwed. My solution, that so far has fixed all the feeds that we’re currently ingesting is a two parter, using a Python first stage, and a PHP second stage. Although in most cases, you’d want to combine it into one (the Python code, probably), we’re running the two-parter because the latter code came first and zis used for other purposes.

(If you’re using uPortal: performance isn’t an issue because the channel gets cached by default for 20m. Be sure though to check that your version of the XSLT channel has my caching patch applied. There was a 3 year old caching bug that caused the channel not to cache for guest layouts and inefficiently for logged in users).

The centerpiece of the Python code is to use Mark Pilgrim’s Universal Feed Parser. This, of course, solves all the issues related to parsing different flavors of RSS
With version 3.x of the UFP, character encoding is dealt with better, and strings are automatically converted to unicode when possible. From there, output is a simple unicodestring.encode(‘utf-8’) away. PHP deals with unicode rather atrociously by comparison.
Note, and this really screwed me for a while, that mxTidy, which the UFP defaults to using if it finds it, does not play nice with unicode and will screw you. So be sure to turn it off. (I haven’t tried µTidylib or TidyHTMLTreeBuilder yet)
If tidy worked, it could have taken care of converting your entity characters into numerics, but since I haven’t, I instead made entity declarations in the DOCTYPE to cover my bases. You’ll want to load at least the first two, and to be safe, all three of the normative XHTML entitie sets
After that, I do some HTML to XHTML processing (unnecessary if tidy would work like it should), and also conversion of non-entity ampersands. This is a good one. Here’s the PCRE:
```
/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/
```
Special cases: If you’re loading images via HTTP and your page is on HTTPS, errors may be displayed, so you might want to omit or convert as appropriate
Note: One rule you shouldn’t need if you run through the UFP is conversion of smart quotes, but if you’re doing other processing that hasn’t gone through the first steps, that would be a good idea

And that’s that. Ta-da! The Aristocrats!