Python Tidbits

I spent the majority of my waking hours the past couple of days writing a Python script (had a chunk of XML-RPC code already written in Python) that processes zip files from Blackboard and puts them onto Confluence. Some notes:

  • Python’s time module wasn’t installed by Debian until apt-get’ing mxDateTime
  • libclamav’s file scan will catch all kinds of stuff that the buffer scan will miss
  • Python’s string slicing is sweet (str[::-1] for reversing)
  • sorted() isn’t in Python 2.3, so sorting a hash has to be done against the keys as a list
  • mechanize for Python really isn’t there yet
  • It’s too easy to take CPAN for granted.

Structured Blogging and Data Representation

Most conversation about structured blogging has dealt with the idea at the application and delivery level (microformats, etc). I’ve been interested in the relationship of blogging and other loose KM applications (specifically wikis and outliners) for a while now. I have a belief that ultimately, these applications are more alike then they are different, and can/should be intrinsically tied together (that’d make it a blikiliner, right?), however I’ve been hung on on the best way to store, represent, and relate the common data. With the imminent promise of time to pursue these issues, I’ve started to pick back up my earlier work, and with a fresh pair of eyes.

  • Caching – my original (and still current) approach towards representing pieces of data (microcontent) has been using directional graphs (with typed nodes and relationships, very much like the RDF model). One of the roadblocks I hit was that unlike with trees, there are no high performance ways to store this in SQL. Last year it occurred to me that I should completely forget that, store them as simply as possible (nodes and relationships), and simply build the partial sets (views) and cache those with an appropriate lookup table. Or I could bite the bullet and see if using an RDF database makes sense)
  • First class data structures – this actually is something I’m still thinking about. In a graph model, maybe it doesn’t really matter as long as you can extract types and relationships. There are some things that you’d definitely want to extract (like external and internal hyperlinks), but that can be pretty trivially done post-hoc… Once you’ve committed to that sort of data representation, all you have to worry about then is how to usably combine that information. Still up for question is how fine grained nodes should be, and how to best point within nodes (think about annotations and purple number style addressing). One could conceivably split elements to DOM constituents pretty easily, but that ups the number of nodes you need to keep track of up a magnitude or two (but might be a better alternative to an xpath type approach).

I’m still doing searches to see if there’s been anything new published over the past year or two. Some links:

Web Framework Notes

I’m working on rearchitecture/refactoring of a couple projects right now, one which requires support for custom modules and provides for quick and easy modification by non-experts, and another which is a bit straightforward, with an emphasis on scaling.

In doing these redesigns, looking at existing frameworks and application designs obviously helps. I’m not vehemently against frameworks, but I have found most of them to myself generally reluctanct in their use because even the best designed impose conceptual and organizational constraints which oftentimes mitigate any potential productivity gains. So yes, I’m a ‘library’ guy in that sense. However, that doesn’t mean I’m against frameworks, just that I haven’t found one that is for me naturally maps into my conceptual model of web development.

As with most people, this model has been primarily dominated by page-driven REST-type interaction, but for the past few years, I’ve been trying to come to grips for a holistic approach to handle AJAX (née remote-scripting) and more recently SOAs and APIs. A couple rough notes, probably to be refined:

  • I’m a big fan of the most straightforward controller mapping possible. One of the great things about how scripted application layers like PHP work is that you can start with an URL and figure out how it works from there. Of course, things can quickly get complicated. [this is bad] Minor modifications quickly become a PITA
  • Separating the view is also a good thing, with the caveat that templating languages generally suck. Recently, a coworker of mine has been working on a simple XML-based view framelet called Phiz, which on a conceptual level is really appealing (implementation-wise, what it requires is a good caching system)
  • OOP concepts are a must – inheritance and polymorphism being my top properties. As far as patterns, especially for modules, the Service Locator and Decorator are on my mind right now. Like most people, I’ve settled primarily on an overall MVC model, although I’ve been thinking a lot about IoC and how event streams might be processed.
  • I’ll have to compare how other frameworks implement AJAX and API responses, but the design I’m working on should handle that interaction with pretty much no duplication or messiness.

uPortal and Feed Cleaning

The nature of syndicated data on the web is such that quality and correctness is oftentimes (nearly invariably) uneven. The RSS specs are themselves rather murky, and even the best of sites will push out the occasional unescaped entity or improper encoding.

As seems to be its natural inclination, uPortal completely ignores this reality and completely barfs when encountering any hint of irregularity. uPortal parses RSS via its XSLT channel utilizing Xalan-J, where “error recovery” means throwing an exception, dying, and spewing an ugly error at the user.

By and large most commonly run into error is character encoding issues. The uPortal channel, expecting XML, defaults to UTF-8 when encoding is left unspecified. If there are multi-byte characters, you’re screwed. My solution, that so far has fixed all the feeds that we’re currently ingesting is a two parter, using a Python first stage, and a PHP second stage. Although in most cases, you’d want to combine it into one (the Python code, probably), we’re running the two-parter because the latter code came first and zis used for other purposes.

(If you’re using uPortal: performance isn’t an issue because the channel gets cached by default for 20m. Be sure though to check that your version of the XSLT channel has my caching patch applied. There was a 3 year old caching bug that caused the channel not to cache for guest layouts and inefficiently for logged in users).

  • The centerpiece of the Python code is to use Mark Pilgrim’s Universal Feed Parser. This, of course, solves all the issues related to parsing different flavors of RSS
  • With version 3.x of the UFP, character encoding is dealt with better, and strings are automatically converted to unicode when possible. From there, output is a simple unicodestring.encode(‘utf-8’) away. PHP deals with unicode rather atrociously by comparison.
  • Note, and this really screwed me for a while, that mxTidy, which the UFP defaults to using if it finds it, does not play nice with unicode and will screw you. So be sure to turn it off. (I haven’t tried µTidylib or TidyHTMLTreeBuilder yet)
  • If tidy worked, it could have taken care of converting your entity characters into numerics, but since I haven’t, I instead made entity declarations in the DOCTYPE to cover my bases. You’ll want to load at least the first two, and to be safe, all three of the normative XHTML entitie sets
  • After that, I do some HTML to XHTML processing (unnecessary if tidy would work like it should), and also conversion of non-entity ampersands. This is a good one. Here’s the PCRE:
    /&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/
  • Special cases: If you’re loading images via HTTP and your page is on HTTPS, errors may be displayed, so you might want to omit or convert as appropriate
  • Note: One rule you shouldn’t need if you run through the UFP is conversion of smart quotes, but if you’re doing other processing that hasn’t gone through the first steps, that would be a good idea

And that’s that. Ta-da! The Aristocrats!

Yay Python!

Recently, I’ve been writing some Python CGI’s for administrative interfaces (Python’s SOAP and XML-RPC interfaces are really sweet). Today I had to calculate array differences and was looking into doing it in Perl, but it turns out that it’s even easier in Python because it has set functionality built-in. Here’s how to calculate the difference of two arrays:


a_set = set(a_list)
b_set = set(b_list)
aonly = list(a_set - b_set)

Sweet!

Um… Is this thing on?

The blog has lain fallow a bit while other areas have taken priority, but I’m going to do my best to get back on the saddle. I can see now how easy it is to just stop once you get out of the habit… And then how it’s sort of awkward to start up again because… well, enough navel gazing.

One of the causes of much busyness is that I finally got around to the long overdue task of clearing stuff off my plate, starting with quitting work (I finish up at the end of the month), which entails wrapping up with a new revision of the campus-wide portal, pending launch of campus-wide wikis, and piloting of blogs (my involvement with Elgg should be continuing).

While working as USC has been a mostly positive experience, there were too many things on the todo list, and something had to give. I’m really looking to having a least a few weeks to unwind and knock out a number of little projects starting next month.

There’s a ginormous backlog of cool stuff to comment on (locally and on the web), so more soon.

Tabs vs Spaces

I’m typically not bother too much by the whole tabs vs spaces, but for some reason, it sort of got on my nerves last week (I do most of my editing in vim where it doesn’t really matter, but I’ve been doing some stuff in SubEthaEdit latetly) and got me looking at some of the back and forth…

Anyway, while the last article actually got me thinking, in the end, I decided to stick with spaces. Tabs just introduced more complexity, and mixing tabs/spaces leads to all kinds of wonkiness when done improperly. I did decide to be a bit more aggressive about using retab if I got bothered again…

For reference, my vim settings for tabs:

" interpret tab as an `indent' command instead of an insert-a-tab command
set softtabstop=2
"indent with two spaces when hitting tab 
set shiftwidth=2
"expand all tabs to spaces according to shiftwidth parameter
set expandtab
" mod%2 to ignore tabs
set tabstop=2

More on the TV Thing

The other week, I came across an interesting article in the Weekly Standard (yeah, that Weekly Standard) on Joss Whedon’s upcoming Firefly movie (née half-season cancelled Fox show), Serenity. Firefly was a highly regarded, but totally doomed space-western show that I’d never really bothered to look into despite having heard good things about it (not being a huge Whedon fan, being way too busy, and not watching TV being some strong factors).

The article piqued my interest though, especially in light of the whole Global Frequency thing and my pleasant surprise with Battlestar Galactica (another show doing very smart online stuff — and yeah, I’ve been watching a lot of TV programming the past few months for someone who doesn’t even own a set anymore).

I ended up going through the series this past week, and came away impressed. It starts out smart and grows on you as it goes on – there’s a lot of character, both in the uh, characters, and the settings (a wild-west spin on humanity’s 26th-century interplanetary colonization, with an Anglo-Sino Alliance occupying ‘civilized’ core worlds surrounded by frontiersy border worlds). I should mention however that the pronunciation of the Chinese phrases they interject are mangled laughably beyond belief (and comprehension).

Perhaps as interesting as the story in the show is the meta-story of the show: put in a Friday death slot and shown out of order by Fox TV execs, but subsequently kept on life-support post-cancellation by the dedication of both fans and the production crew, and given a new lease on life as a feature film following strong DVD sales (as strong as it’s ever been right now, currently the #8 top seller on Amazon.com, 1.5years after release – and yeah, they sold me a copy).

Sci-Fi channel will airing all the Firefly episodes made (including 3 never aired by Fox) in the next couple weeks, which should be an interesting movie lead-in (if my math is right, they won’t get to the end before the movie comes out – maybe that’s planned to encourage DVD sales?)

I don’t know if there’s a moral to this story. The Fox execs are making money hand-over fist despite their blindness, but at least it highlights both the changings dynamic in entertainment consumption and also hints at the additional opportunites available for those with vision – Global Frequency highlighted the opportunities for seeding (ahem) pilots and seeing what takes, while things like the Family Guy and Firefly perhaps points to a viable model of recognizing and continuing quality shows canned before their time (that’s exploiting known-quantity, low-risk, untapped revenue streams to the suits).

On Laziness

I’ve noticed that recently I’ve been taking less and less photos since I’ve gotten my T7. I think that part of it is since I carry it around everywhere, there’s less of a need to justify bringing it by taking lots of photos – hence, less interesting stuff to post. Even when I do have photos, it’s more of a pain to transfer, since I need a Duo adapter and a separate card reader (vs carrying around my CF/PC card adapter in my laptop slot). And lastly, when I am taking shots, I’m much more likely to do video now than stills. And beyond an a cheesy AppleScript, I don’t really have any automation tools for thumbnailing/posting those. As someone once said, “it’s hard work.”

A short death polo clip* from a BBQ in the Golden Gate Park this weekend:

Guido!

* no actual deaths