I’ve been playing around with categorization for a while, and have been watching the recent rise of freetagging with great interest. A friend, Nick Mote, wrote a recent paper that does a good job both to summarize developments from a close-to information sciences perspective and to outline several near-term issues.
Among them, I believe that both disambiguation and synonym merging are relative non-issues. For the former, the ease of intersections almost makes it moot from the practical perspective of searching. For the latter issue, we are already beginning to see automated solutions (related tags).
One of the reasons for the relative ease of solving these problems is that the applicable relevance algorithms are already quite familiar to lay web practitioners (i.e., people like me, without a CS Ph.D) from their long-time use in e-commerce (collaborative filtering), spam filters (mathematical filters), and now social networks (web of trust). [my faves: clustering, adaptive resonance, context graphs, CRM114]
Anyway, what I wanted to ruminate on was the non-hierarchical freetag model of unions, intersections, and differences and see if there’s a way to to build a practical (both in terms of backend implementation and user interface) bridge with (more) traditional hierarchical faceted classification.
The first hurdle is, I suppose coming up with a convincing argument that hierarchies are worthwhile. I think quite obvious that in everyday life, we categorize and subcategorize often and that being a first-class object isn’t completely out of the realm of sense. The real question is if there’s a way of reintroducing hierarchy that doesn’t reintroduce the problems they caused in the first place.
First lets talk a bit about data structures. Traditional structures will explicitly delineate parent/child relationships, either via pointers or relational structures. Note that this can be generalized into the generic subject/predicate/object triplets that we see in RDF tuples. While I’m very partial to typed links (and late binding and dynamic properties… keep on target), I think we can see that this will lead to a level of complexity that will work against both ease of use (first rule of getting user participation) and social/corpus relevance matching (although spam filtering engines like CRM114 are built for sparse data).
Before we get to something I’m throwing out, I’d also like to mention that in our faceted hierarchies, Celko’s set/adjacency models (he recently published a whole book on trees in sql) won’t directly work as we’re dealing with what will likely be very bushy graphs (think overlapping possibly-cyclic digraphs). A real mess huh?
So, my 3AM brainfart last night was to try attacking from the point of view of using traditional tagging structures, and taking the idea of separators for hierarchies and improving on that. For example:
tag: foo
tag: foo.bar
tag: foo.bar.baz
tag: foo.qux
If we do a search for foo[\.]*, we will everything within tree ‘foo’ inclusive. This relieves us of many of the disadvantages of traditional hierarchical representation, and does not marginally increase complexity of either searches or of tag-renaming (the former can be globbed relatively inexpensively and the latter is costly either way).
Now, the main crux of the matter comes with the user interface end. ‘foo.bar.baz’ is a pain in the ass to type. Sure, your non-hierarchical option is to type ‘foo’ and ‘bar’ and ‘baz’, but this, at least from the input side, removes one of the advantages of hierarchical input.
In this case, then, why not do masking? When storing/searching, take both the entire ‘foo.bar.baz’ as well as the most specific child identifier ‘baz’. This creates a new disambiguation issue:
tag: baz
tag: baz (foo.bar.baz)
tag; baz (foo.qux.quux.baz)
From a search/aggregation perspective this might not be necessarily bad as it’d combat the sparseness issue, but from an entry perspective it again minimizes the hierarchy usefulness quotient (at this point, one begins to ask, are intersections that bad? The answer of course is in most cases no, however in some yes).
A UI solution for this is to have an auto-completing combobox that recognizes hierarchies. This widget is also useful for traditional freetagging as well, so is a worthwhile avenue to persue regardless.
[Err, revisions, less half-bakedness forthcoming, as well as some code. Well, might as put this out for commenting. Finishing my trackback/comment code might be good.. ah, screw it, will procrastinate later. Back to real work for now]