I spent an hour or so this morning writing a regular expression to conditionally convert html character entities. Finding character entities is easy: &([A-Za-z]+;|#[0-9]+;)
, and ‘grep -v’ will give you the opposite. Of course, doing substitutions in Perl, there is no ‘-v’ option and so this becomes a bit more problematic. Here’s my solution:
# doesn't begin w/ alpha or #
# alpha, but no ;
# '#' but no ;
&([^a-z#]|[a-z]+(?=[^a-z;]+)|#d+(?=[^d;]+))/&$1/i
You need to use lookaheads to insure that there are no semicolons at the end of the string you’re looking for.
Of course, some time later a friend sends me an example that basically does the same thing.
Anyway, some good came out of this: I broke out the copy of Mastering Regular Expressions that I hadn’t touched in a while, and I have a few links that are of interest:
- The Regex Coach – OMFG this program kicks so much ass. The highlighting/step-through and tree are especially nice
- Regular Expression Library – not much in here, but it’s a good idea
I’ve spent the rest of the day so far shooting around town. I need to acquire 7 stills from this video that ‘represent’ LA.