I haven’t been closely following the comment spamming problem, but it looks like its hit Trackback now as well. Furthermore, the spammers have discovered flooding and anonymous proxies… It’s become clear to me that these attacks will completely change the nature of the weblog landscape. It was only a matter of time I suppose. Rather than waiting for it to overtake and destroy the medium (a la USENET), it’d probably be good to be proactive.

At this point, it looks like rate-limiting (and auto-blacklisting based on flooding) is currently the most effective stopgap to go. The addition of easy deletion/banning might be a good idea (marking a comment as spam either from a custom interface or from the page itself will remove the spam, blacklist the urls pointed, and blacklist the posting IP). Bayesian-type filtering probably won’t work very well at this point b/c of lack of headers, size of corpus, although a SpamAssassin-like point system might (see also, slashcode noise filters). Using redirects (a la 2.661) may reduce impetus for spamming (although not for those that are just being annoying). White-listing sort of defeats the purpose, although I could see this whole thing being a good push for a Digital ID system (whether actual DigID or adhoc via PGP/GPG signatures). This could work in conjunction w/ a white-list/black-list system.

For the current flooding, which only serves as an attack tool, it may be a matter of thinking up of a way coming up with a number of challenges (two checkbox questions, one will ban you, form field and questions randomly custom generated) that can’t be automated, or assigning session ids to track a client regardless of IP. Of course, trackback would be more difficult. For trackbacks, one could run a mathematical filter on the trackback url before (and periodically after) putting it up… That’d have the bonus of checking for linkrot as well. (see also pingback as alternative)

Other people have been putting way more brainpower into this than I; this is just me blabbing of the top of my head.

(I don’t think I have to worry too much about comment or trackback spam right now, the flooders seem to try to attack anyone who writes about them)

Some orkut observations:

  • I got on the other day, by the count algorithm (uid from 1M), I am user 6617. My last friend request earlier today has a user of about 23K. It will be interesting to see the growth curve of the ‘invite only’ network
  • Everyone on my current friends list is a blogger (also the biggest community I am in is ‘Bloggers’ which was at about 2 or 3 when I joined; now grown to 235)
  • The system looks like it’s written in C#/ASP.NET; it uses a lot of xmlhttp for inline page updating (works in IE, Mozilla, fails silently in Safari)
  • There’s a level of privacy control (restricting information by predefined, but not arbitrary groups)
  • You can add someone as a friend automatically, which is then pending. If you are rejected, you can’t add them again, they can add you; I’m assuming if you reject them it’ll mean you can never add each other, but I wasn’t really feeling like testing that part out
  • There’s some ratings; you can be a fan of someone, and rate your friends (aggregated pseudonymously) on trustworthiness, coolness, and sexiness

Never Mind The Bollocks, Here’s The Wonderchicken – stavrosthewonderchicken ruminates on blogs and punk rock.

Weblogs are a party, damn it, and sometimes they’re publications too, or instead, and sometimes they’re diaries, sometimes they’re pieces of art, sometimes they’re tools for self-promotion, sometimes they’re money-maknig ventures, sometimes they’re monuments to ego, sometimes they’re massive wanks, sometimes they’re public services, sometimes they’re dedications of faith, sometimes they’re communities.

Thought: would remove human error from spam/ham classification if you sent for training based on message location (if misclassified message is in inbox, always classify as spam, if it’s in the error mailbox, always classify as ham); working on that tonight.

OK, done. Here’s a version of my script that will automatically submit as misclassified spam anything in the inbox, and misclassfied ham if its anywhere else (easier than coding the specific folder, should be just as effective). Sure it has the possibility of being slightly usafe, but less than human error for me at least. (Alternate method would be to tokenize message and find out if it’s classified as spam or not and reverse, actually not that hard since I already tokenize to strip the X-CRM-Status header)… Well, it’s late and I’m lazy. Good enough.

Just got an email from Tim Conner, developer of BlogApp and BlogScript (also, he has some neat AppleScript snippets as well) with a couple of string functions he uses:

(**** Example ****)
-- this example will find the word "work" in the string 
-- "Bob went to work." and replace it with "the beach".
set myResult to snr("Bob went to work.", "work", "the beach")
display dialog myResult
--
(**** fast search and replace methods ****)
on snr(the_string, search_string, replace_string)
  return my list_to_string((my string_to_list(the_string, search_string)), replace_string)
end snr
on list_to_string(the_list, the_delim)
  my atid(the_delim)
  set the_string to (every text item of the_list) as string
  my atid("")
  return the_string
end list_to_string
on string_to_list(the_string, the_delim)
  my atid(the_delim)
  set the_list to (every text item of the_string) as list
  my atid("")
  return the_list
end string_to_list
on atid(the_delim)
  set AppleScript's text item delimiters to the_delim
end atid

Should come in handy next time I take the fork to the eye.

I finally got around to setting up postfix, courier, procmail, getmail, and crm114 all up on my server. It was surprisingly painful considering I already had postfix and courier working. I’ve not figured out why mutt is being a pain…

In any case, I’m now training CRM114. Having done only a few dozen error corrections, it’s already starting to get pretty good. Hopefully I’ll have enough volume in the next couple of weeks to really get it well trained, and then never have to worry about it again.

Training involves forwarding erroneous mail back to yourself prepended with a ‘spam’ or ‘nonspam’ command and your password. Since Apple’s Mail.app doesn’t do full-source forwarding by default, I wrote to AppleScripts to automate the process (one of Apple’s included scripts gets you half-way there). They send out an email automatically as spam/nonspam (stripping the bad X-CRM114-Status line as well) and either delete or move to the inbox as appropriate. (I also have procmail set up to move the training results into its own folder)

Rename the last part to whatever you want your key command to be, and put it in your ~/Library/Scripts/Mail Scripts/ folder.

Lastly, I want to reiterate how much AppleScript documentation sucks total ass. The Language Guide is useless since it doesn’t have any references to basic operations (so, if Google doesn’t turn anything useful up on applescript string parsing, you’re up a creek – this stuff isn’t in the application dictionaries either…). Basically, the only way to get anything done is to dig around until you find an AppleScript that does something similar.