Clean and store your raw tags like Flickr

Today I was working through how a new application would be handling tags and realized that I strongly believe Flickr has the most robust method of storing and querying tags. I think they do it well and wanted to copy their lead.

The main reason I feel they’ve got the best system is how they handle their ‘raw’ tags and their ‘clean’ tags.

When a photo is tagged at Flickr, the tag itself is saved in two different formats – raw and clean. If you were to tag a photo with “St. Patrick’s Day”, that’s what would remain in your list of tags, visible on screen. But how Flickr encodes and cleans up that tag results in “stpatricksday”. This is a subtle, but powerful model.

It keeps the original tagger happy (“I know how I want to tag things, darn it”) and it makes the tags more functional in terms of finding things later (both for the tagger, and everyone else). The clean tag is what is used in URLs, in the tagclouds, and wherever aggregation is important for statistics. The tags “N.Y.C.” and “NYC” and “nyc” are all ‘cleaned’ down to the same thing (“nyc”) so when a query comes in for nyc, all three original photos would be presented in the results.

I wanted that cleaning function for myself. I looked everywhere today, and couldn’t find it detailed in any one place any more than on the Flickr API pages themselves.

raw: The ‘raw’ version of the tag – as entered by the user. This version can contain spaces and punctuation.

tag-body: The ‘clean’ version of the tag – as processed by Flickr. This version is used for constructing urls.

Flickr-Like Tag Cleaning Regular Expression

Please let me know if you find any errors, or if this gets out of date. I’ll try to keep it current over time.

The function can be described fairly simply – “remove spaces and punctuation, then lowercase”. I wrote a one line ruby regular expression to do this:

clean_tag = raw_tag.gsub(/[\\s"!@#\\$\\%^&*():\\-_+=\\'\\/.;`<>\\[\\]?\\\\]/,"").downcase

The rest of my test code is included below. As far as I could test today, these are the ‘punctuation’ that Flickr is scrubbing from your raw tags:

#!/usr/local/bin/ruby -w

require 'cgi'

# removes whitespace
# downcases A-Z
# removes 27 different punctuation characters
  # quotation marks
  # exclamation point
  # at symbol
  # pound sign
  # dollar sign
  # percent sign
  # carat
  # ampersand
  # asterisk
  # open parenthesis
  # close parenthesis
  # colon
  # hyphen
  # underscore
  # plus sign
  # equals sign
  # apostrophe
  # forward slash
  # period
  # semicolon
  # backtick
  # open angle bracket
  # close angle bracket
  # open square bracket
  # close square bracket
  # question mark
  # backslash
# does not affect other characters (you should safely CGI.escape these)
  # curly brackets
  # tilda
  # pipe
  # british pound
  # euro symbol
  # chinese characters
def clean_tag(raw_tag)
  clean_tag = raw_tag.gsub(/[\\s"!@#\\$\\%^&*():\\-_+=\\'\\/.;`<>\\[\\]?\\\\]/,"").downcase

tags = [
  # should remove the offending characters
  "\\"double\\" quotes",          # quotation marks                     doublequotes
  "!excited!iam!",              # exclamation point                   excitediam
  "",           # at symbol                           testexamplecom
  "pound#it",                   # pound sign                          poundit
  "$ave on everyThing",         # dollar sign                         aveoneverything
  "i feel 30% better",          # percent sign                        ifeel30better
  "carats^aretasty",            # carat                               caratsaretasty
  "and&this&and&that",          # ampersand                           andthisandthat
  "maris*61",                   # asterisk                            maris61
  "i think (maybe)",            # open and close parentheses          ithinkmaybe
  "F:ooBar",                    # colon                               foobar
  "hyphen-ated",                # hyphen                              hyphenated
  "under_my_score",             # underscore                          undermyscore
  "1+1=2",                      # plus and equals                     112
  "Saint Patrick's Day",        # apostrophe                          saintpatricksday
  "/leaning/forward/ish",       # forward slash                       leaningforwardish
  "Mrs. Jones",                 # period                              mrsjones
  "semi;automatic;parsing",     # semicolon                           semiautomaticparsing
  "back`tick`here",             # backtick                            backtickhere
  "open<and>close",             # open and close angle brackets       openandclose
  "don't[be]square",            # open and close square brackets      dontbesquare
  "you?sure",                   # question mark                       yousure
  "back\\\\slash",                # backslash                           backslash
  # should only encode the rest of these
  "crab|vs|pipe",               # pipe                                crab%7Cvs%7Cpipe
  "東京",                         # chinese characters                  %E6%9D%B1%E4%BA%AC
  "£",                          # british pound                       %C2%A3
  "nice {curly} brackets",      # curly brackets                      nice%7Bcurly%7Dbrackets
  "Mötley Crüe",                # umlauts                             m%C3%B6tleycr%C3%BCe
  "Tōkyō"                       # long o                              t%C5%8Dky%C5%8D
].each do |t|
  print t
  print "\\n\\tcleaned  -->  "
  print clean_tag(t)
  print "\\n\\tescaped  -->  "
  print CGI.escape(clean_tag(t))
  print "\\n"

