Today I was working through how a new application would be handling tags and realized that I strongly believe Flickr has the most robust method of storing and querying tags. I think they do it well and wanted to copy their lead.
The main reason I feel they’ve got the best system is how they handle their ‘raw’ tags and their ‘clean’ tags.
When a photo is tagged at Flickr, the tag itself is saved in two different formats – raw and clean. If you were to tag a photo with “St. Patrick’s Day”, that’s what would remain in your list of tags, visible on screen. But how Flickr encodes and cleans up that tag results in “stpatricksday”. This is a subtle, but powerful model.
It keeps the original tagger happy (“I know how I want to tag things, darn it”) and it makes the tags more functional in terms of finding things later (both for the tagger, and everyone else). The clean tag is what is used in URLs, in the tagclouds, and wherever aggregation is important for statistics. The tags “N.Y.C.” and “NYC” and “nyc” are all ‘cleaned’ down to the same thing (“nyc”) so when a query comes in for nyc, all three original photos would be presented in the results.
I wanted that cleaning function for myself. I looked everywhere today, and couldn’t find it detailed in any one place any more than on the Flickr API pages themselves.
raw: The ‘raw’ version of the tag – as entered by the user. This version can contain spaces and punctuation.
tag-body: The ‘clean’ version of the tag – as processed by Flickr. This version is used for constructing urls.
Flickr-Like Tag Cleaning Regular Expression
Please let me know if you find any errors, or if this gets out of date. I’ll try to keep it current over time.
The function can be described fairly simply – “remove spaces and punctuation, then lowercase”. I wrote a one line ruby regular expression to do this:
clean_tag = raw_tag.gsub(/[\\s"!@#\\$\\%^&*():\\-_+=\\'\\/.;`<>\\[\\]?\\\\]/,"").downcase
The rest of my test code is included below. As far as I could test today, these are the ‘punctuation’ that Flickr is scrubbing from your raw tags:
#!/usr/local/bin/ruby -w
require 'cgi'
# removes whitespace
# downcases A-Z
# removes 27 different punctuation characters
# quotation marks
# exclamation point
# at symbol
# pound sign
# dollar sign
# percent sign
# carat
# ampersand
# asterisk
# open parenthesis
# close parenthesis
# colon
# hyphen
# underscore
# plus sign
# equals sign
# apostrophe
# forward slash
# period
# semicolon
# backtick
# open angle bracket
# close angle bracket
# open square bracket
# close square bracket
# question mark
# backslash
# does not affect other characters (you should safely CGI.escape these)
# curly brackets
# tilda
# pipe
# british pound
# euro symbol
# chinese characters
def clean_tag(raw_tag)
clean_tag = raw_tag.gsub(/[\\s"!@#\\$\\%^&*():\\-_+=\\'\\/.;`<>\\[\\]?\\\\]/,"").downcase
end
tags = [
# should remove the offending characters
"\\"double\\" quotes", # quotation marks doublequotes
"!excited!iam!", # exclamation point excitediam
"test@example.com", # at symbol testexamplecom
"pound#it", # pound sign poundit
"$ave on everyThing", # dollar sign aveoneverything
"i feel 30% better", # percent sign ifeel30better
"carats^aretasty", # carat caratsaretasty
"and&this&and&that", # ampersand andthisandthat
"maris*61", # asterisk maris61
"i think (maybe)", # open and close parentheses ithinkmaybe
"F:ooBar", # colon foobar
"hyphen-ated", # hyphen hyphenated
"under_my_score", # underscore undermyscore
"1+1=2", # plus and equals 112
"Saint Patrick's Day", # apostrophe saintpatricksday
"/leaning/forward/ish", # forward slash leaningforwardish
"Mrs. Jones", # period mrsjones
"semi;automatic;parsing", # semicolon semiautomaticparsing
"back`tick`here", # backtick backtickhere
"open<and>close", # open and close angle brackets openandclose
"don't[be]square", # open and close square brackets dontbesquare
"you?sure", # question mark yousure
"back\\\\slash", # backslash backslash
# should only encode the rest of these
"crab|vs|pipe", # pipe crab%7Cvs%7Cpipe
"東京", # chinese characters %E6%9D%B1%E4%BA%AC
"£", # british pound %C2%A3
"nice {curly} brackets", # curly brackets nice%7Bcurly%7Dbrackets
"Mötley Crüe", # umlauts m%C3%B6tleycr%C3%BCe
"Tōkyō" # long o t%C5%8Dky%C5%8D
].each do |t|
print t
print "\\n\\tcleaned --> "
print clean_tag(t)
print "\\n\\tescaped --> "
print CGI.escape(clean_tag(t))
print "\\n"
end
Tags: flickr - SocialTagging