Monday, May 23, 2005

On Recommendation Systems

Pete Lyons wonders about the voodoo behind Amazon's recommendations which of late seems to have nailed things for him as they successfully mine the Long Tail of music. He asks

Do they have more data now or better algorithms?
Sad fellow that I am, I actually happen to have read the research paper behind their approach Amazon.com recommendations - Item-to-Item Collaborative Filtering (PDF) which describes the techniques they used. The question then is how do these glue layer, machine-learning heuristic, and algorithm-eating artificial intelligence folks do recommendation systems and what should one make of them?

Now I'm no algorithm wonk, but my mathematics and electrical engineering hasn't atrophied enough to prevent me from recognizing the elegance of Amazon's technique. What's also interesting is that like Google Suggest, they make good use of offline processing so that almost everything can be pre-calculated and cached for reasons of scalability and speed.

As with almost all recommendation or machine learning systems, the more data they have the better things get. Of course the computation costs also increase but, if you can parallelize things a little, you can leverage the experience we now have of 10 years of Mr Moore in the Datacenter of the web and throw server farms at the problem. It's not just the data however, if you look at the mathematics, you'll see that the algorithm used plays the most crucial role much like PageRank or Kleinberg's earlier Clever (PDF, PS) did for search systems on the web as discussed here.

Music or book purchases on the whole are episodic unless you're someone like me who is obsessed and constantly looking for the perfect beat. Amazon as a retailer doesn't have to be worried about real-time timeliness; I'd guess their average user buys or browses through the site monthly, weekly or at a stretch perhaps once a day (e.g. outliers like myself who are always tweaking their wishlists). Amazon also gives users easy ways of providing feedback so that they can train the system at their leisure as they interact with the site. Like Netflix, they just have to throw in a little unobtrusive DOM scripting, hidden iframe or XMLHTTPRequest thingimijig and the user can click on those star rating systems without even seemingly reloading the page.

The benefits for the user are frictionless immediacy and the notion that one is teaching the autistic machine. The endpoint of machine learning is to perfectly anticipate human desires. Per contra I actually believe human psyches require a little imperfection. We need a little interaction and especially some conversation as we mediate our ever-changing world. In Apple's Knowledge Navigator 1988 concept, the interaction was more with a friend or butler. And as we know Jeeves was opinionated and mostly correct but he also made mistakes and Wooster could sometimes be smug. Should the people working on recommendation systems ever get to their nirvana, I would speculate that it would be a little unsatisfactory for the users, hence I'd suggest that they make sure to throw in enough errors (maybe a 5 percent threshold of randomness) to keep things exciting. I actually cherish the mistakes in my music collection and even the things Best Left Unread.

Enough people have been noting that Audioscrobbler's music recommendation system has improved enough that, late adopter that I am, I went ahead yesterday and installed their plugin on serveral of my machines. On the older machines, I use Winamp which has lower memory requirements than iTunes. My digital music collection (10,300 songs, 70.6 GB) is larger than the largest iPod so I am waiting for another Moore's Law inspired doubling of storage before I adopt that platform, not to mention that my ear canal is too small for those white earbuds.

Audioscrobbler's approach has the virtue of actually keeping track of your actual usage of the content in question so that if it notices that you keep playing James Carter's The Intimacy Of My Woman's Beautiful Eyes as I did last week, then perhaps it will be smart enough to figure out that you'll dig Nicholas Payton's Captain Crunch (Meets The Cereal Killer). At least that's the theory. Some fear that this smacks of Big Brotherism and indeed we need to be vigilant that the benefits we gain in giving up some of our privacy outweigh the possible pitfalls.

On the other hand, humans are social beasts and don't mind a certain amount of looking over one's shoulders or, like the teenagers on the bus, we don't mind others plugging in to our headphone jacks. Or like some of our neighbours, we want everyone in the neighbourhood to hear our sounds whatever they may be. There was some mp3 player or other I saw on Engadget a while back that had 2 headphone jacks to facilitate intimate sharing of tunes and that's the right idea. There's even controversy among iPod people over the new vocabulary: does podjacking mean "plugging your cord into the jack of another person's iPod (and vice versa, of course) to hear what that person is listening to" or is it "using an FM transmitter attachment to take over neighboring radios". Amazon's Listmania feature, our Netflix queues, the sharing of iTunes playlists, things like Webjay, Shoutcast, last.fm and All Consuming express the pent-up desire for the two-way web of publishing of content and interests.

If you love books or music as much I do, and write about them with incisiveness and opinion, others will always be asking you what the latest thing is in order that they can seek out new things (or know what to avoid) or simply share their opinions or anecdotes. These are cultural markers that help establish shared context. That is the role of DJs or the personal recommenders we all have, the people who are arbiters of the cool. And if the people who market these items seek out these tastemakers with the moral equivalent of payola, it's just smart advertising. Samuel Jackson will never buy a hat in his life again unless Kangol loses their corporate mind. In my own little way, long after I was active at WHRB, record companies were still sending me 12 inch singles even though my Technics turntables had been stolen and I'd asked repeatedly to be removed from their mailing lists.

Anyway here's Koranteng's Musical Toli at Audioscrobbler.

It's a little scary in that you could figure out that as I was writing these words Chuck Brown & The Soul Searchers - Bustin' Loose was making my early morning brighter and you might even picture me getting up to take a few shuffle steps around my living room in my pajamas.

The Audioscrobbler recommendation system kicks in after 100 or so tracks played and I'll give them a couple of weeks or to see if it's hype and report back with the results. I too am looking for overlooked breakbeats.

Dan Bricklin's Listgarden is software in that mode. My current Forms work is all about lists, indeed until the lawyers heard about Microsoft List Builder, that was the preferred terminology for some. I actually would rather call a form a form, and a view a view but I'm a cantakerous sort as you might know.

In the web site link category where timeliness is of importance, the sheer scale and real-time nature of the blogosphere makes the design of efficient recommendations systems more problematic. I assume all the big boys are burning the midnight oil trying to crack that question. We want both search and recommendations to help navigate the web. Google has been very quiet of late although their Web Accelerator is pointing to the kind of thinking required. Just when you think that Blogger has been sadly neglected (no categories? no trackback? no easy lists or blogrolls etc) all those PhDs will surprise you. At least I hope they will - the contortions I've been going through with Blogger deserve a seperate post. I assume the same thing is true at Yahoo, MSN and the like. In the meantime, it has been Technorati, del.icio.us, PubSub, Blogdigger, Furl and BlogPulse for me (not to mention Blogdex, Popdex, Hot Links, Mememorandum and that latecomer Ice Rocket). Mark Fletcher has been intimating that Bloglines will surface something in the summer. If they do, they will have quite an advantage since so many are increasingly living in Bloglines and other newsreaders that Tim Bray is now pondering the usefulness of his Browser Market Share numbers.

Another Glue Layer Person, Leonard Richardson has one of the most interesting papers (and software that one can play with complete with Python source). He explicates the problem space quite clearly:

The Ultra Gleeper: A Recommendation Engine for Web Pages
Recommendation engines were built and run into troubles. Seemingly insurmountable problems emerged and the flame of hype moved elsewhere. Recommendation engines for web pages were not built or successfully launched. To even attempt one would require development of a web crawler and the associated resources. Today, recommendation engines have something of the reputation of a well-meaning relative who gives you gifts you often already have or don't quite want. Most useful recommendations come from knowledgeable friends or trusted web sites.

But over the years, as people built these web sites, they came up with models and tools for solving the basic problem of finding and tracking useful web sites. The wide adoption of these strategies has not only brought down the cost of building a web page recommendation engine, it's removed some of the insurmountable problems that still plague recommendation engines for other domains. It's now possible for someone with a dedicated server to run a recommendation system for themselves and their friends. I've done it and I'll show you how to do it.
An officious-looking "Legal Education" document came in through my Big Blue inbox in the past few days, so presumably I should consult that before peeking at the code if indeed that is allowed. The tension between the open source imperative and the inhibition of so-called "Intellectual Property" is a minefield that like all such battlegrounds causes stalls (and amputations); we still haven't learnt how to navigate easily these things.

David Hyatt recently wrote about Implementing CSS and gave pointers to the major lessons learned while implementing CSS support in Safari and Mozilla along with a couple of optimizations he came up with (very clever algorithms by the way) and the subsequent public discussion about the design and performance tradeoffs made in complex things like web browsers stimulated my engineering juices. Lots of things are browser-like these days and the techniques he has written about have wide applicability. A high performance dynamic style rule matching component could be used in many products that don't have to do with browsing. Instead I have to consult a lawyer or read yet another stack of incomprehensible powerpoint slides. Why exactly did my parents forbid me a law career? It seems that lawyers are the only ones who have guaranteed job security (trigger-happy disclaiming is the rule)

And while digressing about job security, I should mention that the news of 10,000 to 13,000 jobs being eliminated at IBM hasn't made for a comfortable work environment for the past few weeks. Those are mighty nice round numbers... Rumours are rampant in the workplace that as the Beeb suggest "it's too expensive to lay off in Europe, they have all those unions and are vaguely socialist therefore they're going to cut around here in the US where there is much less difficulty hiring or firing. Otto Von Bismarck knew a thing or two about paternalism and the unions in Europe even survived Maggie Thatcher's defenestration of Arthur Scargill. I know my family are all wondering what might happen and if I'll be affected and not just quietly, it's an ongoing concern in conversations. And who knows? We all worry about our own little patch of the woods but also about those colleages who we may not be talking to in coming months sometimes to the point of work being affected. Despite our undoubted professionalism, we're only human and don't understand abstract and capricious things like gravity or economics, we'd rather discuss tangible things like books, music and recommendation systems.

The two topics that never get spoken about in the US corporate world are one's actual salary and the impending job cuts. Everybody just keeps their heads down; the fear being that if you make too much noise, you might get the cut because admittedly we've all heard that such things have happened somewhere, somehow, at some time in some corporation. Corporations being the expression of unsentimental capitalism, I assume the worst part of being a manager is having to lay someone off and see their face. I'm too sensitive for that kind of thing and hence I'll never aspire to a managerial position. The one thing that you'll hear as the received wisdom-du-jour gets explained is that the euphemistic "resource action is not the fault of those affected" but rather "simply reactions to market conditions" or philosophically that ultimately it was a failure of planning or of something or other. But maybe I shouldn't stick my neck out any further...

In the spirit of the recommendation systems I've discussed I'll endeavour to share a few playlists with toli readers this week before London calls this weekend.

Soundtracks to this tale: See also: On The Long Tail of Music, Metrics and Recommendations

File under: , , , , , , , , , , , , , , , , , , , , , ,

2 comments:

Robert Jamison said...

You might look at StumbleUpon.com and digg.com as a couple stellar examples of good recommendation systems for websites.

You might also look at my own experimental recommendation system for different types of media at kindakarma.com, I'd love to have your feedback on it.

Greg Linden said...

Thanks, Koranteng, glad you liked the paper!