Showing posts with label heuristics. Show all posts
Showing posts with label heuristics. Show all posts

Tuesday, June 14, 2005

On The Long Tail of Music, Metrics and Recommendations

Responding to Chris Anderson's Bring tha Noise post with some commentary, some metrics from my music collection and pointers to artists who inhabit this Long Tail of music...

On The Long Tail of Music, Metrics and Recommendations


Some lunch break ego-surfing prompts this piece. Chris "Long Tail" Anderson tries to use real world data to validate some of his theorizing about music trends. Yours truly is one of the guinea pigs for this exercise in Bringing tha noise as he, and Public Enemy and Anthrax would have it. He's trying to figure out the signal to noise ratio in the fringe.

I seem to be a good source of metrics for this kind of thing because my tastes are not too mainstream and also because I augment my writing with lots of allusions and links to music and books that I champion or lean on to underline the slightest point. "Mining the cultural zeitgeist" is how I have described this tendency of mine when Jon Udell's metrics leaped out to me. I also tend to just use Amazon links which helps those seeking statistics. So then, what to say about the Long Tail of music?

First this: there's a difference between
  1. the things that I write about, and quote or link to (i.e. what you'll see if you mined this blog)
  2. the things I actually listen to or read
  3. the things I've bought - my entire collection of music and books
I recently wrote a piece On Recommendation Systems that is in much the same area as this Long Tail discussion.

On the matter of recommendation systems, a couple of comments are in order.
  • Leonard Richardson felt a little guilty that he hadn't spent time enhancing Ultra-Gleeper but that's the way things go, we're all stretched thin. Like he suggests, sometimes it's the idea that is important and not the embodiment in code, we see that all the time in technology. If one manages to give some intern somewhere an itch to scratch, as that paper should do, that is a contribution to knowledge and we are all the better for it.
  • Robert Jamison seems to have a made a great contribution in ideas (Searching, Sharing and Stumbling), if not in code (kindakarma.com) to the recommendation systems debate and points to a few others of note. I will be playing with said code and services in coming weeks
But back to Anderson and Udell's search for metrics...

One thing I didn't comment on in Jon's post was that he was looking at statistics of bloggers he reads who most frequently cite books on Amazon.com. The problem was that many of the links from this site are about music rather than books. This may have skewed his statistics a little.

In other words, what was required was some sort of link classifier to figure out what a link refers to (whether book, music cd, dvd etc.). The classifier would be like a regexp pre-processor that would figure out that, for example, a link to IMDB is probably about a movie in much the same way as the LibraryLookup project maps ASINS to APACS. I assume All Consuming uses the Amazon api to figure out what type a given item is (book, music, dvd, other). I continue to wonder if one can have a classifier that will determine when a link points to a person... All Consuming is in flux as they transfer to 43 Things and their old REST apis are orphaned in the transition but that product has an interesting take on link classification and tagging.

Items 2 (what I actually listen to) and 3 (what I've purchased) in the above list are also interesting.

I have a partial pass at item 2, what I actually listen to. You can check out my statistics at Audioscrobbler. I installed Audioscrobbler on my home machine and have been playing with it to figure out if its recommendation engine will prove useful, I'll report back in a couple of months. After a couple of weeks, it has me in quite good company on the basis of the 769 tracks I've played on the home computer.

With a music collection as large as I have, my listening habits tend to be
  • One third playlist-driven (I have numerous playlists for every mood as befits someone in the grip of musical obsession - thanks Gardner by the way for your kind comments and Autumn song playlist suggestions)
  • One third shuffle serendipity
  • The other third typically comes into play during shuffle; I'll hear something which will make me think of some other song and I'll do an ad-hoc thematic playlist, the problem is that I don't tend to save those tangential playlists borne of serendipity to feed back into the first category. I'm too lazy for that.
I'm a bit of an audiophile bigot and "can hear the difference" between original cds and even the 224 kbps VBR mp3 encodings I've done of my collection. Also some things have to be listened to loud (and one has to keep the neighbours on their toes every now and then). My computers haven't been hooked up to the big speakers and gleaming stereo system; the computer speakers are some generic $30 muffled fuzz boxes. Thus I need to get a Squeezebox, Audiotron or something although I'm a late adopter in such things. Thus, what is missing in the Audioscrobbler statistics is the music I listen to on my stereo. Still Audioscrobbler is a good proxy for much of my actual consumption of music.

On the topic of playlists, I'll be sharing some shortly so that others can get a sense of my beat matching juxtaposition insanity.

One problem with dealing with statistics from the blog is that for example, even though I'd estimate that 40 percent of my collection is jazz, I write about jazz only occasionally, like when I was Vibing with Abbey Lincoln or reminiscing about A Soul Jazz Thing with a belated appreciation of the late Jimmy Smith, or say alluding to Kamal (of The Roots) and his Ahmad Jamal keyboard stylings. There has only been one "proper posting" on jazz otherwise the allusions are all incidental say comparing Miles Davis's First Quintet to Rokia Traoré's band.

Jazz is less quotable and difficult to write about so perhaps that is why it wouldn't show up more readily in the statistics from the blog. Still the jazz idiom is a big part of my musical taste and outlook on life.

Metrics From My Music Collection


Anyway here is the data for number 3: I give you my music collection for your forensic analyses.

collection.m3u and collection.pls

My digitized music collection stands at 10,346 songs and amounts to 70.8 GB (about 36 days of continuous music). Note that this is larger than the largest iPod (hence my pause in not adopting that ubiquitous platform). I only ripped things that I liked so presumably one should add a good 50 percent filler for a fuller picture. I've been revisiting that notion since the Best Left Unread piece with a newfound appreciation of the bad things in my collection; disk space is cheap and I've just ordered a 250 GB spare hard drive to mitigate future disasters, that disk could be put to work.
Per contra I actually believe human psyches require a little imperfection.


Of the 1,527 artists therein, there were about 15 that were problematic (e.g. I'm sure there's no musician called Hawaii 5.0 or James Bond 007, to take some of the theme songs I occasionally play to spice things up, and similarly Ghana Highlife 1 to 10 are simply proxies for the unknowns I've picked up). I haven't digitized the 4 crates of records that I have lying about. That's a weekend project for this summer like my photo-digitization project of a few months ago (see Cultural Sensitivity in Technology)

Apparently there are 1,540 albums in the collection. I haven't quite diligent about tagging albums so 2,000 songs don't have album information. Handwaving a 10 song per album ratio that's another 200 albums, so lets say 1,750 albums). I'll fix that and update those files in the next few days.

Generating metadata for albums is difficult: what album would 50 Cent's blistering Jay-Z diss that was the talk of mixtapes in 2002 fall in? And there are also things that come from Greatest Hits, or Live albums as opposed to the original album, and I typically have multiple versions of things. I can't remember which version I actually ripped.

I am actually crazy to have spent cash money on most of these albums instead of downloading so if you assume that's $14,000 to $27982.50 (at $8 per album or $15.99 if I were to pay what record companies would prefer). Luckily I got a lot of these for free when I used to DJ and was on record company mailing lists, also I used Columbia House and BMG music zealously. Still that's a lot of money, I bought my first cd in 1992, previously it was all vinyl and tapes. I can't say that I've downloaded much from file sharing networks (I use Gnucleus). At the higher end of the scale of the cost estimates, $30,000 is a mortgage downpayment (well it was before this insane real-estate bubble that even leads to riots). Music companies should love me. I justify it by saying that my only hobbies are music and books; no drink, fast cars (I'm a public transport prole) or other flash.

I've tagged songs by genre but here you really want to be able to specify multiple tags just like in any of the new social software.

I suppose I should switch to iTunes to get an xml representation of these playlists. One of the bad versions of Winamp, the bloated one that which caused a backlash, used to export playlists in an xml format. That would have been better than m3u or pls. Still that is nothing that a capable regexp and Unix pipeline wizardry couldn't fix.

I'm still in disaster recovery mode but once I've got Cygwin set up, I'll be sure to download and install PlaylistManager and slice and dice the data with that Linux tool.

Anyway I hope this helps...

Full Force in the Long Tail


The obligatory musical note... While writing this note, I just rediscovered the wonderfully fresh sounds of Adriana Evans whose 1995 album for some reason didn't blow up (stupid record companies). Adriana Evans is the name that you drop when you want to put someone talking up Ashanti or Brandy in their place. Love is all Around and Seein' is Believing are some of the most laidback soul jams of all time.

Adriana Evans


Speaking of Seein' is Believin', how about Cheryl "Pepsii" Riley's (erstwhile of Thanks for my Child fame) song of the same name from 1988's Me, Myself and I album, produced by Full Force. That song is soul/funk perfection, much like Jerome Prister's sublime Say You'll Be that I recently found after 18 years.

Cheryl Pepsii Riley


And speaking of Full Force, what about that Hall of Fame production crew and House Party funkers? In recent years have been souping things up for Britney and Christine (you've got to pay bills). Their roots, like Jam and Lewis, are in pure soul and a hip-hop/funk aesthetic. Guess who's coming to the crib? is my favourite album of theirs although others preferred their eponymous debut album

Full Force


My picks in the Full Force canon
  • Your Love Is So Def
    My dad lost all respect for me when I used to sing this because he heard "deaf" in my loud singing rather than the 80s hip-hop notion of def.
  • Love Is For Suckers (Like You And Me)
    Shower crooners at school couldn't stop singing this song for weeks on end. Not to mention the lines we all primed to repeat on cue, their irresistible verbal tics:
    "Come on bust a move."
    "Full Force Get Busy One Time"
  • Take Care Of Homework
    "Yo homeboy, your girl looking kinda fly lately, know what I'm saying?"
    "No I don't know what you're saying."
    "What I'm saying is that she be like checking me out 'cos you been neglecting. D'you know what I mean?"
    "No I don't know what you mean!"
    "What I mean is that you better take care of homework or a dude like me will push up and take your place!"
    On the basis of the irreversibly funky break beat from Take Care of Homework, these brothers proved they were funky enough to revamp James Brown in his sadly neglected I'm Real album. The title track is a ferocious call to arms by The Godfather.

    James Brown is Real
  • I'd also note that their Smoove album is no spring chicken. It contains such gems as Ain't My Type Of Hype, the Smooved out title track and that phenomenal ballad, Kiss Those Lips. These are cases in point about the versatility of the crew
  • They gained fame with Alice, I Want You Just For Me with the famous opening.
    Testing, Testing One, Two
    Testing, One Two
    In The Place To Be...
    Girl I Want To Shower You With Diamonds And Pearls
    And When Were All Alone I'll Take You For A Trip Around The World
    Yes Indeed I Like Your Style
    Ooh You're Worth My While
    Baby, I'm Your Carpenter,
    Please Me Lay Your Tile
    I Don't Want To Share You With No Else
    Alice Be My Girl
    Can't You See?
    I Want You Just For Me?
    Full Force Get Busy One Time
  • Their repeated collaborations with Lisa Lisa & Cult Jam are also essential Spanish Fly.

    Lisa Lisa and Cult Jam with Full Force


    I Wonder If I Take You Home will still fill a dancefloor to this day.
    "Would You Still Be In Love, Baby?
    Because I Need You Tonight."
  • Their work with UTFO especially on Roxanne, Roxanne also gave rise to the legendary The Complete Story of Roxanne, those 103 responses to UTFO's 1985 novelty hit.

    The Story of Roxanne is one of the best stories of of the Long Tail of Music (the other great case in point is All Roads Lead to Apache).

    Roxanne, Roxanne brought to prominence The Real Roxanne and more importantly Roxanne Shante, possibly the best female MC of all time, Roxanne's Revenge will go down as the best response record of all time. One of my favourite songs of hers is Live on Stage which is Marley Marl's production at its best.

    Roxanne Shante
  • Full Force consolidated their fame with collaborations with Kid 'N Play in House Party and its two sequels. The movies were the inspired brainchild of the Hudlin Brothers, a decade before Barbershop.

    House Party


    Kid-N-Play 2 Hype


    Kid 'N Play were no slouches themselves even if they tended towards the pop side of the hip-hop spectrum. They were fun, their beats and lyrics focused on putting a groove in your step. More to the point, they could dance; at a time when Hip Hop was heading into inaccessible navel-gazing, they reminded everyone about the dance in Kool Herc's dancehall that was in the founding mythology of the music. As historians of Hip-Hop would write, break dancing (the B Boying and B Girling aesthetic) is an essential element - some would even term it the most important of the 4 elements of hip-hop, even more than MCing (rapping), DJing, and that maligned "social nuisance" element, Graffiti. Gittin Funky and Rollin' with Kid 'N Play will cause circles to form and Soul Train breakouts on the dancefloor. Funhouse is their best album and that's the spirit. And those flat tops!
Full Force have since cashed out of the soul side of things except to give some "urban" substance to blue-eyed pop, and who can blame them: "black" audiences are so fickle.

File under: , , , , , , , , , , , , , , , , , , , , , , , , ,

Monday, May 23, 2005

On Recommendation Systems

Pete Lyons wonders about the voodoo behind Amazon's recommendations which of late seems to have nailed things for him as they successfully mine the Long Tail of music. He asks

Do they have more data now or better algorithms?
Sad fellow that I am, I actually happen to have read the research paper behind their approach Amazon.com recommendations - Item-to-Item Collaborative Filtering (PDF) which describes the techniques they used. The question then is how do these glue layer, machine-learning heuristic, and algorithm-eating artificial intelligence folks do recommendation systems and what should one make of them?

Now I'm no algorithm wonk, but my mathematics and electrical engineering hasn't atrophied enough to prevent me from recognizing the elegance of Amazon's technique. What's also interesting is that like Google Suggest, they make good use of offline processing so that almost everything can be pre-calculated and cached for reasons of scalability and speed.

As with almost all recommendation or machine learning systems, the more data they have the better things get. Of course the computation costs also increase but, if you can parallelize things a little, you can leverage the experience we now have of 10 years of Mr Moore in the Datacenter of the web and throw server farms at the problem. It's not just the data however, if you look at the mathematics, you'll see that the algorithm used plays the most crucial role much like PageRank or Kleinberg's earlier Clever (PDF, PS) did for search systems on the web as discussed here.

Music or book purchases on the whole are episodic unless you're someone like me who is obsessed and constantly looking for the perfect beat. Amazon as a retailer doesn't have to be worried about real-time timeliness; I'd guess their average user buys or browses through the site monthly, weekly or at a stretch perhaps once a day (e.g. outliers like myself who are always tweaking their wishlists). Amazon also gives users easy ways of providing feedback so that they can train the system at their leisure as they interact with the site. Like Netflix, they just have to throw in a little unobtrusive DOM scripting, hidden iframe or XMLHTTPRequest thingimijig and the user can click on those star rating systems without even seemingly reloading the page.

The benefits for the user are frictionless immediacy and the notion that one is teaching the autistic machine. The endpoint of machine learning is to perfectly anticipate human desires. Per contra I actually believe human psyches require a little imperfection. We need a little interaction and especially some conversation as we mediate our ever-changing world. In Apple's Knowledge Navigator 1988 concept, the interaction was more with a friend or butler. And as we know Jeeves was opinionated and mostly correct but he also made mistakes and Wooster could sometimes be smug. Should the people working on recommendation systems ever get to their nirvana, I would speculate that it would be a little unsatisfactory for the users, hence I'd suggest that they make sure to throw in enough errors (maybe a 5 percent threshold of randomness) to keep things exciting. I actually cherish the mistakes in my music collection and even the things Best Left Unread.

Enough people have been noting that Audioscrobbler's music recommendation system has improved enough that, late adopter that I am, I went ahead yesterday and installed their plugin on serveral of my machines. On the older machines, I use Winamp which has lower memory requirements than iTunes. My digital music collection (10,300 songs, 70.6 GB) is larger than the largest iPod so I am waiting for another Moore's Law inspired doubling of storage before I adopt that platform, not to mention that my ear canal is too small for those white earbuds.

Audioscrobbler's approach has the virtue of actually keeping track of your actual usage of the content in question so that if it notices that you keep playing James Carter's The Intimacy Of My Woman's Beautiful Eyes as I did last week, then perhaps it will be smart enough to figure out that you'll dig Nicholas Payton's Captain Crunch (Meets The Cereal Killer). At least that's the theory. Some fear that this smacks of Big Brotherism and indeed we need to be vigilant that the benefits we gain in giving up some of our privacy outweigh the possible pitfalls.

On the other hand, humans are social beasts and don't mind a certain amount of looking over one's shoulders or, like the teenagers on the bus, we don't mind others plugging in to our headphone jacks. Or like some of our neighbours, we want everyone in the neighbourhood to hear our sounds whatever they may be. There was some mp3 player or other I saw on Engadget a while back that had 2 headphone jacks to facilitate intimate sharing of tunes and that's the right idea. There's even controversy among iPod people over the new vocabulary: does podjacking mean "plugging your cord into the jack of another person's iPod (and vice versa, of course) to hear what that person is listening to" or is it "using an FM transmitter attachment to take over neighboring radios". Amazon's Listmania feature, our Netflix queues, the sharing of iTunes playlists, things like Webjay, Shoutcast, last.fm and All Consuming express the pent-up desire for the two-way web of publishing of content and interests.

If you love books or music as much I do, and write about them with incisiveness and opinion, others will always be asking you what the latest thing is in order that they can seek out new things (or know what to avoid) or simply share their opinions or anecdotes. These are cultural markers that help establish shared context. That is the role of DJs or the personal recommenders we all have, the people who are arbiters of the cool. And if the people who market these items seek out these tastemakers with the moral equivalent of payola, it's just smart advertising. Samuel Jackson will never buy a hat in his life again unless Kangol loses their corporate mind. In my own little way, long after I was active at WHRB, record companies were still sending me 12 inch singles even though my Technics turntables had been stolen and I'd asked repeatedly to be removed from their mailing lists.

Anyway here's Koranteng's Musical Toli at Audioscrobbler.

It's a little scary in that you could figure out that as I was writing these words Chuck Brown & The Soul Searchers - Bustin' Loose was making my early morning brighter and you might even picture me getting up to take a few shuffle steps around my living room in my pajamas.

The Audioscrobbler recommendation system kicks in after 100 or so tracks played and I'll give them a couple of weeks or to see if it's hype and report back with the results. I too am looking for overlooked breakbeats.

Dan Bricklin's Listgarden is software in that mode. My current Forms work is all about lists, indeed until the lawyers heard about Microsoft List Builder, that was the preferred terminology for some. I actually would rather call a form a form, and a view a view but I'm a cantakerous sort as you might know.

In the web site link category where timeliness is of importance, the sheer scale and real-time nature of the blogosphere makes the design of efficient recommendations systems more problematic. I assume all the big boys are burning the midnight oil trying to crack that question. We want both search and recommendations to help navigate the web. Google has been very quiet of late although their Web Accelerator is pointing to the kind of thinking required. Just when you think that Blogger has been sadly neglected (no categories? no trackback? no easy lists or blogrolls etc) all those PhDs will surprise you. At least I hope they will - the contortions I've been going through with Blogger deserve a seperate post. I assume the same thing is true at Yahoo, MSN and the like. In the meantime, it has been Technorati, del.icio.us, PubSub, Blogdigger, Furl and BlogPulse for me (not to mention Blogdex, Popdex, Hot Links, Mememorandum and that latecomer Ice Rocket). Mark Fletcher has been intimating that Bloglines will surface something in the summer. If they do, they will have quite an advantage since so many are increasingly living in Bloglines and other newsreaders that Tim Bray is now pondering the usefulness of his Browser Market Share numbers.

Another Glue Layer Person, Leonard Richardson has one of the most interesting papers (and software that one can play with complete with Python source). He explicates the problem space quite clearly:

The Ultra Gleeper: A Recommendation Engine for Web Pages
Recommendation engines were built and run into troubles. Seemingly insurmountable problems emerged and the flame of hype moved elsewhere. Recommendation engines for web pages were not built or successfully launched. To even attempt one would require development of a web crawler and the associated resources. Today, recommendation engines have something of the reputation of a well-meaning relative who gives you gifts you often already have or don't quite want. Most useful recommendations come from knowledgeable friends or trusted web sites.

But over the years, as people built these web sites, they came up with models and tools for solving the basic problem of finding and tracking useful web sites. The wide adoption of these strategies has not only brought down the cost of building a web page recommendation engine, it's removed some of the insurmountable problems that still plague recommendation engines for other domains. It's now possible for someone with a dedicated server to run a recommendation system for themselves and their friends. I've done it and I'll show you how to do it.
An officious-looking "Legal Education" document came in through my Big Blue inbox in the past few days, so presumably I should consult that before peeking at the code if indeed that is allowed. The tension between the open source imperative and the inhibition of so-called "Intellectual Property" is a minefield that like all such battlegrounds causes stalls (and amputations); we still haven't learnt how to navigate easily these things.

David Hyatt recently wrote about Implementing CSS and gave pointers to the major lessons learned while implementing CSS support in Safari and Mozilla along with a couple of optimizations he came up with (very clever algorithms by the way) and the subsequent public discussion about the design and performance tradeoffs made in complex things like web browsers stimulated my engineering juices. Lots of things are browser-like these days and the techniques he has written about have wide applicability. A high performance dynamic style rule matching component could be used in many products that don't have to do with browsing. Instead I have to consult a lawyer or read yet another stack of incomprehensible powerpoint slides. Why exactly did my parents forbid me a law career? It seems that lawyers are the only ones who have guaranteed job security (trigger-happy disclaiming is the rule)

And while digressing about job security, I should mention that the news of 10,000 to 13,000 jobs being eliminated at IBM hasn't made for a comfortable work environment for the past few weeks. Those are mighty nice round numbers... Rumours are rampant in the workplace that as the Beeb suggest "it's too expensive to lay off in Europe, they have all those unions and are vaguely socialist therefore they're going to cut around here in the US where there is much less difficulty hiring or firing. Otto Von Bismarck knew a thing or two about paternalism and the unions in Europe even survived Maggie Thatcher's defenestration of Arthur Scargill. I know my family are all wondering what might happen and if I'll be affected and not just quietly, it's an ongoing concern in conversations. And who knows? We all worry about our own little patch of the woods but also about those colleages who we may not be talking to in coming months sometimes to the point of work being affected. Despite our undoubted professionalism, we're only human and don't understand abstract and capricious things like gravity or economics, we'd rather discuss tangible things like books, music and recommendation systems.

The two topics that never get spoken about in the US corporate world are one's actual salary and the impending job cuts. Everybody just keeps their heads down; the fear being that if you make too much noise, you might get the cut because admittedly we've all heard that such things have happened somewhere, somehow, at some time in some corporation. Corporations being the expression of unsentimental capitalism, I assume the worst part of being a manager is having to lay someone off and see their face. I'm too sensitive for that kind of thing and hence I'll never aspire to a managerial position. The one thing that you'll hear as the received wisdom-du-jour gets explained is that the euphemistic "resource action is not the fault of those affected" but rather "simply reactions to market conditions" or philosophically that ultimately it was a failure of planning or of something or other. But maybe I shouldn't stick my neck out any further...

In the spirit of the recommendation systems I've discussed I'll endeavour to share a few playlists with toli readers this week before London calls this weekend.

Soundtracks to this tale: See also: On The Long Tail of Music, Metrics and Recommendations

File under: , , , , , , , , , , , , , , , , , , , , , ,