Archive for the ‘Algorithms’ Category

Don’t use python’s list comprehensions for set operations. Use, um, the set type and her functions.

Tuesday, September 9th, 2008

I’m making, say, 60,000 function calls, each involving two python lists (no duplicates within these lists, so they’re theoretically more like sets), with each having some overlap with the other.  I’m making these 60,000 calls, say, 30,000 times.  The question: how do I find the symmetric difference of the two lists efficiently via my function call?

Say my lists are a and b.  I’m concerned with the most computationally intensive part of the function, the calculation of the symmetric difference.  Here’s one approach:

[elem for elem in a if elem not in b] + [elem for elem in b if elem not in a]

But this ends up taking anywhere from .8 to 1.4 seconds for the 60,000 function calls.  Too long!  I thought perhaps if I can collapse it all into one list comprehension, then I’d be fine … maybe the issue is the fixed cost of calling the function:

[elem for elem in set(a+b) if (elem in a and elem not in b) or (elem in b and elem not in a)]

But this is even sillier - not to mention it takes a ton of time.  Anywhere from 1 - 1.8 seconds for the calls.  So why not just use python’s set type?  I can then turn my code into this guy:

a, b = set(a), set(b)

sym_diff = a.symmetric_difference(b)

This takes about .2 - .4 seconds, which is a great improvement, especially considering the number of calculations I’m making.

Moral of the story: list comprehensions are pretty handy, but not particularly efficient.  I’ll report more after I jump into the C code that dictates these operations.  Likely I could cut down this run time even more!

Value of recommendation engines

Thursday, July 31st, 2008

MIT technology review, which apparently has been published since 1899 according to the cover, in May 2008 published an article by Michael Schrage, “Recommendation Nation”. The last few paragraphs:

For all my excitement about the future of recommendation services, I can’t help feeling the way I felt about search in 2001. Existing recommendation engines have a lot of value, but they’re still primitive. Distinctions between browsing and comparison (that is, between looking at products and choosing between them) are poorly understood. We’ve yet to see how user-generated tags make product and service descriptions more precise and useful. The more specific, explicit, and time-sensitive the tag, the better the potential recommendations will be.

Smart people all over the world are working on these problems. Billions of dollars are at stake. Netflix is offering a million dollars to anyone who can improve the efficacy of its (exceptionally successful) recommendation engine. That’s a small price to pay for a company whose future depends on its ability to compete with Blockbuster and the digital video delivery companies of the future. It is an interesting and important problem, because it’s not only individuals who watch the movies, but couples, families, and friends. Perhaps the winning algorithm will be optimized for the preferences of groups.

When I get good recommendations, I spend my time and money differently. Even better recommendations will dramatically increase the value of that time and money. That’s a digital future I crave and expect. I hope Internet innovators take my recommendations as seriously as I take theirs.

Not that we need the validation. The point is I fully agree that better recommendations are needed especially as we are increasingly overwhelmed by mounds of content.

What the impending The Dark Knight / Godfather War tells us about the IMDb Top 250

Tuesday, July 29th, 2008

Movie lovers are a pretty intense bunch.  And we should know - we’re rather insane movie lovers ourselves.  But it seems that two classes of film lovers are poised for an all-out war.  A recent /film blog post reports that The Dark Knight has overtaken The Godfather on the IMDb Top 250 list:

The Godfather has fallen to the #3 spot, after nearly a decade at the top of the Internet Movie Database’s listing of the Top 250 Movies of All Time [...]The Dark Knight overtook The Godfather’s throne, but this latest development is really interesting because it might show evidence of a fanboy mob at work. Could it be that Dark Knight fans are intentionally voting down Godfather in hopes of keeping The Dark Knight at the top spot? Why else would The Shawshank Redemption have overtaken The Godfather in a time when neither film is in the public forefront? The percentage of users who gave Godfather a 1 out of 10 (the lowest rating possible) grew from 6.1% to 6.4%, just enough to push Shawshank ahead, while the percentage of participating users who loved the film, giving it a 10 out of 10, remained the same (57%). It’s also worth noting that while any IMDb user can vote and effect a movie’s overall rating, only regular IMDb users can influence the film’s top 250 placement.

Nevermind the irony; The Dark Knight is in part about an insane, brilliant criminal besting the mob world.  The problem is, this isn’t a question about which one is better.  This isn’t McCain vs. Obama - after all, The Godfather is a frickin’ bad-ass movie.  And as far as Batman goes, Ledger’s performance as The Joker is, as everyone is aware, pretty much one of the best-performed roles any of us have ever seen.  No, comparing the two movies is hardly useful.  The real question is whether or not The Dark Knight can stand the test of time, as The Godfather has.  That will take a bit of time, so that whatever hype still exists concerning The Dark Knight will have settled into genuine criticism.  The IMDb Top 250 as currently rendered is not really sophistocated enough to handle that type of comparison.  So how can we control for new movies making a big splash?

joker

For starters, the formula used to calculate the list (at the bottom of the Top 250 page) could incorporate temporal effects, so that movies that have been consistently popular throughout time would be rated higher than movies whose true mettle has not been tested by the sands of time.  (I’ll try to keep this post supremely non-technical, so as not to scare anyone away, but those of you who would understand the necessary math can probably imagine what I’m thinking.  Perhaps next time we can delve into the math-ey stuff).  A statistician or econometrician would have already thought of a few ways of using ratings over time.  So consider the following: The Godfather has been number one for about ten years, and has had a consistent number of people continually rate it high over the course of the IMDB’s tenure.  That’s the mark of a good movie - if new generations rediscover it, and love it just as much as those who first saw it, then that movie is a real winner.  If a movie makes a big splash when it first comes out, but putters about years (or even months) after (or simply fades away), then it doesn’t deserve its spot.  Its hard to say whether or not it’ll matter for The Dark Knight in the long run, but I bet you’d see some serious reshuffling of the other movies on the list.  The Incredibles might finally lose to The Graduate.  Finally.

The Dark Knight is a great movie - everyone at HelloMovies adored it (except Stan, but he hates everything :-) ) - but if we have to wait to see whether or not people will rate it consistently well enough over time to consider it a classic, then we might as well factor the movie’s age and temporal distribution of ratings into the Top 250 calculation and save movie fans the trouble.  That way, if a bunch of fanboys vote down Vito Corleone, nothing saddening like the don dropping to number three could ever occur, unless said fanboys somehow manage a many-years long siege on that castle.  And it’s not like the IMDb doesn’t have data to tune the right parameters to make it happen, either.  No doubt IMDb has probably thought of this already, but it’s mighty hard to change an institution like the Top 250 and keep users from revolting.  But hey, any true movie fan should already be raging against the machine on this amazing upset.  This might be just the catalyst IMDb needs to change things up!

How we will beat Netflix

Tuesday, April 8th, 2008

The Netflix Challenge is a $1 million prize for an improved movie recommendation algorithm that does better than the Netflix algorithm at predicting how a user will rate a movie.

Different teams, including some at Stanford, have adopted various approaches to the problem including very, very sophisticated algorithms. But a few teams have taken a different approach and instead of limiting themselves to the Netflix data set they have instead included additional useful data sets. And no surprise - when you think outside the box the results are often much better (but unfortunately you won’t win the prize by bending the rules).

The bigger point is that adding more independent data usually beats out designing increasingly better, but only incrementally better, algorithms.

Another example of this comes from Google. Many people mistakenly believe the success of Google is predicated on their brilliant algorithms, namely PageRank. The truth is the innovation lies in the dual recognition that hyperlinks were an important measure of popularity and that anchortext in the web index should weight the page title. Previous search engines had only used the text of the web pages themselves, so the addition of these two data sets rocketed Google’s search up the leaderboard.

Another Google example is from Adwords - the keyword auction model. Overture popularized advertisers bidding on keywords but Google significantly improved the results by adding additional data: the click-through rate (CTR) on each ad. This change made Google’s ad marketplace much more efficient than Overture’s and again the point is the algorithm itself isn’t the key component but rather the addition of new data.

So we aren’t working on incrementally improving the algorithms around personalized search and recommendations for movies - we are working on adding more data, the right data, to fundamentally change the quality of those recommendations.

The obvious question remains: can’t Netflix do that also? they could, but they would have to radically change the information they collect from each user, on each users social circle, on each movie, and to top it all off develop the right algorithms. Still sound easy?