Thursday, September 27, 2007

A great application for named entity recognition

1) Locate a highly polar, drippingly opinionated piece of political bloggery.

2) Identify all person names.

3) Randomly replace them with other figures.

4) Repost the story somewhere else, and record whether anyone can detect the change.

If successful it could reduce the time required to generate political blogs by 50%, a huge gain for the economy. I just got this idea in flash while reading this. If you're interested in collaborating on this research, let me know.

Saturday, September 22, 2007

Now this is funny

No longer timely, but catching up on my RSS, here's from an Ars Technica piece Sun to sell Windows Server boxes:


The move comes as a definite surprise, even for veteran Sun-watchers who are used to seeing the company switch strategies on a dime (see the historical chart above).

I always get it backwards

An interesting Ars Technica post reviews evidence that some of us are susceptible to sales pitches that start by confusing you, and then clear things up. All these years, I've been following a model for talks and papers: always start off with something that everyone can understand, even if you end up presenting some thing dense and hard to follow later on. From now I'll put my hardest math in the abstracts, or maybe even the title.

A puzzle

Knowledge is power - so why are consumers (and voters) so willing to give up control of information about their own behavior, social life, etc to companies (or the government)? I just followed some links from Fernando's blog and spent an intriguing time reading up on Vendor Relation Management (VRM) and some related projects. Giving consumers more control over their data is an old idea and one that has never really taken off. Maybe the real market niche is for systems like Farecast, that do data-mining of corporate behavior and put the information in the hands of consumers.

(Disclosure: I'm on the advisory board of Farecase, and anyway would be a natural fan of any company that uses "classic rule learning".)

Monday, September 03, 2007

Economics of data-sharing

Last month I posted some speculations that the limited use of public transit was as much a problem of understanding how to use it as a problem of availability and cost. I got a comment from a legitimate expect on this, Joe Hughes, who pointed out that decent data exchange formats exist but "transit agencies haven't traditionally seen the value of sharing their information with outside developers". (Joe also shared a pointer to his blog, which includes an interesting post on geocoding ideas, btw).

The economics of data exchange is something I've thought about in the past. I think that efficient schemes for sharing information are best thought of as public goods - ie goods that have positive externality - ie things that benefit everyone, but don't benefit any one party enough for them to actually want to pay for it. One classic example of a public good is a road system (which benefits both the merchants and consumers that are connected). Like a road, data sharing schemes in general connect data consumers and potential data providers.

Unlike a road, however, the connected parties don't usually exist until some data-sharing scheme is possible - so you often see "chicken and egg" problems: there's no point establishing a standard until somebody's ready to use the published data, but nobody's going to be ready until the data is there. You can jump start the system from either end (or from a third neutral point, by scraping data sites and standardizing the format - let me tell me about this company I used to work for...) but that is always risky, since it's hard to guess the size of a market that doesn't exist.