Friday, December 28, 2007

Collective Kvetching on The Point

Over the fall semester, I got too busy to blog, but I bookmarked a couple of items that seemed to be worth thinking about. One was an article on Webware about a site called The Point.

The Point is a new site to help instigators collect the wishes of the masses and to get participants to pledge to take action when a "tipping point" of participation is reached.

For example, if you are upset that Southwest Airlines no longer lets families with small children board first, you can join the pledge to boycott Southwest once 2,000 other people also sign up. As soon as the desired number of people sign on to the campaign, the pledge is activated. But if they don't, you're not left twisting in the wind executing a meaningless protest.

The Point can also work with financial action: You can join a pledge to participate in an event if enough other people sign on, as well. If the pledge goals are met, your credit card (that you're previously submitted) is debited. If not, you're not charged.

I wish I'd thought of this---and I hope that The Point reaches it's own tipping point in use. It's great to see people coming up with new collective action using the web, and it will be interesting to see how/if this catches on once it is integrated with social networks (which is planned).

Saturday, November 24, 2007 (video/quicktime Object)

From Mike Berman's blog, the SoCal version of Stonehenge, but with more computers involved in the design. Of course in Pittsburgh's weather it would take all of November to see one cycle of the poem completely - would that make it better or worse ?

Tuesday, November 13, 2007

StupidFilter :: Main / About

StupidFilter is "an open-source filter software that can detect rampant stupidity in written English. This will be accomplished with weighted Bayesian or similar analysis and some rules-based processing, similar to spam detection engines. ... To this end, we're collecting a ranked corpus of stupid text, gleaned from user comments on public websites and ranked on a five-point scale."

I wonder, to what extent can stupidity be modeled with a unigram distribution? What is the overall distribution of stupid comments? And how many random stupid comments does the average person look at before moving on?

Thursday, November 08, 2007

Does your party affiliation begin with a 'D'?

Avi Rubin asks: Does your home address begin with a '5'? Interesting story, although it is colored by Avi's unfortunate skepticism about the reliability of modern compu^@^@^@^@^@^@^@ //SYSIN DD* ERROR MISSING DLL

Wednesday, October 31, 2007

AT&T Slashdotted

...and also Wired-Blog-Networked (?) with this story about a research paper from 2001 about a language for data-mining, more provocatively called a "Programming Language for Mass Surveillance". While I'm not happy with much of what my former employer seems to have gotten up to since I left, I guess I know too much of the backstory on this to subscribe to this particular round of hysteria. As an employee back in the late 90's and an sometime colleague of the principles on the paper I'd readily believe that the original purpose of "Hancock" was, as claimed, problems more like detecting long-distance fraud, than supporting the NSA and their ilk. (As an aside, at least 2/3 of the authors of this paper are now at Google).

This all points out an intriguing problem, really, for those of us engaged in R&D. Any sufficiently general tool can be used for many purposes, some good and some evil - and as tool creators, we have little control over the eventual effect of what we do. In fact, the better you are as a scientist and engineer, the more general-purpose your results and tools will be - so in CS, at least, it's unlikely that you won't facilitate something unpleasant sometime in your career. Certainly, we can make choices about where we work and how we direct our energies, but the bottom line is that it is now as scientists, but as citizens (and consumers) that we need to decide to what uses technology will be put.

And one man's language for mass surveillance might be another man's language for analyzing protein-protein interactions for look for cancer cures.

Friday, October 19, 2007

Take it back! Take it back!

The hardest part of having a secret revealed is tracking down everyone that heard it and forcing them to forget it. From a story titled Breathtaking Abuse of the Constitution a confrontation between the Phoenix New Times and one Sheriff Arpaio (in which the Sheriff's home address appeared in an PNT op-ed story) has degenerated to a subpoena asking for, among other things, the domain name and IP address of anyone that accesses the Phoenix New Times website since 2004.

Early on the story says "It is, we fear, the authorities' belief that what you are about to read here is against the law to publish" and indeed the authors spent a night in jail for writing it, so you know it's worth a read.

Wednesday, October 10, 2007

Another interesting post from Lauren Weinstein

Can you control what updates happen on your computer, or not? and who should have control? In Lauren's words: "who has the right to ultimately control operations on a system -- the owner of the computer itself, or a software vender?"

I bet you'll never guess what Microsoft's answer to this question is.

Maybe this is why AT&T bricks hacked iPhones?

From Eweek, via Dave Farber's IP mailing list:

The iPhone has been turned into a "pocket-sized … network-enabled root shell...A rootkit takes on a whole new meaning when the attacker has access to the camera, microphone, contact list and phone hardware. Couple this with 'always-on' Internet access over EDGE and you have a perfect spying device".

Thursday, September 27, 2007

A great application for named entity recognition

1) Locate a highly polar, drippingly opinionated piece of political bloggery.

2) Identify all person names.

3) Randomly replace them with other figures.

4) Repost the story somewhere else, and record whether anyone can detect the change.

If successful it could reduce the time required to generate political blogs by 50%, a huge gain for the economy. I just got this idea in flash while reading this. If you're interested in collaborating on this research, let me know.

Saturday, September 22, 2007

Now this is funny

No longer timely, but catching up on my RSS, here's from an Ars Technica piece Sun to sell Windows Server boxes:

The move comes as a definite surprise, even for veteran Sun-watchers who are used to seeing the company switch strategies on a dime (see the historical chart above).

I always get it backwards

An interesting Ars Technica post reviews evidence that some of us are susceptible to sales pitches that start by confusing you, and then clear things up. All these years, I've been following a model for talks and papers: always start off with something that everyone can understand, even if you end up presenting some thing dense and hard to follow later on. From now I'll put my hardest math in the abstracts, or maybe even the title.

A puzzle

Knowledge is power - so why are consumers (and voters) so willing to give up control of information about their own behavior, social life, etc to companies (or the government)? I just followed some links from Fernando's blog and spent an intriguing time reading up on Vendor Relation Management (VRM) and some related projects. Giving consumers more control over their data is an old idea and one that has never really taken off. Maybe the real market niche is for systems like Farecast, that do data-mining of corporate behavior and put the information in the hands of consumers.

(Disclosure: I'm on the advisory board of Farecase, and anyway would be a natural fan of any company that uses "classic rule learning".)

Monday, September 03, 2007

Economics of data-sharing

Last month I posted some speculations that the limited use of public transit was as much a problem of understanding how to use it as a problem of availability and cost. I got a comment from a legitimate expect on this, Joe Hughes, who pointed out that decent data exchange formats exist but "transit agencies haven't traditionally seen the value of sharing their information with outside developers". (Joe also shared a pointer to his blog, which includes an interesting post on geocoding ideas, btw).

The economics of data exchange is something I've thought about in the past. I think that efficient schemes for sharing information are best thought of as public goods - ie goods that have positive externality - ie things that benefit everyone, but don't benefit any one party enough for them to actually want to pay for it. One classic example of a public good is a road system (which benefits both the merchants and consumers that are connected). Like a road, data sharing schemes in general connect data consumers and potential data providers.

Unlike a road, however, the connected parties don't usually exist until some data-sharing scheme is possible - so you often see "chicken and egg" problems: there's no point establishing a standard until somebody's ready to use the published data, but nobody's going to be ready until the data is there. You can jump start the system from either end (or from a third neutral point, by scraping data sites and standardizing the format - let me tell me about this company I used to work for...) but that is always risky, since it's hard to guess the size of a market that doesn't exist.

Thursday, August 30, 2007

Good news for privacy advocates or telcos - I'm not sure which

Here's a nice article on DCSNet, the FBI's "sophisticated, point-and-click surveillance system that performs instant wiretaps on almost any communications device" (from Lauren Weinstein). From my years at AT&T I'm dubious that the technology works with the 1984-like seamless smoothness the article suggests, but this part sounds accurate to me:

Despite its ease of use, the new technology is proving more expensive than a traditional wiretap. Telecoms charge the government an average of $2,200 for a 30-day CALEA wiretap, while a traditional intercept costs only $250, according to the Justice Department inspector general. A federal wiretap order in 2006 cost taxpayers $67,000 on average, according to the most recent U.S. Court wiretap report.

Wednesday, August 29, 2007

As summer finally ends...

Back from vacation, I'm so behind in everything it's amazing...starting with the news....
  • What's cooler than printing 3-D objects? Maybe printing human organs? This is only a little bit far-out...and even modest success would be hugely interesting, not necessarily to clinicians, but to developmental biologists and others that study how cells interact in tissues.
  • Some useful warnings about the "cult of FireFox" and the evil of Ad Block Plus. Remember, not reading ads is theft. And not citing my papers - that causes uncontrollable weight gain.
  • Hal Daume promises to automate the construction of LDA-like statistical models. Well, at least partly. A fascinating idea, although a challenging one...for whatever reason the ML community doesn't seem to take to these sort of high-level tools. AutoBayes and WinBUGS are prior efforts along these lines.
  • The scary privacy-infringement stories of the week: from Ars Technica, we learn that China is to begin web monitoring with Clippy-style animated police and, if that's not horrifying enough, a confessed movie pirate has been ordered to switch to Windows by the court, so his parole officer can install the appropriate monitoring software. (However, the rumors that the court also ordered a switch from Emacs to Notepad are apparently false.)
  • In related news, the EFF's suit against AT&T may have gotten stronger: even though National Intelligence Director Mike McConnell has said before that "the disclosure of any information that would tend to confirm or deny... an alleged classified intelligence relationship between the NSA and MCI/Verizon, would cause exceptionally grave harm to the national security" he, oops, confirmed that AT&T was assisting in surveillance: "under ... the terrorist surveillance program, the private sector had assisted us...and they were being sued". And the DoD's official web sites are more than 100x more likely to leak sensitive information than milbloggers.

Friday, August 17, 2007

Please re-calibrate yesterday's posting...

Today, the same news source has a post on DARPA's bootstrap learning project which, while accurate enough in most of the details, has a pretty high hyperbole factor. (I helped write a grant for this program, which is novel enough, but not exactly "far far far out" - it has very concrete one-year deliverables, and is a pretty typical DARPA 6.1 program in terms of risk and innovation.)

Thursday, August 16, 2007

Nothing to hide, and nowhere to hide it

Shenzhen, China attempts to take the next step in citizen surveillance with 20k cameras "equipped with 'intelligence'".

Tuesday, August 14, 2007

Famous researchers barking about SEAL

Richard Wang's SEAL system has gotten a few hits from the curious since my posting. Now it's getting some discussion from Matt Hurst and Fernando Pereira.

Thursday, August 09, 2007

Hard to do vs hard to figure out

I'd like to argue for a second that better websites might do more to stop global warming than hybrid cars....

Sometimes it's easier to just go with what you know than to try something new. Case in point - I was down in DC earlier this week, and even though I had a car and free parking at my final destination, I ended up taking the Metro to my workshop rather than driving - not because it was faster or cheaper, but just because it was easier. I understand the Metro; but I get lost driving around DC on my own. Likewise, when I first moved to the NYC area, I tended to walk or take cabs more frequently from point-to-point in Manhattan...then, after a I'd figured out the baroque and arcane subway system (and no longer found myself mysteriously deposited in Brooklyn at random intervals) I used it instead.

Now I live in Pittsburgh and one of the fringe benefits of my job is a free bus pass. So, I take the bus everywhere, right? Well, not really - I use a few routes I know, and drive most other places. The main obstacle, I think, is just not knowing what bus to take when. It's easier to drive. But that's changing...

Matt Hurst has a nice run-down of mapping systems that give information on public transit (or ways to walk instead of drive). Google transit also has a great system for Pittsburgh (and a handful of other cities). I think all that's really needed to make public transit more widely used is some tools like this, development of the BusML standard for route information, free municipal wifi and something to access it with that fits in my pocket.

Tuesday, July 31, 2007

Flyer's Rights

You gotta love anything called a bill of rights for airplane passengers. How do I get them to add the right not to be sold credit cards while on the plane?

A paper emerges from the tunnel of anonymous review

Overall I'm in favor of anonymous review, but one of the many annoyances of it is that during the review process I always feel reluctant to advertise. But now...

Richard's paper describing his "Set Expander in Any Language" system was accepted to ICDM. Of all my students Richard's the only one who really enjoys building if you don't have time to read the (preliminary version of the) paper you can play with the demo.

Friday, July 27, 2007

Deep packet inspection meets 'Net neutrality, CALEA: Page 1

There's a very interesting, in-depth discussion of "deep packet inspection", and some of the implications of it on Ars Technica. DPI is diving into the packets flowing through an ISP, and opening them up to inspect the content - eg, there are commercial tools to identify the type of traffic (e.g., virus vs YouTube video vs iTunes download vs chat vs email vs ...). "Flow analysis" is assembling packets together (e.g., to reconstruct an email message), and that's also commercialized. DPI products that are "CALEA-compliant" can collect and offload a user's datastream (CALEA is the "Communications Assistance to Law Enforcement Act") - usually this stuff is farmed out by an ISP to specialists. Once packets (or flow) is classified it's also possible to impose rules - e.g., squash viruses, eliminate denial-of-service attacks, disallow on-line games for non-premium users, or slow down traffic from, say, YouTube to a crawl unless Google pays up a designated fee.

According to the article, current DPI systems classify packets using signature-based methods, much like anti-virus systems do. This makes a lot of sense if you're only interested in Personally I'm surprised that machine learning isn't used in this step yet - but I suspect that this will happen before long.

Thursday, July 26, 2007 Arts - Queen's Brian May to complete astrophysics doctorate

Brian May, the 60-year old former Queen guitarist just submitted his doctoral thesis in astrophysics on "Radial Velocities in the Zodiacal Dust Cloud" (Imperial College, London). When I retire I think I'll become a rock star.

Data Mining: Text Mining, Visualization and Social Media: LinguisticAgents

Matt Hurst has been spying out the action at AAAI. Today's post on LinguisticAgents (an Israeli company) is interesting. From a quick read, NanoSyntax is combining morphology with syntax - which makes loads and loads of sense in Hebrew, certainly, and similar methods have been effective in other languages - Klein et al have a very nice paper from CoNLL a few years back on character-level models for NER., to give one example.

I always find it amusin', though, how different industry types and academics pitch their intellectual wares. AI companies are so often based on transformational revolutionary brand-new ideas (if you believe the white papers), whereas the most of us longhair university folks are plugging away with incremental improvements to the big idea from, say, three years ago. Does that seem backwards to anyone else?

What are willing to do - for science?

Of course you trust your ISP, but next time you're using a friend's, visit

It's, you know, a good cause.

Wednesday, July 25, 2007

It's a sign!

While reading the blogs on my front porch, my laptop ran out of power and hibernated while I was reading this. (Discovered from Boingboing).

Monday, July 23, 2007

More email leaks

For those that are interested in information diffusion processes: I got an update from Vitor on publicity on our paper on "detecting email leaks" - there have been a dozen or so followups to the new stories I mentioned back in June. Interestingly all of these are in Portugese:


Research, high-tech and general news websites,,MUL61241-6174,00.html,,MUL34417-6174,00.html

At least 30 comments on a discussion board, asking the question: "Have you ever sent an email to the wrong person?",,MUL34362-6174,00.html

And some in news Portugal, pretty much echoing the news from Brazil

I guess there's some element of "local news" with these, since Vitor's Brazilian, but there's nothing particular about the content of the story that seems especially Brazilian - it's just a technique to avoid a class of email-related errors (and was evaluated, in fact, on the Enron corpus, which is all English.) It's interesting that language is as much of a barrier as it appears to be for the spread of high-tech news.

If you don't own an iPhone yet...

It got easier. Penetrated, completely compromised; the hack validated by Steve Bellovin and Avi Rubin; published in the NYT; and even slashdotted. On the plus side the guy that hacked it is a former employee of the NSA.

Monday, July 16, 2007

Friday, July 13, 2007

The laughter curver

Ok, maybe this is politics...but it's also a great example of really bad data analysis. Maybe there is a Laffer curve, but you sure have to look hard to see it here. I wonder how many economists would choose the line in the top graph over the one in the bottom graph, if all they saw were the points, without any labels?

Monday, July 09, 2007

Sunday, July 08, 2007

Personalization and polarity

In the more than 20 years I've been studying AI I've discovered that every decent knife has two edges. Even those knives that seem like really, really cool ideas when you first grok them. How could giving everyone a free editorial column be a bad thing? and how could improving access to all that new content not be beneficial?

Anyway, I can't resist responding to Fernando's response to Matt's response to my response to Lauren Weinstein's posting (are you following all this?) on search-term polarity.

The original post by Lauren Weinstein that triggered this thread was about the visible global impact of search rankings, but William's discussion suggests a less global but possibly more powerful effect in search personalization, of whether a personalization algorithm could become a strong reinforcer of prejudice without the counter-pressure of critical discussion of globally visible search ranking.
Here's an even broader suggestion: could just having more and better access to more and more diverse content have the same effect - i.e., is the growing blog world "a strong reinforcer of prejudice without the counter-pressure of critical discussion of globally visible" content? It's certainly easy enough to fill time reading political commentary that you can be 99% sure you'll agree with - and look how bitterly partisan the country has become, and how little is now universally accepted as correct.

Maybe Matt or Fernando know whether anyone's ever looked into whether the effect I'm speculating about is real - and if it is, what could we scientists do to create the appropriate "counter-pressure". Ideas, anyone?

Friday, July 06, 2007

iShouldn't have wondered about security

According to RixStep, the iPhone includes, among its many OS X-based cool features, a root password of 'dottie'.


Here's something interesting, I wonder how far it will go. Wikipedia's been so successful, a lot of people seem to be trying to take a further step in that direction. So far, Freebase is my favorite project of that sort - they have a very precise clear vision that's easy to convey (if not accomplish).

Thursday, July 05, 2007


According to Matt Hurst the iPhone buzz has peaked which means that I'll fashionably late with my contribution...

I don't expect to get an iPhone anytime soon - I'm still perfectly happy with my ancient Samsung i500. But I'm amazed that Apple decided to jump in bed lock in to using AT&T as a provider. (Apparently the first day the iPhone was released AT&T Edge had widespread network outages, affecting non-iPhone users as well as iPhone users.) And I'm thrilled that Apple decided on a closed platform - it will be way amusing to follow the inevitable opening of the iPhone by hackers. The checklist below (from IP) is already yesterday's news:
  1. Break DMG Password *COMPLETE*
  2. Break Activation *COMPLETE*
  3. Unlock Phone
  4. Run Third Party Applications
  5. Allow DUN/Tethering
  6. Remove IMEI Transmitting
  7. Enable Disk Mode
Yes, it would be cool to be able to use an iPhone on whatever network you want - but there's a socially interesting issue here also. My smartphone holds my information, some of which is potentially quite private (e.g., what doctors do I go to? when am I out of my house?), and which we know is not always very secure. I think I know how and where my Palm/phone device keeps this information, but on a networked device with a closed architecture, what guarantees do you have? We know how careful Apple has been with DRM in iTunes - it'll be interesting to see how much effort Apple has put into locking up its customer's information.

Friday, June 29, 2007

Research Publicity

Last year my student Vitor Carvalho got the clever idea of using machine learning methods to detect when an email is mistakenly sent to the wrong person (an "email leak"). A few months after our publication of some results on this, the idea's getting a little publicity:

Search-term Polarity

Via Farber's "Interesting People" list: Lauren Weinstein has a nice discussion of a well-known bug or feature of Google's ranking method - namely, if you search for "Jew" the top-ranked page is, well, sort of uncomplimentary to us Members of the Tribe. Google's analysis is that "Jew" is used more in an "anti-Semitic context".

Lauren's comments here are interesting, and raise some nice questions. If my search term indicates I'm an anti-Semite, should I get a page that's ranked highly by other anti-Semites, or one that's ranked highly by the (hopefully) larger general community? What if my search term indicates I'm a creationist? a disbeliever in global warming? arguably there's a continuum between Google-bombing (ie, manipulation of search results by a small group) and just exploiting linguistic regularities of a subcommunity to give better search results.

Wednesday, June 20, 2007

Dangers of eating lunch at home

Durn, I missed CMU's "Dynamic Balance Festival" and a chance to try out an extreme pogo stick.

Tuesday, June 19, 2007

AT&T Boing-Boinged

My old employer AT&T makes the most-read blog, Boing Boing.

But even at $10/month, AT&T DSL should be avoided like the plague. These are the scumbags who illegally wiretapped the entire Internet for the NSA, who broke net-neutrality to find "copyright infringements, and who inspired NBC to call for a law requiring all ISPs to do the same (imagine -- a law forbidding network neutrality!). Seriously: the only day I wouldn't piss on AT&T is if they were on fire.
This isn't politics, btw - it's technology.

Monday, June 18, 2007

The Shape of Things to Come?

The book The Long Tail is about the way the market for "soft" goods (like music) has changed with the internet. The last chapter discussed some interesting new technologies, including 3D printers - with speculation that they will ultimately force the same sort of changes in the market for real, physical goods. Right now 3D printers are expensive and slow, but they are already making an impact in certain niches--for instance, a talk at CMU a few years ago made a great case for using them to make printing 3D models of proteins. (Believe it or not, holding one for a minute or two beats any CHIME visualization). A good friend of mine recently blogged a visit to Desktop Factory, which makes 3D printers, and posted a bunch of interesting images. Look at the bones, man!

Friday, June 15, 2007

Symbol grounding and relations

I've been reading a number of Peter Turney's papers and I've lately been catching up on his blog, which has a number of really interesting posts related to his work - and remarkably related to the things I've been pondering about over the last few months.

For instance: machine learning spent years learning how to recognize classes of objects by their attributes; a popular topic now is collective classification, or recognizing the class of an object (in part) by considering how that object is related to other objects. Are attributes only a "convenient fiction" - a useful abstraction that ultimately must be defined in terms of relations? Is an apple intrinsically red, or is "redness" something that describes the interaction of the apple and the sensory system of the observer? likewise is every attribute properly a description of an object and some sensory system or measurement instrument?

This seems like a rather strange and abstract question, but it's intimately connected with the "symbol grounding problem", the subject of another of Peter's posts, which in turn is connected with my long-standing interest in data integration - a very practical real-world problem. To combine data from two heterogeneous knowledge bases is, properly speaking, impossible to do automatically: if they are different formal systems, and there is no surefire way to translate between them. There is no common ground. By the same token, communication between two people is also impossible. How do we know that what I mean by "red" is the same as what you mean?

The solution to the problem, for human communication, appears to be that language is grounded - in part by common perceptual systems. (Goleman's book Social Intelligence is a nice description of some of the ingenious mechanisms that have evolved for establishing this common grounding.) The effect is that I don't really know that what I mean by "fear" is the same as what you mean; an in fact, it may not be the same. But if we're both neurologically typical it is almost certainly highly similar to what you mean.

Back to attributes and relations - I'm not sure, but I think the apparent primacy of relations starts to emerge when you start thinking about these issues. Everything is ultimately defined in terms of relationships with other things, which are finally grounded in our own perceptions--those few things that we don't need words to explain or understand.

Wednesday, June 06, 2007

Is it real or is it demoware?

From Lynn Monson. These guys have so been watching Minority Report.

Google Streetview

There's lots and lots of talk about Google Streetview, including quite a lot of gosh-what-about-privacy? postings: eg there's a rundown of comments in BoingBoing today. One that I resonate with is:

What is the difference between posting a picture of people on a public street on Google Street View or on Flickr?

Obviously - the information is the same, only the information access is different. There's no difference in kind of information - just a qualitatively easier interface to getting it. Anyone can take a plane and a taxi and take a picture of my house - but I don't expect that anyone will bother. So it's jarring to find out suddenly that anyone with a DSL line and 30 sec to spare can get the same effect.

Our expectations of privacy (and privacy laws) are driven by what's easy, and what's likely, not what's theoretically possible.

My personal take on this - it's another great example of how smoothly information can be integrated together, when it's all grounded in the physical world. Entries from maps + business listings + satimages + street view are relatively easy to search together and visualize simultaneously, since they're all tied together by physical location in space.

New Look and New Resolutions

I'm back to blogging for a while, but I'm staying away from politics. It's just way way too much of a time sink.