Friday, May 16, 2008

We have violated the prime directive

Noah Smith and I are co-supervising Tae Yano on a project involving analysis of political blogs, and Tae left a pile of results and code on her CMU web site as a way of communicating with us...world-readable. Surprisingly someone at one of the blogs she spidered, Little Green Footballs, actually noticed, leading to a lot of investigative work in this fascinating thread:
Anyone know what this page at Carnegie Mellon means? It’s some kind of experiment that involves comments posted at LGF, and I have a feeling it’s not friendly.
...
comment #5: Maybe it's post-modern poetry, academia-style? Using LGF comments as gibberish to transcend interpretation?
...
comment #18: Obviously it's a KGB plot.
...
#12: there's a Kos directory too. Probably just too what they consider to be partisan sites to test some kind of language algorithms or something
...
#20: I think they are doing a text analysis looking for Charles' sock puppets -- like the way some trolls accuse Robert Spencer of posting as Hugh
...
#23: Looks like they are doing another election-year study looking at patterns of argument or other aspects of communication.
...
#47: Who ever thought that garbage labeled as "research" could be so sophisticated and at the same time be so worthless. ... Polishing farts to get Piled higher and Deeper has become, for the most part, an ephemeral exercise in self deception. [ouch! sounds like my last grant review].
...
#48: This isn't that advanced of stuff - we are working on a simillar concept for internal corporate communications. This has been around a while.
....
#60: EXPERIMENTATION WITHOUT REPRESENTATION! I demand compensation!
...
#61: I wonder if linking to their experiment will introduce fatal feedback loops, leading to a disturbance in the Krell mind-field and possible space-time discontinuity?
...
#79: It looks to my un-Pythonesque eye like a programmer is using the cosine function to create some kind of mapping on how comments refer to one another based on comment location in the thread; and looking at different types of postings from Charles (open thread, breaking news, news outlet criticisms) to see if there are any trends in the discussions that vary according to the posting type. That has to be one of the worst run-on sentences I've ever written.
...
#92: [#64: "What is port sniffing?"] after you pour some in the glass, but before you take a sip, swirl it around a bit and inhale whist the dark chocolate is melting on your tongue......
...
#162: Ahh, so basically it's a merger between post-Marxist deconstructionism and THX-1138-esque Orwellian personality dehumanization.
...
#172: I liked the method of withdrawing all posts by 204 commenters, then asking the system to predict what those posts were (or *if* the post existed, I think) based on all the rest.
...
#187: ok here's how I see it. There is a whole group at Carnegie Mellon CS department doing research in Artificial Intelligence by running statistics on blogs. A grad student Tae Yano stashed her research files on a server and left them wide open to WWW. The files pertain to running basic stats on DK and LGF, up to cosine similarity. The AI goal would be to have a blog-commenting software indistinguishable from a blog-commenting human. For all I know, this, or any other comment, is already written by a python script.
...
#277: [Quoted from a grant proposal Noah & I wrote that got unearthed, along with various other papers, pictures of Noah's cats, Tae's cv, her picture, her programming project on knitting, wedding announcement, etc:] "Political text is often indirect, sarcastic, repetitive, hyperbolic, emotional, biased, manipulative, and riddled with unstated assumptions." - Byte me.
...
#314: So some students are doing research on blogs. No big deal. It was interesting to see what it was about, but that's about it. I don't think they feel the need to protect their files. Why should they? Who would want to mess with that? It's just a school project, I don't see point of "exposing" the students and publicizing their information. At least that's my take.
...
#316: I noticed that the research group to which she belongs is partially funded by a DARPA grant.
...
#361: And the same techniques/analyses will probably work on Arabic-language websites, too.
...
#428: Are terrorists ‘phone banking’ for Barack?
...
#474: magine a natural language program that could respond to comments with charm and style, sort of a robo-blogger. Now imagine an army of them, all set to monitor a different political blog, run by a campaign manager for a politician. Add to its writing ability an encyclopedic memory, with instant access to famous quotes, historical facts, trivia, statistics, and every word ever uttered by the opposition. You now have an army of ultimate bloggers, all completely under the control of one campaign manager... no more "going off message" by some underpaid/volunteer lackey, just high quality counter-opinion, ready to be inserted into the blogs of anyone who disagrees with your candidate. This research will eventually lead to robo-blogging to kill emerging scandals and alternative opinions on issues... no more Rathergates as they will be smothered in the cradle by the most charming bloggers around -- the poli-bots.
...
#482: She's trying for a data-mining tool tailored for blogs that separates "useful, thoughtful" information from all the mindless dreck in the blogosphere. Lotsa luck with that! As an aside, I've been on the Carnegie Mellon campus, toured the Computer Science department, and met with CS faculty. It's a gorgeous campus. The school clearly has big bucks. CMU holds numerous contracts with various government agencies related to the information technology aspects of defense, computer security, homeland security, and similar "black ops" topics. At least some people on that campus have intimate access to NSA, DOD, and CIA. It's a spooky place.
...
#489: We should, for amusements sake, keep an eye open for a KOS diary about this. There may be some entertaining histrionics and conniption fits over their being the unwitting subject of DARPA funded research.
...
#492: But but but... Markos is a CIA agent.... dKos is a DARPA funded research project....
...
#496: Don't spill the beans. The Koslings haven't figured any of that out yet. Agent Markos will have a rough time of it when he is exposed as being a double secret agent of the Zionist conspiracy. Don't blow his cover.
...
#497: From this, we conclude that LGF not only has MORE numbers than DKOS, but BIGGER numbers as well. If YOU TOO want bigger numbers, choose LGF brand blogs.
...
#514: Was just thinking that all you people have too much time on your hands...but then it occurred to me I'm sitting here reading all this.
...
#518: Looks like the whole thing has gone 404. My guess is that she just wanted a corpus of data for some programming project and, now that the object under study is aware of her, it's no longer useful. I don't see any dark purpose here. How evil can someone be who writes knitting software?
...
#519: I suspect that if the mice KNOW they are in an experiment, they will not produce the same results,as they would otherwise, thus invalidating the experiment.
...
#520: [re: #21 zombie: "Talk about pointless. People get PhDs for this crap."] Not really pointless. The "value" of this may be questionable though, at this point in time. As the web and bogosphere has grown exponentially in influence, there has been great interest in determining if real life outcomes can be affected by influential opinions posted on blogs -- and then re-created in numerous other places to mimick majority viewpoints. There are numerous companies invloved in this research-like activity which can be tailored for marketing, business intelligence financial trading, political campaign managment uses (for example, Umbria of Denver, CO was just sold to JD Power). They use primarily data visualization tools (like those used now by lawyers engaged in electronic data discovery) -- These are similar to the software tools used by intelligence services to monitor, track, analyze, for example, wireless (web/phone) transmissions and discussions originating in the US, and destined for overseas delivery in places like, say, Iran.
...
#531: Yea- shut down. I didn't get to it in time to even see what it was all about. So, now I'm depressed. And stuff.
Since this morning, Tae chmod'ed all her code to hide it, but I suggested to Tae that she keep it visible, since folks were having so much fun with it. (Of course now she's embarassed and wants to clean up her code first...) Noah and I also wrote an open letter to LGF explaining what was happening.

I'm mostly amused (as you can tell), but also impressed at (1) how much was uncovered about Tae, Noah and me and (2) how much of this obscure statistical NLP code the LGF community was able to figure out, communally. The peril and power of the internet.

6 comments:

Husband Of A Democrat said...

Hiya, just wanted to thank you for the late night entertainment at LGF. Pretty cool research you've got going on there. Tell Tae not to feel too embarrassed. It was, after all, a late night thread and the few people that were actually interested in it will probably forget about it by tomorrow. Take care.

William Cohen said...

Thanks Bosforus!

Tim Grossner said...

"I'm mostly amused (as you can tell), but also impressed at (1) how much was uncovered about Tae, Noah and me..." - Could it be you were under the impression that all Conservatives and LGF readers were slack-jawed yokels? :-P

William Cohen said...

"Could it be you were under the impression that all Conservatives and LGF readers were slack-jawed yokels? :-P"

Aw, don't take it like that, Tim - it's just that some of the stuff we do is pretty obscure. I know about LGF and Rathergate, so I know you're not all yokels...on the other hand you have to admit that not all conservatives have full mastery of Google-fu.

:-)

Tim Grossner said...

Agreed - if there is one thing I have learned its that everyone can make a mistake, and to take everything I see on the Interwebs with a grain of salt... While not exactly the same, that brings to mind this case of "rush" to judgement (see what i did there? :-).

Tim

Unknown said...

Hehehe, not bad.

I read LGF and like the blog, I liked the open letter to Charles and the good people there. It's nice to see you guys took it in stride :) .