Why CAPTCHAs are evil

Have you ever signed up for an internet forum or web app? Chances are, you’ve seen a CAPTCHA: a little image with distorted letters demanding that you prove that you are human. Or is that what it is really asking? Perhaps instead, it’s asking that you prove that you’re sighted.

Phoney Security
CAPTCHAs really just provide the illusion of security

CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), are supposed to prevent automated computer programs from posting spam messages on public forums. It’s a kind of Turing test that assumes that computers cannot pass.

Continue reading “Why CAPTCHAs are evil”

Language Matters and Computational Linguistics

A cover of Computational Linguistics
Image via Wikipedia

I’m rather disappointed with the second edition of Language Matters, by Donna Jo Napoli and Vera Lee-Schoenfeld. The second edition was published in 2010, which has some minor updates to the earlier 2003 edition, along with some added material.

Chapter 7, “Can computers learn language?” received only minor edits, changing references in the examples. They change the term VCR to a DVR. The example they use however, has not changed, nor has their conclusion.

The two examples they use are:
1) Record “Law and Order” at 9 P.M. on Channel 10.

2) If there’s a movie on tonight with Harrison Ford in it, then record it. But if it’s American Graffiti, then don’t bother because I already have a copy of that.

As Napoli notes (Lee-Schoenfeld was not involved in the first edition), this task would involve asking the computer “to scan a list of TV programs, recognize which ones are movies, filter out the particular movie American Graffiti, determine whether Harrison Ford is an actor in the remaining movies, and then activate the “record” function on the DVR at all the appropriate times on all of the appropriate channels.” (Language Matters 2nd Ed. p 99).  Napoli continues to suggest that “we’d be asking the computer to work from ordinary sentences, extracting the operations and then properly associating them with the correct vocabulary items, a much harder task”. (Language Matters, 2nd edition, p 99).

Of interest here is that Napoli’s summary does not follow the lexical and linguistic parsing of the command. In particular, Napoli filters out American Graffiti before performing any searches for Harrison Ford. This appears particularly strange to me, as the first step in parsing this statement would be the same whether by a linguist or a software parser. Parse the first sentence before attempting to add context from the second.

While Napoli and Lee-Schoenfeld make several bold, definitive statements throughout the text which I found lacking in support, in this case they seem to dismiss the concept as “much harder task”.  This statement may have gotten a bare pass in 2003, but in 2010, it’s a harder sell. Admittedly, the Jeopardy! showdown with IBM’s Watson may not have yet occurred, but in a text revision, I would expect some level of research to validate these claims. There are several journals on computational linguistics available, such as the Journal of Computational Linguistics, which has been Open Access since March of 2009.

In particular, the example given above is domain specific. It deals with television specific language, for which there are databases of particular terms, such as movie titles and casting information.

Even before Watson, I would not have considered a problem of this scope to be extraordinarily difficult, primarily due to the limited domain. While a more general domain would increase the difficulty considerably, current research is looking more hopeful. While computers are still not ready to pass the Turing test, there are some indications that this may happen in the relatively near future.

Language Matters is a very accessible text, covering many aspects of language and linguistics to those without much experience in the field. Aside from the chapter on computers and language, this book provides a good introduction to a number of topics. I wish that in the revision process, the authors had revisited some of their conclusions in an active field of research.