Thursday, June 11, 2009

Rapped on the accents

Many search algorithms that I have used in the internet and locally installed software are lenient about punctuation and diacritical parsley such as accent aigu and Umlaut. As I remember, Google used to silently "normalize" in the background (ö and oe giving approximately the same results for German words), but it doesn't do that anymore.

The Diccionario de la lengua Española, maintained by the Real Academia Española, is unforgiving. Rereading Nerval's El desdichado, I wanted to know what the Academia had to say about desdichado. There is a "coloq." subentry saying sin malicia, pusilánime. Since the English congenerics have very distinct meanings, I wanted to check the Spanish. I entered pusilanime without bothering about the accent, and got this:
La palabra pusilanime no está registrada en el Diccionario. Las que se muestran a continuación tienen una escritura cercana.

* pusilánime
Since there was only one entry "in the vicinity" of what I entered, I thought: the software might just as well have opened to that result. But no, it knuckles me to acknowledge that "á" is not "a" in Spanish orthography. This is not a case where I would say (formulating as a native English speaker) that "the accent makes a difference" in meaning, apart from indicating the syllable to be stressed . But then si (if) and (yes) are completely different words.

On the whole, I should just take the rap and purge the "diacritical parsley" idea from my head. It makes life easier.


Noetica said...

Stu, you can make Diccionario de la lengua Española ignore diacriticals. Just select "Búsqueda sin signos diacríticos", next to where you type in the item you are searching for. See other options, too. The present online version is less severe than the CD-ROM version that I have installed on my hard-drive.

El Desdichado, eh? I translated it some time ago myself. A haunted piece, by a haunted poet.

Stuart said...

Thanks for the tip, Noetica. Unfortunately, selecting "Búsqueda sin signos diacríticos" for pusilanime still results in the knuckle-rap I described.

With "sin signos diacríticos", a search for turns up both and si, as expected. When I selected "Búsqueda por aproximación" and searched for , I got exactly . This surprised me, because I thought "por aproximación" would cover "sin signos diacríticos". But in fact they've separated out these issues in a pleasing way. The Ayuda explains that "por aproximación" means poco a poco, taking one algorithm after another.

> Búsqueda por aproximación (por defecto).
> Aplica sucesivamente los tres tipos de búsquedaposteriores, hasta lograr resultados.
> Comienza por la búsqueda exacta, que es la más precisa, y termina por la
> semejanza fonético-ortográfica, la más amplia.

I found the word "posteriores" here to be briefly distracting, since it drags in the order in which the four modalidades de consulta *currently* appear in the Ayuda, and in the drop-down list on the browser page. I regard presentation order as a supervenient feature that should not intrude in discussion, planning or explanation of underlying algorithms. The only order that is relevant to understanding how this stuff functions is that of the three search modes apart from "Búsqueda por aproximación".

This fourth search mode stands in no order relationship to the other three. To the end of user convenience, of course, in accordance with tradition, a "default" setting will usually appear as the first one in a list. But the search modes would function just as well if the "default" setting appeared last on the list. I program for a living, and have learned to be very severe in seeing to it that algorithmic design and UI design are kept quite distinct.

Noetica said...

Now I understand. I had read too fast. I agree with you on the general point: these things could be made a lot more intuitive and friendly. Certainly TLFi could, as you have pointed out elsewhere. Frustrating, that one: so much power with so clunky an interface.