Monthly Archives: September 2013

Using only orthographic features to identify a language

As I talked about in an earlier post, the Latin alphabet is now used to represent more languages than any other script system. But the Latin alphabet doesn’t have a character for every sound in the world; it was created to represent Latin, and it’s been adapted somewhat since then, but there are still plenty of natural language sounds out there that have no corresponding Latin character. In order to represent these sounds in the Latin alphabet without creating a new letter—an expensive proposition—we’ve relied on diacritical marks, as seen in ù, ú, û, ũ, ū, ŭ, ü, ủ, ů, ű, ǔ, ȕ, ȗ, ư, ụ, ṳ, ų, ṷ, ṵ, ṹ, ṻ, ǖ, ǜ, ǘ, ǖ, ǚ, ừ, ứ, ữ, ử, ự, and ʉ, as well as combinations of letters, such as ch.

But sometimes diacritics and multigraphs aren’t used to represent new sounds. Sometimes they’re used for ideological reasons, to create the illusion that a language is more unique than it actually may be. For example, when planning the Basque orthography, the language moguls decided to use tx to represent the same sound that ch represents in Spanish. Why create a new digraph when an existing one would serve perfectly well? One reason, surely, is because the invented digraph would emphasize that Basque is not Spanish—that the Basques are not Spanish!

There’s an interesting result of all this: Because of the unique characters and character combinations of many languages, it’s fairly easy to tell what language something is written in, even if you don’t know that language. If you see an ü and a sch in some text, it’s probably German, and if there was a ß, that’d give it away for sure.

I hypothesized that this would even work with made-up words.

This idea was the basis for an art project of mine, in which I considered the most defining characteristics of several languages: average number of letters per word, most common beginning and ending letters, most frequent letters, and unique or stereotypical multigraphs, characters or punctuation marks. As a result, I’ve created, among others, the most English fake English word.


The resulting words are interesting enough, but instead of simply printing them here like any other text, I traced their typeset forms and filled them in with watercolor, arousing meditations on mechanical reproduction, handwriting and creation.

See if you can tell which languages the rest of these made-up words are written in.







What makes an A an A

I recently began learning Japanese, and I was struck by how some of the kana look like Latin letters. ん, for example, looks like h. な looks like tj. ケ looks like k. These are just coincidences, of course; Japanese characters have nothing to do with Latin letters. And on their own, maybe these examples aren’t that exciting. But when I considered ナ, it got more interesting: ナ looks a little bit like a t, but only if you’re looking for resemblances. If you just saw it on its own, it’s unlikely that you’d think it’s a t. Why? Probably because the vertical line slants in the wrong direction, and the horizontal line is a bit too long.

What makes each letter each letter? What is it about one collection of strokes that makes us think A, while another does nothing of the sort?

Here we get into Prototype Theory, a storied branch of philosophy that overlaps a lot with linguistic semantics. Imagine you were trying to explain to someone what bird means by showing them a picture. Would it be more helpful to show them a robin or an ostrich? If you’re a Westerner, you probably said a robin. That’s because, for some reason, a robin is much more birdish than an ostrich. That is, a robin exhibits more of the characteristics that we consider prototypical of birds, while an ostrich is more of an outlier.

In the context of letterforms, we can suppose that we have a learned sense of the prototypical characteristics of each letter. But no one writes like Helvetica; indeed, our handwritten letterforms vary wildly: We can put a tail on our U or happily leave it off; we can put a cap on our J; and we can round our E, V and W with no negative repercussions. Interestingly, in some cases the handwritten form greatly differs from the typed form. The lowercase a is a good example: Almost no one writes it with the upper stem (as seen in the non-italic a), but that’s the most common typed form. When we see the two-story a handwritten, it gives us pause.

The question is: How many of these prototypical letterform characteristics can we pluck away before our letters become unidentifiable? And how do we learn these characteristics in the first place?

I think generally the distance we can stray from the prototype has to do with whether the modified version could be confused with another letter. It won’t matter if the stems of your N are crooked. It’ll still be legible. If the vertical line of your E is a bit too long, it’s no big deal. But if you make the line too long on your L, it might look like a t. And if you hurriedly write H, it might look like it. If you round your M, it’s fine, but if you round your D too much it might look like an O. An upside down E is still an E, and somehow an upside down A is still recognizable as an A. But an upside down M reads as a W. How tall can the stem on an n be before it becomes an h?

But the rules don’t always have to do with confusion with other letters; sometimes the rules are simply in place so we can see that a given form is, indeed, a letter, and what letter it is, without delay. For example, I’ve found that my lowercase f is sometimes unrecognized by others. It’s not that they think it’s a different letter, but rather that they don’t know what it is.

Where does our knowledge of these prototypical characteristics come from? Simply put, we logically induce the rules from seeing letterform after letterform throughout our lives.

Besides pure input, we can learn from explicit feedback. As children, we might have written P backwards. It wasn’t possible to confuse ꟼ with another letter, and it was certainly still recognizable as P, but our teacher would still have said it was wrong and corrected us. In some cases, we might have mixed up d and b, but eventually we would have realized (either by explicit feedback or through our own realization) that these create ambiguities.

It gets interesting when we learn new scripts, which come with a whole nother set of rules to learn. Intuitively we’ll want to apply some of our native script–specific rules to the new script. For example, Japanese publishers print Western numerals as monospace characters, where each character takes the same amount of horizontal space, because all native Japanese characters are monospace. Compare 100 and 100. The first one looks normal to us because the 1 takes up less space than the wider 0‘s. But the second would look more normal to a Japanese person, because each character has the same width.

But we have to be careful. When I write in English, I know that I can loop the tail on my j and g and it won’t affect readability. I also know that I can wait to cross my t‘s until the very end of the word, when I can elegantly get them all with a single stroke that comes off the last letter. But when I’m writing in Japanese, I don’t know what modifications I can do without causing problems.

When I was starting out with hiragana, I wrote き exactly as it appears typed. But I soon learned that no one actually handwrites it that way; proficient writers disconnect the bottom curve from the vertical line, like this:

Japanese Hiragana ki

Of course, handwriting き as it appears on the computer isn’t detrimental; it’s just amateurish. But the point is that I have no idea which mistakes I may be unwittingly making that are detrimental.

In learning a new script, we have to learn what characteristics matter most. It begins with direction: in English, all characters have the same height (ascenders and descenders notwithstanding). In Japanese, all characters fit into a square, with width uniformity apparently being the priority; the language was traditionally written vertically.

Even within a single script, it seems there are cultural variables—something not mentioned in language classes. Most Europeans, for example, write the number 1 with a long tail and no base, almost like a ʌ. A lesser-known example is the cursive lowercase z. Below is the one I learned in school on the left, along with the one most commonly used in Poland on the right (probably not 100 percent accurate, since it’s my imitation :)).

Cursive Z's

There’s a good chance that if you wrote a cursive z and asked a Pole what it was, they wouldn’t have a clue. (We, on the other hand, are a bit luckier, given that the cursive Polish z more closely resembles a regular z.) I was also pleased to discover that the Spanish tend to write their lowercase f just as I do—a way that proved unrecognizable to many of my fellow Americans. If you’re somehow as interested in all this as I am, check out the Wikipedia article on regional handwriting variation.

So what’s the point? From here, we could go through every script in the world and each character in each script, defining all the possible acceptable variations and analyzing them. Maybe I will someday, but not now. For the moment, let this meditation suffice to give a perhaps newfound appreciation for how well you know the Latin script—and what went into getting you to where you are. The fact is that we know this script mind-bogglingly well—to the point where we can spot when others aren’t so fluent—given away by an awkward stem here, a crooked form there—even if we’re not actually aware of this ability.

Quality in writing today

There’s been a lot of talk about how the quality of writing is going downhill these days. But what does that mean, exactly? Sure, we can assume that if kids are lolling and btwing in term papers, the writing is probably not of high quality. But is there more to it? Moreover, is the quality of writing really going downhill? And if so, should this worry us?

Quality is a major theme of the book Zen and the Art of Motorcycle Maintenance, and the author begins his investigation by specifically analyzing quality in writing. Robert Pirsig’s idea of Quality is one that exists prior to any manifestation. In other words, it’s not the writing itself that is high or low quality; it’s what’s behind the writing. Therefore, if today’s writing does not have quality, perhaps this is because it is not as meditated as writing once was; the stuff behind the words isn’t as honed.

In Doing Our Own Thing, John McWhorter says that formal English has lost its value over the past several decades as we’ve shifted toward writing that is more reflective of oral English. He argues that this change has resulted in a decline in quality of oratory, poetry, theater, preaching and—ultimately—thinking.

Even so, McWhorter recognizes that the shift in writing from complex to straightforward has had some positive effects: Immigrants and minorities have begun to pass into the mainstream, and education has become more democratized.

Amalia Gnanadesikan shares this viewpoint. As she writes in The Writing Revolution, “[Language purists] may have a point about stylistic quality, but they have failed to notice that what they are really witnessing is the ultimate triumph of the written word –the point at which the technology becomes second nature to a whole society” (p 272). That is, writing isn’t going downhill entirely; it’s just that people have platforms for writing today who never would have in the past. Therefore a much higher proportion of writing is of lower quality, but that doesn’t necessarily mean that quality writing is nonexistent.

Noami Baron talks about this also in Always On. Similarly to Gnanadesikan, she argues that today’s writing has degraded simply because we’re doing so much more of it. Because of the sheer number of words each of us must type each day, writing has lost its expectation of quality. She observes that our face-to-face conversations, which are as a rule of lower quality than writing, are gradually being replaced by writing. As a result, writing falls in quality.

According to Baron, one of the primary manifestations of this fall in quality is spelling trouble. She condemns spell-check, saying it robs us of opportunities to learn to spell on our own (definitely true), and it causes us to rely on a tool that can’t solve all our problems. (Cant and wont are both words, after all, and even the virtual wizard doesn’t know you meant can’t or won’t.) I agree with Baron up to this point, but I don’t support her claim that spelling is about to spiral out of control.

In Always On, Baron cites the case of compound words. Is it makeup, make-up or make up? In this example, each of the three mean different things. But in less common words, that’s not a problem. Where are the innocent deaths that occur from someone spelling nonetheless as none-the-less? But even if the punctuation distinguishes among multiple meanings, the context almost always makes it clear. When you’re talking about how you need to use makeup before you can make up in time for the make-up, for example. But is it really so scandalous that a few words might have idiosyncratic spellings?

I don’t think so. After all, we already do this, both in spelling and in speech. In spelling, we have the famous U.S. and British differences. In speech, we have multiple pronunciations for words like aunt, either, adult and coupon—even such common words as and the. We also have multiple possibilities for pluralizing certain words, like persona, which can pluralize as personas and personae. In my opinion, it’s highly unlikely that we’re going to slip into a chaotic world where everyday words have countless spellings, as they did in the olden days—the world Baron fears.

The paradigm of diversity versus understandability comes into play. Idiosyncratic spellings can only go so far before the issues in understandability outweigh the value of creative expression. Naturally, they will subside as we reconvene with a more rigid spelling system. (Granted, less rigid than what we had a century ago, perhaps, but rigid nonetheless.)

But is quality writing really all about good spelling? Certainly not. In fact, I’d like to think that mechanical issues are the smallest definers of quality. But then what are the important parts? Pirsig says in Zen that everyone knows, but no one can define them. Well, that’s not very helpful. If I can think back to high school with any accuracy, I remember a rubric we used to judge writing, using criteria like Content, Word Choice, Sentence Fluency, Organization and Voice. Of these, perhaps Content is the most important—it’s the one that requires you to have something to say in the first place.

Can we really claim that we’re spiraling downward in all these areas? McWhorter seems to think so, suggesting that we Americans have lost all capacity for formal argument and critical thinking. I don’t think that’s true; it’s just that formal argument and critical thinking are not found in all the places they once were.