Validating Input In An Internationalized Application

The 'bopomofo' iPhone keyboard

If your application accepts input from users you’re likely going to want to perform some validation on that input. If you allow users to create data within your system that is displayed to other users (i.e. it should be “human-readable”) you might consider implementing a validation routine that ensures that all input for this data consist only of letters and whitespace. While I generally tend to shy away from overly aggressive validation of user input, I think that this kind of validation can make sense in some scenarios where you wouldn’t want someone’s cat running across their keyboard to create a document titled “((@(8#jk(@(‘’299” in your system.

The Simple Solution

But how should you go about this validation? A simple regular expression (is there such a thing?) should do the trick, right? A quick Google search for “regex for letters and whitespace” might yield something like this: ^[a-zA-Z ]+$ (loosely based on the question and accepted answer of a Stack Overflow question). In short, this regular expression would match any string consisting of the letters A-Z (upper or lower case) and a white space.

This regex would allow the user to enter “My Document”, but not “ZOMG The. Best. Document. EVERRRR!!!!!!11!!1one 🙂”. This is great and does what you want until your application needs to support someone who doesn’t speak English. (Side note: I normally wouldn’t care if someone wanted to name a document something silly. This is an admittedly contrived example that I put together to illustrate a point. Be nice to your users and always validate responsibly.)

Your Alphabet Is Not The Only Alphabet

The alphabets used in other languages (if they even have an alphabet) often make use of diacritical marks. In the Czech alphabet, for example, there use some familiar looking letters like ‘A’, ‘E’, and ‘J’, but there are also letters like ‘Č’ and ‘ö’. It’s important to note here that the diacritical marks above these letters are not just “accent marks”. That is to say that they aren’t just modifiers that are applied to the “normal” letters; they are completely different letters.

So when a user goes to create a document called ‘Muj dokument’ (which is likely an atrocious translation of ‘My Document’… my apologies to the Czechs), your regular expression will reject it. You could be an arrogant American and tell them to just make it ‘Muj dokument’, but that’s not very nice. The point of your validation routine was to help ensure that input remained human-readable, and using characters in foreign alphabets is a perfectly human thing to do.

In this particular scenario you’d likely be better suited using the regular expression shorthand for “word character”, which is “\w”. In the .NET flavor of regex, this will allow for Unicode letters, meaning letters with diacritics are fair game. It’s important to note that it also allows digits (0-9) and the underscore character but not whitespace. In my opinion, the digits 0-9 and the underscore are perfectly acceptable things to allow in a human-readable string, but you could always modify your validation routine to expressly check for digits and underscores if there’s some really good reason to disallow them.

Why Should I Care?

If your application is and will always be for a target audience of English-speaking users, then maybe you don’t need to care. If you think there’s any remote chance that your application might someday be used by folks that don’t speak English, do yourself a favor and take a second to think about how best to validate input from you users.