As a corporate coder (sigh) I work a lot using databases. One of my biggest concerns (besides data consistency) is ease to search through my data. Being a native spanish speaker, having accents and tildes makes it a little difficult.

Removing Nonspacing Characters (Tildes, accents) from Unicode Strings in Python

Beautiful Natural Language Complexity

Spanish is a beautiful language. I've always thought it has the right level of complexity to make complex ideas easy to express, and make simple ideas short enough. Also, it's far less context dependent than english thank lord.

But complexity has its price: people tend to avoid some of it's beauty for no other aparent reason than lazyness. Some others (like me) tend to avoid its complexity in order to adjust to the former kind of people.

How so?

Let's take an example. The words for female "this" and (he or she) "is" are "esta" (sometimes "ésta" when "this" is used as unique first person in an statement) and "está" correspondingly. You do not need to be a native spanish speaker to notice these words are almost identical while written. The only way you can distinguish one from another is because of an accent, a diacritic. But here's the catch: many people tend to obviate these accentuation details and leave it to the context to infer the actual word ( which makes baby Jesus cry and kill a kitten :( ) .

That's ok for people. Thanks to english and my high level of intelligence (yeah right) I have no trouble deciphering these context tied words to infer what they mean.

But let's talk about computers.

How this problem affects your database

So you have your so beloved user, Anita, to register some new costumer. She's all intelectual and stuff, she knows how to write. She gets a new costumer, Mr. Julián Hernández Güero. She registers him correctly in your system, and makes a succesful sale.

Then later, her coworker, Ms. Diana Dyneson receives a complain call from Mr. Julián to report an issue with his product. Diana types "Julian Hernandez". Nothing comes up in the search. Sorry Mr. Hernandez, you seem to be confusing us 'cuz you have never buyed here.

What do?

It's so easy to get it wrong. So the best bet you can make is store everything in a "neutral way".

Since I love python let's do it in it. What we're gonna do here is take an unicode string, and use two different aproaches to "clean it". Let's proceed.

First of all, if you want to use unicode literals inside your python script, you must specify the encoding of your file. Do this by adding this as your first line:

# -*- coding: utf-8 -*-

(#coding: utf-8 will also work here)

Let's import the one most important module here, unicodedata which as the official python docs states, defines properties for unicode characters and helps you handle them. I love python's batteries : )

import unicodedata

Then now, let's get our unicode text, you can get it from a file or a database, that's problem for another post. So will use this literal for now:

text = u"áéíóú äëïöü ñÑ û"

Note that if you try to print it just like that in a windows console you'll get the following error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

That's because windows default encoding can't handle the characters you're trying to output just like that. You must first encode them to a feasible format then you'll be fine. Try:

print(text.encode("utf-8"))

The first and obligatory step in this "process" will be to normalize our unicode string. I will not be very deep in the details, partly because I'm not trying to be an unicode gurú and part because fuck you, but read this: Unicode standard defines two ways of displaying an "special character" one is using two characters (one "normal" character and one combining character) and the other is using a single character to represent the special character as such.

The first one of these approaches is called canonical decomposition D, the second one is referred as C for it first applies a canonical decomposition and then composes the pre-combined characters again.

I hope by this moment it's very clear to you the fact that we will use D form on our text, so we can separate every nonspacing character and then somehow throw it away from our string. This is how you do it:

text = unicodedata.normalize("NFKD", text) #D is for d-ecomposed

The first and easier-to-the-eye thing we'll do is just "transform" our unicode string into an ascii string, just dumping everything that is not ascii. Let's get directly to it.

clean_text = text.encode("ascii", "ignore")

The first parameter for encode method is the encoding to represent our unicode string, the second one defines how we'll treat error in the conversion, here we just chose to ingnore them. When we print our clean text, we'll see the following:

aeiou aeiou nN u

Nice! Keep in mind that by using these approach you will no longer be handling an unicode string object, but rather a regular python str object which is 8bit character long. The only drawback I can quickly think about this approach is that it will simple wipe out every non ascii character. Maybe you don't want that and simply wanted to wipe out explicitly the combining characters. Let's try that again, carefuly:

clean_text = u"".join([ch for ch in text if not unicodedata.combining(ch)])

So, we'll recreate our unicode string, just this time ignoring combining characters. I think this is more like I was expecting. This way you can keep other kind of characters in (like special punctuation marks like "¿" and "¡")

Keep in mind that this method gives an unicode string. So if you just try to print it like that in a windows console, you'll get the same first error I talked about. So try to encode it before printing!

Well, I hope everything was clear and concise enough to help you.

Enjoy your coding!


PS: if you're ok with the first approach (just extracting the ascii) you can do so easily with unidecode module. Install it with standard pip install unidecode.

Then you just do:

from unidecode import unidecode
print(unidecode(u"áéíóú äëïöü ñÑ û"))

easy peasy!

Posted by: fabzter
Last revised: 21 Sep, 2012 02:34 PM History

Comments

11 Jul, 2015 11:05 AM

Ray Ban 3025 Aviator Silver 003/3F Small 55mm Glasses

11 Jul, 2015 11:05 AM

Ray Ban 8307 002-n5 58-14 Aviator CaRay Ban on Fibre Green Classic

Efraim
Efraim
05 Jan, 2016 01:30 PM

Hi man.

First of all, great post.

I'm from Brazil and ran into the same problem you described - storing accentuated strings in a database, and having users searching them later.

The thing is, I isn't there another way around this problem? Because if I have a string 'João é meu amigo' (João IS my friend) and take all accents, it will be stored as 'Joao e meu amigo' (Joao AND my friend), so the meaning of the phrase is changed.

I've been struggling with this for a few days and haven't thought of a way to solve this problem.

Any ideas?

Thanks.

WilliamVomb
WilliamVomb
19 May, 2016 10:41 AM

Thank you ever so for you forum post. Keep writing. Granzow

Your Comments

Used for your gravatar. Not required. Will not be public.
Posting code? Indent it by four spaces to make it look nice. Learn more about Markdown.

Preview