This is related to the previous post. I have started developing drop-in replacement for atoi() and atol(). The new functions maintain the compatibility with the library functions and yet are faster on the platforms and compilers I tested. The sign of a number and whitespace are handled as atoi() would, with the caveat that whitespace is hardcoded to match atoi() behavior when LCC_ALL=C. See the source code and performance charts.

 

 

This year I had the pleasure to attend Code::Dive conference in Wroclaw, Poland. One of the invited speakers was Andrei Alexandrescu, the author of the notorious “Modern C++ Design” book and co-designer of the D language who lately left Facebook to concentrate on his “pet project”. I set my sights on Andrei and didn’t miss any of his lectures – they were worth it.

Sadly, the video materials from the conference have not been officially published yet. If you search reddit, however, you will find links to non-public recordings on YT. I recommend both parts of “Writing Fast Code”. It’s an eye opening experience – Andrei shows how the mathematic properties of operations and processor’s internal architecture  can be combined together to increase the performance by several hundred percent. And the optimized functions are very simple to understand and yet useful: fast versions of integer-to-string and string-to-integer conversions.

The claimed optimizations were substantial: 2-3 times faster than the library function and up to 2 times faster than a handwritten one. However Andrei underlined that everything should be properly measured and you should not rely on intuition and experts since they are often wrong. So, I started my own verification task. First, I wrote the atoui() routine using the Horner algorithm, the equivalent of “handwritten” version in the lecture. As the next step, I added “one small trick” that Andrei showed us. It saved some comparisons and the code was visibly faster: atoui_f() on the chart.

Then I rewrote the function applying the optimizations that should better parallelize the computation done in ALU… and received something slower even than the original Horner algorithm: atoui_and(). After reducing the number of  comparisons with “one small trick”, the new version, atoui_and_f(), was still losing to the original atoui() on my old i5 CPU and MinGW gcc 4.9.2 compiler even after applying full optimizations for the architecture. Unrolling the loops – atoui_and_u() – did not improve the situation. Horner was winning.

Well, that could not be right: probably Alexandrescu used a different combination of compiler and architecture.  Luckily I have a secondary development machine which is… well… not as old as the primary one. Still I could not replicate Andrei’s results in full. But at least, for some string sizes atoui_and_u() (orange, unrolled and mathematically optimized) was minimally faster even than atoui_f() (red, the improved Horner algorithm). The chart from my MacBook is below. The function zero() is just a plain empty loop prevented from being optimized out that serves as a base value (another trick by Andrei).

image001

atoui performance

So a minor success. And it pays off to know your maths and tricks of the trade after all.

I hope to write about the implementation details if I have some time. Please leave a comment if you are interested.

Note: in Alexandrescu’s video you will find ILP, ILP+unsigned, ILP+unsigned+unrolling, these are respectively atoui_and, atoui_and_f, atoui_and_u.

I’m bringing to an end this series of mildly interesting useless factoids on the Google Translate ability to recognize the likely language judging by just the one or two initial letters. Polish language uses almost all letters of the Roman alphabet and adds nine specific ones: ą, ę, ć, ł, ń, ó, ś, ź, ż. Some of these are also used by other languages like Lithuanian, but I suspected that all would be detected on their own as being a part of a Polish word by Google Translate. Of course this an example of Central Europe-centric thinking ;-) Was I right?

It turns out that ć, ę, ł, ś, ź, ż are indeed recognized as Polish. The letter ą was marked as English, which might mean “no detection, use default”. It’s debatable whether this behaviour is correct since no valid Polish word can start with ą. On the other hand ę never happens at the beginning of a word and it was detected as Polish. What’s interesting – the letter ó was marked as probable Irish and ń as Yoruba. I understand that in those languages ó and ń are just versions of the base letter with added acute accent, whereas in Polish they have quite distinct pronunciation from o and n.

Speaking of accents here’s some bonus information: (single apostrophe) is guessed to be Afrikaans and ` (back apostrophe): Armenian.

Finally here are the changes in the proposed language to translate from when typing “ósemka” – the name of the digit 8 and one of the few words that start with ó in Polish.

ó      Irish - detected
ós     Galician - detected
óse    Irish - detected
ósem   Icelandic - detected
ósemk  Icelandic - detected
ósemka Polish - detected

 

I wrote a script that ran all two letter word prefixes against Google Translate page and stored the auto detected language. It has no real use apart from a long list of mildly interesting factoids, for example “ji” prefix is most likely to be Czech, “sy” – Afrikaans and “ta” – Maltese etc.

Here’s the visualization of the prefix to language mapping done using PIL. I have used HSV color space so that the colors would be uniformly distributed among the hues, although the less saturated ones seem to be difficult to distinguish from one another. Each color represents a language, the languages are listed at the bottom. Generation of N easily distinguishable colors seems to be an interesting problem in itself.

Note that letter combinations unlikely to prefix a word in any language seem to default to English.

Two letter prefixes and detected languages

Two letter prefixes and detected languages

Disclaimer: all data came from Google Translate.

When you use Google Translate with the “Instant translation” and “Detect language” options turned on, the language is detected as soon as you type the first letter.

For example a single letter “a” is identified as a text most probably in English and both “w” and “z” are likely to be Polish. These happen to be valid words in the relevant languages.

Here is the chart of the various languages detected in response to all the letters of the Roman alphabet.

Letter to likely language mapping chart

Letter to likely language mapping chart

Note that most letters are defaulted to English. Of course this Google Translate behaviour probably depends on the user country and language settings, so your mileage can vary.

In the next installment in this series I’ll write about the language specific characters and multi-letter prefixes.

Replacing each letter by its counterpart in reversed order alphabet – English

Start with loading the list of words (plenty English word lists are available on the Internet)

In [1]:
words = [ w.strip() for w in open(r"D:\opt\data\words.txt", "r").readlines() ]
In [2]:
len(words)
Out[2]:
354986
In [3]:
# I wonder what the longest word is
max(len(w) for w in words)
Out[3]:
31
In [4]:
# out of curiosity...
[ w for w in words if len(w) == 31]
Out[4]:
['dichlorodiphenyltrichloroethane']

Filter the words leaving only ones consisting of letters 'a'-'z'

In [5]:
import string

az_words = [ w for w in words if set(w) <= set(string.ascii_lowercase) and len(w) > 0 ]
In [6]:
len(az_words)
Out[6]:
351076

Define the mapping of each letter to its counterpart in the reversed alphabet, e.g. 'a' ↔ 'z', 'b' ↔ 'y' etc.

In [7]:
m = dict( (k, chr(ord('a')+ord('z')-ord(k))) for k in string.ascii_lowercase )
print(m)
{'r': 'i', 'l': 'o', 'i': 'r', 'c': 'x', 'w': 'd', 'd': 'w', 'z': 'a', 'b': 'y', 'h': 's', 't': 'g', 'y': 'b', 'p': 'k', 'u': 'f', 'x': 'c', 'o': 'l', 'v': 'e', 'a': 'z', 's': 'h', 'q': 'j', 'f': 'u', 'n': 'm', 'g': 't', 'j': 'q', 'm': 'n', 'k': 'p', 'e': 'v'}

In [8]:
def map_word(w):
    return "".join(m[letter] for letter in w)

print(map_word('a'), map_word('zz'), map_word('abc'))
z aa zyx

The original example – WIZARD ↔ DRAWIZ

In [9]:
# note the usage of the Martian smiley
map_word('wizard') == 'wizard'[::-1]
Out[9]:
True

Let's find all the 'wizard' words in English!

In [10]:
wizard_words = [ w for w in az_words if map_word(w) == w[::-1] ]
In [11]:
len(wizard_words)
Out[11]:
32
In [12]:
sorted(wizard_words)
Out[12]:
['az',
 'bevy',
 'by',
 'fu',
 'girt',
 'grit',
 'gt',
 'hols',
 'hovels',
 'hs',
 'ir',
 'izar',
 'klop',
 'levo',
 'lo',
 'mn',
 'nm',
 'ol',
 'pk',
 'polk',
 'sh',
 'tg',
 'trig',
 'vire',
 'vole',
 'wd',
 'wird',
 'wizard',
 'wold',
 'xc',
 'za',
 'zira']

Notably wizard is one of two longest wizard words, the other one is hovels.

Another question

How many mapped words become valid words? That is, we no longer check if the reversed mapped word is the original one. We only check if the mapped word exists in the dictionary. For example if abc becomes xyz after mapping we accept abc as a member of the new set if xyz is also in the dictionary.

In [13]:
az_set = set(az_words) #speed up lookup
good_words = [ w for w in az_words if map_word(w) in az_set ]
In [14]:
len(good_words)
Out[14]:
534

Find the longest words that are still valid after being remapped, as well as the words they map to.

In [15]:
maxlen = max( len(w) for w in good_words )
[ (w, map_word(w)) for w in good_words if len(w) == maxlen ]
Out[15]:
[('brigs', 'yirth'),
 ('drogh', 'wilts'),
 ('droob', 'willy'),
 ('erizo', 'viral'),
 ('girth', 'trigs'),
 ('grogs', 'tilth'),
 ('tilth', 'grogs'),
 ('trigs', 'girth'),
 ('viral', 'erizo'),
 ('willy', 'droob'),
 ('wilts', 'drogh'),
 ('yirth', 'brigs')]
 

Here’s a simple toy project – find all the word squares – a word square is a n x n matrix of letters where each row and column is a word. I used brute force approach and no recursion (to make it harder), so this algorithm may be easily improved. Let’s get the words:

words = [ w.strip() for w in open("myfavoritedictfile.txt", "r").readlines() ]

Let’s define the class for the word square. Being lazy, I will not define a class at all.

import collections
Square = collections.namedtuple('Square', ['n', 'hlist', 'vlist'])

Hlist is the list of horizontal words. Vlist is for vertical ones.

Let’s check if the new word can be added – we need to verify each cell where it crosses another word. This is symmetrical, hence we need only one function and swap the lists if needed.

def add(list1, list2, w):
    k = len(list1)
    for i, e in enumerate(list2):
        if e[k] != w[i]:
            return False
    list1.append(w)
    return True

Now the actual function that fills the word square and returns the results as the iterables.

def fill_iter(n, words):
    count = 0
    z = Square(n, [], [])
    nw = len(words)
    stk = [ 0 ]
    while 0 < len(stk) <= 2 * n:
        count += 1

        if len(stk) % 2:
            r = add(z.vlist, z.hlist, words[stk[-1]])
        else:
            r = add(z.hlist, z.vlist, words[stk[-1]])
        if count % 1000000 == 0:
            print(count, z, words[stk[-1]])           

        if r:
            if len(z.hlist) == len(z.vlist) == n:
                yield z
                r = False
                if len(stk) % 2:
                    z.vlist.pop()
                else:
                    z.hlist.pop()
            else:
                stk.append(0)
        if not r:
            stk[-1] += 1            
            while stk[-1] == nw:
                stk.pop()
                if len(stk) == 0:
                    return None
                if len(stk) % 2:
                    z.vlist.pop()
                else:
                    z.hlist.pop()

                stk[-1] += 1 
    return None

Since this approach is very slow, let’s reduce the search space a bit. Here’s how to get all the 5-letter words that remain valid words if read backwards e.g BUS-SUB. If we limit ourselves to these words our word square will be a “magic” word square (almost, since we do not check the diagonal). And finally – run!

s5 = set(w for w in words if len(w) == 5)
rw5 = [w for w in s5 if w[::-1] in s5 ]

for z in fill_iter(5, rw5):
    print(z)

Just a few selected examples in Polish:


Square(n=5, hlist=['igrom', 'glebo', 'rener', 'obelg', 'morgi'], vlist=['igrom', 'glebo', 'rener', 'obelg', 'morgi'])
Square(n=5, hlist=['momus', 'odoru', 'motor', 'urodo', 'surom'], vlist=['momus', 'odoru', 'motor', 'urodo', 'surom'])
Square(n=5, hlist=['kinaz', 'imaga', 'nagan', 'agami', 'zanik'], vlist=['kinaz', 'imaga', 'nagan', 'agami', 'zanik'])

Follow

Get every new post delivered to your Inbox.