13/10/2009

Word cloud algorithm

This is how Iv been doing my word clouds for anyone to copy/learn/improve. Iv had this working in python/cairo, JS/HTML and JS/SVG and its fast enough to work in all.

First the input - we have words/phrases and a weight. Something like this in python:

[
  ('dunk', 100),
  ('python', 50),
  ('rules', 20),
]

You can make up that list however you want. The important thing is we have a word and an importance.

Standard word clouds use font size to display importance. So from these numbers we need to calculate a font size for each word/importance. You could just use the importance as a font size or you could specify a max/min font size and use the importance to work out a new size. How ever you do it we now have a similar list as the input but we can replace importance with font size.

[
  ('dunk', 20),
  ('python', 10),
  ('rules', 8),
]

Normally you will sort the words by font-size/importance in descending order. If you dont it wont matter but you might get some strange results.

Before we place words we need to be able to be able to calculate a bounding box for each word. This is actually rather important. If your doing this in HTML then you can add a new div element with the right font-size and inner text and then use JS to find the pixel size of the div ( the same technique works for SVG too). If your using cairo then you can simply request the extents of a string without having to actually draw anything.

Placing the first word is simple: just choose somewhere and place the word there. Easiest is probably just choosing the centre of your page but there is nothing stopping you using the corner of your output document or anywhere else. Once you've chosen a place we can remove this item from our input list and add it to a "placed" list.

Now for the best bit: we know that next time we add a word we want to add it close to the last... or - easier: we want to place the next word on one of the edges. All we need is a list of edges that we can try and place our word on.

So after placing the first word, we take the bounding box and turn that into a list of edges. Keep the list of edges in order - either clockwise or anticlockwise. For each edge calculate a normal ( this is why you need to maintain order ). All the normals should point away from the centre of the bounding box. These edges should be added to a "free edges" list.

Now for the next word: first calculate the bounding box for the word. Now for each edge in the free edges list place the word on that edge. You'll need to use the edge normal to correct place the word. We know have the new word neatly aligned with the first. For the second word we have no trouble with collisions as nothing can possibly collide - but for all the other words we have we need to make sure that the new word doesnt collide with any others. For this we can do a simple rect/rect collision test with every word in the placed words list. If we detect a collision simple try the next edge in the free edges list. The successful edge should now be removed from the free edges list and new edges from the new word should be added. The edge on the new word that aligns with the chosen edge can also be removed.

Now just keep repeating until either you cant place any more words or dont have any words left to replace. Keep looping through edges, testing for collisions, adding new edges...

Thats it.

There are a load of ways to make this nicer. First, sorting the free edges list by some criteria makes for nicer clouds. Sorting by distance to the centre makes for a nice looking cloud. You can also consider better ways for adding/removing edges: here I have said simply remove the used edge, but if you think about it adding a new word could possibly create new edges where words dont align perfectly.

Other things to think about are word rotation, which is reasonably simple, you just need to decide when to rotate a word. Also you can make things much more complicated by calculating a bounding box for each character in a word instead of for the whole word. This will let words be placed inside other words. It does of course greatly increase the complexity (and therefore runtime).

No comments: