Categories
Life Programming

Hacker learning Chinese

I guess when a hacker learns a language it is different from the way “others” do that. I guess this, because I think I’m a hacker, I’m learning a language and it feels different. I see two main factors coming in the picture:

  1. I’m connecting things in my life, so the things I do usually need some motivation or purpose behind them, without which they are abandoned. The activities I keep up the best are the ones that connect multiple different things.
  2. I want to do things as efficiently as possible. Hack the tools, hack the process to make it better. If there are no tools, make some. If no processes, come up with ideas.

Okay, these are pretty vague expressed like this, now let’s give the example that prompted this post: I’m learning Chinese. Living in Taiwan for two and a half years now, so “high time” doesn’t even start to describe it. Finally I got a tutor, and a good one at that, so twice each week I have a good session of chatting and learning. After each of the sessions we have at least 30 or 40 new words and expressions written down. Those would normally be just forgotten, so I take an effort (about another hour or so after the session) to type them into a Google Document, this very one on the picture:

Google Docs Chinese vocabulary
The Google Document that powers my learning (click to enlarge)

Enter the English expression, the Chinese original and the pronunciation in Bopomofo (which I prefer to Pinyin). This last is possible because of Yahoo Chinese-English Dictionary (one service that is generally better than Google’s own Translate, though I frequently need to use both). The other three columns I’ll came in just recently.

At the moment I’m up to 307 expressions, and that’s just not possible to practice from a spreadsheet like that. I remembered, though, checking out a fellow StartupBus participant, Pamela Fox‘s Google Spreadsheet Flash Cards some time ago. It was fun but followed a different logic than me so couldn’t make complete use of it. But then: why not make a new practice system for myself? This goes back to the precious points: 1) connecting programming and languages – in both I’ll learn something new and they’ll reinforce each other’s motivation, 2) use the exact tools that I need even if I have to make them (because it’s possible to do).

Also after having done Who Said That? (that is currently down due to the AWS fail), I was into guessing games: let’s make a vocabulary guessing practice app that uses the above spreadsheet that I have anyways. What format it should be? Well, for the very first test, to see how to interact with Google Docs and such, just made it as a simple, console-based python app, something like this:

Chinese learning console
The console app to practice me Chinese vocabulary (click to enlarge)

Each practice round has 30 questions, randomized, 4 possible solutions, with the pronunciation as hint. Simple, though very ugly in the inside at the moment (here’s the repo, I still need to write a ReadMe). It works and now I know what to look out for in terms of implementation (how to log in, how to get and update data, stuff like that).

The other 3 columns in the above spreadsheet are explained as well: they are keeping score such as the total number of times a word/expression is practiced, the number of good answers and the number of current good-answers-in-a-row. These provide a rough-and-ready way to diagnose and manage the learning process – until I come up with better ways. Anyway, right now I get about 50%-70% good which is more than I expected, but still a lot more room for improvement.

A list of improvements to the program that I’m thinking about:

  • Making it into a site, so I can use my phone on the go to practice. Also, potentially others can use it as well to practice anything based on any spreadsheet with “one side – other side – hint” structure.
  • I’m pretty sure this whole things could be done in a single Javascript powered page. I don’t know enough Javascript to pull it off yet, but I’ve seen that all the components separately, and that would make a very portable and compact solution.
  • Need to figure out some easier setup of the spreadsheet if this is to be used by others later. I cannot rely on them understanding what they way I was thinking. Maybe in-app option to add more fields?
  • I remember reading somewhere that the most effective practice is that I reduce the frequency of checking words that I know well. That’s where the last column comes in: as one has more right answers in a row, one can tast that word/expression fewer times. If there’s a mistake then go back to the original method and test it more until score builds up again. This can potentially be a very complicated algorithm, I got to think of a way that scales well (ie. it is not too bad compared to an ideal method but does not require extensive amount of calculation). Have some ideas, but they need more polish.
  • After watching Salman Khan’s TED talk this year it grabbed me just how much information is there in one’s actions (they do amazing feedback to teachers on how the students study), how much better you can understand why did people what they did if you have all those diagnostic information (ah, the temptation of Big Data:). To apply this idea of extended diagnostics I could have a logging system instead of keeping (a simple) score. From there the system could get: how much you practice, how are you doing / getting better, what are easier or more difficult words for you, what two words you mistake with one another, suggest things to practice more, suggest words from topics that you know well to extend your knowledge… And more (this was just a little brainstorming). I don’t think I’d have time to extend it like that, and ideas are a dime a dozen, but one never knows…
  • Adding more modes of learning not just multiple choice, multiplayer learning, more game mechanics (achievements, pins anyone? :)
  • If there are central datasets instead of self-provided ones, then the system can anticipate what are the difficult parts from other students’ performance before you.

Now, let’s just see how will my Chinese improve during all this hacking, since that’s the main point, isn’t it? :)

New Formosa Restaurant Signature Dishes
Some motivation for learning, loads of Taiwanese food :)

Ps: If you have any language learning tricks, let me know in the comments!

Categories
Programming

Building (for) fun

I’ve been doing things very differently since I came back from the Startup Bus. Even before that I was writing up lists of technologies I wanted to try – but all there was to it, lists. Now I know one can indeed create something in a short time, so there’s no excuse to not actually trying all those tech. Here’s a little case study of what I’ve been doing in the last ~two weeks.

I had an idea for a game: I love quotations and reading a lot of Twitter I thought there could be a lot of “clevers” and “funnies” and “insightfuls” in there… I follow a lot of people and it is always interesting to see the different style people talk. So let’s take a bunch of people who others might know, find their tweets that make good quotation and let people guess who that tweet belongs to.

Yeah, that’s the whole thing. And all of it mostly made because 1) I like these kind of puzzles, 2) wanted to learn a bit of web app programming. Going ahead and looking at the result: Here it is, Who Said That? :

Website screenshot
playing with Who Said That. Here, I missed.

To summarize (mostly for my own education), here’s what I learned in the one week of development and in the one week since going live:

Behind the scenes

First I had to choose what technology it should all be based on. The two main contenders were Ruby on Rails and Django. I was reading a lot of pros and cons of both. I know that almost everyone I met on the Bus were doing Rails, and seen it myself too. On the other hand my go-to language to do things is Python. So they were on more-or-less equal footing. In the end I went with Django, because it is completely new to me and can have a little “niche” experience (not because not many people use Django, just not many who I know) that might come handy later. Also, there will be lots of other opportunities to try Rails in other projects. :)

I also wanted something hosted. Since Rails is out, Heroku is out too (that’s what everyone seems to be using, for a good reason, though I also didn’t want to spend money on it if I don’t have to). I also considered Google AppEngine which is not strictly Django but at least Python. I used before (and it is great at least on the small scale), so wanted to see if there’s any other hosting company out there. Fortunately there’s a startup that seems to be just the right thing for me: DotCloud. They are still in beta, but I got somehow an invite by talking to them on Twitter. They host Ruby, Python, Rails, Django, databases (MySQL, PostgreSQL, Redis), and more. One push deploy, roll-back, quite good docs, … and I love startups so why not help them too by testing their platform. :) So there it is: Django on DotCloud.

I also wanted to try a few different techs, like “can I use Redis for something in this project”? But that’s just the wrong way. Instead I should always ask: I want to do this or that: what’s the right tech for it? Using PostgreSQL and Django at the moment, though would think MongoDB would be a good fit too, but DotCloud does not have Mongo yet.

Getting interesting data

First I thought it will be quite easy, because people seemed to like the Twitter API. It was everything but… Now I learned to be much more careful of what I say: an API can be good for a lot of things, but to get exactly what I want can often mean a lot of wrangling, trial and error, and tweaking my own thought process to think in their developers’ way.

What did I want? “Quote-quality tweets”, that is: not a reply, not a retweet, not a link. Popularity (by some measure to be determined) is optional, but would be nice.

How did it go this time:

  1. “I want interesting Tweets from all kinds of folks, from everywhere!” (If you already know web development and my naivete start to make you sniggle at this point, I know, I know…) That looked like a perfect job for statuses/public_timeline. But that only pulls a couple of Tweets in.
  2. Then maybe the Streaming API? Well, I cannot have access to all the tweets (and probably I couldn’t handle of that traffic either). At my level (“Spitzer”) I can only have access to a random 1% of tweets. So I wrote a little app that was listening to the Stream and filtering tweets that looked like a good quote. I left that running the whole night. How many good tweets it collected? I think it was about 3 (yeah, single digit). How many of those were from people who anyone else would be interested? 0. Well, this failed again.
  3. [There was some more trial and error that I don’t remember very well…]
  4. Then had a spark: keep it simple. Instead of fishing in the big pond, let’s choose the interesting people first, and import what they are saying. Well, duh! So it can be even simpler: set up a Twitter user for the sole purpose of administration. Whoever that user follows will be in my database. Can also use Twitter lists to organize them and get some categories out (e.g. someone can be in Celebrity, Actors and Comedy in the same time). This also lends itself to easy administration within a Twitter interface, don’t have to roll my own.

So with this, I just had to figure out who to follow. At this point I mostly imported people who I know and I think they write well or interesting things.

  • WeFollow is a good resource for categories and popular people
  • Twitter’s own Who to follow
  • Using the Similar people link when I added someone new

So far I have about 70-80 people in the database, giving more than 4000 useful quotes. Some people are better than other. I was surprised how many of them just post loads of links but almost no personal content (and these are the things I like to learn in a project like this).

In the end I had something simple working and I enjoyed using. Now let’s make it ready to open up.

Domain Name

This part I didn’t really have to but wanted to: putting it on it’s own domain. I mean the likes of “somethingsomething.dotcloud.com” is okay, but I wanted to be able to make short links to the game and that format doesn’t really leaves much space e.g. if I make a Tweet with a link…

Been looking for a name for the better half of two days. Lots of searching and getting inspiration from Domai.nr. I learned a couple of things. One idea was “Who Said It?”, but .it’s Italy does not let people with not EU addresses to register. GoDaddy goes around this restriction for an extra $19.99/year (they vouch for you or something) but that was just too much for me and I didn’t want to register with them anyways. I do have an EU permanent residence at the moment, but just didn’t want to risk it. There were a couple of other domain names that are restrictive in this sense, and in the end I went with Austria’s .at, for “saidth.at”. It’s okay, not too expensive (the prices very a lot between countries, before I though there’s a more-or-less common price). One thing I didn’t like about it, that now I’m registered with living in “Taiwan, Province of China”. Say WHAT? (Okay, let’s put politics aside, that’s for some other time).

Making it look something

Well, from the design of the page one thing is easy to tell: I’m not a designer. I know I should make something simple, clean (because that’s what I like to use as well), but not entirely sure how to get around doing it. I was looking for CSS speech bubbles for a long time. It seems that most of the designs use the :before and :after pseudo-element, which is cool, but has its own problems: eg. cannot be manipulated from Javascript – if I wanted to make a speech bubble pointing 4-ways and then deleting 3 once the player guessed, then 1) I don’t know a way to use 4 :before/:after elements, 2) cannot alter them…. Well, I’ll look into this later.

In the end I’m happier with this opacity change, and hiding/unhiding of elements. I’m using jQuery to achieve it, and also for the AJAX calls to get the correct answer. It works well enough until I find a good designer who’s willing to improve on the page for not much in return. Or we’ll see…

My favorite part of this design is the relative robustness: I tried it even in the Kindle 3‘s browser, everything worked very well except for the results’ display by setting .innerHTML… Guess I should look around if there’s another way to do that. I bet there is.

Showing it off

Well, this didn’t work too well so far. I try to put shameless plugs everywhere, but I got altogether about 35 pageviews in the last 10 days…. and I think half of that must be me. :) Hurrah for Google Analytics. Things tried/to try

  • I like the “tweet this puzzle” button, but that got altogether 1 click. Guess people are spammed with linked content already so it does not stand out
  • Set up twitter account for the page. Not even spam-bots noticed it. The hidden management bot account on the other hand got a few “fans” already… :)
  • Set up a blog on Posterous. Got to write better content, so far I have 1 lone through-click. Good thing is that when I post something there, it is shown in their new post directories, so that blog got already much more casual “views” than this blog here. Space to improve
  • Submitting it later to tech sites? Maybe, when I made I finished implementing a few things that I want. Or when I give up implementing them for the moment.
  • ???

Maintenance

Since launch, I keep filing bugs, feature requests and refactor requests in the GitHub issue page of the site (well, just not to forget it). I did fix a couple of things but didn’t make that many new features as I thought. It might very well be because I’m more a “hacker” (i.e. making something work quick) then a developer (making something work well). I need some attitude change in this respect. Even if this is a “just for fun” project, I do have many ideas to improve it beyond the current stage. It would be too bad to let it die out of laziness.

Future

Keep improving. See whether there’s something the main domain (saidth.at) could be used for. See if I can pivot it to be something more interesting to people. See where did I go wrong with my assumptions that what makes good game mechanics (since I have 0 experience in that field before:).  Also, there are other ideas popping up in my head all the time, so got to think what is better: getting some of the other things done or improve this.

If anyone have any comments, feel free, I take this as an experiment and want to learn as much as possible from it. If you are making things for fun as well, tell me in the comments, I’d love to see! :)

Categories
Life Programming

Hopped on a bus

It’s been about two weeks since I came back from the adventure that is called the StartupBus. Things just start to sink in, soon the glare will fade and see what remains of the total awesomeness that is the Bus. I think a lot will stick with me, it was just way beyond what I could have imagined and I can already feel the changes the trip & the people made to me. The best possible changes, to say. :)

So, the StartupBus in a nutshell: 6 buses from all around the US set off a journey, with about 30 “buspreneurs” on board each, who ideally don’t know each other or haven’t worked together before. First, whoever has some business/product ideas, can pitch to the others. Groups are formed and everybody starts to make their idea into reality – and have to do that frantically, since there’s only about 48 hours before the buses arrive to their destination. Sounds like fun? If it doesn’t, you should just give it a try :)

Palm Springs morning before heading out.

I had no idea what to expect and when I heard the pitches that Tuesday morning, rolling out of San Francisco on the highway, I was thinking to myself: what do I do here? What can I contribute? There are just so many amazing developers, setting the bar. But I could choose a project in the end, something that was close to my heart since I do love to travel: Fly By Miles (the site went back to pre-launch mode for a bit, but that’s where it will be ;) – a site that wants people to use their frequent flyer miles well and easily. Our team had 8 members altogether: 5 developers (with yours truly), 2 designers and a business strategist. Of course these are lines drawn in the sand, everyone chipped in a bit in each role.

I have to say, thos ~3 days are close to a blur now. Should have continued writing my journal but there was just no chance for that. We were mostly hopping from co-working spaces to hotels and back, all through California, Arizona, New Mexico and Texas. I should know where, I wrote the Android app that told the map where we all are. There were some very nice places, interesting scenery that I haven’t seen much of. There were challenges like walking up to people in Santa Monica and getting feedback on our project – and taking a video of it (before this I’ve never thought I could do that. But got our 3 recorded videos and a few people’s thoughts off-the-record). There was always something to do and I enjoyed being left to my own devices and coming back with a solution. Not the best solution usually, so there were usually more than one iterations for solving each of the problems I tried to tackle, but at least there was a good modus operandi: give Greg a task, some time, maybe a couple of Red Bulls and he’ll come back with something.

By the way, Red Bulls: don’t let them anywhere near me for the time being. Lost count at 7, but the shakes stayed for a couple days after I stopped so it must have been more than that. Haven’t had it since high school (which has been a while) and under this hackathon circumstances they are very addictive…

Startup bus life snapshot
Life on the bus: laptops, getting contacts on phone, powernaps… Intense and awesome.

Well, after a few hotels and sleepless/frantic nights, we arrived to Austin. Finished up a proof-of-concept, working prototype and got some sleep. The next days were about taking Austin in, with Ignite, with SxSW, flashmob, friends, places, food… There was some programming as well, since Fly By Miles got into the semifinals, then into the finals, held in the Hilton. It was great, I really enjoyed it, an albeit we didn’t win, there was a lot to take home. Also, since that was my last night in Austin, I haven’t actually slept (again) and it was surreal to leave the place at 5am. Missed our CNN interview, but well, if we do it right, there will be more than enough to make up for that. ;)

What do I think I learned:

  • Now I can code everywhere. In the park, on the curb, on the plane, on the bus… Though Taiwanese buses have more hectic drivers and their suspension is worse than the Bus was so better hold on to that computer. Haven’t been coding on a motor-scooter yet, but I will…. probably won’t try that. :)
  • It is worth knowing the popular tech in the field I’m interested in even if I’m not using it (yet). In our group Ruby on Rails was the thing – and I’m a Python hacker. It was quite confusing for the first time, but I can see the advantages. This stands for other fields as well: mobile development, databases, design, front end interface…. There’s a lot of interesting tech out there, and I believe that one should try things out before the need arises. That makes educated choices down the road.
  • Have a niche. I might be biased, but I think this is even more important than the previous point. Know something else than others. Networking, game development, Big Data, Python, functional programming, myriads of 3rd party APIs (Google, Twillio, …), and so on. Anything that excites you.
  • Be quick learner. Before the trip I had to (well, wanted to :P ) pick up Android development in a week. And it produced something that crashed fewer times than I thought it would. :) After coming back I wanted to pick up Django in a week because I was curious. There’s even a result (Wanna see? There will be a proper post about that). I believe everyone can do that, so no excuse for not learning a new things ever week.
  • Choose carefully in what things you rely on others. It is easy to get burned and few things are worse than having your fate in someone else’s hand.
  • Take note what people say, but don’t have to take them too seriously. Everyone’s been excited at the end of the trip and wanted to continue the project. Two weeks passed and 2 (and a half) people left of the 8. No hard feelings, that’s how things work and there will certainly time to work with them together.
  • Getting in touch with people is easy. Not just technically, but nothing should hold you back. Fire off that email to the big shot you met, talk to the 2nd level contact on Linked In if you need to… Most of them will love something personal, just keep it simple, honest, and no hard feelings if it doesn’t work out. This liberated me on so many levels
  • Startup life is a test on one’s liver. I’d thought it is a marginal issue, but better be prepared. Not sure what way (no, I don’t mean training), but got to.
  • The best is always what you have but it’s always good to look out for more. For example. the Silicon Valley bus was undoubtedly the best bus of them all! :D But that won’t stop me from knowing as many amazing folks from each of the buses and off the buses as well.
  • Silicon Valley at least was Mac World. Been an outsider with my Lenovo, but I don’t mind if I cannot share the power plug with the other 29 peeps on the bus. Penguin power!
  • Laptop stickers are cool. I’m late to the party to say that (seeing some of the laptops there) but anyways…. I got to be selective, though, otherwise just too many stickers flying around.
  • Now I just cannot stop the flow of ideas. It is not a question that I’m a “starter”. What I need to learn how to select the good ones (or, probably I should select the “ridiculous” ones, they seem to work the best), and how to be a “finisher”.
Todo list
Morning results of the todo list after a whole night of hacking

Probably there are more lessons, but that’s enough. Now off to do some practice. Let’s see what happens until next year’s Bus :)

Categories
Life Programming

The power of not caring

I’m reminded time and time again, that the best things I do come from fun and passion, not mere sweat and grinding teeth.

Recently I read about a project called The Startup Bus. It’s a bunch of strangers getting on a bus going from A to B, for 48 hours, in which time they build a startup company from 0 to launch. It looked very intriguing and I applied even if I couldn’t imagine going there (after all, it’s in the US and I’m in Taiwan). Wrote up an introduction that was… how shall I put it? Ordinary? Bland? (maybe even that’s generous). Then I started to hear from more and more people on Twitter that they got on it and I did as most people do when they cannot get something: started to want it :) On the other hand, looking at who got on I had no hope that I’d be selected, none at all.  So instead of revising the application I had, just gave it up…. but also just wrote another one application. In true nerdy fashion in a programming language. And made it actually run. Just for fun. It is not perfect (can see it on Github) and actually I cannot imagine that no-one else thought of it… When I was satisfied, just sent it in, closed the browser and walked out. Not even an hour passed, I got my invitation…

So I’m heading off to San Francisco this weekend to mingle with a bunch of very clever people, do a lot of programming, most likely things I’ve never tried or even imagined trying, go all the way to Austin, when maybe could pitch to possible investors. What will come out of all this, I have no idea. But certainly glad I “didn’t care” for long enough that creativity (no matter how shallow) started to flow.

For me this whole story is brings up a poem by Dallas Clayton:

Good/Bad

How a bad idea starts:
“That looks easy… I could do that.”
How a good idea starts:
“That looks fun… I should do that.”

I like this way of thinking a lot, I even got it on a bag to remind me. :)

Now all the preparation is on the way, I’m helping by making an Android app to be able to track the buses en route. I week ago I didn’t know anything about Android apps. But there you go, it’s working, more or less. :P

BusDroid running on HTC Desire
A little side project for the Bus.

… and soon I’m hoping to use the power of not caring for many other things in life.

Categories
Programming

Poor lemmings, or adventures into cliques

I love programming competitions and attempt to try every new site I find. Those sites will worth another post later, this current will be about a programming contest on CoderCharts I took part recently. It finished about two days ago and I didn’t do as well as I hoped, but did much better than I would have a year ago. Out of the 8 puzzles i have: 5 solved, 1 given up on for now, 1 haven’t even started and 1 attempted but run out of time . This last on is one Lemmings mating, one of the “hard” puzzles, and when I say run out of time I mean at 4am I decided that for the 4 ours then left to finish the competition I’d rather sleep than try to think of another optimization. Still, it bothered be and I did manage to write solution (with ~70% score, not too bad). Now I want to document the how did I get there so I won’t forget.

* Spolier warning. If you haven’t solved the problem and want to do it independently, do not read on. *

Path to a solution

Looking at the problem statement, there’s clearly a graph-related problem behind all the story of the poor creatures. When I’m not sure about not just what should be the solution, but what exactly is the right description the problem, I usually open The Algorithm Design Manual (ADM). It’s a great book, so far I couldn’t find a problem where it had nothing to say. It is a bit dated, however, and while sets me on the right path, there’s usually much more reading than that. In this situation it quickly revealed that the problem I face is finding the maximum clique in the graph describing the problem. Well, quickly learned that there’s a big can of worms waiting for me in there.

Maximum clique

So, to summarize, a clique is a group of vertices that are all connected to each other by edges. Got a maximal clique when it is not possible to add any other vertex in a way that everyone’s still connected. Finally the maximum clique is the largest maximal clique. Note how logical the notation is but still so easy to mix up. It is relatively easy to find maximal cliques (start from any vertex and add more connected vertices until there’s none left that is connected to all already selected). Finding the maximum clique (or any maximum clique since there could be more than one) is, however, a proper NP-complete problem.

The ADM only says something along the lines, that “well, then you just have to do a backtracking through all the graph”. Which is a great idea and simple to implement, I would had a result, but it is terribly inefficient because in a naive implementation one would check lots of cases which cannot possibly have the result. So fist thing to do is pruning, or eliminating paths that we are sure that it cannot improve what we already have. Wikipedia helps in finding just such a thing, called Bron–Kerbosch algorithm. Idea is quite simple (just wrap the head around recursive functions) and it can be implemented in Python following the pseudo-code version on Wikipedia. Here’s mine (not really optimized or anything).

def bronkerbosch(R, P, X, graph):
    """
    Want all vertices from R, some from P and none from X
    where the graph dict defines the connections
    """
    if len(P) == 0 and len(X) == 0:
        return R
    candidates = P.copy()
    max = set([])
    for v in candidates:
        res = bronker1(R.union([v]),
                       P.intersection(graph[v]),
                       X.intersection(graph[v]),
                       graph
                       )
        if len(res) > len(max):
            max = res
        P.remove(v)
        X.add(v)
    return max

This really worked and actually 3 out of 7 tests are passed, but the rest timed out. Got to find something better. Looking at more on Wikipedia, I was tempted to get on with the version of the algorithm that they said is “the fastest”, by J.M. Robson. It’s written up, but it is from the pre-MathJax era, so it’s just terrible work to figure out and keep up with all the strange notation. Also, it looks like a big collection of special cases and theoretical shortcuts. I’m sure it works, maybe I’ll come back to it later but for now I wanted to have a little bit more of an insight as well, so looked a bit more.

Vertex colouring

Since at this point I really caught up with the fact that maximum clique finding is an important problem, so went to Google Scholar to see what the academia has written. A lot, apparently. I was browsing though them in a kind of reverse-chronological order, so I could find the “newest” algorithms, since that should be the best. In the end a pattern emerged: most of the hard work can be done by employing a different technique, the vertex colouring. The two are connected because a clique in a graph G is an independent set in the complement graph of G (two vertices are connected in the complement of G if  and only if are not connected in G), and vertex coloring is a good method of finding independent sets. What we want is labeling the vertices with the smallest number of different labels (“colours”) such that vertices with the same label are independent from each other. The number of different labels we have to use to do that helps pruning our clique finding algorithm by setting an upper bound how big clique it is possible to make from the vertices we have in a particular step.

As an example we can start out with a graph like this (taken from the original lemmings problem):

A sample graph for illustration
Defining the graph: vertices are labelled and edges show connection

When running this graph through a vertex colouring algorithm, we would get something along this lines:

Result of graph colouring
The result of the graph colouring: vertices of the same colour are independent

The blues are not connected to each other, neither are the greens and so on. In this case we need 4 colours to separate the graph into independent groups. Of course when programming one would use numbers instead of colours (thus often this called “numbering”): blue was actually “1”, red was “2”, … By adjusting the algorithm in terms of what is the order of the colours and the order of the vertices within each colour the previous algorithm can be improved by orders of magnitudes since we much better pruning. In the maximal clique every vertex would have different colour (though not all colours will necessarily be used).

Some notes on the algorithm to get started:

  1. Try to keep nodes in the order of neighbours in the current sub-graph. The smaller is good, because they can be quickly eliminated, reducing the test-space more.
  2. Go from colors with fewer members to more members, similar reasons.
  3. Can use the color number + the current clique size for better pruning: if my largest clique so far has Q elements, my current clique has P and the tested vertex comes from the highest colour number N, then in case of N + P <= Q it is futile to go on, I cannot improve the result and time to backtrack. Less efficient methods use the C number of elements in the current candidate set instead of N, and since N<=C, the pruning bound set by C+P is more “loose” letting more tests to be done than necessary.

For example, the final solution of this given problem is the set of vertices highlighted by the edges is shown on the next picture, and it is done in only 4 steps or so…

The final solution, the maximum clique
The maximum clique on this sample graph defined by the highlighted edges.

Different methods

In the papers I found a few different colouring schemes, and most of the improvements seem to be made in:

  • initial ordering, or
  • ordering within each of the colours, or
  • better upper-bound estimation for the possible clique size from the available colour information.

The hardest things about reading these papers is that often the pseudo-code is a big mess, there are plenty of typos and the examples are quite scarce.

In the end I used algorithms mixed from two different papers:

Tomita, Akutsu and Matsunaga, Efficient Algorithms for Finding Maximum and Maximal Cliques: Effective Tools for Bioinformatics, 2011 (link) This has a few different algorithms and the most effective one looked too complicated and I went with MCQ, one of the improved but not perfect on. Now 6/7 tests pass, that’s got to be the right path. According to benchmarks the coloring takes up the most time

Segundo, Rodriguez-Losada and Jimeneza, An exact bit-parallel algorithm for the maximum clique problem, Computers & Operations Research, Volume 38, Issue 2, 2011, pp.571-581 (doi:10.1016/j.cor.2010.07.019). Their algorithm is a variation on the previous one and I’m not using it to it’s full potential (wonder how fast a C implementation would be). Also, I wonder if I misunderstood something but the original form of the algorithm failed on me. There are nodes that don’t need to have colour information because it does not matter at that stage of the algorithm. Their version of colouring then seems to remove those vertices from the colouring function output. This does not matter for their example (it is correct) but in my tests it gives wrong result (the algorithm finishes too early). So, mixed in from the previous paper that I keep all nodes whether they are coloured now or not, because later they might be, it works like a charm. And about a factor of 2-3 faster than before, enough to pass all 7 tests on CoderCharts and with a reasonable score. I think the real improvement is not really in the pruning (this version seems to check a few more nodes than the previous) but somehow the colouring function is faster – that might tell something about my Python skills too, though.

Lessons

This could probably be solved with other algorithms as well, I would be curious to see others’ code / hear their stories. What I took home from this:

  • Know your problem and know where to look for solutions. The best is to study and practice a lot.
  • Often pre-processing the data is a crucial step to the solution. Since there’s usually a memory/speed tradeoff, it is worth experimenting: if and algorithm is too slow, what could be prepared and stores to exchange calculating something with a mere data lookup?
  • Different algorithms have bottlenecks at different sections. E.g. in this case there is the data preparation, the number of steps to test possible solutions and the time it takes for re-colouring. On the path to the final solution I had algorithms inefficient in each of the sections.
  • Test cases are important. The example in the problem setting and the examples in all the papers are too small to be helpful for optimization. The “bonus” test case on problem site, however , is just too big. Actually, even my passed algorithm is too slow to have a result for it in reasonable time. So, if there’s no provided test case, make your own
  • The example cases I generated are much smaller than the 7 tests I need to pass, but they are still slower. Every paper said that maximum clique finding has very different running time for different (random) graphs. I presume that some of the test graphs are in fact specially prepared. Also, some of  the changes in code that made my own test cases faster (sometimes 20%) often failed more tests on the web. The message is that coding competitions like this are set towards finding the optimal solution for their own tests – whatever your winning solution might be, it is not necessarily the best overall solution and if I ever come across a similar problem in real life setting I likely to have to do differently.
  • Most of the Python optimization advice I found is outdated and sometimes outright hurting the performance. Do plenty of performance testing. Start optimizing the big slow downs, the small shortcuts rarely worth the time in competition setting.
  • Language choice matters but does not matter that much. I’d think Python actually pushes me to be more efficient because many things are slower. The code on the other hand is more readable in the end (ideally).
  • Someone has to update Wikipedia. :)

I will share the code later, just need to wait until some time has passed after the competition.