Physicist, hacker. Enjoys avant-guarde literature probably a bit too much. Open source advocate and contributor, both for software and hardware. Follow these posts on the Fediverse by @gergely@gergely.imreh.net
I’m reminded time and time again, that the best things I do come from fun and passion, not mere sweat and grinding teeth.
Recently I read about a project called The Startup Bus. It’s a bunch of strangers getting on a bus going from A to B, for 48 hours, in which time they build a startup company from 0 to launch. It looked very intriguing and I applied even if I couldn’t imagine going there (after all, it’s in the US and I’m in Taiwan). Wrote up an introduction that was… how shall I put it? Ordinary? Bland? (maybe even that’s generous). Then I started to hear from more and more people on Twitter that they got on it and I did as most people do when they cannot get something: started to want it :) On the other hand, looking at who got on I had no hope that I’d be selected, none at all. So instead of revising the application I had, just gave it up…. but also just wrote another one application. In true nerdy fashion in a programming language. And made it actually run. Just for fun. It is not perfect (can see it on Github) and actually I cannot imagine that no-one else thought of it… When I was satisfied, just sent it in, closed the browser and walked out. Not even an hour passed, I got my invitation…
So I’m heading off to San Francisco this weekend to mingle with a bunch of very clever people, do a lot of programming, most likely things I’ve never tried or even imagined trying, go all the way to Austin, when maybe could pitch to possible investors. What will come out of all this, I have no idea. But certainly glad I “didn’t care” for long enough that creativity (no matter how shallow) started to flow.
For me this whole story is brings up a poem by Dallas Clayton:
Good/Bad
How a bad idea starts: “That looks easy… I could do that.” How a good idea starts: “That looks fun… I should do that.”
I like this way of thinking a lot, I even got it on a bag to remind me. :)
Now all the preparation is on the way, I’m helping by making an Android app to be able to track the buses en route. I week ago I didn’t know anything about Android apps. But there you go, it’s working, more or less. :P
… and soon I’m hoping to use the power of not caring for many other things in life.
I was catching up on StartupBus, an awesome (crazy? probably both) project that I really wish I could take part in. One part was watching the videos they shared in Youtube, especially one series by Phil McKinney (Chief Technology Officer at HP), where among other things he talks about the Rules of the Garage (the garage where HP was started). It is very inspiring I think those rules should apply not just in business but many other creative team effort. Such creative team effort is, for example, a physics lab, just like the one I’m working now. And indeed this group would need some philosophy infusion like this.
Believe you can change the world.
Work quickly, keep the tools unlocked, work whenever.
Know when to work alone and when to work together.
Share tools, ideas. Trust your colleagues.
No Politics. No bureaucracy. (These are ridiculous in a [lab]).
The customer defines a job well done. [?]
Radical ideas are not bad ideas.
Invent different ways of working.
Make a contribution every day.
If it doesn’t contribute, it doesn’t leave the [lab].
Believe that together we can do anything.
Invent.
These are so well formulated that they give more of an “aha!” feeling instead of a question that “yeah? why’s that?”. Researchers should have all of this drilled in. Or rather everyone should have.
The only one that I’m not satisfied by is #6: what constitutes a “job well done” in academia? No customers so no direct measurable response to the effort. The commonly used metrics are amount of grants won, number of members who won tenures, number of publications and/or citation, conference invitations, and so on, none of which really captures what science is about and why people should be doing it. But then what is the metric? Maybe this is a problem in general, no clear aim makes unfocused effort. Or is it a strength as people can define their own measures of success? I don’t think that’s the right way either. I usually go by the “amount of curiosity satisfied and new things learned”. But since I feel I have learned an awful lot lately but still barely going forward, maybe I’m not the example of someone who should be followed on this.
Anyway, I wanted to post this, because it needs a bit more thinking and have to start somewhere. And indeed there’s a lot to do…
Now back to work. But better this time, there’s a world to change.
I love programming competitions and attempt to try every new site I find. Those sites will worth another post later, this current will be about a programming contest on CoderCharts I took part recently. It finished about two days ago and I didn’t do as well as I hoped, but did much better than I would have a year ago. Out of the 8 puzzles i have: 5 solved, 1 given up on for now, 1 haven’t even started and 1 attempted but run out of time . This last on is one Lemmings mating, one of the “hard” puzzles, and when I say run out of time I mean at 4am I decided that for the 4 ours then left to finish the competition I’d rather sleep than try to think of another optimization. Still, it bothered be and I did manage to write solution (with ~70% score, not too bad). Now I want to document the how did I get there so I won’t forget.
* Spolier warning. If you haven’t solved the problem and want to do it independently, do not read on. *
Path to a solution
Looking at the problem statement, there’s clearly a graph-related problem behind all the story of the poor creatures. When I’m not sure about not just what should be the solution, but what exactly is the right description the problem, I usually open The Algorithm Design Manual (ADM). It’s a great book, so far I couldn’t find a problem where it had nothing to say. It is a bit dated, however, and while sets me on the right path, there’s usually much more reading than that. In this situation it quickly revealed that the problem I face is finding the maximum clique in the graph describing the problem. Well, quickly learned that there’s a big can of worms waiting for me in there.
Maximum clique
So, to summarize, a clique is a group of vertices that are all connected to each other by edges. Got a maximal clique when it is not possible to add any other vertex in a way that everyone’s still connected. Finally the maximum clique is the largest maximal clique. Note how logical the notation is but still so easy to mix up. It is relatively easy to find maximal cliques (start from any vertex and add more connected vertices until there’s none left that is connected to all already selected). Finding the maximum clique (or any maximum clique since there could be more than one) is, however, a proper NP-complete problem.
The ADM only says something along the lines, that “well, then you just have to do a backtracking through all the graph”. Which is a great idea and simple to implement, I would had a result, but it is terribly inefficient because in a naive implementation one would check lots of cases which cannot possibly have the result. So fist thing to do is pruning, or eliminating paths that we are sure that it cannot improve what we already have. Wikipedia helps in finding just such a thing, called Bron–Kerbosch algorithm. Idea is quite simple (just wrap the head around recursive functions) and it can be implemented in Python following the pseudo-code version on Wikipedia. Here’s mine (not really optimized or anything).
def bronkerbosch(R, P, X, graph):
"""
Want all vertices from R, some from P and none from X
where the graph dict defines the connections
"""
if len(P) == 0 and len(X) == 0:
return R
candidates = P.copy()
max = set([])
for v in candidates:
res = bronker1(R.union([v]),
P.intersection(graph[v]),
X.intersection(graph[v]),
graph
)
if len(res) > len(max):
max = res
P.remove(v)
X.add(v)
return max
This really worked and actually 3 out of 7 tests are passed, but the rest timed out. Got to find something better. Looking at more on Wikipedia, I was tempted to get on with the version of the algorithm that they said is “the fastest”, by J.M. Robson. It’s written up, but it is from the pre-MathJax era, so it’s just terrible work to figure out and keep up with all the strange notation. Also, it looks like a big collection of special cases and theoretical shortcuts. I’m sure it works, maybe I’ll come back to it later but for now I wanted to have a little bit more of an insight as well, so looked a bit more.
Vertex colouring
Since at this point I really caught up with the fact that maximum clique finding is an important problem, so went to Google Scholar to see what the academia has written. A lot, apparently. I was browsing though them in a kind of reverse-chronological order, so I could find the “newest” algorithms, since that should be the best. In the end a pattern emerged: most of the hard work can be done by employing a different technique, the vertex colouring. The two are connected because a clique in a graph G is an independent set in the complement graph of G (two vertices are connected in the complement of G if and only if are not connected in G), and vertex coloring is a good method of finding independent sets. What we want is labeling the vertices with the smallest number of different labels (“colours”) such that vertices with the same label are independent from each other. The number of different labels we have to use to do that helps pruning our clique finding algorithm by setting an upper bound how big clique it is possible to make from the vertices we have in a particular step.
As an example we can start out with a graph like this (taken from the original lemmings problem):
When running this graph through a vertex colouring algorithm, we would get something along this lines:
The blues are not connected to each other, neither are the greens and so on. In this case we need 4 colours to separate the graph into independent groups. Of course when programming one would use numbers instead of colours (thus often this called “numbering”): blue was actually “1”, red was “2”, … By adjusting the algorithm in terms of what is the order of the colours and the order of the vertices within each colour the previous algorithm can be improved by orders of magnitudes since we much better pruning. In the maximal clique every vertex would have different colour (though not all colours will necessarily be used).
Some notes on the algorithm to get started:
Try to keep nodes in the order of neighbours in the current sub-graph. The smaller is good, because they can be quickly eliminated, reducing the test-space more.
Go from colors with fewer members to more members, similar reasons.
Can use the color number + the current clique size for better pruning: if my largest clique so far has Q elements, my current clique has P and the tested vertex comes from the highest colour number N, then in case of N + P <= Q it is futile to go on, I cannot improve the result and time to backtrack. Less efficient methods use the C number of elements in the current candidate set instead of N, and since N<=C, the pruning bound set by C+P is more “loose” letting more tests to be done than necessary.
For example, the final solution of this given problem is the set of vertices highlighted by the edges is shown on the next picture, and it is done in only 4 steps or so…
Different methods
In the papers I found a few different colouring schemes, and most of the improvements seem to be made in:
initial ordering, or
ordering within each of the colours, or
better upper-bound estimation for the possible clique size from the available colour information.
The hardest things about reading these papers is that often the pseudo-code is a big mess, there are plenty of typos and the examples are quite scarce.
In the end I used algorithms mixed from two different papers:
Tomita, Akutsu and Matsunaga, Efficient Algorithms for Finding Maximum and Maximal Cliques: Effective Tools for Bioinformatics, 2011 (link) This has a few different algorithms and the most effective one looked too complicated and I went with MCQ, one of the improved but not perfect on. Now 6/7 tests pass, that’s got to be the right path. According to benchmarks the coloring takes up the most time
Segundo, Rodriguez-Losada and Jimeneza, An exact bit-parallel algorithm for the maximum clique problem, Computers & Operations Research, Volume 38, Issue 2, 2011, pp.571-581 (doi:10.1016/j.cor.2010.07.019). Their algorithm is a variation on the previous one and I’m not using it to it’s full potential (wonder how fast a C implementation would be). Also, I wonder if I misunderstood something but the original form of the algorithm failed on me. There are nodes that don’t need to have colour information because it does not matter at that stage of the algorithm. Their version of colouring then seems to remove those vertices from the colouring function output. This does not matter for their example (it is correct) but in my tests it gives wrong result (the algorithm finishes too early). So, mixed in from the previous paper that I keep all nodes whether they are coloured now or not, because later they might be, it works like a charm. And about a factor of 2-3 faster than before, enough to pass all 7 tests on CoderCharts and with a reasonable score. I think the real improvement is not really in the pruning (this version seems to check a few more nodes than the previous) but somehow the colouring function is faster – that might tell something about my Python skills too, though.
Lessons
This could probably be solved with other algorithms as well, I would be curious to see others’ code / hear their stories. What I took home from this:
Know your problem and know where to look for solutions. The best is to study and practice a lot.
Often pre-processing the data is a crucial step to the solution. Since there’s usually a memory/speed tradeoff, it is worth experimenting: if and algorithm is too slow, what could be prepared and stores to exchange calculating something with a mere data lookup?
Different algorithms have bottlenecks at different sections. E.g. in this case there is the data preparation, the number of steps to test possible solutions and the time it takes for re-colouring. On the path to the final solution I had algorithms inefficient in each of the sections.
Test cases are important. The example in the problem setting and the examples in all the papers are too small to be helpful for optimization. The “bonus” test case on problem site, however , is just too big. Actually, even my passed algorithm is too slow to have a result for it in reasonable time. So, if there’s no provided test case, make your own
The example cases I generated are much smaller than the 7 tests I need to pass, but they are still slower. Every paper said that maximum clique finding has very different running time for different (random) graphs. I presume that some of the test graphs are in fact specially prepared. Also, some of the changes in code that made my own test cases faster (sometimes 20%) often failed more tests on the web. The message is that coding competitions like this are set towards finding the optimal solution for their own tests – whatever your winning solution might be, it is not necessarily the best overall solution and if I ever come across a similar problem in real life setting I likely to have to do differently.
Most of the Python optimization advice I found is outdated and sometimes outright hurting the performance. Do plenty of performance testing. Start optimizing the big slow downs, the small shortcuts rarely worth the time in competition setting.
Language choice matters but does not matter that much. I’d think Python actually pushes me to be more efficient because many things are slower. The code on the other hand is more readable in the end (ideally).
Someone has to update Wikipedia. :)
I will share the code later, just need to wait until some time has passed after the competition.
I’m always looking for new and interesting magazines that I can read. I do believe there’s still future for printed journalism, even if most of the things I read now is online (Hacker News, myriads of blogs on Google Reader, links shared on Facebook and Twitter…). During my time at the university, a perfect weekend program was having brunch in the Common Room and checking out the latest issue of The Economist. My interests are mostly in analysis, world affairs, getting insights from people with much more experience, but not shoved down on my throat like many dailies seem to do but giving me space to make up my mind myself.
Recently I was checking out Monocle in my local Eslite Bookstore, it was wrapped up since it’s not one of the cheapest (NT$520, almost twice the cover price elsewhere). The cover promises to have ABCDE: Affairs, Business, Culture, Design and Edits – which is all good and could be very interesting. I checked out on the web a little bit what did others write about it, and it is all good. With all the accolades, about their worldwide reach (apparently they have offices all around the globe), that they do their own photography because they want the best, with all the enthusiasm by the staff… I actually felt that this might be the real life embodiment of the Millennium magazine. That’s certainly a lot to live up to, isn’t it?
So, last week I bought it. First impression: I was pretty underwhelmed. The features are thin on content, most the content feels like an IKEA catalogue and how can I relate to something that advertises £190 polo shirts? I wasn’t that excited about it anymore, and it certainly wasn’t the Millennium.
Nevertheless, I took it out every now and again, reading more of it and things did change. I think I was wrong to hype it up for myself, should have judged it on its on merits. And on that, the writers are certainly clever. Maybe their bread is focusing on short observations but making many of them. The topics are actually worthy. The photography is indeed top-notch. The designs they show are really cool (and living in Taiwan, where I’m quite spoiled with good design, that is tough). In the end, I’d say it is a good magazine. Maybe not for me, or not every single issue, but if I’ve found in a library I’m sure I’d check it out. If design was my business I’d subscribe.
The One Idea
Nevertheless, there was idea that stuck with me (and maybe that’s one reason I’ve started to change my mind about the Monocle), which came from a rather short editorial titled “What Ireland can learn from Finland”. The writer argues, that Ireland’s crisis is pretty much inevitable, since no economy can survive on only the service industry. They should instead start to make things again, rolling up the sleeves and creating something tangible.
After reading this I felt a bit shocked. Looking at the things I’m doing, how I’m reading the hymns for software developers and Web 2.0 startups every day, how I’m positioning myself to become a better programmer. And despite doing all that, I do remember now, that I wanted to make things – and I haven’t. It feels like a wake-up call, that there are things that I value more but I forgot about.
This got me thinking: instead of being an awesome programmer (good luck with that), I really should think out how can I leverage my maker background (every experimental physicist is a maker) and programmer ambitions to create something new. Don’t give up either (I couldn’t) but find what unique combination of skills I might have. I really feel this is what would bring the much desired sense of achievement.
Got me thinking that I have all the issues from Make: in the last two years or so, but never actually did anything. That I planned to set up a hacker space in Taipei, but never got beyond asking my friends who would be interested in it. That I admired and saved so many things on Instructables but always had it linger on my Next-Action List.
“If not now, when? If not you, who?”
Now just stay tuned as I try to follow through. :)
Since I have set up my little Virtual Private server about two months ago, I keep reading and learning more about its administration. In particular I’m trying to make it more secure, since nobody likes data lost or their things used behind their back. I know that the Internet is a tough place. Most computer users are nicely isolated behind their routers and internal networks, nevertheless I had my freshly installed WinXP being infected in less then 5 minutes when connected to the Net. (Well, since then I don’t install anything Microsoft and first thing to take care is the security, so things are much better).
Thwarting brute force attacks
One of the first thing is securing the remote login access to the machine: disabling root login for SSH is always a good idea. But since I’m interested in cleverer methods, I wanted to do something more potent and general. Found this blog post about how to limit brute-force attack with iptables, so I set out to implement it. The basic idea is that if another computer is trying to connect too many times in short succession, then it is likely an attack. Use the firewall to see how many connections are made in a specific time interval to the sensitive ports and if a threshold is passed then ban that host from connecting for a while. I like it and had to implement it.
The information on the linked page is quite detailed and very useful. Just save the current iptables rules, edit them, and then restore.
For remote servers one thing to be extra careful about is not to block the SSH connections completely: keep the current connection open, try to make a new connection and if you can log in, then things should be fine.
The only thing I have changed compare to the other site is the log level, so i can separate them better. In the following line there was originally --log-level 7 (debug) I’m using --log-level 4 (warning): -A ATTACKED -m limit --limit 5/min -j LOG --log-prefix "IPTABLES (Rule ATTACKED): " --log-level 4
Then update the line in /etc/syslog.conf to: kern.warning /var/log/warnings
Of course this might vary somewhat from Linux distro to distro: the above is for my CentOS install with syslog,
From the logs
Well, not sure if my host was particularly busy or not – I assume it wasn’t since I don’t rank high in Google so fewer attackers would find my little “home”. Still, in the last month there’s a nice little collection of IP addresses which triggered that ATTACKED rule of the firewall.
Using Python I extracted the IP addresses from the logs, run them through GeoIP Python API to get their locations and fed that into the Google Maps Static API, to get this picture:
Altogether in about 1 month, I logged 110 ATTACKED triggers from 47 different hosts. Most of them tries only once, there was one that did 48 times. According to GeoIP database, it is from Varna, Bulgaria. Well, if there is one good thing that came out of this, that Varna actually looks quite good and I’d be interested to visit it. :) Talk about strange my reactions to things…
It seems Europe and China are up to no good. Not sure if American baddies are less or just targeting mostly Americans. Might investigate the regional differences some time later. Though this is just for curiosity and fun, if I was serious, then I could set up a proper honeypot.
Some technical notes on making this picture:
GeoIP Python API looks one of the worst documented codes I’ve ever seen. I found a tutorial that helped me to get the results I wanted: cities and locations, not just countries.
Static maps are quick, dirty and limited. Will try to figure out use the Google Map API for a proper zoomable, scrollable, annotated map. Could imagine making a heat-map of threats, or better colour-coding of the number of attempts from each IP/City.
Anyways, at least there’s no sign of unauthorized entry so far, since most of these attacks are not sophisticated at all. I wonder if I’d recognize if I ever was targeted by a sophisticated attack, but that’s not something to fret over. Just keep the automated backups going and it will be all fine. :D
Update:
The Python script I used to get that map can be found over here.