python Archives - ClickedyClick

Adventures into Code Age with an LLM

Gergely Imreh — Sat, 09 Nov 2024 09:50:20 +0000

It’s a relaxed Saturday afternoon, and I just remembered some nerdy plots I’ve seen online for various projects, depicting “code age” over time: how does your repository change over the months and years, how much code still survives from the beginning till now, etc… Something like this made by the author of curl:

Curl’s code age distribution

It looks interesting and informative. And even though I don’t have codebases that have been around this long, there are plenty of codebases around me that are fast moving, so something like a month (or in some cases week) level cohorts could be interesting.

One way to take this challenge on is to actually sit down and write the code. Another is to take a Large Language Model, say Claude and try to get that to make it. Of course the challenge is different in nature. For this case, let’s put myself in the shoes of someone who says

I am more interested in the results than the process, and want to get to the results quicker.

See how far we can get with this attitude, and where does it break down (probably no spoiler: it breaks down very quickly.).

Note on the selection of the model: I’ve chosen Claude just because generally I have good experience with it these days, and it can share generated artefacts (like the relevant Python code) which is nice. And it’s a short afternoon. :) Otherwise anything else could work as well, though surely with varying results.

Version 1

Let’s kick it off with a quick prompt.

Prompt: How would you generate a chart from a git repository to show the age of the code? That is when the code was written and how much of it survives over time?

Claude quickly picked it up and made me a Python script, which is nice (that being my day-to-day programming language). I guess that’s generally a good assumption these days if one does data analytics anyways (asking for another language is left for another experiment).

The result is this this code. I’ve skimmed it that it doesn’t just delete all my repo or does something completely batshit, but otherwise saved in a repo that I have at hand. To make it easier on myself, added some inline metadata with the dependencies:

# /// script
# dependencies = [
#   "pandas",
#   "matplotlib",
# ]
# ///

and from there I can just run the script with uv.

First it checked too few files (my repository is a mixture of Python and SQL scripts managed by dbt), so had to go in and change those filters, expanding them.

Then the thought struck me to remove the filter altogether (since it already checks only files that are checked in git, so it should be fine – but then it broke on a step where it reads a file as if it was text to find the line counts. I guess there could be a better way of filtering (say “do not read binary files”, if there’s a way to do that), but just went with catching the problems:

# ....
    for file_path in tracked_files:
        try:
            timestamps = get_file_blame_data(file_path)
            for timestamp in timestamps:
                blame_data[timestamp] += 1
                total_lines += 1
        except UnicodeDecodeError:
            print(f"Error reading file: {file_path}")
            continue
#....

(hance I know that a favicon PNG was causting those UnicodeDecodeError hubbub in earlier runs. Now we are getting somewhere, and we have a graph like this:

Version 1

This is already quite fun to see. There are the sudden accelerations of development, there are the plateaus of me working on other projects, and generally feel like “wow, productive!” (with no facts backing that feeling ). Also pretty good ROI on maybe 15 mins of effort.

Having said that, this is still fair from what I wanted.

Version 2

Promt: Could we change the code to have cohorts of time, that is configurable, say monthly, or yearly cohoorts, and colour the chart to see how long each cohort survives?

This came back with another set of code. Adding the metadata, skimming it (it has the filter on the file extensions again, never mind), and running it once more to see the output, I get this:

Version 2

Because of the file extension filter in place, the numbers are obviously not aligning with the above, but it does something. The something is a bit unclear, bit it feels like progress, so let’s give it a benefit of the doubt, and just change once more.

Version 3

Promt: Now change this into a cummulative graph, please.

One more time Claude came back with this code. Adding the metadata again, same drill. Running this has failed with errors in numpy, though:

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Now this needed some debugging. It turns out a column the code is trying to plot is actually numbers as strings rather than numbers as, you know, say floats…

# my "fix"
        df['cumulative_percentage'] = df['cumulative_percentage'].astype(float)
# end

        # Plot cumulative area
        plt.fill_between(df.index, df['cumulative_percentage'],
                        alpha=0.6, color='royalblue',
                        label='Cumulative Code')

It didn’t take too many tries, but it was confusing at first – why shouldn’t be, if I didn’t actually read just skim the code…

The result is then like this:

Version 3

Sort of meh, it feels like it’s not going to the right direction overall.

But while debugging the above issues, I first tried tried to ask Claude about the error (maybe it can fix it itself), but came back with “Your message exceeds the length limit. …” (for free users, that is). So I kinda stopped here for the time being.

Lessons learned

The first lesson is very much re-learned:

Garbage in, garbage out.

If I cannot express what I really want, it’s very difficult to make it happen. And my prompts were by no means expressing my wishes correctly, no wonder Claude wasn’t really hitting the mark. Whether or not a human engineer would have faired better, I don’t know. I know however, that this kind of “tell me exceedingly clearly what’s your idea” is an everyday conversation for me as an engineer (and being on both end of the convo).

The code provided by the model wasn’t really far off for some solution, so that was fun! On the other hand, when it hit any issues, I really had to have domain and language knowledge to fix things. This seems like an interesting place to be:

the results are quick and on the surface good-enough for a non/less technical person, probably
but they would also be the ones who couldn’t do anything if something goes wrong.

Even myself I feel that it would be hard to support the code as a software engineer if it was just generated like this. But that’s also a strange thought: so many times I have to support (debug, extend, explain, refactor) code that I haven’t had anything to do with before.

It seems to me that now that since Claude comes across as an eager junior engineer, writing decent code that always needs some adjustments, the trade-off is really in the dimension of spending time to get better at prompting vs better at coding.

If there’s a person with some amount of programming skills, mostly interested in the results not the process, and doubling down on prompting: they likely could get loads further than I did here. Good quality prompts and small amount of code adjustments being the sweet spot for them.

For others who have more programming expertise, and maybe more interested in the process, spending time on getting better at programming rather than getting really better at prompting: keeping to smaller snippets might be the sweet spot, or learning new languages, … Something as a starting point for digging in, a seed, is what this process can help with.

Future

Given the above notes on how this generated code is like a new codebase that I suddenly neet to support, here’s a different, fun exercise to actually improve engineering skills:

Take AI generated code that is “good enough” for a small problem and refactor, extent, productionise it.

I’m not sure if this would work, or would get me into wrong habits, but if I wanted do have some quick ways of doing deliberate practice – and not Exercism, LeetCode, or somilar, rather something that can be custom made, then this seems a way to get started.

Also, now that I’ve gotten even more interested in the problem, I’ll likely just dig into how to actually define that chart I was looking for and what kind of data I would need to get from git to make it happen. The example code made me pretty confident, that “all I need is Python” really, even though while prepping for this I found other useful tools like one allowing you to write SQL queries for your repo, that might be some further way to expand my understanding.

Either way, it’s just fun to mess with code on a lazy Saturday.

The post Adventures into Code Age with an LLM appeared first on ClickedyClick.

Programming challenge: Protohackers 3

Gergely Imreh — Sat, 24 Sep 2022 09:33:28 +0000

Protohackers is a server programming challenge, where various network protocols are set as a problem. It has started not so long ago, and the No 3. challenge was just released yesterday, aiming at creating a simple (“Budget”) multi-user chat server. I thought I ~~sacrifice a decent part of my weekend~~ give it a honest try. This is the short story of trying, failing, then getting more knowledge out than I’ve expected.

Definitely wanted to tackle it using Python as that’s my current utility language that I want to know most about. Since the aim of Protohackers, I think, is to go from scratch, I set to use only the standard library. With some poking around documentation I ended up choosing SocketServer as the basis of the work. It seemed suitable, but there was a severe dearth of non-dummy code and deeper explanation. In a couple of hours I did make some progress, though, that already felt exciting:

Figured out (to some extent) the purpose of the server / handler parts in practice
Made things multi-user with data shared across connections
Grokked a bit the lifecycle of the requests, but definitely not fully, especially not how disconnections happen.

Still it was working to some extent, I could make a server that functioned for a certain definition of “functioned”, as the logs attest:

Server logs from trying my Budget Chat Server

On the other hand, ended up in a relative dead-end, as some message ordering issues kicked in, and reliably failed the test here, not knowing much what to try next just yet:

Testing my in-progress solution, and failing.

Since it’s a learning exercise and definitely not a competition on my part, I started to procrastinate. Not long before I’ve looked at the status of the leaderboard. Funnily enough, looking at the top entries, they were linking to the repositories where their solutions were!

Shoulders of Giants

Here’s my surprise and delight started, though. Within the first 7 entries there were 3 with Python implementations that included code! Even better, they actually covered 3 completely different ways of solving the task. Jackpot, really!

The first solution used pure sockets, which is quite versatile if I’d want to go all-in on low-level networking in the future. It had quite a lot of helper code, though which makes it look like a pretty decent effort to duplicate.
The second solution went with SocketServer just like I’ve tried, and that is nice to dig in a bit more, given how small the whole code is. The main thing here was that I should have understood from the problem description this being a Streaming TCP connection case. Looks like streaming is the part that takes care of a lot of details, including the connection/disconnection that plagued me. Bam!
The third solution then used asyncio, to take it in a different direction again. It’s amazing how simple it all is when the relevant components and abstractions are understood.

Which one is the most tempting solution to follow (and/or learn from)? Pure sockets are likely just a fallback option when there’s nothing else. On the SocketServer vs asyncio front however there was some useful StackOverflow discussion, even if a bit dated, coming from 2016. It pointed at the different use of threading and event loops. I guess this would make this answer a bit unsatisfying, but quite realistic: learn both and know when either is applicable for your use case.

What did we learn?

In the end I haven’t finished my code yet. Reading the existing solutions influences me and just adapting what others did and submit would feel like cheating (to myself). The way to resolve this is setting your own goals on top of the original challenge. Here I picked the following, and achieving these would complete things for me:

Use proper project structure and try out PDM
Figure out how to set up the project & code to be testable with pytest (basically grok testing of programs that run servers)

The combination of these focuses on something akin to “going to production”, besides obviously writing the actual code, which is very much relevant to my interests.

So far I haven’t seen many examples of testing SocketServer, though there’s Python’s own test suit that could be a starting place. It has a lot of super useful helper functions (such as finding an unused port to run the server on), but overall seems a lot of boilerplate too. For asyncio I haven’t looked around yet. It being “cooler” there might be more discussion around it, but it’s by no means a given. Would be interesting to combine this with a Basic Chat client as well.

Another impression from today’s effort is that Python modules are documented to very varying levels. Their complexity definitely jumps when I try to go from dummy stuff to anything useful. For example here understanding the proper role and interaction of the Server and Handler parts of this multi-user environment.

I’m also acutely aware that my networking knowledge is very patchy regardless of doing networking-adjacent stuff for decades. It’s a very useful frontier to tackle when I have a chance.

Finally, ngrok is still very cool tool, nice to be able to sit in a cafe and safely exposing a server to the internet.

The post Programming challenge: Protohackers 3 appeared first on ClickedyClick.

Creating a Prometheus metrics exporter for a 4G router

Gergely Imreh — Tue, 14 Jun 2022 03:25:24 +0000

Recently I begun fully remote working from home, with the main network connectivity provided by a 4G mobile router. Very soon I experienced patchy connectivity, not the greatest thing when you are on video calls for half of each day. What does one do then (if not just straight replacing the whole setup with a wired ISP, if possible), other than monitor what’s going on and try to debug the issues?

The router I have is a less-common variety, an Alcatel Linkhub HH441 (can’t even properly link to on the manufacturer’s site, just on online retail stores). At least as it should, it does have a web front-end, that one can poke around in, and gather metrics from – of course in an automatic way.

The Alcatel HH41 LinkHub router that I had at hand to use

Looking at the router’s web interface, and checking the network activity through the browsers’ network monitor (Firefox, Chrome), the frontend’s API calls showed up, so I could collect a few that requested different things like radio receiving metrics, bandwidth usage, uptime, and so on… From here we are off to the races setting up our monitoring infrastructure, along these pretty standard lines:

Set up a Prometheus metrics exporter, pulling data from the router’s internal API (the same way the web interface does it)
Spin up a Prometheus + Grafana interface to actually monitor, alert on, and debug any metrics

Metrics Exporter

Given that I’m mostly working with Python, using the existing Prometheus Python client was an easy choice, in particular using their internal HTTP exporter to get started quickly. It was relatively straightforward to turn many of the metrics into various gauges (radio reception metrics, bandwidth used, &c.), though some were naturally info fields, such as mobile network name and cell ID. This latter would be very useful as my hunch was cell hopping by the router is what’s mainly affecting my network quality.

After some poking around I’ve also realised, that the API exposed is just JSON-RPC (although the router’s backend doesn’t seem implement everything in there, e.g. there’s no batch), which made a lot of things clearer, and potentially easier to use.

In the end, I’ve ended up with one class to do all the metrics gathering from a couple of JSON-RPC methods, working relatively robustly. The authentication was simplified very much: most need an auth token that can be extracted by manually observing some requests (more on this later) and some need a referrer header for the request to pretend to be coming from the admin console.

The resulting code is on GitHub: imrehg / linkhub_prometheus_exporter, and should be a full-featured server with most (though probably not all) the metrics available in the admin console as well.

Monitoring

With the metrics exporter running, I used a Docker Compose-based Prometheus + Grafana stack locally to have everything together, just adding an extra “linkhub” task in Prometheus to pull the data periodically, and a new dashboard in Grafana to have a quick overview.

Grafana view of some of the reception metrics

I also went a bit overboard and added some extra bits and pieces, like coloured regions for the signal metrics to show what’s bad / acceptable / good / excellent or so, based on some scouting, making it clearer when things are good or not good.

I also tried to use a bit more of Grafana’s tooling (not a lot, but a bit more), so added some different sections for signal quality and network metrics, as well as a running average on some of the noisy metrics.

Lessons learned

Learned a bunch of things as this was the first time I used, from scratch, many of the tools here. The very first one being: how to choose the right Prometheus metrics for various data streams? Now I see how does it look like in practice, planning for a metric that needs to be monitored from the very early stage. There are fewer varieties of metrics that I’ve expected, and while there’s a lot of derivative stuff to make it a lot more useful, it’s not that everything that one imagines can be made to work.

Used Poetry here more than previously, and set up poetry-dynamic-versioning plugin (as a candidate competitor to setuptools-scm). That meant also using poetry plugins and a beta Poetry release at this stage. It’s not bad, but sooo many gotchas in the process, and still have to figure out what would be a good reusable template for projects using these. (including __version__ variables, etc).

Figured out how to do good CI Docker image builds with libraries that rely on git for versioning, fortunately setuptools_scm did the work for us: bind mount of .git in the specific build step. I think in CI/CD all this reliance of repo data being available can still make things a bit trickier, but something’s gotta give, and it’s not much extra compared to the rest of the things.

Learned a bit about JSON-RPC (and how the router might or might not be fully compliant). Not sure if I’d go with that for any future project myself, but good to be aware of it, and potentially looking at its presence in other routers or interfaces’ communications channels.

Chance to use some Python 3.10-based features (match) and hit/fix some of GitHub actions related issues with 3.10:way to go libraries that convert 3.10 to “3.1″ because it’s a number so let’s round it, right? Or actually way to go libraries/YAML to allow both ‘x.y’ and x.y forms (quoted and unquoted), and the former would have been the correct form all the time, but people generally go with the latter to save a few keypresses. It’s subtle, but experience is expecting the subtleties and the reasons for them arising.

Seen how mypy can actually benefit the coda quality: while trying to fix all the reported issues actually found stuff that was clear benefit and it’s coming not from adding all the type hints (that’s good, but baseline), rather than being smart where it complains and think about what’s the underlying issue (e.g. patterns of getting values out of dicts where there might not be result, exhaustive matching of match and return values of functions, etc…)

The resulting Docker image is north of 1GB due to Python, and that’s not great considering that it doesn’t do that much work. Writing/rewriting this whole thing in Go could be interesting and would be useful learning experience (or another compiled language, I guess, but Prometheus itself is written in Go, so there’s a connection). One step at a time, projects written in Python are useful proof-of-concept to compare other stuff against later, so it was well.

Having said that, I’ve seen the best practices listed when writing Prometheus exporters, and given the current environment, I couldn’t apply all the best practices. For example: “Metrics should only be pulled from the application when Prometheus scrapes them, exporters should not perform scrapes based on their own timers.” The official Prometheus Python Exporter on the other hand seems to need to use exactly that sort of “while True” loop to keep getting/storing metrics, instead if running on demand. There might be a more subtle pattern to do on-demand work (which I see to be more correct), but I need to find it.

So what have I learned about the actual network issues? Most of the instability seemed to be correlated with switching to specific cell towers (based on cell IDs). Certain cell towers would pretty stable, and on some of the days the router was switching between towers, and that’s when most of my online calls were pretty futile.

Finally, I did think a lot about the adage that “something that isn’t worth doing isn’t worth doing well.” On the other hand there’s no kill like overkill, so here we are…

Future development

Compared to other projects I’ve done, this might be lighter maintenance, given that it’s sorta done for the moment (except if others start to use it and need other metrics, for example). Otherwise the Docker-based deployment and poetry.lock’d dependencies make bit-rot a bit slower, hopefully. In the meantime, I’ve switched to a wired connection, so unlikely to need this project much, but it could be that much of this will be repurposed for other monitoring projects.

The post Creating a Prometheus metrics exporter for a 4G router appeared first on ClickedyClick.

Taiwan WWII Map Overlays

Gergely Imreh — Mon, 02 Nov 2015 13:47:43 +0000

A while ago I came across the Formosa (Taiwan) City Plans, U.S. Army Map Service, 1944-1945 collection, in the Perry-Castañeda Library Map Collection of the University of Texas in Austin. I’m a sucker for maps, enjoy learning about history a lot, and I have a lot of interest in my current home, Taiwan – so you can call this a magic mix of cool stuff.

There are 26 maps in the collection, made by the US Army by flying over different parts of the island, and mostly I guess stitching together aerial photographs. The maps themselves were not that easy check in an image viewer, since there’s no context, zoom is clumsy, and have no idea where about half the places should be located. Instead, I thought it would be great to have them as an overlay on top of current maps and satellite imagery on Google Maps.

The result is Taiwan City Maps overlays, which does exactly that. Feel free to click the link and explore right now! In the rest of this post, I try to first show how that page was made, and also some history lessons I gained by making it.

How it was made

The Google Maps Ground Overlay API does exactly what I needed, so can’t claim much (almost any) innovation in the page, though I stumbled on it after a few other “Overlays”, as Google Maps has a couple, all working slightly differently.

Ground Overlay basically takes the following parameters:

the map’s center,
zoom level,
an image,
the image’s N/E/W/S bounds in geographical coordinates.
whether to show satellite or maps view underneath

Even cooler, the opacity of the image can be adjusted, so the overlay and the base layer can be viewed together much easier!

In that list above, the trickiest part is of course figuring out the boundaries of the image – basically the way to overlay them – for multiple reasons. First, all the maps have borders with information, titles, legend, and so on. Here below is the Taipei city plan from 1944. I can’t really tell where the corner of the image should be geographically, if there’s nothing to compare to.

Taipei / Taihoku city plan

Then there’s also the problem of things have changed in the last ~70 years, so a lot of things are different on the maps, if I want to match them up. Lastly, the maps are not perfectly accurate either, they are stitched from multiple photographs of unknown quality.

Been trying to match the maps by hand for a while, and codings some tools to make that process easier – but it didn’t really get easier. I needed to get to get around all these issues, and have a reasonable, objectively good fit. Objectivity is a key, and math helps with that. What I ended up with doing a linear fit of geographical and image coordinates.

look for landmarks that are reasonably the same back then and now
note their geographical coordinates, and their pixel coordinates on the image
note the image size
finally: run a linear fit that gives back the geographical coordinates of the image edges that would best match the landmark data in both sources

In really ugly Python it would be something like this:

import numpy as np

# # For example the contents of "taoyuanfit.csv":
# # Landmark data: image vertical/horizontal (px), latitude/longitude
# 1677,2861,24.991962,121.322634
# 1597,524,24.992282,121.304856
# 2385,1348,24.986873,121.311122
# 764,1342,24.998116,121.311411
datafile = "taoyuanfit.csv"
points = np.loadtxt(datafile, delimiter=",", comments="#")
imgwidth, imgheight = 3975, 3209

# Points on the image, scaled to range 0 to 1
px, py = points[:, 1]/imgwidth, (imgheight-points[:, 0])/imgheight
# Points on the map
gx, gy = points[:, 3], points[:, 2]
# Do the fitting
lx, ly = np.polyfit(px, gx, 1), np.polyfit(py, gy, 1)

# Fitting parameters give map boundaries, results in this case
west = lx[1]          # 121.300983
east = lx[1] + lx[0]  # 121.331141
north = ly[1]+ly[0]   #  25.003429
south = ly[1]         #  24.981204

This is now pretty objective, and then quality of the overlay depends on the quality of the landmarks I find, and the maps that they are on.

The best landmarks I could turn up were notable roundabouts like below…

Matching roundabouts (Tainan, click to go to this section on the map)

… and recognizable streets and intersections, like below…

Matching recognizable streets (Taichung, click to go to this section on the map)

… but not too small and not too large streets. Large streets have fewer recognizable locations (unless some notable mid-sized street is intersecting with it), and small streets are usually not placed too carefully on the map, so can’t really trust their positions.

What I couldn’t trust, though: rivers (they changed a lot), bridges (usually can’t tell if they are the same bridge in both era, or a new one built close to the no longer there old one, as it is a common practice), coastlines, lakes, large structures.

The minimal number of landmarks needed for a fit (ideally) is just two, and surprisingly, there were a couple maps where I could get along with about three. In most cases, though, I needed at least 4-6 landmarks, and there were some I needed to discard for overall better quality of fit. Some maps show pretty good accuracy, while others (such as e.g. Kaohsiung) are quite poor – compared to today’s satellite pictures, of course. It’s all enjoyable, though.

Then added a few more functions to the page, such as changing the opacity, being able to link to the exact location and zoom level I’m looking at, and some page-load animation (the maps are in the range of 4-10MB, not that quick to load).

The source code for the site is on Github, and pull requests are accepted, naturally.

History lessons learned

I was very interested to note the things that have changed, and things that haven’t in the decades passed.

Things that mostly stayed the same

Schools are usually at the same location, even stayed the same type (e.g. girls’ high school staying that in Changhua).

Airports are usually at the same location (duh!), though usually greatly expanded, while the old runways can be matched to the new ones (like in Hsinchu).

Train stations, of course, although tracks might change, or entire stations disappear (like in Beigang or in Hualien).

Military installations, which surprised me, can stay the same, looks like the KMT troops just took over the Japanese posts and continued to use them. Some turned into schools, or could disappear too.

Many roundabouts, outlines of city blocks (like the CKS Memorial Hall now, or Yilan’s city outline), mountains, also stayed mostly the same.

Things that changed a lot

City areas changed enormously (duh!), just look at the capital, Taipei, the west side stayed mostly the same but built in about twice/three times as big area on the east. Most other cities grown just like that, as it’s very clear for example for Taitung. It’s no wonder, 5.8 million people on the island in 1940, while above 23 million by now (according to Wikipedia). This also brings the changes of a lot of streets and roads.

Water is definitely very changeable. Rivers got diverted a lot, most notably on the east side of Taipei (near Neihu and Songshan districts) the Keelung river was straightened up, but the roads can still mirror the old path. Other rivers (or channels?) got covered up, and now only live on as roads – though I guess they might still flow under our feet (just like Fleet Street)? Some more dramatic change is an entire missing island in Danshui, I wonder what happened to it? Lakes change too, mostly I guess due to development around them (like in Zuoying). The sea shore is different for the many cities, as probably they were developed and also the shallows were naturally filled by the rivers (like in Hsinchu and in Tainan). Harbors don’t stay the same either (like in Penghu).

Some landmarks such as shrines have changed too. The most notable I could find so far is in Taipei / Yuanshan, where the current Grand Hotel / 圓山大飯店 replaced a shrine in 1952 (yellow rooftop on the left below), while the Taipei Martyrs’ Shrine seems to be built on the site of another previous shrine in 1969 (yellow rooftop and cross-shaped grounds on the right below). On the map there’s another temple-complex looking structure in between them, that’s now the Taipei American Club and the International Community Radio Taipei (ICRT).

Grand Hotel replacing a shrine (yellow rooftop on the left), while shrine becomes a Taipei Martyrs’ Shrine (yellow roof and cross-shaped ground on the right). In the centre, now the Taipei American Club and the ICRT radio station (click to go to this section on the map)

Race courses disappeared too – together with the cavalry, I guess (like in Tainan).

I’m sure there are a bunch of other things too, it will take some time to discover and understand things…

Other notes

The legends and information on the maps was pretty interesting. For example here’s the coverage map for Taipei / Taihoku, showing what aerial maps they stitched together, and when were the flights that took the photos:

Taipei aerial photo coverage

The Internet also taught me a few things, such as this very insightful comment on Reddit that traced back the Chinese / Japanese city names of these maps.

I was also amazed by how detailed information the Americans had about the functions of the buildings – down to the type of factories, the type schools the buildings were. I would not be surprised if the maps were supplemented by some real intelligence from the ground. Especially likely as the big cities had so much more such details than the smaller locations (where there might not be more than a few likely military installations noted). I wonder if there’s any good source to learn about this – about American intelligence behind enemy lines in the Pacific theatre of WWII.

To the Future

Now that everything kinda works, I think I’ll spend more time looking at the maps instead of working with them, try to learn more.

Hosting this project on Github is definitely pretty easy and convenient, though I start to see that their speed is not necessarily very good for big assets like the map images. Could add Cloudfront very easily to speed things up, just I don’t necessarily want to pay for that extra speed just yet. :P

To be able to show the loading animation I currently need to use jQuery to preload the image, then rely on the browser caching to not to download the image for the second time when the image is attached to the map object to create the overlay. It works pretty well… except in the Facebook Android app’s built in browser – that happily ignores caching and downloads the image twice (ouch). Would be awesome to figure out how can I get directly from the Google Map API when it finished downloading the image.

Might revisit a couple of the maps to see if I can improve the fit, though really just a couple of them, most are totally fine for their quality. Should also add the Japanese writing of the city names for more authenticity and discoverability.

And in the meantime, keep checking the map, and any comment or feedback is appreciated! A big thanks to the University of Texas Libraries and the Perry-Castañeda Library Map Collection for the public domain maps!

The post Taiwan WWII Map Overlays appeared first on ClickedyClick.

Automating the hell out of it

Gergely Imreh — Mon, 16 Sep 2013 10:03:29 +0000

Even before the 4-Hour Work Week made me more serious about this, I really enjoyed automating tasks, that benefit from not needing to remember to do, or would be troublesome to do otherwise. This frees up a lot of time, keeps a bunch of problems away, and it is actually quite fun when the information comes to me instead me going to it.

Now I have automated checking my bank account and credit card balance, updating dynamic IP of server, ebook sales numbers, and network clock synchronizing. There are some general ideas that I summarize, then give an intro to all of those scripts.

Tools

Most of my scripts are written in bash, because it’s relatively straightforward to hammer out simple stuff, and it is surprisingly simple to do a lot of things once I have thought enough about a problem. The Advanced Bash-Scripting Guide is always on my reading list, but I usually get to check only the parts that are relevant to the given problem. You can get quite far with a few simple constructs.

The most common parts I seem to come across:

if-then-else constructs: if [ -f ‘directory ‘]; then echo “Found!”; fi
for loops: for f in *.png; do optipng $f; done
loading the results of a command into a variable: VAR=$(command)

For most other problems with a little keyword-fu there’s always an answer on StackOverflow or on the web.

Another group of scripts uses Python, when a bit more data-manipulation is needed, like web scraping or JSON parsing. Actually, all of the scripts could be rewritten in Python for consistency, and it would probably be be simpler too, which is something for the future.

As a general tip, most of these scripts need tweaking, and all of them are sort of alpha-beta quality code. To facilitate hacking and reduce heartache of mangled clever code, I keep everything in git repos. I share those repos online, so have to make sure there are no secrets checked in, ever. It helps to strategically use .gitignore, separate files for the secrets, and having an example how that secrets file should look in the inside.

Most of these scripts are run periodically by cron, so it is worth having some basic knowledge about how to schedule it.

Some scripts send me emails under specific circumstances (some after every run, some when new information appears), and for good delivery I have set up postfix to use Gmail as an SMTP relay. This way I’m sure to receive the emails and receive them quickly.

Scripts

These are the scripts I use most often and the longest. Still, many of them are under development and adjust them whenever I learn how to do things better. I list the links to all their repos, where it can be improved.

Banking account balances

My two main bank accounts are queried once a day for available balance and I’m notified by email. Both accounts needed quite a bit of web scraping (and got them done at two different OpenHack Taipei events). The banks’ websites are pretty awfully organized (iframes within iframes within iframes; not using CSS classes and id), though it doesn’t have to be good for me, it has to be good for the bank.

Cathay United Bank

The cathaycheck (click for repo) script queries the available balance at Cathay United Bank by logging in with curl, and parsing the final page with Beautiful Soup. The script can be a skeleton for any other website where on has to log in and then navigate over a series of pages to get the information. The required HTML variable names can be extracted with the help of the Inspect Element tools in Chrome.

At the moment the credentials is stored in the crontab command, which is not really ideal, should rewrite to use a secrets file, though given that it runs on a server where I’m the only user (and root), for me there’s no practical difference at the moment. I have set it up to receive an email at the end of the day with the current balance.

ANZ Taiwan credit card

The anzcheck (click for repo) script queries my spending with the ANZ Taiwan credit card. Again bash for logging in and Beautiful Soup for parsing the final page. It needs a bit more logic extracting information from a table, because the websites developers added no classes or ids to the items to make it easier to understand – or for them to style, but that’s not my problem.

Just recently updated that it extracts the spending items added to my balance on a given day, so I can will never be caught by surprise again (hopefully). Since many of my charges go to companies that have Chinese names, I quickly run into the problem of having to tell my Heirloom Mailx (that I use to send emails on my ArchLinux box) that the text I want to mail is plain text, not an attachment. With some hacking the solution was to add a few more commands to “mail” so it knows that the text is UTF-8. From “sendthatmail.sh” in the repo, the parameters needed are:

-S sendcharsets=utf-8 -S ttycharset=utf-8 -S encoding=8bit

I could still extract some more information from the bank’s website, though nothing really urgent.

No-IP address updater

At the Taipei Hackerspace we have a handful of servers running, but the residential internet connection is provided by Chunghwa Telecom only gives us a dynamic IP address. Applying for a static IP seems to be pretty troublesome, so in the meantime I’m using a script on one of the servers to update the IP address associated with our dynamic tpehack.no-ip.biz address.

The no-ip-bash-updater (click for repo) script is forked originally from elsewhere, but I have rewritten it quite a bit so that it

needs no extra file to store the current IP address, but compares external IP with a DNS query
stores no secrets in the file

It uses a pretty straightforward API call with HTTP authentication, the only real logic in there is to check when that call actually needs to be made.

E-book sales

Recently I have helped a friend to publish an ebook version of How to Start a Business in Taiwan on Leanpub, and of course I want to know when there are any sales are made (disclaimer: I don’t get a cut of the sales, all goes to the author). The leanpubsales (click for repo) script is written in Python, because using JSON there is easier than it would be with bash. The call otherwise is quite simple, just keep an external file around to check if the sales number have increased or not, if yes then send an email. To send an email conditional on the output the the script the “ifne” command from moreutils is very useful (meaning: “if input is not empty”).

The query is run periodically, and lovely to receive the results. I will surely set up a script when I get my own book ideas published on Leanpub.

RTC correction

As a physicist in atomic physics, which is the area of science very much concerned about keeping precise time, keep all my servers’ times synchronized with network time protocol (NTP) using chrony. One difficulty is that the real-time clock (RTC) of those computers is pretty crappy and drifts away. Wouldn’t be a problem if I never restart them, but a pain if I do: after restart it can be tens of seconds away until the time is synchronized again.

Chrony can sync NTP and the RTC, but it doesn’t do that automatically, I have to trigger it manually. Instead I have written up an rtccorrect (click for repo) script that is run every 2 hours or so (could be done just once a day, actually), and eliminates the drift of the RTC.

Server backup

For backing up data between servers rsync has proven invaluable. I have a couple of scripts that do just that, though those are among my oldest ones and at that time I haven’t separated out personal information (way too easy to inline every credential, email, login, and all that), so I need to sanitize that. A couple of ideas about these backup scripts:

sometimes higher transfer speed can be achieved by messing with the ssh algorithms, eg. passing “-e ‘ssh -c arcfour'” to rsync
more often there’s even better performance when there’s an rsync daemon running on the remote computer (though with Raspberry Pi, both cases are still frustratingly slow)
can exclude some files if no need to transfer them, eg: “–filter=’- *.part'”
using rsync not just to transfer but to mirror, the “–delete” (delete at target if doesn’t exist at origin) and “–archive” are pretty useful

For these backups I also use the Dead Man’s Snitch to know when things didn’t work out, e.g having a similar command in the cron list, where backup.sh is my script’s name, xxxxxxxx is the snitch ID from my account:

backup.sh && curl -s https://nosnch.in/xxxxxxx > /dev/null

This way I got to know when my backup server was dying all the time because of bad heatsink, or my host server by flaky hosting company….

Afterword

I guess there will be just more automation in the future, and maybe many of these scripts can be ported onto a common base so new ones are made much easier. What else do you guys automate?

The post Automating the hell out of it appeared first on ClickedyClick.

Barometric recording of Typhoon Soulik

Gergely Imreh — Sun, 14 Jul 2013 01:47:44 +0000

It all started a few weeks ago with Sparkfun having “20%-off” day, when I got myself (among other things) a BMP085 barometric pressure sensor. When it arrived, I have soldered some pins on it, and set it up with an Arduino Nano, to have the readings off it easily.

BMP085 barometric pressure sensor breakout board from Sparkfun

Originally all I wanted is just some laid back pressure recording, so maybe I can use that to predict the weather a bit. “Pressure falls: bad weather comes, pressure rises: things will clear up”. I was recording for about a week, and nothing really noteworthy came out of that.

Then it was the news, that the year’s first typhoon is on the way to Taiwan, and it was supposed to be a big one. Obvious that I will try to record the barometric pressure pattern of its passing, but wanted to make it more interesting and informative. More visual than just the timeseries plot of pressures.

The Japanese Meteorological Agency (JMA) is a good place to watch for information about typhoons. They list path prediction, typhoon properties like strength, wind speeds, and central pressure, have satellite imagery. Putting these together, two days before the typhoon arrived, I set up a script to download the satellite imagery as it became available.

The morning before the typhoon arrived

The JMA publishes usually 2 satellite images in an hour for our North Western Quadrant (at :00 and :30), one of them covers the whole area, the other covers just the top 80% or so, leaving a dark band on the bottom. Nevertheless, matching up the pressure reading with the satellite pictures would be a good little project for this time.

Friday came, the government gave the afternoon off, though it turned out no landfall happened till everyone supposed to be off anyways, just a bit of on-and-off rain. People stocked up on convenience store food (I now have a good supply of instant noodles:) and water, taped over their glass windows, take in their plants and BBQ equipment from outside – well, those who have planned.

Around 10pm the big rain has arrived, here’s a video of how it looked from my window. Went to sleep later, and got woken up around 3:30am by the rain having changed into pretty darn big wind. Here’s another video of the violent part of the typhoon that time in the morning, that doesn’t even really do it justice. The houses around here are pretty tall, and I wonder if they have protected from the wind, or been artificial canyons channeling it. Some things got broken, though not as much as I expected – which is a very good thing.

In the meantime by the power of the Internet I have checked out the pressure reading, how is it going a few miles away in the Taipei Hackerspace, where I have left the barometric pressure sensor (the geolocation is 25.052993,121.516981)

Here’s the entire recording of the approximately 2 days of typhoon. It was pretty okay weather in the start and end of the plot.

Pressure reading during the passing of Typhoon Soulik, recorded at the Taipei Hackerspace

The readings have been corrected to sea level (from about 20m height, where the Taipei Hackerspace is), should be good within 1hPa or less.

The the pressure was indeed dropping like a rock, and the dip on the graph coincided with the most violent wind that woke me up. According the JMA, the central area of the typhoon had pressures down to 950hPa, which means that core must have passed pretty close to here, having readings below 958hPa, though probably not directly, as it didn’t stay down there for long.

I made a video syncing up the pressure reading and the satellite picture. The red dot on the video marks the recording location. (Watching it in full screen and HD makes it clearer.)

I would wonder what was the flat part in the readings while the typhoon was leaving. Maybe sign of changing direction, by the look of it.

Either way, this was fun to do, and I am glad that only a few people got hurt here, much fewer then even during the less powerful typhoons. Maybe getting people scared a little (like with this “super typhoon” stuff that went on) helps them keep safe? Just don’t use it too often.

Extra material

I put almost all material used here into a gist: the satellite imagery download script, the plotting, the movie frame generation, the movie generation script, and the complete barometric recording. Because this last part is pretty big (5Mb), Github truncated the rest of the scripts. I guess it’s okay to check check it out. Will add the Arduino sketch to read the sensor and the logging script later.

The satellite imagery weighs about 60Mb, so don’t put it online, but if anyone wants them, let me know.

Keep safe!

The post Barometric recording of Typhoon Soulik appeared first on ClickedyClick.

Laboratory 2.0 – a monitoring system

Gergely Imreh — Sun, 28 Oct 2012 14:03:15 +0000

Looks like that one of my specialty as a physicist, and contribution to the labs where I have worked so far, is bringing different kinds of programming techniques, and technologies to the table. I’m not saying I’m any better than many of the professors, post-docs, and students I’ve met so far (there are plenty of ingenious ones), it’s more like I experiment with different tools, have tried more of the cutting edge or recent technologies, did some web programming and could whip up something quick – that might not work very well at first, but does broaden the horizon for the rest of the people.

Also, I’m a lazy person, so want to automate as much as possible. That was on my mind recently when we have been preparing to do a vacuum-system bake-out. It’s essentially a procedure to have a delicate experimental system, mostly made up of steel, glass, and stuff like that, closed up from the atmosphere, all the air pumped out, then heated up to high temperature (~150-300°C). One has to be careful, because things can break, there are temperature limitations for some materials, also on how quickly that temperature can change, requiring careful monitoring of the status of the system. And the whole thing takes something like two weeks or more. Perfect setting for automation.

Set up the electronics

The pressure measurements are done by some expensive other equipment so didn’t have to bother with that one yet, so set to work first on the temperature monitoring. Before it was a bunch of thermocouples and multimeters, requiring manual intervention and lots of labour. Instead, got some inspiration from Adafruit’s Thermocouple Breakout Board, using the MAX31855 chip, and also from the Thermocouple Multiplexer Shield. It can handle only one channel, but can use some other chip together with it to switch between the different thermocouples, and so we can read it out one-by-one. The Adafruit board could only handle 1 channel, and the multiplexer shield was using an older chip for the measurement that I could not buy anymore. In the end, found a good analog multiplexer that one that is sold in the computer market here in Taipei, the CD4067B, and it works pretty well.

Breadboard setup for temperature monitoring with Arduino

Of course, setting it all up was quite a bit of fun times, as there were way too many gotchas along the way.

MAX31855 is a surface-mount component, and haven’t worked with it before. Not too bad, and can be much neater, just takes some plactice
MAX31855 is a 3.3V circuit, so the CMOS voltage levels used by my Arduino Mega ADK had to be level shifted
Unlike the older chip, MAX31855 really needs differential input, and it’s much more sensitive to the environment. This required different kind of analog multiplexer than that board had
The Arduino Mega is a new model for me, and had some strange behaviour in terms of the serial communication
Surprisingly there are not too many options for 3.3V voltage regulators over here, just the LM1117, which is different from what others are using elsewhere
Lots of noise and stability issues until figured out what should be how. For example under no circumstance should touch the thermocouple to conducting surfaces, and avoid ground loops
While MAX31855 says it’s “cold-point compensated”, meaning that it accounts for the chip-s local temperature when measuring the thermocouple, it doesn’t appear completely compensated, meaning that we can have unexpected measurement change because the chip is heating up for example by being in a closed box.
Figuring out the right amount of time to wait between switching channels (375ms seems to be good enough, 500ms is totally fine)

In the end, though, we did have a nice 16 channel thermocouple multiplexer, sending off the measurements onto an LCD screen and to the computer over an USB cable.

Temperature monitoring board in it’s lab setting with 16 thermocouple channels

This is then saved in a database, and can be accessed from elsewhere.

Visualize!

The thing that my co-workers were most amazed by wasn’t the electronics. Sure, they haven’t worked with Arduinos, but did do similar stuff. Instead they liked the monitoring interface much more, this is the one on the picture right here (can click to enlarge)

Bakeout Monitor interface (click image for full view)

It’s the schematic layout of our equipment, with the temperatures positioned where the actual sensors are. Also, the change of the measured values in time are also displayed with live scrolling.

I’m not saying it’s great. Thinking about it, the major insight that made it good for the rest of the people is that I realized how much more people understand visual data: the placement of the values to the corresponding locations on the schematics. That’s the only thing.

So inside it’s a MongoDB database (learned from previous mistakes, using a replica-set at least), with Python scripts talking to the sensors and saving the data, NodeJS / Smoothie Charts for visualization (and plain old CSS positioning of tags for the reading display), nginx‘s upstream module for running two monitoring servers just in case. It’s mostly in the Github repo of the monitoring code, as well as the Arduino sketch for talking to the electronics.

It was actually quite fun to write it all, and the gradual improvements, trying the new tech, trying not to lose to much data, amazed how well it works. Especially had a good time learning about the database, scaling, fault tolerance, performance…

Of course there could be room for a lot more improvements.

My failover-restart bash scripts are awful, though they do seem to work more or less and counteract the USB unreliablilities
There were some changes to Smoothie Charts that I could improve on: logarithmic plotting, some display enhancements, wonder if it can be more optimized for performance
More efficient data loading. 12h data is about 30Mb in JSON format, that I send compressed, apparently it gets down to ~5% in size, but it still takes quite a bit of time to process on the frontend
The layout now can be changed from config files if the sensors change, so co-workers can do that without programming knowledge. I wonder if that can be simplified even more

Of course, I’m a person who generally overengineers stuff, so maybe it’s good to stop somewhere. And the somewhere might be when I got to the point to use my Kindle for monitoring (craps out on 1h data already, but some real time things are good enough).

Bakeout Monitor on running on Kindle 3, not perfect but does work

Get on with it

I did learn a lot along the way, and I’m sure that with this experience I will be let to do a little bit more in the lab in terms of programming ideas. I don’t like that the rest of the system is currently forced to be LabView, but that’s for another post, and there are so many things that can be improved in general as well. Let’s just go and do that.

The post Laboratory 2.0 – a monitoring system appeared first on ClickedyClick.