{"id":2691,"date":"2022-08-04T04:45:58","date_gmt":"2022-08-04T03:45:58","guid":{"rendered":"https:\/\/gergely.imreh.net\/blog\/?p=2691"},"modified":"2022-08-04T04:52:37","modified_gmt":"2022-08-04T03:52:37","slug":"a-personal-finance-data-pipeline-project","status":"publish","type":"post","link":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/","title":{"rendered":"A personal finance data pipeline project"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">I had received a (family) project brief recently. In Taiwan many credit\/debit cards have various promotions and deal, and many of them depend on one&#8217;s monthly spending, for example &#8220;below X <a href=\"https:\/\/en.wikipedia.org\/wiki\/New_Taiwan_dollar\">NTD<\/a> spending each month, get Y% cashback&#8221;. People also have a lot of different cards, so playing these off each other can be nice pocket change, but have to keep an eye on whether where one is compared to the max limit (X). So the project comes from here: easy\/easier tracking of where one specific card&#8217;s spending is within the monthly period. That doesn&#8217;t sound too difficult, right? Except the options for these are: <\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>A banking website with CAPTCHAs and no programmatic access<\/li><li>An email received each day with an password-protected PDF containing the last day&#8217;s transactions in a table<\/li><\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Neither of these are fully appetizing to tackle, but both are similar to bits that I do at #dayjob, but 2. was a bit closer to what I&#8217;ve been doing recently, so that&#8217;s where I landed. That is:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Forward the received email (the email provider does it)<\/li><li>Receive it in some compute environment<\/li><li>Decrypt the PDF<\/li><li>Extract the transaction data table<\/li><li>Clean and process the tabular data<\/li><li>Put raw in some data warehouse<\/li><li>Transform data to get the right aggregation<\/li><li>&#8230;<\/li><li>Literally profit?<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">I was surprised how quick this actually worked out in the end (if &#8220;half a weekend&#8221; is quick), and indeed this can be a first piece of a &#8220;personal finance data warehouse&#8221;.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Technical implementation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">I wanted to have the final setup run in &#8220;The Cloud&#8221;, as that&#8217;s one less thing to worry about. The most obvious arrangement, based on past experiences was combing <a href=\"https:\/\/docs.aws.amazon.com\/ses\/latest\/dg\/Welcome.html\">AWS Simple Email Service<\/a> (SES) to receive an email, and a <a href=\"https:\/\/aws.amazon.com\/lambda\/\">Lambda<\/a> to run serverless processing. On the data warehouse side the real obvious choice is <a href=\"https:\/\/cloud.google.com\/bigquery\/\">GCP&#8217;s BigQuery<\/a>, however, so I looked into what would be a similar arrangement for the processing pieces if I want to put everything into a single cloud provider.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After some docs diving the most natural arrangement on GCP seemed to be quite different: an <a href=\"https:\/\/cloud.google.com\/appengine\/\">App Engine<\/a> deployment with <a href=\"https:\/\/cloud.google.com\/appengine\/docs\/standard\/python3\/services\/mail\">Mail API <\/a>enabled. This gives a receiving domain name (@[Cloud-Project-ID].appspotmail.com) , and every email sent there is just passed to the server that is running in App Engine. This seemed pretty simple! App Engine also has a free tier, though that comes with pretty small memory limits, which features in this story too.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The final result of the server part is <a href=\"https:\/\/github.com\/imrehg\/finance-extract\/\">shared on GitHub<\/a>, and should be easy to reuse or extend.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">PDF processing<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Getting the attachments out of the email was pretty straightforward with the Mail API, so the first heavier task was opening the encrypted PDF and getting the table out of it. Opening PDFs are quite common, but the table extraction was a bit of a journey.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">False try<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">First I was searching around (as anyone else does) for someone else&#8217;s rundown of the options, <a href=\"https:\/\/www.geeksforgeeks.org\/how-to-extract-pdf-tables-in-python\/\">as an example<\/a>. From there I honed in on <a href=\"https:\/\/pikepdf.readthedocs.io\/en\/latest\/\">pikepdf<\/a> to open the password-protected files, an <a href=\"https:\/\/tabula-py.readthedocs.io\/en\/latest\/\">tabula-py<\/a> which seemed handy to extract tables right into <a href=\"https:\/\/pandas.pydata.org\/\">Pandas<\/a> DataFrames. One subtlety was that tabula-py is just a wrapper around <a href=\"https:\/\/github.com\/tabulapdf\/tabula-java\">tabula-java<\/a> to do the extraction, and needs a Java environment installed. The free tier of App Engine uses their <a href=\"https:\/\/cloud.google.com\/appengine\/docs\/standard\/\">standard<\/a> environment where all I have is my code and &#8220;requirements.txt&#8221; to install my python dependencies, so it&#8217;s obvious how would I get Java into the deployment correctly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Enter the scene <a href=\"https:\/\/github.com\/jyksnw\/install-jdk\">install-jdk<\/a> which can install the Java environment at runtime. That was sufficiently crazy hack to actually work, and <a href=\"https:\/\/github.com\/imrehg\/finance-extract\/blob\/19fa754b72b39aac246bf385e2ebeff7cf35b1ad\/main.py#L14-L28\">it did work<\/a>. Or so it seemed, since the data was processed and showing up in BigQuery, when I&#8217;ve sent test emails into the system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Upon closer inspection, though, there were loads of duplicate lines. Between signing off in the evening, and checking it in the morning, I had bunches of them, and were still coming in&#8230;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"609\" height=\"307\" src=\"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/duplicated-data-extraction.png\" alt=\"BigQuery view of duplicated data\" class=\"wp-image-2705\" srcset=\"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/duplicated-data-extraction.png 609w, https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/duplicated-data-extraction-500x252.png 500w\" sizes=\"auto, (max-width: 609px) 100vw, 609px\" \/><figcaption>Sometimes duplicated data sneaks in from software issues<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">I should have checked the logs earlier, because once dig in, there were bunches of &#8220;server errors&#8221; listed that didn&#8217;t connect to any programming errors that I might have made, rather than (here comes the epiphany) instances being killed for being out of memory \/ blowing their memory budget (of 256MB for the free tier). Thus what happened is:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>the Java run of tabula was just using too much memory while processing the PDFs<\/li><li>it finished processing and like loaded the data but it takes a bit of time<\/li><li>GCP catches up and kills the instance while that is still going on, and reports to the Mail API that the email <em>hasn&#8217;t been<\/em> properly handled (server error during that process)<\/li><li>Whatever is handling the incoming email queue in GCP will just just keep the data and retries later<\/li><li>The cycle repeats&#8230;<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This didn&#8217;t seem very helpful and the repeat emails were piling up in whatever (opaque, to me) system GCP has, so needed a quick replace of tabulate with something lighter&#8230;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Worse is better and actually good<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Going down the list of recommended libraries, next I looked at <a href=\"https:\/\/camelot-py.readthedocs.io\/en\/master\/\">camelot-py<\/a> which looks great, but needs <a href=\"https:\/\/opencv.org\/\">OpenCV<\/a> on the machine to do its work, so back to the &#8220;how to install OS packages on Standard AppEngine?&#8221; question. For some extra inspiration I was looking at <a href=\"https:\/\/github.com\/camelot-dev\/camelot\/wiki\/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools\">camelot&#8217;s comparison with other similar tools page<\/a> and it was a bit disappointing (though not surprising) that pretty much every other library is &#8220;worse&#8221; on various PDFs compared to camelot. Just for kicks I did try some out, and <a href=\"https:\/\/pypi.org\/project\/pdfplumber\/\">pdfplumber<\/a> actually delivered:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>it does actually work on the example PDFs I had from previous bank emails<\/li><li>nothing else beside pip install<\/li><li>it can actually handle decrypting the PDF as well, so helper libraries can be dropped<\/li><li>the extracted data is in Python tables, but it&#8217;s just an extra line to get DataFrames, so no sweat<\/li><li>The extracted data was actually better quality than tabula&#8217;s, so had to do fewer cleanup steps!<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This was a pure win, and indeed it&#8217;s worth looking stuff that works with the data at hand, not ignoring the edge cases, but also not overly emphasizing being able to do &#8220;everything&#8221; when there&#8217;s a clear target of what &#8220;thing&#8221; needs to work. (Potential technical debt considered too).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Data transformations and visibility<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Now the data sits in BigQuery properly:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"667\" height=\"330\" src=\"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/bigquery-finance-data.png\" alt=\"BigQuery financial data table\" class=\"wp-image-2708\" srcset=\"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/bigquery-finance-data.png 667w, https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/bigquery-finance-data-500x247.png 500w\" sizes=\"auto, (max-width: 667px) 100vw, 667px\" \/><figcaption>Actual data in the works.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The raw transaction data loaded into BigQuery was the first step, but still need to answer the question: in this billing period, how much have I spent?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Not being a data analyst (or not yet?:), this took a bit of figuring out. As other novices share their bit of &#8220;clever code&#8221; when it&#8217;s actually trivial to the experts, I&#8217;m sharing here the bit of SQL queries in a similar &#8220;that was fun to figure out, wasn&#8217;t it?&#8221; way. I&#8217;m sure it can be much improved, but it&#8217;s a good reminder for myself as well.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Given that my billing period starts on the 23rd of the month, get the aggregated value of transactions for each billing period: <\/p>\n\n\n\n<pre title=\"monthly-aggregation.sql\" class=\"wp-block-code has-small-font-size\"><code lang=\"sql\" class=\"language-sql\">WITH\n  Aggregated AS (\n  SELECT\n    DATE(TransactionDate, 'Asia\/Taipei') AS day,\n    TransactionAmountNTD\n  FROM\n    `personal-data-warehouse.finance.huanan` ),\n  calendar AS (\n  SELECT\n    day,\n    -- Find the last day before the new interval\n    DATE_SUB(\n      DATE_ADD(\n        day,\n        INTERVAL 1 Month),\n      INTERVAL 1 DAY\n    ) AS endday\n  FROM\n    UNNEST (\n      GENERATE_DATE_ARRAY(\n        -- Start date in the past before any data,\n        -- on the right day of the month for\n        -- the billing cycle.\n        '2022-05-23', \n        CURRENT_DATE('Asia\/Taipei'),\n        INTERVAL 1 Month\n      )\n    ) AS day\n) SELECT\n  SUM(TransactionAmountNTD) AS `MonthlyTransactions`,\n  COUNT(*) AS `TransactionCount`,\n  EXTRACT(Year FROM c.day) AS `Year`,\n  EXTRACT(Month FROM c.day) AS `Month`,\n  FORMAT('%d-%02d', EXTRACT(Year\n    FROM\n      c.day), EXTRACT(Month\n    FROM\n      c.day)\n  ) AS `Interval`\nFROM\n  calendar AS c\nJOIN\n  Aggregated AS a\nON\n  a.day BETWEEN c.day AND c.endday\nGROUP BY\n  c.day<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Good stuff on the date array and joining with a &#8220;between&#8221; statement, those are the main TIL. They also already came up at #dayjob, which was very satisfying.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">From here the data I surface in a connected <a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/connected-sheets\">Google Sheet<\/a> which is pretty practical, though leaves the &#8220;being notified when I approach\/reach X&#8221; out, but that&#8217;s fine for now.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"874\" height=\"445\" src=\"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/google-sheet-connected-to-bigquery.png\" alt=\"\" class=\"wp-image-2709\" srcset=\"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/google-sheet-connected-to-bigquery.png 874w, https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/google-sheet-connected-to-bigquery-500x255.png 500w, https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/google-sheet-connected-to-bigquery-768x391.png 768w\" sizes=\"auto, (max-width: 874px) 100vw, 874px\" \/><figcaption>Connected tables view in Sheets<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Testing and getting to &#8220;production&#8221;<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">One good thing about personal projects is that I can make them as &#8220;good&#8221; as I want to (or as &#8220;bad&#8221;, of course), which usually results in an unhealthy amount of tweaking, trying out various best practices to see if they work, and so on. Here I really wanted to get the system well tested, for example, which turned out to take <em>loads more time<\/em> than actually writing the original service. Actually, there&#8217;s nothing surprising about that for software engineering professionals, but still can catch people off-guard.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here the tricky parts came from two areas: <a href=\"https:\/\/fastapi.tiangolo.com\/\">FastAPI<\/a> settings and cloud service integrations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The former is always a bit of an issue, depending on how the code uses the settings (whether things can be patched well at testing time), but here I also used a trick for the server to pull the PDF decryption key from <a href=\"https:\/\/cloud.google.com\/secret-manager\">Secret Manager<\/a>, so I don&#8217;t have to deploy environment files, nor keep settings like that in version control, etc&#8230; But this meant a trickier flow of getting the FastAPI testing client up in a way that it worked without it talking to the cloud backends (and stalling, and failing&#8230;). Nothing that some good mocking cannot solve (says the person with hindsight).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For the cloud services part it meant mocking BigQuery connections, so that the test can actually pretend to &#8220;receive&#8221; an email all the way looking at the &#8220;database&#8221; and see the right information being there. Under the hood I&#8217;m using <a href=\"https:\/\/googleapis.dev\/python\/pandas-gbq\/latest\/index.html\">pandas-gbq<\/a>, and thus it was interesting to look under the hood <a href=\"https:\/\/github.com\/googleapis\/python-bigquery-pandas\/blob\/685d1c39f709a58a9bf59fb1cec9474d3e3c03c0\/tests\/unit\/conftest.py\">for their tests<\/a>, borrowing some of them. Took a bit more time, but that&#8217;s working pretty well now. Still need to do some extra bits and pieces to do cover more of the workflow, but I&#8217;m already more confident about things working. Also, all this will be very useful on other projects that are interacting with BigQuery in any way (not just through Pandas).<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"618\" height=\"404\" src=\"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/test-run-goodness.png\" alt=\"\" class=\"wp-image-2707\" srcset=\"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/test-run-goodness.png 618w, https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/test-run-goodness-500x327.png 500w\" sizes=\"auto, (max-width: 618px) 100vw, 618px\" \/><figcaption>A test run that&#8217;s nice<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Big evergreen lesson on testing: <strong>you have to write your code to be testable<\/strong>. Lots of code out there is not even not tested, but it&#8217;s even extremely difficult to actually test. This needs remembering in every development. Also, <strong>test writing never really stops<\/strong>, there&#8217;s always more thing to test for. And finally, can always try more advanced testing, such as using automated test case generation (e.g with <a href=\"https:\/\/hypothesis.readthedocs.io\/en\/latest\/\">hypothesis<\/a>), and fuzz testing (e.g. with <a href=\"https:\/\/gitlab.com\/gitlab-org\/security-products\/analyzers\/fuzzers\/pythonfuzz\">pythonfuzz<\/a>). The next frontier, right after I&#8217;ve implemented the currently skipped tests. And finally, remember that code coverage is not case coverage, so the goals should be maximizing the latter, while the former is just a potential proxy for it.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Future outlook<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">It would be nice to take this idea of financial data analysis further and add some actual dashboard (say deploying <a href=\"https:\/\/superset.apache.org\/\">Superset<\/a> somewhere which is excellent for this). It would help to get more information into the system as well, though, currently it&#8217;s very sparse. That would mean adding other financial sources, maybe if finding an API in the end, or doing a bit of &#8220;pragmatic execution&#8221; and do a CAPTCHA bypass (since I quickly checked that my credit card provider&#8217;s CAPTCHA is completely readable by <a href=\"https:\/\/tesseract-ocr.github.io\/tessdoc\/Home.html\">Tesseract<\/a>, for example, so I <em>could<\/em> likely scrape things there if I really wanted.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I&#8217;m not holding my breath for having something like UK&#8217;s <a href=\"https:\/\/www.openbanking.org.uk\/about-us\/\">Open Banking<\/a> here which enables apps like <a href=\"https:\/\/emma-app.com\/\">Emma<\/a> so all this is accessible for people who don&#8217;t want to code. But where&#8217;s the fun in that (for me)? :) (In fact there&#8217;s a lot of fun in open access APIs, so this would be the real way of doing it&#8230;)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, it&#8217;s good to remember how easy it is to corrupt &#8220;production&#8221; data sets, but also with the right tools (like snapshots), some of that pressure can be less. There are always bugs, the question is how to mitigate their effect.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If your bank updates you through password-protected PDFs in emails and you are a programmer, you make some finance data extract lemonade.<\/p>\n","protected":false},"author":1,"featured_media":2702,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8,9],"tags":[228,226,227],"class_list":["post-2691","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-prog","category-tw","tag-automation","tag-bigquery","tag-gcp"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>A personal finance data pipeline project - ClickedyClick<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A personal finance data pipeline project - ClickedyClick\" \/>\n<meta property=\"og:description\" content=\"If your bank updates you through password-protected PDFs in emails and you are a programmer, you make some finance data extract lemonade.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/\" \/>\n<meta property=\"og:site_name\" content=\"ClickedyClick\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/gergely.imreh\" \/>\n<meta property=\"article:author\" content=\"https:\/\/www.facebook.com\/gergely.imreh\" \/>\n<meta property=\"article:published_time\" content=\"2022-08-04T03:45:58+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-08-04T03:52:37+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/credit-card-statement.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1158\" \/>\n\t<meta property=\"og:image:height\" content=\"333\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Gergely Imreh\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@imrehg\" \/>\n<meta name=\"twitter:site\" content=\"@imrehg\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Gergely Imreh\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/\"},\"author\":{\"name\":\"Gergely Imreh\",\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/#\\\/schema\\\/person\\\/42391e2ae52c8ed76b37be509a5707b0\"},\"headline\":\"A personal finance data pipeline project\",\"datePublished\":\"2022-08-04T03:45:58+00:00\",\"dateModified\":\"2022-08-04T03:52:37+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/\"},\"wordCount\":1934,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/#\\\/schema\\\/person\\\/42391e2ae52c8ed76b37be509a5707b0\"},\"image\":{\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/credit-card-statement.png\",\"keywords\":[\"automation\",\"bigquery\",\"gcp\"],\"articleSection\":[\"Programming\",\"Taiwan\"],\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/\",\"url\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/\",\"name\":\"A personal finance data pipeline project - ClickedyClick\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/credit-card-statement.png\",\"datePublished\":\"2022-08-04T03:45:58+00:00\",\"dateModified\":\"2022-08-04T03:52:37+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/#primaryimage\",\"url\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/credit-card-statement.png\",\"contentUrl\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/credit-card-statement.png\",\"width\":1158,\"height\":333,\"caption\":\"A daily credit card transaction statement\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/2022\\\/08\\\/a-personal-finance-data-pipeline-project\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A personal finance data pipeline project\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/\",\"name\":\"ClickedyClick\",\"description\":\"Life in real, complex and digital.\",\"publisher\":{\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/#\\\/schema\\\/person\\\/42391e2ae52c8ed76b37be509a5707b0\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-GB\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/#\\\/schema\\\/person\\\/42391e2ae52c8ed76b37be509a5707b0\",\"name\":\"Gergely Imreh\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/1d5be311c5d616a3f4c7dfbc6b736ec817d2508b8c420ec29edb950d33fb4946?s=96&d=retro&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/1d5be311c5d616a3f4c7dfbc6b736ec817d2508b8c420ec29edb950d33fb4946?s=96&d=retro&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/1d5be311c5d616a3f4c7dfbc6b736ec817d2508b8c420ec29edb950d33fb4946?s=96&d=retro&r=g\",\"caption\":\"Gergely Imreh\"},\"logo\":{\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/1d5be311c5d616a3f4c7dfbc6b736ec817d2508b8c420ec29edb950d33fb4946?s=96&d=retro&r=g\"},\"description\":\"Physicist, hacker. Enjoys avant-guarde literature probably a bit too much. Open source advocate and contributor, both for software and hardware. Follow these posts on the Fediverse by @gergely@gergely.imreh.net\",\"sameAs\":[\"https:\\\/\\\/gergely.imreh.net\\\/\",\"https:\\\/\\\/www.facebook.com\\\/gergely.imreh\",\"https:\\\/\\\/www.instagram.com\\\/imrehg\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/in\\\/gergelyimreh\\\/\",\"https:\\\/\\\/www.youtube.com\\\/@GergelyImreh\"],\"url\":\"https:\\\/\\\/gergely.imreh.net\\\/blog\\\/author\\\/gergely\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A personal finance data pipeline project - ClickedyClick","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/","og_locale":"en_GB","og_type":"article","og_title":"A personal finance data pipeline project - ClickedyClick","og_description":"If your bank updates you through password-protected PDFs in emails and you are a programmer, you make some finance data extract lemonade.","og_url":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/","og_site_name":"ClickedyClick","article_publisher":"https:\/\/www.facebook.com\/gergely.imreh","article_author":"https:\/\/www.facebook.com\/gergely.imreh","article_published_time":"2022-08-04T03:45:58+00:00","article_modified_time":"2022-08-04T03:52:37+00:00","og_image":[{"width":1158,"height":333,"url":"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/credit-card-statement.png","type":"image\/png"}],"author":"Gergely Imreh","twitter_card":"summary_large_image","twitter_creator":"@imrehg","twitter_site":"@imrehg","twitter_misc":{"Written by":"Gergely Imreh","Estimated reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/#article","isPartOf":{"@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/"},"author":{"name":"Gergely Imreh","@id":"https:\/\/gergely.imreh.net\/blog\/#\/schema\/person\/42391e2ae52c8ed76b37be509a5707b0"},"headline":"A personal finance data pipeline project","datePublished":"2022-08-04T03:45:58+00:00","dateModified":"2022-08-04T03:52:37+00:00","mainEntityOfPage":{"@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/"},"wordCount":1934,"commentCount":0,"publisher":{"@id":"https:\/\/gergely.imreh.net\/blog\/#\/schema\/person\/42391e2ae52c8ed76b37be509a5707b0"},"image":{"@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/#primaryimage"},"thumbnailUrl":"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/credit-card-statement.png","keywords":["automation","bigquery","gcp"],"articleSection":["Programming","Taiwan"],"inLanguage":"en-GB","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/","url":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/","name":"A personal finance data pipeline project - ClickedyClick","isPartOf":{"@id":"https:\/\/gergely.imreh.net\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/#primaryimage"},"image":{"@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/#primaryimage"},"thumbnailUrl":"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/credit-card-statement.png","datePublished":"2022-08-04T03:45:58+00:00","dateModified":"2022-08-04T03:52:37+00:00","breadcrumb":{"@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/"]}]},{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/#primaryimage","url":"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/credit-card-statement.png","contentUrl":"https:\/\/gergely.imreh.net\/blog\/wp-content\/uploads\/2022\/08\/credit-card-statement.png","width":1158,"height":333,"caption":"A daily credit card transaction statement"},{"@type":"BreadcrumbList","@id":"https:\/\/gergely.imreh.net\/blog\/2022\/08\/a-personal-finance-data-pipeline-project\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/gergely.imreh.net\/blog\/"},{"@type":"ListItem","position":2,"name":"A personal finance data pipeline project"}]},{"@type":"WebSite","@id":"https:\/\/gergely.imreh.net\/blog\/#website","url":"https:\/\/gergely.imreh.net\/blog\/","name":"ClickedyClick","description":"Life in real, complex and digital.","publisher":{"@id":"https:\/\/gergely.imreh.net\/blog\/#\/schema\/person\/42391e2ae52c8ed76b37be509a5707b0"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/gergely.imreh.net\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":["Person","Organization"],"@id":"https:\/\/gergely.imreh.net\/blog\/#\/schema\/person\/42391e2ae52c8ed76b37be509a5707b0","name":"Gergely Imreh","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/secure.gravatar.com\/avatar\/1d5be311c5d616a3f4c7dfbc6b736ec817d2508b8c420ec29edb950d33fb4946?s=96&d=retro&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1d5be311c5d616a3f4c7dfbc6b736ec817d2508b8c420ec29edb950d33fb4946?s=96&d=retro&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1d5be311c5d616a3f4c7dfbc6b736ec817d2508b8c420ec29edb950d33fb4946?s=96&d=retro&r=g","caption":"Gergely Imreh"},"logo":{"@id":"https:\/\/secure.gravatar.com\/avatar\/1d5be311c5d616a3f4c7dfbc6b736ec817d2508b8c420ec29edb950d33fb4946?s=96&d=retro&r=g"},"description":"Physicist, hacker. Enjoys avant-guarde literature probably a bit too much. Open source advocate and contributor, both for software and hardware. Follow these posts on the Fediverse by @gergely@gergely.imreh.net","sameAs":["https:\/\/gergely.imreh.net\/","https:\/\/www.facebook.com\/gergely.imreh","https:\/\/www.instagram.com\/imrehg\/","https:\/\/www.linkedin.com\/in\/gergelyimreh\/","https:\/\/www.youtube.com\/@GergelyImreh"],"url":"https:\/\/gergely.imreh.net\/blog\/author\/gergely\/"}]}},"_links":{"self":[{"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/posts\/2691","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/comments?post=2691"}],"version-history":[{"count":14,"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/posts\/2691\/revisions"}],"predecessor-version":[{"id":2715,"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/posts\/2691\/revisions\/2715"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/media\/2702"}],"wp:attachment":[{"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/media?parent=2691"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/categories?post=2691"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gergely.imreh.net\/blog\/wp-json\/wp\/v2\/tags?post=2691"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}