My roommate needed help with building a web scraper. As a result of various constraints, we ended up using Heroku. There was, however, a database row limit that we had to work around. It turned out to be a nice opportunity to play with the new PostgreSQL 9.5 INSERT ... ON CONFLICT DO UPDATE, also known as UPSERT, functionality. The goal of this article is to introduce UPSERT to you via a practical example.

## Introduction

My roommate, an aspiring programmer, recently asked me for advice on how to build a web scraper. The scraper should run on a machine that is constanly connected to the internet, because it should check at least every hour or so. Running the scraper on his laptop is not an option for my roommate, because he frequently takes it with him on the road. Neither is running it on a remote server, because he possesses none nor is he planning to spend money on one. We tried out Heroku; a free platform as a service provider that embraces all of these constraints.

### Choosing the right storage

The Heroku Scheduler addon enabled us to run a script every 10 minutes (the highest available frequency) that scrapes the targeted website. With that requirement settled, we needed a way to compare the old state to the new state, so we could send ourselves an email as soon as the website changed. The simplest solution we could come up with involved saving the hashed HTML contents into a Heroku config variable. That solves the problem of notification, but will not allow us analyse the data later. For this reason, using a database made sense. PostgreSQL storage is cheap, free actually on Heroku, so we picked that one.

### Heroku limitations

Normally, I would insert one row per scraping run with the scraped contents (or a foreign key to a unique piece of content) and a timestamp into the database. In this case, however, it was not a good idea because the free Heroku PostgreSQL database offering has certain limitations. You are allowed to have a maximum of 10.000 rows in the database. A scraper that scrapes every 10 minutes reaches the limit after 40 days of service. We needed a way to circumvent the limitation, since we were planning on scraping longer than that.

What if we scaled horizontally instead of vertically? Instead of adding rows, we could grow columns and consequently circumvent the row limit. One row per unique piece of scraped content with an ever expanding array of timestamps. PostgreSQL arrays seem to be a good fit for this. One could implement this using two queries. First, a SELECT establishes whether we have seen the content before. Second, an INSERT or UPDATE respectively inserts a new piece of content with the current time to the array or updates an existing row by appending the current time. Or, you use the new INSERT … ON CONFLICT DO UPDATE operation instead, like I will show you in this article.

## Methods

In this article, I will only show you how to complete the task using UPSERT. If you would like to see how you could achieve similar behavior with PostgreSQL versions below 9.5, please check out this Stackoverflow answer.

### The query

We built the plumbing of the scraper using Python; the source code is available on GitHub. The Python script executes the following query–the interesting part–every time it runs.

### The test

We supply the query above to the script below via the \$QUERY variable. First, we initialize an empty database and run the query twice with the body variable set to content and ts set to the SQL function now(). After that, we simulate a change on the targeted website by running the query again after setting content to changed. In production, the body variable contains an HTML document. The output of the script is printed below it.

## Results

The scraper has been running flawlessly in production for almost two weeks. Let’s see how many rows have been added during that time.

After running every 10 minutes for almost two weeks, the scraper inserted just 24 records. Before we inspect the contents of the database, let’s make sure that we really are in compliance with the Heroku PostgreSQL maximum rows limitation:

We have to unnest the seen_at array to obtain the total count of scraper runs.

Let’s break the 2282 rows down by date and aggregate the count of checks and changes of content that occured that day. Please check out the appendix for the exact query that I used.

The sums of checks and changes match the unnested and total counts above, respectively. Note: the amount of checks differs between days probably because of the following:

Scheduler is a best-effort service. There is no guarantee that jobs will execute at their scheduled time, or at all. Scheduler has a known issue whereby scheduled processes are occasionally skipped.

## Discussion

This scraping method only works when the website is static and the content changes slowly, compared to dynamic websites with different HTML output on each request. For example, some websites return a different XSRF token at every visit. In that case, every scraping run inserts a new row into the database, negating the savings of our UPSERT horizontal expansion.

## Conclusion

The HTML output of our targeted website only changed when the content of interest changed, leading to our high checks vs. changes ratio. Using UPSERT for scraping turned out to be a good fit for this website because it enables us to scrape for some time ahead, while logging all of the captured data.

## Acknowledgements

Thank you people that made Org mode and Babel. :-)