Building Hacker News Alerts

I've always been a fan of when companies "dogfood" their own products, using them in-house as part of their normal operations. To me, it shows that they've built a product that addresses a problem where they include themselves as a customer that needs the solution.

Shipyard's flexibility means there are a wide range of business applications it could be "dogfooded" for. This time around, I wanted to see if I could create an alert for the marketing team.

Engaging with Hacker News

Most mornings when I'm at the gym, I'll scroll through Hacker News between sets. Occasionally, I'll find articles or discussions relevant to Shipyard and send them over to the team.

At best, if it's a relatively new post and the comments are active, we can enter the discussion to show how Shipyard fits into the picture. This tactic would always consistently generate traffic to our site. At worst, we would end up reading relevant articles and discussions that would help shape our perspective of the data operations landscape. However, this is not a particularly scalable solution since I'm not guaranteed to find relevant posts. If I do, they may be too old to get much traction.

A simple, repeatable, programmatic solution was called for.

Developing the Solution

I needed a script that could perform the required querying and filtering to fetch relevant Hacker News links and present them to our team, so I started my coding journey.

Writing Script Functionality

The script was a small, ~150 lines of code, program written in Go (the same language Shipyard is written in). You can find the code on our GitHub.

It's relatively simple and accepts a few flag parameters so that the user can define what posts they'd like to see and how they'd like to receive the output. We also leveraged the Algolia Hacker News API to make the querying more performant and accurate.

Ultimately, this now got us part of the way to our final solution of being periodically alerted about new articles.

Developing Code Locally

Since it was a relatively small script and didn't need an extensive test suite, I ran the script locally to ensure data was coming back as expected. Additionally, the team who would be using the completed script was testing things out locally as we were going back-and-forth on features to include.

Building the script to run locally was easy and automating it on Shipyard was ideal since the platform runs code exactly as if it were running locally.

Transferring to Shipyard

After we settled on a final design, launching the script on Shipyard only took a few minutes. Even with the script written in Go, which is not natively supported on the platform, the team could still use the Bash language option and install the required package to run the program.

With the code uploaded to the GitHub repository, we synced that repository to a brand new Blueprint. The Blueprint allowed our team to easily change the arguments being passed through to the script through a form.

This meant that even though the primary use would be for finding articles related to our product, we could build new Vessels from the Blueprint to look up whatever else we wanted.

Scheduling Periodic Triggers

The next setup step was to ensure that we could periodically trigger the script to collect new posts.

In my past work on AWS, this would be accomplished by setting up a CloudWatch Events Rule to run the attached Lambda which requires configuring and launching both resources. That seemed a bit like overkill.

On Shipyard, scheduling is supported out of the box so I was able to configure the script to run several times a day without additional infrastructure.

Sending A Message

Now that we had the script running on a periodic schedule to create a list of relevant articles each day, we needed to find a way to send that list to the entire team. We purposefully didn't build that logic into the script because we wanted to leverage our platform's features.

Using Shipyard's built-in no-code Slack Blueprints, we were able to build a Vessel to send a message directly to the team in just a few clicks. It only required us to provide a few inputs like the channel we wanted to send to, the team members we wanted to alert, and the name of the file we wanted to upload.

Shipyard's built-in Slack blueprint

Linking it Together

With the Slack Vessel created, we linked the scraper script directly to the Slack Vessel as part of a Fleet. Because Fleets share data between Vessels, we were able to easily take the file full of Hacker News articles and have it uploaded directly to Slack.

And that's it!

We were now up and running with automated Slack alerts for relevant Hacker News articles to check out every day.

Refining the Solution

About a week in, we noticed a few flaws in our system.

  1. We were continually sending ourselves articles, but half of the time, the articles were duplicated from the last time we opened it.
  2. It was becoming harder to tell which articles were actually new which decreased the value of our alerting system.
  3. We sometimes had important growth articles go unnoticed since we were only running the script every few hours.

We wanted to find better ways to send articles without continuously adjusting the core functionality of the script.

Logging Articles in Bigquery

First, we created a single Python Vessel that imported the Hacker News articles as a Pandas Dataframe and added in an extra column with the timestamp to indicate which were new articles.

import pandas as pd
import datetime

df = pd.read_csv('hn_articles.csv')
df['load_time'] = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')

df.to_csv('hn_articles.csv')

Next, we used a BigQuery Blueprint to upload the CSV of Hacker News articles to a BigQuery table. Since the file sizes generated by the scraper were relatively low (~30 rows every time it ran) we chose to just continuously append data to the table, rather than go through the hassle of  making updates. This would allow us to track changes over time and understand when an article was first noticed.

Returning the Newest Articles

With our logging in place, we wanted to find a way to send "never before seen" articles. Naturally, with all of the articles now stored in Bigquery, we used a Blueprint to return a CSV file with query results.

We built the following query which returns articles that were loaded in the last 24 hours, but only once.

SELECT
  MAX(TITLE) AS Title,
  MAX(URL) AS URL,
  MAX(POINTS) AS SCORE,
  MAX(COMMENTS) AS COMMENTS
FROM
  `shipyard-solutions-internal.data_community.hn_articles`
GROUP BY
  ID
HAVING
  COUNT(ID) = 1
  AND extract(DAY from MIN(load_time) AT TIME ZONE "America/Chicago") = extract(DAY from CURRENT_TIMESTAMP() AT TIME ZONE "America/Chicago")
  AND extract(MONTH from MIN(load_time) AT TIME ZONE "America/Chicago") = extract(MONTH from CURRENT_TIMESTAMP() AT TIME ZONE "America/Chicago")
  AND extract(YEAR from MIN(load_time) AT TIME ZONE "America/Chicago") = extract(YEAR from CURRENT_TIMESTAMP() AT TIME ZONE "America/Chicago")
ORDER BY TITLE ASC

Now, any time we need to update this logic, we can update the view directly in BigQuery. Shipyard will always use the latest query for pulling the newest articles.

While this query wouldn't be 100% accurate, it would be 100% easier than continuously tweaking the script's logic.

Surfacing Growth Articles

Over time we noticed that commenting on the surfaced Hacker News posts generated little traction, but occasionally, a post would see high upvote growth and traction on our comment was significantly higher.

To ensure that we didn't miss out on those opportunities, we set up another Vessel to find high-growth articles and return those results in a file alongside the new articles. We created the following query to identify any articles that had surfaced in the last 24 hours that had gained more than 10 points.

SELECT
  MAX(TITLE) AS Title,
  MAX(URL) AS URL,
	MAX(POINTS)-MIN(POINTS) as point_growth,
	MAX(comments)-MIN(comments) as comment_growth
FROM
  `project.dataset.hn_articles`
GROUP BY
  ID
HAVING
  COUNT(ID) > 1
  AND point_growth > 10
  AND MIN(load_time) > timestamp_sub(CURRENT_TIMESTAMP(),INTERVAL 1 DAY)
ORDER BY TITLE ASC

Now we had a guaranteed way to make sure that we were highlighting articles with lots of popularity.

Final Product

Through a month of steady iteration, we developed the exact solution that our team needed to find relevant Hacker News articles. The current version of the solution looks like this:

We have one Fleet with 6 Vessels set up to run every hour between 9AM-6PM. Every time the Fleet runs, we scrape Hacker News articles, append a timestamp to the file, load the data into Bigquery, run our queries for New and Growth articles, then send the resulting files to our internal Slack.

This process now runs seamlessly in the background, with automatic retries, in-depth logging and notifications for troubleshooting in the event that something goes wrong.

And the best part? Throughout of this project, we never had to perform any additional development work to make the solution exactly what we needed. The process of hosting, scheduling, executing, connecting to other services, and tweaking the final solution was handled entirely through Shipyard configurations without writing any additional code.

This is just a small example of how our own team was able to quickly and easily begin using our own platform to tackle our own problems.

Want to build a solution like this on your own? Get started with our free Developer Plan.


About Shipyard:
Shipyard is a modern data orchestration platform for data engineers to easily connect tools, automate workflows, and build a solid data infrastructure from day one.

Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build data workflows while enabling data engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows.

The Shipyard team has built data products for some of the largest brands in business and deeply understands the problems that come with scale. Observability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams.

With a high level of concurrency and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them.

For more information, visit www.shipyardapp.com or get started for free.