Handling Hawkâs Deluge of Data with AWS
When a user loads a web page, all the associated code runs on their computer. Unless communicated explicitly to a server, what they do with the code is unknown. Choices they make, errors and UX bugs are all locked away on their computer. And then they vanish forever when the user closes the browser window.
This is a problem for developers, who need to know those things.
At Cyber Care, we became dissatisfied with the available solutions to that problem. Most locked you into a funnel paradigm, or required you to interpret complex heatmaps. Some only handled error reporting. We wanted a solution that would do what we needed, without constraining us in ways we weren't committed to.
So we started developing Hawk: our answer to the problem.
One major technical difficulty with Hawk, from the beginning, was data volume. We didn't have the budget to spin up huge, inefficient servers. But as Hawk ran day by day during development, the pile of data mounted higher and higher.
We went through several server paradigms to handle it.
Starting Simple
In our first iteration, our âreporterâ script (the JavaScript file that Hawk users included on their websites to facilitate reporting data to Hawk) sent data to REST endpoints on a NodeJS server, which then stored the data in MongoDB. This approach was simple and familiar, and allowed us to easily develop Hawk locally and deploy it in production.
For our production server, we picked AWS Lightsail for its low price, but one of the downsides is that (for the server size we selected), youâre expected to use an average of no more than 10% CPU. When you exceed the limit, the server slows to a crawl. On top of that, disk usage is throttled too.
Hawkâs most important feature was providing a visual recording of what the user actually saw on their screen. Our initial implementation was to send the full HTML serialization of the DOM to the server each time something changed in the DOM. We were also:
capturing details about JS errors and failed XHR and fetch requests,
using WebSockets to send the HTML to site owners logged in on usehawk.com, so they could watch sessions live
using Puppeteer to render the HTML to PNG images to include in email notifications
so it was easy to overwhelm our server.
A Little Lambda
The first optimization we made was moving the storage of HTML snapshots to S3 and the rendering of PNG images into a Lambda function. We still had the data coming into our server, but rather than writing it to disk in our Mongo database, we sent the HTML to S3 and kicked off a Lambda function. The Lambda function loaded the HTML from S3 in Puppeteer, took a screenshot, and saved the PNG into S3.
Lambda was a much better choice for running Puppeteer and rendering HTML to PNGs than our resource-constrained Lightsail server. With Lambda, youâre only charged for the time when your function is running and Amazon handles scaling for you. If 100 requests come in at the same time, you can run 100 instances of your Lambda function concurrently. And if thereâs no traffic between midnight and 6:00am, you donât pay anything during that period. Lambda also allows your function to remain loaded in memory for a period of time after it completes. They donât charge for this time, but in our case, this meant we didnât have to wait for Chrome to launch each time our function ran.
This was a good first step, but we still had a lot of data flowing through our humble Lightsail server.
API Gateway, Lambda and S3
We knew that our Lightsail server was the center of our performance problems. We could have upgraded to a larger server (in EC2, for example), but while this might âfixâ things in the short-term, it wouldnât scale well as Hawk grew. After our positive experience with Lambda, we knew that we wanted to move as much as possible off of our server.
We used AWS API Gateway as the new entry point for receiving data from our reporter script. We created an API for each category we needed to handle (snapshot [HTML] data, events, site-specific configuration, etc.). Each API had from one to a handful of REST endpoints that invoked a Lambda function. We unified the APIs under a single domain name using API Gateway Custom Domain feature.
Data Storage
For data storage, we knew that we didnât want to use our MongoDB server. We considered using AWSâs schemaless database, DynamoDB. However, querying in DynamoDB seemed very limited (dependent on indices, with additional charges for each index). We decided that S3 (although not designed to be a database) would work just as well for our purposes. S3 makes it easy to list files in ascending order, with an optional prefix. So for events (including page navigations and errors), we organized them in S3 by account, then by site, and then using an âinverted timestampâ, for example:
5cd31ad21d581b0071b53abc/5cd31ad21d581b0071b53def/7980-94-90T78:68:âŚ
The timestamp is generated by taking an ISO-8601 formatted date and subtracting each digit from 9; this results in the newest events being sorted at the top, which is the order Hawk wanted to display them in the UI.
Live View
To support live viewing of sessions, we needed a way to get the data from the visitor to the site owner. This involved a couple of Lambda functions and some SQS queues: On the visitor side of things:
Our html-to-s3 Lambda function accepts screenshot data (HTML, window dimensions, etc.) from a site visitor and stores it in S3.
It then checks for the existence of an SQS queue named live-viewers-requests-{siteId}-{sessionId}
If such a queue exists, it reads messages off of that queue. Each message contains the name of a queue that should be notified of new screenshot data for the session.
It sends the name of the screenshot data in S3 to each queue found in step 3.
And then on the viewer side:
Our live-viewer Lambda function takes site and session ids as parameters, and the name of the last screenshot the viewer received.
It queries S3 to see if a newer screenshot is available. If so, it returns that screenshot.
Otherwise, it creates a randomly-named SQS queue, and sends a message containing that random queue name to live-viewers-requests-{siteId}-{sessionId}.
It then waits for a message on the randomly-named queue. If it receives one within 23 seconds, it returns the screenshot contained in the message.
Otherwise, it returns a specific error code to Hawk's front-end to indicate that no new data was available. The front-end can then retry the request.
Keeping Lambda functions small and fast
API Gateway APIs time out after 29 seconds. In order to avoid having our requests time out, and to ensure we didn't slow down our customer's sites by having concurrent, long-running requests, we split some of our endpoints into multiple Lambda functions. For example, the html-to-s3 function we mentioned above is actually split into a couple of pieces. The first accepts the HTML from the client, stores it in S3, asynchronously kicks of the next Lambda function in the process, and returns as quickly as possible. The next Lambda function in the chain deals with SQS, and if necessary, kicks off a final Lambda function to render the HTML to a PNG image.
Authentication
When a Hawk user wanted to retrieve data, our web app used the /signin endpoint on our Lightsail server, which returned a JWT. Subsequent requests were then sent to API Gateway, with the JWT included in a request header. Our Lambda functions could then validate the signature on the JWT and extract the necessary data (eg. the account id) without needing to contact our server.
We're happy with where Hawk ended up. If your team is dealing with a project that produces an overwhelming amount of data, maybe something we did might work for you.







