Hypothesized Incident Onset: 2026-06-08 03:28AM PST
Oncall Ownership Onset: 2026-06-08 05:45AM PST
Amelioration Onset: 2026-06-08 05:50AM PST
Narrative: at 2026-06-08 3:28AM PST, the tumblr-rotator daemon running on minicomputer-2 attempted to POST the 208th avatar of the day to the Tumblr avatar upload endpoint.
During nominal functioning of the daemon, after every upload attempt, we fetch (without intermediary sleep) the avatar from Tumblr's servers. In order to verify the fidelity of Tumblr's cache invalidation (and therefore their ability to faithfully broadcast avatars sent to them by the tumblr-rotator daemon), we use a perceptual hashing algorithm (ImageHash) in order to check that they are not serving a stale or otherwise incorrect image. This threshold is tuned by-hand to accept uploads where the pHash is less than 25 according to the snippet "abs(imagehash.phash(src, hash_size=16) - imagehash.phash(srv, hash_size=16))".
The perceptual hash algorithm has previously ensured that we re-attempt uploads as-needed; Tumblr has been known to among other things reset the regexkind avatar to a days-old version of the regexkind avatar. When a mismatch is detected, an upload is reattempted for a maximum of 5 times (including the initial attempt), with a fixed sleep of 5 seconds between each upload¹.
After 5 attempts elapse, the method run_avatar_stage surrenders control to the scheduler, which under nominal conditions will sleep to allow 60 seconds to pass between each successful avatar push, giving viewers adequate time with each avatar. After that delay passes, the last unpushed image is scheduled for upload. This means that the retry-logic in run_avatar_stage is somewhat misleading; in actuality, an image will be attempted indefinitely until one of two things happens:
A successful upload occurs,
The exact sequence of images for 2026-06-08 had been pushed on previous days, but not since the update to this new scheduler algorithm; previous versions of the scheduler were more willing to allow "transmission gaps" where an image never made it to the viewer, which would require them to interpolate said image. It is assumed that on these previous days, the scheduler also saw a pHash distance of 26 (as it did today during the upload of avatar #208), attempted 5 times, and then gave up transmitting; due to a gap in our operating procedure, such events would not have been adequately investigated.
At 2026-06-08 05:45AM PST, the oncaller for the service saw the (silent³) notification on her phone. Noticing that the lag metric had reached 8260s, she immediately investigated the logs, which revealed that attempts had been steadily made to upload every 30s for the past few hours, but had "failed" due to the pHash value being 26 > 25. She observed visually that the current image at the avatar GET endpoint was the 208th image, implying that this hairsplit was a false positive for an upload mismatch. She tuned the threshold by-hand to 27, reasoning that the precise degree of acceptable pHash distance could be computed at a later time, so long as she unblocked the rotator.
She then ran "systemctl restart tumblr-rotator.service" as root to reload the rotator scripts, and the upload process immediately resumed and completed a successful reupload of the 208th avatar at 2026-06-08 05:50:58 PST, followed by a successful first upload of the 209th avatar at 2026-06-08 05:51:27 PST.
The following graphs illustrate the characteristics of the incident:
We anticipate the lag dropping to sub-minute intervals within two hours of this report, after which we will consider incident algol-0001 to be mitigated if not resolved.
The pHash algorithm is probably appropriate for detecting the difference between glyphs but is inappropriate for detecting other important features of the avatar, including the RGB value and [REDACTED]. Since RGB value and [REDACTED] can be more faithfully measured by other methods, we should ensure that the pHash algorithm is only used to measure the difference between glyphs, leading to a composite diff-measuring algorithm that is simultaneously more accurate at detecting differences between served and uploaded avatars, and less prone to false positives.
As mentioned in this report, this issue likely has occurred before and not been surfaced due to a presumption of fault on the side of Tumblr API and CDN. Since the number of faulty uploads is normally less than a dozen per day, and we expect that the oncaller will be able to review data from the daemon at least once every few days, it seems reasonable to snapshot each offending diff and compile them into a running "report" for visual inspection.
The first of these fixes is considered accepted, pending implementation and the second of these fixes is rejected due to a strong organizational value on never imposing toil except where absolutely required. Forcing our oncaller to view a report full of false positives every day does not align with our values.
¹Incidentally, this is not considered to be sound engineering practice--exponential backoff with jitter being the gold standard--but the author concurs with the writer of the script that this is "good enough" in a circumstance where we are likely the only client with the exact interaction characteristics of tumblr-rotator.
²The forfeit conditions are somewhat complex, but involve the scheduler computing that even if we uploaded an image at the maximum rate (30s between uploads) from now until midnight, we could not push every image in the schedule.
³All notifications for the tumblr-rotator service are silent; we cannot afford to pay our oncaller for the level of service implied by a paging alert system that can wake her up.