Watermarks for Google Cloud Pub/Sub
Pub/Sub
Messages can be out-of-order
Client pulls a message and acks by its ID
Pub/Sub tries to provide oldest messages first but with no hard gurantees
Heuristic watermark generation
Assume event timestamps are "well behaved", i.e., amount of out-of-order timestamps on the source data is bounded
Data with timestamp outside the allowed out-of-order bounds will be considered late data
In the current implementation, this bound is at least 10 seconds
We call this estimation band
If the stream pipeline is perfectly caught up, the watermark will 10 second behind real time
If not, the subscription has non-empty backlog and we use estimate the watermark using the backlog
How to estimate watermark using backlog
Pub/Sub provides the "oldest publish timestamp" in the backlog, where "publish timestamp" is the system timestamp when the message is ingested by Pub/Sub
We make another subscription to track watermark
We pull and acknowledge messages on the tracking subscription ASAP, storing publish timestamp and event time in a sparse histogram
We take a band of event times
with publish timestamp newer than the oldest unacknowldged publish timestamp
or in estimation band
The estimated watermark value is the minimum in the band.