🕕 Let’s talk about clock synchronization.
🗒️ This is essential for stable and secure Kubernetes clusters.
This is the story of leap seconds and the complexity of sub-second time synchronization.
Context: Regulations and software need accurate time
Having sub-second accurate clock is essential for establishing “happened-before” or “cause-effect” relationship in distributed systems.
- In some industries, e.g., FinTech, clock synchronization is regulated to ensure market fairness.
- Ceph (an open-source storage cluster software) tolerates a drift of up to 0.05 seconds.
- etcd (the database fuling Kubernetes) complains if clocks drift by more than 1 second.
But what “time” do you care about? Physical time or human time?
Let’s look at two timescales:
- TAI: This is the time given by atomic clocks worldwide. This is as close as one can get to the physical definition of time.
- UT1: This is given by Earth’s rotation, i.e., the passing of the sun above longitude 0°. This is as close as one can get to the human expectation of time.
Problem: Earth’s rotation is unpredictable
UT1 (“Earth time”) and TAI (“physical time”) have drifted by as much as 37 seconds since 1972.
37 seconds?!?!?! That’s an eternity to both Ceph and etcd!
Solution: UTC a compromise between physical and human time.
UTC mostly follows TAI, but adds or removes leap seconds to keep within +/- 1 second of UT1.
In essence, when Earth rotates slower than 24 hours a so-called leap second is added. When Earth rotates faster than 24 hours a leap second is removed.
- NTP (Network Time Protocol) uses UTC.
- GPS uses TAI.
- UNIX timestamps represent UTC … but cannot capture leap seconds. 🤦
Which leads me to:
Next Problem: Leap seconds support is buggy
What did your application log on Dec 31st, 2016 at 23:59:60? “60” here is not a typo, but a leap second.
Leap seconds are:
- 🚩 rare
- 🚩 unpredictable
- 🚩 untested
The perfect receipt for a disaster.
More Problems: Clock smearing or when the cure is worse than the disease
To avoid leap seconds, several orgs “smear” the leap second over a 24 hour interval.
- “Google” timescale: Leap second is smeared over 24 hours centered on the leap second.
- “Amazon” timescale: Leap second is smeared over 24 hours preceding the leap second.
🚩 Wait? Are you telling me that Google and Amazon’s VMs weren’t clock synced on Dec 31st 2016?
That’s exactly what I’m saying. Did I mention that sub-second clock synchronization is complex?
The Real Solution: Abandon human time
While leap seconds was a fun exercise, their unpredictability and general public unawareness makes them unsuitable for the digital age.
✨ Therefore, in November 2022, it was decided to abandon the leap second.
What about you?
Were you affected by leap seconds? Please share your story in the comments on this LinkedIn post.