Leap Seconds: When the cure is worse than the disease

🕕 Let’s talk about clock synchronization.

🗒️ This is essential for stable and secure Kubernetes clusters.

This is the story of leap seconds and the complexity of sub-second time synchronization.

Context: Regulations and software need accurate time

Having sub-second accurate clock is essential for establishing “happened-before” or “cause-effect” relationship in distributed systems.

But what “time” do you care about? Physical time or human time?

Let’s look at two timescales:

Problem: Earth’s rotation is unpredictable

UT1 (“Earth time”) and TAI (“physical time”) have drifted by as much as 37 seconds since 1972.

37 seconds?!?!?! That’s an eternity to both Ceph and etcd!

Solution: UTC a compromise between physical and human time.

UTC mostly follows TAI, but adds or removes leap seconds to keep within +/- 1 second of UT1.

In essence, when Earth rotates slower than 24 hours a so-called leap second is added. When Earth rotates faster than 24 hours a leap second is removed.

Which leads me to:

Next Problem: Leap seconds support is buggy

What did your application log on Dec 31st, 2016 at 23:59:60? “60” here is not a typo, but a leap second.

Leap seconds are:

The perfect receipt for a disaster.

More Problems: Clock smearing or when the cure is worse than the disease

To avoid leap seconds, several orgs “smear” the leap second over a 24 hour interval.

🚩 Wait? Are you telling me that Google and Amazon’s VMs weren’t clock synced on Dec 31st 2016?

That’s exactly what I’m saying. Did I mention that sub-second clock synchronization is complex?

The Real Solution: Abandon human time

While leap seconds was a fun exercise, their unpredictability and general public unawareness makes them unsuitable for the digital age.

✨ Therefore, in November 2022, it was decided to abandon the leap second.

What about you?

Were you affected by leap seconds? Please share your story in the comments on this LinkedIn post.