A look at the Year 2036/2038 problems and time proofness in various systems

14. March 2017 sturmflut

I remember the Y2K problem quite vividly. The world was going crazy for years, paying insane amounts of money to experts to fix critical legacy systems, and there was a neverending stream of predictions from the media on how it’s all going to fail. Most didn’t even understand what the problem was, and I remember one magazine writing something like the following:

Most systems store the current year as a two-digit value to save space. When the value rolls over on New Year’s Eve 1999, those two digits will be “00”, and “00” means “halt operation” in the machine language of many central processing units. If you’re in an elevator at this time, it will stop working and you may fall to your death.

I still don’t know why they thought a computer would suddenly interpret data as code, but people believed them. We could see a nearby hydopower plant from my parents’ house, and we expected it to go up in flames as soon as the clock passed midnight, while at least two airplanes crashed in our garden at the same time. Then nothing happened. I think one of the most “severe” problems was the police not being able to open their car garages the next day because their RFID tokens had both a start and end date for validity, and the system clock had actually rolled over to 1900, so the tokens were “not yet valid”.

That was 17 years ago. One of the reasons why Y2K wasn’t as bad as it could have been is that many systems had never used the “two-digit-year” representation internally, but use some form of “timestamp” relative to a fixed date (the “epoch”). Here’s a short comparison:

DCF77 and WWVB actually transmit the year as a two-digit number, so the device has to use information from other contexts (e.g. “product was not on the market before $YEAR”) to make an educated guess about the current century.
Function 2Ah (“Get Date”) offered by INT 21h on DOS returns a “year” field between 1980 and 2099.
UNIX and UNIX-like systems traditionally use the number of seconds passed since January 1, 1970, stored as a signed 32-bit value, for all APIs. The rollover will not happen before January 19, 2038, when the counter jumps back to Dezember 13, 1901 (if the system can handle negative values correctly, which isn’t a given).
GetSystemTime() in the Windows WIN32 API returns a value between the years 1601 and 30827.
The standard C functions for timekeeping on Windows used to use 32-Bit UNIX timestamps, these also overflow in 2038.
FAT timestamps range between January 1, 1980 and December 31, 2107.
NTFS timestamps range between January 1, 1601 and May 28, 60056.
Linux filesystems like ext2/3/4, XFS, BTRFS etc. either use the UNIX 32-bit timestamps or have already switched to 64-Bit timestamps.
The Network Time Protocol (NTP) uses a 32-Bit value for the number of seconds passed since January 1, 1900, which overflows in 2036.
CD-ROMs following the ISO9660 format store the year as four ASCII digits.
Digital Video Broadcasting (DVB) uses a 16-bit counter to transmit the dates for the Electronic Program Guide in “Modified Julian Days” and overflows on April 22, 2038.
Early Apple Macintosh computers and the HFS/HFS+ filesystems count the number of seconds since January 1, 1904 as a 32-bit number. These values will wrap around on February 6, 2040.
The IBM S/370 mainframe and its successors count the number of 2^-12 second (0.244 nanosecond) units since January 1, 1900 using a 64-Bit counter, rolling over on September 17, 2040.
The Global Positioning System (GPS) counts the number of weeks since January 6, 1980 using a 10-Bit counter (1024 weeks). The roll over has actually already happened at midnight between August, 21 and 22 1999, and the next will happen between March, 31 and April 1, 2019. It has always been within the responsibility of the “user” to consult other sources and choose the correct current 1024-week-block.

What happens on a rollover?

The actual problem with time and dates rolling over is that systems calculate timestamp differences all day. Since a timestamp derived from the system clock seemingly only increases with each query, it is very common to just calculate diff = now - before and never care about the fact that now could suddenly be lower than before because the system clock has rolled over. In this case diff is suddenly negative, and if other parts of the code make further use of the suddenly negative value, things can go horribly wrong.

A good example was a bug in the generator control units (GCUs) aboard Boeing 787 “Dreamliner” aircrafts, discovered in 2015. An internal timestamp counter would overflow roughly 248 days after the system had been powered on, triggering a shut down to “safe mode”. The aircraft has four generator units, but if all were powered up at the same time, they would all fail at the same time. This sounds like an overflow caused by a signed 32-bit counter counting the number of centiseconds since boot, overflowing after 248.55 days, and luckily no airline had been using their Boing 787 models for such a long time between maintenance intervals.

So it’s always good to know when your time and date values might flow over. For most GNU/Linux users there are currently four dates to “keep your eyes on”:

February 7, 2036 (Y2036) for the Network Time Protocol
January 19, 2038 (Y2038) for signed 32-Bit Unix epoch values (system time, filesystems)
February 7, 2106 (Y2106) for unsigned 32-Bit Unix epoch values (system time, filesystems)
January 1, 2108 for FAT timestamps

The “obvious” solution is to simply switch to 64-Bit values and call it day, which would push overflow dates far into the future (as long as you don’t do it like the IBM S/370 mentioned before). But as we’ve learned from the Y2K problem, you have to assume that computer systems, computer software and stored data (which often contains timestamps in some form) will stay with us for much longer than we might think. The years 2036 and 2038 might be far in the future, but we have to assume that many of the things we make and sell today are going to be used and supported for more than just 19 years. Also many systems have to store dates which are far in the future. A 30 year mortgage taken out in 2008 could have already triggered the bug, and for some banks it supposedly did.

I was once called to fix an SCO UNIX system back in 2009, the installation date seemed to have been around 1991, so I’m no stranger to 20 years old legacy systems. Especially now with all the “Internet of Things” (IoT) stuff the world is flooded with billions of cheap, 32-Bit ARM microcontrollers, and the media we store our filesystems on might still be around in 2038. So I decided to check if everything’s already fine or not.

Network Time Protocol rollover

NTP is used to synchronise a devices’ clock with a number of very precise reference clocks over a network. It uses a two-part 64-Bit timestamp for most parts of the protocol. The first 32 bits are the number of seconds since January 1, 1900, and the second 32 bits are the fractional part. This gives NTP ~136 years between rollovers, leading to February 7, 2036, and a theoretical resolution of 233 picoseconds.

The good news is that the 64-Bit timestamp is (literally) only half the truth. According to RFC 5905, the “Network Time Protocol Version 4: Protocol and Algorithms Specification”, an NTP timestamp is a truncated version of a full NTP date. An NTP date is a point on the NTP timescale and expressed as the signed 64-bit number of seconds relative to January 1, 1900, plus a 64-Bit fraction. Thus an NTP date is able to cover the lifetime of the universe, but is also precise enough to express the smallest amounts of time scientists can measure. When the first 32 bits in the NTP timestamp flow over, there are still another 32 bits in the full NTP date, and these bits specificy the “NTP era”. Every ~136 years another era starts, and dates before January 1, 1900 are expressed in negative era values, so everything should be fine.

The actual NTP synchronization algorithm uses just the 64-Bit timestamps. When a rollover happens, messages received from some time sources might have rolled over already, while others haven’t. That can be handled and most implementations (e.g. ntpd) seem to handle it. The bigger problem is that the current NTP era value is not exchanged during the protocol. Since NTP can handle a time difference between the local clock and the time sources of ±68 years (half of ~136 years), the local clock should already be within ~68 years of the actual time and date when the client is started to latch on to the correct value. If it isn’t, or the implementation doesn’t handle all the boundary conditions correctly, the local clock might be “synchronized”, but to the wrong Era, and might be a multiple of 136 years off, still thinking it is working correctly.

The NTP protocol could be extended to include the current NTP era, but many legacy devices will probably not get an update. You might think that this isn’t a real issue, but remember that 32-Bit UNIX systems might roll over in 2038 and jump back to December 13, 1901. NTP rolls over in 2036 and system time should be within 68 years of the current time and date. If such a 32-Bit system happens to jump back to 1901 in 2038 and then starts an NTP client, that’s more than 68 years off.

(Please notice all the “if”s and “might/could”s, but I find this a funny brain exercise.)

32-Bit UNIX timestamp rollover

sys_gettimeofday() is one of the most used system calls on a generic Linux system and returns the current time in form of an UNIX timestamp (time_t data type) plus fraction (suseconds_t data type). Many applications have to know the current time and date to do things, e.g. displaying it, using it in game timing loops, invalidating caches after their lifetime ends, perform an action after a specific moment has passed, etc. In a 32-Bit UNIX system, time_t is usually defined as a signed 32-Bit Integer.

When kernel, libraries and applications are compiled, the compiler will turn this assumption machine code and all components later have to match each other. So a 32-Bit Linux application or library still expects the kernel to return a 32-Bit value even if the kernel is running on a 64-Bit architecture and has 32-Bit compatibility. The same holds true for applications calling into libraries. This is a major problem, because there will be a lot of legacy software running in 2038. Systems which used an unsigned 32-Bit Integer for time_t push the problem back to 2106, but I don’t know about many of those.

The developers of the GNU C library (glibc), the default standard C library for many GNU/Linux systems, have come up with a design for year 2038 proofness for their library. Besides the time_t data type itself, a number of other data structures have fields based on time_t or the combined struct timespec and struct timeval types. Many methods beside those intended for setting and querying the current time use timestamps, among them:

msgctl() for operating on System V message queues can tell the application when a queue was last operated on
All methods operating on filesystem timestamps, e.g. ftime() and fstat()
All methods operating on timers
Methods for accounting, e.g. getrusage()
Methods operating on sockets, e.g. select()
Methods for process synchronisation, e.g. wait3() and wait4()
Methods for thread synchronisation, e.g. pthread_mutex_timedlock() and pselect()
Methods for accounting, e.g. getrusage() and getutent()

As you can see time is an important thing. I’ll spare you all the technical details of the glibc design, but it seems to be clear that there won’t be duplicate method calls for 32-Bit and 64-Bit, instead the programmer will have to decide which API they want to use at compile time if a 32-Bit application wants to use the 64-Bit versions. Existing 32-Bit applications will still use the 32-Bit calls and still be susceptible to the Y2038 problem.

32-Bit Windows applications, or Windows applications defining _USE_32BIT_TIME_T, can be hit by the year 2038 problem too if they use the time_t data type. The __time64_t data type had been available since Visual C 7.1, but only Visual C 8 (default with Visual Studio 2015) expanded time_t to 64 bits by default. The change will only be effective after a recompilation, legacy applications will continue to be affected.

If you live in a 64-Bit world and use a 64-Bit kernel with 64-Bit only applications, you might think you can just ignore the problem. In such a constellation all instances of the standard time_t data type for system calls, libraries and applications are signed 64-Bit Integers which will overflow in around 292 billion years. But many data formats, file systems and network protocols still specify 32-Bit time fields, and you might have to read/write this data or talk to legacy systems after 2038. So solving the problem on your side alone is not enough.

File system rollovers

Linux filesystems with 32-Bit timestamps and HFS/HFS+ timestamps will overflow in 2038 and 2040, respectively. I fully expect to encounter such file systems in 2038, maybe while migrating some legacy system to a new platform or while cleaning up my attic. I don’t think archeologists will find and use one of my SD cards with a FAT file system after the year 2107, but FAT is still going strong and FAT32 filesystems might still be created for decades to come.

Many filesystems have added support for 64 Bits or used 64 Bit to begin with, not only to avoid timestamp overflows, but also because 32-Bit values can impose several limits on maximum file and maximum volume sizes. Depending on the tools used, the default for new filesystems might vary. I’m using e2fsprogs 1.43.4 on Arch Linux for example, and the default for mkfs.ext4 seems to be to enable the 64bit flag.

It’s hard to say what will happen to a running system if the system clock overflows to 1901 and all the applications are suddenly confronted with file creation/modification times as far as 137 years in the future. Also all new files will be created with the “current” date, so new files will be “older” than existing files and applications relying on file dates (Makefiles, synchronization solutions, backup etc.) might fail. There are simply too many running processes on a modern machine, and every single one may have to cope with file system timestamp rollovers on its own.

I was thinking for a while if the problem could be solved by asking the user if a removable drive contains a file system created before January 19, 2038 and then do a transparent conversion of times and dates in the background, but the moment you write to such a filesystem the current date has to be stored, which wouldn’t be possible without tricks. You could for example say “this timestamps looks like it has overflown, the file can’t really be from 1910, so this timestamp is probably relative to January 19, 2038”. But where do you draw the line between “looks like it has overflown” and “is genuine”? Maybe a file actually has a creation date in 1969, but you draw the line at 1970.

So I guess one will have to mount existing drives read-only, backup all the data and then maybe create a new filesystem with larger timestamps on the existing drive.

How to avoid the problem

While reading up on the topic I have come across some solutions for avoiding the problem of overflowing timestamps:

Use at least 64-Bit values and make sure your timescale reaches much further into the future than just 100 years. NTFS flows over in the year 60056, that sounds like a reasonable starting point.
Operate on strings, as ISO 8601 suggests and SAP is already doing. Make sure all buffers and all methods used to perform date calculations can handle at least the lifetime of the universe (but you never know, your software might still be there after the next Big Bang.)

LIEBERBIBER

A look at the Year 2036/2038 problems and time proofness in various systems

What happens on a rollover?

Network Time Protocol rollover

32-Bit UNIX timestamp rollover

File system rollovers

How to avoid the problem

3 thoughts on “A look at the Year 2036/2038 problems and time proofness in various systems”

Leave a Reply Cancel reply