Bug report: RTC and NTP fails to recover


#1

I had a bug where IoTaWatt was always failing the call to rtc.initialize() in Setup.cpp and also taking a very long time to obtain the current time via NTP.

The first problem is that my RTC started in a disabled state (Control_3 had a value of 0xE0). IoTaWatt would never reset the RTC into an enabled state (Typically I assume you want to use 0 for the top 3 bits of the Control_3 register of RTC to enable both “Standard Battery Switchover” and “Low Battery Detection”).

I see in timeServices.cpp the syncRtc state has a comment indicating that rtc.adjust() also changes the mode enabling both “Standard Battery Switchover” and “Low Battery Detection”. However, rtc.adjust() was never called.

You can reproduce this problem by changing the code to set Control_3 register to 0xE0 in setup. After doing this once, the state was retained by my RTC and the problem showed up again every time I rebooted I needed a successful NTP query before anything would start working.

I dont know how my RTC got into this disabled state in the first place (I think it just started off that way from the beginning). I think IoTaWatt firmware should probably be made to be resilient against this problem.

My solution was to update the conditional for the rtc.adjust() call from:

if(timeDiff < -1 || timeDiff > 1)

to:

// Read Control_3
Wire.beginTransmission(PCF8523_ADDRESS);
Wire.write((byte)PCF8523_CONTROL_3);
Wire.endTransmission();
Wire.requestFrom(PCF8523_ADDRESS, 1);
uint8_t Control_3 = Wire.read();

if(timeDiff < -1 || timeDiff > 1 || ((Control_3 & 0xE0) != 0))

This forces the adjust call to run if the current state is not what we expect, which sets the RTC into the desired state (Standard Switchover and Low Detection), but only after getting a correct time from NTP.

The other problem I was having was with older firmware, where an NTP query would always timeout. I have a large ping. The increase in the timeout helps somewhat, but an improvement for me was changing to using the NTP pool servers instead of the NIST ones. The NIST ones apparently work well in USA but not so well outside.

I.e. I changed the NTP host from “time.nist.gov1”, to using: “pool.ntp.org”. Changing this greatly reduced the time required to synchronize with the NTP servers for me.

A more complete change would be to add this option to the config file. But that may be overkill.

Would you be willing to accept pull requests for these two changes?

Thanks,
Brendon.


#2

I noticed that this was first published in the OEM forum in the homebrew category. This may be a hardware problem that is exacerbated by the latency of accessing the North American NIST time servers from Australia.

If this is homebrew hardware, what schematic version was it based on? There was a problem, I believe in hardware prior to 4.2, in that the RTC was not properly powered. The data-sheet calls for an RC circuit on the Vdd pin so that on power failure, the chip is powered down slow enough to recognize the falling voltage and make the switch to battery. That might account for your setup always restarting with an uninitialized RTC.

The RTC is not considered initialized until it is set after a successful NTP query. Control_3 is set to 0XE0 by the clock on power up. If it has not been running on battery until then.

I can see how this would be a problem, because you are indicating that the rtc is uninitialized, yet it may still have a viable time. If that’s the case, you’re right that the RTC will not subsequently be adjusted until it differs from the NTP time by at least 2 seconds.

But the clock should never be in the “uninitialized” state yet have the correct time, unless the situation is artificially created as you have described.

Making these kinds of changes flirts with the laws of unintended consequences. Here’s what happens when the RTC is working properly:

The Setup code notices that the RTC is not yet initialized and refrains from setting the status RTCrunning = true;

Subsequent startup code runs fine, except that the messages that are issued are not time stamped.

At the end of setup, the various Services are started, including the basic timer service “timeSync”. TimeSync is dispatched straight away and, like most of the other services, is state driven. It initializes and drops into setRTC state. If RTCrunning is true, the state is advanced to syncRTC. If not, getNTPtime is invoked to try and get a time fix. If that fails, the process is repeated every ten seconds until it succeeds. We can’t do any data logging if we don’t know what time it is.

Once the time is acquired, RTCrunning is set to true and NTP time is calibrated to the ESP millisecond clock. From here forward, the IoTaWatt has a viable time reference and can proceed to measure and log power, it’s raison d’etre. And the state is advanced to syncRTC to maintain synchronization with NTP.

The syncRTC state is invoked right away in an immediate redispatch of the Service and if the time in the RTC differs from the NTP time by more than a second, the RTC is adjusted, and in the process marked initialized.

It may make sense to do an adjustment in setRTC. That would be the place to do it.

So I suspect that you don’t have the RC circuit on your RTC and that fixing that, along with the increased NTP timeout will eliminate your problem.

I only know that they work well in and around the USA. I looked at the pool.ntp.org program and it looks as if there is less of a standard for those servers. It appears to be more or less a voluntary collection of servers. The time.nist.gov servers appear to be more directly linked to a standard. That said, IoTaWatt just needs to know about what second it is. I’ll look into that further and may make the change dependent on the time zone to recognize units not in the Americas. In the meantime, as a homebrew, you can make the change for yourself.

Let me know on that hardware.

UPDATE: I added a couple of lines of code to timeServices to initialize the clock on the first successful NTP query. This is an improvement because it eliminates the message:

timeSync: adjusting RTC by 170285420

when first setting the clock. I still think the real problem here is the lack of an RC circuit powering the RTC.


#3

I am actually using the Adafruit RTC breakout: https://www.adafruit.com/product/3295 in my IoTaWatt which has a RC circuit on the VDD pin. I had used this in my prototyping stage in a breadboard so figured I could just reuse it and not have to buy more parts.

After I found the issue and initialized the Control_3 register once (then removed the code to initialize it), all was fine and I haven’t seen the problem again. Sorry I wasn’t clear. I have not seen the problem since unless I manually trigger it (in the reproduction steps shown above). I just thought if a production IoTaWatt happened to get into that state somehow it may also have the same problem of never recovering.

Yep that is a better solution. I will change mine to do that instead of in syncRTC.