Time synchronization sometimes goes crazy

One of the best features of Iotawatt is the local storage of data on the SD card. Unfortunately, this feature could not save me in this case. Two iotawatts are located in the very north of Germany, they are connected to the same WLAN. Since the alpha version 02_03_06 strange things happen here with the time correction. The time jumps back and forth between 2018 and 2036:

5/21/18 06:47:11z EmonService: retry count 6350
5/21/18 06:47:21z EmonService: retry count 6360
5/21/18 06:47:31z EmonService: retry count 6370
2/07/36 06:28:17z timeSync: adjusting RTC by 559081614
5/21/18 11:01:24z timeSync: adjusting RTC by -559081615
5/24/18 06:03:01z timeSync: adjusting RTC by -2
2/07/36 06:28:16z timeSync: adjusting RTC by 558750417
5/25/18 07:01:20z timeSync: adjusting RTC by -558750417
5/26/18 21:01:18z timeSync: adjusting RTC by -2
2/07/36 06:28:16z timeSync: adjusting RTC by 558494812
5/28/18 06:01:24z timeSync: adjusting RTC by -558494812
5/29/18 14:55:14z timeSync: adjusting RTC by -2
5/31/18 21:01:28z timeSync: adjusting RTC by -2
6/03/18 14:02:03z timeSync: adjusting RTC by -2

Unfortunately, this means that the history log and the current log no longer function correctly. I assume it helps here only a reset of the two data logs?

10

Here are the logs of both devices:

garage.txt (87.8 KB)

wohnung.txt (48.8 KB)

I’m not sure this is coincident with 02_03_06. Looking back, timer services were restructured for release 02_03_01 to be more asynchronous. That was back on March 3. But that’s not important.

From what I see in the logs, there were serious WiFi connectivity issues from sometime around May 21 through May 28. Both suffered long WiFi outages, and both appeared to get bogus time references several times over several days. My suspicion is that the bad time updates were actually convoluted error conditions in the UDP NTP transactions.

The UDP protocol doesn’t have error detection and correction like TCP, so in looking at the code, I think there needs to be some integrity checking on the response packet from UDP. I’ll do that.

Regarding your logs, I believe that you are right that it would be best to delete them and start over. There is probably some viable history in both history logs, but recovery would not be automatic and the integrity of the underlying file structure is questionable.

Although the time issue appears to be OK for the last week, there still appears to be a problem with the WiFi as the wohnung unit cannot download the ALPHA release. The one time it succeeded this morning, it took 45 seconds and failed to verify. Not good. I have a suspicion that the signature verification problems after successful download are caused by some kind of problem in the firmware. I’ve been through it multiple times and have an instrumented unit downloading here to try to get more information. But in the end it does succeed with a good connection. If not for the VPN connectivity I’d switch back to lwip 1.4. Most of these problems seem to be coincident with the change to lwip 2.

1 Like

Since version 02_03_10 already made it into the minor update class and I wanted to use NGINX, I set the two devices mentioned in this thread to alpha again. With version 02_02_30 both devices were rock-solid and without problems, but with the current alpha version the time synchronization error occurred again and damaged the data logs. Even if the error occurs only in very rare cases and with very bad WiFi connection, it is probably still not advisable to use this version for Major. The data logs probably can’t be recovered anymore, can they?

** Restart **

SD initialized.
8/04/18 04:13:12z Real Time Clock is running. Unix time 1533355992 
8/04/18 04:13:12z Version 02_03_10
8/04/18 04:13:12z Reset reason: Software/System restart
8/04/18 04:13:12z Trace:  18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:2, 18:3, 18:4, 18:5, 1:6, 1:3, 1:4, 1:5[21]
8/04/18 04:13:12z ESP8266 ChipID: 6920630
8/04/18 04:13:12z Local time zone: 0
8/04/18 04:13:12z device name: garage, version: 3
8/04/18 04:13:15z Connecting with WiFiManager.
8/04/18 04:16:24z No WiFi connection.
8/04/18 04:16:24z MDNS responder started
8/04/18 04:16:24z You can now connect to http://garage.local
8/04/18 04:16:24z HTTP server started
8/04/18 04:16:24z dataLog: service started.
8/04/18 04:16:25z dataLog: Last log entry 8/4/18 04:13:10 
8/04/18 04:16:25z statService: started.
8/04/18 04:16:25z timeSync: service started.
8/04/18 04:16:25z historyLog: service started.
8/04/18 04:16:25z historyLog: Last log entry 8/4/18 04:13:00 
8/04/18 04:16:29z EmonService: started. url:my.rexometer.de:80,node:IotaWatt,interval:10, encrypted POST
8/04/18 05:00:54z WiFi connected. SSID: Klaus, IP: 192.168.2.4
8/04/18 05:00:54z Updater: service started. Auto-update class is ALPHA
8/04/18 05:01:09z Updater: Auto-update is current for class ALPHA.
8/04/18 05:01:13z EmonService: Start posting at 8/3/18 22:15:00 
8/04/18 05:01:15z EmonService: HTTP response 307, retrying.
8/04/18 06:01:01z Updater: Invalid response from server. HTTPcode: 307
8/04/18 07:57:30z EmonService: Retry successful after 261 attempts.
2/07/36 06:28:17z timeSync: adjusting RTC by 552605226
8/04/18 10:01:13z timeSync: adjusting RTC by -552605226
8/04/18 12:01:15z timeSync: adjusting RTC by 2

Full log: iotawatt_log.txt (19.7 KB)

08

Current local time 5.8.2018 12:06

@Simsala,
Just getting to this Elias. I don’t see where the upgrade from 02_02_30 happened. The log seems to begin on July 23 with 02_03_09. I see that there were some issues there on July 23rd, and later between July 24 and 26, and then probably on Aug 2. I don’t think it’s related, but is this using the NGINX proxy to access Emoncms.org?

I do see a lot of WiFi problems throughout. I have not done anything yet to try to mitigate the effects of errors in the UDP NTP responses. I’ll get on that this week.

Your datalogs look to be intact, although they are back to the future right now. The current and history logs end on August 8 and 6 respectively. If you were to delete the current log, it will be recreated as of August 6. I’m not exactly sure what will happen then. I think it will just not log until then then pick up like nothing happened (unless there is another NTP event). If you then delete the history log, it should start working again as of then.

Although the code may look substantially different, I think the reason the reduced blocking version is giving this trouble is that it has a much longer timeout, which opens the door wider for I/O problems. I think it’s reasonable to just not accept an 18 year time adjustment, so I’ll get that in first.

Can you do anything about your WiFi?

Thanks Bob, that sounds reasonable.

Unfortunately I can’t do anything at the moment, because I am more than 500 km away.

I am also not so sure anymore if it is really due to a bad Wi-Fi connection, because one of the two devices in the household is not far away from the router and is also affected by the strange time synchronization problem.

Good idea to just delete the current log and hope that it will continue to run normally tonight. I have now temporarily returned to the 02_02_30 firmware.

Yeah, sorry, I didn’t copy the whole log. I was there about two weeks ago when I changed the update class from Major back to Alpha, because I wanted to use NGINX.

Interestingly that someone else is affected, (RTC issues causing possible history log corruption - #4 by mabnz) had already feared that the local network structure caused the problem. I have now done some more research and it is no coincidence that the time jumps to 2036 (RTC rollover in 2036):

NTP uses an epoch of January 1, 1900 so the first rollover will be on February 7, 2036.

Source: Year 2038 problem - Wikipedia

In this thread the cause is mentioned:

I have seen in some case that NTP can and will reset a computers time to the
year 2036. this seems to be the case when a machne senses a NTP time packet
that contain the time 0000000000.0000000000.

Source: http://compgroups.net/comp.protocols.time.ntp/-ntp-sets-time-to-year-2036/205808

Perhaps another more precise solution would be not to allow this time? Even in 2036, this time stamp will be valid for a maximum of one day.

Actually, it would be valid for exactly one second. I’ve seen this several other times, even on a particularly WiFi stressed test system that I maintain here. Without getting into the nitty-gritty of what I’ve done about it, suffice to say that date won’t be appearing post 02_03_12.

At the heart of the matter is that the NTP system uses UDP which has no error checking. NTP has the ability to use an MD5 hash for validation, but that’s overkill and is really intended to authenticate rather than validate in secure systems.

The most important development is that 02_03_12 does seem to finally fix the memory leaks that have plagued recent development releases since the introduction of LwIP2, so I can now fast track that into general release.

1 Like