System freezes from time to time

I have a weird issue where the entire system freezes from time to time, for up to 8 minutes, resulting in data not being collected and logged. I first noticed it when I saw nulls in my data:

2022-06-19 06:54:00, 0.1, 1.9, 100.7, 5.6, 127.1, 13.2, 1.1, 5.8, 42.2, 0.1
2022-06-19 06:55:00, 0.1, 2.1, 99.8, 5.8, 127.2, 13.4, 1.2, 5.6, 43.3, 0.1
2022-06-19 06:56:00, null, null, null, null, null, null, null, null, null, null
2022-06-19 06:57:00, null, null, null, null, null, null, null, null, null, null
2022-06-19 06:58:00, null, null, null, null, null, null, null, null, null, null
2022-06-19 06:59:00, null, null, null, null, null, null, null, null, null, null
2022-06-19 07:00:00, 0.1, 2.0, 97.9, 5.6, 128.1, 13.3, 1.1, 5.9, 39.9, 0.1
2022-06-19 07:01:00, 0.1, 1.9, 100.0, 5.5, 131.2, 13.1, 1.1, 5.8, 39.8, 0.1
...
2022-06-19 08:00:00, 0.2, 1.2, 103.4, 842.4, 121.8, 13.0, 0.9, 5.8, 148.7, 0.1
2022-06-19 08:01:00, null, null, null, null, null, null, null, null, null, null
2022-06-19 08:02:00, null, null, null, null, null, null, null, null, null, null
2022-06-19 08:03:00, 1.8, 1.9, 103.4, 844.3, 120.7, 13.1, 1.0, 5.9, 133.5, 0.1

Every now and again when querying the data I noticed it lock up and I need to reconnect it on the router to bring it back. The system is new (a couple days old) and is being queried by Home Assistant and uploading to InfluxDB. This happened before the Influx upload. There is sometimes nothing in the Message Log during this time, sometimes there seems to be a message dataLog: datalog WDT - restarting:

6/18/22 21:23:36 LLMNR responder started for hostname IotaWatt
6/18/22 21:23:36 Updater: service started. Auto-update class is MINOR
6/18/22 21:23:38 influxDB_v2: Resume posting 06/18/22 21:18:00
6/18/22 21:23:40 Updater: Auto-update is current for class MINOR.
6/19/22 01:47:40 dataLog: datalog WDT - restarting

** Restart **

SD initialized.
6/18/22 23:47:41z Real Time Clock is running. Unix time 1655596061 
6/18/22 23:47:41z Reset reason: Software/System restart
6/18/22 23:47:41z Trace:  25:64, 25:65, 25:61, 25:64, 25:65, 25:61, 25:64, 25:65, 25:61, 25:64, 25:65, 25:61, 25:64, 25:65, 25:61, 25:64, 25:65, 25:61, 25:64, 25:65, 25:61, 25:64, 25:65, 25:61, 25:64, 25:65, 25:55, 25:56, 25:41, 25:42, 25:43, 10:57
6/18/22 23:47:41z ESP8266 ID: 15923351, RTC PCF8523 (68)
6/18/22 23:47:41z IoTaWatt 5.0, Firmware version 02_07_05
6/18/22 23:47:41z SPIFFS mounted.
6/19/22 01:47:41 Local time zone: +2:00
6/19/22 01:47:41 device name: IotaWatt
6/19/22 01:47:41 HTTP server started
6/19/22 01:47:41 influxDB_v2: Starting, interval:10, url:http://influx.wiersma.lan
6/19/22 01:47:41 timeSync: service started.
6/19/22 01:47:41 statService: started.
6/19/22 01:47:41 dataLog: service started.
6/19/22 01:47:41 dataLog: Last log entry 06/19/22 01:42:40
6/19/22 01:47:46 historyLog: service started.
6/19/22 01:47:46 historyLog: Last log entry 06/19/22 01:42:00
6/19/22 01:47:46 WiFi connected. SSID=StayOutIoT, IP=10.0.1.96, channel=1, RSSI -79db
6/19/22 01:47:46 MDNS responder started for hostname IotaWatt
6/19/22 01:47:46 LLMNR responder started for hostname IotaWatt
6/19/22 01:47:46 Updater: service started. Auto-update class is MINOR
6/19/22 01:47:48 influxDB_v2: Resume posting 06/19/22 01:42:40
6/19/22 01:47:50 Updater: Auto-update is current for class MINOR.
6/19/22 07:00:20 dataLog: datalog WDT - restarting

** Restart **

SD initialized.
6/19/22 05:00:21z Real Time Clock is running. Unix time 1655614821 
6/19/22 05:00:21z Reset reason: Software/System restart
6/19/22 05:00:21z Trace:  25:64, 25:65, 25:61, 25:64, 25:65, 25:61, 25:64, 25:65, 25:61, 25:64, 25:65, 25:55, 25:56, 25:41, 25:42, 25:41, 25:45, 25:41, 25:42, 25:41, 25:44, 10:57, 10:58, 25:20, 25:40, 25:41, 25:44, 10:56, 25:1, 25:1[2], 25:1[3], 10:59
6/19/22 05:00:21z ESP8266 ID: 15923351, RTC PCF8523 (68)
6/19/22 05:00:21z IoTaWatt 5.0, Firmware version 02_07_05
6/19/22 05:00:21z SPIFFS mounted.
6/19/22 07:00:21 Local time zone: +2:00
6/19/22 07:00:21 device name: IotaWatt
6/19/22 07:00:21 HTTP server started
6/19/22 07:00:21 influxDB_v2: Starting, interval:10, url:http://influx.wiersma.lan
6/19/22 07:00:21 timeSync: service started.
6/19/22 07:00:21 statService: started.
6/19/22 07:00:21 dataLog: service started.
6/19/22 07:00:21 dataLog: Last log entry 06/19/22 06:55:20
6/19/22 07:00:26 historyLog: service started.
6/19/22 07:00:26 historyLog: Last log entry 06/19/22 06:55:00
6/19/22 07:00:26 WiFi connected. SSID=StayOutIoT, IP=10.0.1.96, channel=1, RSSI -78db
6/19/22 07:00:26 MDNS responder started for hostname IotaWatt
6/19/22 07:00:26 LLMNR responder started for hostname IotaWatt
6/19/22 07:00:26 Updater: service started. Auto-update class is MINOR
6/19/22 07:00:28 influxDB_v2: Resume posting 06/19/22 06:55:10
6/19/22 07:00:30 Updater: Auto-update is current for class MINOR.
6/19/22 07:31:01 WiFi disconnected.
6/19/22 07:31:02 WiFi connected. SSID=StayOutIoT, IP=10.0.1.96, channel=1, RSSI -78db

My device is in the roof so not easily accessible to look at the lights during this time, but if needed I guess I could camp out there for an hour to see. Otherwise I have pocked around a little but not sure how to debug this further to work this out.
Any help would be appreciated. Thank you.

Not enough context to say for sure what’s happening. Your WiFi RSSI could be better, so anything that is dependent on WiFi can take longer.

You say you are doing queries. When the IoTaWatt is responding to a query, it is not sampling or recording data. In fact it does nothing else during the query. That’s because of the way the web server works. There is an async web server available but it doesn’t seem to be supported anymore and using it also causes other issues with interrupting sampling.

Because it monopolizes the processor, queries are limited to 1,000 lines (intervals), and that generally keeps queries short enough to cause no problems. There is a limit parameter that can be used to increase or completely override default limit. If you are using that, there is a good chance that is causing this issue.

First off, thanks for the quick reply and an awesome product.

I assumed that might be the case, and started uploading to influx to run my queries, so the only thing querying is the Home Assistant integration. I do however find it odd that it seems to occasionally not have data in the log for several minutes, which seems to be more than just a blocking query.

That said, it occurs to me that I ran the system for a couple days next to my couch with 1 input as a test, and that has no data gaps. From then to now it has moved location (South African houses are a pain for this kind of installation) and added 11 inputs. This is a good start for debugging the issue.

The 5 minute gap corresponds to the timeout interval for the datalog watchdog timer (WDT). So the system was monopolized for five minutes to trigger the WDT. The trace from the message log shows that the IoTaWatt had just completed a query. It’s suspicious that the query had just completed, and may indicate that there was a WiFi issue sending the last of the data.

The two minute gap at 08:01:00 has no corresponding message log activity, so I assume it is just a query that took about two minutes.

The HA queries are very short and should present no problems. Uploads to influxDB are not queries, operate asynchronously and would not monopolize the system or effect sampling. If you intend on doing large queries, influxDB is a faster, safer and more robust way to do it.

:man_facepalming: I found the issue and it is dumb.

When I setup the system I did it on my work Macbook Pro as it is across the hall from my DB board. I was showing the system to my Electrician (and he was very impressed, hopefully a growing market in South Africa for you) and had a query on refresh. The query was only for 1 hour, but Macbooks wake up every now and again to check things. They turn the Wifi on, do their checks and then go back to sleep. When the Wifi was turned on the browser with the query running would run the query, then the Wifi would be turned off again, I guess meaning the query could not complete, timing the Watchdog out obviously. I have not used that machine since I was setting it up. This did not occur to me until you mentioned that the query basically completes. After closing the tab there have been zero system restarts and no data gaps.

I am so sorry to have wasted your time for such a dumb thing although I am glad the issue was my stupidity and not something more difficult to fix.

Not a problem. Thanks for the closure.