Software Watchdog in a loop


#1

HI I have one device doing continuous watchdog reset…

SD initialized.
6/28/18 05:22:15z Real Time Clock is running. Unix time 1530163335
6/28/18 05:22:15z Version 02_03_08
6/28/18 05:22:15z Reset reason: Software Watchdog
6/28/18 05:22:15z Trace: 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:1, 9:0, 9:1, 8:4, 8:6, 8:8, 9:3, 9:4, 9:5, 9:9, 1:2
6/28/18 05:22:15z ESP8266 ChipID: 6910350
6/28/18 07:22:15 Local time zone: 2
6/28/18 07:22:15 device name: pinguITW, version: 3
6/28/18 07:22:15 MDNS responder started
6/28/18 07:22:15 You can now connect to http://pinguITW.local
6/28/18 07:22:15 HTTP server started
6/28/18 07:22:15 dataLog: service started.
6/28/18 07:22:15 dataLog: Last log entry 6/28/18 07:22:10
6/28/18 07:22:15 statService: started.
6/28/18 07:22:15 timeSync: service started.
6/28/18 07:22:15 WiFi connected. SSID: pingu, IP: 192.168.18.73
6/28/18 07:22:16 Updater: service started. Auto-update class is ALPHA
6/28/18 07:22:16 historyLog: service started.
6/28/18 07:22:16 historyLog: Last log entry 6/28/18 07:22:00
6/28/18 07:22:20 EmonService: started. url:emoncms.org:80,node:pinguITW,interval:10, unsecure GET
6/28/18 07:22:20 influxDB: started.

** Restart **

SD initialized.
6/28/18 05:22:24z Real Time Clock is running. Unix time 1530163344
6/28/18 05:22:24z Version 02_03_08
6/28/18 05:22:24z Reset reason: Software Watchdog
6/28/18 05:22:24z Trace: 7:4, 1:6, 1:3, 1:4, 1:5, 7:0, 7:5, 1:6, 1:1, 9:0, 9:1, 8:4, 8:6, 8:8, 9:3, 9:4, 9:5, 9:9, 1:2, 1:3, 1:4, 1:1, 9:0, 9:1, 8:4, 8:6, 8:8, 9:3, 9:4, 9:5, 9:9, 1:2
6/28/18 05:22:24z ESP8266 ChipID: 6910350
6/28/18 07:22:24 Local time zone: 2
6/28/18 07:22:24 device name: pinguITW, version: 3
6/28/18 07:22:24 MDNS responder started
6/28/18 07:22:24 You can now connect to http://pinguITW.local
6/28/18 07:22:24 HTTP server started
6/28/18 07:22:24 dataLog: service started.
6/28/18 07:22:24 dataLog: Last log entry 6/28/18 07:22:20
6/28/18 07:22:24 statService: started.
6/28/18 07:22:24 timeSync: service started.
6/28/18 07:22:24 WiFi connected. SSID: pingu, IP: 192.168.18.73
6/28/18 07:22:25 Updater: service started. Auto-update class is ALPHA
6/28/18 07:22:25 historyLog: service started.
6/28/18 07:22:25 historyLog: Last log entry 6/28/18 07:22:00
6/28/18 07:22:29 EmonService: started. url:emoncms.org:80,node:pinguITW,interval:10, unsecure GET
6/28/18 07:22:29 influxDB: started.

** Restart **

SD initialized.
6/28/18 05:22:32z Real Time Clock is running. Unix time 1530163352
6/28/18 05:22:32z Version 02_03_08
6/28/18 05:22:32z Reset reason: Software Watchdog
6/28/18 05:22:32z Trace: 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:1, 9:0, 9:1, 8:4, 8:6, 8:8, 9:3, 9:4, 9:5, 9:9, 1:2
6/28/18 05:22:32z ESP8266 ChipID: 6910350
6/28/18 07:22:32 Local time zone: 2
6/28/18 07:22:32 device name: pinguITW, version: 3
6/28/18 07:22:33 MDNS responder started
6/28/18 07:22:33 You can now connect to http://pinguITW.local
6/28/18 07:22:33 HTTP server started
6/28/18 07:22:33 dataLog: service started.
6/28/18 07:22:34 dataLog: Last log entry 6/28/18 07:22:25
6/28/18 07:22:34 statService: started.
6/28/18 07:22:34 timeSync: service started.
6/28/18 07:22:35 WiFi connected. SSID: pingu, IP: 192.168.18.73
6/28/18 07:22:35 Updater: service started. Auto-update class is ALPHA
6/28/18 07:22:35 historyLog: service started.
6/28/18 07:22:35 historyLog: Last log entry 6/28/18 07:22:00
6/28/18 07:22:37 updater: Auto-update is current for class ALPHA.
6/28/18 07:22:38 EmonService: started. url:emoncms.org:80,node:pinguITW,interval:10, unsecure GET
6/28/18 07:22:38 influxDB: started.

** Restart **

SD initialized.
6/28/18 05:22:42z Real Time Clock is running. Unix time 1530163362
6/28/18 05:22:42z Version 02_03_08
6/28/18 05:22:42z Reset reason: Software Watchdog
6/28/18 05:22:42z Trace: 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:1, 9:0, 9:1, 8:4, 8:6, 8:8, 9:3, 9:4, 9:5, 9:9, 1:2
6/28/18 05:22:42z ESP8266 ChipID: 6910350
6/28/18 07:22:42 Local time zone: 2
6/28/18 07:22:42 device name: pinguITW, version: 3
6/28/18 07:22:42 MDNS responder started
6/28/18 07:22:42 You can now connect to http://pinguITW.local
6/28/18 07:22:42 HTTP server started
6/28/18 07:22:42 dataLog: service started.
6/28/18 07:22:42 dataLog: Last log entry 6/28/18 07:22:35
6/28/18 07:22:42 statService: started.
6/28/18 07:22:42 timeSync: service started.
6/28/18 07:22:42 WiFi connected. SSID: pingu, IP: 192.168.18.73
6/28/18 07:22:43 Updater: service started. Auto-update class is ALPHA
6/28/18 07:22:43 historyLog: service started.
6/28/18 07:22:43 historyLog: Last log entry 6/28/18 07:22:00
6/28/18 07:22:47 EmonService: started. url:emoncms.org:80,node:pinguITW,interval:10, unsecure GET
6/28/18 07:22:47 influxDB: started.

** Restart **

SD initialized.
6/28/18 05:22:50z Real Time Clock is running. Unix time 1530163370
6/28/18 05:22:50z Version 02_03_08
6/28/18 05:22:50z Reset reason: Software Watchdog
6/28/18 05:22:50z Trace: 9:0, 8:4, 8:6, 8:8, 1:2, 1:3, 1:4, 1:5, 7:0, 7:3, 1:6, 1:3, 1:4, 1:5, 7:0, 7:4, 15:0, 15:1, 7:4, 7:4, 1:6, 1:1, 9:0, 9:1, 8:4, 8:6, 8:8, 9:3, 9:4, 9:5, 9:9, 1:2
6/28/18 05:22:50z ESP8266 ChipID: 6910350
6/28/18 07:22:50 Local time zone: 2
6/28/18 07:22:50 device name: pinguITW, version: 3
6/28/18 07:22:50 MDNS responder started
6/28/18 07:22:50 You can now connect to http://pinguITW.local
6/28/18 07:22:50 HTTP server started
6/28/18 07:22:50 dataLog: service started.
6/28/18 07:22:50 dataLog: Last log entry 6/28/18 07:22:45
6/28/18 07:22:50 statService: started.
6/28/18 07:22:50 timeSync: service started.
6/28/18 07:22:50 WiFi connected. SSID: pingu, IP: 192.168.18.73
6/28/18 07:22:51 Updater: service started. Auto-update class is ALPHA
6/28/18 07:22:51 historyLog: service started.
6/28/18 07:22:51 historyLog: Last log entry 6/28/18 07:22:00
6/28/18 07:22:55 EmonService: started. url:emoncms.org:80,node:pinguITW,interval:10, unsecure GET
6/28/18 07:22:55 influxDB: started.

** Restart **


#2

As the last entry before restart is influxDb, I can tell that the second IoTaWatt is running and reporting the data to influxDb.


#3

Would you post the config.txt file of that IoTaWatt please?


#4

CONFIG-REMOVED

PS: The only relevant difference from my view is the additional
,
“stop”: false

at the end


#5

I’ve loaded your config into an IoTaWatt and started it up. I may have caused several entries to be posted to your emoncms.org node before I realized that would happen. I since changed theconfig to go to a different emoncms.org account. I created a database on my influx instance and it is posting there.

I’m not getting any failure with this setup. There are some difference:
I’m 60hz - I don’t believe this would make any difference.
I don’t have all of the inputs populated, but an unplugged input looks just like plugged zero power input.
I don’t have three phase - that is a significant difference, but I don’t think it bears on this problem.

The trace clearly narrows the failure to four lines of code in the main loop. There is a loop there that could potentially cause a WDT restart if it ran away. I don’t see that happenning unless there were some kind of buffer or memory issue that corrupted the input data structures. You have them all full, and so it’s a stretch that they would all be corrupted. Nevertheless, it’s a possibility. I’ve added another trace entry to distinguish that, but that doesn’t help here.

The other possibility, and what I believe is happening, is that the underlying lwip (lightweight ip) layer is chasing it’s tail somewhere. In addition to the loop mentioned above, the other two lines of code yield to the core and then explicitly feed the wdt. I think it’s looping in the lwip when the yield() is invoked.

There is a way to determine if that’s the case, but it involves running the IoTaWatt with a serial cable and recording the serial output. You could do this with putty, or probably any other terminal application that you have. What I need is the wdt reset stack trace that is generated. With that, I should be able to determine if it’s in lwip.

Since changing to lwip 2, I’ve seen two incidences of wdt resets in lwip. Those were single events and did not recur after the reset, so this would be something hard. While it may seem worse than the isolated random instances, something like this could actually be beneficial if I can get enough information from the stacktrace to report an issue with the ESP folks.

Another approach would be to try stopping the Emoncms and influxDB posting and see if the problem goes away. That would be diagnostic as well.


#6

I deleted the 2 web reports and looks like running.
Unfortunately I’m a bad user and I tried to edit the config.txt by addind the influx db config and now I have red blinking led ;-(


#7

I took out the SD card and replaced, up anbd running again.
Now I added both webservices and looks like running, I will re-add manually all the reports…

PS: How to watch with the serial? When I connected the USB to the PC I did not get any virtual COM ?


#8

I try to avoid editing the Json files because it’s easy to make a mistake. When I find it necessary, I always run it through a json syntax checker like this before I save it to the SD.


#9

You probably need to install CP2102 device drivers. Search for that for your particular environment. You should be able to find instructions for getting a CP210x device working.


#10

There is something with the influxDb I think,
I continued to fill and again the same.

One interesting point:

  1. saved intermediate config from gui
  2. I re-opened and I saw the one without the last changes
  3. once I think he finished to catchup the history upload I opened and saw the one with last changes…

There you have the log when it makes the reset:
20180628_Log1.txt (6.5 KB)


#11

That was very helpful. My suspicion was confirmed that the wdt was in the yield() rather than the firmware loop, but the stacktrace shows it’s in a call to my async HTTP code rather than lwip. I’ll dig deeper into that over the next few days.


#12

I do not think that is related, but I just started to try to have a look to the code to try to understand how it is working and what are the available libraries.
One point took my attention at line 314 of influxDB.cpp:
if((( ! request || request->readyState() == 4) && HTTPrequestFree) &&
(reqEntries >= influxBulkSend || reqData.available() >= reqDataLimit))

I do not know how it is on this compiler, but I find this a quite dangerous if…
please analyze the case where request is null, probably you have to split the if in 2 if’s.
You have a || therefore it will try to call request->readyState() even if request is null

Similar at line 241 and 248


#13

I agree that it could be dangerous if the compiler takes the approach of evaluating both expressions and then performing an or operation on the boolian results. If that were the case, I’d be in big trouble because I do this all over the place. My experience is that the compiler evaluates left to right and does not evaluate the “or” expression if the left expression is true. In fact, that’s exactly why the call to request->readyState() is preceded by that ! request, to protect against the case of request being null. My experience is that modern optimizing compilers can be relied on to do this.

I have been following up on your problem, and I have identified (and fixed) a potential problem in asyncHTTPrequest handling of chunked responses, which is where the trace that you provided led me. There is not enough information to definitively say that was the problem, but what I found definitely was a problem that would cause a wdt and it definitely was in the chunked respond handler.

Chunks are sent with a length header followed by \r\n. The total length of the transfer is not known, so the data payload is terminated with a chunk that has zero length. If that header consists of only a \r\n instead of 0\r\n, the wdt would result. I think that’s what was happening.

Fortunately, I also wrote asyncHTTPrequest so I was able to fix it and publish the new version. I will include that version in the pio dependencies on the next release.

But by all means, audit the code. I’m sure there are other things in there that could use another set of eyes. A lot of the code in IoTaWatt is in the second generation. I just learned C++ a year ago and had not before used OOP explicitely. I was born 50 years too soon.

BTW/ That code is at a different line# in the latest release.


#14

C++ is also something I do not know to use with the nice object facilities. I implement C using pseudo object with handles (in case https://www.linkedin.com/in/daniele-wyss-bb8b2a2)


#15

Any idea of when you are going to create the next release?