Occasional Hang

SandyB · September 19, 2018, 6:57pm

Hi Bob,
I’ve noticed that my system very occasionally stops reporting to (a local) emonCMS (I don’t have an InfluxDB - yet!). At this point the IoTaWatt UI is completely unresponsive. The light is green, and flickers only very slightly (maybe every 10 seconds). I just cycle the power & it uploads all the missing data & carries on for another week or two
The:
Updater: Invalid response from server. HTTPcode: -4
message seems to be common to all instances - but not sure if this is cause or symptom.
Here’s the log from it happening a couple of times.
iotawatt_hang.txt (9.0 KB)
(An occasional power cycle with no data loss doesn’t bother me in the slightest. So I’m just letting you know in case it helps corroborate something else you are looking at)
Thanks,
Sandy

overeasy · September 19, 2018, 7:32pm

Thanks Sandy,

As you say, It’s intermittent and recovers fully, yet it shouldn’t happen at all. The log you posted has it running for two weeks at a time before Emoncms upload stops. Clearly the unit is still sampling because you are able to recover. What is curious is your report that UI is unresponsive. That is a good clue that has some implications that I can look into. I may be back to you with questions if I find something.

Meanwhile, your system updated to 02_03_14 yesterday and seems to be running fine on that. Time will tell if the problem persists. There are some changes that could be related and might just fix it.

Oh. The invalid response from HTTP server is a timeout. That happens. Not a big issue. The failure to update time over 24 hours is an issue that could be related to the Emoncms update.

SandyB · October 11, 2018, 1:59pm

Hi Bob,
Just FYI, I did get another one of these iotawatt_hang2.txt (9.7 KB) - after another 2 weeks, this time on 02_03_16. One additional observation - when hung, it doesn’t even respond to a ping. So, seems like your code is doing just fine - measuring, logging. But something’s up with the low-level communications?
Thanks,
Sandy

overeasy · October 11, 2018, 4:15pm

Yes, the update to 02_03_16 went very smooth on 9/26. Except for a 12 minute communication problem with your Emoncms on 9/29, which recovered fine, the entire two weeks up to 10/10 were completely uneventful.

I can see that then there was an error logged at on 10/10/18 at 20:19:18 and the status after the restart the next day indicates that’s when the updates stopped. Judging by other evidence in the log, I’m assuming the backlog fully recovered after the restart.

The 9 problems on 9/28 were timeouts, and they were overcome with a reasonable number of retries. That’s normal.

The error on 10/10 that caused the suspension of posting was a connect error, and as you say, could be indicative of a failure of lower level code. That is reinforced by the same type of failure of the updater service. This is all very useful information.

At this point I have to consider that it may not be possible to recover from all situations if it could be in the lower level code, so I’m considering some form of HTTP watchdog timer to restart in these rare instances.

Thanks,
Bob

SandyB · October 11, 2018, 10:17pm

Hi Bob,
Exactly; incredibly smooth - the timeouts were all explicable, and nothing to do with IoTaWatt - mostly updates on the EmonCMS machine taking all the CPU.
And I have my own “watch dog” already - I watch EmonCMS, if it stops updating, I wander into the garage & pull the IoTaWatt power for a moment!!
So, no worries from me - just making observations in case they help!
Thanks again,
Sandy