EmonCMS update randomly stops, requiring manual stop/start to resync

OK, that 120 second interval explains a lot.

The apparent problem is that IoTaWatt does not seem to be getting an acknowledgement from Emoncms. Example:

18:43:11 you stop the service. It says that the last frame it knows was posted was timestamped 14:53:30.
18:43:22 you restart the service, it queries Emoncms, and that the next post should be 14:57:40 - indicating that Emoncms said it had 18:57:30.
18:43:39 you stop the service. It says the last frame it knows about was from 14:57:30.

So I am thinking that if you waited long enough, until after IoTaWatt timed out and retried 30 attempts, you would get the timeout -11 logged, and it would keep on retrying forever. Stopping breaks the cycle and when it restarts, posts the next batch but receives no response…

At this point I would be suggesting you try running the influx or PVoutput services to see if you get similar behavior, but you are doing that and they work fine. There is very little difference between the Emoncms and influx services with respect to the HTTP request/response handling.

The only difference I can see on the IoTaWatt side is the encrypted payload. So I would first suggest that you drop back to unencrypted by removing the account from the setup.

Whatever the cause, IoTaWatt doesn’t seem to be getting any response from Emoncms when the posts contain multiple frames. If switching to unencrypted doesn’t resolve the problem, I’m inclined to suspect Emoncms.

by stop/restart on pvoutput & influxdb goes ahead beautifully as expected.

when I was going forward, I was doing without waiting too much, but after realize before starting hitting the stop/start, that by not doing nothing was not going ahead for minutes.

what do you mean by disabling the encryption by removing the account at the setup? I have to leave empty the emoncms userid field? (now I’ve it set to 1) AFAIK I still kave to provide the R/W API key for sending the data and be able to write to emoncms right?

thanks!

Duh! I completely missed that in your original post. Yes, I mean remove the userid. I’m pretty sure that’s the problem. From the documentation:

Without a doubt specifying an invalid userid should not cause this failure mode and I will look into that.

ok, now without the userid, stopped the emoncms, wait about 30 mins, and restarted.
after about 2 and a half mins, then -11.
stop and restart loads another 4 mins, but not more, -11 again after 2.5 mins…

so leaving without userid and unecrypted doesn’t make a difference :frowning:

here the log:
11/25/20 22:23:54 EmonService: Stopped. Last post 11/25/20 21:52:10
11/25/20 22:24:08 EmonService: started. url=HTTP://emoncms.elserrat.cat/emoncms, node=IoTWGrge, interval=10
11/25/20 22:24:10 EmonService: Start posting at 11/25/20 21:56:20
11/25/20 22:26:40 EmonService: HTTP response -11, retrying.
11/25/20 22:27:20 EmonService: Stopped. Last post 11/25/20 21:56:10
11/25/20 22:27:31 EmonService: started. url=HTTP://emoncms.elserrat.cat/emoncms, node=IoTWGrge, interval=10
11/25/20 22:27:32 EmonService: Start posting at 11/25/20 22:00:20
11/25/20 22:30:01 EmonService: HTTP response -11, retrying.
11/25/20 22:30:33 EmonService: Stopped. Last post 11/25/20 22:00:10
11/25/20 22:30:38 EmonService: started. url=HTTP://emoncms.elserrat.cat/emoncms, node=IoTWGrge, interval=10
11/25/20 22:30:39 EmonService: Start posting at 11/25/20 22:04:10

after several stop/start, without changing the 10s interval, and once reached current time, the emoncms lgets loaded without any missing point and goes ahead normally.

seems that there is no problem on loading the data to emoncms once synched, neither in the very first bunch of records while loading the history, but -11 still appears while loading history and after the first packet of data being loaded.

OK, I’m stumped. Lets try just sending multiple current frames. Can you set Bulk Send to 2 and see if that works. If it does, try 10?

Just wake up today, haven’t tried now, is synched, but in the past ocurrences, IIRC what was really doing a difference was change the interval rather than the bulk send.

Now being able to reproduce, today I’ll try it again more carefully to observe this.

OK, here the tests done.
While being synched at current time, having interval at 10s all time, changing bulk send to 2, was going ahead every 20 secs, by changing to 10, was also going ahead every 100s as expected.

Then, I stopper the web server for about half hour, and then restarted.

Having bulk send still at 10, after resume, was not going forward except nearly 4 mins. , so similar behavuiour as it was bulk=1., and had to go through the stop/start itearions to go upo to the current time.

In a summary, changing the # of records of bulk send seems that does not affect at all, and while going forward manually, was still able to load nearly 4 mins, just llike what I observed by having interval to 10s, and bulk=1. So in conclusion looks like stops after a fixed number of records loaded, regardless the number of buckets, and by keeping waiting, then a -11 is thrown after around 2m30s.

So sending up to 10 frames at a time works OK.

Right, bulk send only works to delay sending when you are current. If there are more than bulk-send frames available, it will send the max.

So no difference was expected when there is a backlog. When there is a backlog, each post contains 24 or 25 frames (6/min * 4). Now I would lik eto try setting bulk-send to 24 and see what happens. If it works fine, then try 30. If both work fine, there is no need to stop the service to force failure, I will have the information I want. With bulk send set to 24 it should take about 4 minutes to update.

ok, I wasn’t able to try it before, now I did. Here the results.
Without stopping, having interval to 10s and changed buk to 24, keeps at current, and aparently sends an update every 1m30s (90s), which isn’t 24 records, but 9?
Changing bulk to 30, the same, doesn’t takes longer to get updated, an update every 1m30s (9 records)…
So ignores values greater than 9 for bulk?
I’m looking to the statuis page.

Ok, thanks. I’ll check that tomorrow. It’s Thanksgiving today, so I’m going to sign out.

Of course! I’m not in a hurry, nothing is burning and in the meantime my IotaWatt is doing the job beautifully :wink:
Enjoy the thanksgiving and big thanks for your attention!

Update: My iotawatt was uploading into emoncms fine until thursday 10th (not aware of that until today).
Now I’m in the process of loading the history manually (stop/start) until gets up-to-date.

Any other idea about what I can check for?

Did anything happen on the 10th? Any power failure? Could I see the IoTaWatt message log for then?

I don’t remember anything special, no power failure, neither at the emoncms site.
There is no log currently available from the 10th.
Can I somehow store permanently the log somewhere?

The entire log is a file /iotawatt/iotamsgs.txt accessible from the File Manager and Editor tab in Tools.

You can right click on it and download, but that isn’t the only way to get to near recent history.

When you display the message log, IoTaWatt simply asks for the ~10,000 last bytes of the file (breaks at a line boundary) and displays that in your browser. Look at the URL:

image

You can simply change that to say -100,000 and enter to go back 10 times farther, then locate the entries in question and copy/paste to the forum.

OK, thanks! Pasted below.
What I see for the 10th is that starts reporting the -11 from emoncms, and keeps reporting that in every retry since then (although emoncms whas online and working normally).

** Restart **

SD initialized.
12/09/20 21:42:48z Real Time Clock is running. Unix time 1607550168 
12/09/20 21:42:48z Reset reason: Software Watchdog
12/09/20 21:42:48z Trace:  1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:1, 1:2[1], 9:0[1], 9:0, 9:1, 8:4, 8:6, 8:8, 8:9, 9:3, 9:5, 9:9, 1:2
12/09/20 21:42:48z ESP8266 ChipID: 2528519
12/09/20 21:42:48z IoTaWatt 4.x, Firmware version 02_05_11
12/09/20 21:42:48z SPIFFS mounted.
12/09/20 22:42:49 Local time zone: +1:00
12/09/20 22:42:49 Using Daylight Saving Time (BST) when in effect.
12/09/20 22:42:49 device name: IoTWGrge
12/09/20 22:42:49 HTTP server started
12/09/20 22:42:49 timeSync: service started.
12/09/20 22:42:49 statService: started.
12/09/20 22:42:50 dataLog: service started.
12/09/20 22:42:56 dataLog: Last log entry 12/09/20 22:42:40
12/09/20 22:42:56 Updater: service started. Auto-update class is MINOR
12/09/20 22:42:57 WiFi connected. SSID=guifi.net-ElSerratFW, IP=192.168.19.188, channel=1, RSSI -65db
12/09/20 22:42:57 MDNS responder started for hostname IoTWGrge
12/09/20 22:42:57 LLMNR responder started for hostname IoTWGrge
12/09/20 22:42:57 EmonService: started. url=HTTP://emoncms.elserrat.cat/emoncms, node=IoTWGrge, interval=10
12/09/20 22:42:57 influxDB: started, url=HTTP://192.168.19.147:8086, db=iotawatt, interval=10
12/09/20 22:42:58 Updater: Auto-update is current for class MINOR.
12/09/20 22:43:00 EmonService: Start posting at 12/09/20 22:42:50
12/09/20 22:43:02 influxDB: Start posting at 12/09/20 22:42:50
12/09/20 22:43:50 historyLog: service started.
12/09/20 22:43:52 historyLog: Last log entry 12/09/20 22:42:00
12/09/20 22:43:59 PVoutput: started
12/09/20 22:43:59 PVoutput: System El Serrat del Serí, interval 5, donator mode  
12/09/20 22:44:00 PVoutput: Start status beginning 12/09/20 22:40:00
12/10/20 00:50:09 EmonService: HTTP response -11, retrying.
12/10/20 06:11:46 WiFi disconnected.
12/10/20 06:11:49 WiFi connected. SSID=guifi.net-ElSerratFW, IP=192.168.19.188, channel=1, RSSI -65db

** Restart **

SD initialized.
12/10/20 05:57:50z Real Time Clock is running. Unix time 1607579870 
12/10/20 05:57:50z Reset reason: Software Watchdog
12/10/20 05:57:50z Trace:  1:6, 1:3, 1:4, 1:5[3], 3:0, 3:11, 1:6, 1:3, 1:4, 1:5[3], 3:0, 3:11, 1:6, 1:3, 1:4, 1:5[3], 3:0, 3:11, 1:6, 1:1[2], 1:2[3], 9:0[3], 9:0, 9:1, 8:4, 8:6, 8:8, 8:9, 9:3, 9:5, 9:9, 1:2
12/10/20 05:57:50z ESP8266 ChipID: 2528519
12/10/20 05:57:50z IoTaWatt 4.x, Firmware version 02_05_11
12/10/20 05:57:50z SPIFFS mounted.
12/10/20 06:57:51 Local time zone: +1:00
12/10/20 06:57:51 Using Daylight Saving Time (BST) when in effect.
12/10/20 06:57:51 device name: IoTWGrge
12/10/20 06:57:51 HTTP server started
12/10/20 06:57:51 timeSync: service started.
12/10/20 06:57:51 statService: started.
12/10/20 06:57:52 dataLog: service started.
12/10/20 06:57:57 dataLog: Last log entry 12/10/20 06:57:45
12/10/20 06:57:57 Updater: service started. Auto-update class is MINOR
12/10/20 06:57:57 WiFi connected. SSID=guifi.net-ElSerratFW, IP=192.168.19.188, channel=1, RSSI -63db
12/10/20 06:57:57 MDNS responder started for hostname IoTWGrge
12/10/20 06:57:57 LLMNR responder started for hostname IoTWGrge
12/10/20 06:57:58 EmonService: started. url=HTTP://emoncms.elserrat.cat/emoncms, node=IoTWGrge, interval=10
12/10/20 06:57:58 influxDB: started, url=HTTP://192.168.19.147:8086, db=iotawatt, interval=10
12/10/20 06:57:59 Updater: Auto-update is current for class MINOR.
12/10/20 06:58:01 EmonService: Start posting at 12/10/20 00:49:10
12/10/20 06:58:05 influxDB: Start posting at 12/10/20 06:57:50
12/10/20 06:58:53 historyLog: service started.
12/10/20 06:58:54 historyLog: Last log entry 12/10/20 06:57:00
12/10/20 06:59:01 PVoutput: started
12/10/20 06:59:02 PVoutput: System El Serrat del Serí, interval 5, donator mode  
12/10/20 06:59:02 PVoutput: Start status beginning 12/10/20 06:55:00
12/10/20 07:00:31 EmonService: HTTP response -11, retrying.
12/10/20 10:41:46 WiFi disconnected.
12/10/20 10:41:49 WiFi connected. SSID=guifi.net-ElSerratFW, IP=192.168.19.188, channel=1, RSSI -67db

** Restart **

SD initialized.
12/10/20 11:11:50z Real Time Clock is running. Unix time 1607598710 
12/10/20 11:11:50z Reset reason: Software Watchdog
12/10/20 11:11:50z Trace:  1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:3, 1:4, 1:1[2], 1:2[3], 9:0[3], 9:0, 9:1, 8:4, 8:6, 8:8, 8:9, 9:3, 9:5, 9:9, 1:2
12/10/20 11:11:50z ESP8266 ChipID: 2528519
12/10/20 11:11:50z IoTaWatt 4.x, Firmware version 02_05_11
12/10/20 11:11:50z SPIFFS mounted.
12/10/20 12:11:51 Local time zone: +1:00
12/10/20 12:11:51 Using Daylight Saving Time (BST) when in effect.
12/10/20 12:11:51 device name: IoTWGrge
12/10/20 12:11:51 HTTP server started
12/10/20 12:11:51 timeSync: service started.
12/10/20 12:11:52 statService: started.
12/10/20 12:11:52 dataLog: service started.
12/10/20 12:11:57 dataLog: Last log entry 12/10/20 12:11:45
12/10/20 12:11:57 Updater: service started. Auto-update class is MINOR
12/10/20 12:11:57 WiFi connected. SSID=guifi.net-ElSerratFW, IP=192.168.19.188, channel=1, RSSI -70db
12/10/20 12:11:57 MDNS responder started for hostname IoTWGrge
12/10/20 12:11:57 LLMNR responder started for hostname IoTWGrge
12/10/20 12:11:58 EmonService: started. url=HTTP://emoncms.elserrat.cat/emoncms, node=IoTWGrge, interval=10
12/10/20 12:11:58 influxDB: started, url=HTTP://192.168.19.147:8086, db=iotawatt, interval=10
12/10/20 12:11:59 Updater: Auto-update is current for class MINOR.
12/10/20 12:12:00 EmonService: Start posting at 12/10/20 00:53:10
12/10/20 12:12:04 influxDB: Start posting at 12/10/20 12:11:50
12/10/20 12:12:53 historyLog: service started.
12/10/20 12:12:54 historyLog: Last log entry 12/10/20 12:11:00
12/10/20 12:13:01 PVoutput: started
12/10/20 12:13:02 PVoutput: System El Serrat del Serí, interval 5, donator mode  
12/10/20 12:13:02 PVoutput: Start status beginning 12/10/20 12:10:00
12/10/20 12:14:30 EmonService: HTTP response -11, retrying.

I can’t tell you why, but it appears that no response is being received when data is posted. That could be something in Emoncms, or something with the server running it. My suspicion is Emoncms itself because it seems to be responding to the initial query that determines the last posted time.

ok, but emoncms wasn’t restarted at all. it runs on an old raspberry, and takes care to a quite few other IoT devices (hvac, EV, solar PV, weather station, pool…); so tracks many feeds and node-red flows, but works.
could be the case that if for some reason when at some particular point emoncms did not respond so -11 is thrown, this exception don’t get cleared at the iotawatt for the next try, so appears falsely that -11 is thrown repeatedly? I’m saying because by the manual stop/start process is able to load the history manually, and thereafter also keeps updated normally once all the history has been loaded, and no changes were done at the emoncms at all…

Well, back at you. Nothing has changed with the IoTaWatt. That’s not a useful indicator.

I don’t rule out a problem with IoTaWatt, but I see no evidence at all of a problem. Your Influx and PVoutput are working fine. There is no difference with respect to receiving a response from a server. Moreover, it appears that Emoncms is responding to the initial query to determine last post time.

There’s little available, but one thing that stick out to me is the report that it works ok when current, but not when shoveling history up.

It appears that one buffer load of upload contains 9 “frames” or about 90 seconds worth of data. But at each restart, you are able to get about 4 minutes worth of data. That implies that at least two transactions complete and the third hangs with no response.

I don’t know the internals of Emoncms that well, but I believe incoming transactions are queued, the input processes are run on each input, and the feeds are updated. I don’t know where in that process the response to the sender is initiated, but i suspect that there is a problem when the queue of data to be processed to feeds grows. That is the only theory that I can come up with to explain why it works when current but not when behind.

That theory is based somewhat on your assertion that the data for those 4 minute chunks is actually there. IoTaWatt will, as part of error recovery, step over a buffer load of data after a number of unsuccessful retries. So if that were the reason for seeing a 4 minute advance, the data would not be there in Emoncms. But you report that it is.