Influx Service Restart Timing / Logic


#1

Hi Bob. I’m running several Iotawatts, all posting to InfluxDB. From time-to-time, influxdb fails. It restarts but in the meantime, IotaWatts do this:

1/03/19 04:08:05 influxDB: Start posting at 01/03/19 04:07:10
1/03/19 14:10:41 influxDB: Post Failed: -11
* 1/03/19 19:58:46 Heap memory has degraded below safe minimum, restarting.

(The * indicates heap memory problem, but this is possibly unrelated?)

Then after the restart:

SD initialized.
1/03/19 09:58:47z Real Time Clock is running. Unix time 1546509527
1/03/19 09:58:47z Version 02_03_20
1/03/19 09:58:47z Reset reason: Software/System restart
1/03/19 09:58:47z Trace:  1:5[7], 7:0, 7:9, 1:6, 1:3, 1:4, 1:5[7], 7:0, 7:9, 1:6, 1:1[1], 1:2[2], 9:0[2], 9:0, 9:1, 8:4, 8:6, 8:8, 8:9, 9:3, 9:5, 9:9, 1:2, 1:3, 1:4, 1:5[7], 7:0, 7:9, 1:6, 1:3, 1:4, 1:5[21]
1/03/19 09:58:47z ESP8266 ChipID: 6910648
1/03/19 09:58:47z SPIFFS mounted.
1/03/19 19:58:48 Local time zone: +10:00
1/03/19 19:58:48 device name: iotaa39
1/03/19 19:58:48 MDNS responder started for hostname iotaa39
1/03/19 19:58:48 LLMNR responder started for hostname iotaa39
1/03/19 19:58:48 HTTP server started
1/03/19 19:58:48 timeSync: service started.
1/03/19 19:58:49 statService: started.
1/03/19 19:58:49 dataLog: service started.
1/03/19 19:58:49 dataLog: Last log entry 01/03/19 19:58:45
1/03/19 19:58:49 historyLog: service started.
1/03/19 19:58:50 historyLog: Last log entry 01/03/19 19:58:00
1/03/19 19:58:53 influxDB: started, url=live.phisaver.com:8086, db=iotaa39, interval=10
1/03/19 19:58:53 WiFi connected. SSID=NetComm 0405, IP=10.1.1.4, channel=1, RSSI -79db
1/03/19 19:58:53 Updater: service started. Auto-update class is MINOR
1/03/19 19:58:58 Updater: Auto-update is current for class MINOR.
1/03/19 19:59:26 influxDB: last entry query failed: -11
1/03/19 19:59:26 influxDB: Stopped. Last post 01/03/19 19:56:50

Since the last entry query failed, the InfluxDB service stops. This query fails because, I guess, the InfluxDB is still down. However, a manual restart 12 hours later still results in a “last entry query failed: -11” even though the InfluxDB is running. Further manual restarts of the service eventually success and all is well.

The status of this unit is here, in the middle of back-filling. Looks fine to me:

Looking at the InfluxD logs on 1/3/19 at 19:59 local time (09:69 UTC):

ts=2019-01-03T09:59:06.980442Z lvl=info msg=“Executing query” log_id=0ClVDwel000 service=query query=“SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = ‘iotaa39’”
ts=2019-01-03T09:59:13.745951Z lvl=info msg=“Executing query” log_id=0ClVDwel000 service=query query=“SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = ‘iotaa39’”
ts=2019-01-03T09:59:36.328734Z lvl=info msg=“Executing query” log_id=0ClVDwel000 service=query query=“SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = ‘iotaa39’”

So it seems influxdb was getting the “last entry” query, but iotawatt didn’t get/understand the response.

Also in the log is a query on PF instead of Watts:

ts=2019-01-02T18:08:04.204275Z lvl=info msg=“Executing query” log_id=0CjPd8DG000 service=query query=“SELECT last(PF) FROM iotaa39.autogen.iotaa39 WHERE device = ‘iotaa39’”

Some of the IotaWatts (3 of 7) are posting again, but the rest are running but not posting. I couldn’t find the restart logic on github.

So, my question is: What’s the logic/timing for restarting InfluxDB service? And why is the “last entry” query failing?

In my case, I’ll like it to try to reconnect frequently (every 5 mins) so outages of InfluxDB are not too visible to users.

(Obviously a perfect WiFi and InfluxDB would be better! But that’s not feasible!)

Happy to provide further testing / debug as necessary.

Cheers, Brett


#2

That which does not kill us makes us stronger
Friedrich Nietzsche

Hi Brett,
I just love it when someone beats the crap out of these services. Fixing these high stress cases makes it so much more robust for the average user. So first pass:

Not sure. It did happen 6 hours later. Have you had any other such failures? There is an outstanding memory leak issue when using passwords. Have you set access passwords for the IoTaWatt?

The service issues a query for every measurement. There is a five second timeout on the transaction, so if it doesn’t get a response within that time window, it times out (-11) and stops the service.

That’s probably not the best way to handle the query failure. A periodic retry would be possible, but it will involve introducing a new state and reporting that so that when someone doesn’t see the service stopped but it’s not getting anywhere, we can identify the problem. Let me ponder that for awhile and put it on the list.

The recent new PVoutput service was structured as a class and offers some advantages in handling complicated situations like this. I do want to eventually transform influx and emoncms services to that structure, so maybe that will be the way to approach this.

As above, it queries on every measurement, so you must have output(s) with PF.

Two ways to restart:

  • Restart the IoTaWatt
  • Use the start button in the influx tab of the status display.

The apparent cause is a timeout (-11) in one of the last entry queries. Specifically, that means no response after 5 seconds. I’ve got three IoTaWatt posting to a RPi/SDcard influx instance and don’t experience these problems. You probably have higher loads. I’m not sure how influx handles queries vs posting data. It could be that posting data, at least to the memory database, takes a priority and that delays the responses. Once again, with the class structured service, I’d be able to keep retrying particular queries and eventually get through it if it’s at all possible.

Your WiFi is so-so at -79db, but I don’t think that’s the problem. The failure is typically -4 with connection problems, but faster influx might be the issue. Can you show me the influx configurations so I can get a feel for load? Also, is there anything else writing to your influx?


#3

Hiya. Yes, I’m trying to get things robust for further installs in remote locations, so these problems are better fixed now than 500km away!

Not sure. It did happen 6 hours later. Have you had any other such failures? There is an outstanding memory leak issue when using passwords. Have you set access passwords for the IoTaWatt?

No passwords. Haven’t have this as an issue before.

The service issues a query for every measurement. There is a five second timeout on the transaction, so if it doesn’t get a response within that time window, it times out (-11) and stops the service.

To clarify: My question is better phrased as "What’s the logic for restarting the InfluxDB service?". In this scenario, I’m unsure of what prompts #3 and #6.

  1. InfluxDB goes down.
  2. IotaWatt InfluxDBService fails to write a measurement, and stops.
  3. IotaWatt InfluxDBService is restarted. The trigger is … ? after 24 hours ?
  4. IotaWatt InfluxDBService runs “select last … from measurement”. This fails (as InfluxDB is still down). Service stops.
  5. InfluxDB is back up and running.
  6. IotaWatt InfluxDBService is restarted, at an unknown time
  7. IotaWatt InfluxDBService runs “select last … from measurement”. This succeeds and all is well.

faster influx might be the issue. Can you show me the influx configurations so I can get a feel for load? Also, is there anything else writing to your influx?

Some more details on the Influx setup:

  • Running on a EC2 Medium Instance on Amazon AWS
  • Total of 8 databases in influx, connected to 8 iotawatts.
  • Well monitored instance, with typically 5% CPU and 19% Memory (8Gb) for Influx. Long running queries are logged.
  • Crashes caused during file compaction and memory exhaustion (now fixed hopefully?), not from posting.
  • Config from config.txt (2.8 KB)
  • Posting at 10s intervals without bulk send so that users see “instant” results.
  • Hardware sized on these guidelines.
  • Some temperature sensors are writing each 10min to the database (low volume)

When the IotaWatts are running and posting there are no problems handling the load (it’s designed for much more).

I don’t think the timeout is caused by a lack of response for 5s. Even megaqueries complete in ~10ms. Note that I see the “last() …” query in the InfluxDB log, even when InfluxDB gets a timeout on the “last()” query and stops.
For example:

cat /var/log/influxdb/influxd.log-20190104 | grep a39 | grep last
ts=2019-01-02T18:08:03.310111Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:03.472580Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:03.610203Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:03.764180Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(PF) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:03.910729Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:04.062050Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:04.204275Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(PF) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:04.348239Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:04.480175Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:04.586165Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-02T18:08:04.725556Z lvl=info msg="Executing query" log_id=0CjPd8DG000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-03T09:59:06.980442Z lvl=info msg="Executing query" log_id=0ClVDwel000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-03T09:59:13.745951Z lvl=info msg="Executing query" log_id=0ClVDwel000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"
ts=2019-01-03T09:59:36.328734Z lvl=info msg="Executing query" log_id=0ClVDwel000 service=query query="SELECT last(Watts) FROM iotaa39.autogen.iotaa39 WHERE device = 'iotaa39'"

I think we see iotawatt trying to run “last” queries repeatedly. I’m assuming the query succeeds (no errors or long running query logs) but am unsure how to confirm this. I don’t understand the 3-6 tries.


#4

I’ve got to dig into this, so no quick resolution. I looked at you config and I see what’s going on with the multiple seeming identical queries. You are using tags to differentiate measurements and they are not includ3 in the “last” queries. That’s easy to add.

But I don’t think that is causing the timeouts. The more robust detection and retry that I mentioned earlier is probably needed. If you can recreate this issue, I might ask you to try an instrumented version of code as a diagnostic aid.

As an aside, im interested in what you are doing with influx and AWS. I spoke with the influx folks awhile back about provisioning @n AWS instance for multiple users. They didn’t have anything to offer except their enterprise system, but the general idea of having individual databases running in a single instance was discouraged as not very scalable. They recommended using one database and differentiating users with keys. Given that you have had problems with memory, you might want to look into that.

I’ll post again when I look deeper into this problem. Today is a down day.


#5

I think I can recreate, and am happy to try that 'instrumented" version. It would be great to understand the “last()” query appearing to run but a timeout occurring. Maybe the instrumented version (and/more more debug on InfluxDB) could help.

The road to AWS and InfluxDB was a bit tricky, but overall it’s been pretty good. I’m hoping to scale to dozens of users, but obviously need to get things reliable first. Memory use on InfluxDB is a ‘known issue’ and I’m trying different things to help. I’d be up for working together on a “shared tenancy” AWS/InfluxDB/Grafana model.

I’ve actually asked on Influx community about multiple databases and await a reply. Sounds like a “one db is better than many.”

Yes, a tag for the last query makes a lot of sense, particular in light of the above.

Regarding reliability and robustness, there will be times when InfluxDB gives a “client error” to throttle itself, so that it’s doesn’t run out of memory. This means that clients should backoff and try again. I think this is different to a timeout (ie. no response), but perhaps good to consider “robust, persistent but nice” client behavior.


#6

Those queries in your influx log succeeded, except maybe the last one. You have 11 measurements in the configuration. The first 11 in tha log are for a restart on Jan 2, and I think they all worked ok. The last three queries in the log are for Jan 3 at 9:10z and correspond to your IoTaWatt log with the timeout. I can’t tell if it was the last query in the influx log or the next that would have been made for PF. But the point is that the queries that you see logged are OK and the failure is another case that may be fixable in IoTaWatt (or asyncHTTPrequest which is mine as well).

So I’m optimistic.

[quote=“brettbeeson, post:5, topic:595”]
Regarding reliability and robustness, there will be times when InfluxDB gives a “client error” to throttle itself, so that it’s doesn’t run out of memory. This means that clients should backoff and try again. I think this is different to a timeout (ie. no response), but perhaps good to consider “robust, persistent but nice” client behavior[/quote]
I’ll check the handling for that, but I do think the normal retry mechanism for writing data once it’s started will work fine in this case. There is a flow control protocol in PVoutput as well and it works fine.

[quote=“brettbeeson, post:5, topic:595”]
I’d be up for working together on a “shared tenancy” AWS/InfluxDB/Grafana model.[/quote]

I had a conference call with influx a few months back, and they brought some developer talent into the conversation. I couldn’t follow through, but I would be happy to try to put you in contact with those folks to explore the possibilities. I think it’s all about a provisioning front end to administrate access, security, and billing.


#7

Ok, thanks, that sounds promising!

Can you explain the logic for restarting the InfluxDB service? Does it try to restart every 24 hrs or something similar? This is important to me because I have 2 IotaWatts not posting for 6 days and am deciding whether to drive out to them.

Regarding the last() queries: To confirm, if I have 11 measurements, then there are 11 “last()” queries, all of which need to succeed? Perhaps we can write a single, regex-based query to answer the question “what’s the last measurement in the database?” Due to the flexibility of the tag/measurement naming, this might be hard. Maybe one measurement is nominated as the ‘default’? In any case, for my setup (with a “ct” tag to identify each ct), you can get this, at least for Watts (=$units) with:

“select last(Watts) from iotawatt-db where device='iotaa39”


#8

There is no provision to automatically start the service after it stops. You might want to consider using port forwarding on remote systems to enable you to access them remotely until there is some resolution to this situation.

This is an old issue. I had a single query to extract the last post time for all of the measurements, but the response was too long for at least one user and exceeded available heap to receive the response and Json parse it. So this “divide and conquer” approach is intended to reduce heap demand. Moreover there is no reason to believe that one query has less chance of failing than eleven.

I’ve already looked at that code, and done some work to reduce the impact of failure. I will put in a perpetual retry for HTTP related failures. There are a class of errors however - no such database, invalid authentication, that will not resolve over time and will still cause the service to stop.

That really complicates the user interface, and I would hate to have to write up how it works. I think making the multiple queries work is the way forward.


#9

There is no provision to automatically start the service after it stops. You might want to consider using port forwarding on remote systems to enable you to access them remotely until there is some resolution to this situation.

Oh no! That’s a considerable problem for me. I now think the only reason the other units have restarted posting is a hard restart, or a heap-related-restart. If the DB is unreachable (eg failure, maintenance, WiFi off, or other temporary issue), I would like IotaWatt to “try again later”. That is, attempt to restart the service. That might be 1 hour or 1 week later, so a “once an hour” check would be reasonable. This is possible?

I’m open to trying to flash a dev version, or trying other ideas; at the moment I can’t roll out additional units to remote sites. I understand you’ve got a lot on your plate (biggest understatement of 2019 to date)!

Unfortunately, a port-forward would not be feasible for me (multiple router types, admin required and manual login and restart of every unit required) for more than a few units. [ I’m intending to test method of bypassing firewalls for IoT devicse by holding a websocket connection open, similar to reverse SSH forwarding, but that’s an additional service and whole new topic.]

This is an old issue.

That all sounds reasonable, although I would have hoped the response would just be a single time-the-last-measurement. But agreed that the additional complexity would be a pain and not more reliable.


#10

No quick fixes for that. We’re probably talking months before I can rework influx into the framework that was developed for PVoutput and properly test. I’ve already coded a fix to add the all the keys to the query, and will just retry HTTP errors forever. That will go out in the next ALPHA after I burn it in here for a week or more. That’s the best I can do.

AWS is supposed to be like 99.9% available, so maybe lightning won’t strike twice.

Too bad. That seems to work for another user with remote sites and also solves other administration issues.


#11

No quick fixes for that. We’re probably talking months before I can rework influx into the framework that was developed for PVoutput and properly test. I’ve already coded a fix to add the all the keys to the query, and will just retry HTTP errors forever. That will go out in the next ALPHA after I burn it in here for a week or more. That’s the best I can do.

No worries. The “retry HTTP error forever”: will that mean the InfluxDB service keeps trying to get a connection and therefore won’t stop? That would be adequate for me, as units would start posting again after a loss of DB connection. If this understanding is correct, I’ll switch over to ALPHA and can test.

Also, thanks for your quick responses - it makes all the difference!