Influxdb update slower on one vs the other

It’s been a very long time since anyone has reported SDcard problems with IoTaWatt. I’m not seeing any direct evidence of it here.

What you are displaying is the data stored in influx, not IoTaWatt. It may be that your datalog has a corrupted record for July 11, 2020 00:13:50UTC, but the question is how did IoTaWatt pass a value like 3.9759610513759245e+161 to influx? Before I buy into a problem with the IoTaWatt datalog, I’d like to at least look at what it has to say about this data.

So lets take a look. Using the old graph program, can you plot kWh (energy) on the washer on Jan 10-11, 2020? And then separately for MainA? If those big numbers are there, the kWh should jump dramatically at 00:13:50.

The datalog has some really bad data in it. It is about 1GB. I can put it on a share if you want to look at it.

7/10/2020 16:56	806.5	10.2	0.4	233.3
7/10/2020 17:06	-12581673.3	3157623544	0.3	233.9
7/10/2020 17:16	12583987.1	-3157684478	0.3	233.4
7/10/2020 17:26	1056.4	141.7	0.3	58.9

Seems like something got confused on July 10 at around 5 PM.

more better picture

I’d like to see your message log. Can you upload in a PM?

The old graph shows the problem at that instant (local time). I’d like to see that graph of Energy (kWh) not Power (Watts). What you see in Graph+ is not representative. This isn’t about the SDcard, it’s about one or two bad records in the datalog.

When you reformat or replace the SDcard, then copy the datalog files, the problem stayed with the file. There are a couple of solutions to this, but the cleanest is to wipe both datalogs and start over. What is the status of your upload? If you can get the data to influx, you can retain your history there. When you are ready, I’ll tell you how to do it.

I sent you a PM with the message log

Here we have a graph at 60sec samples covering the time period (looks fine)

Now here it is at 30sec samples:


Not sure if that is fine or not.

Here it is overlayed with the power:

I know how to wipe the data and history log and am planning on doing that after the upload is complete and I rearrange some inputs with my new design.

I know that you believe SD card corruption is exceedingly rare in IotaWatt. I won’t dispute that. I have now seen what I believe to be that on at least 3 occasions. But all have been on the same IotaWatt with the same SD card. I have another IotaWatt that has had no issues at all. At least one of those issues involved a significant number of power loss restarts (probably due to a loose USB cable). Pulling power at just the right/wrong time can really cause bad things to happen. So, that explains at least one of them. I have lost confidence in the SD card that was in the device, so I have now replaced it with another one. It is still a cheap Kingston 8GB card, so I will order a new high quality SD card (since I will feel better about that, the IotaWatt might not care :wink: ). I will also hook both of my IotaWatts up to the new mini UPS that I recently bought to eliminate fluctuating power as a potential source of issues.

Many people say that exceedingly rare events happen around me a lot, so don’t read this as I believe IotaWatt is a bad product. I have been very happy with it and it has worked very well and I expect it will continue to do so. Bob has also been very generous with debugging issues that I have experienced

If I try and get the graph to go at a smaller fixed interval, it appears to hang/do nothing. I am pretty sure it is doing a query to get the data. I can try that directly.

Data from this query:
/query?select=[time.local.unix,MainA.wh]&begin=1594409910&end=1594440750&group=5s

baddata.json (288.0 KB)
Looks like this is where it starts going bad:

[1594401090,1],[1594401095,1],[1594401100,1],[1594401105,1],[1594401110,1],[1594401115,1],[1594401120,1],[1594401125,1],[1594401130,1],[1594401135,1],[1594401140,1],[1594401145,1],[1594401150,1],[1594401155,1],[1594401160,1],[1594401165,1],[1594401170,1],[1594401175,1],[1594401180,1],[1594401185,1],
[1594401190,null],
[1594401195,3],
[1594401200,2],[1594401205,-2],
[1594401210,2],[1594401215,-2],
[1594401220,2],[1594401225,-2],
[1594401230,2],
[1594401235,19],[1594401240,-19],
[1594401245,19],[1594401250,-19],
[1594401255,19],[1594401260,-19],
[1594401265,19],

This query did not return a full set of data, so it probably got really confused by what is in the data log. It does appear to be able to skip around the bad part.

Query and Graph+ which uses query are not useful tools for figuring this out. Some of those data points come from the History log and some from the current log. When it all works its great, but what you have there is meaningless.

Take a look at the old plot above and ask yourself why the area after the incident is solid yellow from near zero to 10,000kWh? The answer is that you are getting 30 second intervals, and the even minutes are coming from the history log, and the half-minutes are coming from the current log. So It’s really just a whole lot of up-down-up-down. The question is which one is up and which one is down?

That’s why I asked for a plot at one minute intervals, to figure out exactly where the damage is. You may not need to delete all of your data. The power plot is derived from the energy plot, so that is of no help at all because all of the data is a combination of the two logs.

If you are going to diagnose and fix this yourself, I would caution you to better understand the whole dynamic. Without knowing what is wrong, you may find temporary relief only to have other problems in a few months. The gold standard is to delete both logs simultaneously. There is a command for that. Otherwise, the new log is created using data from the remaining log and you can perpetuate problems.

The IoTaWatt only writes each datalog block twice (holds two records). That’s it, until it wraps and then it writes each one two more times. SDcard wear, and damage, is a function of writes. Most cards can take 100,000 or more writes per block, and they have active wear leveling to remap blocks as they are used heavily. It works fine. There are probably billions of these things out there. If they didn’t work, an alternative would exist.

I didn’t see an ask for 1 min intervals here or in PM, but since you asked for it now, here it is:

Here is the same time period with 5 sec interval:

My plan for deleting was to remove the SD card and delete the datalog, but am open to the correct way to fix this.

As far as SD cards go, from what I have read (and experienced) they actually have a decently high failure rate. There are much better things out there, but they cost a lot more than SD cards. SD cards are designed to be cheap. There are also a large number of cards of questionable quality.

You might remember that the last time I had a problem with this device, the data in the directory/filesystem got corrupted. One of the files got a name with non-standard characters. I also had the data log get corrupted before. That can and does happen when power quality is poor and the logs show that was happening around the time of the last failure.

I am planning on moving some circuits around, so I can reset everything to get a clean start when I do that. I also replaced the SD card so if it was a defective card that problem should also be solved. If there is any more data you need/want, let me know.

Did a little more investigation on the file: iotalog.log
The directory lists the file as 722,109,696 bytes, but when I view the file with a hex editor, the data stops at e0e0dc0 which is 235,802,048. When I try the same thing on the older log file I have, I can see data to the end of the file (107.020,288 bytes). So, it definitely looks like it is not in a happy state.

This IotaWatt is on the Alpha release channel, so maybe that had something to do with it, but I am leaning more towards a combination of bad power and a marginal (at best) SD card.

The other weird thing was that I added Amps to the main channel as something to send to influxdb, but did not change anything else. I expected it would start from where it was currently (September), or from where I told it to (August). But, no, it started from the beginning of the datalog. I just happened to be watching it at the point at which the datalog is bad and it reset back to the beginning of the data log. I then updated the service again and this time it started at August (past the bad data). It looks like both IotaWatts are now progressing at about the same pace, so I think the issue is resolved.

Through some, possibly exceedingly rare, event, my datalog got corrupted. This caused the upload to influxdb to get in a loop. This is probably the edge of a corner case, so it is unlikely anyone else will run into it. It looks like most of the data is fine and the process is working okay now that I skipped over the bad data.

Thought I did, sorry. The one minute plot would be the History log only, and it looks fine for the period in question. It’s quite possible that the History log is OK. Can you plot the kWh of the Main from July to now, and then the an hour today using the old plot?

722Mb would be about 163 days. Sounds right. Don’t know what you are using for an editor, but the IoTaWatt does read parts of the file at startup and it’s unlikely you would still be running if it’s only 235Mb. Also, careful looking at the contents of files written by an ESP8266. It is a little-endian machine so some of the data can be backwards.

This is the kind of speculation I try to discourage here. There have been two Alpha releases since July and both were uneventful and went to full Minor quickly without incident. Maybe an SD card issue, but not at all conclusive.

Here is July to now (looks fine to me):

Here is the last hour (also fine)

There is some bad data in the “history”. Based on what I know about the way the data is saved, the HISTLOG.LOG file seems like it is okay. The iotalog.log appears to have some “not so good” data in it. How that data got in there will probably always be a mystery. Since I do have another IotaWatt and it was running at the same time. I looked at it to see if there was a power outage then (there wasn’t).

Going back to the message log around the time of the event there is nothing particularly interesting:

** Restart **

SD initialized.
6/22/20 20:45:10z Real Time Clock is running. Unix time 1592858710 
6/22/20 20:45:10z Power failure detected.
6/22/20 20:45:10z Reset reason: External System
6/22/20 20:45:10z ESP8266 ChipID: 2222239
6/22/20 20:45:10z IoTaWatt 4.x, Firmware version 02_05_09
6/22/20 20:45:10z SPIFFS mounted.
6/22/20 13:45:10 Local time zone: -8:00
6/22/20 13:45:10 Using Daylight Saving Time (BST) when in effect.
6/22/20 13:45:10 device name: IotaWatt
6/22/20 13:45:13 HTTP server started
6/22/20 13:45:13 WiFi connected. SSID=FrogmoreN66, IP=192.168.1.38, channel=1, RSSI -72db
6/22/20 13:45:13 MDNS responder started for hostname IotaWatt
6/22/20 13:45:13 LLMNR responder started for hostname IotaWatt
6/22/20 13:45:13 timeSync: service started.
6/22/20 13:45:13 statService: started.
6/22/20 13:45:13 Updater: service started. Auto-update class is ALPHA
6/22/20 13:45:14 dataLog: service started.
6/22/20 13:45:14 dataLog: New current log created.
6/22/20 13:45:14 dataLog: Last history entry: 06/15/20 21:39:00
6/22/20 13:45:15 Updater: Auto-update is current for class ALPHA.
6/22/20 13:47:13 historyLog: service started.
6/22/20 13:47:13 historyLog: Last log entry 06/15/20 21:39:00
6/24/20 07:49:29 Updater: Invalid response from server. HTTPcode: -4
7/10/20 22:29:32 Updater: Invalid response from server. HTTPcode: -14
7/15/20 02:39:22 Updater: Invalid response from server. HTTPcode: -4
8/03/20 14:32:16 timeSync: Six week routine restart.

The log starts when I powered up the IotaWatt with the datalog removed (from the previous time it got in a bad state). I did not clear the history log. It looks like a mystery. I am willing to accept that. Stuff happens.

Yes, that is just history log.

That is a combination of history log and current log data. It does also look fine. That means that the two logs are in sync. It also means that the damage probably happened after July 10 because the history log is just a subset of the current log records, unaltered except for a sequence number.

So given that you have a new SDcard, deleting only the current log should be OK and will retain all of your history in the history log. You can remove the SDcard and manually delete it, or you can use the command

http://<device name>.local/command?deletelog=current

Careful with this and be sure you address the correct unit as it is not reversible.

@frogmore Are you able to share your influx config? I’m also looking to set that up?
Thanks

Look here for a start:

Well, the plot thickens. I noticed that the uploading progress stalled again. Sure enough Grafana showed an annomoly at 11:10AM on 10/30. So I stopped the sending (which had restarted back at the beginning of the datalog and not at the point that I told it to start at). I set the start date a few days ahead and the upload is continuing. I ran some queries and this is what I see in the data:

> select * from iotawatt where Watts < 0
name: iotawatt
time                Amps Volts Watts                    channel   device
----                ---- ----- -----                    -------   ------
1592858730000000000            -2636                    Barn      IotaWatt
1592858730000000000            -1.6535469126360393e+51  Utility3  IotaWatt
1594426420000000000            -43.18                   HPWH_A    IotaWatt
1594426420000000000            -70208482881.11          Utility1  IotaWatt
1599354900000000000            -5.760780923401527e+164  Family    IotaWatB
1599354900000000000            -1.894185814057058e+280  k15       IotaWatB
1600551490000000000            -649.18                  Bed23     IotaWatt
1600551490000000000            -93103324898650900000    k13       IotaWatt
1600551490000000000            -29.48                   WH_Well_A IotaWatt
1600551490000000000            -63.59                   Utility3  IotaWatt
1600551490000000000            -29.95                   RangeVent IotaWatt
1600551490000000000            -23.32                   HPWH_A    IotaWatt
1600551490000000000            -639.95                  Dryer_A   IotaWatt
1602377000000000000            -5.439967369313442e+164  Family    IotaWatB
1602377000000000000            -1.7887000316273969e+280 k15       IotaWatB
1604081460000000000            -188951526.75            MB_Floor  IotaWatt
1604081470000000000            -1.3228952450799069e+51  Utility3  IotaWatt
1604081470000000000            -2093.83                 Barn      IotaWatt
> select * from iotawatt where Watts > 10000
name: iotawatt
time                Amps  Volts Watts                  channel  device
----                ----  ----- -----                  -------  ------
1592858730000000000             5.621438149643385e+227 Dryer_A  IotaWatt
1594426420000000000             779295162956.35        MB_Floor IotaWatt
1594426420000000000             1.025749877095174e+114 k13      IotaWatt
1594426420000000000             27401210191.42         Washer   IotaWatt
1599354900000000000             1.2710192031453963e+73 BarnB    IotaWatB
1602377000000000000             1.2002371002851973e+73 BarnB    IotaWatB
1604081460000000000             117286850979.92        Dryer_A  IotaWatt
1604081460000000000 4.603       23813.22               MainA    IotaWatt
1604081460000000000             8.991434343761321e+85  HPWH_A   IotaWatt
1604081470000000000             4.49734672892881e+227  Dryer_A  IotaWatt
1604081470000000000             1.6003708857643472e+79 HPWH_A   IotaWatt
> 

IotaWatt is LegA and IotaWatB (8 char limit on the name in IotaWatt) is LegB. This time it shows some clearly incorrect values from IotaWatB. This one has never given me a problem. There are two timestamps that have an issue on Leg B:
1599354900000000000 Saturday, September 5, 2020 6:15:00 PM [GMT-07:00] DST
1602377000000000000 Saturday, October 10, 2020 5:43:20 PM [GMT-07:00] DST

So, it looks like there is something going on. The Sept 5 one is a little before the 6 week automatic restart, but that is probably just a coincidence. The Oct 10 one does have anything interesting around that time, but the next day it did update the FW.

I think I will be done uploading all the data to influxdb in the next day or so. After that, I will rearrange things and after that delete the data and history logs on both devices. I am seeing the following issues:

  1. Bad data or bad measurements are either being written to the datalog, or the datalog is getting corrupted.
  2. When uploading data to influxdb, if it hits bad data, it restarts the upload process
  3. When it restarts the upload process it restarts from the beginning of the datalog, and not the specified start date
  4. It doesn’t log anything in the message log when it restarts the upload process

It seems the error does not hit all channels at the same time. It is possible that there are more errors. I just searched for obviously wrong values. I don’t have solar, so I left all the channels at the default (allow negative power is unchecked). This means all the negative values are definitely wrong.

Many of the timestamps occurred with nothing interesting around them, so not clear what caused them. Could be a bug, but I doubt this one is going to be easy to figure out.

This is a little unsettling, but I don’t there is enough here to support your conclusions. I have some questions that I’d like to resolve.

What version of influx are you using? The reason I ask is that I don’t know how I could post -5.760780923401527e+164 if I wanted to. Maybe C++ is taking liberties with printf, but I’ll need to see that.

I’m starting to suspect there may be a bug in the datalog search algorithm that is returning the wrong data. It’s definitely a possibility, although this would be the first incident noticed. There is no index for the datalog. It is chronological and so can be searched pretty efficiently. But perhaps there is a boundary condition…

The restarting at the beginning is a separate issue. That’s why I’m asking about the influx version. I’m wondering if there is a translation between the V1 query and flux, and if that might be working improperly because a flux query is being created with too short a look-back window.

There’s a lot going on here, and with a newer version of influx, it may be we’re looking at compound issues. I definitely want to look at the IoTaWatt data from the other machine. I’ll post the date/times and what I’d like to see.

I agree that it is unsettling and I am committed to helping figure out what happened. So I will do what I can to get you the information that you need. I haven’t looked at the data structures used in the log so can’t really explain what might have happened. I also agree that there is more than one issue here, which makes it complicated to figure out what is going on.

So, first question. I am using Influx 1.8.3,
https://dl.influxdata.com/influxdb/releases/influxdb_1.8.3_amd64.deb

I am running this on Ubuntu 20.04 in an LXC (container)

The datalog is showing strange numbers in the graphs for the sample points I looked at. But, tell me what you want to see and I will check it out.

Good news, IotaWatB completed the catch up and got current yesterday. IotaWatt got current sometime in the evening and now I have lots of data in influxdb.