Update on log file corruption issue

overeasy · October 9, 2018, 1:00am

This past weekend, I rolled out 02_03_16 to MINOR and MAJOR users, completing the conversion of all auto-update users to this release. It had been running without incident for several weeks in most ALPHA systems.

There followed three reports of various symptoms, all related to corruption of the current and/or history log. All of those users are back up and running, but they did lose some data and two lost all of their local history. It’s quiet now, and I think the worst is over.

Today I spent the day researching this issue. Here’s what I found:

Of the two systems for which I received diagnostic files (02_03_16 produces them when a log becomes corrupted) both were total trainwrecks. I began to question the disgnostic file generator. I cannot understand how these systems were working at all, yet both reported running since the beginning of the year and that was proved out by the diagnostic files.

So I took a look at two systems here in New England that have been running for about the same length of time. Both were fine.

Today I removed the SDcards from the three production type systems that I have, extracted the datalogs, and ran the diagnostic scans. Both the current logs and history logs were perfect. My home system has been running for nearly two years, and I did finally lose the current log about two months ago due to an old corruption from a test firmware that never went public. The history log was fine. (Can you tell I use a heatpump?)

In analysing the failures, I discovered that 02_03_16 didn’t appear to cause the corruption, it just identified it and recognized that it couldn’t proceed.

I suspect one of two causes for the problems:

It could be that the damage was caused by some changes to the datalog class in 02_03_13 for the brief time it was out there, although none of my systems or the local ALPHA users had trouble. Those changes were backed out.

The other possibility that I haven’t had a chance to follow up on is that the SDcards were either defective or too small. I’ll check on that going forward. The only reason I suspect the card capacity is that they all seem to be systems that started up last January. I would expect a more even distribution if the cause were a software release. If that is the problem, it is already solved because new IoTaWatt have an 8Gb card from the factory.

So I’m now pretty confident that going forward, things will be stable. There may yet be another one or two users that haven’t noticed their systems stopped, but even those will probably be automatically recovered by 02_03_16.

I’ll keep you posted.