Broken firmware?

Hi,
My IotaWatt stopped working this morning. First symptoms were that I was unable to go into web configuration or status (graphs still worked, so did the built-in file viewer).
I have power cycled the device and it no longer listens on wed interface.
It still does connect to my WiFi, but as “ESP_xxxxxx” (6 hex digits), no longer under my custom name. It still gets the same IP address, but no web interface pops up.
The device has the following blink pattern: “green faint-red faint-red green” with slight delay between cycles.
I believe it had the firmware it came with, that is 02_04_00. Before power cycling it, I have looked in the logs and noticed an exception logged. Unfortunately, I haven’t saved the log before power cycling, so I have lost access to that.

How do I restore my device to a working condition? Can I just upload the latest firmware.bin from the github somehow?

Update: I’ve got esptool and dumped the SPI flash first - the firmware (first 0.5MB of the flash) is pristine, no point in reflashing it.
I have also checked the sources, the blink pattern is apparently bad config:

#define LED_BAD_CONFIG “G.R.R.G…” // Could not parse config file

Where is the config file, on SPI or on the SD card? (next, I’ll open the device and check the SD). Since the device could still connect to the wifi (so not that part of the config, apparently), it would be nice if it let me clear or fix the config over the webface…

The config file is in the root directory of the SD card: CONFIG.TXT
If the file exists, it is probably corrupted. Use a Json Lint program to check the Json.

Also, look in the iotawatt directory for the message log: IOTAWATT/IOTAMSGS.TXT
Please upload that file.

This sounds like it started as a poor WiFi issue.

I found the file on the card, it was empty. I tried deleting it first, but that just triggered another error pattern (“G.R.R.R…”) which decoded to, lo and behold, missing config.
Next, I have placed the default config from the github, which finally made the IotaWatt happy, if little lost, since I obviously had to reconstruct my all my settings. Well, now I keep a backup of that precious file with all my inputs configured.

Now to the failure mode: Why do you think it was poor wifi? I have had the setup working the same way (network-wise) for last 3 weeks, no connectivity issues and the data has happily streamed to my influxDB. I was doing some configuration changes when this happened, namely adding more streams to the influxDB export - I have wired the PV inverter CT the day before and wanted to adjust the reporting, so I have modified both Outputs and Web several times before the issue. Are you suggesting that a bad connection during config modification could cause a config file corruption? That could easily become a support nightmare for you with the product sales up. Especially when such a failure mode brings the device to the state where the user can’t fix it over the web interface. At the risk of stating the obvious, it would be much better to boot to a minimum working state (no logging, web up allowing config reset or the file manager), or at least resetting to factory defaults. (not complaining, I can deal with stuff and in general I am very pleased with the functionality, especially since I haven’t had to design and code it all myself)

Relevant parts of the log are below:

7/07/19 09:44:27 WiFi disconnected.
7/07/19 09:46:04 WiFi connected. SSID=Doma, IP=192.168.55.253, channel=11, RSSI -77db
7/07/19 11:27:38 influxDB: Restart. Last post 07/07/19 11:27:20
7/07/19 11:27:38 influxDB: started, url=192.168.55.23:8086, db=iotawatt, interval=10
7/07/19 11:27:38 influxDB: Start posting at 07/07/19 11:27:30
7/07/19 11:27:50 influxDB: Stopped. Last post 07/07/19 11:27:20
7/07/19 11:29:03 influxDB: started, url=192.168.55.23:8086, db=iotawatt, interval=10
7/07/19 11:29:04 influxDB: Start posting at 06/01/19 01:00:10
7/07/19 18:44:53 influxDB: Restart. Last post 07/07/19 18:44:40
7/07/19 18:44:53 influxDB: started, url=192.168.55.23:8086, db=iotawatt, interval=10
7/07/19 18:44:54 influxDB: Start posting at 07/07/19 18:44:50

** Restart **

SD initialized.
7/08/19 02:35:54z Real Time Clock is running. Unix time 1562553354 
7/08/19 02:35:54z Reset reason: Hardware Watchdog
7/08/19 02:35:54z Trace:  8:8, 8:9, 9:3, 9:5, 9:9, 1:2, 1:3, 1:4, 1:5[19], 1:6, 1:1[1], 1:2[2], 9:0[2], 9:0, 9:1, 8:4, 8:6, 8:8, 8:9, 9:3, 9:5, 9:9, 1:2, 1:3, 1:4, 1:5[21], 1:6, 1:1[2], 1:2[3], 9:0[3], 9:0, 9:1
7/08/19 02:35:54z ESP8266 ChipID: 2518867
7/08/19 02:35:54z IoTaWatt 5.0, Firmware version 02_04_00
7/08/19 02:35:54z SPIFFS mounted.
7/07/19 19:35:55 Local time zone: -8:00
7/07/19 19:35:55 Using Daylight Saving Time (BST) when in effect.
7/07/19 19:35:55 device name: Merak
7/07/19 19:35:55 MDNS responder started for hostname Merak
7/07/19 19:35:55 LLMNR responder started for hostname Merak
7/07/19 19:35:55 HTTP server started
7/07/19 19:35:55 timeSync: service started.
7/07/19 19:35:56 statService: started.
7/07/19 19:35:56 Updater: service started. Auto-update class is MINOR
7/07/19 19:35:56 dataLog: service started.
7/07/19 19:35:56 dataLog: Last log entry 07/07/19 19:35:45
7/07/19 19:35:56 historyLog: service started.
7/07/19 19:35:57 historyLog: Last log entry 07/07/19 19:35:00
7/07/19 19:35:57 WiFi connected. SSID=Doma, IP=192.168.55.253, channel=11, RSSI -81db
7/07/19 19:35:59 Updater: Auto-update is current for class MINOR.
7/07/19 19:36:00 influxDB: started, url=192.168.55.23:8086, db=iotawatt, interval=10
7/07/19 19:36:00 influxDB: Start posting at 07/07/19 19:35:50
7/07/19 23:36:34 Updater: Invalid response from server. HTTPcode: -11
7/08/19 21:38:38 Updater: Invalid response from server. HTTPcode: -11
7/09/19 08:39:33 Updater: Invalid response from server. HTTPcode: -4
7/09/19 20:35:34 WiFi disconnected.
7/09/19 20:37:01 WiFi connected. SSID=Doma, IP=192.168.55.253, channel=1, RSSI -80db
7/12/19 02:45:29 Updater: Invalid response from server. HTTPcode: -14
7/13/19 17:42:56 influxDB: Restart. Last post 07/13/19 17:42:40
7/13/19 17:42:56 influxDB: started, url=192.168.55.23:8086, db=iotawatt, interval=10
7/13/19 17:42:57 influxDB: Start posting at 07/13/19 17:42:50

** Restart **

SD initialized.
7/14/19 03:24:43z Real Time Clock is running. Unix time 1563074683 
7/14/19 03:24:43z Reset reason: Exception
7/14/19 03:24:43z Trace:  15:1, 15:1, 15:0, 15:1, 15:1, 15:0, 15:1, 15:1, 15:0, 15:1, 15:1, 15:0, 15:1, 15:1, 15:0, 15:1, 15:1, 15:0, 15:1, 15:1, 15:0, 15:1, 15:1, 15:0, 15:1, 15:1, 15:0, 15:1, 15:1, 15:0, 15:1, 15:1
7/14/19 03:24:43z ESP8266 ChipID: 2518867
7/14/19 03:24:43z IoTaWatt 5.0, Firmware version 02_04_00
7/14/19 03:24:43z SPIFFS mounted.
7/13/19 20:24:44 Local time zone: -8:00
7/13/19 20:24:44 Using Daylight Saving Time (BST) when in effect.
7/13/19 20:24:44 device name: Merak
7/13/19 20:24:44 MDNS responder started for hostname Merak
7/13/19 20:24:44 LLMNR responder started for hostname Merak
7/13/19 20:24:44 HTTP server started
7/13/19 20:24:44 timeSync: service started.
7/13/19 20:24:45 statService: started.
7/13/19 20:24:45 dataLog: service started.
7/13/19 20:24:45 dataLog: Last log entry 07/13/19 20:24:40
7/13/19 20:24:45 historyLog: service started.
7/13/19 20:24:46 historyLog: Last log entry 07/13/19 20:24:00
7/13/19 20:24:47 Updater: service started. Auto-update class is MINOR
7/13/19 20:24:48 WiFi connected. SSID=Doma, IP=192.168.55.253, channel=1, RSSI -78db
7/13/19 20:24:48 Updater: Auto-update is current for class MINOR.
7/13/19 20:24:49 influxDB: started, url=192.168.55.23:8086, db=iotawatt, interval=10
7/13/19 20:24:49 influxDB: Start posting at 07/13/19 20:23:50
7/14/19 08:08:15 influxDB: Restart. Last post 07/14/19 08:07:40
7/14/19 08:08:15 influxDB: started, url=192.168.55.23:8086, db=iotawatt, interval=10
7/14/19 08:08:16 influxDB: Start posting at 07/14/19 08:07:50

** Restart **

SD initialized.
7/14/19 16:29:02z Real Time Clock is running. Unix time 1563121742 
7/14/19 16:29:02z Power failure detected.
7/14/19 16:29:02z Reset reason: External System
7/14/19 16:29:02z ESP8266 ChipID: 2518867
7/14/19 16:29:02z IoTaWatt 5.0, Firmware version 02_04_00
7/14/19 16:29:02z SPIFFS mounted.
7/14/19 16:29:02z Config file parse failed.
7/14/19 16:29:02z Program halted.

BTW: Not sure if I did that before the issue or whether it could somehow trigger the issue, but during configuring my outputs and exports, I might have inadvertently typed A + B / C (think changing Grid = “Phase_1 + Phase_2 - Solar” into “Phase_1 + Phase_2 + Solar”, since my Solar is correctly negative), hitting the division sign instead of the plus sign on the calculator. Would the firmware handle division by zero gracefully in such a case?

1 Like

Good to see you got it running.

Those codes are described in the documentation under “troubleshooting”.

I’d like to do that under the hood, but it’s not as easy as it sounds. There is no rename capability in the SD file system. So when updating the file, it’s not as easy as writing the new file, check it’s integrity and then delete the old and rename the new. To be sure there are other (and better) ways to accomplish that, and it’s on the long list, but this failure just hasn’t been a big issue. This is the first field report of it happening directly from the config app.

Because the way the config app works is to write an entirely new config each time anything changes. As above, there is no way to rename a file, so the old file I’d deleted, then the new file is received and written. You can see there is a window of exposure if the payload of the write doesn’t make it. Memory is a huge issue, and it’s not possible to reliably buffer the new file before deleting the old. I have a design to not only handle this, but also to make the switch an atomic operation. Just need to get around to it.

Absolutely, but there are some issues with bringing up the web server before the configuration is processed. Hindsight is 20-20. You’ve been digging through the code, it’s a WIP. Trading one problem for another isn’t progress. The way forward is to redesign some low level components, but with thousands of users around the world auto updating, I need to be careful.

case opDiv: return operand == 0 ? 0 : result / operand;

Who needs documentation when there is source code… /s
(No idea how did I miss the LED codes in the troubleshooting section. Perhaps since I have read all of the documentation while eagerly awaiting the device to arrive, a month ago)
Fully understood your “tread carefully” approach.

Thanks for great support!

A post was merged into an existing topic: History upload issue with EmonCMS