OC200's Batch Software upgrade for APs feature may need some improvment in Reliability
One of our controllers has 48 EAP-225 Outdoor APs. Today morning, when we logged into the controller cloud interface it showed up a new 2020-Jan-13 firmware upgrade avaialbility and we used "Batch Upgrade" to upgrade all the 48 EAPs. However the upgrade process did not finish cleanly, and for 8 APs, the upgrade failed, and the rest succeeded. The following "alert" logs were written by the controller (for these APs) :
1 2020-03-09T20:58:37.329Z - Omada Controller - - Bougenvilla-C (Towards Park)(98-DA-C4-97-02-30) upgrade failed
1 2020-03-09T20:58:37.329Z - Omada Controller - - Bougenvilla-C(98-DA-C4-58-9F-F2) upgrade failed
1 2020-03-09T20:58:44.821Z - Omada Controller - - Bougenvilla-B(98-DA-C4-58-84-20) upgrade failed
1 2020-03-09T20:59:06.337Z - Omada Controller - - Jacaranda-K(98-DA-C4-96-F9-02) upgrade failed
1 2020-03-09T20:59:06.344Z - Omada Controller - - Jacaranda-H(98-DA-C4-97-02-2E) upgrade failed
1 2020-03-09T20:59:06.351Z - Omada Controller - - Clubhouse Entrance Front(98-DA-C4-97-09-8C) upgrade failed
2020-03-09T20:59:07.356Z - Omada Controller - - Mayflower-S(98-DA-C4-58-9C-FE) upgrade failed
1 2020-03-09T20:59:14.363Z - Omada Controller - - Mayflower-Q-R(98-DA-C4-96-FA-B8) upgrade failed
The above logs appeared earalier in time like a cluster, towards the initial part of overall system upgrade process. Infact no successful Upgrade log message appeared before these failures. Its almost looks like the controller itself aborted the upgrade process fore some APs due to some type of overload condition.
For the remaining 40, the upgrade happenned successfully and the following types of "informational" logs (not full list) were written subsequently:
1 2020-03-09T21:00:13.223Z - Omada Controller - - Hibiscus-M(98-DA-C4-58-88-8E) upgraded to 2.7.0 Build 20200113 Rel. 38287 successfully
1 2020-03-09T21:00:13.228Z - Omada Controller - - Lotus-Hibiscus Walkway(98-DA-C4-97-09-62) upgraded to 2.7.0 Build 20200113 Rel. 38287 successfully
1 2020-03-09T21:00:14.872Z - Omada Controller - - Lotus Approach(98-DA-C4-97-08-B0) upgraded to 2.7.0 Build 20200113 Rel. 38287 successfully
1 2020-03-09T21:00:16.143Z - Omada Controller - - Bougenvilla-A(98-DA-C4-96-FB-F4) upgraded to 2.7.0 Build 20200113 Rel. 38287 successfully
1 2020-03-09T21:00:16.156Z - Omada Controller - - Jacaranda-G-H(98-DA-C4-97-09-D2) upgraded to 2.7.0 Build 20200113 Rel. 38287 successfully
1 2020-03-09T21:00:17.171Z - Omada Controller - - Bougenvilla-F(98-DA-C4-58-84-18) upgraded to 2.7.0 Build 20200113 Rel. 38287 successfully
I manually upgraded the 8 rejected APs, one by one from the Web-interface and completed the process successfully for these 8 also.
Firmware upgrade is a reasonably parallel operation, except when constrained by the uploading of new firmware to device in the early stages. If the controller hardware cannot handle so much upgrade task load concurrently, it may be better for it to batch the upgrades in blocks of 10 or 15 APs, complete a chunk reliably and safely and then move to the next, till the entire batch of APs lined for upgrades (the "batch upgrade" pool) are upgraded successfully. can something like this be done to improve the overall reliability of batch upgrade process ?
This type of thing i observed 2nd time on WLAN network. The upgrade was done in completely off peak hours with least network load.