Objectively Removing Producers
Put on your propeller hats - this might get a tad technical…
It is well known that the EOSIO software by default has 21 active block producers (BPs) and many standby BPs. These standby producers are ready to begin producing blocks should there be any issues like a hardware or network failure within the top 21 producers. What is probably not as well known is there is no automated failover mechanism to prompt this action. When one of the 21 BPs has a failure and misses (a) round(s), this effectively pauses the network for 6 seconds every 126 seconds. During those 6 seconds, transactions can be submitted but will queue up and not be added to a block until the next BP in the schedule adds them. The delay continues until the BP-not-producing either fixes the issue, gets voted out, or temporarily removes themselves (unregprod). The other way to remove a BP in this state is to have at least 15 other BPs agree to temporarily remove them (rmvproducer).
Historically, BPs have done a good job recovering from these outages quickly. Using our Reliability Tracker tool, we can see the average round recovery time on the EOS Mainnet is about 3.8 rounds. Looking at the data since launch, including some bumpy early days of bugs and instability, we see that it is 8 minutes from when the BP starts missing rounds to when they resume. If we focus on the last 90 days, the time drops to about 4 minutes.
We watch this data closely at Aloha EOS. What concerns us are not the averages, but the outliers. On EOS Mainnet for example, the most prolonged single BP outage is 198 rounds in a row, or just under 7 hours. There have been 27 BP outages lasting over 100 rounds in a row, which is 3.5 hours. These aren’t common, but we think the network should have solutions in place to recover more quickly than that. An automated failover to standby BP solution would be ideal, and there has been much discussion (and even some sidechain implementations), but these things take time to implement and mature on EOS Mainnet.
To help in the interim, today we are excited to announce a Telegram bot version of our Reliability Tracker tool. The bot announces each round missed by a BP and summarily when they either leave the schedule or come back online. Additionally, after three missed rounds we automatically create a multisig proposal to temporarily remove them from the active schedule. These proposals still need to be approved by 15 or more BPs to be executed, but we feel by doing this in an objective manner it will make it more likely that BPs will quickly approve. It is important to note that removing a BP from the schedule (rmvproducer) is temporary and does not impact votes for the BP. After they fix the issue, they can simply register again (regproducer), and they will return to their voted position.
Currently, our Telegram bot reporting supports EOS Mainnet, Jungle Testnet, and BOS Mainnet, with more networks coming soon. Here are the links to the Telegram groups:
If you are interested in monitoring BP reliability and recovery times, we urge you to join. We at Aloha are focused on making the EOS network as productive and performant as possible and will continue to create tools and resources to support this goal. We believe this tool will help improve the EOS Mainnet as a whole. We welcome feedback and general discussion in our Aloha EOS telegram group.
Until next time, Mahalo!