Changes

Summary

Fix issue where node promotion could fail (details)

Commit f7ce828e9e1cd0df26ae04d49b8dca32dc24b654 by derny

Fix issue where node promotion could fail

If a node is promoted right after another node is demoted, there exists
the possibility of a race, by which the newly promoted manager attempts
to connect to the newly demoted manager for its initial Raft membership.
This connection fails, and the whole swarm Node object exits.

At this point, the daemon nodeRunner sees the exit and restarts the
Node.

However, if the address of the no-longer-manager is recorded in the
nodeRunner's config.joinAddr, the Node again attempts to connect to the
no-longer-manager, and crashes again. This repeats. The solution is to
remove the node entirely and rejoin the Swarm as a new node.

This change erases config.joinAddr from the restart of the nodeRunner,
if the node has previously become Ready. The node becoming Ready
indicates that at some point, it did successfully join the cluster, in
some fashion. If it has successfully joined the cluster, then Swarm has
its own persistent record of known manager addresses. If no joinAddr is
provided, then Swarm will choose from its persisted list of managers to
join, and will join a functioning manager.

Signed-off-by: Drew Erny <derny@mirantis.com>
(cherry picked from commit 16e5c41591bda30aa48214fd792c4dd893acbc72)
Signed-off-by: Drew Erny <derny@mirantis.com>

daemon/cluster/noderunner.go (diff)