[ejabberd] Mnesia ops questions...?

Felix GV felix at mate1inc.com
Sat Apr 13 22:39:15 MSK 2013


I have some questions about Mnesia.

If you feel this is not the right place for these and you know of a better
mailing list or forum to ask Mnesia questions on, please don't hesitate to
mention it.

We've been running ejabberd in production for several months and it has
been going mostly top notch, but every once in a while, we hit a problem
with Mnesia and we never got to the bottom of the issue.

We do rolling restarts every once in a while to update custom modules or
perform some other minor changes, and this usually goes well, but
sometimes, we hit a situation where one or more node cannot rejoin the
cluster properly. It seems to have to do with the order in which nodes are
stopped and/or brought back up, although it is very confusing to
troubleshoot, so I'm not sure about the details. In any case, it seems that
stopping and bringing back up the whole cluster solves the issue (but
unfortunately, this causes downtime).

There are times I've hit this issue where the nodes would tell me the
following message during start up:

*Killing not allowed - living nodes in database*

But this does not always happen. When it does happen, the nodes sometimes
eventually starts up normally, sometimes it remains stalled until some
other node in the cluster is also restarted. What does this message mean
anyway?

This week we had the weirdest manifestation of these types of problems. We
changed some modules, then started doing a rolling restart of our 4 nodes:

   - The first one restarted properly.
   - The second one seemed to have restarted, but it was intermittently
   flaky:
      - ejabberdctl status did not always report the node as running,
      sometimes it said the node was "starting with status started".
      - Also, ejabberdctl mnesia info run in the second node sometimes
      reported that the first node was not running, even though at the
very same
      moment, the first node was able to see both itself and the second node.
      This looked like some sort of one way net split...!?!?
      - Even though the above two intermittent problems are weird, the
      second node appeared to be serving traffic properly.
   - The third and fourth node were unable to restart. They stayed stalled
   in "starting with status started" forever and never seemed to be able to
   finish initializing themselves. While they were in that state, we were able
   to run mnesia info and the nodes would be able to see the other running db
   nodes (mostly, that seemed to vary a bit as well), but they would have no
   tables at all! None! Normally, tables are replicated with the same copy
   types everywhere, but now they were just not appearing at all.

In the end, we stopped the first and second node, and tried restarting
them, which didn't work (these two also ended up in the "starting with
status started" state seemingly forever). After that, we stopped them
again, killed the epmd process on each of them, and started them again. At
that point, one of the two nodes spat out the "Killing not allowed - living
nodes in database" message and the first two nodes eventually started up
normally (with proper status and mnesia info outputs). After that, we also
restarted the third and fourth node and they were able to join the cluster
properly as well (with full proper mnesia info output).

This is all very confusing, and quite opaque to troubleshoot. Even with
ejabberd set to DEBUG logging levels (5), it would still output nothing. As
if some things were failing even before it could reach the ejabberd code...

I'm kind of disappointed at how vaguely I can describe the issue, even
after troubleshooting it for a whole day.

In the end, we always ended up being able to recover after a full restart,
but this is not very nice for uptime reasons, and also not very re-assuring
since we don't understand the root cause of the issue well enough and
cannot be certain that "this time the full restart will work".

Any insight would be appreciated!

Thanks :) !

--
Felix
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jabber.ru/pipermail/ejabberd/attachments/20130413/796f6957/attachment.html>


More information about the ejabberd mailing list