[ejabberd] mnesia corruption with concurrent ejabberdctl usage

Martin Langhoff martin.langhoff at gmail.com
Tue Dec 29 14:51:21 MSK 2009

On Mon, Dec 28, 2009 at 8:33 PM, Konstantin Khomoutov
<flatworm at users.sourceforge.net> wrote:
>>  - So a small pool of names with some locking may work? Modern distros
>> carry flock, so we could say (pseudo-shell):
>>    CONNLOCKDIR=/var/lock/ejabberd/ejabberdctl
>>    for CONNID in 1..$MAXCONNECTIONS; do
>>        if flock -n "$CONNLOCKDIR/ctl-ejabberd-$CONNID at localhost" \
>>            erl -sname "ctl-ejabberd-$CONNID at localhost" ... ; then
>>            break

> I think that the idea proposed by Brian Cully, namely to stick to
> OS PIDs to generate unique names is OK.
> Refer to discussions archived as [1] and [2].

thanks for the pointers! Gave the thread and your script a good read.
Still, a couple of things trouble me...

 - Using PIDs will still "leaks" atoms (to the PID wraparound) -- this
can be rather large (and I deal with low-RAM servers)

 - The "not concurrent" mode seems a workaround to the leak above.
Defaulting to "not concurrent" ... is only safe in single-user-mode
IMHO. It is an assumption that cannot be made safely be the
application code (in case of apps that want to sync with ejabberd),
because the app can be one of many talking to that ejabberd node. It
also cannot be made by the sysadmin: in multi-user mode, other admins
may be logged in.

To put it differently: the app glue code writer needs to know to use
--concurrent for sane use, or place a lock that assumes that his app
is the only thing talking to ejabberd. Many app instances break this
assumption. A sysadmin trying to monitor what's going on breaks the

And the failure modes when you hit this are strange and mysterious --
which is not my fave really.

In my case (the OLPC XS) I hit all those cases directly in code (have
1 app that syncs, will add a few more in the near future), and through
the practices I recommend when debugging (like monitoring the output
of some ejabberdctl commands over time).

So I cannot use the PID-based approach. Instead, I will be
implementing something based on the pseudocode I posted earlier. And
if 100 isn't enough for a given setup, the script can error out with a
sensible message ("raise your maxconcurrent").

Will post here the resulting patch.


 martin.langhoff at gmail.com
 martin at laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

More information about the ejabberd mailing list