[ejabberd] Semi-frequent lockup / "crash" in random nodes in ejabberd cluster

Armando Di Cianno armando.dicianno at gmail.com
Tue May 31 19:18:52 MSD 2011

I'm having an odd case of freezing / "crashing" on seemingly random
nodes in my 10 machine ejabberd cluster.


 * Seemingly with no periodicity, one of the nodes in the cluster will
freeze (the erlang process inside the beam.smp, not the VM)
 * It doesn't crash (ergo the earlier "crash" scare-quotes), so
there's no good erl crash dump file to look at
 * The OS beam.smp process is still running, so there are some crash
logs coming from our monitoring agent as it tries to restart ejabberd
and *that* process crashes, since a node is already using that name
 * The few times I've been right at my workstation, and able to log in
and manually check what's going on, `ejabberdctl status` fails to run
manually / connect to the ejabberd process

 * 10 machines?! Yeah ... this is running on a managed VM service,
where we control everything about the guest VMs, but nothing about the
host machines. Suffice to say, our web services don't seem to exhibit
related issues, and I believe I have nearly exhausted all routes to
put blame on the fact that we're using VMs (although, frankly, I'm
still suspect).
 * The machines seem to be over-provisioned for RAM, running ~4GiB
each -- our stats aggregator shows that ejabbered rarely takes up
>1.8GiB of RAM per node
 * Average user count: ~4k
 * Average burst user count / peak periods: ~10k
 * Earlier tests showed we could handle ~40-50k users with that many VMs/nodes
 * We had async threads on at one point, e.g. +A 32, but have turned them off
 * SMP support is on
 * kernel polling is on
 * We do utilize `ejabberdctl reopen_log` as part of our log rotation
 * I have written our own ejabberd modules for authentication -
however, I'm fairly confident in them -- because our use case is
*extremely* specialzed, most of the required auth functions return a
happy default, and only the main "is the password valid" function does
any work.
 * The monitoring agent uses both the pid file and `ejabberdctl
status` - status runs once every minute
 * `ejabberdctl connected_users_number` also runs periodically - about
once every 5 minutes
 * We do not store users in mnesia nor mysql/etc, since we have an
specialized method for authorizing users
 * We only use mnesia for whatever mnesia needs to do internally
 * Very few modules are turned on globally:
  {mod_adhoc,    []},
  {mod_caps,     []},
  {mod_disco,    []},
  {mod_ping,     []},
  {mod_privacy,  []},
  {mod_filter,   []}
 * A few more are turned on or specialized per-vhost:
  {{add, modules},
   [{mod_ping,[{send_pings, true},
               {ping_interval, 10},
               {timeout_action, kill} ]},
    {mod_muc,[{host, "lobbies. at HOST@"},
              {access, 'fakename_muc'},
              {access_create, 'fakename_muc'},
              {access_admin, 'fakename_muc_admin'},
              {access_persistent, 'fakename_muc_admin'},
              {history_size, 0},
              {max_users, 100},
              {max_users_admin_threshold, 2},
              {max_user_conferences, 1},
              {max_room_id, 128},
              {max_room_name, 256},
              {max_room_desc, 1024},
              {max_rooms_number, 99},
              {default_room_options,[{allow_change_subj, false},
                                     {allow_private_messages, false},
                                     {allow_visitor_nickchange, false},
                                     {public, true},
                                     {public_list, true},
                                     {allow_query_users, true},
                                     {anonymous, false},
                                     {logging, false},
                                     {members_by_default, true}

Any pointers, advice, avenues to research, or points of obvious
stupidity would be greatly appreciated.


More information about the ejabberd mailing list