[ejabberd] Mnesia and Pids

Matthew Reilly matthew.reilly at sipphone.com
Wed Nov 16 00:32:26 MSK 2005


I discovered an odd behavior with Mnesia/OTP and I don't know if this is
a bug with Mnesia/Erlang/OTP or something I'm missing.

I was playing around with ejabberd configuration and I accidentally
started up an ejabberd node with odbc without first compiling ejabberd
with odbc support. This ended up causing different clustered node to
crash due to an out of memory error. The root cause of the crash seemed
to either be how Mnesia stores Pids or Erlang/OTP monitors them.

I had this set up:
system A - normal ejabber setup

I added:
system B - clustered with the session/route/s2s/presence tables
replicated with ram_copies.

System B partially failed at start up in ejabberd_app:start/2.
I stopped System A.

When I tried starting up system A again, it would soon run out of
memory. It always ran out of memory in the ejabber_router process.
The cause was an infinite recursion -- ejabberd_router:init/0 calls
erlang:monitor/2 (which is a BIF) that ended up calling
erlang:dmonitor_p/2 (which is erlang), which called erlang:monitor/2,
which called erlang:dmonitor_p/2, and so on...
This not only hung the ejabber_router process, but since calls to BIFs 
aren't tail recursive, the stack increased until we ran out of memory.

ejabberd_router tries to monitor all pids listed in the router table:
ejabberd_router.erl:
init() ->
    mnesia:subscribe({table, route, simple}),
    lists:foreach(
      fun(Pid) ->
              erlang:monitor(process, Pid)
      end,
      mnesia:dirty_select(route, [{{route, '_', '$1', '_'}, [],
['$1']}])),
    loop().

When system A came up again, it received this list of pids from system
B. The first pid on the list was '<0.473.0>'. This was not an existing
pid on system A, (and had never existed in the newly started version of
A).

System A believed this to be a local pid (since is_process_alive(Pid)
returned false -- if it were external, it would throw an exception),
however, erlang:monitor should never call dmonitor_p except if the Pid
is remote.

While the original cause was due to an application error, it seems that
Mnesia and/or erlang/OTP should not allow a BIF to crash due to an
application error. Either Mnesia shouldn't return an invalid Pid, or
erlang:monitor should handle it correctly.

Is this a true bug or am I not understanding how Mnesia is supposed to
handle Pids?
If it is a bug, is it in Mnesia or in OTP/erlang?


To reproduce this behavior:
1) Start ejabberd as normal on System A:
$ erl -sname ejabberd -s ejabberd

2) Force application error on System B:
Edit ejabberd_app and force start/2 to fail:
    start(normal, _Args) ->
    application:start(sasl),
      randoms:start(),
      db_init(),
      sha:start(),
      catch ssl:start(),
      translate:start(),
      acl:start(),
      gen_mod:start(),
      ejabberd_config:start(),
      Sup = ejabberd_sup:start_link(),
      1 = 2,       % <-- This line will force start/2 to fail
      ejabberd_auth:start(),
      cyrsasl:start(),
      start(),
      load_modules(),
      Sup;
    start(_, _) ->
      {error, badarg}.
$ make
$ make install

3) Add System B as a cluster:
$ erl -sname ejabberd -mnesia extra_db_nodes ['ejabberd at systemA'] -s
mnesia
erl> mnesia:change_table_copy_type(schema, node(), disc_copies).
erl> q().

4) Start System B:
$ erl -sname ejabberd -s ejabberd

5) Stop System A:
erl> q().

6) Start System A:
$ erl -sname ejabberd -s ejabberd

This should cause the recursion which will eventually crash beam.




-- 
Matthew Reilly
matthew.reilly at sipphone.com
Gizmo Project name: matt 




More information about the ejabberd mailing list