[ejabberd] problem: ejabberdctl restore. solution: ejabberdctl install_fallback

Jan Koum jan.koum at gmail.com
Tue Dec 22 12:58:48 MSK 2009


hi there,

just stumbled into a problem with 'ejabberdctl restore' we are hoping
somebody can give hand with.  what we want to do is rename our
ejabberd at localhost node to ejabberd at master.xmpp.example.net

so following the instructions in the guide, we did:

$ ejabberdctl --node ejabberd at localhost start
$ ejabberdctl --node ejabberd at localhost status
The node ejabberd at localhost is started with status: started
ejabberd 2.1.0 is running in that node

doing backup takes about 5 minutes and creates a 522MB file:

$ time ejabberdctl --node ejabberd at localhost backup /tmp/node.localhost

real    4m40.485s
user    0m0.187s
sys    0m0.091s

$ ls -l /tmp/node.localhost
-rw-r--r--  1 jkb  wheel  522707410 Dec 22 00:59 /tmp/node.localhost

$ ejabberdctl --node ejabberd at localhost stop

so far so good, and following the guide, we now we move old DCD/DAT/DCL
files out of the way and start the cluster with a new node name:

$ ejabberdctl start
$ ejabberdctl status
The node 'ejabberd at master.xmpp.example.net' is started with status: started
ejabberd 2.1.0 is running in that node

everything still looks good.  time to do mnesia_change_nodename:

$ time ejabberdctl mnesia_change_nodename ejabberd at localhost
ejabberd at master.xmpp.example.net /tmp/node.localhost /tmp/
node.master.xmpp.example.net

mnesia_change_nodename goes through successfully:

[...]
 * Checking table: 'last_activity'
   + Checking key: 'ram_copies'
   + Checking key: 'disc_copies'
     - Replacing nodename: 'ejabberd at localhost' with: ''
ejabberd at master.xmpp.example.net''
   + Checking key: 'disc_only_copies'
[...]
switched

real    0m31.713s

so next thing we do is 'ejabberdctl restore' and this is where everything
breaks:

$ time ejabberdctl restore  /tmp/node.master.xmpp.example.net
Failed RPC connection to the node 'ejabberd at master.xmpp.example.net':
nodedown

=ERROR REPORT==== 22-Dec-2009::01:07:52 ===
** Node 'ejabberd at master.xmpp.example.net' not responding **
** Removing (timedout) connection **

real    1m58.774s

what actually happens is beam will eat up all available RAM (7GB), eat up
all avaiable swap (2GB) and get killed by the OS.

my guess this is because ejabberd/erlang/mnesia is trying to load everything
into memory first before writing it into the DCD/DAT/DCL files, correct?  is
there any way to modify this behavior or work around it somehow?

[... 5 minutes later of trying various things like mnesia:restore(...),
google searches, etc...]

AHA! there is install_falback command which says:

*install_fallback ejabberd.backup* The binary backup file is installed as
fallback: it will be used to restore the database at the next ejabberd
start. Similar to restore, but requires less memory.
perfect -- just tried it and seems to have worked, except for this scary
core:

=ERROR REPORT==== 2009-12-22 01:25:44 ===
Mnesia('ejabberd at master.xmpp.example.net'): ** ERROR ** (ignoring core) **
FATAL ** A fallback is installed and Mnesia must be restarted. Forcing
shutdown after mnesia_down from 'ejabberd at master.xmpp.example.net'...

[this fatal errors comes with either 'ejabberdctl restart' or 'ejabberdctl
stop' commands after install_fallback command -- is this scary fatal error
expected?]

i guess the really one question i have is: why does 'restore' not act like
'install_fallback' when it comes to memory consumption?  and more
importantly: maybe it makes sense to modify the documentation guide to
recommend people use install_fallback when doing cluster renames in
production.

-- yan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jabber.ru/pipermail/ejabberd/attachments/20091222/4202a5b7/attachment.html>


More information about the ejabberd mailing list