[ejabberd] Clustered setup, problems after update (ejabberd 2.0.5 > 2.1.5)

Daniel Dormont dan at greywallsoftware.com
Thu Mar 10 20:17:25 MSK 2011


Hi,

I just ran into that situation myself. I had a two-node cluster which was working fine, but when I tried to activate ODBC on the second node (previously I'd been running it in only one node) and I got exactly the situation you talked about. Not only that, but the errors seem to have caused an infinite loop - I was getting about 500 of them a second.

The issue has to do with the Mnesia table sql_pool. At least in my testing so far, if you create the second node by doing only a disc copy of schema, all other tables are treated as "remote" in the second node, and in the specific case of sql_pool, it caused the behavior you saw. The solution is to create a RAM copy of sql_pool on the node *before* running ejabberd with ODBC enabled.

This does get to a larger question I would pose to the community: is there a guide on how to set up the storage type based on the cluster requirements? I found this: http://lists.jabber.ru/pipermail/ejabberd/2009-December/005535.html which is a good start but it seems a bit out of date and also starts with the configuration of the "master" node as a baseline without really getting into how *that* should be decided.

-Dan


On Mar 6, 2011, at 4:38 PM, Sven 'Darkman' Michels wrote:

> Hi,
> 
> short facts:
> - 2 nodes
> - mysql clustered on both
> - debian 5.0.8 on both nodes
> 
> Today we upgraded our both nodes from 2.0.5 to 2.1.5. Before the upgrade both
> nodes where running fine for about one year without problems. The update was
> done via. aptitude, all services on both servers where stopped upfront and we
> removed the network connection between both to avoid getting some "nasty" status
> when the server is automaticly started after the upgrade. We've been doing this
> since a couple of years without problems, so far ;)
> 
> We upgrade both servers, after a reboot, we started the mysql cluster again. After
> a couple of seconds the cluster was back in sync and everything was fine so far.
> Then we started ejabberd on node1, worked without problems (its running right
> now). After testing the first node without any problems, we started the second
> one. But that one didn't came up. It doesn't even log any problems, we just found
> some "core" files like: MnesiaCore.ejabberd at node2.domain.tld_1299_440860_127056
> The core files stated something like "failed to merge schema". So we decided to
> remove the node from the cluster and rejoin it to get a clean state. But that
> also failed. We synced the cookie and did (as ejabberd):
> erl -name ejabberd at node2.domain.tld -mnesia dir '"/var/lib/ejabberd/"' -mnesia
> extra_db_nodes "['ejabberd at node1.domain.tld']" -s mnesia
> 
> after that, we verified the working connection with mnesia:info(). and checked
> the webinterface on node1 which showed the node2 just fine.
> 
> Then we issued the mnesia:change_table_copy_type(schema, node(), disc_copies).
> command which succeeded. Then q(). to leave mnesia shell. This is just like we
> did it when node2 was joined the first time. Worked fine. But this time, ejabberd
> didn't came up after we tried to start it. Instead its filling the logs with the
> following:
> =ERROR REPORT==== 2011-03-06 22:10:13 ===
> E(<0.37.0>:ejabberd_rdbms:67) : Start of supervisor 'ejabberd_odbc_sup_domain.tld'
> failed:
> {error,{{'EXIT',{badarg,[{ets,delete,[sql_pool,"domain.tld"]},
>                         {mnesia,delete,5},
>                         {mnesia_tm,non_transaction,5},
>                         {ejabberd_odbc_sup,start_link,1},
>                         {supervisor,do_start_child,2},
>                         {supervisor,handle_start_child,2},
>                         {supervisor,handle_call,3},
>                         {gen_server,handle_msg,5}]}},
>        {child,undefined,'ejabberd_odbc_sup_domain.tld',
>               {ejabberd_odbc_sup,start_link,["domain.tld"]},
>               transient,infinity,supervisor,
>               [ejabberd_odbc_sup]}}}
> Retrying...
> 
> (i get this message a couple of times within a second, running ejabberd in
> loglevel 5...). So ejabberd "try" to start, but hangs at this problem. Google
> didn't help much with this problem and i'm not sure about whats going wrong there.
> In fact, the same software, same version, same config etc. is working on the other
> node.
> 
> Anyone aware about this issue? anything i forgot? anything i'm missing?
> 
> Thanks for your time and help!
> 
> Regards,
> Sven
> _______________________________________________
> ejabberd mailing list
> ejabberd at jabber.ru
> http://lists.jabber.ru/mailman/listinfo/ejabberd



More information about the ejabberd mailing list