[ejabberd] ejabberd_router blocking
skeltoac at gmail.com
Tue Dec 1 21:05:36 MSK 2009
First the specs. I am running ejabberd trunk rev 2610, working on
upgrading to head. The server has 8 cores and 4GB RAM. Throughput is
typically 400Kbps in and 1M out. Normally less than one full core is
in use and ejabberd takes about 5-10% of physical RAM.
Our throughput is almost all pubsub events. The inbound events are
received efficiently via a custom HTTP module. Outbound events go via
s2s and c2s alike. Online users averages in the dozens and s2s
connections under ten.
Now the problem. Frequently (sometimes weekly, daily, or more often)
ejabberd memory use suddenly spikes to several gigabytes. When this
exceeds the physical RAM and it starts to swap, it's game over. The
server becomes almost totally unresponsive. Sometimes I can restart
ejabberd, sometimes the hardware has to be restarted.
I've been looking into this. When the memory problem occurs there is
only one process that is eating RAM: ejabberd_router. It builds up a
huge message queue which requires gigabytes of RAM.
Now let me tell you what I just figured out while writing this email.
I was going to ask for help but now I'm just telling a story for your
The module ejabberd_router is not the cause of the problem. Neither is
it a slow client. It's not even mnesia. The problem is that the
filter_packet hook blocks ejabberd_router. If anything on that hook
ever gets slow the entire router queue will wait. It must be one of my
packet filters blocking the router and causing the pile-up.
I wondered if I could rewrite ejabberd_router:do_route/3 to dispatch
messages asynchronously. A scan of RFC 3920, XMPP Core, provides the
10. Server Rules for Handling XML Stanzas
Compliant server implementations MUST ensure in-order processing of
XML stanzas between any two entities.
Dang. I was hoping to keep the router going despite some packets
taking a long time to run through the filters but I can't do that by
simply spawning processes. Doing so would make it possible for some
messages to be delivered before other messages sent earlier.
End of story. Beginning of proposal. All invented while composing.
It would be possible to amend ejabberd_router so that it spawns new
processes to handle messages while honoring RFC 3920. The trick would
be to limit the spawning to one process per pair of entities. If the
router sent its messages to these other processes, which would then
run the packet_filter hook, then there would be no way to block the
router. The entity-pair routers would still be blocked by the hook but
the result would be a more robust system.
So should I code this up? :-)
More information about the ejabberd