WebSocket Signaling Implementation: Room Routing, Reconnection & Backpressure

A WebRTC media session cannot begin until two peers have exchanged a session description and a stream of ICE candidates over a side channel. That side channel is the signalling server, and in production it is almost always a WebSocket: a persistent, full-duplex connection that delivers SDP and candidate payloads in sub-10 ms without the polling overhead of HTTP. This guide is part of the WebRTC Protocol Stack & Signaling Servers guide, and it walks through building a signalling layer that routes messages to the right room, survives the Wi-Fi-to-cellular transitions that drop sockets mid-call, applies backpressure before the event loop stalls, and rejects malformed payloads at the boundary. The goal is a server you can run behind a load balancer and trust to deliver every offer, answer, and candidate exactly once to exactly the right peers.

The signalling server is deliberately dumb about media. It never parses RTP, never terminates DTLS, and never sees decrypted audio or video. It is a typed message router that maps a roomId onto a set of live sockets and forwards opaque payloads between them. Keeping that boundary clean is what lets the same server handle a two-person call and a 50-person conference without code changes — the routing logic is identical, only the fan-out width differs.

SDP and ICE messages travel client → signaling server → room peers; the server holds only a room-to-socket map.

Step 1 — Room Routing

The core data structure is a Map<roomId, Set<WebSocket>>. Every inbound message names a room; the server looks up the set and forwards the payload to every member except the sender. The sender exclusion matters — echoing an offer back to its originator triggers InvalidStateError when the peer tries to apply its own SDP as a remote description. Keep room membership a Set rather than an array so that joins are idempotent: a client that reconnects and re-joins must not appear twice and receive every message in duplicate.

Attach a stable, server-assigned identifier to every socket at connection time. Clients should never supply their own peer ID, because a malicious or buggy client could collide with another peer and hijack its routing slot. Generate the ID server-side with a UUID and stamp it onto each forwarded message as senderId so the receiving peer knows which transceiver the offer belongs to.

const { WebSocketServer } = require('ws');
const { randomUUID } = require('crypto');

const wss = new WebSocketServer({ port: 8080, maxPayload: 65536 });
const rooms = new Map(); // Map<roomId, Set<ws>>

function joinRoom(roomId, ws) {
  if (!rooms.has(roomId)) rooms.set(roomId, new Set());
  rooms.get(roomId).add(ws);
  ws.roomId = roomId; // remember for O(1) cleanup on close
}

function routeToRoom(roomId, payload, sender) {
  const room = rooms.get(roomId);
  if (!room) return;
  const data = JSON.stringify({ ...payload, senderId: sender.peerId });
  for (const peer of room) {
    // Exclude the sender; skip sockets mid-close to avoid throwing
    if (peer !== sender && peer.readyState === peer.OPEN) peer.send(data);
  }
}

wss.on('connection', (ws) => {
  ws.peerId = randomUUID(); // server-assigned identity, never client-supplied
});

Track ws.roomId on the socket so teardown is O(1): when a socket closes you delete it from one set, not by scanning every room. For rooms that span more than one server process, this in-memory map is no longer sufficient — a peer connected to node B must still receive a message published on node A. That fan-out across nodes is the subject of the Scaling WebSocket Signaling with Redis Pub/Sub deep-dive.

Step 2 — Reconnect & Backoff

Mobile clients change networks constantly. A handoff from Wi-Fi to LTE drops the TCP connection underneath the WebSocket, and the browser surfaces this as a close event with code 1006 (abnormal closure) — not a clean 1000. The client must distinguish 1006 (reconnect aggressively) from 1001 Going Away sent during a planned server drain (reconnect, but the server is healthy) and 1000 (intentional, do not reconnect).

Reconnect with exponential backoff plus jitter. Without jitter, a server restart causes every disconnected client to reconnect at the same instant — a thundering herd that knocks the server over again. A base of 500 ms doubling to a cap of 8–10 s, with ±30% randomised jitter, spreads the reconnect storm across a window.

// Client-side reconnect with exponential backoff and jitter
let attempt = 0;
function connect() {
  const ws = new WebSocket('wss://signal.example.com/ws');

  ws.onopen = () => {
    attempt = 0;                       // reset backoff on a clean open
    ws.send(JSON.stringify({ type: 'rejoin', roomId, lastSeq }));
  };

  ws.onclose = (e) => {
    if (e.code === 1000) return;       // intentional close, do not reconnect
    const base = Math.min(500 * 2 ** attempt, 10000); // cap at 10 s
    const jitter = base * 0.3 * (Math.random() * 2 - 1); // ±30%
    attempt++;
    setTimeout(connect, base + jitter);
  };
}

Reconnecting the signalling socket does not by itself disturb the media session. An RTCPeerConnection keeps its ICE and DTLS state alive independently of the WebSocket — a 4-second signalling outage during an active call is invisible to the media plane. Only if ICE itself reports failed should the client trigger an ICE restart with createOffer({ iceRestart: true }), capped at 3 retries. On rejoin, send the last sequence number you processed so the server can detect whether you missed any messages while disconnected. The full set of reconnection-aware state transitions belongs to Signaling State Machine Patterns.

Step 3 — Backpressure

A WebSocket send is not instantaneous. When a slow consumer cannot drain its receive buffer as fast as you push to it, the kernel and the ws library queue the unsent bytes in ws.bufferedAmount. Ignore this and a single slow peer in a busy room will balloon your process heap until the event loop stalls and every other call degrades. Backpressure is the discipline of refusing to enqueue more than a peer can absorb.

Set a high-water mark on bufferedAmount. When a peer exceeds it, stop forwarding non-critical traffic to that peer — or, for a peer that stays saturated for several seconds, close it with code 1013 (Try Again Later) and let it reconnect to a less loaded node. Critically, never let one slow subscriber block the broadcast loop for the whole room; forward to fast peers immediately and drop or defer for the slow one.

const HIGH_WATER = 1 << 20; // 1 MiB of un-flushed bytes per socket

function safeSend(peer, data) {
  if (peer.readyState !== peer.OPEN) return;
  if (peer.bufferedAmount > HIGH_WATER) {
    // Slow consumer: shed load rather than growing the heap unbounded
    peer._slowSince ??= Date.now();
    if (Date.now() - peer._slowSince > 5000) peer.close(1013, 'backpressure');
    return; // skip this peer for this message
  }
  peer._slowSince = undefined;
  peer.send(data);
}

On the inbound side, the ws library pauses reading from a socket automatically when your message handler is async and slow, but only if you actually await the work. If you fire async validation without awaiting, inbound frames pile up unbounded. Offload CPU-heavy validation (large SDP, schema checks) so the main thread keeps servicing heartbeats — a blocked main thread misses pong deadlines and the client wrongly concludes the connection is dead.

Step 4 — Message Validation & Verification

Every inbound frame is untrusted. Parse JSON inside a try/catch, reject anything that is not an object, and validate against a strict allow-list of message types and required fields before routing. A signalling server that forwards arbitrary client payloads is an open relay: an attacker can broadcast junk to every peer in a room, inject crafted SDP to crash peers, or amplify traffic. Validate type, roomId, and a bounded payload size; reject unknown types with an explicit error rather than silently dropping them, so clients fail loud during development.

const ALLOWED = new Set(['join', 'rejoin', 'offer', 'answer', 'candidate', 'leave']);

function validate(raw) {
  let msg;
  try { msg = JSON.parse(raw); } catch { return { error: 'INVALID_JSON' }; }
  if (typeof msg !== 'object' || msg === null) return { error: 'NOT_OBJECT' };
  if (!ALLOWED.has(msg.type)) return { error: 'UNKNOWN_TYPE' };
  if (typeof msg.roomId !== 'string' || msg.roomId.length > 128) return { error: 'BAD_ROOM' };
  return { msg };
}

wss.on('connection', (ws) => {
  ws.on('message', (raw) => {
    const { msg, error } = validate(raw);
    if (error) { ws.send(JSON.stringify({ type: 'error', error })); return; }
    if (msg.type === 'join' || msg.type === 'rejoin') joinRoom(msg.roomId, ws);
    else routeToRoom(msg.roomId, msg, ws);
  });

  ws.on('close', () => {
    const room = rooms.get(ws.roomId);
    if (room) { room.delete(ws); if (room.size === 0) rooms.delete(ws.roomId); }
  });
});

Verification checklist:

Two browser tabs join the same roomId Two browser tabs join the same `roomId`; an offer from tab A reaches tab B and never echoes back to A.
Kill the WebSocket from DevTools; confirm the client reconnects within the backoff window and the active media stream is undisturbed.
Send { "type": "offer" } with no roomId and confirm the server replies with a BAD_ROOM Send `{ "type": "offer" }` with no `roomId` and confirm the server replies with a `BAD_ROOM` error rather than crashing.
Throttle one peer’s network in DevTools and watch bufferedAmount Throttle one peer's network in DevTools and watch `bufferedAmount` on the server; confirm the slow peer is shed without stalling the others.
Load-test with k6 or artillery Load-test with `k6` or `artillery` to 10k concurrent sockets and watch event-loop lag stay under a few milliseconds.

A working signalling channel is the prerequisite for ICE Candidate Gathering & Filtering, since trickled candidates ride this same channel. For the framework-specific build on top of these four steps, see WebSocket Signaling with Node.js & Socket.IO.

Edge Cases & Browser Quirks

Concurrent connection caps. Chrome and Firefox cap WebSocket connections at roughly 200–256 per origin (Chrome historically 256, Firefox governed by network.websocket.max-connections, default 200). A tab that opens a separate socket per call hits the ceiling fast; multiplex all signalling for a tab over one connection.

Safari close-code reporting. Safari (through 17) is less consistent than Chrome about surfacing distinct close codes; abnormal drops frequently arrive as 1006 with no reason string. Do not branch reconnection logic on the reason text — branch only on the numeric code, and treat any 1006 as “reconnect with backoff.”

Firefox aggressive idle timeout on cellular. On some Android builds, Firefox’s underlying connection is reclaimed faster than Chrome’s during background tabs. Keep ping/pong heartbeats at 30–45 s to refresh NAT bindings before carrier-grade NAT (which can expire UDP mappings in under 30 s) or the browser reclaims the socket.

perMessageDeflate memory on Chrome. Enabling per-message compression saves bandwidth on large SDP but allocates a compression context per connection; under tens of thousands of sockets this is real memory. Measure before enabling it server-wide.

mDNS .local candidates. Modern Chrome and Firefox mask host IPs behind .local mDNS hostnames in candidates. Your signalling server must forward these strings verbatim — do not “normalise” them, or peers cannot resolve the obfuscated host.

Common Implementation Mistakes

Echoing to the sender. Forgetting the peer !== sender guard makes a peer apply its own offer as a remote description and throw InvalidStateError.
Client-supplied peer IDs. Trusting a client-provided identity lets one client overwrite another’s routing slot. Assign IDs server-side.
Scanning every room on disconnect. Iterating all rooms on each close is O(rooms); store ws.roomId and delete in O(1).
Treating 1006 as fatal. Abnormal closure is the normal outcome of a network handoff, not an error to surface to the user — reconnect silently.
Unbounded bufferedAmount. No backpressure means one slow consumer grows the heap until the event loop stalls for everyone.
Renegotiating media on signalling reconnect. The RTCPeerConnection survives a signalling drop; only restart ICE if ICE itself reports failed.
Forwarding unvalidated payloads. An open relay lets attackers broadcast junk to every peer; validate type, roomId, and size at the boundary.

FAQ

Do I need sticky sessions if I run more than one signalling node?

Sticky sessions keep a given client pinned to one node so its in-memory room map stays consistent, but they do not solve cross-node fan-out: two peers in the same room may land on different nodes. You need either sticky routing plus a shared message bus, or a stateless design with a pub/sub backplane. The trade-offs are covered in the Scaling WebSocket Signaling with Redis Pub/Sub guide.

Does a dropped WebSocket drop the call?

No. Media flows peer-to-peer over a separate DTLS-SRTP path. The signalling socket is only needed to negotiate or renegotiate. A brief signalling outage during an established call is invisible to media; only an ICE failed state requires action.

How large can a signalling message get?

SDP payloads for a multi-track session can reach a few kilobytes; bundled simulcast offers more. Cap maxPayload at 64 KiB to bound memory, which comfortably fits realistic SDP while rejecting abusive frames.

Should signalling be encrypted at the application layer?

WSS gives you transport encryption, which is sufficient — the SDP carries a DTLS fingerprint that WebRTC validates cryptographically during the handshake, so a tampered SDP fails the media handshake. Application-layer encryption is defence-in-depth, rarely required.

Related: this guide sits under WebRTC Protocol Stack & Signaling Servers; build the concrete server with WebSocket Signaling with Node.js & Socket.IO, scale it horizontally via Scaling WebSocket Signaling with Redis Pub/Sub, model the transitions with Signaling State Machine Patterns, and feed candidates through ICE Candidate Gathering & Filtering.

Related Guides