Building an SFU

Published on June 15, 2025

Okay. So here’s the thing.

Most people treat WebRTC like this magic box that somehow makes video calls work. You write some frontend code, call a few peer connection APIs, and suddenly you have a working call. But nobody tells you what’s actually happening underneath. It’s just “connect the peers” and “it works.”

That didn’t sit well with me. I didn’t want magic. I wanted to know how this thing actually works. Not just to use it, but to build with it. From scratch. At protocol level. And then scale it properly.

So I started from first principles—peer-to-peer WebRTC, then built a fully custom SFU, and even piped it into HLS for streaming. No frameworks. No shortcuts. Just protocols and packets.

Here’s everything I figured out.

WebRTC Is Just a Set of Protocols. That’s It.

WebRTC isn’t a media library. It’s not a video framework. It’s just a bunch of protocols that work together to get media from one machine to another in real time and securely.

Here’s what’s actually involved:

You send media as RTP packets
You encrypt them using DTLS, so you get SRTP
The transport is UDP, because latency matters more than reliability
To get through NATs, you use ICE, which uses STUN (or TURN if you’re unlucky)
And all the config and capabilities go through SDP, which is just a text blob

That’s it. There’s no “WebRTC black magic” going on. You’re just doing secure UDP streaming with some extra steps.

Signaling? That’s Your Job.

This tripped me up in the beginning. WebRTC doesn’t come with signaling.

You’re supposed to build that yourself. All that “offer”, “answer”, “candidate” stuff? You have to send it across some channel—could be WebSocket, HTTP, anything that gets the message across. I used raw WebSockets. It worked fine.

Point is: signaling is not part of WebRTC. It’s a sideband protocol you have to glue in yourself.

What Happens When a WebRTC Call Starts?

Everything up until the DTLS step is just setup. After that, RTP packets start flowing. Encrypted, timestamped, sequenced media packets. That’s your actual video/audio stream.

If something breaks during this—like no video shows up—it’s almost always something in that flow: ICE failed, DTLS didn’t finish, RTP never started, codecs didn’t match, etc.

I Started With Just P2P

I built a barebones 1:1 WebRTC app first. Literally just two peers:

Each creates an RTCPeerConnection
They exchange SDP + ICE over WebSocket
Use getUserMedia() to get the camera
Attach that stream to the peer connection

This helped a lot. You really get to see what the ICE negotiation is doing, when DTLS kicks in, how fast or slow the media starts, how NAT types affect the flow, etc.

Once that clicked, I was ready to scale.

P2P Doesn’t Scale. Period.

This is where most people stop. But once you try a group call, reality hits hard.

In peer-to-peer, each participant sends (n - 1) copies of their media one to each other participant. So with 5 users, that’s 4 uploads per client. Multiply that by 1080p60 video and now your laptop is melting and your Wi-Fi is crying.

This is not a WebRTC issue. It’s just how peer-to-peer works. That’s why I started digging into SFU architecture.

What SFU Actually Means (It’s Not a Tool)

SFU = Selective Forwarding Unit. It’s just a fancy term for:

A server that receives media streams and forwards them to other participants.

That’s it. No transcoding. No mixing. Just smart packet forwarding.

Each client sends one stream to the server
The server forwards that stream to everyone else

No duplication on the client side. Bandwidth usage becomes stable. Server handles the heavy lifting. Clean separation.

What You Need to Build an SFU

You don’t need to use Janus or some “magic” server. If you understand the protocols, you can build your own SFU from scratch.

Here’s what the core needs to do:

Accept UDP connections
Do the DTLS handshake
Parse and decrypt SRTP
Read and forward RTP packets to the right clients

And when a user starts sending media, your server basically marks them as a producer. Everyone else becomes a consumer of that stream.

Now your job is just forwarding packets from producers to consumers, with correct SSRCs and timing.

The Hard Stuff

If you’re serious about this, you can’t just wing it with tutorials. You have to learn the low-level stuff. Here’s what I went through:

RTP headers: how SSRC, timestamps, and sequence numbers work
How STUN and ICE help find IP/port pairs
DTLS over UDP and how to terminate it server-side
How to trace traffic with Wireshark and tcpdump
Codec negotiation, jitter buffering, NACKs, and PLI
And yeah, a bit of SDP hell

Debugging this was hard. But once I could inspect and understand every packet, I stopped getting surprised by “why it’s not working.”

Streaming WebRTC to HLS

Another thing I wanted was streaming to non-participants—basically viewers.

So I took RTP streams from the SFU, piped them into FFmpeg, and converted them to HLS:

.ts segments
.m3u8 playlist
Served via static HTTP

Now any passive viewer could just open a URL and watch the WebRTC stream in real time with a regular HTML5 video tag. No peer connection needed.

It needed a bit of remuxing magic—syncing audio/video, handling keyframes—but once it worked, it felt very clean.

Final Architecture (Protocol-Level View)

Here’s what the final system looks like:

Signaling Server: Just a WebSocket server for SDP and ICE messages
Media Ingest (SFU): Handles DTLS + SRTP, routes RTP streams
RTP Router: Manages producer/consumer mapping, handles retransmissions
HLS Pipeline (Optional): Uses FFmpeg to create segments from RTP
HTTP Server: Serves m3u8 and ts files to viewers

No frameworks. Just protocol components stitched together with clarity.

Final Thoughts

If you’re just building a quick prototype or product, sure, use something like Mediasoup or LiveKit.

But if you want to understand WebRTC—like really get what’s happening under the hood—you have to build it yourself. At least once. Even a minimal version.

Don’t treat WebRTC as magic. It’s just:

UDP
DTLS
SRTP
RTP
ICE/STUN
SDP

That’s it. Learn those protocols, trace the flows, watch the packets, and everything clicks.

And trust me that’s worth it.