Building an SFU
Published on June 15, 2025
Okay. So here’s the thing.
Most people treat WebRTC like this magic box that somehow makes video calls work. You write some frontend code, call a few peer connection APIs, and suddenly you have a working call. But nobody tells you what’s actually happening underneath. It’s just “connect the peers” and “it works.”
That didn’t sit well with me. I didn’t want magic. I wanted to know how this thing actually works. Not just to use it, but to build with it. From scratch. At protocol level. And then scale it properly.
So I started from first principles—peer-to-peer WebRTC, then built a fully custom SFU, and even piped it into HLS for streaming. No frameworks. No shortcuts. Just protocols and packets.
Here’s everything I figured out.
WebRTC Is Just a Set of Protocols. That’s It.
WebRTC isn’t a media library. It’s not a video framework. It’s just a bunch of protocols that work together to get media from one machine to another in real time and securely.
Here’s what’s actually involved:
- You send media as RTP packets
- You encrypt them using DTLS, so you get SRTP
- The transport is UDP, because latency matters more than reliability
- To get through NATs, you use ICE, which uses STUN (or TURN if you’re unlucky)
- And all the config and capabilities go through SDP, which is just a text blob
That’s it. There’s no “WebRTC black magic” going on. You’re just doing secure UDP streaming with some extra steps.
Signaling? That’s Your Job.
This tripped me up in the beginning. WebRTC doesn’t come with signaling.
You’re supposed to build that yourself. All that “offer”, “answer”, “candidate” stuff? You have to send it across some channel—could be WebSocket, HTTP, anything that gets the message across. I used raw WebSockets. It worked fine.
Point is: signaling is not part of WebRTC. It’s a sideband protocol you have to glue in yourself.
What Happens When a WebRTC Call Starts?
Everything up until the DTLS step is just setup. After that, RTP packets start flowing. Encrypted, timestamped, sequenced media packets. That’s your actual video/audio stream.
If something breaks during this—like no video shows up—it’s almost always something in that flow: ICE failed, DTLS didn’t finish, RTP never started, codecs didn’t match, etc.
I Started With Just P2P
I built a barebones 1:1 WebRTC app first. Literally just two peers:
- Each creates an RTCPeerConnection
- They exchange SDP + ICE over WebSocket
- Use getUserMedia() to get the camera
- Attach that stream to the peer connection
This helped a lot. You really get to see what the ICE negotiation is doing, when DTLS kicks in, how fast or slow the media starts, how NAT types affect the flow, etc.
Once that clicked, I was ready to scale.
P2P Doesn’t Scale. Period.
This is where most people stop. But once you try a group call, reality hits hard.
In peer-to-peer, each participant sends (n - 1) copies of their media one to each other participant. So with 5 users, that’s 4 uploads per client. Multiply that by 1080p60 video and now your laptop is melting and your Wi-Fi is crying.
This is not a WebRTC issue. It’s just how peer-to-peer works. That’s why I started digging into SFU architecture.
What SFU Actually Means (It’s Not a Tool)
SFU = Selective Forwarding Unit. It’s just a fancy term for:
A server that receives media streams and forwards them to other participants.
That’s it. No transcoding. No mixing. Just smart packet forwarding.
- Each client sends one stream to the server
- The server forwards that stream to everyone else
No duplication on the client side. Bandwidth usage becomes stable. Server handles the heavy lifting. Clean separation.
What You Need to Build an SFU
You don’t need to use Janus or some “magic” server. If you understand the protocols, you can build your own SFU from scratch.
Here’s what the core needs to do:
- Accept UDP connections
- Do the DTLS handshake
- Parse and decrypt SRTP
- Read and forward RTP packets to the right clients
And when a user starts sending media, your server basically marks them as a producer. Everyone else becomes a consumer of that stream.
Now your job is just forwarding packets from producers to consumers, with correct SSRCs and timing.
The Hard Stuff
If you’re serious about this, you can’t just wing it with tutorials. You have to learn the low-level stuff. Here’s what I went through:
- RTP headers: how SSRC, timestamps, and sequence numbers work
- How STUN and ICE help find IP/port pairs
- DTLS over UDP and how to terminate it server-side
- How to trace traffic with Wireshark and tcpdump
- Codec negotiation, jitter buffering, NACKs, and PLI
- And yeah, a bit of SDP hell
Debugging this was hard. But once I could inspect and understand every packet, I stopped getting surprised by “why it’s not working.”
Streaming WebRTC to HLS
Another thing I wanted was streaming to non-participants—basically viewers.
So I took RTP streams from the SFU, piped them into FFmpeg, and converted them to HLS:
- .ts segments
- .m3u8 playlist
- Served via static HTTP
Now any passive viewer could just open a URL and watch the WebRTC stream in real time with a regular HTML5 video tag. No peer connection needed.
It needed a bit of remuxing magic—syncing audio/video, handling keyframes—but once it worked, it felt very clean.
Final Architecture (Protocol-Level View)
Here’s what the final system looks like:
- Signaling Server: Just a WebSocket server for SDP and ICE messages
- Media Ingest (SFU): Handles DTLS + SRTP, routes RTP streams
- RTP Router: Manages producer/consumer mapping, handles retransmissions
- HLS Pipeline (Optional): Uses FFmpeg to create segments from RTP
- HTTP Server: Serves m3u8 and ts files to viewers
No frameworks. Just protocol components stitched together with clarity.
Final Thoughts
If you’re just building a quick prototype or product, sure, use something like Mediasoup or LiveKit.
But if you want to understand WebRTC—like really get what’s happening under the hood—you have to build it yourself. At least once. Even a minimal version.
Don’t treat WebRTC as magic. It’s just:
- UDP
- DTLS
- SRTP
- RTP
- ICE/STUN
- SDP
That’s it. Learn those protocols, trace the flows, watch the packets, and everything clicks.
And trust me that’s worth it.