8.0 KiB
Low-Level Implementation Plan
1. Network Packet Anatomy (The Data Plane)
To minimize latency, we use a custom binary format for UDP voice data instead of JSON or Protobuf[cite: 1].
- UDP Voice Header (Fixed 16 Bytes):
u32(4 bytes): Session Token. Generated during TCP handshake. The server drops any packet where the IP/Port does not match this token[cite: 1].u64(8 bytes): Sequence Number. Monotonically increasing per user. Essential for the Jitter Buffer to reorder packets[cite: 1].u32(4 bytes): Timestamp. Measured in audio samples (increments by 960 per 20ms frame) to handle playback timing[cite: 1].
- Payload: Raw Opus-encoded bytes (variable length, typically 60–120 bytes). The bitrate is not hardcoded; it is dictated dynamically by the server's
ChannelConfig(e.g., 16kbps for voice, 96kbps for music bots) when the user joins a room.
2. Real-Time Audio Pipeline (client_node/audio)
Audio threads must be "lock-free" to prevent stuttering. We use a Single-Producer Single-Consumer (SPSC) ring buffer[cite: 1].
- Global Hotkeys / Push-to-Talk:
- Use
global-hotkey(orrdev) to hook OS-level key presses, allowing PTT even when minimized[cite: 1].
- Use
- Microphone Thread (The Producer):
- Initialize
cpalwith a 48kHz input stream[cite: 1]. - Rule: The hardware callback must only push raw
f32samples into theringbuf. No networking or heavy math allowed here[cite: 1].
- Initialize
- DSP/Encoder Thread (The Consumer):
- Pull samples from
ringbuf. - Process via
webrtc_audio_processing(Echo Cancellation, Noise Suppression, and Voice Activity Detection/VAD). If VAD detects silence, stop transmitting to save bandwidth[cite: 1]. - Accumulate exactly
960samples (20\text{ms})[cite: 1]. - Pass to
audiopus::Encoder. - Send resulting bytes to the Network Task via an asynchronous MPSC channel[cite: 1].
- Pull samples from
3. Jitter Buffer & Playback Logic (client_node/network)
The Jitter Buffer compensates for unstable internet connection by adding a controlled "latency tax"[cite: 1].
- The Sorting Mechanism: Incoming UDP packets are inserted into a
BinaryHeap(Min-Heap) sorted by Sequence Number[cite: 1]. - The Watermark Strategy:
- Wait until the heap contains at least
40\text{ms}(2 frames) of audio before starting playback[cite: 1]. - This buffer allows late-arriving packets to be inserted in the correct order[cite: 1].
- Wait until the heap contains at least
- Playback Tick: Every
20\text{ms}, the playback thread pops the next sequence number.- Success: Decode the packet. Before pushing to the master
cpalspeaker buffer, multiply the specific user'sf32decoded array by their local volume scalar (e.g., 0.5 for 50% volume) to enable Per-User Volume Control. - Missing (Packet Loss): If the sequence number is missing, call
audiopus::Decoder::decodewith aNoneframe to trigger Packet Loss Concealment (PLC), which synthesizes a "guess" of the missing sound[cite: 1].
- Success: Decode the packet. Before pushing to the master
4. Server Relay & Routing (server_node/udp_relay.rs)
The server acts as a high-speed traffic controller. It must be "Zero-Copy" where possible.
- Validation: Use
tokio::net::UdpSocket. On receipt, verify theu32 Session Tokenagainst theDashMapstate[cite: 1]. - Broadcast Logic:
- Identify the sender's current
ChannelId[cite: 1]. - Retrieve the list of
SocketAddrfor every other user in that channel[cite: 1]. - Iterate and send the exact byte buffer to each address. Use the
bytescrate to share the buffer via reference counting (Arc) instead of cloning[cite: 1].
- Identify the sender's current
- NAT Keep-Alives: The server must ignore empty 0-byte UDP packets (used by clients to keep router ports open)[cite: 1].
- TCP Control Lane & Chat Routing: The TCP router handles synchronized text messages and broadcasts them to users in the same
ChannelId[cite: 1]. - Stateful Auto-Reconnect: If the TCP socket drops, the client quietly reconnects and submits its existing
Session Tokento resume its channel presence without forcing a full re-login[cite: 1]. - Whisper Lists (Direct UDP Routing): The server supports targeted UDP forwarding. If a packet header contains a
Target_SessionToken, the server routes the audio strictly to that user, bypassing the standard channel broadcast.
5. Wasm Plugin ABI (client_node/plugins)
Since the Wasm sandbox cannot access host memory directly, we use a shared "mailbox" system.
- The ABI Pattern:
- Host (Rust) serializes event data (e.g.,
OnMessage) into JSON[cite: 1]. - Host allocates a block of memory inside the Wasm instance and writes the JSON there[cite: 1].
- Host calls the Wasm function, passing the memory pointer[cite: 1].
- Guest (Wasm) processes and returns a pointer to its response[cite: 1].
- Host (Rust) serializes event data (e.g.,
- Audio Intercepts: For voice changers, the Host passes a raw
&mut [f32]buffer to the plugin. The plugin modifies the samples "in-place" before they reach the Opus encoder[cite: 1].
6. Persistence & State Management (server_node/database.rs)
The server uses sqlx for compile-time safe database interaction[cite: 1].
- Hashing: Use
Argon2idwith a salt of at least 16 bytes. Passwords should be hashed with a minimum of3passes and64\text{MB}of memory[cite: 1]. - Migrations: On startup, the server checks the
_sqlx_migrationstable. If the code expects a newer schema than the SQLite file has, it applies the.sqlscripts in order before opening the network ports[cite: 1]. - Admin API: The
axumweb server requires aBearertoken (JWT) for all sensitive routes (/api/kick,/api/ban). This token is generated when the Admin logs into the dashboard[cite: 1]. - Permissions & Access Control: During TCP
ChannelJoinevents, the server checks the database forRequired_Roleand password locks before permitting entry[cite: 1]. - Client-Side Persistence (Bookmarks): The
client_nodemaintains a local SQLite or.tomlfile to persist Server Bookmarks (IP, Port, Password, chosen Nickname) so users don't have to manually type connection details.
7. Zero-Conf Automation Logic (scripts/install.sh)
- Environment Check: Script verifies
systemdavailability[cite: 1]. - Permissioning: Creates a non-privileged
voiceappuser to run the binary (security hardening)[cite: 1]. - Auto-Update:
update.shcompares the local binary hash against thelatestrelease on GitHub via the API. If different, it downloads, replaces, and runssystemctl restart voice_app[cite: 1]
8. Testing & Debugging Strategy
To ensure the real-time audio pipeline and network remain stable during development, several specific debugging tools are built directly into the workflow, completely avoiding the need for CLI flags or terminal commands.
- Developer Control Panel: A dedicated "Testing & Debugging" tab within the
eguiclient settings. This provides a purely graphical interface for all diagnostic tools. - UI-Driven Audio Dumper: A toggle in the Developer Panel that instantly records and writes the DSP pipeline streams to
.wavfiles (raw_mic.wav,post_dsp.wav,post_opus_decode.wav) to physically inspect audio quality degradation. - UI-Driven Chaos Simulator: Sliders in the Developer Panel that dynamically inject artificial packet loss (%), latency (ms), and packet re-ordering into the outgoing UDP transport layer to stress-test the Jitter Buffer locally.
- In-App Debug Overlay: An
eguidiagnostic HUD toggled via a UI button (orF3) that overlays real-time metrics: Network Ping (TCP and UDP), Jitter Buffer depth (ms), packet loss percentage, and active Opus PLC triggers. - Load Test Dashboard: The Server's web admin dashboard (
axum) will feature a "Stress Test" page. Instead of running terminal scripts, the server admin can click "Spawn 100 Bots", which dynamically spins up headless internal clients that broadcast.wavaudio to verify the server's UDP routing capacity.