Files
TS3-vibed/Documentation/Low_level_plan/Implementation_Plan.md
2026-05-03 10:50:25 +02:00

8.0 KiB
Raw Permalink Blame History

Low-Level Implementation Plan

1. Network Packet Anatomy (The Data Plane)

To minimize latency, we use a custom binary format for UDP voice data instead of JSON or Protobuf[cite: 1].

  • UDP Voice Header (Fixed 16 Bytes):
    • u32 (4 bytes): Session Token. Generated during TCP handshake. The server drops any packet where the IP/Port does not match this token[cite: 1].
    • u64 (8 bytes): Sequence Number. Monotonically increasing per user. Essential for the Jitter Buffer to reorder packets[cite: 1].
    • u32 (4 bytes): Timestamp. Measured in audio samples (increments by 960 per 20ms frame) to handle playback timing[cite: 1].
  • Payload: Raw Opus-encoded bytes (variable length, typically 60120 bytes). The bitrate is not hardcoded; it is dictated dynamically by the server's ChannelConfig (e.g., 16kbps for voice, 96kbps for music bots) when the user joins a room.

2. Real-Time Audio Pipeline (client_node/audio)

Audio threads must be "lock-free" to prevent stuttering. We use a Single-Producer Single-Consumer (SPSC) ring buffer[cite: 1].

  • Global Hotkeys / Push-to-Talk:
    • Use global-hotkey (or rdev) to hook OS-level key presses, allowing PTT even when minimized[cite: 1].
  • Microphone Thread (The Producer):
    • Initialize cpal with a 48kHz input stream[cite: 1].
    • Rule: The hardware callback must only push raw f32 samples into the ringbuf. No networking or heavy math allowed here[cite: 1].
  • DSP/Encoder Thread (The Consumer):
    • Pull samples from ringbuf.
    • Process via webrtc_audio_processing (Echo Cancellation, Noise Suppression, and Voice Activity Detection/VAD). If VAD detects silence, stop transmitting to save bandwidth[cite: 1].
    • Accumulate exactly 960 samples (20\text{ms})[cite: 1].
    • Pass to audiopus::Encoder.
    • Send resulting bytes to the Network Task via an asynchronous MPSC channel[cite: 1].

3. Jitter Buffer & Playback Logic (client_node/network)

The Jitter Buffer compensates for unstable internet connection by adding a controlled "latency tax"[cite: 1].

  • The Sorting Mechanism: Incoming UDP packets are inserted into a BinaryHeap (Min-Heap) sorted by Sequence Number[cite: 1].
  • The Watermark Strategy:
    • Wait until the heap contains at least 40\text{ms} (2 frames) of audio before starting playback[cite: 1].
    • This buffer allows late-arriving packets to be inserted in the correct order[cite: 1].
  • Playback Tick: Every 20\text{ms}, the playback thread pops the next sequence number.
    • Success: Decode the packet. Before pushing to the master cpal speaker buffer, multiply the specific user's f32 decoded array by their local volume scalar (e.g., 0.5 for 50% volume) to enable Per-User Volume Control.
    • Missing (Packet Loss): If the sequence number is missing, call audiopus::Decoder::decode with a None frame to trigger Packet Loss Concealment (PLC), which synthesizes a "guess" of the missing sound[cite: 1].

4. Server Relay & Routing (server_node/udp_relay.rs)

The server acts as a high-speed traffic controller. It must be "Zero-Copy" where possible.

  • Validation: Use tokio::net::UdpSocket. On receipt, verify the u32 Session Token against the DashMap state[cite: 1].
  • Broadcast Logic:
    1. Identify the sender's current ChannelId[cite: 1].
    2. Retrieve the list of SocketAddr for every other user in that channel[cite: 1].
    3. Iterate and send the exact byte buffer to each address. Use the bytes crate to share the buffer via reference counting (Arc) instead of cloning[cite: 1].
  • NAT Keep-Alives: The server must ignore empty 0-byte UDP packets (used by clients to keep router ports open)[cite: 1].
  • TCP Control Lane & Chat Routing: The TCP router handles synchronized text messages and broadcasts them to users in the same ChannelId[cite: 1].
  • Stateful Auto-Reconnect: If the TCP socket drops, the client quietly reconnects and submits its existing Session Token to resume its channel presence without forcing a full re-login[cite: 1].
  • Whisper Lists (Direct UDP Routing): The server supports targeted UDP forwarding. If a packet header contains a Target_SessionToken, the server routes the audio strictly to that user, bypassing the standard channel broadcast.

5. Wasm Plugin ABI (client_node/plugins)

Since the Wasm sandbox cannot access host memory directly, we use a shared "mailbox" system.

  • The ABI Pattern:
    1. Host (Rust) serializes event data (e.g., OnMessage) into JSON[cite: 1].
    2. Host allocates a block of memory inside the Wasm instance and writes the JSON there[cite: 1].
    3. Host calls the Wasm function, passing the memory pointer[cite: 1].
    4. Guest (Wasm) processes and returns a pointer to its response[cite: 1].
  • Audio Intercepts: For voice changers, the Host passes a raw &mut [f32] buffer to the plugin. The plugin modifies the samples "in-place" before they reach the Opus encoder[cite: 1].

6. Persistence & State Management (server_node/database.rs)

The server uses sqlx for compile-time safe database interaction[cite: 1].

  • Hashing: Use Argon2id with a salt of at least 16 bytes. Passwords should be hashed with a minimum of 3 passes and 64\text{MB} of memory[cite: 1].
  • Migrations: On startup, the server checks the _sqlx_migrations table. If the code expects a newer schema than the SQLite file has, it applies the .sql scripts in order before opening the network ports[cite: 1].
  • Admin API: The axum web server requires a Bearer token (JWT) for all sensitive routes (/api/kick, /api/ban). This token is generated when the Admin logs into the dashboard[cite: 1].
  • Permissions & Access Control: During TCP ChannelJoin events, the server checks the database for Required_Role and password locks before permitting entry[cite: 1].
  • Client-Side Persistence (Bookmarks): The client_node maintains a local SQLite or .toml file to persist Server Bookmarks (IP, Port, Password, chosen Nickname) so users don't have to manually type connection details.

7. Zero-Conf Automation Logic (scripts/install.sh)

  • Environment Check: Script verifies systemd availability[cite: 1].
  • Permissioning: Creates a non-privileged voiceapp user to run the binary (security hardening)[cite: 1].
  • Auto-Update: update.sh compares the local binary hash against the latest release on GitHub via the API. If different, it downloads, replaces, and runs systemctl restart voice_app[cite: 1]

8. Testing & Debugging Strategy

To ensure the real-time audio pipeline and network remain stable during development, several specific debugging tools are built directly into the workflow, completely avoiding the need for CLI flags or terminal commands.

  • Developer Control Panel: A dedicated "Testing & Debugging" tab within the egui client settings. This provides a purely graphical interface for all diagnostic tools.
  • UI-Driven Audio Dumper: A toggle in the Developer Panel that instantly records and writes the DSP pipeline streams to .wav files (raw_mic.wav, post_dsp.wav, post_opus_decode.wav) to physically inspect audio quality degradation.
  • UI-Driven Chaos Simulator: Sliders in the Developer Panel that dynamically inject artificial packet loss (%), latency (ms), and packet re-ordering into the outgoing UDP transport layer to stress-test the Jitter Buffer locally.
  • In-App Debug Overlay: An egui diagnostic HUD toggled via a UI button (or F3) that overlays real-time metrics: Network Ping (TCP and UDP), Jitter Buffer depth (ms), packet loss percentage, and active Opus PLC triggers.
  • Load Test Dashboard: The Server's web admin dashboard (axum) will feature a "Stress Test" page. Instead of running terminal scripts, the server admin can click "Spawn 100 Bots", which dynamically spins up headless internal clients that broadcast .wav audio to verify the server's UDP routing capacity.