# dTelecom x402 Gateway

> Pay-per-use WebRTC, Speech-to-Text, and Text-to-Speech via x402 and MPP payment protocols.
> Base URL: https://x402.dtelecom.org

## Overview

This gateway provides real-time communication services billed through the x402 and MPP (Machine Payments Protocol) payment protocols. Clients purchase microcredits using USDC payments on Solana, Base (EVM), or Tempo (MPP), or USDT on TRON (via Bank of AI facilitator), then spend those credits on WebRTC, STT, and TTS sessions. No API keys or accounts — wallet-based auth only.

## Services & Pricing

- WebRTC: $0.001/audio participant min (1,000 microcredits/min)
- STT: $0.006/min (6,000 microcredits/min)
- TTS: $0.008/1K chars (8,000 microcredits/1K chars)
- Agent Session (bundle): ~$0.015/min (WebRTC + STT + TTS)
- 1 USD = 1,000,000 microcredits
- Minimum purchase: $0.1 (~6.5 min of bundled agent voice)

## WebRTC — dTelecom DePIN SFU

- LiveKit fork with Solana-based node discovery and Ed25519 JWT signing
- Decentralized SFU nodes (not centralized servers) — horizontally scalable
- SVC/simulcast, adaptive bitrate, E2EE support
- Room-based multi-participant sessions with metadata
- SDKs: JavaScript, React (pre-built UI components: VideoConference, Chat, GridLayout)
- Features: speaker detection, selective subscription, moderation, data messages
- Webhooks: room lifecycle events, participant join/leave

## STT — Dual-Engine Speech-to-Text

### Engines
- **Parakeet-TDT 0.6B**: 25 European languages, 3-4x faster than Whisper
- **Whisper large-v3-turbo**: 99+ languages, contextual prompting

### Smart Routing
- force_model=whisper → always Whisper
- language=auto → Parakeet preferred (native auto-detect)
- language not in Parakeet set → Whisper
- Parakeet available → Parakeet (preferred); busy → Whisper fallback

### Parakeet Languages (25)
en, ru, bg, hr, cs, da, nl, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv, uk

### Top WER (FLEURS benchmark)
Italian 3.0%, Spanish 3.5%, Portuguese 4.8%, English 4.9%, German 5.0%, French 5.2%, Russian 5.5%, Ukrainian 6.8%, Polish 7.3%, Dutch 7.5%

### Whisper Language Tiers
- High-resource (3-8% WER): en, es, fr, de, it, pt, zh, ja, ko, nl
- Medium-resource (8-15% WER): ru, pl, cs, tr, ar, hi, sv, id, vi, uk, ro, hu, fi, da, no, el, he, th, bg, hr, sk, ca, sl, lt, lv, et, sr, ms, gl, eu
- Long-tail: 60+ additional languages including af, sq, am, hy, az, bn, be, bs, ka, gu, ha, is, kn, kk, km, lo, mk, ml, mi, mr, mn, ne, fa, pa, so, sw, tl, ta, te, uz, cy, yi, yo

### Audio Pipeline
Silero VAD → GTCRN noise reduction → speech validation → silence trimming → Whisper/Parakeet → hallucination filter

### WebSocket Protocol
- Config (first message): `{"type": "config", "language": "en"}` (add `session_key` for auth)
- Mid-session reconfig: `{"type": "config", "language": "es"}`
- Client→Server: binary PCM16 audio, `flush`, `reset`, `ping`, `extend`
- Server→Client: `ready`, `vad_event` (speech_start/speech_end), `transcription`, `pong`, `error`, `session_expiring`, `session_extended`
- Reconnection: clock pauses on disconnect, reconnect with same session_key

### Audio Format
PCM16, 16kHz, mono, little-endian, 20-100ms chunks recommended

## TTS — Neural Text-to-Speech

### Model
Kokoro 82M (MLX, Apple Silicon optimized). Text with per-message voice/speed/language override.

### 54 Voices across 9 Languages

**American English (lang_code "a"):** af_alloy, af_aoede, af_bella, af_heart (default), af_jessica, af_kore, af_nicole, af_nova, af_river, af_sarah, af_sky, am_adam, am_echo, am_eric, am_fenrir, am_liam, am_michael, am_onyx, am_puck, am_santa
**British English ("b"):** bf_alice, bf_emma, bf_isabella, bf_lily, bm_daniel, bm_fable, bm_george, bm_lewis
**Spanish ("e"):** ef_dora, em_alex, em_santa
**French ("f"):** ff_siwis
**Hindi ("h"):** hf_alpha, hf_beta, hm_omega, hm_psi
**Italian ("i"):** if_sara, im_nicola
**Japanese ("j"):** jf_alpha, jf_gongitsune, jf_nezumi, jf_tebukuro, jm_kumo
**Portuguese Brazilian ("p"):** pf_dora, pm_alex, pm_santa
**Mandarin Chinese ("z"):** zf_xiaobei, zf_xiaoni, zf_xiaoxiao, zf_xiaoyi, zm_yunjian, zm_yunxi, zm_yunxia, zm_yunyang

### Language Code Aliases
"a"=en-us, "b"=en-gb, "e"=es, "f"=fr, "h"=hi, "i"=it, "j"=ja, "p"=pt-br, "z"=zh

### Voice Blending
Comma-delimited voice IDs average voice tensors: `"af_heart,af_bella"`

### Speed Control
Float multiplier (0.5–2.0), affects phoneme duration (no pitch shift)

### WebSocket Protocol
- Auth (first message on /v1/stream): `{"session_key": "jwt"}` with optional `config`
- Send text: `{"text": "...", "voice": "af_heart", "lang_code": "a", "speed": 1.0}`
- Session config: `{"config": {"voice": "...", "lang_code": "...", "speed": 1.0}}`
- Barge-in: `{"type": "clear"}` → server responds `{"type": "cleared"}`
- Server responses: `generating`, `done`, `cleared`, `error`, `extended`
- Binary audio out: PCM16, 48kHz, mono, 1920 bytes/chunk (20ms)

### Chunking
English: splits at sentence boundaries when phonemes exceed 510 chars. Non-English: ~400 chars on sentence boundaries.

## Authentication

Wallet-based signature auth. Sign `METHOD\nPATH\nTIMESTAMP` with Ed25519 (Solana) or ECDSA (EVM).

Headers: Authorization, X-Wallet-Address, X-Wallet-Chain, X-Timestamp

## SDKs

- @dtelecom/x402-client — npm package, wraps all gateway APIs + x402 payment (npm: https://www.npmjs.com/package/@dtelecom/x402-client, GitHub: https://github.com/dTelecom/x402-client)
- @dtelecom/agents-js — voice agent framework with DtelecomSTT/DtelecomTTS providers (npm: https://www.npmjs.com/package/@dtelecom/agents-js, GitHub: https://github.com/dTelecom/agents-js)

## Example

- AI Language Tutor: live demo at https://ai-tutor-demo.dtelecom.org/, source at https://github.com/dTelecom/ai-tutor-demo

## Key Endpoints

- POST /v1/credits/purchase — Buy credits (x402 payment, USDC on Solana/Base)
- POST /v1/credits/purchase/mpp — Buy credits (MPP payment, USDC on Tempo)
- POST /v1/credits/purchase/tron — Buy credits (x402 payment via Bank of AI facilitator, USDT on TRON)
- GET /v1/account — Account info and balance
- POST /v1/webrtc/token — Create WebRTC session
- POST /v1/stt/session — Create STT session
- POST /v1/tts/session — Create TTS session
- POST /v1/agent-session — Create bundled session (WebRTC + STT + TTS). Optional `client_identity` param generates dual WebRTC tokens (agent + client) with independent geo-routing.
- POST /v1/agent-session/extend — Extend all sessions in a bundle
- GET /v1/pricing — Pricing info (public)
- GET /v1/servers/status — Server availability (public)

## Detailed API docs: /docs.md