Voice
Voice connections operate in a similar fashion to the Gateway connection. However, they use a different set of payloads and a separate UDP-based connection for voice data transmission. Because UDP is used for both receiving and transmitting voice data, your client must be able to receive UDP packets, even through a firewall or NAT (see UDP Hole Punching for more information). The Discord voice servers implement functionality (see IP Discovery) for discovering the local machine's remote UDP IP/Port, which can assist in some network configurations. While video from a camera feed is sent in the same connection as voice, streaming (the "Go Live" feature) uses seperate Gateway Opcodes and a second below voice connection.
Voice Gateway
To ensure that you have the most up-to-date information, please use version 7. Otherwise, we cannot guarantee that the events and commands documented here will reflect what you receive over the socket. Video is only fully supported on gateway v5 and above.
Gateway Versions
Version | Status | Default |
---|---|---|
7 | Available | âś“ Client |
6 | Available | |
5 | Available | |
4 | Available | |
3 | Available | |
2 | Available | |
1 | Available | âś“ API |
Gateway Commands
Name | Description |
---|---|
Identify | Start a new voice connection |
Resume | Resume a dropped connection |
Heartbeat | Maintain an active WebSocket connection |
Select Protocol | Select the voice protocol and mode |
Speaking | Indicate the user's speaking state |
Voice Backend Version | Request the current voice backend version |
Gateway Events
Name | Description |
---|---|
Hello | Defines the heartbeat interval |
Heartbeat ACK | Acknowledges a received client heartbeat |
Ready | Contains SSRC, IP/Port, experiment, and encryption mode information |
Resumed | Acknowledges a successful connection resume |
Session Description | Acknowledges a successful Select Protocol and contains the information needed to send/receive voice data |
Session Update | Client session description changed |
Speaking | User speaking state updated |
Client Disconnect | A user disconnected from voice |
Voice Backend Version | Current voice backend version information, as requested by the client |
Connecting to Voice
Retrieving Voice Server Information
The first step in connecting to a voice server (and in turn, a guild's voice channel or private channel) is formulating a request that can be sent to the Gateway, which will return information about the voice server we will connect to. Because Discord's voice platform is widely distributed, users should never cache or save the results of this call. To inform the gateway of our intent to establish voice connectivity, we first send an Update Voice State payload.
If our request succeeded, the gateway will respond with two events—a Voice State Update event and a Voice Server Update event—meaning you must properly wait for both events before continuing. The first will contain a new key, session_id
, and the second will provide voice server information we can use to establish a new voice connection.
With this information, we can move on to establishing a voice WebSocket connection.
Establishing a Voice WebSocket Connection
Once we retrieve a session_id
, token
, and endpoint
information, we can connect and handshake with the voice server over another secure WebSocket. Unlike the gateway endpoint we receive in an HTTP Get Gateway request, the endpoint received from our Voice Server Update payload does not contain a URL protocol, so some libraries may require manually prepending it with "wss://" before connecting. Once connected to the voice WebSocket endpoint, we can send an Opcode 0 Identify payload with our server_id
, user_id
, session_id
, and token
:
Identify Structure
Field | Type | Description |
---|---|---|
server_id | snowflake | The ID of the guild or private channel being connecting to |
user_id | snowflake | The ID of the current user |
session_id | string | The session ID of the current session |
token | string | The voice token for the current session |
video? | boolean | Whether or not this connection supports video |
streams? | array[stream object] | An array of video stream objects |
Example Identify
{"op": 0,"d": {"server_id": "41771983423143937","user_id": "104694319306248192","session_id": "my_session_id","token": "my_token","video": true,"streams": []}}
The voice server should respond with an Opcode 2 Ready payload, which informs us of the SSRC, connection IP/port, supported encryption modes, and experiments the voice server expects:
Ready Structure
Field | Type | Description |
---|---|---|
ssrc | integer | The SSRC of the user's voice connection |
ip | string | The IP address of the voice server |
port | integer | The port of the voice server |
modes | array[string] | An array of supported voice encryption modes |
experiments | array[string] | An array of available voice experiments |
streams | array[stream object] | An array of available video streams |
Example Ready
{"op": 2,"d": {"ssrc": 1,"ip": "127.0.0.1","port": 1234,"modes": ["aead_aes256_gcm_rtpsize","aead_aes256_gcm","aead_xchacha20_poly1305_rtpsize","xsalsa20_poly1305_lite_rtpsize","xsalsa20_poly1305_lite","xsalsa20_poly1305_suffix","xsalsa20_poly1305"],"experiments": ["fixed_keyframe_interval"],"streams": []}}
Heartbeating
In order to maintain your WebSocket connection, you need to continuously send heartbeats at the interval determined in Opcode 8 Hello:
Hello Structure
Field | Type | Description |
---|---|---|
v? | integer | The voice gateway version |
heartbeat_interval | integer | The minimum interval (in milliseconds) the client should heartbeat at |
Example Hello Before v3
{"heartbeat_interval": 41250}
Example Hello Since v3
{"op": 8,"d": {"v": 7,"heartbeat_interval": 41250}}
This is sent at the start of the connection. Be warned that the Opcode 8 Hello structure differs by gateway version as shown in the above examples. Versions below v3 do not have op
or d
fields. v3 and above was updated to be structured like other payloads. Be sure to expect this different format based on your version.
This heartbeat interval is the minimum interval you should heartbeat at. You can heartbeat at a faster interval if you wish. For example, the web client uses a heartbeat interval of min(heartbeat_interval, 5000)
if the gateway version is v4 or above, and heartbeat_interval * 0.1
otherwise. The desktop client uses the provided heartbeat interval if the gateway version is v4 or above, and heartbeat_interval * 0.25
otherwise.
After receiving Opcode 8 Hello, you should send Opcode 3 Heartbeat—which contains an integer nonce—every elapsed interval:
The gateway may request a heartbeat from the client in some situations by sending an Opcode 3 Heartbeat. When this occurs, the client should immediately send an Opcode 3 Heartbeat without waiting the remainder of the current interval.
Example Heartbeat
{"op": 3,"d": 1501184119561}
In return, you will be sent back an Opcode 6 Heartbeat ACK that contains the previously sent nonce:
Example Heartbeat ACK
{"op": 6,"d": 1501184119561}
Establishing a Voice Connection
Once we receive the properties of a voice server from our Ready payload, we can proceed to the final step of voice connections, which entails establishing and handshaking a connection for voice data. First, we connect to the IP and port provided in the Ready payload. We then send an Opcode 1 Select Protocol with details about our connection:
Select Protocol Structure
Field | Type | Description |
---|---|---|
protocol | string | The voice protocol to use (udp or webrtc ) |
data | ?protocol data | string | The voice connection data or WebRTC SDP |
rtc_connection_id? | string | The UUID4 RTC connection ID, used for analytics |
codecs? | array[codec object] | The supported audio/video codecs |
experiments? | array[string] | The received voice experiments to enable |
Protocol Data Structure
Field | Type | Description |
---|---|---|
address 1 | string | The discovered IP address of the client |
port 1 | integer | The discovered UDP port of the client |
mode | string | The encryption mode to use |
1 These fields are only used to receive voice data. If you do not care about receiving voice data, you can randomize these values.
Example Select Protocol
{"op": 1,"d": {"protocol": "udp","data": {"address": "127.0.0.1","port": 1337,"mode": "xsalsa20_poly1305_lite"},"experiments": ["fixed_keyframe_interval"]}}
Encryption Mode
Value | Name | Nonce Bytes | Generating Nonce |
---|---|---|---|
xsalsa20_poly1305 | XSalsa20 Poly1305 | The nonce bytes are the RTP header | Copy the RTP header |
xsalsa20_poly1305_suffix | XSalsa20 Poly1305 (Suffix) | The nonce bytes are 24 bytes appended to the payload of the RTP packet | 24 random bytes |
xsalsa20_poly1305_lite | XSalsa20 Poly1305 (Lite) | The nonce bytes are 4 bytes appended to the payload of the RTP packet | Incremental 4 bytes (32bit) int value |
xsalsa20_poly1305_lite_rtpsize | XSalsa20 Poly1305 (Lite, RTP Size) | The nonce bytes are 4 bytes appended to the payload of the RTP packet | Incremental 4 bytes (32bit) int value |
aead_aes256_gcm | AEAD AES256 GCM | The nonce bytes are 4 bytes appended to the payload of the RTP packet | Incremental 4 bytes (32bit) int value |
aead_aes256_gcm_rtpsize | AEAD AES256 GCM (RTP Size) | The nonce bytes are 4 bytes appended to the payload of the RTP packet | Incremental 4 bytes (32bit) int value |
aead_xchacha20_poly1305_rtpsize | AEAD XChaCha20 Poly1305 (RTP Size) | The nonce bytes are 4 bytes appended to the payload of the RTP packet | Incremental 4 bytes (32bit) int value |
Finally, the voice server will respond with an Opcode 4 Session Description that includes the mode
and secret_key
, a 32 byte array used for sending and receiving voice data:
Session Description Structure
Field | Type | Description |
---|---|---|
audio_codec | string | The audio codec to use |
video_codec | string | The video codec to use |
media_session_id | string | The media session ID, used for analytics |
mode? | string | The encryption mode to use, not applicable to WebRTC |
secret_key? | array[integer] | The 32 byte secret key used for encryption, not applicable to WebRTC |
sdp? | string | The WebRTC session description protocol |
keyframe_interval? | integer | The keyframe interval in milliseconds |
Example Session Description
{"op": 4,"d": {"audio_codec": "opus","media_session_id": "89f1d62f166b948746f7646713d39dbb","mode": "xsalsa20_poly1305_lite","secret_key": [ ...251, 100, 11...],"video_codec": "H264"}}
Sometimes, the voice server will later send an Opcode 14 Session Update to indicate an update to the voice session:
Session Update Structure
Field | Type | Description |
---|---|---|
audio_codec? | string | The new audio codec to use |
video_codec? | string | The new video codec to use |
media_session_id? | string | The new media session ID, used for analytics |
We can now start encrypting/decrypting and sending/receiving voice data over the previously established connection.
UDP Connections
UDP is the most likely protocol that clients will use. First, we open a UDP connection to the IP and port provided in the Ready payload. If required, we can now perform an IP Discovery using this connection. Once we've fully discovered our external IP and UDP port, we can then tell the voice WebSocket what it is by sending a Select Protocol as outlined above, and receive our Session Description to begin sending/receiving voice data.
IP Discovery
Generally routers on the Internet mask or obfuscate UDP ports through a process called NAT. Most users who implement voice will want to utilize IP discovery to find their external IP and port which will then be used for receiving voice communications. To retrieve your external IP and port, send the following UDP packet to your voice port (all numeric are big endian):
Field | Description | Size |
---|---|---|
Type | Values 0x1 and 0x2 indicate request and response, respectively | 2 bytes |
Length | Message length excluding Type and Length fields (value 70) | 2 bytes |
SSRC | Unsigned integer | 4 bytes |
Address | Null-terminated string in response | 64 bytes |
Port | Unsigned short | 2 bytes |
Sending and Receiving Voice
Voice data sent to and received from Discord should be encoded with Opus, using two channels (stereo) and a sample rate of 48kHz. Voice Data is sent using a RTP Header, followed by encrypted Opus audio data. Voice encryption uses the key passed in Session Description and the nonce formed with the 12 byte header appended with 12 null
bytes, if required. Discord encrypts with the libsodium encryption library.
When receiving data, the user who sent the voice packet is identified by caching the SSRC and user IDs received from Speaking events. At least one Speaking event for the user is received before any voice data is received, so the user ID should always be available.
Voice Packet Structure
Field | Type | Size |
---|---|---|
Version + Flags | Single byte value of 0x80 | 1 byte |
Payload Type | Single byte value of 0x78 | 1 byte |
Sequence | Unsigned short (big endian) | 2 bytes |
Timestamp | Unsigned integer (big endian) | 4 bytes |
SSRC | Unsigned integer (big endian) | 4 bytes |
Encrypted Audio | Binary data | n bytes |
WebRTC Connections
WebRTC allows for direct peer-to-peer voice connections, and is most commonly used in browsers. To use WebRTC, you must first send a Select Protocol payload as outlined above, with the protocol
field set to webrtc
, and data
set to the client's WebRTC SDP. The voice server will respond with a Session Description payload, with the sdp
field set to the server's WebRTC SDP. The client can then use this SDP to establish a WebRTC connection.
Speaking
To notify the voice server that you are speaking or have stopped speaking, send an Opcode 5 Speaking payload:
Speaking Structure
Field | Type | Description |
---|---|---|
speaking 1 | integer | The speaking flags |
ssrc | integer | The SSRC of the speaking user |
user_id 2 | snowflake | The user ID of the speaking user |
delay? 3 | integer | The speaking packet delay |
1 For gateway v4 and below, this field is a boolean.
2 Only sent by the voice server.
3 Not sent by the voice server.
Example Speaking (Send)
{"op": 5,"d": {"speaking": 5,"delay": 0,"ssrc": 1}}
When a different user's speaking state is updated, the voice server will send an Opcode 5 Speaking payload:
Example Speaking (Receive)
{"op": 5,"d": {"speaking": 5,"ssrc": 2,"user_id": "852892297661906993"}}
Speaking Flags
The following flags can be used as a bitwise mask. For example 5
would be priority and voice.
Value | Name | Description |
---|---|---|
1 << 0 | Microphone | Normal transmission of voice audio |
1 << 1 | Soundshare | Transmission of context audio for video, no speaking indicator |
1 << 2 | Priority | Priority speaker, lowering audio of other speakers |
Voice Data Interpolation
When there's a break in the sent data, the packet transmission shouldn't simply stop. Instead, send five frames of silence (0xF8, 0xFF, 0xFE
) before stopping to avoid unintended Opus interpolation with subsequent transmissions.
Likewise, when you receive these five frames of silence, you know that the user has stopped speaking.
Resuming Voice Connection
When your client detects that its connection has been severed, it should open a new WebSocket connection. Once the new connection has been opened, your client should send an Opcode 7 Resume payload:
Resume Structure
Field | Type | Description |
---|---|---|
server_id | snowflake | The ID of the guild or private channel being connecting to |
session_id | string | The session ID of the current session |
token | string | The voice token for the current session |
Example Resume
{"op": 7,"d": {"server_id": "41771983423143937","session_id": "my_session_id","token": "my_token"}}
If successful, the voice server will respond with an Opcode 9 Resumed to signal that your client is now resumed:
Example Resumed
{"op": 9,"d": null}
If the resume is unsuccessful—for example, due to an invalid session—the WebSocket connection will close with the appropriate close code. You should then follow the Connecting flow to reconnect.
Other Client Disconnection
When a user disconnects from voice, the voice server will send an Opcode 13 Client Disconnect payload:
When received, the SSRC of the user should be discarded.
Client Disconnect Structure
Field | Type | Description |
---|---|---|
user_id | snowflake | The ID of the user that disconnected |
Example Client Disconnect
{"op": 13,"d": {"user_id": "852892297661906993"}}
Voice Backend Version
For analytics, the client may want to receive information about the voice backend's current version. This is only available on gateway v6 and above. To do so, send an Opcode 16 Voice Backend Version with an empty payload:
Voice Backend Version Structure
Field | Type | Description |
---|---|---|
voice | string | The voice backend's version |
rtc_worker | string | The WebRTC worker's version |
Example Voice Backend Version (Send)
{"op": 16,"d": {}}
In response, the voice server will send an Opcode 16 Voice Backend Version payload with the versions:
Example Voice Backend Version (Receive)
{"op": 16,"d": {"voice": "0.9.1","rtc_worker": "0.3.35"}}