Voice

Voice connections operate in a similar fashion to the Gateway connection. However, they use a different set of payloads and a separate UDP-based connection for voice data transmission. Because UDP is used for both receiving and transmitting voice data, your client must be able to receive UDP packets, even through a firewall or NAT (see UDP Hole Punching for more information). The Discord voice servers implement functionality (see IP Discovery) for discovering the local machine's remote UDP IP/Port, which can assist in some network configurations. While video from a camera feed is sent in the same connection as voice, streaming (the "Go Live" feature) uses seperate Gateway Opcodes and a second below voice connection.

Voice Gateway

To ensure that you have the most up-to-date information, please use version 7. Otherwise, we cannot guarantee that the events and commands documented here will reflect what you receive over the socket. Video is only fully supported on gateway v5 and above.

Gateway Versions
VersionStatusDefault
7Availableâś“ Client
6Available
5Available
4Available
3Available
2Available
1Availableâś“ API
Gateway Commands
NameDescription
IdentifyStart a new voice connection
ResumeResume a dropped connection
HeartbeatMaintain an active WebSocket connection
Select ProtocolSelect the voice protocol and mode
SpeakingIndicate the user's speaking state
Voice Backend VersionRequest the current voice backend version
Gateway Events
NameDescription
HelloDefines the heartbeat interval
Heartbeat ACKAcknowledges a received client heartbeat
ReadyContains SSRC, IP/Port, experiment, and encryption mode information
ResumedAcknowledges a successful connection resume
Session DescriptionAcknowledges a successful Select Protocol and contains the information needed to send/receive voice data
Session UpdateClient session description changed
SpeakingUser speaking state updated
Client DisconnectA user disconnected from voice
Voice Backend VersionCurrent voice backend version information, as requested by the client

Connecting to Voice

Retrieving Voice Server Information

The first step in connecting to a voice server (and in turn, a guild's voice channel or private channel) is formulating a request that can be sent to the Gateway, which will return information about the voice server we will connect to. Because Discord's voice platform is widely distributed, users should never cache or save the results of this call. To inform the gateway of our intent to establish voice connectivity, we first send an Update Voice State payload.

If our request succeeded, the gateway will respond with two events—a Voice State Update event and a Voice Server Update event—meaning you must properly wait for both events before continuing. The first will contain a new key, session_id, and the second will provide voice server information we can use to establish a new voice connection.

With this information, we can move on to establishing a voice WebSocket connection.

Establishing a Voice WebSocket Connection

Once we retrieve a session_id, token, and endpoint information, we can connect and handshake with the voice server over another secure WebSocket. Unlike the gateway endpoint we receive in an HTTP Get Gateway request, the endpoint received from our Voice Server Update payload does not contain a URL protocol, so some libraries may require manually prepending it with "wss://" before connecting. Once connected to the voice WebSocket endpoint, we can send an Opcode 0 Identify payload with our server_id, user_id, session_id, and token:

Identify Structure
FieldTypeDescription
server_idsnowflakeThe ID of the guild or private channel being connecting to
user_idsnowflakeThe ID of the current user
session_idstringThe session ID of the current session
tokenstringThe voice token for the current session
video?booleanWhether or not this connection supports video
streams?array[stream object]An array of video stream objects
Example Identify
{
"op": 0,
"d": {
"server_id": "41771983423143937",
"user_id": "104694319306248192",
"session_id": "my_session_id",
"token": "my_token",
"video": true,
"streams": []
}
}

The voice server should respond with an Opcode 2 Ready payload, which informs us of the SSRC, connection IP/port, supported encryption modes, and experiments the voice server expects:

Ready Structure
FieldTypeDescription
ssrcintegerThe SSRC of the user's voice connection
ipstringThe IP address of the voice server
portintegerThe port of the voice server
modesarray[string]An array of supported voice encryption modes
experimentsarray[string]An array of available voice experiments
streamsarray[stream object]An array of available video streams
Example Ready
{
"op": 2,
"d": {
"ssrc": 1,
"ip": "127.0.0.1",
"port": 1234,
"modes": [
"aead_aes256_gcm_rtpsize",
"aead_aes256_gcm",
"aead_xchacha20_poly1305_rtpsize",
"xsalsa20_poly1305_lite_rtpsize",
"xsalsa20_poly1305_lite",
"xsalsa20_poly1305_suffix",
"xsalsa20_poly1305"
],
"experiments": ["fixed_keyframe_interval"],
"streams": []
}
}

Heartbeating

In order to maintain your WebSocket connection, you need to continuously send heartbeats at the interval determined in Opcode 8 Hello:

Hello Structure
FieldTypeDescription
v?integerThe voice gateway version
heartbeat_intervalintegerThe minimum interval (in milliseconds) the client should heartbeat at
Example Hello Before v3
{
"heartbeat_interval": 41250
}
Example Hello Since v3
{
"op": 8,
"d": {
"v": 7,
"heartbeat_interval": 41250
}
}

This is sent at the start of the connection. Be warned that the Opcode 8 Hello structure differs by gateway version as shown in the above examples. Versions below v3 do not have op or d fields. v3 and above was updated to be structured like other payloads. Be sure to expect this different format based on your version.

This heartbeat interval is the minimum interval you should heartbeat at. You can heartbeat at a faster interval if you wish. For example, the web client uses a heartbeat interval of min(heartbeat_interval, 5000) if the gateway version is v4 or above, and heartbeat_interval * 0.1 otherwise. The desktop client uses the provided heartbeat interval if the gateway version is v4 or above, and heartbeat_interval * 0.25 otherwise.

After receiving Opcode 8 Hello, you should send Opcode 3 Heartbeat—which contains an integer nonce—every elapsed interval:

The gateway may request a heartbeat from the client in some situations by sending an Opcode 3 Heartbeat. When this occurs, the client should immediately send an Opcode 3 Heartbeat without waiting the remainder of the current interval.

Example Heartbeat
{
"op": 3,
"d": 1501184119561
}

In return, you will be sent back an Opcode 6 Heartbeat ACK that contains the previously sent nonce:

Example Heartbeat ACK
{
"op": 6,
"d": 1501184119561
}

Establishing a Voice Connection

Once we receive the properties of a voice server from our Ready payload, we can proceed to the final step of voice connections, which entails establishing and handshaking a connection for voice data. First, we connect to the IP and port provided in the Ready payload. We then send an Opcode 1 Select Protocol with details about our connection:

Select Protocol Structure
FieldTypeDescription
protocolstringThe voice protocol to use (udp or webrtc)
data?protocol data | stringThe voice connection data or WebRTC SDP
rtc_connection_id?stringThe UUID4 RTC connection ID, used for analytics
codecs?array[codec object]The supported audio/video codecs
experiments?array[string]The received voice experiments to enable
Protocol Data Structure
FieldTypeDescription
address 1stringThe discovered IP address of the client
port 1integerThe discovered UDP port of the client
modestringThe encryption mode to use

1 These fields are only used to receive voice data. If you do not care about receiving voice data, you can randomize these values.

Example Select Protocol
{
"op": 1,
"d": {
"protocol": "udp",
"data": {
"address": "127.0.0.1",
"port": 1337,
"mode": "xsalsa20_poly1305_lite"
},
"experiments": ["fixed_keyframe_interval"]
}
}
Encryption Mode
ValueNameNonce BytesGenerating Nonce
xsalsa20_poly1305XSalsa20 Poly1305The nonce bytes are the RTP headerCopy the RTP header
xsalsa20_poly1305_suffixXSalsa20 Poly1305 (Suffix)The nonce bytes are 24 bytes appended to the payload of the RTP packet24 random bytes
xsalsa20_poly1305_liteXSalsa20 Poly1305 (Lite)The nonce bytes are 4 bytes appended to the payload of the RTP packetIncremental 4 bytes (32bit) int value
xsalsa20_poly1305_lite_rtpsizeXSalsa20 Poly1305 (Lite, RTP Size)The nonce bytes are 4 bytes appended to the payload of the RTP packetIncremental 4 bytes (32bit) int value
aead_aes256_gcmAEAD AES256 GCMThe nonce bytes are 4 bytes appended to the payload of the RTP packetIncremental 4 bytes (32bit) int value
aead_aes256_gcm_rtpsizeAEAD AES256 GCM (RTP Size)The nonce bytes are 4 bytes appended to the payload of the RTP packetIncremental 4 bytes (32bit) int value
aead_xchacha20_poly1305_rtpsizeAEAD XChaCha20 Poly1305 (RTP Size)The nonce bytes are 4 bytes appended to the payload of the RTP packetIncremental 4 bytes (32bit) int value

Finally, the voice server will respond with an Opcode 4 Session Description that includes the mode and secret_key, a 32 byte array used for sending and receiving voice data:

Session Description Structure
FieldTypeDescription
audio_codecstringThe audio codec to use
video_codecstringThe video codec to use
media_session_idstringThe media session ID, used for analytics
mode?stringThe encryption mode to use, not applicable to WebRTC
secret_key?array[integer]The 32 byte secret key used for encryption, not applicable to WebRTC
sdp?stringThe WebRTC session description protocol
keyframe_interval?integerThe keyframe interval in milliseconds
Example Session Description
{
"op": 4,
"d": {
"audio_codec": "opus",
"media_session_id": "89f1d62f166b948746f7646713d39dbb",
"mode": "xsalsa20_poly1305_lite",
"secret_key": [ ...251, 100, 11...],
"video_codec": "H264"
}
}

Sometimes, the voice server will later send an Opcode 14 Session Update to indicate an update to the voice session:

Session Update Structure
FieldTypeDescription
audio_codec?stringThe new audio codec to use
video_codec?stringThe new video codec to use
media_session_id?stringThe new media session ID, used for analytics

We can now start encrypting/decrypting and sending/receiving voice data over the previously established connection.

UDP Connections

UDP is the most likely protocol that clients will use. First, we open a UDP connection to the IP and port provided in the Ready payload. If required, we can now perform an IP Discovery using this connection. Once we've fully discovered our external IP and UDP port, we can then tell the voice WebSocket what it is by sending a Select Protocol as outlined above, and receive our Session Description to begin sending/receiving voice data.

IP Discovery

Generally routers on the Internet mask or obfuscate UDP ports through a process called NAT. Most users who implement voice will want to utilize IP discovery to find their external IP and port which will then be used for receiving voice communications. To retrieve your external IP and port, send the following UDP packet to your voice port (all numeric are big endian):

FieldDescriptionSize
TypeValues 0x1 and 0x2 indicate request and response, respectively2 bytes
LengthMessage length excluding Type and Length fields (value 70)2 bytes
SSRCUnsigned integer4 bytes
AddressNull-terminated string in response64 bytes
PortUnsigned short2 bytes
Sending and Receiving Voice

Voice data sent to and received from Discord should be encoded with Opus, using two channels (stereo) and a sample rate of 48kHz. Voice Data is sent using a RTP Header, followed by encrypted Opus audio data. Voice encryption uses the key passed in Session Description and the nonce formed with the 12 byte header appended with 12 null bytes, if required. Discord encrypts with the libsodium encryption library.

When receiving data, the user who sent the voice packet is identified by caching the SSRC and user IDs received from Speaking events. At least one Speaking event for the user is received before any voice data is received, so the user ID should always be available.

Voice Packet Structure
FieldTypeSize
Version + FlagsSingle byte value of 0x801 byte
Payload TypeSingle byte value of 0x781 byte
SequenceUnsigned short (big endian)2 bytes
TimestampUnsigned integer (big endian)4 bytes
SSRCUnsigned integer (big endian)4 bytes
Encrypted AudioBinary datan bytes

WebRTC Connections

WebRTC allows for direct peer-to-peer voice connections, and is most commonly used in browsers. To use WebRTC, you must first send a Select Protocol payload as outlined above, with the protocol field set to webrtc, and data set to the client's WebRTC SDP. The voice server will respond with a Session Description payload, with the sdp field set to the server's WebRTC SDP. The client can then use this SDP to establish a WebRTC connection.

Speaking

To notify the voice server that you are speaking or have stopped speaking, send an Opcode 5 Speaking payload:

Speaking Structure
FieldTypeDescription
speaking 1integerThe speaking flags
ssrcintegerThe SSRC of the speaking user
user_id 2snowflakeThe user ID of the speaking user
delay? 3integerThe speaking packet delay

1 For gateway v4 and below, this field is a boolean.

2 Only sent by the voice server.

3 Not sent by the voice server.

Example Speaking (Send)
{
"op": 5,
"d": {
"speaking": 5,
"delay": 0,
"ssrc": 1
}
}

When a different user's speaking state is updated, the voice server will send an Opcode 5 Speaking payload:

Example Speaking (Receive)
{
"op": 5,
"d": {
"speaking": 5,
"ssrc": 2,
"user_id": "852892297661906993"
}
}
Speaking Flags

The following flags can be used as a bitwise mask. For example 5 would be priority and voice.

ValueNameDescription
1 << 0MicrophoneNormal transmission of voice audio
1 << 1SoundshareTransmission of context audio for video, no speaking indicator
1 << 2PriorityPriority speaker, lowering audio of other speakers

Voice Data Interpolation

When there's a break in the sent data, the packet transmission shouldn't simply stop. Instead, send five frames of silence (0xF8, 0xFF, 0xFE) before stopping to avoid unintended Opus interpolation with subsequent transmissions.

Likewise, when you receive these five frames of silence, you know that the user has stopped speaking.

Resuming Voice Connection

When your client detects that its connection has been severed, it should open a new WebSocket connection. Once the new connection has been opened, your client should send an Opcode 7 Resume payload:

Resume Structure
FieldTypeDescription
server_idsnowflakeThe ID of the guild or private channel being connecting to
session_idstringThe session ID of the current session
tokenstringThe voice token for the current session
Example Resume
{
"op": 7,
"d": {
"server_id": "41771983423143937",
"session_id": "my_session_id",
"token": "my_token"
}
}

If successful, the voice server will respond with an Opcode 9 Resumed to signal that your client is now resumed:

Example Resumed
{
"op": 9,
"d": null
}

If the resume is unsuccessful—for example, due to an invalid session—the WebSocket connection will close with the appropriate close code. You should then follow the Connecting flow to reconnect.

Other Client Disconnection

When a user disconnects from voice, the voice server will send an Opcode 13 Client Disconnect payload:

When received, the SSRC of the user should be discarded.

Client Disconnect Structure
FieldTypeDescription
user_idsnowflakeThe ID of the user that disconnected
Example Client Disconnect
{
"op": 13,
"d": {
"user_id": "852892297661906993"
}
}

Voice Backend Version

For analytics, the client may want to receive information about the voice backend's current version. This is only available on gateway v6 and above. To do so, send an Opcode 16 Voice Backend Version with an empty payload:

Voice Backend Version Structure
FieldTypeDescription
voicestringThe voice backend's version
rtc_workerstringThe WebRTC worker's version
Example Voice Backend Version (Send)
{
"op": 16,
"d": {}
}

In response, the voice server will send an Opcode 16 Voice Backend Version payload with the versions:

Example Voice Backend Version (Receive)
{
"op": 16,
"d": {
"voice": "0.9.1",
"rtc_worker": "0.3.35"
}
}