More than 5 years have passed since last update.

SkyWay SFU RoomでInsertable Streams APIによるエンドツーエンド暗号化を試してみる

Last updated at 2020-11-12Posted at 2020-10-26

今回はSkyWayネタになりますが、他のWebRTCプラットフォームのSDKでも応用できるかもしれません。

Insertable Streams APIについて

Insertable Streams APIは、WebRTCで通信するエンコード済みのメディアフレームに触ることが出来るAPIです。

SFUを利用した多人数通話のエンドツーエンド暗号化（以下、E2EE）や、メディアフレームに何らかのデータを付加して同期的に相手にする、メディアフレームの解析等、様々なユースケースに活用できる可能性があり、個人的にはWebRTC界隈でイマイチ押しのAPIです。

オススメの紹介記事を紹介します。

ブラウザの対応状況

Chrome

Chrome Platform Statusの2020.10.25時点の情報によれば、以下のプラットフォーム並びにバージョンから利用可能になっています。

Chrome for desktop release 86
Chrome for Android release 86

M86は2020.10.06に安定版としてリリース済みなので、普通に利用できます。

他のブラウザの状況

こちらも上記記事の引用ですが、Firefox、Safariとも実装に積極的のようです。
EdgeはChromiumベースになったので、情報はないですが、実装することはほぼ確定路線のような気がします。

Consensus & Standardization

Firefox: Positive
Edge: No signal
Safari: Positive

動作確認したい

動作をサクッと確認したい場合はWebRTC samplesがおすすめです。執筆時点で２つ掲載されていました。

SkyWayのSFURoomでInsertable Streams APIを利用して簡易E2EEビデオチャットを実装する

今回の記事では、Insertable Streams APIの一番メジャーなユースケースになるであろう、E2EEをSFURoomと組み合わせて動作させてみます。

記事執筆時点では、SkyWay JavaScript SDKを一部改修する必要があるため、その点は予めご了承ください。

尚、同JavaScript SDKのMediaConnectionを利用して、Insertable Streams APIで映像・音声と同期的にデータ送信を行う事もできます。興味がある方は SkyWayの公式チュートリアルでInsertable Streamsを試してみたをどうぞ。

デモ

最初に今回の記事通りに実装するとどのような事が実現できるのかを、動画で紹介します。この動画では、予め決めておいた共通のキーを用いて3者通話を確立させ、その後、一人ずつキーを変更し映像が乱れることを確認しています。

E2EEはその名の通り、エンド・ツー・エンドで暗号化する仕組みです。SFUサーバでは通常、WebRTCの暗号化（SRTP）を解いて映像のスプリットを行い、再度暗号化して送り出します。Insertable Streams APIで2重に暗号化を施すことによって、SFUサーバでWebRTCの暗号化を解いても内容が漏洩しないというメリットがあります。

SkyWay JavaScript SDKの修正

Insertable Streams APIを利用する場合は、RTCPeerConnection Objectを触る必要があるため、取り出すAPIを追加します。

追加するコード

  getSFUPeerConnection() {
    if (!this._connectionStarted) {
      return null;
    }
    return this._negotiator._pc;
  }

追加する場所
- https://github.com/skyway/skyway-js-sdk/blob/master/src/peer/sfuRoom.js

コードの追加が終わったらREADMEに従ってビルドしてください。

エンコード/デコード用の仕組みを用意

本質的な部分ではないため詳細は割愛しますが、今回は WebRTC samples Peer connection end to end encryption で利用されているWebWorkerを利用したエンジンを利用します。

暗号化の仕組みとしてはデータをXOR演算で加工しているだけなので、あくまでサンプルと考えたほうが良いと思います。

利用するにあたり1箇所修正しています。

  } else if (operation === 'setCryptoKey') {
    if (event.data.currentCryptoKey !== currentCryptoKey) {
      // コメントアウトします
      //currentKeyIdentifier++;
    }
    currentCryptoKey = event.data.currentCryptoKey;
    useCryptoOffset = event.data.useCryptoOffset;
  }

ソースコード
- https://github.com/webrtc/samples/blob/gh-pages/src/content/peerconnection/endtoend-encryption/js/worker.js

公式サンプルをベースにE2EE対応の多人数ビデオチャットアプリを作る

ベースにするのは、Roomサンプルで、こちらを修正していきます。

ソースコードの全容はgistに載せているので、参考にしてください。
手を加えた部分を解説していきます。

動作ブラウザの確認

RTCRtpSenderにcreateEncodedStreamsが生えているかを確認します(RTCRtpReceiverにも同じものが生えている前提です)。

const supportsInsertableStreams = !!RTCRtpSender.prototype.createEncodedStreams;

最初の仕様では、 createEncodedVideoStreams というAPIでしたが、こちらは既にdeprecatedになっています。

Insertable Streamの有効化

RTCPeerConnectionオブジェクトをnewする際に、encodedInsertableStreamsオプションを渡す必要があります。
SkyWayでは、new Peerにconfigで指定することで有効となります。

const peer = (window.peer = new Peer({
    key: "APIキー",
    debug: 3,
    config: {
        encodedInsertableStreams: true
    },
}));

workerの初期化とキーのセット

映像データのエンコード/デコード用にworkerを初期化します。
タイミング的にはRoom Joinのボタンをクリックしたタイミングとしました。初期化後は、setCryptoKey()を実行して、エンコードとデコードのためのCryptoKeyをセットします。補足ですが、このキーはデモ映のためにWebRTC Samplesに倣って変更されたら都度反映できるように、イベントをセットしておきます。

joinTrigger.addEventListener('click', () => {
    if (!peer.open) {
        return;
    }

    const worker = new Worker('./libs/worker.js', {name: 'E2EE worker'});
    setCryptoKey();
    cryptoKey.addEventListener('change', setCryptoKey);

    // 省略

});

CryptoKeyをセットするfunctionは以下のとおりです。キーを入力するUIを用意して、入力された値をworkerに渡して、setCryptoKeyというオペレーションを実行します。

const cryptoKey = document.getElementById('crypto-key');

// 省略

function setCryptoKey() {
    messages.textContent += `=== CryptoKey is ${cryptoKey.value} ===\n`;
    currentCryptoKey = cryptoKey.value;
    const useCryptoOffset = true;
    worker.postMessage({
        operation: 'setCryptoKey',
        currentCryptoKey,
        useCryptoOffset,
    });
}

useCryptoOffsetについては、worker.jsにコメントが有りました。利用するデコーダーに配慮して、先頭の制御情報は暗号化の対象外とするオプションのようです。SkyWayのSFUでは映像はVP8、音声はOpusとなるため、デフォルトでこの設定を有効にしています。

// If using crypto offset (controlled by a checkbox):
// Do not encrypt the first couple of bytes of the payload. This allows
// a middle to determine video keyframes or the opus mode being used.
// For VP8 this is the content described in
//   https://tools.ietf.org/html/rfc6386#section-9.1
// which is 10 bytes for key frames and 3 bytes for delta frames.
// For opus (where encodedFrame.type is not set) this is the TOC byte from
//   https://tools.ietf.org/html/rfc6716#section-3.1
//
// It makes the (encrypted) video and audio much more fun to watch and listen to
// as the decoder does not immediately throw a fatal error.

自分の映像・音声をエンコードする

自分の映像・音声のエンコード処理は、roomのopenイベントをトリガーに行います。

room.once('open', () => {
    messages.textContent += '=== You joined ===\n';

    if (supportsInsertableStreams){
        setTimeout(setLocalStreamEncoder, 1000);
    }

});

SkyWayのSDKは生のWebRTC APIを隠蔽しているため、今回の改造ではそれぞれのタイミングが重要となります。roomのopenイベント発火時点では、エンコード処理で利用するgetSendersオブジェクトがまだ生成されていないため（SkyWayのSFUサーバとのWebRTC的な接続が完了していないため）、以下のようなエラーが出ます。これを回避するために、1000ms実行を遅延させています。

Uncaught TypeError: Cannot read property 'getSenders' of null
    at setLocalStreamEncoder (script.js:177)

setLocalStreamEncoderの中身は以下のとおりです。
先程SDKに追加したgetSFUPeerConnectionを利用してRTCPeerConnectionオブジェクトを参照し、getSenders()を実行します。SenderはVideo用、Audio用で複数あるため、forEachで全てのSenderに対して処理を実行しています。エンコードの具体的なやり方ですが、 createEncodedStreams()を実行して、エンコード済みのメディアフレームを読み取るインターフェース(senderStreams.readableStream)を取得し、worker.jsのencodeオペレーションを実行します。

function setLocalStreamEncoder(){
    let pc = room.getSFUPeerConnection();
    senderStreams = pc.getSenders().forEach(sender => {
        let senderStreams;
        senderStreams = sender.createEncodedStreams();
        worker.postMessage({
            operation: 'encode',
            readableStream: senderStreams.readableStream,
            writableStream: senderStreams.writableStream,
            }, [senderStreams.readableStream, senderStreams.writableStream]);
    });
}

worker.jsでは、TransformStreamを利用して、エンコード処理を行うfunctionをハンドラーとして登録します。その後、readableStreamをpipeThroughメソッドを利用して接続していき、最終的にpipeToメソッドでwritableStreamに接続します。これで、エンコード処理がされたメディアフレームが相手に送られるようになります。


  if (operation === 'encode') {
    const {readableStream, writableStream} = event.data;
    const transformStream = new TransformStream({
      transform: encodeFunction,
    });
    readableStream
        .pipeThrough(new TransformStream({
          transform: polyFillEncodedFrameMetadata, // M83 polyfill.
        }))
        .pipeThrough(transformStream)
        .pipeTo(writableStream);

polyFillEncodedFrameMetadataの補足てすが、M83まではencodedFrame.getMetadata()が実装されていなかった為、ポリフィルを用意していたようです。M86では以下の通り実装されているため、無視してもらって構いません。

encodedFrame.getMetadata()
> dependencies: []
> frameId: 1
> height: 480
> spatialIndex: 0
> synchronizationSource: 653435377
> temporalIndex: 0
> width: 640

相手の映像・音声をデコードする

相手の映像・音声のデコード処理は、roomのstreamイベントをトリガーに行います。streamイベントはSFUサーバとのWebRTC通信が確立し、MediaStreamTrackを受信したタイミングで発火するため、遅延実行の必要はありません。

room.on('stream', async stream => {
    if (supportsInsertableStreams){
        setRemoteStreamDecoder();
    }

   // 省略

});

setRemoteStreamDecoderの中身は以下のとおりです。
worker.jsを使ったデコードの処理は、先ほど紹介したエンコード処理と基本的には同じなので割愛します。尚、受信したメディアフレームを操作するためには、getReceivers()を実行して、receiverを取得します。

function setRemoteStreamDecoder(){
    const pc = room.getSFUPeerConnection();
    receiverStreams = pc.getReceivers().forEach(receiver => {
        let receiverStreams;
        try {
            receiverStreams = receiver.createEncodedStreams();
            worker.postMessage({
            operation: 'decode',
            readableStream: receiverStreams.readableStream,
            writableStream: receiverStreams.writableStream,
        }, [receiverStreams.readableStream, receiverStreams.writableStream]);
        } catch (error) {
            // todo
        }
    });
};

ここのポイントはtry catchで処理を囲っている部分です。ポイントと言うかサンプルで実装をサボっている部分です、ごめんなさい。RoomのStreamイベントはRoomに参加者が入室する度に発火します。receiverについては、参加人数×2（映像と音声がある場合）作成され、forEachで毎回全て参照しています。ブラウザの実装としては、createEncodedStreamsを実行済みのreceiverに対して、再度実行すると以下の例外が出るようです。

DOMException: Failed to execute 'createEncodedStreams' on 'RTCRtpReceiver': Encoded video streams already created

現時点では、createしたreceiverを開発者側で管理する必要がありそうです。

尚、この件は、W3C SpecのIssueで議論がなされていました。

Add an API to know if createEncoded{Audio,Video}Streams was called #13
https://github.com/w3c/webrtc-insertable-streams/issues/13

以上で、Roomサンプル修正箇所の紹介は終わります。

終わりに

SkyWayのようなWebRTCの生APIを隠蔽しているSDKは、簡単に使える反面、Insertable Streams APIに限らず生のWebRTC APIを使った機能を実現するにはハードルは高い思います。興味があるかたは、今回の記事を参考に実装してみてください。

参考にさせていただいた記事やリポジトリ

https://www.chromestatus.com/features/schedule
https://w3c.github.io/webrtc-insertable-streams/
https://github.com/w3c/webrtc-insertable-streams
https://github.com/webrtc/samples/tree/gh-pages/src/content/peerconnection/endtoend-encryption
https://qiita.com/massie_g/items/2b0b6d4f61f1865b4da5

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up