【英語で解説】Robotics Transformer (RT-1)

Posted at 2024-05-28

初めに

需要があるのかは分かりませんが、Robotics Transformer (RT-1) をざっくり解説したものを英語で書いたので載せておきます。（元は友人のために書いたもの）

解説自体はかなり簡略化しているので論文の本質と若干異なる可能性があります。流し読み用にご利用くださいませ。

What is RT-1?

RT-1 is a robot control model that is able to perform real-time tasks based on an image sequence with natural language instructions.

What’s cool about this model?

Able to perform 700 tasks with a 97% success rate
Can operate at 3Hz

Original Paper

Video

Components of the architecture

The RT-1 architecture consists of the following components:

Universal Sentence Encoder
EfficientNet
Feature-wise Linear Modulation (FiLM)
TokenLearner
Positional Encoding
Transformer

Universal Sentence Encoder

We first want a way to encode the natural language instruction, so RT-1 adopts the method from Universal Sentence Encoder to do this:

This paper proposes two primary methods for sentence embedding:

Transformer based
Deep Average Network

Whats’s good about this encoder?

different languages are mapped to the same vector space
great accuracy with STSbenchmark

EfficientNet

Now, we want to extract features from the input images, so we use a CNN architecture. RT-1 uses EfficientNet trained on ImageNet.

EfficientNet:

ImageNet:

Simply speaking, EfficientNet is just one of the best CNN architecture with a somewhat simple structure:

Feature-wise Linear Modulation (FiLM)

RT-1 keeps the model efficient by using FiLM to combine natural language instructions and visual features.

FiLM is a convenient layer that learns to transform feature layers so that they incorporate external information. In RT-1, FiLM is used to only get features that are closely related to the natural language instruction.

To be more specific, FiLM learns two functions which output values ($\gamma$, $\beta$) that modulate how affine transformation (colinearity and distance ratios preserved) is applied to an activation:

$$FiLM(\textbf{F}_ {i,c} | \gamma _{i,c}, \beta _{i,c}) = \gamma _{i,c} \odot \textbf{F} _{i,c} + \beta _{i,c}$$

TokenLearner

RT-1 makes the model even more efficient by reducing tokens to only ones that are significant. This is done using a TokenLearner:

TokenLearner learns to compute a weight map which is multiplied element-wise with the original input to extract the important ones:

Positional Encoding

Positional encoding adds positional data to entities in a sequence. This is done simply by adding positional vectors to the original vector sequence (e.g. a different positional vector is added to every single word vector in a sentence sequence). Unique position vectors are created using a combination of sine and cosine. Positional encoding is always done before putting into a transformer.

Why sine and cosine?

This is my intuition behind this:

First of all, sine and cosine are being used because its a simple way to generate values within -1 and 1 (This is important because the original vectors (e.g. word vectors) are normalized, so we can’t add vectors with values that are too huge).

Can’t we just use sine then?

A: The most intuitive method would be to use a sine function with a very low frequency (long wavelength) because a sin function is a periodical function (we don’t want same values repeating because we want unique vectors). However, this creates a problem where values generated earlier don’t have much of a distinct difference (sine wave is essentially flat near the trough). So, adding cosine helps the model capture different phases of a wave at each position, which makes it easier to distinguish positions more clearly.

This is an example of a positional encoding matrix (just concatenating positional vectors). Each row represents a positional vector here, so its pretty clear that unique vectors are being generated:

Transformer

There is a lot to a transformer, but if we explain it really simply, it’s a collection of attention layers that can extract/highlight salient parts of data that deserve attention.

Transformers usually have an encoder-decoder structure:

The encoder learns to represent the input sequence by capturing dependencies among the elements of the sequence
The decoder generates the output sequence using the representation learned with the encoder

Overall Flow of RT-1

Natural language instruction is put through an universal sentence encoder
6 images are put into EfficientNet (with 6 layers)
FiLM is used with the encoded instruction to condition EfficientNet with language (not just image)
The output tokens from EfficientNet are reduced using TokenLearner
Output tokens from each layer are concatenated
These tokens are processed with positional encoding
Then, encoded tokens are put through a transformer (only decoder)
The output of the transformer is mapped to 11 actions, which have 256 bins
1. the word “bin” is used in the paper, but this is simply just making the continuous action space of joints convenient by organizing the action space into 256 discrete slots.
2. 11 action tokens are:
  1. mode - arm, base, termination
  2. arm movement - x, y, z, roll, pitch, yaw, gripper opening
  3. base movement - x, y, yaw

Reference

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up