初めに
需要があるのかは分かりませんが、Robotics Transformer (RT-1) をざっくり解説したものを英語で書いたので載せておきます。(元は友人のために書いたもの)
解説自体はかなり簡略化しているので論文の本質と若干異なる可能性があります。流し読み用にご利用くださいませ。
What is RT-1?
RT-1 is a robot control model that is able to perform real-time tasks based on an image sequence with natural language instructions.
What’s cool about this model?
- Able to perform 700 tasks with a 97% success rate
- Can operate at 3Hz
Original Paper
Video
Components of the architecture
The RT-1 architecture consists of the following components:
- Universal Sentence Encoder
- EfficientNet
- Feature-wise Linear Modulation (FiLM)
- TokenLearner
- Positional Encoding
- Transformer
Universal Sentence Encoder
We first want a way to encode the natural language instruction, so RT-1 adopts the method from Universal Sentence Encoder to do this:
This paper proposes two primary methods for sentence embedding:
- Transformer based
- Deep Average Network
Whats’s good about this encoder?
- different languages are mapped to the same vector space
- great accuracy with STSbenchmark
EfficientNet
Now, we want to extract features from the input images, so we use a CNN architecture. RT-1 uses EfficientNet trained on ImageNet.
EfficientNet:
ImageNet:
Simply speaking, EfficientNet is just one of the best CNN architecture with a somewhat simple structure:
Feature-wise Linear Modulation (FiLM)
RT-1 keeps the model efficient by using FiLM to combine natural language instructions and visual features.
FiLM is a convenient layer that learns to transform feature layers so that they incorporate external information. In RT-1, FiLM is used to only get features that are closely related to the natural language instruction.
To be more specific, FiLM learns two functions which output values ($\gamma$, $\beta$) that modulate how affine transformation (colinearity and distance ratios preserved) is applied to an activation:
$$FiLM(\textbf{F}_ {i,c} | \gamma _{i,c}, \beta _{i,c}) = \gamma _{i,c} \odot \textbf{F} _{i,c} + \beta _{i,c}$$
TokenLearner
RT-1 makes the model even more efficient by reducing tokens to only ones that are significant. This is done using a TokenLearner:
TokenLearner learns to compute a weight map which is multiplied element-wise with the original input to extract the important ones:
Positional Encoding
Positional encoding adds positional data to entities in a sequence. This is done simply by adding positional vectors to the original vector sequence (e.g. a different positional vector is added to every single word vector in a sentence sequence). Unique position vectors are created using a combination of sine and cosine. Positional encoding is always done before putting into a transformer.
Why sine and cosine?
This is my intuition behind this:
First of all, sine and cosine are being used because its a simple way to generate values within -1 and 1 (This is important because the original vectors (e.g. word vectors) are normalized, so we can’t add vectors with values that are too huge).
Can’t we just use sine then?
A: The most intuitive method would be to use a sine function with a very low frequency (long wavelength) because a sin function is a periodical function (we don’t want same values repeating because we want unique vectors). However, this creates a problem where values generated earlier don’t have much of a distinct difference (sine wave is essentially flat near the trough). So, adding cosine helps the model capture different phases of a wave at each position, which makes it easier to distinguish positions more clearly.
This is an example of a positional encoding matrix (just concatenating positional vectors). Each row represents a positional vector here, so its pretty clear that unique vectors are being generated:
Transformer
There is a lot to a transformer, but if we explain it really simply, it’s a collection of attention layers that can extract/highlight salient parts of data that deserve attention.
Transformers usually have an encoder-decoder structure:
- The encoder learns to represent the input sequence by capturing dependencies among the elements of the sequence
- The decoder generates the output sequence using the representation learned with the encoder
Overall Flow of RT-1
- Natural language instruction is put through an universal sentence encoder
- 6 images are put into EfficientNet (with 6 layers)
- FiLM is used with the encoded instruction to condition EfficientNet with language (not just image)
- The output tokens from EfficientNet are reduced using TokenLearner
- Output tokens from each layer are concatenated
- These tokens are processed with positional encoding
- Then, encoded tokens are put through a transformer (only decoder)
- The output of the transformer is mapped to 11 actions, which have 256 bins
- the word “bin” is used in the paper, but this is simply just making the continuous action space of joints convenient by organizing the action space into 256 discrete slots.
- 11 action tokens are:
- mode - arm, base, termination
- arm movement - x, y, z, roll, pitch, yaw, gripper opening
- base movement - x, y, yaw