Converting the Nvidia Project GR00T Dataset to MCAP

Nvidia Isaac GR00T is “a foundation model for generalized humanoid robot reasoning and skills” and in this post I’ll convert the dataset used for the post-training of the GR00T N1 to MCAP, then visualize it in Foxglove.

Overview of the Dataset

We’ll be looking specifically at datasets in the Robotics Arm Kitchen Manipulation group but the same process applies to all other groups. We’ll download the logs for a specific task, single_panda_gripper.CloseDoubleDoor:

git clone --filter=blob:none --no-checkout https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim
git sparse-checkout init --cone
git sparse-checkout set single_panda_gripper.CloseDoubleDoor
git checkout main

and this is how the data is structured:

├── data
│   ├── chunk-000
│   │   ├── episode_000000.parquet
│   │   └── ...
│   └── ...
├── meta
│   ├── episodes.jsonl
│   ├── info.json
│   ├── modality.json
│   ├── stats.json
│   └── tasks.jsonl
└── videos
    ├── chunk-000
    │   ├── observation.images.left_view
    │   │   ├── episode_000000.mp4
    │   │   └── ...
    │   ├── observation.images.right_view
    │   │   ├── episode_000000.mp4
    │   │   └── ...
    │   └── observation.images.wrist_view
    │       ├── episode_000000.mp4
    │       └── ...
    └── ...

There’s a lot of stuff here, but we’re most interested in converting the .parquet episode files to MCAP and the .mp4 files.

Each episode is a series of rows, and the key elements of every row is a timestamp and a state vector. The structure of the state vector is described in the modality.json file

{
    "state": {
        ...snip...
        "joint_position": {
            "start": 25,
            "end": 32
        },
        ...snip...
    }
    ...snip...
}

for example, this snippet tells us that the 7 joint positions are at indices 25 to 31 of the state vector.

Setting up the Project

Before we install dependencies we’ll need to create a virtual environment, so that dependencies aren’t installed globally. We’ll use virtualenv, which you can get on MacOS like this:

python3 -m pip install virtualenv

and on Ubuntu:

sudo apt install python3-virtualenv

then we create the project, create a virtual environment, and install our dependencies

mkdir example
cd example
virtualenv venv 
source venv/bin/activate
python3 -m pip install mcap mcap-protobuf-support foxglove-schemas-protobuf numpy pandas pyarrow urchin 

we need numpy, pandas and pyarrow to parse the dataset, and we need urchin (which is a modern fork of the ancient urdfpy) to compute the transforms between the robot’s joints.

Finally, we’ll need the URDF and meshes for the Franka Research 3 robot. You can create these files using this project.

Converting Joint Positions to MCAP

These are the imports we’ll need:

import pandas as pd
from foxglove_schemas_protobuf.Vector3_pb2 import Vector3
from foxglove_schemas_protobuf.Quaternion_pb2 import Quaternion
from foxglove_schemas_protobuf.FrameTransform_pb2 import FrameTransform
from foxglove_schemas_protobuf.FrameTransforms_pb2 import FrameTransforms
from google.protobuf.timestamp_pb2 import Timestamp
from pathlib import Path
from mcap_protobuf.writer import Writer
from urchin import URDF

from itertools import chain

and this is a function that converts between a rotation matrix, which is how urchin represents rotations, and a quaternion, which is the representation Foxglove expects.

def rot_matrix_to_quat(R):
    trace = R[0, 0] + R[1, 1] + R[2, 2]
    if trace > 0:
        s = 0.5 / np.sqrt(trace + 1.0)
        w = 0.25 / s
        x = (R[2, 1] - R[1, 2]) * s
        y = (R[0, 2] - R[2, 0]) * s
        z = (R[1, 0] - R[0, 1]) * s
    elif (R[0, 0] > R[1, 1]) and (R[0, 0] > R[2, 2]):
        s = 2.0 * np.sqrt(1.0 + R[0, 0] - R[1, 1] - R[2, 2])
        w = (R[2, 1] - R[1, 2]) / s
        x = 0.25 * s
        y = (R[0, 1] + R[1, 0]) / s
        z = (R[0, 2] + R[2, 0]) / s
    elif R[1, 1] > R[2, 2]:
        s = 2.0 * np.sqrt(1.0 + R[1, 1] - R[0, 0] - R[2, 2])
        w = (R[0, 2] - R[2, 0]) / s
        x = (R[0, 1] + R[1, 0]) / s
        y = 0.25 * s
        z = (R[1, 2] + R[2, 1]) / s
    else:
        s = 2.0 * np.sqrt(1.0 + R[2, 2] - R[0, 0] - R[1, 1])
        w = (R[1, 0] - R[0, 1]) / s
        x = (R[0, 2] + R[2, 0]) / s
        y = (R[1, 2] + R[2, 1]) / s
        z = 0.25 * s
    return np.array([x, y, z, w], dtype=float)

next we’ll write the code for loading the episode and modality.json file

data_root = Path("~/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim/single_panda_gripper.CloseDoubleDoor").expanduser()
chunk = "000"
episode = "000900"

urdf = URDF.load("./fr3_franka_hand/fr3_franka_hand.urdf")
data_frame = pd.read_parquet(f'{data_root}/data/chunk-{chunk}/episode_{episode}.parquet', engine="pyarrow")

modality = json.load(open(f'{data_root}/meta/modality.json'))
state_desc = modality["state"]

joint_pos_idx_start = state_desc["joint_position"]["start"]
joint_pos_idx_end = state_desc["joint_position"]["end"]
joint_names = [f"fr3_joint{i}" for i in range(1, 7)]

gripper_pos_idx_start = state_desc["gripper_qpos"]["start"]
gripper_pos_idx_end = state_desc["gripper_qpos"]["end"]
gripper_names = ["fr3_finger_joint1", "fr3_finger_joint2"]

and then we’ll write the code for iterating over the dataset rows, converting the joint positions to FrameTransform messages, and writing the data to an MCAP file:

with open(f"{data_root.name}-{chunk}-{episode}-state.mcap", "wb") as stream, Writer(stream) as writer:
    for _, row in data_frame.iterrows():
        sec_whole, sec_dec = divmod(row["timestamp"], 1)
        sec = int(sec_whole)
        nsec = int(sec_dec * 1_000_000_000)

        timestamp_ns = int(row["timestamp"] * 1_000_000_000)

        state = row["observation.state"]
        
        joint_pos_values = state[joint_pos_idx_start:joint_pos_idx_end]

        gripper_pos_values = state[gripper_pos_idx_start:gripper_pos_idx_end]

        joint_pos_dict = dict(chain(zip(joint_names, joint_pos_values), 
                                    zip(gripper_names, gripper_pos_values)))

        fk_poses = urdf.link_fk(cfg=joint_pos_dict)

        transforms = []
        transforms.append(
            FrameTransform(
                timestamp       = Timestamp(seconds=sec, nanos=nsec),
                parent_frame_id = "world",
                child_frame_id  = "base",
                translation     = Vector3(x=0.0, y=0.0, z=0.0),
                rotation        = Quaternion(x=0.0, y=0.0, z=0.0, w=1.0)
            )
        )

        for joint in urdf.joints:
            T_parent = fk_poses[urdf.link_map[joint.parent]]
            T_child = fk_poses[urdf.link_map[joint.child]]
            T_local = np.linalg.inv(T_parent) @ T_child
            trans = T_local[:3, 3]
            quat = rot_matrix_to_quat(T_local[:3, :3])
            transforms.append(
                FrameTransform(
                    timestamp       = Timestamp(seconds=sec, nanos=nsec),
                    parent_frame_id = joint.parent,
                    child_frame_id  = joint.child,
                    translation     = Vector3(x=trans[0], y=trans[1], z=trans[2]),
                    rotation        = Quaternion(x=quat[0], y=quat[1], z=quat[2], w=quat[3])
                )
            )

            writer.write_message(
                topic        = "/tf",
                message      = FrameTransforms(transforms=transforms),
                log_time     = timestamp_ns,
                publish_time = timestamp_ns
            )

The key part of the code is how we convert the data to a form urchin understands. To compute transforms between joints, urchin needs a dictionary mapping joint names in the URDF to their values.
We get joint values by slicing the state vector

joint_pos_values = state[joint_pos_idx_start:joint_pos_idx_end]

and we define joint names once at the start

joint_names = [f"fr3_joint{i}" for i in range(1, 7)]

then we zip these lists together and use them to create our dictionary

joint_pos_dict = dict(chain(zip(joint_names, joint_pos_values), 
                            zip(gripper_names, gripper_pos_values)))

in other words, "fr3_joint1" corresponds to joint_pos_values[0], "fr3_joint2" corresponds to joint_post_values[1], and so on.

Running this script creates an MCAP file containing the joint positions data, and now we’ll move on to converting the camera data to MCAP.

Converting Video to MCAP

There already exists a Rust utility written by the engineers at Foxglove, but it’ll be more fun to write our own in Python!

If we look in Foxglove’s documentation for CompressedVideo messages we see that

Foxglove doesn’t support video streams that include B frames because they require lookahead
If we have h264 or h265 codec we must use Annex B formatted data
Each message should contain enough NAL units to decode exactly one video frame

So this’ll require some work, but luckily FFMPEG will take care of most of it. So the first thing we’ll do is install FFMPEG (in addition to our existing dependencies). Installing on MacOS

brew install pkg-config ffmpeg

and on Ubuntu

sudo apt-get update
sudo apt-get install -qq --no-install-recommends \
  pkg-config \
  ffmpeg \
  libavutil-dev \
  libavcodec-dev \
  libavformat-dev \
  libswscale-dev \
  libswresample-dev \
  libavfilter-dev \
  libavdevice-dev

and we’ll also need to install av in our virtual environment.

python3 -m pip install av

we create another script for converting the mp4 files, and these are the imports we’ll need

import argparse
import av
import subprocess
from pathlib import Path
from foxglove_schemas_protobuf.CompressedVideo_pb2 import CompressedVideo
from google.protobuf.timestamp_pb2 import Timestamp
from mcap_protobuf.writer import Writer
from tempfile import NamedTemporaryFile

first we determine the codec of our mp4 file:

data_root = Path("~/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim/single_panda_gripper.CloseDoubleDoor").expanduser()
chunk = "000"
episode = "000900"

# I've omitted it in this post, but the code below should be put in a function 
# and called for every camera angle
name = f"{data_root.name}-{chunk}-{episode}"
input_path = data_root / f"videos/chunk-{chunk}/observation.images.left_view/episode_{episode}.mp4"
output_path = f"{name}-left_view.mcap"
topic = "/camera/left"
frame_id = "/camera/left"

with av.open(input_path, "r") as container:
    video_stream = container.streams.video[0]
    codec_context = video_stream.codec_context
    codec_name = codec_context.name
    if codec_name not in ["h264", "h265", "hevc"]:
        raise ValueError(f"Unsupported codec: {codec_name}")

then we call FFMPEG using the subprocess module, extracting the video stream of our mp4, removing B-frames, and converting it to the Annex B format.

with NamedTemporaryFile(suffix=".ts") as temp_output:
    temp_output_path = Path(temp_output.name)

    cmd = [
        "ffmpeg",
        "-y",
        "-i", str(input_path),
        "-c:v", "libx264" if codec_name == "h264" else "libx265",
        "-bf", "0",
        "-bsf:v", "h264_mp4toannexb" if codec_name == "h264" else "hevc_mp4toannexb",
        str(temp_output_path)
    ]
    subprocess.run(cmd, check=True)

finally we’ll use av to iterate over the packets of this processed file and relay them to Foxglove

    with av.open(temp_output_path, "r") as container, open(output_path, "wb") as stream, Writer(stream) as writer:
        video_stream = container.streams.video[0]

        format = "h264" if codec_name == "h264" else "h265"

        for packet in container.demux(video_stream):
            if packet.pts is None:
                continue
            data = bytes(packet)
            # assumes that 1 packet corresponds to 1 frame https://ffmpeg.org/doxygen/2.0/structAVPacket.html
            timestamp_ns = int(packet.pts * 1_000_000_000 * packet.time_base.numerator / packet.time_base.denominator)
            message = CompressedVideo(
                timestamp   = Timestamp(seconds=timestamp_ns // 1_000_000_000, nanos=timestamp_ns % 1_000_000_000),
                data        = data,
                format      = format
            )
            writer.write_message(
                topic        = topic,
                message      = message,
                publish_time = timestamp_ns,
                log_time     = timestamp_ns
            )

Visualizing the Data

Now that we’ve converted the dataset to MCAP we can visualize it!

You’ll notice that I’m also visualizing the joint velocities, but converting this data to MCAP is essentially the same as joint positions, so I’ve omitted the code.