How Tesla FSD Works Part 5: Modeling a Physical World Without LiDAR

Tesla has issued a detailed patent describing the inner mechanics of its vision-based occupancy network. The filing, titled Artificial Intelligence Modeling Techniques For Vision-based Occupancy Determination, was published on March 12, 2026.

Authored by a group of engineers that includes Ashok Elluswamy, the document explains how Tesla applies artificial intelligence to perceive and model the physical environment without using radar or LiDAR.

Understanding the Voxel Grid

The occupancy network is organized around voxels, which are three-dimensional pixels representing points within a volumetric grid surrounding the vehicle. To construct this grid, the model processes images from the vehicle’s eight exterior cameras and predicts whether each voxel is occupied by an object with mass.

Labeling millions of 3D points by hand would be prohibitively time-consuming, so the patent notes that Tesla relies heavily on unsupervised training methods to scale model training.

Variable Resolution and Sub-Voxels

The patent describes a strategy to manage compute by dynamically resizing voxels. The default size for a voxel is 33 centimeters on each vertex, which is sufficient for distant regions or areas off the immediate driving surface.

For regions that are occupied and within a threshold distance from the vehicle, FSD can reduce the voxel size to 10 centimeters to capture finer detail. The neural networks can also represent partial occupancy by splitting occupied regions into smaller sub-voxels.

This added granularity helps the system recover the precise shape of curved objects. In addition, an analytics server can apply trilinear interpolation to estimate the occupancy state of any specific point within a voxel.

Temporal Fusion and 3D Semantics

The AI does not treat frames as isolated snapshots. A transformer aggregates 2D image data into a unified 3D representation and then fuses this with representations from previous timestamps. Combining spatial and temporal context enables the network to compute occupancy flow, which indicates the exact velocity of moving voxels.

The system then applies 3D semantic understanding to infer object type, distinguishing, for example, a moving car, a static building, or a street curb. It prioritizes certain semantic shapes; a moving vehicle near the ego is examined more thoroughly than a distant static structure.

Powering Vehicles and Optimus

The resulting information is continuously compiled into a queryable dataset. FSD can query this dataset to obtain occupancy states and make real-time driving decisions. The same dataset is also used to render the 3D environmental map presented in the vehicle’s user interface.

Although the patent centers on autonomous vehicles, it emphasizes that the underlying approach is highly adaptable. The document specifies that the same vision-based occupancy network can be deployed on a general-purpose, bipedal humanoid robot to traverse varied terrain.

For related reading on Tesla patents related to FSD: