
Tesla’s FSD has consistently followed a core approach: navigate and interact with the physical world using vision-only sensing and generalized artificial intelligence.
Earlier explanations of how FSD works have described how its neural networks construct three-dimensional spaces from two-dimensional pixels and track moving objects over time. Yet one of the toughest problems in autonomy is not dynamic obstacles but the implicit, shifting, human-centric logic of road design.
At an unmapped, complex, multi-lane intersection, an autonomous vehicle must infer which incoming lane connects to which outgoing path. Companies such as Waymo and Zoox address this by depending on fragile, high-definition localized maps. Tesla tackles the task as a dynamic inference problem.
A recent patent application, US 2026/0170852 A1, titled "Vision-Based Machine Learning Model for Lane Connectivity in Autonomous or Semi-Autonomous Driving," outlines a solution that adapts an architecture common to generative AI and LLMs: the autoregressive transformer.
Tokenizing Asphalt
Large Language Models generate text token by token. When a system like Grok produces text, it selects the most likely next token based on the tokens already emitted, rather than producing a full paragraph in one step.
Tesla applies a similar strategy to road geometry. As the car drives, the vision system feeds raw images into FSD’s neural networks. Backbone networks process the pixels, and a cross-attention transformer fuses multiple camera views into a three-dimensional, top-down vector space known as a bird’s-eye view (BEV).
The end-to-end flow is: [Camera Pixels] ➔ [Backbone Networks] ➔ [Multicam Fusion] ➔ [Autoregressive Transformers] ➔ [Lane Graph]
With the spatial representation established, the network begins to "read" the intersection. It tokenizes coordinates so that physical road positions are represented as discrete tokens.
Autoregressive blocks then pick a seed coordinate, such as a lane’s starting point, and predict successive spatial points (X, Y coordinates) along the drivable path.
The Autoregressive Loop
Initial Prediction: The network identifies an entry token at the mouth of an intersection.
Contextual Feeding: That token is fed back into the autoregressive blocks together with spatial feature maps.
Sequential Tracking: Using the accumulated context, the model predicts the next plausible point in the lane, chaining tokens into a precise path through the intersection.
Structural Attribute Labeling: In parallel, the system assigns attributes to each coordinate token, labeling whether a point corresponds to a standard travel path, a merge vector, a fork deviation, or an intersection interior without visible paint lines.
The loop repeats many times—typically between 64 and 108 inferences per cycle. When one lane is fully traced and terminated, the process restarts to model neighboring lanes.
By expressing lanes as a sequence of tokens, the network effectively writes a "sentence" that captures the whole intersection. Even at an unpredictable five-way split or a sharp curve, the transformer uses surrounding context to maintain a consistent trajectory so the vehicle does not switch paths mid-intersection.
Overcoming Amnesia Across Time and Space
The patent notes that real roads often present sensory disruptions: lane markings may be worn, obscured by construction debris, or hidden behind large lead vehicles such as commercial box trucks. To counter immediate spatial amnesia, the system incorporates a dedicated video queue module.
This video queue provides short-term spatial and temporal memory. As the vehicle advances, features from earlier timestamps are retained. A frame alignment step keeps them consistent over time; for example, if the car moves 20 meters, the historical feature maps in the queue are shifted to compensate for ego-motion.
With this alignment, if a lane line disappears under a nearby vehicle, the autoregressive blocks can reference historical features to continue predicting the lane’s connectivity without interruption.
Maps are a Hint, Not the Truth
The industry has long argued that safe driverless operation requires centimeter-accurate HD maps. Tesla Vision has already demonstrated a generalized alternative, and this patent details how that approach can work for lane connectivity.
Standard map data can be provided as an auxiliary input, but only as a hint. The architecture can inject a "don’t know" signal when localized map data appears unreliable or outdated for a given geographic area.
Prioritizing live vision ensures the system behaves more like a skilled human: it considers the scene, interprets intersection context, and infers the correct route through visual reasoning rather than depending on stale map entries.
As Tesla moves toward commercializing an unsupervised ride-hailing network, this work indicates the stack is not merely processing images—it is actively reading the world.
- How FSD Works Part 1
- How FSD Works Part 2
- How FSD Works Part 3
- How FSD Works Part 4
- How FSD Works Part 5
- How FSD Works Part 6 (this article)
- Tesla’s Occupancy Network
- Tesla’s Universal AI Translator
- How Tesla Optimizes FSD
- How Tesla Will Label Data with AI













































Share:
NHTSA Ends Investigation Into Tesla Model 3/Y Steering Failures