Latency to Insight (F1Muse)
I always wanted a go-to place for F1 statistics, every other alternative seemed out-dated and didn't really have a compelling UX. A lot of the current F1 websites have the same format: search for driver/race track -> look at basic statistics. I wanted to combine the both into a single website, and the best solution looked like Statmuse. The great part of the UX is it allows twitter fans to get statistics in a second for their 50 tweet-long threads, but it is only available for a few sports. So the solution is to make an F1-equivalent!
The Architecture: Building F1Muse
Part 0: The Data
My first instinct was to collect and store raw lap times, but raw time is the a nothing-burger statistic. A 1:25.00 at Monaco is "faster" than a 1:45.000 at Spa, even though one would be in the top of the field, while the other is at the trail end. But if I were to tell you they were +0.7% in Monaco compared to the median of the field, then you'd have way more context; i.e the driver was consistently seven-tenths slower than the field. This yields the normalization formula:
where is the session median lap time; which gives an outlier resistance against safety car laps, for example.
| Source | Scope | Volume | Primary Utility |
|---|---|---|---|
| FastF1 | 2018–2025 | ~161k laps | Microsecond-precision telemetry & timing |
| F1DB | 1950–2025 | ~243k results | Historical entity resolution & race context |
The ingestion eliminates noise through three specific filters:
| Filter | Mechanism | Description |
|---|---|---|
| Track Status | TrackStatus != '1' |
Reject any lap not under Green Flag. FastF1 encodes '2' (Yellow), '4' (SC), '6' (VSC), and '7' (Red Flag). |
| Accuracy | IsAccurate |
Use FastF1's boolean flag to filter GPS/timing glitches and data corruption (typically pit entry/exit). |
| Clean Air | IntervalToAhead |
Mark is_clean_air = true if gap > 1.5s. |
I chose the gap to be 1.5s because the 1.0s DRS detection zone measures gap at a single track point, but the dirty air affects cars continuously throughout the lap. Even with the 2022 ground-effect regulations, cars still lose ~18% downforce at one car length (10m) and ~4% at three car lengths (20m). Additionally, the effect is corner-speed dependent: high-speed sections like Copse or Maggots show severe penalties up to 1.5s gaps, while low-speed hairpins can have minimal impact. While 1.5s is not the magic number, it works as a "catch all problems" number.
While choosing the database schema, there were a lot of intricate design choices (and I probably missed some more), but the main ones were
| Choice | Implementation | Description |
|---|---|---|
| Identity | Composite PK | (season, round, track_id, driver_id, lap_number) forces physical sort order for instant range scans. |
| Precision | NUMERIC(8,3) |
Avoids floating-point rounding errors; preserves microsecond integrity for tight aggregations. |
| Filtering | Boolean State | Maps TrackStatus to 4 flags (is_pit, is_out, etc.) to enable binary filtering instead of string parsing. |
| Agility | Dynamic Pace | Pace is computed at query time via session medians to allow for evolving "Clean Air" definitions. |
Additionally, the idx_laps_normalized_clean_air partial index is crucial for low-latency. It only indexes rows where is_valid_lap = true AND is_pit_lap = false AND is_out_lap = false AND is_in_lap = false. This helps us avoid doing full table scans, which makes clean air queries 2-3x faster.
Part 1: The RAG Trap
On first instinct, I thought Statmuse was built on a standard RAG: store lap data as JSON, let the LLM query it, and summarize the results. Although, trying it out was disastrous, there were a lot of hallucinations; my favorite one being how Leclerc biased it was every time:
User: "Leclerc vs Sainz qualifying gap 2024"
GPT-4 Response: "Leclerc out-qualified Sainz in 18 out of 22 races, with an average gap of 0.087%."
The actual answer was Leclerc out-qualified Sainz 14 out of 22 races, with a gap of 0.149%.
Additionally, the RAG architecture requires an embedding call (150ms) -> vector search (100ms) -> LLM processing the JSON (could be up to 1000+ laps if full-season comparisons, 3000ms). This isn't considering the other latency-affecting factors when deployed, so a 3.5s base-case let me rethink my approach (especially due to the fact that Statmuse felt much quicker than a handful of seconds).
This made me revisit Statmuse to "force" it to be at least 3.5s by asking questions I thought would never have been cached; but even then it was returning almost instantaneously. While trying it out though, I noticed it had a "interpreted as: ..." section after I enter any query. This made me pivot to a Intent-to-SQL architecture.
Intent-to-SQL Semantic Routing
Just let SQL do all the heavy-lifting, not the LLM! As such, a 2-step process was developed.
In this architecture, the LLM's only job is to convert the natural language query to a structured intent in JSON:
// src/types/query-intent.ts
export type QueryIntent =
| SeasonDriverVsDriverIntent
| TeammateGapSummarySeasonIntent
| QualifyingResultsSummaryIntent
| DriverSeasonSummaryIntent
| TrackFastestDriversIntent
// ... 22 more types
interface SeasonDriverVsDriverIntent {
kind: 'season_driver_vs_driver';
driver_1_id: string;
driver_2_id: string;
season: number;
track_id?: string;
}
This also makes caching so much easier! Since the LLM is only producing a JSON rather than raw answers (like the RAG), you can cache at many points. After the JSON is produced, we have to validate the intent.
Intent Validation
Since I don't want to be called out for fake statistics, I have to validate all the JSONs, so the system doesn't execute any query blindly.
function validateIntent(intent: QueryIntent): boolean {
if (!driverExists(intent.driver_1_id) || !driverExists(intent.driver_2_id)) {
return false;
}
if (intent.season < 1950 || intent.season > CURRENT_SEASON) {
return false;
}
if (intent.track_id && !trackExists(intent.track_id)) {
return false;
}
return SUPPORTED_INTENT_TYPES.includes(intent.kind);
}
This allows for "Verstappen vs Schumacher 2024" to be rejected, rather than executed, saving computation resources. Once we have the validated intent, we map it to a pre-approved SQL template.
Performance
The latency breakdown of this architecture:
| Operation | Time | Notes |
|---|---|---|
| Intent parsing (Claude) | 350-500ms | Cached after first call |
| Validation queries | 50-100ms | Simple index lookups |
| Template execution | 200-400ms | Pre-optimized SQL |
| Result formatting | 50-100ms | In-memory transformation |
| Total (uncached) | 650-1100ms | 5x faster than RAG |
We are now getting somewhere!
Part 2: Identity Problem
Since users (obviously) don't use canonical IDs. This was a pretty annoying problem to fix. First instinct: Have the LLM resolve aliases in the prompt with a hardcoded roster. The main problem with this is scale; hardcoding every driver adds a lot of token overhead. So, let the LLM extract it as is, but have the backend resolve it.
The Two-Layer Approach
Layer 1: LLM Extraction (No Resolution)
The LLM extracts driver names exactly as the user typed them. No alias look up, no canonicalization:
Example:
Input: "Mad Max vs Lando 2024"
Output: {
"kind": "season_driver_vs_driver",
"driver_a_id": "Mad Max",
"driver_b_id": "Lando",
"season": 2024
}
The LLM doesn't know that "Mad Max" maps to "max_verstappen." It just passes through the raw string.
Layer 2: Backend Resolution (Database)
The DriverResolver class performs strict literal matching against the database. In most cases, this may have ambiguity, i.e "Hamilton" in 1960 obviously doesn't mean Lewis Hamilton. As such, DriverResolver handles ambiguity with context. For example "Sainz" maps to carlos_sainz_jr because he is the most recent, he is distinguished from Carlos Sainz Sr.
This approach allows us to not hardcode rosters, which makes it easy to add new drivers by just inserting a row in the DB instead of constantly deploying code.
Part 3: Caching!
This was my favorite part of the project; seeing a query give me a result instantaneously was extremely satisfying. This was also pretty cool to implement, since it is the first time I implemented caching for a project.
Intent cache
The first layer caches the normalized query -> JSON. Every query gets normalized: lowercase, no punctuation, stopwords removed, whitespace collapsed such that variations like "Verstappen vs Norris 2024" and "verstappen norris 2024" all hit the same cache key. The intent cache is the most important one (that's why it is L1), because the demographic of f1muse is going to be Twitter fans who are in arguments, so if an thread blows up the same query is expected to be repeated until it dies down (in a day, maybe?).
Query Cache
Once the intent is validated, it’s mapped to a SQL template, and the results are cached in the second layer. The key is the hash of the JSON + SQL template version (so changes to the template automatically invalidate results). This is likely where the heavy-lifting queries (i.e pace in clean-air, session-median normalization, etc.) get cached. It will involve a lot of teammate comparisons or track comparisons, so could come in handy when a big race weekend is coming up or two teammates are extremely close in 2026.
Redis Layer (In-Memory Hot Cache)
Redis lookups are around 1–2 ms, compared to 10–20 ms for Postgres. That might not sound like much, but when you’re serving hundreds of requests, it adds up fast. The Redis layer sits in front of both the intent and query caches, acting as a hot cache for frequently accessed data. This layer was implemented mainly for maintenance, so if Redis ever goes down (which is inevitable), the system falls back to Postgres.
| Cache Layer | What it Stores | TTL (Current Season) | TTL (Historical) |
|---|---|---|---|
| Intent Cache | Normalized user query → JSON | 1 hour | 24 hours |
| Query Cache | SQL query results | 5 minutes | 1 hour |
| Redis Layer | Hot in-memory cache of SQL results | 1 hour | 1 hour |
The TTL was decided taking into account how quickly "Twitter" drama dies down, so who knows it might change after launch.
Future
Right now, the API delivers validated, normalized statistics with alright latency, but there are several features I am excited to explore. Firstly, I wish I explored the front-end more, and how implementing graphs would've been like. Additionally, more features like stint analysis, since currently race pace is aggregated across the entire race; multi-season trends, to add more historical context if needed. Finally, I really believe in "making my own" statistics, but that would require a lot more dedication to the math-side of things rather than the programming-side; this is mostly motivated by seeing multiple of current NBA twitter users doing the same.