datalake: compile-time schema descriptors for table_definition#30170
datalake: compile-time schema descriptors for table_definition#30170andrwng merged 1 commit intoredpanda-data:devfrom
Conversation
4139da5 to
74d8b19
Compare
There was a problem hiding this comment.
Pull request overview
Introduces a compile-time schema descriptor system for datalake table definitions, replacing hand-built runtime schema construction and brittle positional access for the embedded redpanda system struct.
Changes:
- Added
schema_descriptor.hprovidingstruct_desc/field_descdescriptors, compile-timeindex_of, and typedtype_field/value_fieldaccessors. - Updated
table_definition.*to define the schemaless schema via descriptors and to centralize construction of theredpandastruct_value. - Updated
record_translator.ccto use typed accessors (rp_struct_type/rp_struct_value) instead offields[0]indexing.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/datalake/table_definition.h | Defines compile-time schema descriptors and adds typed accessors + build_rp_struct declaration. |
| src/v/datalake/table_definition.cc | Builds runtime schema via descriptors and implements centralized build_rp_struct. |
| src/v/datalake/schema_descriptor.h | New compile-time descriptor framework and typed field access helpers. |
| src/v/datalake/record_translator.cc | Replaces brittle positional access with typed accessors for the redpanda nested struct. |
| src/v/datalake/BUILD | Wires the new header and its dependencies into the build. |
06180af to
e4024cb
Compare
oleiman
left a comment
There was a problem hiding this comment.
Really cool. The structural ct_string is a nice touch. I need to review with fresh eyes, but my intuition is that if it builds and tests pass it's almost certainly correct?
Yeah that's my intuition too. |
bharathv
left a comment
There was a problem hiding this comment.
a nitpick but looks good to me, I guess if it compiles, its ok?
| template<typename T> | ||
| static constexpr bool is_nested_v = false; | ||
| template<typename... Fs> | ||
| static constexpr bool is_nested_v<struct_desc<Fs...>> = true; | ||
| template<typename E> | ||
| static constexpr bool is_nested_v<list_desc<E>> = true; | ||
|
|
||
| template<typename F> | ||
| static auto build_iceberg_type(int& next_id) { | ||
| if constexpr (is_nested_v<typename F::type>) { | ||
| return F::type::build_impl(next_id); | ||
| } else { | ||
| return typename F::type{}; | ||
| } | ||
| } |
There was a problem hiding this comment.
I wonder if this could be a concept, a new (future) type without a is_nested_v specialization defaults to the else branch, maybe a bit of surprise.
oleiman
left a comment
There was a problem hiding this comment.
+1 for "if it compiles it's good"
Replace the hand-rolled schemaless_struct_type() with a compile-time
schema descriptor system. The schema is defined as a type:
```
using schemaless_desc = struct_desc<
field_desc<"redpanda", rp_desc>>;
```
From this single definition:
- schemaless_struct_type() generates the runtime iceberg::struct_type
- index_of<"field_name"> provides compile-time field indices
- type_field/value_field accessors replace brittle fields[0]
- rp_struct_type()/rp_struct_value() provide named access
So code like this:
```
std::unique_ptr<iceberg::struct_value> build_rp_struct(
model::partition_id pid,
kafka::offset o,
std::optional<iobuf> key,
model::timestamp ts,
model::timestamp_type ts_t,
const chunked_vector<model::record_header>& headers) {
auto system_data = std::make_unique<iceberg::struct_value>();
system_data->fields.reserve(6);
system_data->fields.emplace_back(iceberg::int_value(pid));
system_data->fields.emplace_back(iceberg::long_value(o));
// NOTE: Kafka uses milliseconds, Iceberg uses microseconds.
system_data->fields.emplace_back(
iceberg::timestamptz_value(ts.value() * 1000));
... brittle, carefully ordered code ...
```
...becomes:
```
std::unique_ptr<iceberg::struct_value> build_rp_struct(
model::partition_id pid,
kafka::offset o,
std::optional<iobuf> key,
model::timestamp ts,
model::timestamp_type ts_t,
const chunked_vector<model::record_header>& headers) {
return rp_desc::build_value(
val<"partition">(iceberg::int_value(pid)),
val<"offset">(iceberg::long_value(o)),
val<"timestamp">(iceberg::timestamptz_value(ts.value() * 1000)),
```
and the definitions of compile-time schemas becomes more crisply defined
as well.
e4024cb to
9552c11
Compare
Replace the hand-rolled schemaless_struct_type() with a compile-time schema descriptor system. The schema is defined as a type:
From this single definition:
So code like this:
...becomes:
and the definitions of compile-time schemas becomes more crisply defined
as well.
Backports Required
Release Notes