-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add builder to TableMetadata interface #62
Conversation
Hi, I'm unsure whether a generic builder is the best approach for building a I think that the main usage patterns of
So, maybe a 'builder' is required only for the Since Thank you. |
+1 with @zeodtr We should have a validation check in the final phase of build, so an auto generated builder maybe inappropricate. How about this:
|
How about a method like this for the 'CREATE TABLE' case? impl TableMetadata {
/// New struct to create a table
pub fn try_new_to_create_table(table_creation: TableCreation) -> Result<Self, Error> {
let last_column_id = table_creation.schema.highest_field_id();
let current_schema_id = table_creation.schema.schema_id();
let schemas = HashMap::from([(
table_creation.schema.schema_id(),
Arc::new(table_creation.schema),
)]);
let default_spec_id;
let partition_specs;
let last_partition_id;
match table_creation.partition_spec {
Some(partition_spec) => {
default_spec_id = partition_spec.spec_id;
last_partition_id =
max(partition_spec.fields.iter().map(|field| field.field_id)).unwrap_or(-1);
partition_specs = HashMap::from([(partition_spec.spec_id, partition_spec)]);
}
None => {
default_spec_id = 0;
last_partition_id = -1;
partition_specs = HashMap::from([(
default_spec_id,
PartitionSpec::builder()
.with_spec_id(0)
.with_fields(vec![])
.build()?,
)]);
}
}
let default_sort_order_id = table_creation.sort_order.order_id;
let sort_orders = HashMap::from([(default_sort_order_id, table_creation.sort_order)]);
Ok(Self {
format_version: FormatVersion::V2,
table_uuid: Uuid::new_v4(),
location: table_creation.location,
last_sequence_number: -1,
last_updated_ms: since_the_epoch_as_millis()?,
last_column_id,
schemas,
current_schema_id,
partition_specs,
default_spec_id,
last_partition_id,
properties: table_creation.properties,
current_snapshot_id: None,
snapshots: None,
snapshot_log: vec![],
metadata_log: vec![],
sort_orders,
default_sort_order_id,
refs: HashMap::new(),
})
}
...
}
fn since_the_epoch_as_millis() -> Result<i64, Error> {
let start = SystemTime::now();
let since_the_epoch = start.duration_since(UNIX_EPOCH)?;
Ok(since_the_epoch.as_millis().try_into()?)
} |
This is a builder tailored for |
Maybe the builder approach is ok. |
I prefer the builder approach since the transaction actions may also need to modify table metadata, such as add schema, add snapshot. |
Thanks for your feedback :) ! I will add a validation function to the builder, seems logical. From my point of view it should validate the inner field of the TableMetadata but not the inner struct that should be validated when build with their own builder (Schema, ...). I will update the pull request to have new discussion materials. |
Update builder default values Cast Builder Error to Iceberg Error
Strenghten builder validation
Ok i added some validation steps. Tell me if its ok |
impl TableMetadataBuilder { | ||
/// Initialize a TableMetadata with a TableCreation struct | ||
/// the Schema, sortOrder and PartitionSpec will be set as current | ||
pub fn with_table_creation(&mut self, tc: TableCreation) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn with_table_creation(&mut self, tc: TableCreation) -> &mut Self { | |
pub fn from_table_creation(tc: TableCreation) -> &mut Self { |
Since all fields starts with with_
, I would suggest to change it to from
so that it's more readable.
/// Add a schema to the TableMetadata | ||
/// schema : Schema to be added or replaced | ||
/// current : True if the schema is the current one | ||
pub fn with_schema(&mut self, schema: Schema, current: bool) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also update last_updated_ms
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i could update it each time we have modification, but does it should be updated in theory at write time from what i read in spec ?
/// - map : the map to scan | ||
/// - default : default value for the key if any | ||
/// - field : map name for throwing a better error | ||
fn check_id_in_map<K: std::hash::Hash + Eq + std::fmt::Display + Copy, T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to check this at last? Why not check it during modification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we restrict with only the given list of setter below we should not need this kind of validation as we will compute every value inside the builder. So they should be coherent.
/// Add a snapshot to the TableMetadata and update last_sequence_number | ||
/// snapshot : Snapshot to be added or replaced | ||
/// current : True if the snapshot is the current one | ||
pub fn with_snapshot(&mut self, snapshot: Snapshot, current: bool) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This in fact updates the main
branch, we also need to update snapshot ref?
Hi, @y0psolo Thanks for the effort! But I'm a little worried about the api here, since it's error prone, e.g. it's quite easy to construct an invalid table metadata. My suggestion would be to remove the derived builder, and for public apis of modification we can follow the java implementation: https://github.com/apache/iceberg/blob/abc3f746bfb00fdcd7ea1f170df242f1857ced75/core/src/main/java/org/apache/iceberg/TableMetadata.java#L842 |
Ok from what i read from the java class, below is the short list of functionalities you want to only provide on the builder :
I will leave only this part public and not allow the builder to generate useless method. I don't know yet if we should remove the builder as it seems to provide some internal struct builder and default value at least. We will see with what is left at the end under his responsability. |
The reason why I suggest removing derived builder is that |
do not allow generation by builder for other fields except few
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @y0psolo Thanks! I've left some comments
|
||
/// Add or replace a schema | ||
/// schema : Schema to be added or replaced | ||
pub fn with_schema(&mut self, schema: Schema) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn with_schema(&mut self, schema: Schema) -> &mut Self { | |
pub fn add_schema(&mut self, schema: Schema) -> &mut Self { |
- Currently we use 0 as default schema id, e.g. the schema id is not set by user. I think we should check this and auto assign a new schema id for it.
- I think by default we should fail when id already exists. We can add another method for the
insert or update
behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problematic with failing in the builder method is that we will not respect the builder pattern that should return "Self". Instead we will have a Result struct.
I think it's better to reject schema creation if no id is set if we want to avoid this corner case. For adding Schema the method should be on the TableMetadata struct itself as it may fail in case of duplicate id.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problematic with failing in the builder method is that we will not respect the builder pattern that should return "Self". Instead we will have a Result struct.
Why builder pattern should return Self
rather Result
? In fact, I don't want to make a normal builder, since it's quite easy to construct an invalid table metadata.
I think it's better to reject schema creation if no id is set if we want to avoid this corner case.
This doesn't work. Thinking about the transaction api, the user wants to change the schema of the table, why he needs to know the scheme id in advance which should be allocated by catalog?
For adding Schema the method should be on the TableMetadata struct itself as it may fails in case of duplicate id.
I prefer to make TableMetadata
immutable and leave the states for validity checking in TableMetadataBuilder
.
Please be aware that the main use case of TableMetadataBuilder
is transaction api, so we should ensure that the result is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problematic with failing in the builder method is that we will not respect the builder pattern that should return "Self". Instead we will have a Result struct.
Why builder pattern should return
Self
ratherResult
? In fact, I don't want to make a normal builder, since it's quite easy to construct an invalid table metadata.
to avoid code where we need to unwrap and test error at each call but have more idiomatic way to build a new struct.
I think it's better to reject schema creation if no id is set if we want to avoid this corner case.
This doesn't work. Thinking about the transaction api, the user wants to change the schema of the table, why he needs to know the scheme id in advance which should be allocated by catalog?
I think a miss a step here. The builder is here to build a struct from scratch. I f we want to update an existing struct, this is another use case and it should be handled differently. I agree with you if a user update a Schema he doesn't need to change the id and so this option should not be exposed.
If we have a look to Table Interface (https://iceberg.apache.org/javadoc/master/org/apache/iceberg/Table.html) all this functions are on the Table object, not the builder.
For adding Schema the method should be on the TableMetadata struct itself as it may fails in case of duplicate id.
I prefer to make
TableMetadata
immutable and leave the states for validity checking inTableMetadataBuilder
.Please be aware that the main use case of
TableMetadataBuilder
is transaction api, so we should ensure that the result is valid.
Thanks for letting me know this information, i don't fully know the iceberg format and internals and try to catch up. Having a TableMetadata immutable is not incompatible to have specific function for handling transaction on the TableMetadata struct itself. We could imagine having a transaction function that would return a new struct encapsulating the existing TableMetadataBuilder (and other struct if needed) and provide only the needed function to update the existing Table (like https://iceberg.apache.org/javadoc/master/org/apache/iceberg/Transaction.html). This object could queue all modification, leaving the encapsulated struct as is and when transaction is committed will deliver a new struct.
We keep immutability of underlying struct, we offer a way to restrict easily possible action on the update operation. We could reuse some part of the validation code between both.
This is just an example. I am trying to illustrate that updating and creating are different use cases and could be addressed differently in order to avoid more complexity, better control on action and better usability.
I am sorry for all these discussions but i really want to understand to best way to end this pull request. I leave below conversation unresolved as the decision on this one will certainly influence their outcome.
pub fn with_current_schema(&mut self, schema: Schema) -> &mut Self { | ||
self.current_schema_id = Some(schema.schema_id()); | ||
self.last_column_id = Some(schema.highest_field_id()); | ||
self.with_schema(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as problem mentioned above, we need to consider the default schema id problem.
|
||
/// Add or replace a partition_spec and update the last_partition_id accordingly | ||
/// partition_spec : PartitionSpec to be added or replaced | ||
pub fn with_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn with_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self { | |
pub fn add_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self { |
- There are same problems as scheme id.
- We need to validate spec with schema:
- If
source_id
exists in schema - If the field type supports such transformation.
/// Add or replace a partition_spec, update the last_partition_id accordingly | ||
/// and set the default spec id to the partition_spec id | ||
/// partition_spec : PartitionSpec to be added or replaced | ||
pub fn with_default_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar problem to schema.
|
||
/// Add or replace a sort_order | ||
/// sort_order : SortOrder to be added or replaced | ||
pub fn with_sort_order(&mut self, sort_order: SortOrder) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar problem as partition_spec
|
||
/// Add or replace a snapshot to the main branch, update last_sequence_number | ||
/// snapshot : Snapshot to be added or replaced | ||
pub fn with_snapshot(&mut self, snapshot: Snapshot) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn with_snapshot(&mut self, snapshot: Snapshot) -> &mut Self { | |
pub fn add_snapshot(&mut self, snapshot: Snapshot) -> &mut Self { |
if self.last_updated_ms < Some(snapshot.timestamp()) { | ||
self.last_updated_ms = Some(snapshot.timestamp()); | ||
} | ||
self.with_ref( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should copy existing snapshot ref's retention policy if it already exists.
related issue : #52
Add a builder for TableMetadata interface