-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature RFC: Flatbuffers support for Optional types. #6014
Comments
I mentioned A:
B:
|
A: B:
That way the user would have control of what sentinel to use. |
I agree this is an issue. I mentioned this in the FB2 discussion as analyze SQL NULL values. I have been thinking about this problem, A) using is present: this doesn't work well with tables and strings etc. because these have not default value. A null value would have to be an offset to a special null object that would have to be checked on access, or the current null value would have to change interpretation. On a related note I have been considering offering a dummy object instead of NULL when the C api discovers a null because that is a safer and simpler approach. You can ask if it is a dummy object explicitly, and otherwise access it safely but getting only dummy results. B) sentinel values are not good, and if they are, they are better done by the end user for specific use cases, i.e. FlatBuffers need not change. C) My proposal. Add an additional bit-vector field at vtable offset 0 for table that have optional values. The bit-vector is stored inline in the table, but could also be an external offset if is believed that a) the are large enough to make this worthwhile, and b) data sharing is likely. Now you can access data just as before, without overhead, but you can explicitly ask if the field is marked as NULL, if you care. This can be very efficient, and the builder can easily add this information as it already has vtable vector in memory that it writes out, and can add the null vector at the same time. |
The reason why B) is not good is because you cannot roundtrip SQL data transparently. This is an absolute requirement for this feature. |
One thing we can do for (A) and the "simultaneous implementation problem," which I agree is a huge problem, instead of implementing it in a huge PR, is to not generate code for optional fields (and warn) if the language isn't ready yet. I think this will at least make code review easier. Perhaps we can also give users the option to use existing codegen and zero-defaults if they really want it and promise they'll always check for field presence manually... but such an option is definitely a foot-gun. I agree with the use of nullable pointers over optional for performance reasons. Rust should prefer Option<&mut T> (if it had a mutation API) since raw pointers are frowned upon in that language and it'll compile down to a nullable pointer anyway. |
Oh, regarding C) it might be strange if a NULL field returns 4 because 4 happens to be the default value. For this reason it makes sense to support a default null value also. That is effectively option B), but without the hack since you have precise NULL information. EDIT: this would add runtime overhead because it would need to check the null table to see if default or null default should be returned. But at least it only happens for absent values. On Wouters suggestion to have a parallel set of accessors. That would add a lot of loud on compilers to strip out even more templates in C++ or macros generated inlines in C, and complicate code generation quite a bit in C. |
We could also require the default values only apply for non-optional fields, and always store values in optional fields, unless they are NULL, and have the bit vector. The default value then becomes the return value if NULL, and NULL can be checked separately. |
@paulovap the sentinal IS the default, so if you wanted to use @mikkelfj certainly do not want to add extra encoding to the binary to support this feature, I think that is too high a cost for such a simple feature, even compared to A). @CasperN we'd have to actively error-out if someone compiles with an unsupported language, rather than ignoring the field, I think, like we currently do with other unsupported features. Doing that in multiple PRs would be ok, as long there is an effort to support most languages. |
As @aardappel already said, most of the languages can return a null value (from the ones I’ve worked with) so why not introduce an optional flag ‘age: int (optional)’ and instead of returning a default value we just return null. Most of languages has some type of optional, if they don’t we can mark the name of the object as optional when we generate the code. This won’t require us to play with the binary data, and would only require small refactoring for each of the language generators. As as example swift already can return optional structs, tables, strings (we are just missing scaler optional) |
Well, if the cost is too high, I think we should not do it. I'd rather not have a half-baked solution on top of an already half-baked solution. |
@mikkelfj what is half baked about A) ? only scalars require this feature, all other types can already be null. |
Can you elaborate on
I'm not sure what you mean by special null object, can you give an example? |
With C++ the simplest way is to embed an optional scalar into a struct. table Foo{
bar : int (boxed);
}
// implicitly translated to:
table Foo{
struct _ { bar : int = 0 }; // anonymous transparent proxy
}
//, or translated to
table Foo{
struct bar { _ : int = 0 }; // named proxy
} Struct might be present or not. "Foo" : {
"bar" : [-12]
}
// or
"Foo" : {
"bar" : { "_" :-12 }
} |
I'm for option A, since option B is what we have now in practice. On how to implement A - well, that's another problem. As far as I know, all our supported languages (maybe except Lobster, whatever that is 😄 ) have some notion of optional data, so we're fine in terms of compatibility. Of course, this would require making changes everywhere, but I personally, am fine with such cost if we finally get proper optional data support for scalars. |
First, the current use of null and default values is half-baked because strings and other non-inline data do not have default values, different from scalars. We could add default values for strings. Second, hijacking presence for NULL is half-baked because it hijacks the current use of default values for a differen purposes. Granted, it can work if you change schema semantics for optional values, but now you have yet another layer of whatever presence and null offsets means, depending on a schema flag. I'd rather have null as a clearly stated non-negotiable data point for all imaginable data types. If not a bit vector, it could also be a special vtable offset value like 0xffff - it takes up more space, but only affects verifiers since non-supporting implementations would just see a future version of the format and choose the default value of the field, while supporting implementations would only take a performance hit if detecting the field is out of range, which it has to check anyway.
It was actually Wouter's suggestion originally when I once mentioned the lack of Null transparency for SQL data. The idea is that to avoid interfering with the current null interpretation, you store a special object in the buffer. The uoffset (not voffset) poins to this object for NULL entries. It can be added to the buffer as an object with just an empty vtable (+/- handling of required fields for the type, but ideally just one object for all types, but NULLable shouldn't have required fields?). Any user level code can do this today, except it is hard to tell if an object is the NULL object or some other object. One could argue that you could just store and actuall NULL instead, but that changes the semantics of the current system and would prevent adding default values for strings and tables etc. |
To extract a key point from my above comment: D) store NULL values as virtual table voffset 0xffff because it is transparent to current implementations and acts the same as a missing entry with voffset 0x0000 except it looks like a future version instead of entirely missing. in both cases a default value is returned. It is now possible to add a separate support function: IsNull() similar to IsPresent(), but it avoids overloading current semantics, and it does not introduce a new bitmap data structure as my proposal C). The benefit of C) is that you can have short vtables if all the last fields are NULL, but I think D) is probably preferable. A larger benefit of D is that you entirely avoid changing the schema semantics with an 'optional' setting because you can just store as value as NULL by adding an API function: SetNull() in the builder. Of course you could add attributes to block, or allow NULL, but it is not required. |
Sorry, I'm not thinking straight. You cannot transparently store 0xffff as a vtable offset. Future versioning requires a field in the vtable outside the currently expected max vtable size, but you cannot store invalid offsets within the table because common reader logic will not check the validity of this offset, only verifiers will. The approach could still work if all readers learned to check this similar to the 0 check they perform now, but that is not practical. That is why the bitmap approach is better. |
I see it as as having the same semantics, just that the default value is a value outside of the type of the scalar itself, NULL. I might be misinterpreting you but I think you're looking for a default vs null system for vectors, tables, and strings (in addition to scalars). The current state of the world is that scalars are default-y and the more complex objects are null-y. This proposal allows users to choose default-y or null-y semantics for scalars but you're worried that it may complicate a future feature where users can choose to make the complex stuff default-y. Therefore this proposal should also design a solution so tables, strings, and vectors can be default-y. On the bitmap thing, I'm not sure of the value of introducing bits to track "default value" vs "null value". It seems to me like a premature optimization of a case where some default and null values are very common, that has a cost on every other use case, though I'm not sure what cases you have in mind. |
I believe that is a separate issue. Not having defaults is maybe not great, but it doesn't affect the string's ability to be optional in the sense of this issue. In fact, non-inline data (and structs) are ahead of scalars in this sense. There are "issues" with supporting default values. For strings it is reasonably easy, but some form of default table is not exactly a lightweight feature. We're not hijacking anything, we're merely extending the existing semantics. Under proposal A), an And I am not at all familiar with SQL, so I really don't follow your concerns around null in general. I am mostly familiar with the nulls in C/C++/Java.. and I believe Flatbuffers follows a similar model. |
NULL in SQL is controversial because the original designers (see book by C.J. Date on SQL) considered it impure in an otherwise elegant and consistent model based on algebra. However, this leads to requiring a defined 0-value value (like an empty string, or a numeric -9999, or whatever) in the domain of any given type, such as integers or strings. Pragmatics found it better to explicitly have a NULL value outside the value domain when they had lots of tables with data in which only few contained actual data. You can also select records where fields are not NULL or vice versa without having to define what a 0-value is in a given context. In actual SQL use all of this can be condensed down to: If you have data extracted from a database that you want to process and reinsert into the database, you can expect to see NULL values, and you cannot consistently assume that you can map NULL to some other value (a 0-value) because that might overlap with a value meaning something else. So if you need to reinsert data without loss, you must keep track of the NULLs. Of course this is a mess, but a practical necessity. Now, for this to work with FlatBuffers, you need to know exactly what is a NULL and not just what is NULL by convention wether it be a NULL string, or a default scalar value or an absent scalar that is different from a present default value. If you add optional types to the schema, you can impose exact semantics for NULL simply by defining what NULL means for the given type and make sure that the equivalent type for SQL usage can represent all values in addition to NULL. This can actually work, but it is somewhat messy and hard to explain, but can be helped with API support. If you do not modify the schema but just decides what constitutes a NULL, you are open to all sorts of interpretations in present use which will not work. Optional types corresponds to SQL schema types where a field can be allowed to be NULLable, or not. You also need to consider how the data can be represented in JSON without loosing NULL, but this should be workable. I'd prefer a bitmap because it takes all guesswork away for any current and future types. It can be directly exported to other data formats such as Arrow that also uses bitmaps for NULLs, and it is fast to process because you can just scan the bitmap rather than special casing each kind of value that can be NULL. Bitmaps do not interfere with current implementations because they are just an extra field in a table that you are not forced to look at if you do not care, but of course you cannot write NULLs if your API does not support it, unless you opt to write the NULL bitmap yourself, which is possible because it would just appear as a ubyte array. So, if you do not want bitmaps, it can still be made to work if you introduce the optional attribute in the schema, but if you do not, it will be a mess. |
What about introducing the optional scalar types:
where compiler when sees those per default creates structs:
And languages which support actual optionals can replace this default behaviour by generating an API with optionals. This way there is an option to have a "small" non breaking change which will work for all languages and opens up gradual migration path for every language specific code generator. PS: If by any chance people already have a |
This seems like options (B) and (A) respectively.
I think we know exactly what is a NULL under (A) for optional scalars in a table. They are null if and only if IsPresent == False.
I'm focusing on the words "in present use": I think you mean fields that are currently scalars cannot safely be redefined into optional scalars because we cannot distinguish between NULL and default. This is correct and intended. For existing fields, changing the optional-ness is a bad idea (and users will need to be educated about that). Adding bits, as you suggest in option (c), will allow for retrofitting null onto existing scalars. Are you claiming retrofit-ability should be a requirement? Fwiw, my opinion is that allowing existing scalars to convert into being optional is not worth the extra bits of option (c) and educating users that "optional (or lack thereof) is permanent" is sufficient for this feature.
Good point. At first glance, json's I like your idea because it lets all languages generate a default API for optionals, though it is essentially syntactic sugar in the schema. What would happen to the Int/Uint after the language in question adds optional support? Would that be a code-breaking change? Regardless, I think we should start with the "error-out if someone compiles [optional scalars] with an unsupported language" approach, since its a lot easier. Your idea should be revisited in the scenario where we go ahead with option (A) but haven't implemented optional scalars in all languages and really need to. However, I don't think we'll end up in that situation since a prerequisite of accepting this feature is confirming we'll actually do the work in enough languages. @aardappel, what would you like to see before deciding between (a) and (b) and giving a go/no-go on this feature? What is the minimum set of languages that must support this feature for it to be worth it? I can volunteer for the following tasks (unless someone else wants them):
|
One the one hand it would be useful, on the other hand it still requires marking the table as having NULLable fields, as opposed to marking individual fields. Bitmaps also provide fast processing of data which can be rather helpful, and it takes up less space than forcing default writes on some cases. But mostly it rubs me the wrong way that you can declare -4 to be the default value, but it really means nothing, since -4 will be stored if not NULL, and if NULL, well, it doesn't really matter what the default value is. |
Here is how I imagined my proposal in detail:
Step one can be done by one person and will work even without step two, the "old" code generators will just generate code for the Potentially the
Where in all languages which have a native UUID type (e.g. Swift), the native type is used and for other languages a struct is generated. |
@mikkelfj from what you say, you'd be happy with A) also, it encodes a null value out of the domain explicitly. Adding bits wouldn't make this any different, they are redundant. @mzaks that could work, but I wouldn't want to clutter generated code with all these definitions for these auto generated structs. Option A) is what you get if you instead access the scalar directly as if it was a struct. I'd prefer that to adding new types. @CasperN option B) is trivial, it is merely a convenience feature on the status quo. It is just |
@aardappel, I can do tracking/coordination too. @mzaks I'm a little wary of making that part of making your SyntacticTypes feature part of optional scalars since it seems like a bit of additional work. I think it should be a separate feature request. Can everyone participating vote on A vs B vs do-nothing and maybe volunteer a language to implement? I support (A). I'll track work by editing the top comment |
Ok, @CasperN is in charge of this feature :) |
What is the plan for implementing of (A)?
That’s all. These changes are compatible with already existing binaries and user programs. Example: table Monster { mana : int = -1 (maybe); } class Monster {
int16_t mana() const { return value or default; }
flatbuffres::optional<int16_t> get_mana() const;
void add_mana(int16_t mana) {
fbb_.AddElement<int16_t>(Monster::VT_MANA, mana); // WITHOUT default value!
}
};
// Optional for default-constructable and copyable T.
// Can have a special implementation if T is a pointer.
template <class T > class optional {
optional(); // none with T{};
optional(bool none, const T& val); // only with explicit default value
T& value(); // return ref to value or ref to default value (never throw)
T& operator*(); // return value()
operator bool() const;
bool has_value() const;
std::optional<T> get() const; // down-cast to std::optional<T>
operator std::optional<T>() const; // implicit down-cast to std::optional<T>
}; |
Currently they're optional in the binary, but non-presence is interpreted as "return the default value according to the schema" so from the perspective of user's code, "all values are there". Proposal (A) allows non-presence to be interpreted to mean None (ie "Nothing" in the words of haskell's optional type, "maybe"). That is semantically at odds with "non-presence means default". I think your table:
should be an error because I think there shouldn't be a non-null default for optional types. The following is fine.
Minor bike-shedding idea, we could specify optional types like so:
This visually implies that optional types are at odds with default values and is "consistent" since the value to the right of the equals sign is what we interpret non-presence to mean.
As discussed above, changing a field from default-y to optional is a binary-semantic breaking change. It is also intended to be a code breaking change (in all languages except maybe python) since we want to show, in code, that the type is optional. And not just in the object API, even the normal accessor should change: e.g.
|
To me it appears that flatcc is another implementation of flatc in a
different language. There may be implementation differences, but the
overall concept remains the same.
My claim is that "required" is used with the same semantics. Not providing
that attribute is an error (crash and burn).
Except that we do this at object construction time since no deserialization
is involved.
…On Tue, Oct 27, 2020, 10:57 PM MikkelFJ ***@***.***> wrote:
@adsharma <https://github.com/adsharma>
I also use the flatcc (for C) compiler as an IDL for other purposes.
However, I think it is a bad idea to introduce different semantics. If you
have special needs you can use attributes to specify these.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6014 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFA2A7JSGCY43LSNDJMCH3SM6XF5ANCNFSM4OLQQVUQ>
.
|
flatcc generates flatbuffer code in C. It is also a schema compiler library that can be used for other purposes. I use it for other purposes in addition to processing flatbuffer files. |
Your original comment seems to suggest you're using
|
No, we are only focusing on table fields in this issue. |
@adsharma you are free to do as you please of course, but it would be nice for your users to alert them to the fact that your tool uses an incompatible dialect of FlatBuffer schemas, and how so, if you feel strongly that you want to change what things mean to |
@aardappel - I don't want to gratuitously do something different, but here's my thinking on the matter. The way we're using flatc.py, we're competing with:
If I push In most programming languages (rust, kotlin and swift among others), we have |
To me, people rejecting your solution because of such a small feature would be unlikely, and I think generally that a tool adheres to (and is compatible with) a well known standard is much more important. Also, "the billion dollar mistake" is about implicit nulls, and the new use here is explicit (requires checking to access, at least in language that support it, like C++, Rust and C#). We already had nulls before, and any implicitness is due to the language they're implemented in, not FlatBuffers. |
I'm not entirely comfortable with the way required and optional differ for different value types, so I understand @adsharmas concern. I still think schema compatibility is more important. However, if we improve FlatBuffers to more suitable as a DDL I'm all for it. That is part of my lengthy arguments in the past. But I fear that ship has sailed. |
@mikkelfj @aardappel - I can look into supporting the |
Voting/requesting support in C# and typescript :) |
Oh we did that! I just forgot to update the top comment (doing that now). |
Is there support for optional types in python? The top comment says there is not. Is that planned? |
works just fine for me in python3. Perhaps the issue is specific to how flatbuffers optional idea interacts with the python type system? |
If I attempt to compile schema that has optional scalars with
|
We would support it, there's no technical blocker, its just unstaffed. I'm happy to review a PR though |
It looks like the support for Python is in this PR, #7318 |
Note: This top comment is heavily edited. It was kept up with the current state until this issue closed.
Motivation
Flatbuffers should allow users to choose to use optional types. There has been some interest in distinguishing between default values (which are not stored in the binary) and some notion of None which the user controls.
Here are some links to previous interest in this idea:
Currently, a user can control a field's presence in the binary by specifying "force_defaults" and checking "IsFieldPresent" which is a bit of a hack. This proposal should define proper Flatbuffers optional types, which should be a better way of doing this. Use of this feature is only advisable for new fields, since changing default values is in general backwards-incompatible.
How do we represent this in the schema file?
We will specify it like so
This visually implies that optional types are at odds with default values and is "consistent" since the value to the right of the equals sign is what we interpret non-presence to mean.
Change to Schema Grammar:
field_decl = ident : type [ = (scalar |
null
) ] metadata ;We can add a field tag, e.g. "optional" or "no_default", that triggers this behavior. Hopefully no one is using those tags. Maybe we can make it specifiable to flatc, an "--optional-field-keyword-tag" flag, just in case people are using it and can't stop.How do we represent this in the binary?
We are going with option (A).
(A) Non-Presence means None
Instead of omitting zero-like values, the generated code must store them. Non-presence for optional fields no longer means "whatever the default is," now it means None. You can interpret it as "the default value is None". This also means we cannot specify both specify a non-null default and mark the field as optional.
Pros:
Cons:
(B) Some Sentinel value means NoneIn this scenario, zero-like values are still not stored. Instead we choose some "sentinel" value which we interpret to be None (e.g.
int
can useint_min
andfloat
can use some kind ofNan
).Pros:
Cons:
How do we represent this in every language API?
We'll need to change the type signature of all generated code (building/reading/mutating/object/etc) around the optional type to signal its optional-ness. I think we should use the language's local standard for optional types. Suggestions:
Optional[T]
.Option<T>
.std::optional<T>
but its not obvious what to use for earlier versions.T*
would work.Optional
shows up in Java 1.8 and triggers autoboxing, so idk :/The exact generated-API for a language should be discussed in the PR implementing this feature in that language.
Out of scope
(I'll add links if you make issues for these feature requests)
TODO
[edits]
= null
syntax in schema file (cross out alternatives)The text was updated successfully, but these errors were encountered: