Make row ordering deterministic #30

olivierdalang · 2021-10-12T10:16:54Z

Hi !

It seems the rows in the resulting SQL statement are not deterministic (two identical calls on the same database can result in different SQLs, containing the same rows but in different orders). This is a bit of a deal breaker when dumps must be versionned.

Postgres doesn't guarantee rows orders unless there's a ORDER BY clause.

So maybe instead of COPY $sample_table TO STDOUT we should have COPY (SELECT * FROM $sample_table ORDER BY id) TO STDOUT (with some logic to retrieve the primary key).

Thanks !!

The text was updated successfully, but these errors were encountered:

mla · 2021-10-17T17:11:21Z

Seems doable. Just to understand the use case, why are you versioning them?

There's no actual guarantee that the selected rows will be the same anyway, is there? I mean, you can explicitly request "random" results, but otherwise, there's no selection criteria, so we're just relying on whichever rows Pg decides to give us. Usually that's deterministic, but idk if you were to rebuild or change indexes or some such if that would affect which rows are returned. IOW, I'm not sure you can rely on the row selection being deterministic, if that's what you need...

olivierdalang · 2021-10-20T08:59:38Z

I'm versioning them because I use pg_sample to produce a seed data dump for development extracted from production data. It's an iterative process as I include more tables every now and then, and want to keep these changes as small as possible to it's easier to see what actually changed.

Of course if the rows are not the same (due to random selection or changes in the data) I don't expect it to be the same :-) But as you said, currently it outputs the row in order postgres returns them, and as you hinted it's indeed not guaranteed to be deterministic (vacuum, indices rebuilds, etc). The only way to get guaranteed deterministic results is to specify the ordering explicitly.

mla · 2021-11-07T20:57:04Z

Thank you for the commits @olivierdalang. I merged it, but see my changes. There were some issues when a table doesn't have a primary key defined (I expanded it to look for any candidate key, but try to use primary key first if avail).

And there were some issues with quoting if the index columns have reserved words and such (see my quote_identifier calls).

It's passing the tests with --ordered applied now.

I think we should rename the option though. We have a --random option, which randomizes the sample selection. I would have assumed that --ordered is the opposite... but it's not. --ordered orders just the output of the sample set, and the sample itself may still be random or may change if it's rebuilt, etc.

I'm still not clear on the use-case, tbh, but I'm happy to keep the option if it's helpful, but let's think of a different name.... --order-sample-output or --order-sample-result or some such?

olivierdalang · 2021-11-08T10:08:55Z

Thanks for the merge !

From your comment, I realize my pull request only does half of the job (as actually --ordered is supposed be the opposite of --random, and should sort on the source table). Just opened a followup PR for that : #33

Sorry, I went a bit too quickly over that 😳 I'd also be happy to include a test case for this new option, but a bit lost about how tests cases are organized (not familiar with perl at all)

mla · 2021-11-09T06:22:38Z

The tests are under the t/ subdir.

Perl uses TAP (Test Anything Protocol):
https://en.wikipedia.org/wiki/Test_Anything_Protocol

If you have Perl installed locally then you probably have "prove", which runs tests. If you run prove from the project root, it should try to run everything.

e..g, I get

$ prove
t/pg_sample.t .. ok     
All tests successful.
Files=1, Tests=12,  1 wallclock secs ( 0.01 usr  0.00 sys +  0.20 cusr  0.03 csys =  0.24 CPU)
Result: PASS

Run prove --verbose for verbosity.

If you can't find prove you should still be able to run the tests directly: perl t/pg_sample.t

If you take a look a pg_sample.t, you'll see all we're doing is creating a bunch of tables with FKs and such, populating them, then running pg_sample on it. We then load the resulting sample and do some basic checks on it.

Note that some of the naming is intentionally weird, to test using removed words and such (to make sure our quoting/escaping works).

mla · 2021-11-09T06:26:06Z

On PR #33 can you compare to the changes I made on dev? I added a find_candidate_key() method or some such which I think is more general.

I'm not sure about the '$table'::regclass thing. Will that work correctly if there's multiple tables with the same name in different schemas?

And my code will work on tables that do not have a primary key.

olivierdalang · 2021-11-09T09:01:26Z

I merged your changes in #33, so I there's no '$table'::regclass anymore (look at the changes in https://github.com/mla/pg_sample/pull/33/files -> I'll remove unrelated whitespace changes).

Ok for the tests, will give it a shot to try to make a testcase for this feature.

Thanks for willing to accept my patches and the primer about Perl testing :-)

olivierdalang · 2021-11-09T14:58:45Z

(continuing the discussion about the PR on the PR itself)

mla · 2021-11-29T00:03:49Z

Sorry for the delay on this @olivierdalang. Just merged into dev. Looks great to me, thank you. And thanks for adding the tests.

Will make a new release soon.

olivierdalang mentioned this issue Nov 3, 2021

Added --ordered flag to make order deterministic #32

Merged

mla added a commit that referenced this issue Nov 7, 2021

make --ordered more robust; #30

f0bc418

mla closed this as completed Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make row ordering deterministic #30

Make row ordering deterministic #30

olivierdalang commented Oct 12, 2021

mla commented Oct 17, 2021 •

edited

Loading

olivierdalang commented Oct 20, 2021

mla commented Nov 7, 2021

olivierdalang commented Nov 8, 2021 •

edited

Loading

mla commented Nov 9, 2021

mla commented Nov 9, 2021

olivierdalang commented Nov 9, 2021

olivierdalang commented Nov 9, 2021

mla commented Nov 29, 2021

Make row ordering deterministic #30

Make row ordering deterministic #30

Comments

olivierdalang commented Oct 12, 2021

mla commented Oct 17, 2021 • edited Loading

olivierdalang commented Oct 20, 2021

mla commented Nov 7, 2021

olivierdalang commented Nov 8, 2021 • edited Loading

mla commented Nov 9, 2021

mla commented Nov 9, 2021

olivierdalang commented Nov 9, 2021

olivierdalang commented Nov 9, 2021

mla commented Nov 29, 2021

mla commented Oct 17, 2021 •

edited

Loading

olivierdalang commented Nov 8, 2021 •

edited

Loading