-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make row ordering deterministic #30
Comments
Seems doable. Just to understand the use case, why are you versioning them? There's no actual guarantee that the selected rows will be the same anyway, is there? I mean, you can explicitly request "random" results, but otherwise, there's no selection criteria, so we're just relying on whichever rows Pg decides to give us. Usually that's deterministic, but idk if you were to rebuild or change indexes or some such if that would affect which rows are returned. IOW, I'm not sure you can rely on the row selection being deterministic, if that's what you need... |
I'm versioning them because I use pg_sample to produce a seed data dump for development extracted from production data. It's an iterative process as I include more tables every now and then, and want to keep these changes as small as possible to it's easier to see what actually changed. Of course if the rows are not the same (due to random selection or changes in the data) I don't expect it to be the same :-) But as you said, currently it outputs the row in order postgres returns them, and as you hinted it's indeed not guaranteed to be deterministic (vacuum, indices rebuilds, etc). The only way to get guaranteed deterministic results is to specify the ordering explicitly. |
Thank you for the commits @olivierdalang. I merged it, but see my changes. There were some issues when a table doesn't have a primary key defined (I expanded it to look for any candidate key, but try to use primary key first if avail). And there were some issues with quoting if the index columns have reserved words and such (see my quote_identifier calls). It's passing the tests with --ordered applied now. I think we should rename the option though. We have a --random option, which randomizes the sample selection. I would have assumed that --ordered is the opposite... but it's not. --ordered orders just the output of the sample set, and the sample itself may still be random or may change if it's rebuilt, etc. I'm still not clear on the use-case, tbh, but I'm happy to keep the option if it's helpful, but let's think of a different name.... --order-sample-output or --order-sample-result or some such? |
Thanks for the merge ! From your comment, I realize my pull request only does half of the job (as actually Sorry, I went a bit too quickly over that 😳 I'd also be happy to include a test case for this new option, but a bit lost about how tests cases are organized (not familiar with perl at all) |
The tests are under the t/ subdir. Perl uses TAP (Test Anything Protocol): If you have Perl installed locally then you probably have "prove", which runs tests. If you run prove from the project root, it should try to run everything. e..g, I get
Run prove --verbose for verbosity. If you can't find prove you should still be able to run the tests directly: perl t/pg_sample.t If you take a look a pg_sample.t, you'll see all we're doing is creating a bunch of tables with FKs and such, populating them, then running pg_sample on it. We then load the resulting sample and do some basic checks on it. Note that some of the naming is intentionally weird, to test using removed words and such (to make sure our quoting/escaping works). |
On PR #33 can you compare to the changes I made on dev? I added a find_candidate_key() method or some such which I think is more general. I'm not sure about the '$table'::regclass thing. Will that work correctly if there's multiple tables with the same name in different schemas? And my code will work on tables that do not have a primary key. |
I merged your changes in #33, so I there's no Ok for the tests, will give it a shot to try to make a testcase for this feature. Thanks for willing to accept my patches and the primer about Perl testing :-) |
(continuing the discussion about the PR on the PR itself) |
Sorry for the delay on this @olivierdalang. Just merged into dev. Looks great to me, thank you. And thanks for adding the tests. Will make a new release soon. |
Hi !
It seems the rows in the resulting SQL statement are not deterministic (two identical calls on the same database can result in different SQLs, containing the same rows but in different orders). This is a bit of a deal breaker when dumps must be versionned.
Postgres doesn't guarantee rows orders unless there's a ORDER BY clause.
So maybe instead of
COPY $sample_table TO STDOUT
we should haveCOPY (SELECT * FROM $sample_table ORDER BY id) TO STDOUT
(with some logic to retrieve the primary key).Thanks !!
The text was updated successfully, but these errors were encountered: