同步 hudi master #1

loukey-lj · 2021-01-26T14:45:25Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…TER_OPT_KEY=true (#2289)

* add supoort for OpenJ9 VM * add 32bit openJ9 * Pulled the memory layout specs into their own classes.

…ges. (#2216) - Turned off by default

…ndex (#2248) - Works only for overwrite payload (default) - Does not alter current semantics otherwise Co-authored-by: Ryan Pifer <ryanpife@amazon.com>

…2196)

… schema using spark-avro conversion (#2192) Co-authored-by: liujh <liujh@t3go.cn>

…T_PROP to true (#2295)

#2278) * [HUDI-1412] Make HoodieWriteConfig support setting different default value according to engine type

Co-authored-by: Xi Chen <chenxi07@qiyi.com>

* Fix flaky MOR unit test * Update Spark APIs to make it be compatible with both spark2 & spark3 * Refactor bulk insert v2 part to make Hudi be able to compile with Spark3 * Add spark3 profile to handle fasterxml & spark version * Create hudi-spark-common module & refactor hudi-spark related modules Co-authored-by: Wenning Ding <wenningd@amazon.com>

…2307)

Fixed the logic to get partition path in Copier and Exporter utilities.

…2313)

Co-authored-by: zhang wen <wen.zhang@dmall.com> Co-authored-by: zhang wen <steven@stevendeMac-mini.local>

… datasets (#2301)

… in bloom index (#2319)

…rce writing (#2233) Co-authored-by: Wenning Ding <wenningd@amazon.com>

…2340)

…streamer (#2264) - Adds ability to list only recent date based partitions from source data. - Parallelizes listing for faster tailing of DFSSources

…ce tracking (#2322)

…parquet files in the hudi-test-suite (#2344)

…ombineAndGetUpdateValue (#2311) * Added ability to pass in `properties` to payload methods, so they can perform table/record specific merges * Added default methods so existing payload classes are backwards compatible. * Adding DefaultHoodiePayload to honor ordering while merging two records * Fixing default payload based on feedback

…nformation (#2354) Co-authored-by: Xi Chen <chenxi07@qiyi.com>

…ngPlan and to run the plan

…on paths (#2417) * [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partition paths * Adding testClass for FileSystemBackedTableMetadata Co-authored-by: Nishith Agarwal <nagarwal@uber.com>

…E_TABLE (#2428)

- Adds field to RollbackMetadata that capture the logs written for rollback blocks - Adds field to RollbackMetadata that capture new logs files written by unsynced deltacommits Co-authored-by: Vinoth Chandar <vinoth@apache.org>

…ldSchema and newSchema in favor of using only new schema for record rewriting (#2424)

…ility to sortBy numeric values (#2453)

…tadata table to avoid too many file splits (#2451)

#2440) * Fixed suboptimal implementation of a magic sequence search on GCS. * Fix comparison. * Added buffered reader around plugged storage plugin such as GCS. * 1. Corrected some comments 2. Refactored GCS input stream check Co-authored-by: volodymyr.burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Nishith Agarwal <nagarwal@uber.com>

* Revert "[MINOR] Bumping snapshot version to 0.7.0 (#2435)" This reverts commit a43e191. * Fixing 0.7.0 snapshot bump

…2441) Addresses leaks, perf degradation observed during testing. These were regressions from the original rfc-15 PoC implementation. * Pass a single instance of HoodieTableMetadata everywhere * Fix tests and add config for enabling metrics - Removed special casing of assumeDatePartitioning inside FSUtils#getAllPartitionPaths() - Consequently, IOException is never thrown and many files had to be adjusted - More diligent handling of open file handles in metadata table - Added config for controlling reuse of connections - Added config for turning off fallback to listing, so we can see tests fail - Changed all ipf listing code to cache/amortize the open/close for better performance - Timelineserver also reuses connections, for better performance - Without timelineserver, when metadata table is opened from executors, reuse is not allowed - HoodieMetadataConfig passed into HoodieTableMetadata#create as argument. - Fix TestHoodieBackedTableMetadata#testSync

…eate_database (#2444) fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database

* [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 * resolve comments Co-authored-by: Wenning Ding <wenningd@amazon.com>

* [HUDI] Add bloom index for hudi-flink-client Co-authored-by: yangxiang <yangxiang@oppo.com>

…inkStreamer and update docs (#2471)

…to hudi (#2474)

…2477)

- These are being deprecated - Causes build issues when .m2 does not have this cached already

…ernalWriterHelper::write(...) (apache#10272) Issue: There are two configs which when set in a certain manner throws exceptions or asserts 1. Configs to disable populating metadata fields (for each row) 2. Configs to drop partition columns (to save storage space) from a row With #1 and apache#2, partition paths cannot be deduced using partition columns (as the partition columns are dropped higher up the stack. BulkInsertDataInternalWriterHelper::write(...) relied on metadata fields to extract partition path in such cases. But with #1 it is not possible resulting in asserts/exceptions. The fix is to push down the dropping of partition columns down the stack after partition path is computed. The fix manipulates the raw 'InternalRow' row structure by only copying the relevent fields into a new 'InternalRow' structure. Each row is processed individually to drop the partition columns and copy it a to new 'InternalRow' Co-authored-by: Vinaykumar Bhat <vinay@onehouse.ai>

leesf and others added 30 commits November 28, 2020 21:47

[MINOR] refactor code in HoodieMergeHandle (#2272)

3d5e9fe

[HUDI-1424] Write Type changed to BULK_INSERT when set ENABLE_ROW_WRI…

36ce5bc

…TER_OPT_KEY=true (#2289)

[HUDI-1373] Add Support for OpenJ9 JVM (#2231)

b826c53

* add supoort for OpenJ9 VM * add 32bit openJ9 * Pulled the memory layout specs into their own classes.

[HUDI-1357] Added a check to validate records are not lost during mer…

ac23d25

…ges. (#2216) - Turned off by default

[HUDI-1196] Update HoodieKey when deduplicating records with global i…

78fd122

…ndex (#2248) - Works only for overwrite payload (default) - Does not alter current semantics otherwise Co-authored-by: Ryan Pifer <ryanpife@amazon.com>

[HUDI-1349] spark sql support overwrite use insert_overwrite_table (#…

1f0d5c0

…2196)

[HUDI-1343] Add standard schema postprocessor which would rewrite the…

62b392b

… schema using spark-avro conversion (#2192) Co-authored-by: liujh <liujh@t3go.cn>

[HUDI-1427] Fix FileAlreadyExistsException when set HOODIE_AUTO_COMMI…

319b7a5

…T_PROP to true (#2295)

[HUDI-1412] Make HoodieWriteConfig support setting different default … (

de2fbea

#2278) * [HUDI-1412] Make HoodieWriteConfig support setting different default value according to engine type

fix typo (#2308)

3a91d26

Co-authored-by: Xi Chen <chenxi07@qiyi.com>

[MINOR] Throw an exception when keyGenerator initialization failed (#…

007014c

…2307)

[HUDI-1395] Fix partition path using FSUtils (#2312)

bd9ccec

Fixed the logic to get partition path in Copier and Exporter utilities.

[HUDI-1445] Refactor AbstractHoodieLogRecordScanner to use Builder (#…

4bc45a3

…2313)

[MINOR] Minor improve in IncrementalRelation (#2314)

6cf25d5

[HUDI-1439] Remove scala dependency from hudi-client-common (#2306)

236d1b0

[HUDI-1428] Clean old fileslice is invalid (#2292)

11bc1fe

Co-authored-by: zhang wen <wen.zhang@dmall.com> Co-authored-by: zhang wen <steven@stevendeMac-mini.local>

[HUDI-1448] Hudi dla sync support skip rt table syncing (#2324)

facde4c

[HUDI-1435] Fix bug in Marker File Reconciliation for Non-Partitioned…

069a1dc

… datasets (#2301)

[MINOR] Improve code readability by passing in the fileComparisonsRDD…

93d9c25

… in bloom index (#2319)

[HUDI-1376] Drop Hudi metadata cols at the beginning of Spark datasou…

26cdc45

…rce writing (#2233) Co-authored-by: Wenning Ding <wenningd@amazon.com>

[MINOR] Fix error information in exception (#2341)

6a6b772

[MINOR] Make QuickstartUtil generate random timestamp instead of 0 (#…

4ddfc61

…2340)

[HUDI-1406] Add date partition based source input selector for Delta …

14d5d11

…streamer (#2264) - Adds ability to list only recent date based partitions from source data. - Parallelizes listing for faster tailing of DFSSources

[HUDI-1437] support more accurate spark JobGroup for better performan…

8b5d6f9

…ce tracking (#2322)

[HUDI-1470] Use the latest writer schema, when reading from existing …

5388c7f

…parquet files in the hudi-test-suite (#2344)

[HUDI-1419] Add base implementation for hudi java client (#2286)

e4e2fbc

[MINOR] Pass root exception to HoodieKeyGeneratorException for more i…

0c821fe

…nformation (#2354) Co-authored-by: Xi Chen <chenxi07@qiyi.com>

[HUDI-1075] Implement simple clustering strategies to create Clusteri…

6dc03b6

…ngPlan and to run the plan

umehrot2 and others added 28 commits January 10, 2021 21:19

[HUDI-1479] Use HoodieEngineContext to parallelize fetching of partit…

7ce3ac7

…on paths (#2417) * [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partition paths * Adding testClass for FileSystemBackedTableMetadata Co-authored-by: Nishith Agarwal <nagarwal@uber.com>

[HUDI-1520] add configure for spark sql overwrite use INSERT_OVERWRIT…

de42adc

…E_TABLE (#2428)

HUDI-1525 fix test hbase index (#2436)

e926c1a

[HUDI-1509]: Reverting LinkedHashSet changes to combine fields from o…

749f657

…ldSchema and newSchema in favor of using only new schema for record rewriting (#2424)

[MINOR] Bumping snapshot version to 0.7.0 (#2435)

a43e191

[HUDI-1533] Make SerializableSchema work for large schemas and add ab…

3d1d5d0

…ility to sortBy numeric values (#2453)

[HUDI-1529] Add block size to the FileStatus objects returned from me…

684e12e

…tadata table to avoid too many file splits (#2451)

[HUDI-1535] Fix 0.7.0 snapshot (#2456)

b9c2856

* Revert "[MINOR] Bumping snapshot version to 0.7.0 (#2435)" This reverts commit a43e191. * Fixing 0.7.0 snapshot bump

[MINOR] Fixing setting defaults for index config (#2457)

91b9cb5

[HUDI-1540] Fixing commons codec shading in spark bundle (#2460)

e23967b

[MINOR] Remove redundant judgments (#2466)

c931dc5

[MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_cr…

244f6de

…eate_database (#2444) fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database

Moving to 0.8.0-SNAPSHOT on master branch.

3719e7b

[MINOR] Disabling problematic tests temporarily to stabilize CI (#2468)

5e30fc1

[MINOR] Make a separate travis CI job for hudi-utilities (#2469)

81ccb0c

[HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (#2412)

976420c

* [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 * resolve comments Co-authored-by: Wenning Ding <wenningd@amazon.com>

[HUDI-1511] InstantGenerateOperator support multiple parallelism (#2434)

b64d22e

[HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (#2375)

641abe8

* [HUDI] Add bloom index for hudi-flink-client Co-authored-by: yangxiang <yangxiang@oppo.com>

[MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFl…

748dcc9

…inkStreamer and update docs (#2471)

[MINOR] Improve code readability,remove the continue keyword (#2459)

048633d

[HOTFIX] Revert upgrade flink verison to 1.12.0 (#2473)

d3ea0f9

[HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka …

e302c6b

…to hudi (#2474)

[MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (#…

84df263

…2477)

Removing spring repos from pom (#2481)

81836f0

- These are being deprecated - Causes build issues when .m2 does not have this cached already

[HUDI-1476] Introduce unit test infra for java client (#2478)

c4afd17

loukey-lj merged commit ee9ae14 into loukey-lj:master Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

同步 hudi master #1

同步 hudi master #1

loukey-lj commented Jan 26, 2021

同步 hudi master #1

同步 hudi master #1

Conversation

loukey-lj commented Jan 26, 2021

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist