HDDS-4315. Use Epoch to generate unique ObjectIDs #1480

hanishakoneru · 2020-10-06T23:09:36Z

What changes were proposed in this pull request?

In a non-Ratis OM, the transaction index used to generate ObjectID is reset on OM restart. This can lead to duplicate ObjectIDs when the OM is restarted. ObjectIDs should be unique.
For HDDS-2939 and NFS are some of the features which depend on ObjectIds being unique.

This Jira aims to introduce an epoch number in OM which is incremented on OM restarts. The epoch is persisted on disk. This epoch will be used to set the first 16 bits of the objectID to ensure that objectIDs are unique even after OM restart.
The highest epoch number is reserved for transactions coming through ratis. This will take care of the scenario where OM ratis is enabled on an existing cluster.

To ensure that objectIDs are unique across restarts in non-ratis OM cluster, the transaction index should be updated in DB on every flush to DB. This can be done in a similar fashion to what is being done for ratis enabled cluster today. TransactionInfo table is updated with transaction index as part of every batch write operation to DB.

Also, and epoch number is introduced to ensure that objectIDs do not clash with older clusters in which this fix does not exist. From the 64 bits of ObjectID (long variable), 2 bits are reserved for epoch and 8 bits for recursive directory creation, if required. The most significant 2 bits of objectIDs is set to epoch. For clusters before HDDS-4315 there is no epoch as such. But it can be safely assumed that the most significant 2 bits of the objectID will be 00 (as it unlikely to reach trxn index > 2^62 in an existing cluster). From HDDS-4315 onwards, the Epoch for non-ratis OM clusters will be binary 01 (= decimal 1) and for ratis enabled OM cluster will be binary 10 (= decimal 2).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4315

How was this patch tested?

Added unit tests.

linyiqun

@hanishakoneru , I'm +1 for this proposal. But I have one thought below.

linyiqun · 2020-10-08T01:28:48Z

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java

+   */
+  public static long getObjectIdFromTxId(long epoch, long id) {
+    Preconditions.checkArgument(id <= MAX_TRXN_ID, "TransactionID " +
+        "exceeds max limit of " + MAX_TRXN_ID);


I am thinking for this extreme case, user cannot write object anymore when TransactionID exceeds MAX_TRXN_ID, right? So what can we do for this, have to setup a new Ozone cluster to use?

prashantpogde · 2020-10-08T19:26:40Z

General comment on using the epoch id that increments with every OM restart. This can get tricky.
If OM goes in crash restart loop then we have just 2^16 increments available which is 65K attempts. if it takes 1 secs for OM to comeback online we have 65 K secs worth epoch number or 20 hours of crash looping. This is very pessimistic view, it may take several seconds for OM to restart but it does show how

16 bit space can be insufficient for this scheme.
epoch need not be dependent on restart based increment. if it increments based on both of the following conditions
A) OM restart +
B) some object gets created after epoch id is incremented
then epoch may last longer. But even then 16 bit looks insufficient. What if OM creates one object and restarts in a loop.

hanishakoneru · 2020-10-09T16:20:32Z

Thank you @linyiqun and @prashantpogde for the reviews.

Agree that setting aside 16 bits for epoch doesn't work for both the epoch as well as the transaction ids. 16 bits would not be enough to cover restarts and 40 bits might not be enough for transaction ids.
The new proposal is to have only 2 bits set aside for epoch. For non-Ratis OM, the transactionIndex will be saved in DB with every sync operation. When OM is restarted, this transactionIndex will be read from DB so that new transactions do not have clashing indices.
The epoch would let us distinguish objects created before and after this upgrade. This would help if someone needs to fix the duplicate objectIDs in existing clusters.

Thank you @bharatviswa504 and @prashantpogde for the offline discussion.

hanishakoneru · 2020-10-20T20:54:32Z

@prashantpogde, @linyiqun, @bharatviswa504, I have updated the PR with the discussed approach. Please review when you get a chance. Thanks.

linyiqun

@hanishakoneru , the new implementation looks good to me. Only minor comments from me below.

linyiqun · 2020-10-22T15:46:38Z

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java

+   * when OM is started first time to add S3G volume. In call other cases,
+   * getObjectIdFromTxId() should be called to append epoch to objectID.
+   */
+  public static long addEpochToObjectId(long epoch, long id) {


Can we rename id -> trxnId

linyiqun · 2020-10-22T15:52:01Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java

@@ -394,6 +403,8 @@ private OzoneManager(OzoneConfiguration conf) throws IOException,
        OMConfigKeys.OZONE_OM_RATIS_ENABLE_KEY,
        OMConfigKeys.OZONE_OM_RATIS_ENABLE_DEFAULT);

+    omEpoch = OmUtils.getOMEpoch(isRatisEnabled);


I prefer to reuse metadataManager#getOmEpoch to set epoch value, so that epoch number is from one same place.

prashantpogde · 2020-10-30T21:36:05Z

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java

+  // reserved for creating S3G volume on OM start {@link
+  // OzoneManager#addS3GVolumeToDB()}.
+  public static final long EPOCH_ID_SHIFT = 62; // 64 - 2
+  public static final long MAX_TRXN_ID = (long) (Math.pow(2, 54) - 2);


can we use 1 << 54 instead of Math.pow ?

prashantpogde · 2020-10-30T21:38:01Z

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java

+  // OzoneManager#addS3GVolumeToDB()}.
+  public static final long EPOCH_ID_SHIFT = 62; // 64 - 2
+  public static final long MAX_TRXN_ID = (long) (Math.pow(2, 54) - 2);
+  public static final int EPOCH_WHEN_RATIS_NOT_ENABLED = 1;


Don't we want these values to be 0 and 1 instead of 1 & 2 ?

Wanted to avoid 0 as we can assume that currently it is 0. This would give us a way to separate out objectIds created before this fix. If ever, these non-unique objectIds need to be fixed, it would be easy to identify them.

prashantpogde · 2020-10-30T21:44:03Z

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java

+  public static long getObjectIdFromTxId(long epoch, long id) {
+    Preconditions.checkArgument(id <= MAX_TRXN_ID, "TransactionID " +
+        "exceeds max limit of " + MAX_TRXN_ID);
+    return addEpochToObjectId(epoch, id);


nit : s/addEpochToObjectId /addEpochToTxnId since your definition is ObjectId = EpochId+TxnId

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java

prashantpogde · 2020-10-30T21:55:45Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java

+  // objectIDs is set to this epoch. For clusters before HDDS-4315 there is
+  // no epoch as such. But it can be safely assumed that the most significant
+  // 2 bits of the objectID will be 00. From HDDS-4315 onwards, the Epoch for
+  // non-ratis OM clusters will be binary 01 (= decimal 1)  and for ratis


why differentiate between epoch before this change and non-ratis OM cluster ? both can be 0 ?

It would help if we ever wanted to update the non-unique objectIds to maintain uniqueness throughout.

hanishakoneru · 2020-11-03T23:01:05Z

Thank you @linyiqun and @prashantpogde for the reviews. I have addressed the comments. Please take a look.

linyiqun

+1. Thanks for addressing the comments, @hanishakoneru .

prashantpogde

+1 LGTM

hanishakoneru · 2020-11-12T19:42:25Z

Thanks @linyiqun and @prashantpogde for the reviews.

* master: (53 commits) HDDS-4458. Fix Max Transaction ID value in OM. (apache#1585) HDDS-4442. Disable the location information of audit logger to reduce overhead (apache#1567) HDDS-4441. Add metrics for ACL related operations.(Addendum for HA). (apache#1584) HDDS-4081. Create ZH translation of StorageContainerManager.md in doc. (apache#1558) HDDS-4080. Create ZH translation of OzoneManager.md in doc. (apache#1541) HDDS-4079. Create ZH translation of Containers.md in doc. (apache#1539) HDDS-4184. Add Features menu for Chinese document. (apache#1547) HDDS-4235. Ozone client FS path validation is not present in OFS. (apache#1582) HDDS-4338. Fix the issue that SCM web UI banner shows "HDFS SCM". (apache#1583) HDDS-4337. Implement RocksDB options cache for new datanode DB utilities. (apache#1544) HDDS-4083. Create ZH translation of Recon.md in doc (apache#1575) HDDS-4453. Replicate closed container for random selected datanodes. (apache#1574) HDDS-4408: terminate Datanode when Datanode State Machine Thread got uncaught exception. (apache#1533) HDDS-4443. Recon: Using Mysql database throws exception and fails startup (apache#1570) HDDS-4315. Use Epoch to generate unique ObjectIDs (apache#1480) HDDS-4455. Fix typo in README.md doc (apache#1578) HDDS-4441. Add metrics for ACL related operations. (apache#1571) HDDS-4437. Avoid unnecessary builder conversion in setting volume Quota/Owner request (apache#1564) HDDS-4417. Simplify Ozone client code with configuration object (apache#1542) HDDS-4363. Add metric to track the number of RocksDB open/close operations. (apache#1530) ...

hanishakoneru requested review from mukul1987 and bharatviswa504 October 6, 2020 23:09

hanishakoneru marked this pull request as draft October 7, 2020 23:10

linyiqun reviewed Oct 8, 2020

View reviewed changes

hanishakoneru force-pushed the HDDS-4315 branch from a0bef37 to 00959ae Compare October 20, 2020 20:52

hanishakoneru marked this pull request as ready for review October 20, 2020 20:52

hanishakoneru added 2 commits October 20, 2020 15:40

HDDS-4315. Ensure ObjectIDs are unique across restarts

bb1fd32

CI fixes

3c1526a

hanishakoneru force-pushed the HDDS-4315 branch from 00959ae to 3c1526a Compare October 20, 2020 22:55

linyiqun reviewed Oct 22, 2020

View reviewed changes

prashantpogde reviewed Oct 30, 2020

View reviewed changes

review comments

94d5056

compile fix

7b0caa2

linyiqun approved these changes Nov 4, 2020

View reviewed changes

prashantpogde reviewed Nov 6, 2020

View reviewed changes

prashantpogde approved these changes Nov 12, 2020

View reviewed changes

hanishakoneru removed request for mukul1987 and bharatviswa504 November 12, 2020 19:41

hanishakoneru merged commit e56d7bc into apache:master Nov 12, 2020

hanishakoneru deleted the HDDS-4315 branch December 1, 2020 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-4315. Use Epoch to generate unique ObjectIDs #1480

HDDS-4315. Use Epoch to generate unique ObjectIDs #1480

hanishakoneru commented Oct 6, 2020 •

edited

Loading

linyiqun left a comment

linyiqun Oct 8, 2020

prashantpogde commented Oct 8, 2020 •

edited

Loading

hanishakoneru commented Oct 9, 2020

hanishakoneru commented Oct 20, 2020

linyiqun left a comment

linyiqun Oct 22, 2020

hanishakoneru Nov 3, 2020

linyiqun Oct 22, 2020 •

edited

Loading

hanishakoneru Nov 3, 2020

prashantpogde Oct 30, 2020

hanishakoneru Nov 3, 2020

prashantpogde Oct 30, 2020

hanishakoneru Nov 3, 2020

prashantpogde Oct 30, 2020

hanishakoneru Nov 3, 2020

prashantpogde Oct 30, 2020

hanishakoneru Nov 3, 2020

hanishakoneru commented Nov 3, 2020

linyiqun left a comment

prashantpogde left a comment

hanishakoneru commented Nov 12, 2020

HDDS-4315. Use Epoch to generate unique ObjectIDs #1480

HDDS-4315. Use Epoch to generate unique ObjectIDs #1480

Conversation

hanishakoneru commented Oct 6, 2020 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

linyiqun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prashantpogde commented Oct 8, 2020 • edited Loading

hanishakoneru commented Oct 9, 2020

hanishakoneru commented Oct 20, 2020

linyiqun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linyiqun Oct 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanishakoneru commented Nov 3, 2020

linyiqun left a comment

Choose a reason for hiding this comment

prashantpogde left a comment

Choose a reason for hiding this comment

hanishakoneru commented Nov 12, 2020

hanishakoneru commented Oct 6, 2020 •

edited

Loading

prashantpogde commented Oct 8, 2020 •

edited

Loading

linyiqun Oct 22, 2020 •

edited

Loading