-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-4315. Use Epoch to generate unique ObjectIDs #1480
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanishakoneru , I'm +1 for this proposal. But I have one thought below.
*/ | ||
public static long getObjectIdFromTxId(long epoch, long id) { | ||
Preconditions.checkArgument(id <= MAX_TRXN_ID, "TransactionID " + | ||
"exceeds max limit of " + MAX_TRXN_ID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking for this extreme case, user cannot write object anymore when TransactionID exceeds MAX_TRXN_ID, right? So what can we do for this, have to setup a new Ozone cluster to use?
General comment on using the epoch id that increments with every OM restart. This can get tricky.
|
Thank you @linyiqun and @prashantpogde for the reviews. Agree that setting aside 16 bits for epoch doesn't work for both the epoch as well as the transaction ids. 16 bits would not be enough to cover restarts and 40 bits might not be enough for transaction ids. Thank you @bharatviswa504 and @prashantpogde for the offline discussion. |
a0bef37
to
00959ae
Compare
@prashantpogde, @linyiqun, @bharatviswa504, I have updated the PR with the discussed approach. Please review when you get a chance. Thanks. |
00959ae
to
3c1526a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanishakoneru , the new implementation looks good to me. Only minor comments from me below.
* when OM is started first time to add S3G volume. In call other cases, | ||
* getObjectIdFromTxId() should be called to append epoch to objectID. | ||
*/ | ||
public static long addEpochToObjectId(long epoch, long id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename id -> trxnId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -394,6 +403,8 @@ private OzoneManager(OzoneConfiguration conf) throws IOException, | |||
OMConfigKeys.OZONE_OM_RATIS_ENABLE_KEY, | |||
OMConfigKeys.OZONE_OM_RATIS_ENABLE_DEFAULT); | |||
|
|||
omEpoch = OmUtils.getOMEpoch(isRatisEnabled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to reuse metadataManager#getOmEpoch to set epoch value, so that epoch number is from one same place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// reserved for creating S3G volume on OM start {@link | ||
// OzoneManager#addS3GVolumeToDB()}. | ||
public static final long EPOCH_ID_SHIFT = 62; // 64 - 2 | ||
public static final long MAX_TRXN_ID = (long) (Math.pow(2, 54) - 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use 1 << 54 instead of Math.pow ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// OzoneManager#addS3GVolumeToDB()}. | ||
public static final long EPOCH_ID_SHIFT = 62; // 64 - 2 | ||
public static final long MAX_TRXN_ID = (long) (Math.pow(2, 54) - 2); | ||
public static final int EPOCH_WHEN_RATIS_NOT_ENABLED = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want these values to be 0 and 1 instead of 1 & 2 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to avoid 0 as we can assume that currently it is 0. This would give us a way to separate out objectIds created before this fix. If ever, these non-unique objectIds need to be fixed, it would be easy to identify them.
public static long getObjectIdFromTxId(long epoch, long id) { | ||
Preconditions.checkArgument(id <= MAX_TRXN_ID, "TransactionID " + | ||
"exceeds max limit of " + MAX_TRXN_ID); | ||
return addEpochToObjectId(epoch, id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : s/addEpochToObjectId /addEpochToTxnId since your definition is ObjectId = EpochId+TxnId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java
Outdated
Show resolved
Hide resolved
// objectIDs is set to this epoch. For clusters before HDDS-4315 there is | ||
// no epoch as such. But it can be safely assumed that the most significant | ||
// 2 bits of the objectID will be 00. From HDDS-4315 onwards, the Epoch for | ||
// non-ratis OM clusters will be binary 01 (= decimal 1) and for ratis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why differentiate between epoch before this change and non-ratis OM cluster ? both can be 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help if we ever wanted to update the non-unique objectIds to maintain uniqueness throughout.
Thank you @linyiqun and @prashantpogde for the reviews. I have addressed the comments. Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thanks for addressing the comments, @hanishakoneru .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
Thanks @linyiqun and @prashantpogde for the reviews. |
* master: (53 commits) HDDS-4458. Fix Max Transaction ID value in OM. (apache#1585) HDDS-4442. Disable the location information of audit logger to reduce overhead (apache#1567) HDDS-4441. Add metrics for ACL related operations.(Addendum for HA). (apache#1584) HDDS-4081. Create ZH translation of StorageContainerManager.md in doc. (apache#1558) HDDS-4080. Create ZH translation of OzoneManager.md in doc. (apache#1541) HDDS-4079. Create ZH translation of Containers.md in doc. (apache#1539) HDDS-4184. Add Features menu for Chinese document. (apache#1547) HDDS-4235. Ozone client FS path validation is not present in OFS. (apache#1582) HDDS-4338. Fix the issue that SCM web UI banner shows "HDFS SCM". (apache#1583) HDDS-4337. Implement RocksDB options cache for new datanode DB utilities. (apache#1544) HDDS-4083. Create ZH translation of Recon.md in doc (apache#1575) HDDS-4453. Replicate closed container for random selected datanodes. (apache#1574) HDDS-4408: terminate Datanode when Datanode State Machine Thread got uncaught exception. (apache#1533) HDDS-4443. Recon: Using Mysql database throws exception and fails startup (apache#1570) HDDS-4315. Use Epoch to generate unique ObjectIDs (apache#1480) HDDS-4455. Fix typo in README.md doc (apache#1578) HDDS-4441. Add metrics for ACL related operations. (apache#1571) HDDS-4437. Avoid unnecessary builder conversion in setting volume Quota/Owner request (apache#1564) HDDS-4417. Simplify Ozone client code with configuration object (apache#1542) HDDS-4363. Add metric to track the number of RocksDB open/close operations. (apache#1530) ...
What changes were proposed in this pull request?
In a non-Ratis OM, the transaction index used to generate ObjectID is reset on OM restart. This can lead to duplicate ObjectIDs when the OM is restarted. ObjectIDs should be unique.
For HDDS-2939 and NFS are some of the features which depend on ObjectIds being unique.
This Jira aims to introduce an epoch number in OM which is incremented on OM restarts. The epoch is persisted on disk. This epoch will be used to set the first 16 bits of the objectID to ensure that objectIDs are unique even after OM restart.The highest epoch number is reserved for transactions coming through ratis. This will take care of the scenario where OM ratis is enabled on an existing cluster.
To ensure that objectIDs are unique across restarts in non-ratis OM cluster, the transaction index should be updated in DB on every flush to DB. This can be done in a similar fashion to what is being done for ratis enabled cluster today. TransactionInfo table is updated with transaction index as part of every batch write operation to DB.
Also, and epoch number is introduced to ensure that objectIDs do not clash with older clusters in which this fix does not exist. From the 64 bits of ObjectID (long variable), 2 bits are reserved for epoch and 8 bits for recursive directory creation, if required. The most significant 2 bits of objectIDs is set to epoch. For clusters before HDDS-4315 there is no epoch as such. But it can be safely assumed that the most significant 2 bits of the objectID will be 00 (as it unlikely to reach trxn index > 2^62 in an existing cluster). From HDDS-4315 onwards, the Epoch for non-ratis OM clusters will be binary 01 (= decimal 1) and for ratis enabled OM cluster will be binary 10 (= decimal 2).
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4315
How was this patch tested?
Added unit tests.