Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-4315. Use Epoch to generate unique ObjectIDs #1480

Merged
merged 4 commits into from
Nov 12, 2020

Conversation

hanishakoneru
Copy link
Contributor

@hanishakoneru hanishakoneru commented Oct 6, 2020

What changes were proposed in this pull request?

In a non-Ratis OM, the transaction index used to generate ObjectID is reset on OM restart. This can lead to duplicate ObjectIDs when the OM is restarted. ObjectIDs should be unique.
For HDDS-2939 and NFS are some of the features which depend on ObjectIds being unique.

This Jira aims to introduce an epoch number in OM which is incremented on OM restarts. The epoch is persisted on disk. This epoch will be used to set the first 16 bits of the objectID to ensure that objectIDs are unique even after OM restart.
The highest epoch number is reserved for transactions coming through ratis. This will take care of the scenario where OM ratis is enabled on an existing cluster.

To ensure that objectIDs are unique across restarts in non-ratis OM cluster, the transaction index should be updated in DB on every flush to DB. This can be done in a similar fashion to what is being done for ratis enabled cluster today. TransactionInfo table is updated with transaction index as part of every batch write operation to DB.

Also, and epoch number is introduced to ensure that objectIDs do not clash with older clusters in which this fix does not exist. From the 64 bits of ObjectID (long variable), 2 bits are reserved for epoch and 8 bits for recursive directory creation, if required. The most significant 2 bits of objectIDs is set to epoch. For clusters before HDDS-4315 there is no epoch as such. But it can be safely assumed that the most significant 2 bits of the objectID will be 00 (as it unlikely to reach trxn index > 2^62 in an existing cluster). From HDDS-4315 onwards, the Epoch for non-ratis OM clusters will be binary 01 (= decimal 1) and for ratis enabled OM cluster will be binary 10 (= decimal 2).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4315

How was this patch tested?

Added unit tests.

Copy link
Contributor

@linyiqun linyiqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanishakoneru , I'm +1 for this proposal. But I have one thought below.

*/
public static long getObjectIdFromTxId(long epoch, long id) {
Preconditions.checkArgument(id <= MAX_TRXN_ID, "TransactionID " +
"exceeds max limit of " + MAX_TRXN_ID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking for this extreme case, user cannot write object anymore when TransactionID exceeds MAX_TRXN_ID, right? So what can we do for this, have to setup a new Ozone cluster to use?

@prashantpogde
Copy link
Contributor

prashantpogde commented Oct 8, 2020

General comment on using the epoch id that increments with every OM restart. This can get tricky.
If OM goes in crash restart loop then we have just 2^16 increments available which is 65K attempts. if it takes 1 secs for OM to comeback online we have 65 K secs worth epoch number or 20 hours of crash looping. This is very pessimistic view, it may take several seconds for OM to restart but it does show how

  • 16 bit space can be insufficient for this scheme.
  • epoch need not be dependent on restart based increment. if it increments based on both of the following conditions
    A) OM restart +
    B) some object gets created after epoch id is incremented
    then epoch may last longer. But even then 16 bit looks insufficient. What if OM creates one object and restarts in a loop.

@hanishakoneru
Copy link
Contributor Author

Thank you @linyiqun and @prashantpogde for the reviews.

Agree that setting aside 16 bits for epoch doesn't work for both the epoch as well as the transaction ids. 16 bits would not be enough to cover restarts and 40 bits might not be enough for transaction ids.
The new proposal is to have only 2 bits set aside for epoch. For non-Ratis OM, the transactionIndex will be saved in DB with every sync operation. When OM is restarted, this transactionIndex will be read from DB so that new transactions do not have clashing indices.
The epoch would let us distinguish objects created before and after this upgrade. This would help if someone needs to fix the duplicate objectIDs in existing clusters.

Thank you @bharatviswa504 and @prashantpogde for the offline discussion.

@hanishakoneru
Copy link
Contributor Author

@prashantpogde, @linyiqun, @bharatviswa504, I have updated the PR with the discussed approach. Please review when you get a chance. Thanks.

Copy link
Contributor

@linyiqun linyiqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanishakoneru , the new implementation looks good to me. Only minor comments from me below.

* when OM is started first time to add S3G volume. In call other cases,
* getObjectIdFromTxId() should be called to append epoch to objectID.
*/
public static long addEpochToObjectId(long epoch, long id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename id -> trxnId

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -394,6 +403,8 @@ private OzoneManager(OzoneConfiguration conf) throws IOException,
OMConfigKeys.OZONE_OM_RATIS_ENABLE_KEY,
OMConfigKeys.OZONE_OM_RATIS_ENABLE_DEFAULT);

omEpoch = OmUtils.getOMEpoch(isRatisEnabled);
Copy link
Contributor

@linyiqun linyiqun Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to reuse metadataManager#getOmEpoch to set epoch value, so that epoch number is from one same place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// reserved for creating S3G volume on OM start {@link
// OzoneManager#addS3GVolumeToDB()}.
public static final long EPOCH_ID_SHIFT = 62; // 64 - 2
public static final long MAX_TRXN_ID = (long) (Math.pow(2, 54) - 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use 1 << 54 instead of Math.pow ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// OzoneManager#addS3GVolumeToDB()}.
public static final long EPOCH_ID_SHIFT = 62; // 64 - 2
public static final long MAX_TRXN_ID = (long) (Math.pow(2, 54) - 2);
public static final int EPOCH_WHEN_RATIS_NOT_ENABLED = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want these values to be 0 and 1 instead of 1 & 2 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to avoid 0 as we can assume that currently it is 0. This would give us a way to separate out objectIds created before this fix. If ever, these non-unique objectIds need to be fixed, it would be easy to identify them.

public static long getObjectIdFromTxId(long epoch, long id) {
Preconditions.checkArgument(id <= MAX_TRXN_ID, "TransactionID " +
"exceeds max limit of " + MAX_TRXN_ID);
return addEpochToObjectId(epoch, id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : s/addEpochToObjectId /addEpochToTxnId since your definition is ObjectId = EpochId+TxnId

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// objectIDs is set to this epoch. For clusters before HDDS-4315 there is
// no epoch as such. But it can be safely assumed that the most significant
// 2 bits of the objectID will be 00. From HDDS-4315 onwards, the Epoch for
// non-ratis OM clusters will be binary 01 (= decimal 1) and for ratis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why differentiate between epoch before this change and non-ratis OM cluster ? both can be 0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help if we ever wanted to update the non-unique objectIds to maintain uniqueness throughout.

@hanishakoneru
Copy link
Contributor Author

Thank you @linyiqun and @prashantpogde for the reviews. I have addressed the comments. Please take a look.

Copy link
Contributor

@linyiqun linyiqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks for addressing the comments, @hanishakoneru .

Copy link
Contributor

@prashantpogde prashantpogde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@hanishakoneru
Copy link
Contributor Author

Thanks @linyiqun and @prashantpogde for the reviews.

@hanishakoneru hanishakoneru merged commit e56d7bc into apache:master Nov 12, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Nov 18, 2020
* master: (53 commits)
  HDDS-4458. Fix Max Transaction ID value in OM. (apache#1585)
  HDDS-4442. Disable the location information of audit logger to reduce overhead (apache#1567)
  HDDS-4441. Add metrics for ACL related operations.(Addendum for HA). (apache#1584)
  HDDS-4081. Create ZH translation of StorageContainerManager.md in doc. (apache#1558)
  HDDS-4080. Create ZH translation of OzoneManager.md in doc. (apache#1541)
  HDDS-4079. Create ZH translation of Containers.md in doc. (apache#1539)
  HDDS-4184. Add Features menu for Chinese document. (apache#1547)
  HDDS-4235. Ozone client FS path validation is not present in OFS. (apache#1582)
  HDDS-4338. Fix the issue that SCM web UI banner shows "HDFS SCM". (apache#1583)
  HDDS-4337. Implement RocksDB options cache for new datanode DB utilities. (apache#1544)
  HDDS-4083. Create ZH translation of Recon.md in doc (apache#1575)
  HDDS-4453. Replicate closed container for random selected datanodes. (apache#1574)
  HDDS-4408: terminate Datanode when Datanode State Machine Thread got uncaught exception. (apache#1533)
  HDDS-4443. Recon: Using Mysql database throws exception and fails startup (apache#1570)
  HDDS-4315. Use Epoch to generate unique ObjectIDs (apache#1480)
  HDDS-4455. Fix typo in README.md doc (apache#1578)
  HDDS-4441. Add metrics for ACL related operations. (apache#1571)
  HDDS-4437. Avoid unnecessary builder conversion in setting volume Quota/Owner request (apache#1564)
  HDDS-4417. Simplify Ozone client code with configuration object (apache#1542)
  HDDS-4363. Add metric to track the number of RocksDB open/close operations. (apache#1530)
  ...
@hanishakoneru hanishakoneru deleted the HDDS-4315 branch December 1, 2020 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants