Question about data processing in Unsupervised NMT #171

ElliottYan · 2020-11-20T13:34:47Z

Hi, thanks for sharing your code.

I'm currently trying to reproduce your results on unsupervised NMT. I noted that you mentioned you filter out tokenized data with more than 175 tokens. However, I didn't find any code in your data processing file get-data-nmt.sh for doing so.

Can you confirm that the data script is up-to-date?

Also, I use the pretraining script you provided in some issues. I found that the loader in your code would remove long sequences, which is set to 100 sub-tokens for default.
Did you filter out the sequence longer than 175 tokens here?

Looking forward to your reply. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about data processing in Unsupervised NMT #171

Question about data processing in Unsupervised NMT #171

ElliottYan commented Nov 20, 2020

Question about data processing in Unsupervised NMT #171

Question about data processing in Unsupervised NMT #171

Comments

ElliottYan commented Nov 20, 2020