Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash of URIRef is not the same across python runs #500

Closed
drewp opened this issue Jul 21, 2015 · 4 comments
Closed

hash of URIRef is not the same across python runs #500

drewp opened this issue Jul 21, 2015 · 4 comments
Assignees
Labels
discussion enhancement New feature or request
Milestone

Comments

@drewp
Copy link
Contributor

drewp commented Jul 21, 2015

python -c 'import rdflib; print hash(rdflib.URIRef("hi"))'
13312079974552043
python -c 'import rdflib; print hash(rdflib.URIRef("hi"))'
13312079974020587

It would be nice if this was stable. The fix is in term.py, where we should perhaps use hash(type(self).name) instead of hash(type(self)) .

@joernhees
Copy link
Member

i'm not entirely sure this is a must, but it's definitely a should...

i'll implement this with the full qualified name though: type(self).__module__ + '.' + type(self).__name__

also the current implementation of Literal.__hash__ uses hash(None) if no language is specified, which at least on my system isn't stable.

@joernhees
Copy link
Member

@drewp: what's your actual use case for this? discussing this with @gromgull it turns out that python 3.3+ use randomized hashing in order to counter DoS attacks (e.g., http://lemire.me/blog/archives/2014/04/23/do-you-realize-that-you-are-using-random-hashing/ ). Hence fixing this from that perspective doesn't seem to make much sense at all.

I'm still somewhat for merging #501 as the current implementation doesn't do what it probably was intended to do: currently it hashes type(i) for rdflib.term.Identifiers, so essentially their memory location, instead of just their fully qualified class names. The change in #501 fixes that. For Python <3.3 it makes hash stable as the hash of a unicode string and for Python 3.3+ it will be as randomized as hash of a str.

Any other thoughts on this?

joernhees added a commit that referenced this issue Jul 28, 2015
make Identifier.__hash__ consistent with str.__hash__ stability over runs, fixes #500
@drewp
Copy link
Contributor Author

drewp commented Aug 10, 2015

Use case: I was generating C code for an arduino, and to get a C variable name for a URIRef, I used 'pwm%s' % (hash(dev) % 99999) and hoped for the best. (And didn't think about negative numbers, apparently.) My system was rebuilding and reuploading on every run, since the generated C code didn't have the same checksum each time. Nothing was broken; it just ran slow. hash(str(dev)) is my workaround in python2.

Probably I should use hashlib.md5(dev).hexdigest() if I want a stable hash, or 'pwm_'+dev.encode('base64').strip('=\n') if I want to try something that's properly unique for each URI. Or re.sub(r'[^0-9a-zA-Z]', '_', dev) for something that's probably unique and also readable.

@rchateauneu
Copy link
Contributor

Another use case is creating two rdflibURIRef("strign") from two strings with identical content. These two URIRef are not equal due to different hash values, as far as I can tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants