hash of URIRef is not the same across python runs #500

drewp · 2015-07-21T08:31:16Z

python -c 'import rdflib; print hash(rdflib.URIRef("hi"))'
13312079974552043
python -c 'import rdflib; print hash(rdflib.URIRef("hi"))'
13312079974020587

It would be nice if this was stable. The fix is in term.py, where we should perhaps use hash(type(self).name) instead of hash(type(self)) .

joernhees · 2015-07-21T14:38:58Z

i'm not entirely sure this is a must, but it's definitely a should...

i'll implement this with the full qualified name though: type(self).__module__ + '.' + type(self).__name__

also the current implementation of Literal.__hash__ uses hash(None) if no language is specified, which at least on my system isn't stable.

joernhees · 2015-07-21T21:56:12Z

@drewp: what's your actual use case for this? discussing this with @gromgull it turns out that python 3.3+ use randomized hashing in order to counter DoS attacks (e.g., http://lemire.me/blog/archives/2014/04/23/do-you-realize-that-you-are-using-random-hashing/ ). Hence fixing this from that perspective doesn't seem to make much sense at all.

I'm still somewhat for merging #501 as the current implementation doesn't do what it probably was intended to do: currently it hashes type(i) for rdflib.term.Identifiers, so essentially their memory location, instead of just their fully qualified class names. The change in #501 fixes that. For Python <3.3 it makes hash stable as the hash of a unicode string and for Python 3.3+ it will be as randomized as hash of a str.

Any other thoughts on this?

make Identifier.__hash__ consistent with str.__hash__ stability over runs, fixes #500

drewp · 2015-08-10T06:21:13Z

Use case: I was generating C code for an arduino, and to get a C variable name for a URIRef, I used 'pwm%s' % (hash(dev) % 99999) and hoped for the best. (And didn't think about negative numbers, apparently.) My system was rebuilding and reuploading on every run, since the generated C code didn't have the same checksum each time. Nothing was broken; it just ran slow. hash(str(dev)) is my workaround in python2.

Probably I should use hashlib.md5(dev).hexdigest() if I want a stable hash, or 'pwm_'+dev.encode('base64').strip('=\n') if I want to try something that's properly unique for each URI. Or re.sub(r'[^0-9a-zA-Z]', '_', dev) for something that's probably unique and also readable.

rchateauneu · 2019-12-13T00:03:20Z

Another use case is creating two rdflibURIRef("strign") from two strings with identical content. These two URIRef are not equal due to different hash values, as far as I can tell.

joernhees added the enhancement New feature or request label Jul 21, 2015

joernhees added this to the rdflib 4.2.1 milestone Jul 21, 2015

joernhees self-assigned this Jul 21, 2015

joernhees mentioned this issue Jul 21, 2015

make Identifier.__hash__ stable wrt. multi processes, fixes #500 #501

Merged

joernhees added the discussion label Jul 21, 2015

joernhees closed this as completed in ec93eed Jul 28, 2015

joernhees added a commit that referenced this issue Jul 28, 2015

Merge pull request #501 from joernhees/fix_hash

53b56b9

make Identifier.__hash__ consistent with str.__hash__ stability over runs, fixes #500

pyup-bot mentioned this issue Nov 8, 2016

Update rdflib to 4.2.1 mytardis/mytardis#733

Closed

This was referenced Jan 16, 2017

Initial Update mozilla/addons-server#4303

Closed

Update rdflib to 4.2.1 mozilla/addons-server#4390

Closed

pyup-bot mentioned this issue Jan 29, 2017

Update rdflib to 4.2.2 mytardis/mytardis#815

Merged

This was referenced Mar 16, 2017

Initial Update mozilla/amo-validator#510

Closed

Update rdflib to 4.2.2 mozilla/amo-validator#515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hash of URIRef is not the same across python runs #500

hash of URIRef is not the same across python runs #500

drewp commented Jul 21, 2015

joernhees commented Jul 21, 2015

joernhees commented Jul 21, 2015

drewp commented Aug 10, 2015

rchateauneu commented Dec 13, 2019

hash of URIRef is not the same across python runs #500

hash of URIRef is not the same across python runs #500

Comments

drewp commented Jul 21, 2015

joernhees commented Jul 21, 2015

joernhees commented Jul 21, 2015

drewp commented Aug 10, 2015

rchateauneu commented Dec 13, 2019