simHash是用来网页去重最常用的hash方法，速度很快.

算法伪代码

1，将一个f维的向量V初始化为0；f位的二进制数S初始化为0；

2，对每一个特征：用传统的hash算法对该特征产生一个f位的签名b。对i=1到f：

如果b的第i位为1，则V的第i个元素加上该特征的权重；

否则，V的第i个元素减去该特征的权重。

3，如果V的第i个元素大于0，则S的第i位为1，否则为0；

4，输出S作为签名。

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------

shingling gives great results but the O(n^{2}) runtime is poor

a set of 1e6 records would require 5e11 comparisons and even the cpp impl can "only" do 5e6 /sec

that's 2 months of runtime, 1.999 months too long in my mind.

we need an heuristic that can help us avoid all that brute forceness.....

let's try making use of the hash algorithm simhash.

simhash was developed by Moses Charikar and is described in his paper

similarity estimation techniques from rounding algorithms.pdf

so what is so special about simhash?

a hash function usually hashes different values to totally different hash values

irb(main):006:0> p1 = 'the cat sat on the mat' irb(main):005:0> p2 = 'the cat sat on a mat' irb(main):007:0> p3 = 'we all scream for ice cream' irb(main):007:0> p1.hash => 415542861 irb(main):007:0> p2.hash => 668720516 irb(main):007:0> p3.hash => 767429688

simhash is one where similiar items are hashed to similiar hash values

(by similar we mean the bitwise hamming distance between hash values)

irb(main):003:0> p1.simhash => 851459198 00110010110000000011110001111110 irb(main):004:0> p2.simhash => 847263864 00110010100000000011100001111000 irb(main):002:0> p3.simhash => 984968088 00111010101101010110101110011000in this case we can see the hamming distance of the similar items (p1,p2)=4

whereas (p1,p3)=16 and (p2,p3)=12

the simhash of a phrase is calculated as follow..

- pick a hashsize, lets say 32 bits
- let V = [0] * 32 # (ie 32 zeros)
- break the phrase up into features
irb(main):003:0> 'the cat sat on the mat'.shingles => #<Set: {"th", "he", "e ", " c", "ca", "at", "t ", " s", "sa", " o", "on", "n ", " t", " m", "ma"}>

- hash each feature using a normal 32-bit hash algorithm
"th".hash = -502157718 "he".hash = -369049682 ...

- for each hash

if bit_{i}of hash is set then add 1 to V[i]

if bit_{i}of hash is not set then take 1 from V[i] - simhash bit
_{i}is 1 if V[i] > 0 and 0 otherwise

simhash is useful because if the simhash bitwise hamming distance of two phrases is low then their jaccard coefficient is high

but how can we make use of this to try to find the similiar items?

consider this other factoid....

in the case that two numbers have a low bitwise hamming distance

and the difference in their bits are in the lower order bits

then it turns out that they will end up close to each other if the list is sorted.

consider numbers

1 37586 1001001011010010 2 50086 1100001110100110 7 <--(this column lists hamming 3 2648 0000101001011000 11 distance to previous entry) 4 934 0000001110100110 9 5 40957 1001111111111101 9 6 2650 0000101001011010 9 7 64475 1111101111011011 7 8 40955 1001111111111011 4

if we sort them

4 934 0000001110100110 3 2648 0000101001011000 9 6 2650 0000101001011010 1 1 37586 1001001011010010 5 8 40955 1001111111111011 6 5 40957 1001111111111101 2 2 50086 1100001110100110 9 7 64475 1111101111011011 9then we notice that two pairs with very smallest hamming distance

hdist(3,6)=1 and hdist(8,5)=2

have ended up adjacent to each other.

great! so in this case rather than check every combo we could just check the adjacent pairs of the list, each is a good candidate.

this has reduces the runtime

from n*(n-1)/2 coeff calcs O(n^{2})

to n fingerprints calcs O(n) + a sort O(n logn) + n coeff calcs O(n) ( which is O(n logn) overall )

alas there is another pair with a low hamming distance, hdist(4,2)=2 that have ended up totally apart at other ends of the list...

sorting only picked up the pairs that differed in their lower order bits.

to get around this consider another convenient property of bitwise hamming distance a permutation of the bits of two numbers preserves hamming distance..

if we permutate by 'rotating' the bits, ie bit shift left and replace lowest order bit with the 'lost' highest order bit we get 'new' fingerprints that have the same hamming distances

'rotate' bits left twice 4 3736 0000111010011000 3 10592 0010100101100000 9 6 10600 0010100101101000 1 1 19274 0100101101001010 5 8 32750 0111111111101110 6 5 32758 0111111111110110 2 2 3739 0000111010011011 9 7 61295 1110111101101111 9

if we sort again by fingerprint

4 3736 0000111010011000 2 3739 0000111010011011 2 3 10592 0010100101100000 11 6 10600 0010100101101000 1 1 19274 0100101101001010 5 8 32750 0111111111101110 6 5 32758 0111111111110110 2 7 61295 1110111101101111 6yay! this time the (2,4) pair ended up adjacent

we also identified the (3,6) and (5,8) pairs as candidates again

so to avoid whether the items differ in the higher or lower order bits we can just so the following B times (where B is the bit length of the fingerprint)

- rotate the bits
- sort
- check adjacent

in the likely case that B << n the entire algorithm will still be under O(n^{2})

(in fact we wouldn't check each adjacent pair each rotation, rather collect the adjacent pairs in a set and check at the end, very similiar items would end up adjacent in a number of the rotations but only need to be checked once)

but it all hinges on the fact that similiar items end up next to each other in the sorted lists...timetime

lets see how well it does for finding similarities above 0.6

for simhash we give the runtime and it's accuracy in finding solutions

eg 5s (90%) implies simhash took 5s to find 90% of all possible resemblances under 0.6

entries | shingling | simhash16 | simhash32 |

100 | 1s | 1s (69%) | 1s (69%) |

200 | 4s | 1s (61%) | 2s (67%) |

400 | 16s | 2s (62%) | 3s (65%) |

800 | 1m 6s | 3s (41%) | 6s (54%) |

1600 | 4m 30s | 10s (31%) | 11s (31%) |

simhash observations

- it's much faster, furthermore simhashs runtime growth is (just) sublinear, not exponential
- 32bit performs better than 16bit, with only a little extra runtime required
- it's getting more and more inaccurate... is this a function of just trying adjacents? perhaps. sampling a couple of cases where it missed some it would have found them if only trying a window of the next 5 instead of the next 1

if we change the code to consider the next 10 instead of just the next 1 we get....

entries | shingling | simhash16 | simhash32 |

100 | 1s | 2s (100%) | 1s (100%) |

200 | 4s | 3s (91%) | 5s (96%) |

400 | 16s | 6s (88%) | 10s (93%) |

800 | 1m 6s | 13s (79%) | 22s (83%) |

1600 | 4m 30s | 27s (67%) | 48s (75%) |

3200 | 21m 30s | 58s (50%) | 1m 43s (61%) |

a x4 runtime for x2 the accuracy, seems worth it.

but it appears the window is going to have to get bigger and bigger as the dataset grows. this idea of a window size being some magic function of the dataset smells like a problem in the algorithm

an important thing to notice is that though the accuracy is dropping off it's the items with a lower similarity that not being found. with a fixed window size of 10 entries we can see that the best precision is in the high similarities

#entries | >0.9 | >0.8 | >0.7 | >0.6 |

400 | 9/9 (100%) | 40/40 (100%) | 65/65 (100%) | 93/100 (93%) |

800 | 19/19 (100%) | 77/77 (100%) | 143/147 (97%) | 190/227 (83%) |

1600 | 40/42 (95%) | 133/139 (95%) | 234/255 (91%) | 307/406 (75%) |

3200 | 64/75 (85%) | 218/265 (82%) | 397/546 (72%) | 614/1003 (61%) |

continuing to look at the 0.6 values in bigger sets

entries | shingling (c++) | simhash32 |

6,400 | 13s | 5m (48%) |

12,800 | 1m | 8m 10s (42%) |

25,600 | 4m 31s | 17m 22s (37%) |

51,200 | 19m | 45m (31%) |

102,400 | 1h 31m | 1h 53m (27%) |

so it seems the single threaded ruby simhash is juuuust about to overtake the multi threaded c++ brute force implementation but alas simhash is dying with out of memory.

there are a number of things i could do to reduce consumption but i've noticed that even with the rather large window size simhash32 is getting pretty terrible results for similarities over 0.6...

how about increasing the hash size from 32 to 64 bits? alas the ruby default hashing is only giving 32bit values so we need to use a larger custom hashing function.

how does swapping in a 64bit universal hash help? it takes quite a bit longer (somewhat expected; there are twice as many bits to rotate and the hash function is nowimplemented in ruby instead of using ruby's c hash impl) but seems more accurate...

entries | shingling c++ | simhash16 | simhash32 | usimhash64 |

800 | 1s | 13s (79%) | 22s (83%) | 72s (100%) |

1600 | 3s | 27s (67%) | 48s (75%) | 2m 52s (97%) |

3200 | 5s | 58s (50%) | 1m 43s (61%) | 6m 38s (92%) |

6400 | 14s | 5m (48%) | 19m 29s (83%) |

and the behaviour is as before, it has captured the higher level of similarities more than the lower similarities

usimhash64 | >0.9 | >0.8 | >0.7 | >0.6 |

6400 | 100% | 97% | 92% | 83% |

the universal hash has another useful property. since it's seeded from a random value it can be run a number of times with different results. so what we can do is perform a number of independant runs in parallel on a multicore machine. this roughly makes the execution time 4 times faster (on 4 core box) bringing it back down to runtime required for simhash32

how's about trying a sketching algorithm?

摘自：http://baike.baidu.com/link?url=g9JsAXckWqU2Q96cWkt_FWH3lKwQ-0IbOZeRg7kwq7BYrHdkUrpSk5BCDMn5At-qF5QtBwU6CI5raIjtv0wJGa

http://matpalm.com/resemblance/simhash/

评论这张

<#--最新日志，群博日志-->
<#--推荐日志-->
<#--引用记录-->
<#--博主推荐-->
<#--随机阅读-->
<#--首页推荐-->
<#--历史上的今天-->
<#--被推荐日志-->
<#--上一篇，下一篇-->
<#-- 热度 -->
<#-- 网易新闻广告 -->
<#--右边模块结构-->
<#--评论模块结构-->
<#--引用模块结构-->
<#--博主发起的投票-->

## 评论