Simhash-文档去重算法简介 - 文章 - 开发者社区

在计算机科学领域，SimHash 是一种用于快速估算两个集合相似度的技术。谷歌利用该算法来查找近乎重复的网页（Detecting Near-Duplicates for Web Crawling）。我们将其用作 job ID 来识别近乎重复的工作。它是由摩西·查里卡（Moses Charikar）创建的。

SIMHASH 算法的有趣之处在于其具备两个特性：

Simhash 的特性：请注意，Simhash 具有两种相互矛盾的特性：（A）文档的指纹是其特征的“哈希值”，（B）相似的文档具有相似的哈希值。

Explain the algorithm

picture.image

The process of the SimHash algorithm.

• 设置哈希大小，例如 64 位，将其全部初始化为零
• 将短语分解为特征（滑动窗口）

“the cat sat on the mat”

-> {"th", "he", "e "， " c"， "ca"， "at"， "t "," s"， "sa"， " o"， "on"， "n "， " t"， " m"， "ma"}

• 使用常规的 64 位哈希算法（例如 MD5）对每个特征进行哈希处理

“th” -> 10010010...

“he” -> 10010110...

• 将 0 设为 -1，对每个位求和，得到一个类似这样的序列：[-4 4 0 5 6 -1 -4 ...]（64 位）
• 生成 Simhash，通过设置 T[i]>0 为 1，T[i]<0 为 0：[0 1 1 1 1 0 0 ...]（64 位）

Make simhash


 
 
 
 
   
pip install simhash


 
 
 
 
   
def make\_features(input\_str):  
    """break the long input string into features, with length = 3  
    """  
    length = 3  
    input\_str = input\_str.lower()  
    out\_str = re.sub(r'[^\w]+', '', input\_str)  
    return [out\_str[i:i + length] for i in range(max(len(out\_str) - length + 1, 1))]  
  
make\_features("hello world")  
# ['hel', 'ell', 'llo', 'low', 'owo', 'wor', 'orl', 'rld']


 
 
 
 
   
def make\_simhash(input\_str):  
    """make simhash from any input string"""  
    features = make\_features(input\_str)  
    return Simhash(features).value  
  
make\_simhash("hello world")  
# 13548364882372308181

Difference between two strings

计算两个字符串中不同位的数量。


 
 
 
 
   
from simhash import Simhash  
Simhash('aa').distance(Simhash('bb'))  
# 31  
Simhash('aa').distance(Simhash('aa'))  
# 0


 
 
 
 
   
text\_1 = "Good job"  
text\_2 = "Good job, Ray"  
  
Simhash(text\_1).distance(Simhash(text\_2))  
# 14

计算两个哈希值中不同位的数量。


 
 
 
 
   
hash\_1 = Simhash(text\_1).value  
hash\_2 = Simhash(text\_2).value  
  
def simhash\_diff(hash\_1, hash\_2):  
    """calcuate the difference from two simhash values.  
    """  
    x = (hash\_1 ^ hash\_2) & ((1 << 64) - 1)  
    ans = 0  
    while x:  
        ans += 1  
        x &= x - 1  
    return ans  
  
simhash\_diff(hash\_1, hash\_2)  
# 14

Use the Simhash Index

使用 SimhashIndex 可以以非常高效的方式查询近似重复的对象。


 
 
 
 
   
import re  
from simhash import Simhash, SimhashIndex  
  
defget\_features(s):  
    width = 3  
    s = s.lower()  
    s = re.sub(r'[^\w]+', '', s)  
    return [s[i:i + width] for i inrange(max(len(s) - width + 1, 1))]  
  
data = {  
    1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',  
    2: u'How are you i am fine. blar blar blar blar blar than',  
    3: u'This is simhash test.',  
}  
objs = [(str(k), Simhash(get\_features(v))) for k, v in data.items()]  
index = SimhashIndex(objs, k=3)  
  
print(index.bucket\_size())  
# 11  
  
s1 = Simhash(get\_features(u'How are you i am fine. blar blar blar blar blar thank'))  
print(index.get\_near\_dups(s1))  
# ['1']  
  
index.add('4', s1)  
print(index.get\_near\_dups(s1))  
# ['1', '4']

参考文献

• Google paper - http://www.wwwconference.org/www2007/papers/paper215.pdf
• Simhash package - https://github.com/leonsim/simhash

点个「赞」+「在看」❤️

让我们知道这份文字有温暖到你，也是我们持续创作的最大动力！