Simhash-文档去重算法简介

在计算机科学领域,SimHash 是一种用于快速估算两个集合相似度的技术。谷歌利用该算法来查找近乎重复的网页(Detecting Near-Duplicates for Web Crawling)。我们将其用作 job ID 来识别近乎重复的工作。它是由摩西·查里卡(Moses Charikar)创建的。

SIMHASH 算法的有趣之处在于其具备两个特性:

Simhash 的特性:请注意,Simhash 具有两种相互矛盾的特性:(A)文档的指纹是其特征的“哈希值”,(B)相似的文档具有相似的哈希值。

Explain the algorithm

picture.image

The process of the SimHash algorithm.

  • • 设置哈希大小,例如 64 位,将其全部初始化为零
  • • 将短语分解为特征(滑动窗口)

“the cat sat on the mat”

-> {"th", "he", "e ", " c", "ca", "at", "t "," s", "sa", " o", "on", "n ", " t", " m", "ma"}

  • • 使用常规的 64 位哈希算法(例如 MD5)对每个特征进行哈希处理

“th” -> 10010010...

“he” -> 10010110...

  • • 将 0 设为 -1,对每个位求和,得到一个类似这样的序列:[-4 4 0 5 6 -1 -4 ...](64 位)
  • • 生成 Simhash,通过设置 T[i]>0 为 1,T[i]<0 为 0:[0 1 1 1 1 0 0 ...](64 位)

Make simhash


 
 
 
 
   
pip install simhash

 
 
 
 
   
def make\_features(input\_str):  
    """break the long input string into features, with length = 3  
    """  
    length = 3  
    input\_str = input\_str.lower()  
    out\_str = re.sub(r'[^\w]+', '', input\_str)  
    return [out\_str[i:i + length] for i in range(max(len(out\_str) - length + 1, 1))]  
  
make\_features("hello world")  
# ['hel', 'ell', 'llo', 'low', 'owo', 'wor', 'orl', 'rld']

 
 
 
 
   
def make\_simhash(input\_str):  
    """make simhash from any input string"""  
    features = make\_features(input\_str)  
    return Simhash(features).value  
  
make\_simhash("hello world")  
# 13548364882372308181

Difference between two strings

计算两个字符串中不同位的数量。


 
 
 
 
   
from simhash import Simhash  
Simhash('aa').distance(Simhash('bb'))  
# 31  
Simhash('aa').distance(Simhash('aa'))  
# 0

 
 
 
 
   
text\_1 = "Good job"  
text\_2 = "Good job, Ray"  
  
Simhash(text\_1).distance(Simhash(text\_2))  
# 14

计算两个哈希值中不同位的数量。


 
 
 
 
   
hash\_1 = Simhash(text\_1).value  
hash\_2 = Simhash(text\_2).value  
  
def simhash\_diff(hash\_1, hash\_2):  
    """calcuate the difference from two simhash values.  
    """  
    x = (hash\_1 ^ hash\_2) & ((1 << 64) - 1)  
    ans = 0  
    while x:  
        ans += 1  
        x &= x - 1  
    return ans  
  
simhash\_diff(hash\_1, hash\_2)  
# 14

Use the Simhash Index

使用 SimhashIndex 可以以非常高效的方式查询近似重复的对象。


 
 
 
 
   
import re  
from simhash import Simhash, SimhashIndex  
  
defget\_features(s):  
    width = 3  
    s = s.lower()  
    s = re.sub(r'[^\w]+', '', s)  
    return [s[i:i + width] for i inrange(max(len(s) - width + 1, 1))]  
  
data = {  
    1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',  
    2: u'How are you i am fine. blar blar blar blar blar than',  
    3: u'This is simhash test.',  
}  
objs = [(str(k), Simhash(get\_features(v))) for k, v in data.items()]  
index = SimhashIndex(objs, k=3)  
  
print(index.bucket\_size())  
# 11  
  
s1 = Simhash(get\_features(u'How are you i am fine. blar blar blar blar blar thank'))  
print(index.get\_near\_dups(s1))  
# ['1']  
  
index.add('4', s1)  
print(index.get\_near\_dups(s1))  
# ['1', '4']

参考文献

点个「赞」+「在看」❤️

让我们知道这份文字有温暖到你,也是 我们持续 创作的最大动力!

推荐

亲测有效!如何用 Address Sanitizer 精准定位内存漏洞?附保姆级操作指南

要用 AI 裁员 50% 的千亿独角兽,公开认错,重启招聘!

一些文档去重算法

single codebook和dual codebook在LLM中向量量化上有什么区别?

胖东来与京东联手了

Qwen 的训练数据是怎么做的?

什么是置信度?置信度模型怎么做?

一些文档去重算法

FSQ的原理与VQ-VAE的区别和联系

最佳的指令数据应当是什么样的?

Prefill-Decode分离

CosyVoice 2:基于大型语言模型的可扩展流式语音合成技术

Mini-Omni2: with Vision, Speech and Duplex Capabilities

亲测有效!如何用 Address Sanitizer 精准定位内存漏洞?附保姆级操作指南

Telling gcc directly to link a library statically

Address Sanitizer in C++

Telling gcc directly to link a library statically

0
0
0
0
评论
未登录
暂无评论