在计算机科学领域,SimHash 是一种用于快速估算两个集合相似度的技术。谷歌利用该算法来查找近乎重复的网页(Detecting Near-Duplicates for Web Crawling)。我们将其用作 job ID 来识别近乎重复的工作。它是由摩西·查里卡(Moses Charikar)创建的。
SIMHASH 算法的有趣之处在于其具备两个特性:
Simhash 的特性:请注意,Simhash 具有两种相互矛盾的特性:(A)文档的指纹是其特征的“哈希值”,(B)相似的文档具有相似的哈希值。
Explain the algorithm
The process of the SimHash algorithm.
- • 设置哈希大小,例如 64 位,将其全部初始化为零
- • 将短语分解为特征(滑动窗口)
“the cat sat on the mat”
-> {"th", "he", "e ", " c", "ca", "at", "t "," s", "sa", " o", "on", "n ", " t", " m", "ma"}
- • 使用常规的 64 位哈希算法(例如 MD5)对每个特征进行哈希处理
“th” -> 10010010...
“he” -> 10010110...
- • 将 0 设为 -1,对每个位求和,得到一个类似这样的序列:[-4 4 0 5 6 -1 -4 ...](64 位)
- • 生成 Simhash,通过设置 T[i]>0 为 1,T[i]<0 为 0:[0 1 1 1 1 0 0 ...](64 位)
Make simhash
pip install simhash
def make\_features(input\_str):
"""break the long input string into features, with length = 3
"""
length = 3
input\_str = input\_str.lower()
out\_str = re.sub(r'[^\w]+', '', input\_str)
return [out\_str[i:i + length] for i in range(max(len(out\_str) - length + 1, 1))]
make\_features("hello world")
# ['hel', 'ell', 'llo', 'low', 'owo', 'wor', 'orl', 'rld']
def make\_simhash(input\_str):
"""make simhash from any input string"""
features = make\_features(input\_str)
return Simhash(features).value
make\_simhash("hello world")
# 13548364882372308181
Difference between two strings
计算两个字符串中不同位的数量。
from simhash import Simhash
Simhash('aa').distance(Simhash('bb'))
# 31
Simhash('aa').distance(Simhash('aa'))
# 0
text\_1 = "Good job"
text\_2 = "Good job, Ray"
Simhash(text\_1).distance(Simhash(text\_2))
# 14
计算两个哈希值中不同位的数量。
hash\_1 = Simhash(text\_1).value
hash\_2 = Simhash(text\_2).value
def simhash\_diff(hash\_1, hash\_2):
"""calcuate the difference from two simhash values.
"""
x = (hash\_1 ^ hash\_2) & ((1 << 64) - 1)
ans = 0
while x:
ans += 1
x &= x - 1
return ans
simhash\_diff(hash\_1, hash\_2)
# 14
Use the Simhash Index
使用 SimhashIndex 可以以非常高效的方式查询近似重复的对象。
import re
from simhash import Simhash, SimhashIndex
defget\_features(s):
width = 3
s = s.lower()
s = re.sub(r'[^\w]+', '', s)
return [s[i:i + width] for i inrange(max(len(s) - width + 1, 1))]
data = {
1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: u'How are you i am fine. blar blar blar blar blar than',
3: u'This is simhash test.',
}
objs = [(str(k), Simhash(get\_features(v))) for k, v in data.items()]
index = SimhashIndex(objs, k=3)
print(index.bucket\_size())
# 11
s1 = Simhash(get\_features(u'How are you i am fine. blar blar blar blar blar thank'))
print(index.get\_near\_dups(s1))
# ['1']
index.add('4', s1)
print(index.get\_near\_dups(s1))
# ['1', '4']
参考文献
-
• Google paper - http://www.wwwconference.org/www2007/papers/paper215.pdf
-
• Simhash package - https://github.com/leonsim/simhash
点个「赞」+「在看」❤️
让我们知道这份文字有温暖到你,也是 我们持续 创作的最大动力!
推荐
亲测有效!如何用 Address Sanitizer 精准定位内存漏洞?附保姆级操作指南
要用 AI 裁员 50% 的千亿独角兽,公开认错,重启招聘!
single codebook和dual codebook在LLM中向量量化上有什么区别?
CosyVoice 2:基于大型语言模型的可扩展流式语音合成技术
Mini-Omni2: with Vision, Speech and Duplex Capabilities
亲测有效!如何用 Address Sanitizer 精准定位内存漏洞?附保姆级操作指南
