BM25唱罢，BMX登场！ - 文章 - 开发者社区

论文笔记分享，论文标题，BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search。代码仓库：https://github.com/mixedbread-ai/baguetter

大多数文本搜索引擎背后的算法多多少少都跟BM25沾边，BM25的主要优势在于，它在分布之外的数据中表现得非常好，也就是说它可以很好的处理以前从未见过的数据。但是！关键字搜索方法有其自身的局限性：

BM25 不考虑查询与任何给定文档之间的相似性，这可以更准确地评估该文档与查询的相关性。
词汇搜索算法缺乏语义理解，因此无法处理同义词和同音异义词等语言细微差别。与基于特定领域文本嵌入的语义搜索相比，这种限制是词法搜索性能不佳的关键因素。

于是本文提出了BMX，计算简单，效果优于所有的BM25变种,建索引，搜索都不会明显慢，但是效果明显好！

picture.image


        
          
# pip install baguetter  
  
from baguetter.indices import BMXSparseIndex  
   
# Initialize BM𝒳 index  
bmx = BMXSparseIndex()  
   
# Add bakery items to the index  
docs = [  
    "Freshly crusty baked sourdough bread with a crispy crust",  
    "Flaky croissants made with French butter",  
    "Chocolate chip cookies with chunks of dark chocolate",  
    "Cinnamon rolls with cream cheese frosting",  
    "Artisanal baguettes with a soft interior and crusty exterior"  
]  
keys = list(range(len(docs)))  
   
bmx.add_many(keys=keys, values=docs)  
   
# Search for bread  
query = "crusty bread"  
results = bmx.search(query, top_k=2)  
   
print(results)  
# SearchResults(keys=[0, 4], scores=array([2.5519667 , 0.97304875], dtype=float32), normalized=False)