ChromaDB：开源向量嵌入数据库，让你的AI应用程序拥有记忆力 - 文章 - 开发者社区

picture.image

点击上方蓝字关注我们

picture.image

一、前言

随着ChatGPT的横空出世，带动了新一波生成式AI的创业浪潮。一个月内，4家向量数据库创业公司获得新融资：其中，向量数据库公司Pinecone完成1亿美元B轮融资，开源数据库初创公司WeaviateBV获得5000万美元B轮融资，Chroma获得1800万美元种子轮融资，开源向量数据库初创公司Qdrant获750万美元种子融资。

在当今的数字时代，拥有一种智能高效的方式来处理数据至关重要。今天，我们来探讨一下ChromaDB的功能， ChromaDB 是一个开源的向量嵌入数据库，允许用户执行语义搜索。ChromaDB将文档存储为密集向量嵌入，这些嵌入通常由基于转换器的语言模型生成，允许对文档进行细微的语义检索。在这篇博文中，我们将演示 如何在 ChromaDB 中创建和存储嵌入，并根据用户查询检索语义匹配的文档 。

picture.image

Chroma为您提供了以下工具：

存储嵌入和它们的元数据
嵌入文档和查询
搜索嵌入

Chroma的重点是：

简单性和开发者生产力
在搜索功能之上进行分析
它同时也非常快速

Chroma由客户端SDK和服务器应用程序组成。Chroma采用Apache 2.0许可证。

picture.image

二、安装

我们从安装所需的软件包开始。


      
          

        
            **!pip install chromadb -q  
!pip install sentence-transformers -q**

对于我们的演示，我们使用存储在名为“ pets ”的文件夹中的一组文本文件。每个文件都包含有关宠物护理不同方面的信息。

接下来，我们需要连接到 ChromaDB 并创建一个集合。默认情况下，ChromaDB 使用句子转换器 all-MiniLM-L6-v2 模型来创建嵌入。


        
import chromadb  
  
client = chromadb.Client()  
collection = client.create_collection("chroma\_demo")

picture.image

三、 添加文档

我们将一些文档添加到我们的集合中，以及相应的元数据和唯一 ID。


        
collection.add(  
    documents=["This is a document about cat", "This is a document about car"],  
    metadatas=[{"category": "animal"}, {"category": "vehicle"}],  
    ids=["id1", "id2"]  
)

picture.image

四、查询

现在，我们可以查询我们的集合。让我们搜索术语“ vehicle ”。返回的结果应该是关于汽车的文档。


        
results = collection.query(  
    query_texts=["vehicle"],  
    n_results=1  
)  
print(results)


        
{'ids': [['id2']],  
 'embeddings': None,  
 'documents': [['This is a document about car']],  
 'metadatas': [[{'category': 'vehicle'}]],  
 'distances': [[0.8069301247596741]]}

如果我们需要加载pdf文档，需要引入 PyPDF2 库来解析


      
          

        
            **!pip install pypdf2**

函数适当修改一下，从“ linux_doc ”文件夹下读取所有的pdf文档，并将数据存储在一个列表中。


        
import os  
from PyPDF2 import PdfReader  
  
def read\_pdf\_files\_from\_folder(folder_path):  
    file_data = []  
  
    for file_name in os.listdir(folder_path):  
        if file_name.endswith(".pdf"):  
            with open(os.path.join(folder_path, file_name), 'rb') as file:  
                reader = PdfReader(file)  
                page = reader.pages[0]  
                print(page.extract_text())  
                information = reader.metadata  
                file_data.append({"file\_name": file_name, "content": page.extract_text()})  
  
    return file_data  
  
folder_path = "linux\_doc"  # your folder path  
pdf_data = read_pdf_files_from_folder(folder_path)  
  
for data in pdf_data:  
    print(f"File Name: {data['file\_name']}")  
    print(f"Content: {data['content']}\n")

picture.image

五、 从文件夹中读取文件

我们的输出符合预期，为最匹配的文档提供 id、文档内容、元数据和距离值。

现在，让我们将我们的宠物文档添加到集合中。我们首先从“pets”文件夹中读取所有文本文件，并将数据存储在一个列表中。


        
import os  
  
def read\_files\_from\_folder(folder_path):  
    file_data = []  
  
    for file_name in os.listdir(folder_path):  
        if file_name.endswith(".txt"):  
            with open(os.path.join(folder_path, file_name), 'r') as file:  
                content = file.read()  
                file_data.append({"file\_name": file_name, "content": content})  
  
    return file_data  
  
folder_path = "pets"  
file_data = read_files_from_folder(folder_path)

picture.image

六、 将文件内容添加到ChromaDB

然后，我们为文档、元数据和 ID 创建单独的列表，并将其添加到我们的集合中。


        
documents = []  
metadatas = []  
ids = []  
  
for index, data in enumerate(file_data):  
    documents.append(data['content'])  
    metadatas.append({'source': data['file\_name']})  
    ids.append(str(index + 1))  
  
pet_collection = client.create_collection("pet\_collection")  
  
pet_collection.add(  
    documents=documents,  
    metadatas=metadatas,  
    ids=ids  
)

picture.image

七、 执行语义搜索

现在让我们查询人们通常拥有的不同种类宠物的集合。


        
results = pet_collection.query(  
    query_texts=["What are the different kinds of pets people commonly own?"],  
    n_results=1  
)  
print(results)


        
{'ids': [['1']],  
 'embeddings': None,  
 'documents': [['Pet animals come in all shapes and sizes, each suited to different lifestyles and home environments. Dogs and cats are the most common, known for their companionship and unique personalities. Small mammals like hamsters, guinea pigs, and rabbits are often chosen for their low maintenance needs. Birds offer beauty and song, and reptiles like turtles and lizards can make intriguing pets. Even fish, with their calming presence, can be wonderful pets.']],  
 'metadatas': [[{'source': 'Different Types of Pet Animals.txt'}]],  
 'distances': [[0.7325009703636169]]}

我们的查询成功检索到最相关的文档，其中讨论了不同类型的宠物动物。

picture.image

八、 筛选结果

如果要进一步细化搜索，可以使用 where_document 参数指定文档文本中必须满足的条件。例如，如果要查找有关拥有宠物的情感益处的文档，其中提到了爬行动物，则可以使用以下查询：


        
pet_collection.query(  
    query_texts=["What are the emotional benefits of owning a pet?"],  
    n_results=1,  
    where_document={"$contains":"reptiles"}  
)  
print(results)

结果表明，谈论人类与宠物之间情感纽带的文档与我们的查询最相关。

同样，如果要使用元数据来筛选搜索结果，则可以使用该参数。假设您想查找有关养宠物的情感益处的信息，但您想专门从有关宠物训练和行为的文档中检索此信息。您可以使用以下查询来执行此操作： where


        
results = pet_collection.query(  
    query_texts=["What are the emotional benefits of owning a pet?"],  
    n_results=1,  
    where={"source": "Training and Behaviour of Pets.txt"}  
)  
print(results)

结果现在显示了有关宠物训练和行为的文档，正如我们在查询中指定的那样。

8.1、 使用不同的模型进行嵌入

picture.image

虽然 ChromaDB 默认使用句子转换器全 MiniLM-L6-v2 模型，但您可以使用任何其他模型来创建嵌入。在这个例子中，我们使用来自句子转换器的“ paraphrase-MiniLM-L3-v2 ”模型。

首先，我们加载模型并为文档创建嵌入。


        
from sentence_transformers import SentenceTransformer  
  
model = SentenceTransformer('paraphrase-MiniLM-L3-v2')  
  
documents = []  
embeddings = []  
metadatas = []  
ids = []  
  
for index, data in enumerate(file_data):  
    documents.append(data['content'])  
    embedding = model.encode(data['content']).tolist()  
    embeddings.append(embedding)  
    metadatas.append({'source': data['file\_name']})  
    ids.append(str(index + 1))

然后，我们创建一个新集合，并向其中添加文档、嵌入、元数据和 ID。


        
pet_collection_emb = client.create_collection("pet\_collection\_emb")  
  
pet_collection_emb.add(  
    documents=documents,  
    embeddings=embeddings,  
    metadatas=metadatas,  
    ids=ids  
)

现在，当我们执行查询时，我们需要提供查询文本的嵌入而不是文本本身。让我们再次搜索人们通常拥有的不同种类的宠物。


        
query = "What are the different kinds of pets people commonly own?"  
input_em = model.encode(query).tolist()  
  
results = pet_collection_emb.query(  
    query_embeddings=[input_em],  
    n_results=1  
)  
print(results)

结果与我们之前的查询类似，返回的有关不同类型宠物动物的相同文档。

最后，让我们对推荐给狗的食物进行更具体的查询。


        
query = "foods that are recommended for dogs?"  
input_em = model.encode(query).tolist()  
  
results = pet_collection_emb.query(  
    query_embeddings=[input_em],  
    n_results=1  
)  
print(results)

结果正确地提供了有关宠物动物营养需求的文件。

picture.image

九、结论

ChromaDB 是一个强大的工具，它允许我们以语义上有意义的方式处理和搜索数据。它在用于创建嵌入的转换器模型方面提供了灵活性，并提供了缩小搜索结果范围的有效方法。无论您是管理小型文档集合还是大型数据库， ChromaDB 处理语义搜索的能力都可以帮助您快速准确地找到最相关的信息 。

机器学习和自然语言处理的强大功能在信息检索方面开辟了一个充满可能性的新世界，而 ChromaDB 是您武器库中的绝佳工具。

Jupyter Notebook 的完整代码：

https://github.com/Crossme0809/langchain-tutorials/tree/main/ChromaDB

picture.image

如果你对这篇文章感兴趣，而且你想要学习更多关于 AI 领域的实战技巧，可以

关注「技术狂潮AI」公众号

。在这里，你可以看到最新最热的 AIGC 领域的干货文章和案例实战教程。