【GoMate框架案例】讯飞大模型RAG智能问答挑战赛top10 Baseline

向量数据库智能语音交互关系型数据库

【RAG框架】GoMate:RAG Framework within Reliable input,Trusted output

【项目链接】:https://github.com/gomate-community/GoMate

unset

unset 一、赛题背景 unset

unset

RAG(检索增强生成)是一种结合了检索模型和生成模型的技术,它通过检索大量外部知识来辅助文本生成,从而提高大型语言模型(LLMs)的准确度和可靠性。

RAG特别适合于需要不断更新知识的知识密集型场景或特定领域应用,它通过引入外部信息源,有效缓解了大语言模型在领域知识缺乏、信息准确性问题以及生成虚假内容等方面的挑战。本次挑战赛旨在探索RAG技术的极限,鼓励开发者、研究人员和爱好者利用RAG技术解决实际问题,推动人工智能领域的进步。picture.image

unset

unset 二、赛题任务 unset

unset

赛题需要参赛选手设计并实现一个RAG模型,该模型能够从给定的问题出发,检索知识库中的相关信息。利用检索到的信息,结合问题本身,生成准确、全面、权威的回答。

unset

unset 三、评审规则 unset

unset

1.数据说明

数据集还可能包括一些未标注的文本,需要参赛者使用RAG技术中的检索增强方法来找到相关信息,并生成答案。这要求参赛者不仅要有强大的检索能力,还要能够生成准确、连贯且符合上下文的文本。

测试集为模拟生成的用户提问,需要参赛选手结合提问和语料完成回答。需注意,在问题中存在部分问题无法回答,需要选手设计合适的策略进行拒绝回答的逻辑。

• corpus.txt.zip:语料库,每行为一篇新闻

• test_question.csv:测试提问

  1. 评审规则

对于测试提问的回答,采用字符重合比例进行评价,分数最高为1。

unset

unset 四、数据分析 unset

unset

  • 检索语料picture.image
  • 文本长度picture.image

unset

unset 五、RAG基线实现 unset

unset


            
              
import pickle  
  
import pandas as pd  
from tqdm import tqdm  
  
from gomate.modules.document.chunk import TextChunker  
from gomate.modules.document.txt_parser import TextParser  
from gomate.modules.document.utils import PROJECT_BASE  
from gomate.modules.generator.llm import GLM4Chat  
from gomate.modules.reranker.bge_reranker import BgeRerankerConfig, BgeReranker  
from gomate.modules.retrieval.bm25s_retriever import BM25RetrieverConfig  
from gomate.modules.retrieval.dense_retriever import DenseRetrieverConfig  
from gomate.modules.retrieval.hybrid_retriever import HybridRetriever, HybridRetrieverConfig  
  
  
def generate_chunks():  
    tp = TextParser()  
    tc = TextChunker()  
    paragraphs = tp.parse(r'H:/2024-Xfyun-RAG/data/corpus.txt', encoding="utf-8")  
    print(len(paragraphs))  
    chunks = []  
    for content in tqdm(paragraphs):  
        chunk = tc.chunk_sentences([content], chunk_size=1024)  
        chunks.append(chunk)  
  
    with open(f'{PROJECT\_BASE}/output/chunks.pkl', 'wb') as f:  
        pickle.dump(chunks, f)  
  
  
if __name__ == '\_\_main\_\_':  
  
    # test\_path="H:/2024-Xfyun-RAG/data/test\_question.csv"  
    # embedding\_model\_path="H:/pretrained\_models/mteb/bge-m3"  
    # llm\_model\_path="H:/pretrained\_models/llm/Qwen2-1.5B-Instruct"  
  
    test_path = "/data/users/searchgpt/yq/GoMate\_dev/data/competitions/xunfei/test\_question.csv"  
    embedding_model_path = "/data/users/searchgpt/pretrained\_models/bge-large-zh-v1.5"  
    llm_model_path = "/data/users/searchgpt/pretrained\_models/glm-4-9b-chat"  
    # ====================文件解析+切片=========================  
    generate_chunks()  
    with open(f'{PROJECT\_BASE}/output/chunks.pkl', 'rb') as f:  
        chunks = pickle.load(f)  
    corpus = []  
    for chunk in chunks:  
        corpus.extend(chunk)  
  
    # ====================检索器配置=========================  
    # BM25 and Dense Retriever configurations  
    bm25_config = BM25RetrieverConfig(  
        method='lucene',  
        index_path='indexs/description\_bm25.index',  
        k1=1.6,  
        b=0.7  
    )  
    bm25_config.validate()  
    print(bm25_config.log_config())  
    dense_config = DenseRetrieverConfig(  
        model_name_or_path=embedding_model_path,  
        dim=1024,  
        index_path='indexs/dense\_cache'  
    )  
    config_info = dense_config.log_config()  
    print(config_info)  
    # Hybrid Retriever configuration  
    # 由于分数框架不在同一维度,建议可以合并  
    hybrid_config = HybridRetrieverConfig(  
        bm25_config=bm25_config,  
        dense_config=dense_config,  
        bm25_weight=0.7,  # bm25检索结果权重  
        dense_weight=0.3  # dense检索结果权重  
    )  
    hybrid_retriever = HybridRetriever(config=hybrid_config)  
    # 构建索引  
    # hybrid\_retriever.build\_from\_texts(corpus)  
    # 保存索引  
    # hybrid\_retriever.save\_index()  
    # 加载索引  
    hybrid_retriever.load_index()  
  
    # ====================检索测试=========================  
    query = "新冠肺炎疫情"  
    results = hybrid_retriever.retrieve(query, top_k=5)  
    # Output results  
    for result in results:  
        print(f"Text: {result['text']}, Score: {result['score']}")  
  
    # ====================排序配置=========================  
    reranker_config = BgeRerankerConfig(  
        model_name_or_path="/data/users/searchgpt/pretrained\_models/bge-reranker-large"  
    )  
    bge_reranker = BgeReranker(reranker_config)  
  
    # ====================生成器配置=========================  
    # qwen\_chat = QwenChat(llm\_model\_path)  
    glm4_chat = GLM4Chat(llm_model_path)  
  
    # ====================检索问答=========================  
    test = pd.read_csv(test_path)  
    answers = []  
    for question in tqdm(test['question'], total=len(test)):  
        search_docs = hybrid_retriever.retrieve(question)  
        search_docs = bge_reranker.rerank(  
            query=question,  
            documents=[doc['text'] for idx, doc in enumerate(search_docs)]  
        )  
        # print(search\_docs)  
        content = '/n'.join([f'信息[{idx}]:' + doc['text'] for idx, doc in enumerate(search_docs)])  
        answer = glm4_chat.chat(prompt=question, content=content)  
        answers.append(answer[0])  
        print(question)  
        print(answer[0])  
        print("************************************/n")  
    test['answer'] = answers  
  
    test[['answer']].to_csv(f'{PROJECT\_BASE}/output/gomate\_baseline.csv', index=False)  
  

          
0
0
0
0
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论