10、模型评估（Model Evaluation）

模型评估（Model Evaluation）

测试人工智能应用程序需要评估生成的内容，以确保人工智能模型没有产生幻觉反应。

评估响应结果的一种方法是使用AI模型本身进行评估。选择最佳的AI模型进行评估，该模型可能与生成响应结果的模型不同。

用于评估响应的 Spring AI 接口Evaluator定义如下：

@FunctionalInterface
public interface Evaluator {
    EvaluationResponse evaluate(EvaluationRequest evaluationRequest);
}

评估的输入定义为 EvaluationRequest

public class EvaluationRequest {

    private final String userText;

    private final List<Content> dataList;

    private final String responseContent;

    public EvaluationRequest(String userText, List<Content> dataList, String responseContent) {
        this.userText = userText;
        this.dataList = dataList;
        this.responseContent = responseContent;
    }

  ...
}

userText：来自用户的原始输入String
dataList：上下文数据（例如来自检索增强生成）附加到原始输入。
responseContent：AI模型的响应内容String

在集成测试中的使用

以下是 RelevancyEvaluator 在集成测试中使用的示例，使用来验证 RAG 流的结果RetrievalAugmentationAdvisor：

@Test
void evaluateRelevancy() {
    String question = "Where does the adventure of Anacletus and Birba take place?";

    RetrievalAugmentationAdvisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
        .documentRetriever(VectorStoreDocumentRetriever.builder()
            .vectorStore(pgVectorStore)
            .build())
        .build();

    ChatResponse chatResponse = ChatClient.builder(chatModel).build()
        .prompt(question)
        .advisors(ragAdvisor)
        .call()
        .chatResponse();

    EvaluationRequest evaluationRequest = new EvaluationRequest(
        // The original user question
        question,
        // The retrieved context from the RAG flow
        chatResponse.getMetadata().get(RetrievalAugmentationAdvisor.DOCUMENT_CONTEXT),
        // The AI model's response
        chatResponse.getResult().getOutput().getText()
    );

    RelevancyEvaluator evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));

    EvaluationResponse evaluationResponse = evaluator.evaluate(evaluationRequest);

    assertThat(evaluationResponse.isPass()).isTrue();
}

您可以在 Spring AI 项目中找到几个集成测试，它们使用来测试（参见测试）和（参见测试RelevancyEvaluator）的功能。QuestionAnswerAdvisor、RetrievalAugmentationAdvisor

自定义模板

RelevancyEvaluator 使用默认提示模板对 AI 模型进行评估。您可以通过 PromptTemplate 构建器 .promptTemplate() 方法提供自己对象来自定义此行为。

自定义 PromptTemplate 可以使用任何TemplateRenderer实现（默认情况下，它使用基于StringTemplate引擎的 StPromptTemplate 实现）。重要的要求是模板必须包含以下占位符：

一个query占位符用于接收用户的问题。
一个response占位符用于接收 AI 模型的响应。
一个context占位符用于接收上下文的信息。

FactCheckingEvaluator

FactCheckingEvaluator 是 Evaluator 接口的另一个实现，旨在根据提供的上下文评估 AI 生成的响应的事实准确性。该评估器通过验证给定的语句（声明）是否在逻辑上得到提供的上下文（文档）的支持，帮助检测并减少 AI 输出中的错觉。

“claim”和“document ”将提交给人工智能模型进行评估。目前已有更小、更高效的人工智能模型专门用于此目的，例如 Bespoke 的 Minicheck，与 GPT-4 等旗舰模型相比，它有助于降低执行这些检查的成本。Minicheck 也可通过 Ollama 使用。

用法

FactCheckingEvaluator 构造函数以 ChatClient.Builder 作为参数：

public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
  this.chatClientBuilder = chatClientBuilder;
}

评估人员使用以下提示模板进行事实核查：

Document: {document}
Claim: {claim}

{document} 上下文信息在哪里，{claim}需要评估的AI模型的响应在哪里。

例子

下面是如何将 FactCheckingEvaluator 与基于 Ollama 的 ChatModel（特别是 Bespoke-Minicheck 模型）一起使用的示例：

@Test
void testFactChecking() {
    // Set up the Ollama API
    OllamaApi ollamaApi = OllamaApi.builder().baseUrl("http://localhost:11434").build();

    ChatModel chatModel = OllamaChatModel.builder().ollamaApi(ollamaApi).defaultOptions(
            OllamaOptions.builder().model(BESPOKE_MINICHECK).numPredict(2).temperature(0.0d).build()).build();

    // Create the FactCheckingEvaluator
    var factCheckingEvaluator = new FactCheckingEvaluator(ChatClient.builder(chatModel));

    // Example context and claim
    String context = "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.";
    String claim = "The Earth is the fourth planet from the Sun.";

    // Create an EvaluationRequest
    EvaluationRequest evaluationRequest = new EvaluationRequest(context, Collections.emptyList(), claim);

    // Perform the evaluation
    EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);

    Assertions.assertFalse(evaluationResponse.isPass(), "The claim should not be supported by the context");

}

作者：Jeebiz 创建时间：2025-08-03 23:52
最后编辑：Jeebiz 更新时间：2025-09-28 09:15

10、模型评估（Model Evaluation）

模型评估（Model Evaluation）

相关性评估器

在集成测试中的使用

自定义模板

FactCheckingEvaluator

用法

例子