Weaviate

本部分将引导您完成 Weaviate VectorStore 设置,以存储文档嵌入并执行相似性搜索

Weaviate 是什么?

Weaviate是一个开源向量数据库。它允许您存储来自您最喜欢的 ML-models 的数据对象和向量嵌入,并无缝扩展到数十亿个数据对象。它提供了存储文档嵌入、内容和元数据以及搜索这些嵌入的工具,包括元数据过滤。

先决条件

  • 计算文档嵌入的 EmbeddingClient 实例。有多种选择
    • Transformers Embedding - 在本地环境中计算 embedding。请遵循 ONNX Transformers Embedding 说明。
    • OpenAI Embedding - 如果你使用 OpenAI 来创建文档嵌入,你还需要在 OpenAI 注册处创建帐户并在 API Keys 处生成令牌。
    • 您还可以使用 Azure OpenAI EmbeddingPostgresML Embedding Client.
  • Weaviate cluster
  1. Weaviate cluster. 您可以在本地 Docker 容器启动集群或 创建 Weaviate Cloud Service。对于后者,您需要创建一个 Weaviate 帐户,设置集群,并从仪表板详细信息中获取访问 API 密钥。

启动时,如果尚未配置,WeaviateVectorStore 则会创建所需的 SpringAiWeaviate 对象。

依赖项

将这些依赖项添加到您的项目中:

  • 嵌入客户端启动程序,计算所需嵌入。
  • Transformers Embedding(本地)并遵循 ONNX Transformers 嵌入说明。
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-transformers-spring-boot-starter</artifactId>
</dependency>

或使用 OpenAI(云)

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>

此外,您还需要提供 OpenAI API 密钥。将其设置为环境变量,如下所示:

export SPRING_AI_OPENAI_API_KEY='Your_OpenAI_API_Key'
  • 添加 Weaviate VectorStore 依赖
<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-weaviate-store</artifactId>
</dependency>

用例

创建一个连接到本地 Weaviate cluster 的 WeaviateVectorStore 实例:

@Bean
public VectorStore vectorStore(EmbeddingClient embeddingClient) {
  WeaviateVectorStoreConfig config = WeaviateVectorStoreConfig.builder()
     .withScheme("http")
     .withHost("localhost:8080")
     // Define the metadata fields to be used
     // in the similarity search filters.
     .withFilterableMetadataFields(List.of(
        MetadataField.text("country"),
        MetadataField.number("year"),
        MetadataField.bool("active")))
     // Consistency level can be: ONE, QUORUM, or ALL.
     .withConsistencyLevel(ConsistentLevel.ONE)
     .build();

  return new WeaviateVectorStore(config, embeddingClient);
}

[NOTE]
You must list explicitly all metadata field names and types (BOOLEAN, TEXT, or NUMBER) for any metadata key used in filter expression.
The withFilterableMetadataKeys above registers filterable metadata fields: country of type TEXT, year of type NUMBER, and active of type BOOLEAN.

If the filterable metadata fields are expanded with new entries, you have to (re)upload/update the documents with this metadata.

You can use the following Weaviate link:https://weaviate.io/developers/weaviate/api/graphql/filters#special-cases[system metadata] fields without explicit definition: id, _creationTimeUnix, and _lastUpdateTimeUnix.

然后在您的主代码中创建一些文档:

List<Document> documents = List.of(
   new Document("Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!!", Map.of("country", "UK", "active", true, "year", 2020)),
   new Document("The World is Big and Salvation Lurks Around the Corner", Map.of()),
   new Document("You walk forward facing the past and you turn back toward the future.", Map.of("country", "NL", "active", false, "year", 2023)));

现在将文档添加到您的向量存储中:

vectorStore.add(documents);

最后,检索类似于查询的文档:

List<Document> results = vectorStore.similaritySearch(
   SearchRequest
      .query("Spring")
      .withTopK(5));

如果一切顺利,您应该检索包含文本“Spring AIrocks!!”的文档。

元数据过滤

您可以通过 WeaviateVectorStore 使用通用、可移植的元数据过滤器

例如,您可以使用文本表达语言:

vectorStore.similaritySearch(
   SearchRequest
      .query("The World")
      .withTopK(TOP_K)
      .withSimilarityThreshold(SIMILARITY_THRESHOLD)
      .withFilterExpression("country in ['UK', 'NL'] && year >= 2020"));

或者以编程方式使用表达式 DSL:

FilterExpressionBuilder b = new FilterExpressionBuilder();

vectorStore.similaritySearch(
   SearchRequest
      .query("The World")
      .withTopK(TOP_K)
      .withSimilarityThreshold(SIMILARITY_THRESHOLD)
      .withFilterExpression(b.and(
         b.in("country", "UK", "NL"),
         b.gte("year", 2020)).build()));

可移植的过滤器表达式会自动转换为 Weaviate where 过滤器

例如,以下可移植过滤器表达式:

country in ['UK', 'NL'] && year >= 2020

转换为 Weaviate GraphQL Where 过滤器表达式

operator:And
   operands:
      [{
         operator:Or
         operands:
            [{
               path:["meta_country"]
               operator:Equal
               valueText:"UK"
            },
            {
               path:["meta_country"]
               operator:Equal
               valueText:"NL"
            }]
      },
      {
         path:["meta_year"]
         operator:GreaterThanEqual
         valueNumber:2020
      }]

在 docker 容器中运行 Weaviate 集群

在 docker 容器中启动 Weaviate:

docker run -it --rm --name weaviate -e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true -e PERSISTENCE_DATA_PATH=/var/lib/weaviate -e QUERY_DEFAULTS_LIMIT=25 -e DEFAULT_VECTORIZER_MODULE=none -e CLUSTER_HOSTNAME=node1 -p 8080:8080 semitechnologies/weaviate:1.22.4

在 Http 协议 和 localhost:8080 主机地址上启动了一个 Weaviate cluster, 连接地址:http://localhost:8080/v1 apiKey:””

作者:Jeebiz  创建时间:2024-04-05 23:19
最后编辑:Jeebiz  更新时间:2024-07-06 19:00