3.3.1、从数据集中获取嵌入

文章地址：https://cookbook.openai.com/examples/get_embeddings_from_dataset

从数据集中获取嵌入

1.加载数据集

本示例中使用的数据集是来自亚马逊的美食评论。该数据集包含截至 2012 年 10 月亚马逊用户留下的总共 568,454 条食品评论。我们将使用该数据集的一个子集，其中包括 1,000 条最新评论，用于说明目的。评论是英文的，往往是正面的或负面的。每条评论都有 ProductId、UserId、Score、评论标题（摘要）和评论正文（文本）。

我们将把评论摘要和评论文本合并成一个组合文本。该模型将对组合文本进行编码，并输出单个向量嵌入。

要运行此笔记本，您需要安装：pandas、openai、transformers、plotly、matplotlib、scikit-learn、torch（transformer dep）、torchvision 和 scipy。

# imports
import pandas as pd
import tiktoken

from utils.embeddings_utils import get_embedding

# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

# load & inspect dataset
input_datapath = "data/fine_food_reviews_1k.csv"  # to save space, we provide a pre-filtered dataset
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)
df.head(2)

	时间	产品编号	用户身份	分数	概括	文本	合并的
0	1351123200	B003XPF9BO	A3R7JR3FMEBXQB	5	一个人从哪里开始...并停止...并伴随着一...	想存一些给我芝加哥的家人......	标题：一个人从哪里开始……又在哪里停止……机智……
1	1351123200	B003JK537S	A3JBPC3WFUT5ZP	1	已成碎片抵达	一点也不高兴。当我打开盒子时，大多数...	标题：碎片抵达；内容：不高兴...

# subsample to 1k most recent reviews and remove samples that are too long
top_n = 1000
df = df.sort_values("Time").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)

1000

2. 获取嵌入并保存以供将来重用

# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

# This may take a few minutes
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, model=embedding_model))
df.to_csv("data/fine_food_reviews_with_embeddings_1k.csv")

作者：Jeebiz 创建时间：2023-12-27 23:10
最后编辑：Jeebiz 更新时间：2023-12-28 16:43