Llama.cpp 快速开始

基本用法

首先，你需要获取二进制文件。你可以使用以下几种方法：

方法 1：克隆此存储库并在本地构建，查看如何构建
方法 2：如果你使用的是 MacOS 或 Linux，则可以通过brew、flox 或 nix安装 llama.cpp
方法 3：使用 Docker 映像，请参阅Docker 文档
方法 4：从发行版下载预构建的二进制文件

您可以使用此命令运行基本补全：

llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128

# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.

请参阅此页面以获取完整的参数列表。

对话模式

如果您想要更类似 ChatGPT 的体验，您可以通过传递-cnv参数以对话模式运行：

llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv

# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!

默认情况下，聊天模板将从输入模型中获取。如果您想使用其他聊天模板，请将其作为参数传递。查看支持的模板–chat-template NAME列表

./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml

您还可以通过前缀、后缀和反向提示参数使用您自己的模板：

./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'

Web 服务器

llama.cpp 网络服务器是一个轻量级的OpenAI API兼容 HTTP 服务器，可用于为本地模型提供服务并轻松将它们连接到现有客户端。

使用示例：

./llama-server -m your_model.gguf --port 8080

# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

交互模式
笔记

如果你喜欢基本使用，请考虑使用对话模式而不是交互模式

在此模式下，您可以随时按 Ctrl+C 并输入一行或多行文本来中断生成，这些文本将转换为标记并附加到当前上下文中。您还可以使用参数指定反向-r “reverse prompt string”提示。这将导致在生成过程中遇到反向提示字符串的精确标记时提示用户输入。典型的用法是使用提示，使 LLaMA 模拟多个用户（例如 Alice 和 Bob）之间的聊天，然后传递-r “Alice:”。

以下是使用命令调用的几次交互的示例

# default arguments using a 7B model
./examples/chat.sh

# advanced chat with a 13B model
./examples/chat-13B.sh

# custom arguments using a 13B model
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

注意使用–color来区分用户输入和生成的文本。其他参数在示例程序的READMEllama-cli中有更详细的说明。

图像

持续互动

通过利用和，可以在./llama-cli调用时保存和恢复提示、用户输入和模型生成。该脚本演示了对长时间运行、可恢复聊天会话的支持。要使用此示例，您必须提供一个文件来缓存初始聊天提示和一个目录来保存聊天会话，并且可以选择提供与相同的变量。相同的提示缓存可以重复用于新的聊天会话。请注意，提示缓存和聊天目录都与初始提示（）和模型文件相关联。–prompt-cache–prompt-cache-all./examples/chat-persistent.shchat-13B.shPROMPT_TEMPLATE

# Start a new chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh

# Resume that chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh

# Start a different chat with the same prompt/model
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh

# Different prompt cache for different prompt/model
PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
    CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh

使用语法进行约束输出

llama.cpp支持语法来约束模型输出。例如，你可以强制模型仅输出 JSON：

./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

该grammars/文件夹包含一些示例语法。若要编写自己的语法，请查看GBNF 指南。

要编写更复杂的 JSON 语法，您还可以查看 https://grammar.intrinsiclabs.ai/ ，这是一个浏览器应用程序，可让您编写 TypeScript 接口，并将其编译为 GBNF 语法，您可以将其保存以供本地使用。请注意，该应用程序是由社区成员构建和维护的，请在其存储库中提交任何问题或 FR ，而不是在此存储库中。

作者：Jeebiz 创建时间：2024-11-28 16:16
最后编辑：Jeebiz 更新时间：2024-11-28 16:22