Llama 2 API服务部署

NSDT工具推荐： Three.js AI纹理开发包 - YOLO合成数据生成器 - GLTF/GLB在线编辑 - 3D模型格式在线转换 - 可编程3D场景编辑器 - REVIT导出3D模型插件 - 3D模型语义搜索引擎 - AI模型在线查看 - Three.js虚拟轴心开发包 - 3D模型在线减面 - STL模型在线切割 - 3D道路快速建模

我的 Mac Studio M2 Ultra 有 24 个内核和 192 RAM：

只需两个简单的步骤即可在其上部署 llama-2 模型并启用远程 API 访问：

1、使用 llama.cpp 部署Llama 2 模型

大多数开源 llms 都可以使用 llama.cpp 轻松部署

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

参考这篇文章下载 Llama-2 模型：
有不同模型可下载到本地文件夹

llama-2-13b/       
llama-2-13b-chat/  
llama-2-70b/       
llama-2-70b-chat/  
llama-2-7b/        
llama-2-7b-chat/

使用 q4 量化将 llama-2 模型转换为 ggml 格式。例如，对于 llama-2–70b：

python3 convert.py llama-2-70b
./quantize llama-2-70b/ggml-model-f16.gguf llama-2-70b/ggml-model-q4_0.gguf q4_0

现在可以为模型提供服务：

./bin/server -m  llama-2-70b/ggml-model-q4_0.gguf

....................................................................................................
llama_new_context_with_model: kv self size  =  160.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
...
ggml_metal_init: recommendedMaxWorkingSetSize  = 147456.00 MB
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size =  145.47 MB
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 37071.47 MB, (37071.91 / 147456.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =     1.48 MB, (37073.39 / 147456.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   162.00 MB, (37235.39 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   144.02 MB, (37379.41 / 147456.00)

llama server listening at http://127.0.0.1:8080

接下来，使用 CURL 测试 API — 打开另一个终端，然后运行以下命令：

curl --request POST --url http://localhost:8080/completion --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

你会看到类似以下内容：

{
  "content": "\nStep 2 - Designing your site\nStep 3 - Developing it with HTML and CSS\nStep 4 - Creating dynamic content using JavaScript or other technologies such as PHP, Ruby on Rails, etc.\nStep 5 - Choosing an appropriate hosting service for the type of website you have built (shared vs dedicated). This will depend largely upon how much traffic to expect from users visiting your site and what kind of security measures need to be taken into account when choosing a host.\nStep 6 - Registering domain names if needed\nStep 7- Setting up email accounts so that people who visit your",
  "generation_settings": {
    "frequency_penalty": 0.0,
    "grammar": "",
    "ignore_eos": false,
    "logit_bias": [],
    "mirostat": 0,
    "mirostat_eta": 0.10000000149011612,
    "mirostat_tau": 5.0,
    "model": "llama-2-70b/ggml-model-q4_0.gguf",
    "n_ctx": 512,
    "n_keep": 0,
    "n_predict": 128,
    "n_probs": 0,
    "penalize_nl": true,
    "presence_penalty": 0.0,
    "repeat_last_n": 64,
    "repeat_penalty": 1.100000023841858,
    "seed": 4294967295,
    "stop": [],
    "stream": false,
    "temp": 0.800000011920929,
    "tfs_z": 1.0,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "typical_p": 1.0
  },
  "model": "llama-2-70b/ggml-model-q4_0.gguf",
  "prompt": "Building a website can be done in 10 simple steps:",
  "stop": true,
  "stopped_eos": false,
  "stopped_limit": true,
  "stopped_word": false,
  "stopping_word": "",
  "timings": {
    "predicted_ms": 9224.515,
    "predicted_n": 127,
    "predicted_per_second": 13.767661497650554,
    "predicted_per_token_ms": 72.63397637795275,
    "prompt_ms": 413.057,
    "prompt_n": 14,
    "prompt_per_second": 33.893627271780886,
    "prompt_per_token_ms": 29.50407142857143
  },
  "tokens_cached": 141,
  "tokens_evaluated": 14,
  "tokens_predicted": 128,
  "truncated": false
}

速度约为每个令牌 72.63 毫秒，每秒 13.77 个令牌：

eval time = 9224.51 ms/127 runs(72.63 ms per token, 13.77 tokens per second)

2、启用远程API访问

如果要启用同一本地网络上另一台机器的 API 访问，只需将服务器主机设置为 0.0.0.0：

./bin/server -m  llama-2-70b/ggml-model-q4_0.gguf --host 0.0.0.0

假设你的 M2 Ultra 地址是 192.168.4.24，那么可以通过以下方式从另一台机器远程访问 API：

url --request POST --url http://192.168.4.24:8080/completion --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

如果你想从互联网上的任何地方进行访问，可以配置你的 WiFi 路由器。请参阅这篇文章。假设你的WAN IP地址（即公网IP地址）为171.67.215.81，并且配置外部端口号端口43931转发到内部端口8080。那么可以通过以下方式远程访问：

url --request POST --url http://171.67.215.81:43931/completion --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

原文链接：How to deploy Llama 2 as API on Mac Studio M2 Ultra and enable remote API access?

BimAnt翻译整理，转载请标明出处

Llama 2 API服务部署

1、使用 llama.cpp 部署Llama 2 模型

2、启用远程API访问

admin

生产级Llama实战

YOLO纯CPU优化182FPS

1、使用 llama.cpp 部署Llama 2 模型

2、启用远程API访问

生产级Llama实战

YOLO纯CPU优化182FPS

You might also like...

You might also like...