GGML简明教程

NSDT工具推荐： Three.js AI纹理开发包 - YOLO合成数据生成器 - GLTF/GLB在线编辑 - 3D模型格式在线转换 - 可编程3D场景编辑器 - REVIT导出3D模型插件 - 3D模型语义搜索引擎 - AI模型在线查看 - Three.js虚拟轴心开发包 - 3D模型在线减面 - STL模型在线切割 - 3D道路快速建模

ggml 是一个用 C 和 C++ 编写的机器学习 (ML) 库，专注于 Transformer 推理。该项目是开源的，正在由一个不断壮大的社区积极开发。ggml 与 PyTorch 和 TensorFlow 等 ML 库类似，但它仍处于早期开发阶段，并且它的一些基本原理仍在快速变化。

随着时间的推移，ggml 与 llama.cpp 和 whisper.cpp 等其他项目一起获得了普及。许多其他项目也在后台使用 ggml 来启用设备上的 LLM，包括 ollama、jan、LM Studio、GPT4All。

人们选择使用 ggml 而不是其他库的主要原因是：

极简主义：核心库自成体系，包含不到 5 个文件。虽然您可能希望包含其他文件以支持 GPU，但这是可选的。
易于编译：您不需要花哨的构建工具。如果没有 GPU 支持，您只需要 GCC 或 Clang！
轻量级：编译后的二进制文件大小小于 1MB，与 PyTorch（通常需要数百 MB）相比非常小。
兼容性好：它支持多种类型的硬件，包括 x86_64、ARM、Apple Silicon、CUDA 等。
支持量化张量：可以量化张量以节省内存（类似于 JPEG 压缩），并在某些情况下提高性能。
内存效率极高：存储张量和执行计算的开销极小。

但是，ggml 也有一些缺点，你在使用它时需要牢记（此列表可能会在 ggml 的未来版本中发生变化）：

并非所有张量操作都支持所有后端。例如，有些可能在 CPU 上运行，但在 CUDA 上不起作用。
使用 ggml 进行开发可能并不简单，可能需要深入了解底层编程。
该项目正在积极开发中，因此预计会发生重大变化。

在本文中，我们将重点介绍 ggml 的基础知识，供希望开始使用该库的开发人员使用。我们不会介绍基于 ggml 构建的 llama.cpp 的 LLM 推理等高级任务。相反，我们将探索 ggml 的核心概念和基本用法，为进一步学习和开发奠定坚实的基础。

1、ggml入门

很好，那么如何开始呢？

为简单起见，本指南将向你展示如何在 Ubuntu 上编译 ggml。实际上，你几乎可以在任何平台上编译 ggml（包括 Windows、macOS 和 BSD）。

# Start by installing build dependencies
# "gdb" is optional, but is recommended
sudo apt install build-essential cmake git gdb

# Then, clone the repository
git clone https://github.com/ggerganov/ggml.git
cd ggml

# Try compiling one of the examples
cmake -B build
cmake --build build --config Release --target simple-ctx

# Run the example
./build/bin/simple-ctx

期望的输出如下：

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

如果你看到了预期的结果，那就意味着我们一切顺利！

2、术语和概念

在深入研究 ggml 之前，我们应该了解一些关键概念。如果你来自 PyTorch 或 TensorFlow 等高级库，这些概念似乎很难掌握。但是，请记住 ggml 是一个低级库。了解这些术语可以让你更好地控制性能：

ggml_context：一个“容器”，用于保存张量、图和可选数据等对象
ggml_cgraph：表示计算图。将其视为将传输到后端的“计算顺序”。
ggml_backend：表示执行计算图的接口。后端有很多种类型：CPU（默认）、CUDA、Metal（Apple Silicon）、Vulkan、RPC 等。
ggml_backend_buffer_type：表示缓冲区类型。可以将其视为连接到每个 ggml_backend 的“内存分配器”。例如，如果您想在 GPU 上执行计算，则需要通过 buffer_type（通常缩写为 buft）在 GPU 上分配内存。
ggml_backend_buffer：表示由 buffer_type 分配的缓冲区。请记住：一个缓冲区可以容纳多个张量的数据。
ggml_gallocr：表示图形内存分配器，用于高效分配计算图中使用的张量。
ggml_backend_sched：允许并发使用多个后端的调度程序。在处理大型模型或多个 GPU 时，它可以将计算分布到不同的硬件（例如 GPU 和 CPU）上。调度程序还可以自动将 GPU 不支持的操作分配给 CPU，从而确保最佳的资源利用率和兼容性。

3、简单示例

在此示例中，我们将完成复制我们在入门指南中运行的代码的步骤。我们需要创建 2 个矩阵，将它们相乘并得到结果。使用 PyTorch，代码如下所示：

import torch

# Create two matrices
matrix1 = torch.tensor([
  [2, 8],
  [5, 1],
  [4, 2],
  [8, 6],
])
matrix2 = torch.tensor([
  [10, 5],
  [9, 9],
  [5, 4],
])

# Perform matrix multiplication
result = torch.matmul(matrix1, matrix2.T)
print(result.T)

使用 ggml，必须执行以下步骤才能实现相同的结果：

分配 ggml_context 以存储张量数据
创建张量并设置数据
为 mul_mat 操作创建 ggml_cgraph
运行计算
检索结果（输出张量）
释放内存并退出

注意：在此示例中，我们将在 ggml_context 内分配张量数据以简化操作。实际上，应将内存分配为设备缓冲区，我们将在下一节中看到。

首先，让我们创建一个新的目录 examples/demo ：

cd ggml # make sure you're in the project root

# create C source and CMakeLists file
touch examples/demo/demo.c
touch examples/demo/CMakeLists.txt

本示例的代码基于 simple-ctx.cpp

编辑 examples/demo/demo.c，内容如下：

#include "ggml.h"
#include <string.h>
#include <stdio.h>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Allocate `ggml_context` to store tensor data
    // Calculate the size needed to allocate
    size_t ctx_size = 0;
    ctx_size += rows_A * cols_A * ggml_type_size(GGML_TYPE_F32); // tensor a
    ctx_size += rows_B * cols_B * ggml_type_size(GGML_TYPE_F32); // tensor b
    ctx_size += rows_A * rows_B * ggml_type_size(GGML_TYPE_F32); // result
    ctx_size += 3 * ggml_tensor_overhead(); // metadata for 3 tensors
    ctx_size += ggml_graph_overhead(); // compute graph
    ctx_size += 1024; // some overhead (exact calculation omitted for simplicity)

    // Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size   =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc   =*/ false,
    };
    struct ggml_context * ctx = ggml_init(params);

    // 2. Create tensors and set data
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);
    memcpy(tensor_a->data, matrix_A, ggml_nbytes(tensor_a));
    memcpy(tensor_b->data, matrix_B, ggml_nbytes(tensor_b));


    // 3. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = ggml_new_graph(ctx);

    // result = a*b^T
    // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
    // the result is transposed
    struct ggml_tensor * result = ggml_mul_mat(ctx, tensor_a, tensor_b);

    // Mark the "result" tensor to be computed
    ggml_build_forward_expand(gf, result);

    // 4. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading
    ggml_graph_compute_with_ctx(ctx, gf, n_threads);

    // 5. Retrieve results (output tensors)
    float * result_data = (float *) result->data;
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1] /* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0] /* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");

    // 6. Free memory and exit
    ggml_free(ctx);
    return 0;
}

在你创建的 examples/demo/CMakeLists.txt 文件中写入以下行：

set(TEST_TARGET demo)
add_executable(${TEST_TARGET} demo)
target_link_libraries(${TEST_TARGET} PRIVATE ggml)

编辑 examples/CMakeLists.txt，在末尾添加此行：

add_subdirectory(demo)

编译并运行：

cmake -B build
cmake --build build --config Release --target demo

# Run it
./build/bin/demo

期望的输出：

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

4、带有后端的示例

ggml 中的“后端”是指可以处理张量操作的接口。后端可以是 CPU、CUDA、Vulkan 等。

后端抽象了计算图的执行。定义后，可以使用相应的后端实现利用可用硬件来计算图。请注意，ggml 将自动为计算所需的任何中间张量保留内存，并根据这些中间结果的生命周期优化内存使用情况。

使用后端进行计算或推理时，需要执行的常见步骤包括：

初始化 ggml_backend
分配 ggml_context 以存储张量元数据（我们不需要立即分配张量数据）
创建张量元数据（仅创建其形状和数据类型）
分配 ggml_backend_buffer 以存储所有张量
将张量数据从主内存 (RAM) 复制到后端缓冲区
为 mul_mat 操作创建 ggml_cgraph
为 cgraph 分配创建 ggml_gallocr
可选：使用 ggml_backend_sched 调度 cgraph
运行计算
检索结果（输出张量）
释放内存并退出

本示例的代码基于 simple-backend.cpp

#include "ggml.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#ifdef GGML_USE_CUDA
#include "ggml-cuda.h"
#endif

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Initialize backend
    ggml_backend_t backend = NULL;
#ifdef GGML_USE_CUDA
    fprintf(stderr, "%s: using CUDA backend\n", __func__);
    backend = ggml_backend_cuda_init(0); // init device 0
    if (!backend) {
        fprintf(stderr, "%s: ggml_backend_cuda_init() failed\n", __func__);
    }
#endif
    // if there aren't GPU Backends fallback to CPU backend
    if (!backend) {
        backend = ggml_backend_cpu_init();
    }

    // Calculate the size needed to allocate
    size_t ctx_size = 0;
    ctx_size += 2 * ggml_tensor_overhead(); // tensors
    // no need to allocate anything else!

    // 2. Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size   =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc   =*/ true, // the tensors will be allocated later by ggml_backend_alloc_ctx_tensors()
    };
    struct ggml_context * ctx = ggml_init(params);

    // Create tensors metadata (only there shapes and data type)
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);

    // 4. Allocate a `ggml_backend_buffer` to store all tensors
    ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(ctx, backend);

    // 5. Copy tensor data from main memory (RAM) to backend buffer
    ggml_backend_tensor_set(tensor_a, matrix_A, 0, ggml_nbytes(tensor_a));
    ggml_backend_tensor_set(tensor_b, matrix_B, 0, ggml_nbytes(tensor_b));

    // 6. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = NULL;
    struct ggml_context * ctx_cgraph = NULL;
    {
        // create a temporally context to build the graph
        struct ggml_init_params params0 = {
            /*.mem_size   =*/ ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead(),
            /*.mem_buffer =*/ NULL,
            /*.no_alloc   =*/ true, // the tensors will be allocated later by ggml_gallocr_alloc_graph()
        };
        ctx_cgraph = ggml_init(params0);
        gf = ggml_new_graph(ctx_cgraph);

        // result = a*b^T
        // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
        // the result is transposed
        struct ggml_tensor * result0 = ggml_mul_mat(ctx_cgraph, tensor_a, tensor_b);

        // Add "result" tensor and all of its dependencies to the cgraph
        ggml_build_forward_expand(gf, result0);
    }

    // 7. Create a `ggml_gallocr` for cgraph computation
    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
    ggml_gallocr_alloc_graph(allocr, gf);

    // (we skip step 8. Optionally: schedule the cgraph using `ggml_backend_sched`)

    // 9. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading
    if (ggml_backend_is_cpu(backend)) {
        ggml_backend_cpu_set_n_threads(backend, n_threads);
    }
    ggml_backend_graph_compute(backend, gf);

    // 10. Retrieve results (output tensors)
    // in this example, output tensor is always the last tensor in the graph
    struct ggml_tensor * result = gf->nodes[gf->n_nodes - 1];
    float * result_data = malloc(ggml_nbytes(result));
    // because the tensor data is stored in device buffer, we need to copy it back to RAM
    ggml_backend_tensor_get(result, result_data, 0, ggml_nbytes(result));
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1] /* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0] /* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");
    free(result_data);

    // 11. Free memory and exit
    ggml_free(ctx_cgraph);
    ggml_gallocr_free(allocr);
    ggml_free(ctx);
    ggml_backend_buffer_free(buffer);
    ggml_backend_free(backend);
    return 0;
}

编译并运行它，你应该得到与上一个示例相同的结果：

cmake -B build
cmake --build build --config Release --target demo

# Run it
./build/bin/demo

期望的结果：

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

5、打印计算图

ggml_cgraph 表示计算图，它定义了后端将执行的操作顺序。打印图表可以成为一种有用的调试工具，尤其是在处理更复杂的模型和计算时。

你可以添加 ggml_graph_print 来打印 cgraph：

...

// Mark the "result" tensor to be computed
ggml_build_forward_expand(gf, result0);

// Print the cgraph
ggml_graph_print(gf);

运行结果：

=== GRAPH ===
n_nodes = 1
 -   0: [     4,     3,     1]          MUL_MAT  
n_leafs = 2
 -   0: [     2,     4]     NONE           leaf_0
 -   1: [     2,     3]     NONE           leaf_1
========================================

此外，你可以将 cgraph 绘制为 graphviz dot格式：

ggml_graph_dump_dot(gf, NULL, "debug.dot");

你可以使用 dot 命令或此在线网站将 debug.dot 渲染为最终图像：

6、结束语

本文提供了 ggml 的介绍性概述，涵盖了关键概念、简单的使用示例和使用后端的示例。虽然我们已经介绍了基础知识，但关于 ggml 还有更多值得探索的内容。

在即将发布的文章中，我们将深入探讨其他与 ggml 相关的主题，例如 GGUF 格式、量化以及不同后端的组织和使用方式。此外，您可以访问 ggml 示例目录以查看更多高级用例和示例代码。请继续关注未来更多 ggml 内容！

原文链接：Introduction to ggml

BimAnt翻译整理，转载请标明出处

GGML简明教程

1、ggml入门

2、术语和概念

3、简单示例

4、带有后端的示例

5、打印计算图

6、结束语

admin

5个被低估的HF工具

大模型1.58位极端量化

1、ggml入门

2、术语和概念

3、简单示例

4、带有后端的示例

5、打印计算图

6、结束语

5个被低估的HF工具

大模型1.58位极端量化

You might also like...

You might also like...