Magika文件类型检测AI
Magika 是一款新颖的 AI 支持的文件类型检测工具,它依靠深度学习的最新进展来提供准确的检测。 在底层,Magika 采用了定制的、高度优化的 Keras 模型,该模型仅重约 1MB,即使在单个 CPU 上运行,也能在几毫秒内实现精确的文件识别。
在对超过 100 万个文件和超过 100 种内容类型(涵盖二进制和文本文件格式)的评估中,Magika 实现了 99% 以上的精确度和召回率。 Magika 被大规模使用,通过将 Gmail、云端硬盘和安全浏览文件路由到适当的安全和内容策略扫描仪,帮助提高 Google 用户的安全性。
你可以使用我们的Web演示来尝试 Magika,无需任何任何操作,该演示在你的浏览器中本地运行!
以下是 Magika 命令行输出的示例:
1、快速入门
Magika 在 PyPI 上以 Magika 的形式提供:
$ pip install magika
用法示例:
$ magika -r tests_data/
tests_data/README.md: Markdown document (text)
tests_data/basic/code.asm: Assembly (code)
tests_data/basic/code.c: C source (code)
tests_data/basic/code.css: CSS source (code)
tests_data/basic/code.js: JavaScript source (code)
tests_data/basic/code.py: Python source (code)
tests_data/basic/code.rs: Rust source (code)
...
tests_data/mitra/7-zip.7z: 7-zip archive data (archive)
tests_data/mitra/bmp.bmp: BMP image data (image)
tests_data/mitra/bzip2.bz2: bzip2 compressed data (archive)
tests_data/mitra/cab.cab: Microsoft Cabinet archive data (archive)
tests_data/mitra/elf.elf: ELF executable (executable)
tests_data/mitra/flac.flac: FLAC audio bitstream data (audio)
...
$ magika code.py --json
[
{
"path": "code.py",
"dl": {
"ct_label": "python",
"score": 0.9940916895866394,
"group": "code",
"mime_type": "text/x-python",
"magic": "Python script",
"description": "Python source"
},
"output": {
"ct_label": "python",
"score": 0.9940916895866394,
"group": "code",
"mime_type": "text/x-python",
"magic": "Python script",
"description": "Python source"
}
}
]
$ cat doc.ini | magika -
-: INI configuration file (text)
$ magika -h
Usage: magika [OPTIONS] [FILE]...
Magika - Determine type of FILEs with deep-learning.
Options:
-r, --recursive When passing this option, magika scans every
file within directories, instead of
outputting "directory"
--json Output in JSON format.
--jsonl Output in JSONL format.
-i, --mime-type Output the MIME type instead of a verbose
content type description.
-l, --label Output a simple label instead of a verbose
content type description. Use --list-output-
content-types for the list of supported
output.
-c, --compatibility-mode Compatibility mode: output is as close as
possible to `file` and colors are disabled.
-s, --output-score Output the prediction score in addition to
the content type.
-m, --prediction-mode [best-guess|medium-confidence|high-confidence]
--batch-size INTEGER How many files to process in one batch.
--no-dereference This option causes symlinks not to be
followed. By default, symlinks are
dereferenced.
--colors / --no-colors Enable/disable use of colors.
-v, --verbose Enable more verbose output.
-vv, --debug Enable debug logging.
--generate-report Generate report useful when reporting
feedback.
--version Print the version and exit.
--list-output-content-types Show a list of supported content types.
--model-dir DIRECTORY Use a custom model.
-h, --help Show this message and exit.
Magika version: "0.5.0"
Default model: "standard_v1"
Send any feedback to magika-dev@google.com or via GitHub issues.
有关详细文档,请参阅 python 文档。
Python API示例:
>>> from magika import Magika
>>> m = Magika()
>>> res = m.identify_bytes(b"# Example\nThis is an example of markdown!")
>>> print(res.output.ct_label)
markdown
有关详细文档,请参阅 python 文档。
2、开发环境搭建
开发设置
我们用poetry来开发和打包:
$ git clone https://github.com/google/magika
$ cd magika/python
$ poetry shell && poetry install
$ magika -r ../tests_data
运行测试:
$ cd magika/python
$ poetry shell
$ pytest tests/
y原文链接:Detect file content types with deep learning
BimAnt翻译整理,转载请标明出处