add ck fix docs
This commit is contained in:
421
docs/ck-embedding-model-fix.md
Normal file
421
docs/ck-embedding-model-fix.md
Normal file
@@ -0,0 +1,421 @@
|
||||
# CK Embedding 模型下载问题修复指南
|
||||
|
||||
## 问题描述
|
||||
|
||||
运行 `ck --index` 时出现以下错误:
|
||||
|
||||
```
|
||||
▸ Indexing Repository
|
||||
ℹ Scanning files in .
|
||||
ℹ 🤖 Model: BAAI/bge-small-en-v1.5 (alias 'bge-small', 384 dims)
|
||||
ℹ 📏 FastEmbed Config: 512 token limit
|
||||
ℹ 📄 Chunk Config: 400 tokens target, 80 token overlap (~20%)
|
||||
DETAILED ERROR: Header Content-Range is missing
|
||||
DEBUG: Error occurred in main
|
||||
```
|
||||
|
||||
## 根因分析
|
||||
|
||||
### 错误来源
|
||||
|
||||
错误来自 `hf-hub` 库(HuggingFace Hub 的 Rust 客户端),位于文件:
|
||||
```
|
||||
~/.cargo/registry/src/*/hf-hub-0.4.3/src/api/sync.rs
|
||||
```
|
||||
|
||||
关键代码片段:
|
||||
|
||||
```rust
|
||||
// 第 534-536 行
|
||||
let content_range = response
|
||||
.header(CONTENT_RANGE)
|
||||
.ok_or(ApiError::MissingHeader(CONTENT_RANGE))?;
|
||||
```
|
||||
|
||||
### 技术原理
|
||||
|
||||
**正常流程:**
|
||||
|
||||
1. `hf-hub` 发送 HTTP 请求获取模型文件的 metadata
|
||||
2. 请求 header 包含 `Range: bytes=0-0`(只请求首字节)
|
||||
3. 服务器应返回:
|
||||
- HTTP 状态码 206 (Partial Content)
|
||||
- Header `Content-Range: bytes 0-0/123456`(文件总大小)
|
||||
- Header `ETag`(文件版本标识)
|
||||
- Header `x-repo-commit`(Git commit hash)
|
||||
|
||||
**失败原因:**
|
||||
|
||||
服务器未返回 `Content-Range` header,可能原因包括:
|
||||
|
||||
| 原因 | 说明 |
|
||||
|------|------|
|
||||
| **CDN 配置问题** | 部分 CDN 不支持 HTTP Range requests |
|
||||
| **网络代理限制** | 企业防火墙/代理可能过滤响应 header |
|
||||
| **镜像站点兼容性** | HuggingFace 镜像可能未完全支持 Range requests |
|
||||
| **网络设备干扰** | 中间网络设备可能修改或丢弃 header |
|
||||
|
||||
### 相关依赖链
|
||||
|
||||
```
|
||||
ck (应用)
|
||||
↓
|
||||
ck-embed (embedding 模块)
|
||||
↓
|
||||
fastembed (embedding 推理库)
|
||||
↓
|
||||
hf-hub (HuggingFace 下载客户端) ← 错误发生位置
|
||||
↓
|
||||
HuggingFace CDN / 镜像站点 ← 服务端问题
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 解决方案
|
||||
|
||||
### 方案一:使用 HuggingFace 镜像(推荐尝试)
|
||||
|
||||
```bash
|
||||
# 设置环境变量使用国内镜像
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
|
||||
# 然后运行 ck
|
||||
ck --index ./your-project
|
||||
```
|
||||
|
||||
**验证镜像是否支持 Range requests:**
|
||||
```bash
|
||||
curl -I -H "Range: bytes=0-0" \
|
||||
"https://hf-mirror.com/Xenova/bge-small-en-v1.5/resolve/main/tokenizer.json"
|
||||
```
|
||||
|
||||
检查响应是否包含 `accept-ranges: bytes`。
|
||||
|
||||
### 方案二:手动预下载模型(最可靠)
|
||||
|
||||
如果镜像方案仍失败,可手动下载模型文件到本地缓存目录。
|
||||
|
||||
#### 步骤详解
|
||||
|
||||
**1. 确定模型信息**
|
||||
|
||||
```bash
|
||||
# 默认模型:BAAI/bge-small-en-v1.5
|
||||
# fastembed 使用的是 Xenova/bge-small-en-v1.5(ONNX 格式版本)
|
||||
MODEL_ID="Xenova/bge-small-en-v1.5"
|
||||
```
|
||||
|
||||
**2. 确定缓存目录结构**
|
||||
|
||||
```
|
||||
~/.cache/ck/models/
|
||||
└── models--Xenova--bge-small-en-v1.5/
|
||||
├── blobs/ # 模型二进制文件(按 SHA 哈希命名)
|
||||
│ └── 828e1496d7fabb... # onnx 模型文件
|
||||
├── refs/
|
||||
│ └── main # 包含 commit hash
|
||||
└── snapshots/
|
||||
└── ea104dacec62c0de.../ # 按 commit hash 组织
|
||||
├── tokenizer.json
|
||||
├── config.json
|
||||
├── special_tokens_map.json
|
||||
├── tokenizer_config.json
|
||||
└── onnx/
|
||||
└── model_quantized.onnx
|
||||
```
|
||||
|
||||
**3. 执行下载脚本**
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# 手动下载 embedding 模型文件
|
||||
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
|
||||
MODEL_ID="Xenova/bge-small-en-v1.5"
|
||||
COMMIT="ea104dacec62c0de699686887e3f920caeb4f3e3"
|
||||
CACHE_ROOT=~/.cache/ck/models
|
||||
MODEL_DIR=$CACHE_ROOT/models--${MODEL_ID//--/--}
|
||||
SNAPSHOT_DIR=$MODEL_DIR/snapshots/$COMMIT
|
||||
|
||||
# 创建目录结构
|
||||
mkdir -p $MODEL_DIR/blobs
|
||||
mkdir -p $MODEL_DIR/refs
|
||||
mkdir -p $SNAPSHOT_DIR/onnx
|
||||
|
||||
# 写入 ref 文件
|
||||
echo $COMMIT > $MODEL_DIR/refs/main
|
||||
|
||||
# 必需的模型文件列表
|
||||
FILES=(
|
||||
"tokenizer.json"
|
||||
"config.json"
|
||||
"special_tokens_map.json"
|
||||
"tokenizer_config.json"
|
||||
"onnx/model_quantized.onnx"
|
||||
)
|
||||
|
||||
# 下载每个文件
|
||||
for file in "${FILES[@]}"; do
|
||||
target_path="$SNAPSHOT_DIR/$file"
|
||||
|
||||
if [ -f "$target_path" ]; then
|
||||
echo "已存在: $file"
|
||||
else
|
||||
echo "下载: $file ..."
|
||||
wget -q "$HF_ENDPOINT/$MODEL_ID/resolve/main/$file" \
|
||||
-O "$target_path"
|
||||
|
||||
if [ -f "$target_path" ]; then
|
||||
echo "成功: $file"
|
||||
else
|
||||
echo "失败: $file"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
echo "模型下载完成!"
|
||||
```
|
||||
|
||||
**4. 验证下载结果**
|
||||
|
||||
```bash
|
||||
# 检查文件完整性
|
||||
ls -la ~/.cache/ck/models/models--Xenova--bge-small-en-v1.5/snapshots/*/onnx/
|
||||
|
||||
# 验证模型文件大小(model_quantized.onnx 约 34MB)
|
||||
du -h ~/.cache/ck/models/models--Xenova--bge-small-en-v1.5/snapshots/*/onnx/model_quantized.onnx
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 其他模型下载
|
||||
|
||||
如需使用其他 embedding 模型,可按相同方法下载:
|
||||
|
||||
| 模型别名 | HuggingFace ID | 特点 | 模型文件 |
|
||||
|----------|----------------|------|----------|
|
||||
| `bge-small` | Xenova/bge-small-en-v1.5 | 默认模型,384 维,512 token | `onnx/model.onnx` (~130MB) |
|
||||
| `nomic-v1.5` | nomic-ai/nomic-embed-text-v1.5 | 768 维,8192 token 上下文 | `onnx/model.onnx` (~270MB) |
|
||||
| `jina-code` | jinaai/jina-embeddings-v2-base-code | 代码专用,768 维,8192 token | `onnx/model.onnx` (~520MB) ⚠️ |
|
||||
|
||||
### ⚠️ jina-code 模型特殊说明
|
||||
|
||||
**重要:** jina-code 模型的 `model.onnx` 文件约 **520MB**,比其他模型大得多。如果网络不稳定,可能难以下载。
|
||||
|
||||
fastembed 对 jina-code 使用的是 **非量化版本** (`onnx/model.onnx`),而非 `model_quantized.onnx`。
|
||||
|
||||
**下载 jina-code 完整模型:**
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
|
||||
MODEL_ID="jinaai/jina-embeddings-v2-base-code"
|
||||
COMMIT="516f4baf13dec4ddddda8631e019b5737c8bc250"
|
||||
MODEL_DIR=~/.cache/ck/models/models--jinaai--jina-embeddings-v2-base-code
|
||||
SNAPSHOT_DIR=$MODEL_DIR/snapshots/$COMMIT
|
||||
|
||||
# 创建目录结构
|
||||
mkdir -p $MODEL_DIR/refs
|
||||
mkdir -p $SNAPSHOT_DIR/onnx
|
||||
echo $COMMIT > $MODEL_DIR/refs/main
|
||||
|
||||
# 下载 tokenizer 文件(较小)
|
||||
FILES="tokenizer.json config.json special_tokens_map.json tokenizer_config.json"
|
||||
for f in $FILES; do
|
||||
wget -q "$HF_ENDPOINT/$MODEL_ID/resolve/main/$f" -O "$SNAPSHOT_DIR/$f"
|
||||
done
|
||||
|
||||
# 下载 model.onnx (~520MB,建议使用 aria2c 或分段下载)
|
||||
# 方法一:直接 wget(可能超时)
|
||||
wget "$HF_ENDPOINT/$MODEL_ID/resolve/main/onnx/model.onnx" -O "$SNAPSHOT_DIR/onnx/model.onnx"
|
||||
|
||||
# 方法二:使用 aria2c(推荐,支持分段下载)
|
||||
aria2c -x 16 -s 16 "$HF_ENDPOINT/$MODEL_ID/resolve/main/onnx/model.onnx" \
|
||||
-d "$SNAPSHOT_DIR/onnx" -o model.onnx
|
||||
|
||||
# 方法三:使用 curl 分段下载
|
||||
curl -L -C - "$HF_ENDPOINT/$MODEL_ID/resolve/main/onnx/model.onnx" \
|
||||
-o "$SNAPSHOT_DIR/onnx/model.onnx"
|
||||
```
|
||||
|
||||
**如果无法下载 model.onnx:**
|
||||
- 推荐使用 `bge-small` 或 `nomic-v1.5` 作为替代
|
||||
- `bge-small` 已验证可用,体积小(~130MB),下载速度快
|
||||
|
||||
### 缓存目录命名差异 ⚠️
|
||||
|
||||
**ck 和 hf-hub 使用不同的目录命名格式:**
|
||||
|
||||
| 库 | 目录格式 | 示例 |
|
||||
|----|----------|------|
|
||||
| hf-hub | `models--{org}--{name}` | `models--jinaai--jina-embeddings-v2-base-code` |
|
||||
| ck 检查 | `{org}_{name}` | `jinaai_jina-embeddings-v2-base-code` |
|
||||
|
||||
**解决方案:创建 symlink**
|
||||
|
||||
```bash
|
||||
# hf-hub 格式目录已存在,创建 ck 格式的 symlink
|
||||
HF_DIR=~/.cache/ck/models/models--jinaai--jina-embeddings-v2-base-code
|
||||
CK_DIR=~/.cache/ck/models/jinaai_jina-embeddings-v2-base-code
|
||||
|
||||
ln -s $HF_DIR $CK_DIR
|
||||
|
||||
# 同样处理 bge-small
|
||||
ln -s ~/.cache/ck/models/models--Xenova--bge-small-en-v1.5 \
|
||||
~/.cache/ck/models/Xenova_bge-small-en-v1.5
|
||||
```
|
||||
|
||||
### 完整缓存结构示例
|
||||
|
||||
```
|
||||
~/.cache/ck/models/
|
||||
├── Xenova_bge-small-en-v1.5 -> models--Xenova--bge-small-en-v1.5 # symlink
|
||||
├── jinaai_jina-embeddings-v2-base-code -> models--jinaai--... # symlink
|
||||
└── models--Xenova--bge-small-en-v1.5/
|
||||
├── blobs/
|
||||
│ └── 828e1496... # ONNX 模型
|
||||
├── refs/
|
||||
│ └── main # commit hash
|
||||
└── snapshots/
|
||||
└── ea104dacec62c0de.../
|
||||
├── tokenizer.json
|
||||
├── config.json
|
||||
├── special_tokens_map.json
|
||||
├── tokenizer_config.json
|
||||
└── onnx/
|
||||
└── model.onnx -> ../../../blobs/828e1496... # symlink to blob
|
||||
```
|
||||
|
||||
**下载其他模型示例:**
|
||||
|
||||
```bash
|
||||
# 下载 nomic-v1.5
|
||||
MODEL_ID="nomic-ai/nomic-embed-text-v1.5"
|
||||
# 需先获取 commit hash
|
||||
curl -s "https://hf-mirror.com/api/models/$MODEL_ID" | grep sha
|
||||
|
||||
# 然后按上述脚本下载
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 问题排查清单
|
||||
|
||||
遇到下载问题时,按以下顺序排查:
|
||||
|
||||
1. **检查网络连接**
|
||||
```bash
|
||||
curl -I https://huggingface.co
|
||||
curl -I https://hf-mirror.com
|
||||
```
|
||||
|
||||
2. **检查代理设置**
|
||||
```bash
|
||||
echo $HTTP_PROXY
|
||||
echo $HTTPS_PROXY
|
||||
# 临时禁用代理:unset HTTP_PROXY HTTPS_PROXY
|
||||
```
|
||||
|
||||
3. **清理不完整下载**
|
||||
```bash
|
||||
# 删除 lock 文件
|
||||
find ~/.cache/ck/models -name "*.lock" -delete
|
||||
# 删除不完整下载
|
||||
find ~/.cache/ck/models -name "*.part" -delete
|
||||
```
|
||||
|
||||
4. **检查 HF_ENDPOINT 设置**
|
||||
```bash
|
||||
echo $HF_ENDPOINT
|
||||
# 应输出:https://hf-mirror.com
|
||||
```
|
||||
|
||||
5. **验证 Range requests 支持**
|
||||
```bash
|
||||
curl -I -H "Range: bytes=0-0" "$HF_ENDPOINT/Xenova/bge-small-en-v1.5/resolve/main/tokenizer.json"
|
||||
# 查找响应中的 "accept-ranges: bytes"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 技术细节:为什么需要 Range requests?
|
||||
|
||||
**Range requests 的作用:**
|
||||
|
||||
1. **获取文件大小** - 请求首字节即可得知总大小,无需下载完整文件
|
||||
2. **断点续传** - 下载中断后可从上次位置继续
|
||||
3. **增量下载** - 只下载需要的部分
|
||||
|
||||
**hf-hub 的 metadata 函数流程:**
|
||||
|
||||
```rust
|
||||
fn metadata(&self, url: &str) -> Result<Metadata, ApiError> {
|
||||
// 发送 Range: bytes=0-0 请求
|
||||
let response = self.client.get(url)
|
||||
.set(RANGE, "bytes=0-0")
|
||||
.call();
|
||||
|
||||
// 从 Content-Range 解析文件大小
|
||||
// Content-Range 格式: "bytes 0-0/123456"
|
||||
let content_range = response.header(CONTENT_RANGE)?;
|
||||
let size = content_range.split('/').next_back().parse()?;
|
||||
|
||||
// 返回 metadata
|
||||
Ok(Metadata {
|
||||
commit_hash,
|
||||
etag,
|
||||
size, // 文件总大小
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
**手动下载绕过了这一检查:**
|
||||
- 我们直接下载完整文件,不需要先获取 metadata
|
||||
- 文件直接放入正确的缓存目录结构
|
||||
- ck 运行时会检测到缓存中已存在模型,跳过下载步骤
|
||||
|
||||
---
|
||||
|
||||
## 验证模型可用
|
||||
|
||||
下载完成后,验证模型是否可正常使用:
|
||||
|
||||
```bash
|
||||
# 清理旧索引(如有)
|
||||
ck --clean .
|
||||
|
||||
# 使用指定模型索引
|
||||
ck --index --model bge-small .
|
||||
ck --index --model nomic-v1.5 .
|
||||
ck --index --model jina-code .
|
||||
|
||||
# 验证语义搜索
|
||||
ck --sem "function definition" .
|
||||
ck --sem "error handling" .
|
||||
|
||||
# 检查索引状态
|
||||
ck --status .
|
||||
```
|
||||
|
||||
**预期输出:**
|
||||
|
||||
```
|
||||
▸ Indexing Repository
|
||||
ℹ Scanning files in .
|
||||
ℹ 🤖 Model: BAAI/bge-small-en-v1.5 (alias 'bge-small', 384 dims)
|
||||
ℹ 📏 FastEmbed Config: 512 token limit
|
||||
ℹ 📄 Chunk Config: 400 tokens target, 80 token overlap (~20%)
|
||||
✓ 🚀 Indexed 174 files
|
||||
ℹ ➕ 174 new files added
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 参考
|
||||
|
||||
- [HuggingFace Hub 文档](https://huggingface.co/docs/hub)
|
||||
- [fastembed 源码](https://github.com/Anush008/fastembed-rs)
|
||||
- [hf-hub Rust 库](https://github.com/huggingface/hf-hub)
|
||||
- [HF-Mirror 镜像站点](https://hf-mirror.com)
|
||||
Reference in New Issue
Block a user