Preface¶

As of today, the ERNIE 4.5 model has been open-sourced, not just one model, but the entire series. It came as a surprise, and I was shocked. The ERNIE 4.5 series consists of 10 open-source models, including Mixture-of-Expert (MoE) models with activation parameter scales of 47B and 3B (with the largest model having a total parameter count of 424B), as well as dense parameter models with 0.3B parameters. Below, we’ll introduce how to quickly use ERNIE 4.5 models for inference and deploy APIs for clients like Android and WeChat Mini Programs. Note that this article focuses on text-type models; ERNIE 4.5 also has multimodal models.

Environment:
- PaddlePaddle 3.1.0
- Python 3.11
- CUDA 12.6
- GPU: 4090 24G
- Ubuntu 22.04

Setting Up the Environment¶

First, install PaddlePaddle (skip if already installed):

python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

Then install the FastDeploy tool:

python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

Install the aistudio-sdk to download models:

pip install --upgrade aistudio-sdk

Quick Usage¶

The following Python code allows you to quickly start a conversation. We’ll use the smallest model first, but there are larger options available:

ERNIE-4.5-0.3B-Paddle
ERNIE-4.5-21B-A3B-Paddle
ERNIE-4.5-300B-A47B-Paddle

When you run the code, it will automatically download the model and start a terminal-based chat. The quantization parameter sets the quantization type, supporting wint4 and wint8.

from aistudio_sdk.snapshot_download import snapshot_download
from fastdeploy import LLM, SamplingParams

# Model name
model_name = "PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
save_path = "./models/ERNIE-4.5-0.3B-Paddle/"
# Download the model
res = snapshot_download(repo_id=model_name, revision='master', local_dir=save_path)
# Chat parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Load the model
llm = LLM(model=save_path, max_model_len=32768, quantization=None)

messages = []

while True:
    prompt = input("请输入问题：")
    if prompt == 'exit':
        break
    messages.append({"role": "user", "content": prompt})
    output = llm.chat(messages, sampling_params)[0]
    text = output.outputs.text
    messages.append({"role": "assistant", "content": text})
    print(text)

Sample output:

INFO     2025-07-01 14:20:26,232 4785  engine.py[line:206] Waitting worker processes ready...
Loading Weights: 100%|█████████████████████████████████| 100/100 [00:03<00:00, 33.26it/s]
Loading Layers: 100%|██████████████████████████████████| 100/100 [00:01<00:00, 66.54it/s]
INFO     2025-07-01 14:20:36,753 4785  engine.py[line:276] Worker processes are launched with 12.627224445343018 seconds.
请输入问题：你好，你叫什么名字？
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.12it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
你好呀！我是**小天**，很高兴认识你！有什么我可以帮助你的吗？
请输入问题：你会什么？
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.44s/it, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
我的本领可多啦！我擅长**整理知识树**、分析历史事件、讲解科学原理，还能帮你快速完成**脑筋急转弯**或**创意小发明**，或者用声音给你讲有趣的笑话呢。你要不要试试？
请输入问题：我刚才问你什么问题？
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.49it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
好呀！你想问什么呢？是关于我的名字、我的爱好、或者其他有趣的话题呀？
请输入问题：

Deploying the API¶

First, download the model (you can replace it with the model of your choice at any time):

aistudio download --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle --local_dir ./models/ERNIE-4.5-0.3B-Paddle/

After downloading the model, run the following command to start the service. The port number is 8180. max-model-len specifies the maximum context length supported by inference, and max-num-seqs is the maximum concurrency during decoding. If quantization is specified, quantization is enabled. For more parameter documentation, see: https://paddlepaddle.github.io/FastDeploy/parameters/

python -m fastdeploy.entrypoints.openai.api_server \
       --model ./models/ERNIE-4.5-0.3B-Paddle/ \
       --port 8180 \
       --quantization wint8 \
       --max-model-len 32768 \
       --max-num-seqs 32

Sample output:

INFO     2025-07-01 14:25:22,033 5239  engine.py[line:206] Waitting worker processes ready...
Loading Weights: 100%|█████████████████████████████████| 100/100 [00:03<00:00, 33.26it/s]
Loading Layers: 100%|██████████████████████████████████| 100/100 [00:02<00:00, 49.91it/s]
INFO     2025-07-01 14:25:33,060 5239  engine.py[line:276] Worker processes are launched with 16.20948576927185 seconds.
INFO     2025-07-01 14:25:33,061 5239  api_server.py[line:91] Launching metrics service at http://0.0.0.0:8001/metrics
INFO     2025-07-01 14:25:33,061 5239  api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
INFO     2025-07-01 14:25:33,061 5239  api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO:     Started server process [5239]
INFO:     Waiting for application startup.
[2025-07-01 14:25:34,089] [    INFO] - Loading configuration file ./models/ERNIE-4.5-0.3B-Paddle/generation_config.json
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
INFO:     127.0.0.1:53716 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Calling the API¶

The API is OpenAI-compatible, so if you use Python to call it, you can use the openai library without specifying the model name or API key.

import openai
host = "192.168.0.100"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

messages = []

while True:
    prompt = input("请输入问题：")
    if prompt == 'exit':
        break
    messages.append({"role": "user", "content": prompt})
    response = client.chat.completions.create(
        model="null",
        messages=messages,
        stream=True,
    )
    output = ""
    for chunk in response:
        if chunk.choices[0].delta:
            print(chunk.choices[0].delta.content, end='')
            output += chunk.choices[0].delta.content
    print()
    messages.append({"role": "assistant", "content": output})

Sample output:

请输入问题：你好
你好呀！😊 很高兴能为你提供帮助～有什么我可以帮你解决的吗？无论是学习上的问题，还是生活里的小烦恼，我都在这儿哦！🧐
请输入问题：

Preface¶

Setting Up the Environment¶

Quick Usage¶

Deploying the API¶

Calling the API¶

Related Articles