How to run MultiNode inference on DeepSeek R1

, WhiteFiber

Learn how to run DeepSeek R1 on a multi-node setup for faster inference. This guide covers setup, deployment, and best practices for efficient scaling.

Ai

Lorem ipsum dolor sit 1

Last time we looked at DeepSeek R1, we explored what makes the model so powerful. Iin that blog post we covered:

‍

The training paradigm that allowed for the creation of the incredible “Aha moment” during reinforcement learning that lead to developing reasoning capabilities;
How DeepSeek trained DeepSeek-R1 from DeepSeek-R1 Zero;
How to run the model using SGLang. the power of a SOTA reasoning Large Language Model.

‍

In this post, we explore how to run DeepSeek-R1 on a distributed machine setup. In this post we walk through:

‍

Considerations when choosing a distributed setup for DeepSeek-R1
Setting up the environment for each node
Downloading DeepSeek-R1 onto your machines
Serving the model on a multi-node distributed deployment.

‍

Distributed vs. Single-Node Machines for DeepSeek-R1

‍

Since DeepSeek-R1 can run on a single, 8xH200 machine, we need to consider why running the model on a distributed setup is worthwhile before we continue. There are two main considerations when choosing your setup:

‍

Cost

using less expensive hardware to run the models is a benefit of DeepSeek-R1 but there is a trade-off when it comes to…

‍

Speed

increasing the number of nodes can correspond to an increase in tokens generated per second, but this requires scaling beyond 2 nodes due to the overhead communication between nodes.

‍

Clearly there is a correlation here in that increased cost often correlates to an increase in speed and vice-versa. Thus, we should always consider the balance of these factors in terms of our eventual deployment when we choose a setup. In short, if you want to optimize for training speed, more properly configured nodes will accomplish that but will result in higher cost.

‍

Running DeepSeek-R1 with SGLang on a Multi-Node Setup

‍

‍

SSH into your machine and mount your volume for storage in /mnt/. Next, download R1 to the mounted volume. This may take a while given the size of R1.
‍

git-lfs clone https://huggingface.co/deepseek-ai/DeepSeek-R1

‍

or

‍

huggingface-cli download deepseek-ai/DeepSeek-R1 --cache <path to   /mnt/ dir>

‍

Once complete, install SGLang on a virtual environment.

‍

python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]>=0.4.3.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

‍

This will complete the machine setup. Be sure to do this for each node.

‍

Identify your host ip

‍

You can find the host ip using vim /etc/hosts. Locate your machine by name, grab the host IP, and save this value for later. We will use this as our host machine ip value when we launch.

‍

Identify your GLOO_SOCKET_IFNAME Variable value. You can find this with the command sudo lshw -C network. Find the first logical name, and save that value. Then use the command export GLOO_SOCKET_IFNAME = value.

‍

With that, we have completed our machine setup and can launch our distributed, SGLang server.

‍

Launch the server

‍

From the host node, launch the server with the following command. Replace the host machine ip with the value you saved earlier, and paste the path to DeepSeek-R1 model files in the --model-path parameter.

‍

python3 -m sglang.launch_server --model-path  <path to deepseek-R1> --tp 16 --dist-init-addr <your host machine ip>:5000 --nnodes 2 --node-rank 0 --trust-remote-code

‍

From the additional node, launch the server with:

‍

python3 -m sglang.launch_server --model-path <path to deepseek-R1> --tp 16 --dist-init-addr <your host machine ip>:5000 --nnodes 2 --node-rank 1 --trust-remote-code

‍

This will take a few minutes to load. Note that this can be scaled up further by adjusting the nnodes and node-rank values to add more machines.

‍

Much like we showed in the previous article, we can interact with our new SGLang endpoint using Python or cURL. Inference with the deployed, distributed model in this way is actually quite simple.

‍

To send a request with cURL paste the following formatted command into a new terminal window connected to the distributed network:

‍

Python
import subprocess, json

curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{{"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{{"role": "user", "content": "What is the capital of France?"}}]}}'
response = json.loads(subprocess.check_output(curl_command, shell=True))
print_highlight(response)

You can then adjust the results by changing various settings in the -d dictionary, such as the prompt or temperature.

‍

Alternatively, we can make requests using Python. This can be done in numerous ways, including using Python requests library or OpenAI syntax. For simplicity, we will be using Python requests for this demo. Paste the following command into the terminal:

‍

Python
import requests

url = f"http://localhost:{port}/v1/chat/completions"

data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

‍

With this, we should get a long reasoning response where the model attempts to logically reason a solution to the simple, one answer question: “What is the capital of France?”. We recommend testing R1 with more complicated problem solving or math tests to see if it fits your deployment use case.

‍

We also recommend all readers check out the SGLang send requests page https://docs.sglang.ai/backend/send_request.html for more in depth details on how to format your usage of the distributed deployment.

‍

Closing Thoughts

In conclusion, the addition of more nodes significantly increases throughput and therefore deployment efficacy of the DeepSeek-R1 model. We recommend using a distributed setup wherever feasible for production workloads, as it will almost always outperform deployment on a single node. Of course, if you are just experimenting and learning the beauty of DeepSeek-R1 is that you can accomplish this with limited hardware overhead.