Building a streaming LLM with Next.js, FastAPI & Docker: the complete stack (part 1)

Alex Arvanitidis@alarv

about 2 months ago

· 10 min read

Deploy LLM

Want to build a modern LLM application with real-time streaming responses? Here's a complete guide covering backend, frontend, and deployment. We'll be using some of the most popular technologies in the modern web development ecosystem: Next.js with the Next UI component library, FastAPI for our backend, and Docker for containerization.

Today we'll be deploying LLama, an open-source large language model developed by Meta, with real-time streaming responses. This setup will allow users to interact with the model in real-time, getting responses as they type. We'll cover the complete stack, including the backend, frontend, and deployment.

The rise of ChatGPT made many people interested in large language models, but its code stayed private. This led to the development of open alternatives like LLaMA, which anyone can use and modify. LLaMA, like other language models, creates text one piece at a time. Here's how it works:

When you type a question or prompt, LLaMA processes it token by token. A token is basically a piece of text - it could be a word, part of a word, or even a single character. The model looks at all the tokens you've given it and predicts what should come next.

For example, if you write "The cat 🐈 sat on the", LLaMA will:

Read your input
Predict the next most likely token (maybe "mat")
Add "mat" to your input
Use the new, longer text to predict the next token
Keep going until it reaches a stopping point

This step-by-step process is called autoregressive generation. It's like building a sentence one brick at a time, where each new brick depends on all the ones that came before it.

The autoregressive process is the same one used by all large language models, including ChatGPT and Claude. The speed differences you might notice when running LLaMA locally versus using commercial APIs usually come down to hardware resources, deployment optimization, and specific techniques used to speed up inference. A single response typically needs hundreds of predictions to complete, which is why having the right deployment setup matters for getting good performance

Anyway, enough talk, let's jump into the code!

Preparing your LLM weights

First, let's talk about the model weights. The Llama model comes in various sizes (7B, 13B, 70B parameters) and versions (Llama 3, Llama 3 Vision etc.). For this tutorial, we'll use the Llama 3.2-1B-Instruct model, which is a good balance between performance and resource requirements.

Before we can use the model, we need to convert the weights to a format that works with the Hugging Face transformers library. This conversion process is necessary because Llama models are originally distributed in their own format, but we want to use them with the transformers ecosystem.

Here's the command to convert the weights with the provided script by the hugging face team:

python convert_llama_weights_to_hf.py \
    --input_dir ~/.llama/checkpoints/Llama3.2-1B-Instruct/ \
    --model_size 1B \
    --output_dir ./llama \
    --llama_version 3.2

Let's break down what's happening:

--input_dir: points to where your original Llama weights are stored
--model_size: specifies the model size (1B in our case)
--output_dir: where the converted weights will be saved
--llama_version: specifies which version of Llama we're using

The conversion process creates a model that's compatible with the transformers library's LlamaForCausalLM architecture. This enables us to use all the powerful features of the transformers library, including:

Efficient tokenization
Model parallelism
Gradient checkpointing
Different generation strategies

For more details about Llama model usage and optimization tips, check out the official documentation of hugging face.

Setting up the backend

Our backend is built with FastAPI, a modern Python web framework known for its speed and ease of use. Note that this is just a simple way to deploy your model. For production, you might want to consider more advanced setups like ExecuTorch or vLLM. These frameworks are outside the scope of this tutorial but offer more advanced features for deploying large language models and we might explore them in future tutorials.

Here's our complete model service that handles predictions and streaming:

import asyncio
from queue import Queue
from threading import Thread

import joblib
import pandas as pd
import torch
from fastapi.responses import StreamingResponse
from jaqpot_api_client.models.prediction_request import PredictionRequest

from src.streamer import CustomStreamer

class ModelService:
    def __init__(self):
        self.model = joblib.load('model.pkl')
        self.tokenizer = joblib.load('tokenizer.pkl')

    def infer(self, request: PredictionRequest) -> StreamingResponse:
        # Convert input list to DataFrame
        input_data = pd.DataFrame(request.dataset.input)

        last_index = input_data.index[-1]
        input_row = input_data.iloc[last_index]

        prompt = input_row['prompt']

        return StreamingResponse(self.response_generator(prompt),
                               media_type='text/event-stream')

    def start_generation(self, query: str, streamer):
        prompt = f"""{query}"""
        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = self.model.to(device)
        inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
        generation_kwargs = dict(inputs, streamer=streamer,
                               max_new_tokens=4096, temperature=0.1)
        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()

    async def response_generator(self, query: str):
        streamer_queue = Queue()
        streamer = CustomStreamer(streamer_queue, self.tokenizer, True)

        self.start_generation(query, streamer)

        while True:
            value = streamer_queue.get()
            if value is None:
                break
            yield value
            streamer_queue.task_done()
            await asyncio.sleep(0.1)

Let's examine the key components of our service:

1.Model initialization: We load both the model and tokenizer from saved files. This happens once when the service starts, keeping them in memory for quick access.

2.Inference endpoint: The infer method handles incoming requests. It expects a prompt and returns a streaming response.

3.Generation process: Text generation happens in a separate thread to prevent blocking. We use GPU acceleration when available.

4.Streaming setup: We use FastAPI's StreamingResponse to send tokens back to the client as they're generated, providing a real-time experience.

Now, let's look at how we set up our FastAPI application to serve the model:

from contextlib import asynccontextmanager
import uvicorn
from fastapi import FastAPI, HTTPException
from jaqpot_api_client.models.prediction_request import PredictionRequest
from jaqpot_api_client.models.prediction_response import PredictionResponse
from fastapi.responses import StreamingResponse

model_service: ModelService = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load the ML model
    global model_service
    model_service = MockModelService()
    yield

app = FastAPI(title="ML Model API", lifespan=lifespan)
app.add_middleware(LogMiddleware)

@app.post("/infer", response_model=PredictionResponse)
def infer(req: PredictionRequest) -> StreamingResponse:
    try:
        return model_service.infer(req)
    except Exception as e:
        logger.error("Prediction request for model failed " + str(e))
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
def health_check():
    return {"status": "OK"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8005, log_config=None)

This FastAPI application:

uses FastAPI's lifespan management to initialize our model service when the app starts
provides an /infer endpoint that accepts prediction requests and returns streaming responses
includes a health check endpoint for monitoring
handles errors gracefully with proper HTTP status codes
runs on port 8005 with uvicorn server

The lifespan context manager ensures our model is loaded only once when the application starts, and the middleware adds logging capabilities to track requests and responses.

Creating the frontend

Our frontend is built with Next.js and the Next UI component library, providing a modern and responsive interface. Next UI gives us a set of beautiful, accessible components that work great with Next.js 13+.

We'll break down our frontend into two main components:

Navigation component

export default function LLMNavigation() {
  const [conversations, setConversations] = useState<string[]>([]);

  return (
    <div className="flex flex-row gap-5">
      <div className="flex max-w-48 flex-col gap-3">
        <Button
          color="primary"
          onPress={() => setConversations([...conversations, 'New Chat'])}
        >
          start new chat
        </Button>
        <Table
          selectionMode="single"
        >
          <TableBody>
            {conversations.map((chat, index) => (
              <TableRow key={index}>
                <TableCell>{chat}</TableCell>
              </TableRow>
            ))}
          </TableBody>
        </Table>
      </div>
      <ChatForm />
    </div>
  );
}

The navigation component manages chat history and session navigation. It uses Next UI's Table and Button components for a polished look.

Chat form component

export function ChatForm() {
  const [isLoading, setIsLoading] = useState(false);
  const [currentResponse, setCurrentResponse] = useState<string>();
  const [chatHistory, setChatHistory] = useState<Array<{
    prompt: string;
    response: string;
  }>>([]);

  const createStreamingPrediction = async (prompt: string) => {
    const response = await fetch('http://localhost:8005/infer', {
      method: 'POST',
      headers: { 'Content-Type': 'text/event-stream' },
      body: JSON.stringify({ prompt }),
    });

    if (!response.body) return;

    const reader = response.body
      .pipeThrough(new TextDecoderStream())
      .getReader();
    let fullResponse = '';

    while (true) {
      const { value, done } = await reader.read();
      if (done) break;

      if (value) {
        fullResponse += value + ' ';
        setCurrentResponse(fullResponse);
      }
    }

    return fullResponse;
  };

  const handleSubmit = async (prompt: string) => {
    try {
      setIsLoading(true);
      const response = await createStreamingPrediction(prompt);
      setChatHistory((prev) => [...prev, { prompt, response }]);
    } catch (error) {
      console.error('Failed to get response:', error);
    } finally {
      setIsLoading(false);
    }
  };

  return (
    <Card>
      <CardBody>
        <div className="flex flex-col gap-4">
          {chatHistory.map((chat, index) => (
            <div key={index} className="space-y-2">
              <div className="font-medium">You: {chat.prompt}</div>
              <div>AI: {chat.response}</div>
            </div>
          ))}
          {currentResponse && (
            <div>AI: {currentResponse}</div>
          )}
          <form onSubmit={(e) => {
            e.preventDefault();
            const input = e.currentTarget.elements.namedItem('prompt') as HTMLInputElement;
            handleSubmit(input.value);
            input.value = '';
          }}>
            <Textarea
              name="prompt"
              placeholder="Type your message..."
              endContent={
                <Button
                  isIconOnly
                  type="submit"
                  isLoading={isLoading}
                >
                  <ArrowUpIcon />
                </Button>
              }
            />
          </form>
        </div>
      </CardBody>
    </Card>
  );
}

The chat form component is where the magic happens. It:

Manages the chat interface using Next UI components (Card, Button, Textarea)
Handles real-time streaming using the web streams API
Maintains chat history and UI state
Provides a smooth typing experience with loading states
Implements auto-scrolling for new messages
Supports keyboard shortcuts for better usability

Key features of our frontend implementation:

Real-time streaming using the web streams API
Responsive design with Next UI components
Chat history management with state persistence
Clean, accessible UI with loading states
Auto-scrolling chat container
Keyboard shortcuts (enter to send)
Error handling and recovery

After doing all this we end up with a UI looking like this image:

Llama ui

Deployment

Let's look at containerizing our model service. We'll explore two approaches: a multi-stage build that includes training, and our current production setup with pre-trained models.

Approach 1: Multi-stage build (work in progress)

Here's a multi-stage Dockerfile that attempts to include the training process:

Here's the training code train.py

import joblib
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "./llama"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

model.to("cpu")

# Save the model
joblib.dump(model, 'model.pkl')
joblib.dump(tokenizer, 'tokenizer.pkl')

and the Dockerfile:

# Training stage
FROM python:3.10 AS trainer

WORKDIR /code

COPY ./requirements.txt /code/requirements.txt
COPY ./src /code/src
COPY ./train.py /code/train.py
COPY ./llama /code/llama

RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
RUN python train.py

# Deployment stage
FROM python:3.10

WORKDIR /code

COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt

COPY ./src /code/src
COPY ./main.py /code/main.py
COPY /model.pkl /code/model.pkl
COPY /tokenizer.pkl /code/tokenizer.pkl

COPY --from=trainer /code/model.pkl /code/model.pkl
COPY --from=trainer /code/tokenizer.pkl /code/tokenizer.pkl

EXPOSE 8005

CMD ["python", "-m", "main", "--port", "8005"]

However, training during the Docker build process proved challenging due to memory constraints and build timeouts.

Instead, we perform the training offline using the train.py script and we move on to the 2nd approach for now.

This script:

loads the converted Llama model and tokenizer
moves the model to CPU for compatibility
saves both as pickle files for easy loading in our service

Approach 2: Production setup

For production, we use a simpler Dockerfile that focuses on deployment:

FROM python:3.10

WORKDIR /code

COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt

COPY ./src /code/src
COPY ./main.py /code/main.py
COPY /model.pkl /code/model.pkl
COPY /tokenizer.pkl /code/tokenizer.pkl

EXPOSE 8005

CMD ["python", "-m", "main", "--port", "8005"]

This production Dockerfile:

uses Python 3.10 base image
sets up a clean working directory
installs requirements first for better caching
copies only the necessary files
runs our FastAPI service on port 8005

Next steps:

Run the training script locally to generate model files
Build your docker image: docker build -t llm-service .
Set up your Next.js frontend
Configure your environment variables
Deploy using your preferred hosting solution

In part 2, I will explain how you can deploy this setup on AWS using an EC2 instance with a GPU (g4dn.xlarge) on your EKS cluster with a GPU node group that will only be used by your LLM pod.

P.S. This exact setup is currently running on the Jaqpot platform at https://app.jaqpot.org/dashboard/models/1983/description. Check it out to see it in action!

P.S.2 The code for this tutorial is available on GitHub at https://github.com/ntua-unit-of-control-and-informatics/jaqpot-Llama3.2-1B-Instruct.

P.S.3 Part 2 is now published! Check it out!