...

Building a streaming LLM with Next.js, FastAPI & Docker: the complete stack (part 1)

avatar
Alex Arvanitidis@alarv
9 days ago
· 10 min read

Deploy LLM

Want to build a modern LLM application with real-time streaming responses? Here's a complete guide covering backend, frontend, and deployment. We'll be using some of the most popular technologies in the modern web development ecosystem: Next.js with the Next UI component library, FastAPI for our backend, and Docker for containerization.

Today we'll be deploying LLama, an open-source large language model developed by Meta, with real-time streaming responses. This setup will allow users to interact with the model in real-time, getting responses as they type. We'll cover the complete stack, including the backend, frontend, and deployment.

The rise of ChatGPT made many people interested in large language models, but its code stayed private. This led to the development of open alternatives like LLaMA, which anyone can use and modify. LLaMA, like other language models, creates text one piece at a time. Here's how it works:

When you type a question or prompt, LLaMA processes it token by token. A token is basically a piece of text - it could be a word, part of a word, or even a single character. The model looks at all the tokens you've given it and predicts what should come next.

For example, if you write "The cat 🐈 sat on the", LLaMA will:

  1. Read your input
  2. Predict the next most likely token (maybe "mat")
  3. Add "mat" to your input
  4. Use the new, longer text to predict the next token
  5. Keep going until it reaches a stopping point

This step-by-step process is called autoregressive generation. It's like building a sentence one brick at a time, where each new brick depends on all the ones that came before it.

The autoregressive process is the same one used by all large language models, including ChatGPT and Claude. The speed differences you might notice when running LLaMA locally versus using commercial APIs usually come down to hardware resources, deployment optimization, and specific techniques used to speed up inference. A single response typically needs hundreds of predictions to complete, which is why having the right deployment setup matters for getting good performance

Anyway, enough talk, let's jump into the code!

Preparing your LLM weights

First, let's talk about the model weights. The Llama model comes in various sizes (7B, 13B, 70B parameters) and versions (Llama 3, Llama 3 Vision etc.). For this tutorial, we'll use the Llama 3.2-1B-Instruct model, which is a good balance between performance and resource requirements.

Before we can use the model, we need to convert the weights to a format that works with the Hugging Face transformers library. This conversion process is necessary because Llama models are originally distributed in their own format, but we want to use them with the transformers ecosystem.

Here's the command to convert the weights with the provided script by the hugging face team:

python convert_llama_weights_to_hf.py \ --input_dir ~/.llama/checkpoints/Llama3.2-1B-Instruct/ \ --model_size 1B \ --output_dir ./llama \ --llama_version 3.2

Let's break down what's happening:

  • --input_dir: points to where your original Llama weights are stored
  • --model_size: specifies the model size (1B in our case)
  • --output_dir: where the converted weights will be saved
  • --llama_version: specifies which version of Llama we're using

The conversion process creates a model that's compatible with the transformers library's LlamaForCausalLM architecture. This enables us to use all the powerful features of the transformers library, including:

  • Efficient tokenization
  • Model parallelism
  • Gradient checkpointing
  • Different generation strategies

For more details about Llama model usage and optimization tips, check out the official documentation of hugging face.

Setting up the backend

Our backend is built with FastAPI, a modern Python web framework known for its speed and ease of use. Note that this is just a simple way to deploy your model. For production, you might want to consider more advanced setups like ExecuTorch or vLLM. These frameworks are outside the scope of this tutorial but offer more advanced features for deploying large language models and we might explore them in future tutorials.

Here's our complete model service that handles predictions and streaming:

import asyncio from queue import Queue from threading import Thread import joblib import pandas as pd import torch from fastapi.responses import StreamingResponse from jaqpot_api_client.models.prediction_request import PredictionRequest from src.streamer import CustomStreamer class ModelService: def __init__(self): self.model = joblib.load('model.pkl') self.tokenizer = joblib.load('tokenizer.pkl') def infer(self, request: PredictionRequest) -> StreamingResponse: # Convert input list to DataFrame input_data = pd.DataFrame(request.dataset.input) last_index = input_data.index[-1] input_row = input_data.iloc[last_index] prompt = input_row['prompt'] return StreamingResponse(self.response_generator(prompt), media_type='text/event-stream') def start_generation(self, query: str, streamer): prompt = f"""{query}""" device = "cuda" if torch.cuda.is_available() else "cpu" self.model = self.model.to(device) inputs = self.tokenizer([prompt], return_tensors="pt").to(device) generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=4096, temperature=0.1) thread = Thread(target=self.model.generate, kwargs=generation_kwargs) thread.start() async def response_generator(self, query: str): streamer_queue = Queue() streamer = CustomStreamer(streamer_queue, self.tokenizer, True) self.start_generation(query, streamer) while True: value = streamer_queue.get() if value is None: break yield value streamer_queue.task_done() await asyncio.sleep(0.1)

Let's examine the key components of our service:

1.Model initialization: We load both the model and tokenizer from saved files. This happens once when the service starts, keeping them in memory for quick access.

2.Inference endpoint: The infer method handles incoming requests. It expects a prompt and returns a streaming response.

3.Generation process: Text generation happens in a separate thread to prevent blocking. We use GPU acceleration when available.

4.Streaming setup: We use FastAPI's StreamingResponse to send tokens back to the client as they're generated, providing a real-time experience.

Now, let's look at how we set up our FastAPI application to serve the model:

from contextlib import asynccontextmanager import uvicorn from fastapi import FastAPI, HTTPException from jaqpot_api_client.models.prediction_request import PredictionRequest from jaqpot_api_client.models.prediction_response import PredictionResponse from fastapi.responses import StreamingResponse model_service: ModelService = None @asynccontextmanager async def lifespan(app: FastAPI): # Load the ML model global model_service model_service = MockModelService() yield app = FastAPI(title="ML Model API", lifespan=lifespan) app.add_middleware(LogMiddleware) @app.post("/infer", response_model=PredictionResponse) def infer(req: PredictionRequest) -> StreamingResponse: try: return model_service.infer(req) except Exception as e: logger.error("Prediction request for model failed " + str(e)) raise HTTPException(status_code=500, detail=str(e)) @app.get("/health") def health_check(): return {"status": "OK"} if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8005, log_config=None)

This FastAPI application:

  • uses FastAPI's lifespan management to initialize our model service when the app starts
  • provides an /infer endpoint that accepts prediction requests and returns streaming responses
  • includes a health check endpoint for monitoring
  • handles errors gracefully with proper HTTP status codes
  • runs on port 8005 with uvicorn server

The lifespan context manager ensures our model is loaded only once when the application starts, and the middleware adds logging capabilities to track requests and responses.

Creating the frontend

Our frontend is built with Next.js and the Next UI component library, providing a modern and responsive interface. Next UI gives us a set of beautiful, accessible components that work great with Next.js 13+.

We'll break down our frontend into two main components:

Navigation component

export default function LLMNavigation() { const [conversations, setConversations] = useState<string[]>([]); return ( <div className="flex flex-row gap-5"> <div className="flex max-w-48 flex-col gap-3"> <Button color="primary" onPress={() => setConversations([...conversations, 'New Chat'])} > start new chat </Button> <Table selectionMode="single" > <TableBody> {conversations.map((chat, index) => ( <TableRow key={index}> <TableCell>{chat}</TableCell> </TableRow> ))} </TableBody> </Table> </div> <ChatForm /> </div> ); }

The navigation component manages chat history and session navigation. It uses Next UI's Table and Button components for a polished look.

Chat form component

export function ChatForm() { const [isLoading, setIsLoading] = useState(false); const [currentResponse, setCurrentResponse] = useState<string>(); const [chatHistory, setChatHistory] = useState<Array<{ prompt: string; response: string; }>>([]); const createStreamingPrediction = async (prompt: string) => { const response = await fetch('http://localhost:8005/infer', { method: 'POST', headers: { 'Content-Type': 'text/event-stream' }, body: JSON.stringify({ prompt }), }); if (!response.body) return; const reader = response.body .pipeThrough(new TextDecoderStream()) .getReader(); let fullResponse = ''; while (true) { const { value, done } = await reader.read(); if (done) break; if (value) { fullResponse += value + ' '; setCurrentResponse(fullResponse); } } return fullResponse; }; const handleSubmit = async (prompt: string) => { try { setIsLoading(true); const response = await createStreamingPrediction(prompt); setChatHistory((prev) => [...prev, { prompt, response }]); } catch (error) { console.error('Failed to get response:', error); } finally { setIsLoading(false); } }; return ( <Card> <CardBody> <div className="flex flex-col gap-4"> {chatHistory.map((chat, index) => ( <div key={index} className="space-y-2"> <div className="font-medium">You: {chat.prompt}</div> <div>AI: {chat.response}</div> </div> ))} {currentResponse && ( <div>AI: {currentResponse}</div> )} <form onSubmit={(e) => { e.preventDefault(); const input = e.currentTarget.elements.namedItem('prompt') as HTMLInputElement; handleSubmit(input.value); input.value = ''; }}> <Textarea name="prompt" placeholder="Type your message..." endContent={ <Button isIconOnly type="submit" isLoading={isLoading} > <ArrowUpIcon /> </Button> } /> </form> </div> </CardBody> </Card> ); }

The chat form component is where the magic happens. It:

  • Manages the chat interface using Next UI components (Card, Button, Textarea)
  • Handles real-time streaming using the web streams API
  • Maintains chat history and UI state
  • Provides a smooth typing experience with loading states
  • Implements auto-scrolling for new messages
  • Supports keyboard shortcuts for better usability

Key features of our frontend implementation:

  • Real-time streaming using the web streams API
  • Responsive design with Next UI components
  • Chat history management with state persistence
  • Clean, accessible UI with loading states
  • Auto-scrolling chat container
  • Keyboard shortcuts (enter to send)
  • Error handling and recovery

After doing all this we end up with a UI looking like this image:

Llama ui

Deployment

Let's look at containerizing our model service. We'll explore two approaches: a multi-stage build that includes training, and our current production setup with pre-trained models.

Approach 1: Multi-stage build (work in progress)

Here's a multi-stage Dockerfile that attempts to include the training process:

Here's the training code train.py

import joblib import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_path = "./llama" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path) model.to("cpu") # Save the model joblib.dump(model, 'model.pkl') joblib.dump(tokenizer, 'tokenizer.pkl')

and the Dockerfile:

# Training stage FROM python:3.10 AS trainer WORKDIR /code COPY ./requirements.txt /code/requirements.txt COPY ./src /code/src COPY ./train.py /code/train.py COPY ./llama /code/llama RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt RUN python train.py # Deployment stage FROM python:3.10 WORKDIR /code COPY ./requirements.txt /code/requirements.txt RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt COPY ./src /code/src COPY ./main.py /code/main.py COPY /model.pkl /code/model.pkl COPY /tokenizer.pkl /code/tokenizer.pkl COPY --from=trainer /code/model.pkl /code/model.pkl COPY --from=trainer /code/tokenizer.pkl /code/tokenizer.pkl EXPOSE 8005 CMD ["python", "-m", "main", "--port", "8005"]

However, training during the Docker build process proved challenging due to memory constraints and build timeouts.

Instead, we perform the training offline using the train.py script and we move on to the 2nd approach for now.

This script:

  • loads the converted Llama model and tokenizer
  • moves the model to CPU for compatibility
  • saves both as pickle files for easy loading in our service

Approach 2: Production setup

For production, we use a simpler Dockerfile that focuses on deployment:

FROM python:3.10 WORKDIR /code COPY ./requirements.txt /code/requirements.txt RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt COPY ./src /code/src COPY ./main.py /code/main.py COPY /model.pkl /code/model.pkl COPY /tokenizer.pkl /code/tokenizer.pkl EXPOSE 8005 CMD ["python", "-m", "main", "--port", "8005"]

This production Dockerfile:

  • uses Python 3.10 base image
  • sets up a clean working directory
  • installs requirements first for better caching
  • copies only the necessary files
  • runs our FastAPI service on port 8005

Next steps:

  1. Run the training script locally to generate model files
  2. Build your docker image: docker build -t llm-service .
  3. Set up your Next.js frontend
  4. Configure your environment variables
  5. Deploy using your preferred hosting solution

In part 2, I will explain how you can deploy this setup on AWS using an EC2 instance with a GPU (g4dn.xlarge) on your EKS cluster with a GPU node group that will only be used by your LLM pod.

P.S. This exact setup is currently running on the Jaqpot platform at https://app.jaqpot.org/dashboard/models/1983/description. Check it out to see it in action!

P.S.2 The code for this tutorial is available on GitHub at https://github.com/ntua-unit-of-control-and-informatics/jaqpot-Llama3.2-1B-Instruct.