Lotus Reader - The Hackernews Client

0xjunhao20 hours ago

Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!

criemen0xjunhao20 hours ago

Thanks for writing the article!

I didn't quite get

Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.

I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?

criemencriemen19 hours ago

And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?

animancriemen17 hours ago

Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.

Decode is the next major step where you start generating output tokens one at a time.

Both run on GPUs but have slightly different workloads

1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token

dist-epochaniman11 hours ago

Doesn't decode also need to stream in the whole of the model weights, thus very I/O heavy?

animancriemen17 hours ago

That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.

Instead for decode, you need to sequentially generate each token.

longbeachbass0xjunhao13 hours ago

Thanks for this! Learnt a lot.

Curious to understand how do we ensure that the same model instance gets requests from the same client/user? Since conversations are stateful and the model needs context from previous turns of the conversation.

Is this happening at the load balancer layer?

cyanflongbeachbass13 hours ago

It's either sticky sessions or an lb that keeps track of prior sequences and route to the instance with the largest match. https://docs.sglang.ai/router/router.html

hhhlongbeachbass12 hours ago

They’re not stateful, you submit the entire history with every call. Caching of prompts etc makes it important for performance to have sticky sessions or smth at the load balancer layer

3abiton0xjunhao10 hours ago

Great write up, it would be interesting to see a lot of those covered features in comparison to other frameworks!

zackangelo0xjunhaoan hour ago

In your forward pass section you give a lot of emphasis to FlashAttention, but it might be worth mentioning Paged Attention as well (which was the paper written by the vLLM authors and I believe was the genesis of the project). PA-style block tables are now supported in most fused attention kernels, but vLLM originally came up with it and it's the main reason why vLLM has such high throughput!

mhlakhani17 hours ago

Thanks for writing this up! I learnt a bunch from it. I noticed this didn’t discuss additional layers of caching - I can see how it would fit in, but is prompt caching out of the scope of this system?

gdiamos16 hours ago

Great write up. We use vLLM kv cache and continuous batching as a foundation for requests in ScalarLM and also add batching optimizations in a centralized queue and by adding explicit batching support in our client.

https://www.scalarlm.com

There is more perf you can sqeeuze out of vLLM

r0b0514 hours ago

Great write up!

Does batching add data from multiple requests into the same context, potentially decreasing perplexity? If so, are we trading off perplexity for lower operating costs?

ethan_smithr0b059 hours ago

Batching in vLLM doesn't combine prompts into the same context - it processes separate requests in parallel while sharing compute resources, so there's no perplexity tradeoff, just efficiency gains.

zettabombethan_smith6 hours ago

It's worth noting that reason this works is because basically every LLM architecture currently in use is severely limited by memory bandwidth, not by compute. So it's trivial to run several requests at a time, while waiting for the next weights to arrive from VRAM.

StochasticLir0b057 hours ago

I would like to know what inference speeds they are achieving exactly on what hardware. I skimmed and searched the article and didn't find that info.

geoffbp12 hours ago

Thanks, good read!