Inference is the process of using a trained AI model to generate predictions or outputs from new inputs. In the context of LLMs, inference means sending a prompt to the model and receiving a generated response. Inference is distinct from training: training adjusts model weights using data, while inference uses fixed weights to produce outputs. Inference cost, speed (latency), and throughput are major considerations for AI deployment. Optimization techniques include quantization, speculative decoding, batching, KV-cache optimization, and purpose-built inference hardware like Groq LPUs. The term "inference-time compute" refers to giving models more processing time to reason.
Frequently Asked Questions
What is inference in AI?
Inference is running a trained AI model to generate outputs from new inputs. When you send a message to ChatGPT or Claude, the model performs inference to generate a response.
What affects inference speed?
Model size, hardware (GPU/TPU), quantization, batching strategy, and context length all affect inference speed. Specialized chips like Groq LPUs can dramatically increase throughput.