AI Glossary

What is Inference?

Inference is the process of using a trained AI model to generate predictions or outputs from new inputs. In the context of LLMs, inference means sending a prompt to the model and receiving a generated response. Inference is distinct from training: training adjusts model weights using data, while inference uses fixed weights to produce outputs. Inference cost, speed (latency), and throughput are major considerations for AI deployment. Optimization techniques include quantization, speculative decoding, batching, KV-cache optimization, and purpose-built inference hardware like Groq LPUs. The term "inference-time compute" refers to giving models more processing time to reason.
Related Terms
Related Articles
After Nvidia’s $20B not-acqui-hire, AI chip startup Groq reportedly raising $650M AI Chipmaker Groq Targets $650M Funding to Rival Nvidia Just like gold and oil, we’ll soon be able to trade AI token futures OpenRouter more than doubles valuation to $1.3B in a year LLM Engineers Must Master These 5 Key Concepts
Frequently Asked Questions

What is inference in AI?

Inference is running a trained AI model to generate outputs from new inputs. When you send a message to ChatGPT or Claude, the model performs inference to generate a response.

What affects inference speed?

Model size, hardware (GPU/TPU), quantization, batching strategy, and context length all affect inference speed. Specialized chips like Groq LPUs can dramatically increase throughput.

All Glossary Terms
Large Language ModelRetrieval-Augmented GenerationFine-TuningTransformerPrompt EngineeringHallucinationTokenEmbeddingVector DatabaseGPTDiffusion ModelReinforcement LearningMultimodal AIContext WindowAgentic AIModel Context ProtocolTool UseChain-of-ThoughtDistillation