a large language model is a statistical next token predictor a large language model is trained on text a large language model produces probabilities over tokens a token is a unit of text a token can be a word a token can be part of a word a token has no meaning by itself a corpus is a collection of text a corpus is chosen by people a corpus determines what a model can learn training is repeated error minimization training adjusts numeric parameters training stops when improvement slows a model is a mathematical function a model maps inputs to outputs a model does not understand meaning an embedding is a numeric vector an embedding represents a token an embedding is learned during training a weight is a numeric parameter a weight scales an input a weight is updated during training a bias is a numeric offset a bias shifts a prediction a bias helps fit the data a neural network is a collection of weighted sums a neural network applies nonlinear functions a neural network is trained using gradients gradient descent is an optimization method gradient descent reduces prediction error gradient descent updates parameters attention is a weighting mechanism attention emphasizes relevant tokens attention does not create understanding inference is using a trained model inference does not change parameters inference produces probabilities a probability is a number between zero and one a probability represents likelihood probabilities sum to one a prompt is input text a prompt conditions prediction a prompt does not instruct intent an output is a predicted token an output is chosen from probabilities an output is not a fact a language model does not reason a language model does not know truth a language model predicts patterns in text