Do two variables move together? Covariance measures it: the average product of their deviations from their means. When both tend to be above (or both below) average at the same time, the products are positive and covariance is positive.
Positive covariance means they rise together. Negative means one rises as the other falls. Zero means no linear tendency either way. But covariance is in awkward mixed units and its size depends on scale, so it's hard to interpret on its own.
Divide covariance by both standard deviations and you get the correlation coefficient ρ, a clean number always between −1 and +1:
Where this lives in MLThe covariance matrix Σᵢⱼ = Cov(Xᵢ, Xⱼ) packages all pairwise covariances of a feature vector. PCA diagonalizes it to find the directions of greatest variance. Highly correlated input features cause multicollinearity and unstable weights, and the pattern of "what attends to what" in a transformer's attention map is, loosely, a learned correlation structure across tokens.