Logit

One problem of terminology: logit

In statistics, the logit (/ˈloʊdʒɪt/ LOH-jit) function or the log-odds is the logarithm of the odds p/(1 − p) where p is the probability. It is a type of function that creates a map of probability values from $[0,1]$ to $(-\infty, +\infty)$. It is the inverse of the standard logistic function (sigmoid).

In deep learning, the term logits layer is popularly used for the last neuron layer of neural networks used for classification tasks, which produce raw prediction values as real numbers ranging from $(-\infty, +\infty)$.

When one talks about deep learning for classification tasks, the output values of the next to last layer (the layer before the final softmax activation) are called logits.

History of this term:

There have been several efforts to adapt linear regression methods to domain where output is probability value $[0,1]$ instead of any real number $(-\infty, +\infty)$. Many of such efforts focused on modeling this problem by somehow mapping the range $[0,1]$ to $(-\infty, +\infty)$ and then running the linear regression on these transformed values. In 1934 Chester Ittner Bliss used the cumulative normal distribution function to perform this mapping and called his model probit an abbreviation for “probability unit”. However, this is computationally more expensive. In 1944, Joseph Berkson used log of odds and called this function logit, abbreviation for “logistic unit” following the analogy for probit. Log odds was used extensively by Charles Sanders Peirce (late 19th century). G. A. Barnard in 1949 coined the commonly used term log-odds; the log-odds of an event is the logit of the probability of the event.

So “logit” actually means “logistic unit”.

Here is one piece of code from https://github.com/pytorch/pytorch/blob/master/torch/distributions/utils.py (2019-06-05)

def logits_to_probs(logits, is_binary=False):
    r"""
    Converts a tensor of logits into probabilities. Note that for the
    binary case, each value denotes log odds, whereas for the
    multi-dimensional case, the values along the last dimension denote
    the log probabilities (possibly unnormalized) of the events.
    """
    if is_binary:
        return torch.sigmoid(logits)
    return F.softmax(logits, dim=-1)

Logits is an overloaded term which can mean many different things.

In mathematics, logit is log-odds, it is the inverse function of (standard) logistic function.

\[l = \log \frac{p}{1-p}\] \[p = \frac{1}{1+ e^{-l}}\]

一个事件的发生比（Odds）是该事件发生的概率和不发生的概率的比值

In neural networks, it is the vector of raw (non-normalized) predictions. In context of deep learning the logits layer means the layer that feeds in to softmax.

Unfortunately the term logits is abused in deep learning. From pure mathematical perspective logit is a function that performs log-odds mapping. In deep learning people started calling the layer “logits layer” that feeds in to logit function. Then people started calling the output values of this layer “logit”, which creates the confusion of this term.

For details, see:

https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow