1 |
|
1 |
|
Sun Haozhe's Blog
1 |
|
1 |
|
Entropy:
\[H(X) = - \int_\mathcal{X} P(x) \log P(x) dx\]Cross entropy:
In information theory, the cross entropy between two probability distributions $P$ and $Q$ over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution $Q$, rather than the true $P$.
\[H(P, Q) = - \int_\mathcal{X} P(x) \log Q(x) dx\]The definition may be formulated using the Kullback–Leibler divergence $D_{\text{KL}}(P || Q)$ of $Q$ from $P$ (also known as the relative entropy of $P$ with respect to $Q$):
\[H(P, Q) = H(P) + D_{\text{KL}}(P || Q)\]Kullback–Leibler divergence (also known as the relative entropy of $P$ with respect to $Q$):
\[D_{\text{KL}}(P || Q) = \int_{\mathcal{X}} P(x) \log \frac{P(x)}{Q(x)} dx\]$D_{\text{KL}}$ achieves the minimum zero when $P(x) =Q(x)$ everywhere.
It is noticeable according to the formula that KL divergence is asymmetric. In cases where $P(x)$ is close to zero, but $Q(x)$ is significantly non-zero, the $Q$’s effect is disregarded. It could cause buggy results when we just want to measure the similarity between two equally important distributions.
However, if we are trying to find approximations for a complex (intractable) distribution $Q(x)$ by a (tractable) approximate distribution $P(x)$, we want to be absolutely sure that any $x$ that would be very improbable to be drawn from $Q(x)$ would also be very improbable to be drawn from $P(x)$. When $P(x)$ is small but $Q(x)$ is not, that’s ok. But when $Q(x)$ is small, this grows very rapidly if $P(x)$ isn’t also small. So if we are looking for $P(x)$ by minimizing KL divergence $D_{\text{KL}}(P || Q)$, it’s very improbable that $P(x)$ will assign a lot of mass on regions where $Q(x)$ is near zero.
Jensen–Shannon divergence is another measure of similarity between two probability distributions. JS divergence is symmetric and more smooth. It is defined by:
\[\text{JSD}(P || Q) = \frac{1}{2}D_{\text{KL}}(P || \frac{P+Q}{2}) + \frac{1}{2}D_{\text{KL}} (Q || \frac{P+Q}{2})\]Some believe (Huszar, 2015) that one reason behind GANs’ big success is switching the loss function from asymmetric KL divergence in traditional maximum-likelihood approach to symmetric JS divergence.
Arithmetics between Pytorch tensor and Numpy array (without explicit casting) are not allowed.
For the following experiments:
PyTorch version: 1.2.0.dev20190611
Numpy version: 1.16.4
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
According to Can’t call numpy() on Variable that requires grad:
Moving to numpy will break the computation graph and so no gradient will be computed. If you don’t actually need gradients, then you can explicitly .detach()
the Tensor that requires grad
to get a tensor with the same content that does not require grad
. This other Tensor can then be converted to a numpy array.
size_t是标准C库中定义的,应为unsigned int,在64位系统中为 long unsigned int。
size_t是一种无符号的整型数,它的取值没有负数,在数组中也用不到负数,而它的取值范围是整型数的双倍。
1 |
|
Uses of typeid in C++:
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
指针*
,引用&
,解引用*
,取地址&
&
可以作为取地址
用。&a
返回取a
的地址,a
是一个对象。&
映射对象到地址。
In this case, &
can be called Address of
operator.
*
可以作为解引用
操作符。*p
返回p
所指向的对象,p
是一个地址。*
映射地址到对象。
In this case, *
can be called Indirection
operator. Indirection
operator returns the value
of the variable located at the address specified by its operand.
Indirection
operator *
is the complement of Address of
operator &
.
References:
Experiments using Python 3.7.3
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
From https://stackoverflow.com/questions/29854398/seeding-random-number-generators-in-parallel-programs
If no seed is provided explicitly, numpy.random
will seed itself using an OS-dependent source of
randomness. Usually it will use /dev/urandom
on Unix-based systems (or some Windows equivalent),
but if this is not available for some reason then it will seed itself from the wall clock. Since
self-seeding occurs at the time when a new subprocess forks, it is possible for multiple subprocesses
to inherit the same seed if they forked at the same time, leading to identical random variates being
produced by different subprocesses.
Some following texts are reprinted from [Python, NumPy, Pytorch中的多进程中每个进程的随机化种子误区] (https://blog.csdn.net/xiaojiajia007/article/details/90207113) with some modifications.
python自带的random在不同子进程中会生成不同的种子,而numpy.random不同子进程会fork相同的主进程中的种子。 pytorch中的Dataloader类的__getitem__()会在不同子进程中发生不同的torch.seed(),并且种子与多进程的worker id 有关(查看**worker_init_fn参数说明)。但是三者互不影响,必须独立地处理。因此在写自己的数据准备代码时,如果使用了 numpy中的随机化部件,一定要显示地在各个子进程中重新采样随机种子,或者使用python中的random发生随机种子。
Experiments were run on Linux-4.9.125-linuxkit-x86_64-with-Ubuntu-18.04-bionic (indeed, in a docker Virtual Machine) with Python 3.6.8, the system had 4 physical cores with 4 hyperthreads, thus 8 logical cores.
Using numpy.random
module, without seeding. Identical random sequences across subprocesses,
experiment is not reproducible:
1 |
|
Using numpy.random
module, seeding with no arguments. Different random sequences across subprocesses,
experiment is not reproducible:
1 |
|
Using numpy.random
module, seeding with None
. Different random sequences across subprocesses,
experiment is not reproducible:
1 |
|
Using numpy.random.RandomState
function, seeding with no arguments. Different random sequences across subprocesses,
experiment is not reproducible:
1 |
|
Using numpy.random.RandomState
function, seeding with None
. Different random sequences across subprocesses,
experiment is not reproducible:
1 |
|
Calling np.random.seed()
within a subprocess forces the thread-local RNG (Random Number Generator) instance to seed itself again
from /dev/urandom
or the wall clock, which will (probably) prevent you from seeing identical output
from multiple subprocesses. Best practice is to explicitly pass a different seed (or
numpy.random.RandomState
instance) to each subprocess.
Using numpy.random.RandomState
function, seeding with different seeds explicitly passed to subprocesses.
Different random sequences across subprocesses, experiment is reproducible:
1 |
|
Using Python’s default random
module, without seeding.
Different random sequences across subprocesses, experiment is not reproducible:
1 |
|
Using Python’s default random
module, seeding with no arguments.
Different random sequences across subprocesses, experiment is not reproducible:
1 |
|
Pytorch中多个进程加载随机样本Dataloader解决方法:
https://discuss.pytorch.org/t/does-getitem-of-dataloader-reset-random-seed/8097/7 除了可选择python中的random解决外,
Instead, add this line to the top of your main script (and you need to use python 3)
1 |
|
Pytorch中多个进程加载随机样本Dataloader解决方法(2022年7月16日更新):
这个bug的出现需要满足以下两个条件:
torch
和random
库产生随机数没有问题,numpy
有问题。PyTorch >= 1.9: 官方修复以后,大家都没问题。__getitem__
方法中使用了Numpy的随机数DataLoader
的构造函数有一个可选参数 worker_init_fn
。在加载数据之前,每个子进程都会先调用此函数。我们可以在 worker_init_fn
中设置NumPy的种子。还有一个要注意的点就是: 在默认情况下,每个子进程在epoch结束时被杀死,所有的进程资源都将丢失。在开始新的epoch时,主进程中的随机状态没有改变,用于再次初始化各个子进程,所以子进程的随机数种子和上个epoch完全相同。因此我们需要设置一个会随着epoch数目改变而改变的随机数,但是这个在实际应用中很难实现,因为在 worker_init_fn
中无法得知当前是第几个epoch。幸运的是,torch.initial_seed()
可以满足我们的需求。这个其实也是PyTorch官方的推荐作法: https://pytorch.org/docs/stable/notes/randomness.html#dataloader
为什么torch.initial_seed()
可以?
torch.initial_seed()
,返回的就是 torch
当前的随机数种子,即 base_seed + worker_id
。因为每个epoch
开始时,主进程都会重新生成一个 base_seed
,所以 base_seed
是随epoch
变化而变化的随机数。 此外,torch.initial_seed()
返回的是 long int
类型,而Numpy只接受 uint
类型([0, 2**32 - 1])
,所以需要对 2**32
取模。torch
或者 random
生成随机数,而不是 numpy
,就不用担心会遇到这个问题,因为PyTorch已经把 torch
和 random
的随机数设置为了 base_seed + worker_id
。1 |
|
知乎 可能95%的人还在犯的PyTorch错误的评论精选:
np.random.default_rng
来获取随机数了”Backward compatibility, sometimes also called downward compatibility (向下兼容), is a property of a system, product, or technology that allows for interoperability with an older legacy system, or with input designed for such a system. Modifying a system in a way that does not allow backward compatibility is sometimes called “breaking” backward compatibility.
Forward compatibility or upward compatibility is a design characteristic that allows a system to accept input intended for a later version of itself.
This experiment was run on Linux-4.9.125-linuxkit-x86_64-with-Ubuntu-18.04-bionic (indeed, in a docker Virtual Machine) with Python 3.6.8, the system had 4 physical cores with 4 hyperthreads, thus 8 logical cores.
An incorrect way to do:
1 |
|
The output was
1 |
|
By doing so, only 1 core among 8 cores was used at 100%, whereas other 7 cores were almost at 0% (checked by linux command top
). At a given time, only 100% (instead of 800%) of CPU charge was used, even though this 100% CPU charge may move from one core to another every time a new process started.
The correct way to do:
1 |
|
The output of the correct way was:
1 |
|
By using the correct way, all 8 cores were used at 100% (checked by linux command top
).
The difference is the following:
1 |
|
becomes
1 |
|
a[start:stop]
items start through stop-1a[start:]
items start through the rest of the arraya[:stop]
items from the beginning through stop-1a[:]
a copy of the whole arraya[-1]
last item in the arraya[-2:]
last two items in the arraya[:-2]
everything except the last two itemsa[::-1]
all items in the array, reverseda[1::-1]
the first two items, reverseda[-3::-1]
everything except the last two items, reverseda[:-3:-1]
the last two items, reverseda[::2]
extracts elements of list at even positionsa[1::2]
extracts elements of list at odd positionsa[::3]
similar to abovea[1::3]
similar to aboveIf we adopt the notation [start:stop:step]
, start
is always inclusive, stop
is always exclusive.
1 |
|
One can substitute None for any of the empty spaces. For example [None:None] makes a whole copy. This is useful when you need to specify the end of the range using a variable and need to include the last item.
Slicing builtin types returns a copy but that’s not universal. Notably, [slicing NumPy arrays] (https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html) returns a view that shares memory with the original.
English | French | Chinese | Scientific notation |
---|---|---|---|
million | million | 百万 | $10^6$ |
billion | milliard | 十亿 | $10^9$ |
trillion | billion | 万亿(兆) | $10^{12}$ |
quadrillion | - | 千万亿(千兆) | $10^{15}$ |
quintillion | trillion | 百亿亿 (百京) | $10^{18}$ |
For English and French:
The long scales (长级差制) and short scales (短级差制) are two of several large-number naming systems for integer powers of ten that use the same words with different meanings. The long scale is based on powers of one million, whereas the short scale is based on powers of one thousand.
Most English-language countries and regions use the short scale with $10^9$ being billion.
The traditional long scale is used by most Continental European countries and by most other countries whose languages derive from Continental Europe (with the notable exceptions of Albania, Greece, Romania, and Brazil). These countries use a word similar to billion to mean $10^{12}$. Some use a word similar to milliard to mean $10^9$, while others use a word or phrase equivalent to thousand millions.
For Chinese:
References:
“亿”以上的大数的说法来源于《孙子算经》。《孙子算经》在大数的递进等差上采用“万万数之”的做法,但后世也有其他的递进等差。
汉字 | 拼音 | 数值 | 备注 |
---|---|---|---|
千 | qiān | $10^3$ | |
万 | wàn | $10^4$ | |
亿 | yì | $10^8$ | 在古代, 亿也可代表$10^5$。详看 大数系统。亦作万万,例如 马关条约中“平银贰万万两交与日本,作为赔偿军费”。 |
兆 | zhào | $10^{12}$ | 在古代, 兆也可代表$10^6$、$10^{16}$。因为兆也可以表示“百万”,造成其用法争议,请看 国际单位制词头。 |
京 | jīng | $10^{16}$ | 在古代,京也可代表$10^7$、$10^{24}$、$10^{32}$。也作 经。 |
垓 | gāi | $10^{20}$ | 在古代,垓也可代表$10^8$、$10^{32}$、$10^{64}$。 |
秭 | zǐ | $10^{24}$ | 在古代,秭也可代表$10^9$、$10^{40}$、$10^{128}$。也作 杼。尧才是对应的国际单位制词头。 |
穰 | ráng | $10^{28}$ | 在古代,穰也可代表$10^{10}$、$10^{48}$、$10^{256}$。也作壤。 |
沟 | gōu | $10^{32}$ | 在古代,沟也可代表$10^{11}$、$10^{56}$、$10^{512}$。 |
涧 | jiàn | $10^{36}$ | 在古代,涧也可代表$10^{12}$、$10^{64}$、$10^{1024}$。 |
正 | zhèng | $10^{40}$ | 在古代,正也可代表$10^{13}$、$10^{72}$、$10^{2048}$。 |
载 | zài | $10^{44}$ | 在古代,载也可代表$10^{14}$、$10^{80}$、$10^{4096}$。 |
兆是一个中文数词。在不同的体系中分别代表百万(1000000也就是$10^6$)、万亿(1000000000000也就是$10^{12}$)、亿亿(10000000000000000也就是$10^{16}$)这三个数目。在台湾、日本、韩国普遍用“兆”来代表$10^{12}$。但在中国大陆,“兆”代表的含义往往取决于语境,在作为计算机相关单位名词如网络流量、二进制数据长度单位时“兆”则经常用于代表mega,也就是“百万”,例如兆字节(MB),兆字节每秒(MB/s)等。而在作为计数数量,衡量数量的时候则往往称为“万亿”,如“中国电子讯息产业总收入达人民币5.6兆元(万亿元,即$10^{12}$)”。
In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for rolling a k-sided die n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.
The Bernoulli distribution models the outcome of a single Bernoulli trial. In other words, it models whether flipping a (possibly biased) coin one time will result in either a success (obtaining a head) or failure (obtaining a tail). The binomial distribution generalizes this to the number of heads from performing n independent flips (Bernoulli trials) of the same coin. The multinomial distribution models the outcome of n experiments, where the outcome of each trial has a categorical distribution, such as rolling a k-sided die n times.
In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution[1]) is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified.
The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.
The categorical distribution is the generalization of the Bernoulli distribution for a categorical random variable, i.e. for a discrete variable with more than two possible outcomes, such as the roll of a die. On the other hand, the categorical distribution is a special case of the multinomial distribution, in that it gives the probabilities of potential outcomes of a single drawing rather than multiple drawings.
One problem of terminology: logit
In statistics, the logit (/ˈloʊdʒɪt/ LOH-jit) function or the log-odds is the logarithm of the odds p/(1 − p) where p is the probability. It is a type of function that creates a map of probability values from $[0,1]$ to $(-\infty, +\infty)$. It is the inverse of the standard logistic function (sigmoid).
In deep learning, the term logits layer is popularly used for the last neuron layer of neural networks used for classification tasks, which produce raw prediction values as real numbers ranging from $(-\infty, +\infty)$.
When one talks about deep learning for classification tasks, the output values of the next to last layer (the layer before the final softmax activation) are called logits.
History of this term:
There have been several efforts to adapt linear regression methods to domain where output is probability value $[0,1]$ instead of any real number $(-\infty, +\infty)$. Many of such efforts focused on modeling this problem by somehow mapping the range $[0,1]$ to $(-\infty, +\infty)$ and then running the linear regression on these transformed values. In 1934 Chester Ittner Bliss used the cumulative normal distribution function to perform this mapping and called his model probit an abbreviation for “probability unit”. However, this is computationally more expensive. In 1944, Joseph Berkson used log of odds and called this function logit, abbreviation for “logistic unit” following the analogy for probit. Log odds was used extensively by Charles Sanders Peirce (late 19th century). G. A. Barnard in 1949 coined the commonly used term log-odds; the log-odds of an event is the logit of the probability of the event.
So “logit” actually means “logistic unit”.
Here is one piece of code from https://github.com/pytorch/pytorch/blob/master/torch/distributions/utils.py (2019-06-05)
1 |
|
Logits is an overloaded term which can mean many different things.
In mathematics, logit is log-odds, it is the inverse function of (standard) logistic function.
\[l = \log \frac{p}{1-p}\] \[p = \frac{1}{1+ e^{-l}}\]一个事件的发生比(Odds)是该事件发生的概率和不发生的概率的比值
In neural networks, it is the vector of raw (non-normalized) predictions. In context of deep learning the logits layer means the layer that feeds in to softmax.
Unfortunately the term logits is abused in deep learning. From pure mathematical perspective logit is a function that performs log-odds mapping. In deep learning people started calling the layer “logits layer” that feeds in to logit function. Then people started calling the output values of this layer “logit”, which creates the confusion of this term.
For details, see:
https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow
English | French | Chinese | Property |
---|---|---|---|
queue | file | 队列 | First In First Out |
stack | pile | 栈, 堆栈 | Last In First Out |
heap | tas | 堆 | Tree-based data structure. In a heap, the highest (or lowest) priority element is always stored at the root. |
In Python3 & PyTorch 1.0.0,
torch.LongTensor
and torch.cuda.LongTensor
means int64
.
HalfTensor
: float16
FloatTensor
: float32
DoubleTensor
: float64
ByteTensor
: uint8
(unsigned)
CharTensor
: int8
(signed)
ShortTensor
: int16
(signed)
IntTensor
: int32
(signed)
LongTensor
: int64
(signed)
One example of conversion from LongTensor
to FloatTensor
:
1 |
|
Attention:
1 |
|
b.type()
equals LongTensor
. The implicit type casting did not work because type(a)
is torch.Tensor
instead of Python raw numbers or Numpy array.
The solution is as follows:
1 |
|
Here, b.type()
equals FloatTensor
.
Don’t forget to reset indices of pandas DataFrame after slicing operations. Otherwise, there might be key errors later.
Let’s say
Xtr
: pandas DataFrame 6000 rows × 2 columns
1 |
|
1 |
|
1 |
|
The correct way is the following:
1 |
|
The reason is that .loc
is label-based indexer, 0
is not interpreted as the location 0
but the label 0
.
However, after slicing, the index of each row remains the same.
When using .loc
with slices, both the start point and end point are included.
When using .iloc
with slices, the start point is included, while the end point is excluded (like standard Python slicing convention).
Python中,一个list
可以包含不同类型的object
。例如:
1 |
|
但是需要注意的是,当我们试图将list
通过np.array()
, np.zeros_like()
, etc.转换为numpy array
的时候,如果list
中只包含int
,则转换得到的numpy array
的dtype
将也是int64
。如果list
中包含至少一个float
,则得到的numpy array
的dtype
将是float64
。
For example:
1 |
|
1 |
|
我是通过一下这段代码发现的这个问题:
1 |
|
1 |
|
1 |
|