Sun Haozhe's Blog

Sun Haozhe's Blog


  • Home

  • Categories

  • Archives

  • Tags

  • About

  • Search

ssh memo

Posted on 2021-04-11 | In Misc
| Words count in article 1196

Basics

https://www.freecodecamp.org/news/how-to-create-and-connect-to-google-cloud-virtual-machine-with-ssh-81a68b8f74dd/

Mac and Linux support SSH connection natively. You just need to generate an SSH key pair (public key/private key) to connect securely to the virtual machine.

The private key is equivalent to a password. Thus, it is kept private, residing on your computer, and should not be shared with any entity. The public key is shared with the computer or server to which you want to establish the connection.

To generate the SSH key pair to connect securely to the virtual machine:

1
2
cd path/to/your/directory/which/stores/the/key/pair
ssh-keygen -t rsa -C [username (最好使用gmail邮箱前缀,@前面的部分,这有可能对使用GCP是必须的)]

It will start the key generation process. You will be prompted to choose the file name (location, path) to store the SSH key pair. For example, if you enter gcp_xxx, the generated private key will be called gcp_xxx, the generated public key will be called gcp_xxx.pub (both under path/to/your/directory/which/stores/the/key/pair). The default path of the generated private key is ~/.ssh/id_rsa on MacOS. Then you will be prompted to choose a password for your login to the virtual machine. If you do not want to use a password, just press ENTER.

To visualize the public key:

1
2
3
4
cat gcp_xxx.pub

# if the public key is in the default location 
cat ~/.ssh/id_rsa.pub

Google Cloud Platform

  • https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys#createsshkeys
  • https://cloud.google.com/compute/docs/instances/connecting-advanced#provide-key

Copy the returned string in the terminal and paste it in the SSH key field in the remote server (e.g. Google Cloud).

Important: you must ensure that the (instance-level) OS Login is disabled!!!! Otherwise you will get Permission denied (publickey). error when trying to connect via ssh.

  • Enabling OS Login on instances disables metadata-based SSH key configurations on those instances. Disabling OS Login restores SSH keys that you have configured in project or instance metadata. In the Google Cloud Console, go to the VM instances page, click the name of the instance, click Edit on the instance details page, under Custom metadata set enable-oslogin in the instance metadata of an existing instance to false, or add a metadata entry and set the key to enable-oslogin and the value to false (manually type false).

You can use the External IP of the virtual machine you just created to connect to Google Cloud virtual machines using SSH.

1
2
3
4
# here, use 192.168.1.123 as an 
# example of the external IP 

ssh -i [private_key] [username]@192.168.1.123

To exit the virtual machine, just type exit.

If you want to use Docker with Google Cloud Platform and you want to use CUDA (nvidia GPU), you need to install NVIDIA GPU device drivers for the VM instance yourself, see https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install. To summarize:

  • Create a Container-Optimized OS VM instance with one or more GPUs
  • ssh to the VM instance without docker exec into the docker container
  • Run sudo cos-extensions install gpu
  • In the above step, it is possible you get the following error: The container name "/cos-gpu-installer" is already in use by container "xxx". You have to remove (or rename) that container to be able to reuse that name.. If so, just do docker rm -f that docker container and redo the above step.
  • In order to verify the installation, run the following commands:
  • sudo mount --bind /var/lib/nvidia /var/lib/nvidia
  • sudo mount -o remount,exec /var/lib/nvidia
  • /var/lib/nvidia/bin/nvidia-smi (nvidia-smi will issue command not found)
  • After the GPU drivers are installed, you can configure Docker containers to consume GPUs. The following example shows you how to run a simple CUDA application in a Docker container that consumes /dev/nvidia0:
  • docker run -dit --ipc=host --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidiactl:/dev/nvidiactl --name my_env -v ~/xxx/yyy:/zzz [imageName]

Short version (you need to install cuda driver and manually docker run a docker container each time you restart a stopped VM instance which uses docker):

  • sudo cos-extensions install gpu
  • sudo mount --bind /var/lib/nvidia /var/lib/nvidia
  • sudo mount -o remount,exec /var/lib/nvidia
  • /var/lib/nvidia/bin/nvidia-smi
  • docker run -dit --ipc=host --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidiactl:/dev/nvidiactl --name my_env -v ~/xxx/yyy:/zzz [imageName]

Or in one line:

  • sudo cos-extensions install gpu && sudo mount --bind /var/lib/nvidia /var/lib/nvidia && sudo mount -o remount,exec /var/lib/nvidia && /var/lib/nvidia/bin/nvidia-smi

scp

1
2
3
4
5
6
7
8
# here, use 192.168.1.123 as an 
# example of the external IP 

# from remote to local 
scp -i [private_key] [username]@192.168.1.123:remote/path/ ./local/path

# from local to remote 
scp -i [private_key] ./local/path [username]@192.168.1.123:remote/path/ 
  • Add -r when we want to copy a directory instead of a file.
  • For the remote path, the / in the beginning of absolute path can be omitted.
  • For the local path, this can be either relative path or absolute path.
  • There is no need to create a tmp directory in the local path when copying a remote directory, because the remote directory will be copied with its directory structure.
  • If you omit the file name, the file will be copied with the original name. For example, the destination local path can be just ./
  • If possible, transfer zipped directory instead of the directory as it is, transferring multiple small files can take much more time than transferring a single large file.
  • If ssh on the remote host is listening on a port other than the default 22 then you can specify the port using the -P argument: scp -P 2322 file.txt remote_username@10.10.0.2:/remote/directory. Adding -P specifies the remote host ssh port.
  • The colon : is how scp distinguish between local and remote locations.
  • Be careful when copying files that share the same name and location on both systems, scp will overwrite files without warning.
  • When transferring large files, it is recommended to run the scp command inside a screen or tmux session.
  • Adding -p preserves files modification and access times.

Troubleshooter

How to fix warning about ECDSA host key

https://superuser.com/a/421024/1231508

Remove the cached key for 192.168.1.123 (this is just an example) on the local machine:

1
ssh-keygen -R 192.168.1.123

Permission denied (publickey)

  • https://cloud.google.com/compute/docs/instances/managing-instance-access#enable_oslogin
  • https://stackoverflow.com/a/63696203/7636942

If you repeatedly get Permission denied (publickey) (you spent hours in debugging it but still do not know how to solve it), just to disable OS Login!!!! Ensure that OS Login is not enabled!!! You can choose to do so in instance metadata (the project-wise metadata is another option):

  • In the Metadata section, add a metadata entry where the key is enable-oslogin and the value is FALSE to exclude the instance from the feature.
Read more »

PyTorch memo

Posted on 2021-04-06 | In Python
| Words count in article 2953

Imports

1
2
3
import torch 
import torch.nn as nn
import torch.nn.functional as F  

Image format convention

In PyTorch, images are represented as [channels, height, width]. During the training you will get batches of images, so your shape in the forward method will get an additional batch dimension at dim0: [batch size, channels, height, width], also called NCHW.

1
x = x.permute(0, 3, 1, 2) # from NHWC to NCHW 

Creation & initialization

In PyTorch 1.8.1, torch.Tensor is a class (<class 'torch.Tensor'>), torch.tensor is a function:

According to https://pytorch.org/docs/stable/tensors.html#torch.Tensor:

  • To create a tensor with pre-existing data, use torch.tensor()
  • To create a tensor with specific size, use torch.* tensor creation ops (see Creation Ops)
  • To create a tensor with the same size (and similar types) as another tensor, use torch.*_like tensor creation ops (see Creation Ops)
  • To create a tensor with similar type but different size as another tensor, use tensor.new_* creation ops
1
torch.tensor([[0.1, 1.2], [2.2, 3.1], [4.9, 5.2]])
1
2
3
tensor([[ 0.1000,  1.2000],
        [ 2.2000,  3.1000],
        [ 4.9000,  5.2000]])
1
2
3
# the input can be a variable number of 
# integers or a collection-like a list or tuple
torch.zeros(2, 3)
1
2
tensor([[ 0.,  0.,  0.],
        [ 0.,  0.,  0.]])

However, it is also possible to use the constructor of torch.Tensor even if this is not documented. The following is an example. The generated torch.Tensor instance contains some small random values, sometimes many 0s. It is not clear to me how it has been initialized, my guess is that the initial weights are just random values induced from uninitialized memory blocks. If this guess is correct, this way of creating torch.Tensor objects should be followed by a initialization phase such as Kaiming uniform, etc.

1
2
3
4
5
6
torch.__version__ # 1.8.1 

a = torch.Tensor(1, 2, 3, 4, 5)
a.shape  # torch.Size([1, 2, 3, 4, 5])
a.size() # torch.Size([1, 2, 3, 4, 5]) 
a.dtype  # torch.float32 

Returns a torch.Tensor object filled with random numbers from a normal distribution with mean 0 and variance 1 (also called the standard normal distribution)

1
2
3
4
# By default, requires_grad == False 
torch.randn(2, 3)
# or equivalently
torch.randn((2, 3))
1
2
tensor([[ 1.5954,  2.8929, -1.0923],
        [ 1.1719, -0.4709, -0.1996]])

Please note the difference between torch.Tensor and torch.nn.parameter.Parameter (nn.Parameter(torch.randn(2, 3)), by default requires_grad == True).

Converting numpy Array to torch Tensor

The created tensor and numpy ndarray share the same memory. Modifications to the tensor will be reflected in the ndarray and vice versa. The returned tensor is not resizable.

1
2
3
a = numpy.array([1, 2, 3])
t = torch.from_numpy(a)
t
1
tensor([ 1,  2,  3])

Converting torch Tensor to numpy Array

The tensor and the returned numpy ndarray share the same underlying storage. Changing one will change the other.

It has been firmly established that we should call .detach() before calling .numpy() (xxx.detach().numpy()). The logic is that np.ndarray does not store/represent the computational graph associated with the array, this graph should be explicitly removed using detach().

1
2
a = torch.ones(5) # tensor([1., 1., 1., 1., 1.])
b = a.numpy()     # [1. 1. 1. 1. 1.] 
1
2
3
4
# one example 
b = model(torch.from_numpy(a)) 
c = b.cpu().detach().numpy()
d = b.detach().cpu().numpy()

Shape manipulation

transpose()

torch.transpose(input, dim0, dim1) → Tensor returns a tensor that is a transposed version of input. The given dimensions dim0 and dim1 are swapped. The resulting out tensor shares its underlying storage with the input tensor, so changing the content of one would change the content of the other.

1
2
3
4
a = torch.randn(1, 2, 3, 4) # torch.Size([1, 2, 3, 4])

# Swaps 2nd and 3rd dimension
b = a.transpose(1, 2)       # torch.Size([1, 3, 2, 4])

transpose_() is the in-place version of transpose().

T returns this Tensor with its dimensions reversed. If n is the number of dimensions in x, x.T is equivalent to x.permute(n-1, n-2, ..., 0).

contiguous()

https://stackoverflow.com/a/52229694/7636942

There are few operations on Tensor in PyTorch that do not really change the content of the tensor, but only how to convert indices in to tensor to byte location. These operations include: narrow(), view(), expand() and transpose(). For example: when you call transpose(), PyTorch doesn’t generate new tensor with new layout, it just modifies meta information in Tensor object so offset and stride are for new shape. The transposed tensor and original tensor are indeed sharing the memory!

The word “contiguous” is bit misleading because its not that the content of tensor is spread out around disconnected blocks of memory. Here bytes are still allocated in one block of memory but the order of the elements is not “contiguous”. When you call contiguous(), it actually makes a copy of tensor so the order of elements would be same as if tensor of same shape created from scratch. Normally you don’t need to worry about this. If PyTorch expects contiguous tensor but if its not then you will get RuntimeError: input is not contiguous and then you just add a call to contiguous(). If self tensor is already contiguous, this function returns the self tensor.

view()

view(*shape) → Tensor returns a new tensor with the same data as the self tensor but of a different shape. The returned tensor shares the same data and must have the same number of elements, but may have a different size. For a tensor to be viewed, the new view size must be compatible with its original size and stride, i.e., each new view dimension must either be a subspace of an original dimension, or only span across original dimensions that satisfy some contiguity-like condition (see https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view). Otherwise, it will not be possible to view self tensor as shape without copying it (e.g., via contiguous()). When it is unclear whether a view() can be performed, it is advisable to use reshape(), which returns a view if the shapes are compatible, and copies (equivalent to calling contiguous()) otherwise.

1
2
3
4
5
6
7
x = torch.randn(4, 4) # torch.Size([4, 4])

y = x.view(16)        # torch.Size([16])

# the size -1 is inferred from other dimensions
z1 = x.view(-1, 8)     # torch.Size([2, 8]) 
z2 = x.view((-1, 8)     # torch.Size([2, 8]) 

reshape()

When possible, the returned tensor will be a view of input. Otherwise, it will be a copy. Contiguous inputs and inputs with compatible strides can be reshaped without copying, but you should not depend on the copying vs. viewing behavior.

It means that torch.reshape() may return a copy or a view of the original tensor. You can not count on that to return a view or a copy. According to the developer, if you need a copy use clone() if you need the same storage use view(). The semantics of reshape() are that it may or may not share the storage and you don’t know beforehand.

1
2
3
a = torch.randn(2, 3, 4) # torch.Size([2, 3, 4])
b = a.reshape(3, -1)     # torch.Size([3, 8])
c = a.reshape((3, -1))   # torch.Size([3, 8])

Difference between view(), reshape(), transpose(), permute().

https://jdhao.github.io/2019/07/10/pytorch_view_reshape_transpose_permute/

Both view() and transpose() return a new tensor sharing the data with the original tensor. One difference between them is that view() can only operate on contiguous tensor and the returned tensor is still contiguous. transpose() can operate both on contiguous and non-contiguous tensor, but the returned tensor may be not contiguous any more.

But what does contiguous mean? Here is a reference which also applies to PyTorch. A lot of tensor operations requires that the tensor should be contiguous, otherwise, an error will be thrown. To make a non-contiguous tensor become contiguous, just use contiguous(), which will create a new memory space for the new tensor and copy the value from the non-contiguous tensor to the new tensor.

1
2
3
4
5
6
7
8
9
x = torch.tensor([[1, 2, 3], [4, 5, 6]]) 
y = x.transpose(0, 1) 

# x and y share the same memory space
print(x.data_ptr())      # 140340426148352
print(y.data_ptr())      # 140340426148352 

print(x.is_contiguous()) # True 
print(y.is_contiguous()) # False

permute() and tranpose() are similar. transpose() can only swap two dimension, but permute() can swap all the dimensions. Note that, in permute(), you must provide the new order of all the dimensions. In transpose(), you can only provide two dimensions. tranpose() can be thought as a special case of permute() method.

1
2
3
4
x = torch.rand(16, 32, 3) # torch.Size([16, 32, 3])
y = x.transpose(0, 2)     # torch.Size([3, 32, 16])
z = x.permute(2, 1, 0)    # torch.Size([3, 32, 16])
w = x.permute(1, 2, 0)    # torch.Size([32, 3, 16]) 

torch.equal() returns True if two tensors have the same size and elements, False otherwise`

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x = torch.tensor([1, 2]) # torch.Size([2])
y = torch.tensor([1, 2]) # torch.Size([2]) 

torch.equal(x, y) # True

x.data_ptr()      # 140340397035200
y.data_ptr()      # 140340370049664

#################################################

a = torch.randn(1, 2, 3, 4) # torch.Size([1, 2, 3, 4])
b = a.transpose(1, 2)       # torch.Size([1, 3, 2, 4])
c = a.view(1, 3, 2, 4)      # torch.Size([1, 3, 2, 4])

torch.equal(a, b)    # False
torch.equal(a, c)    # False
torch.equal(b, c)    # False 

a.data_ptr()      # 140340397022336
b.data_ptr()      # 140340397022336
c.data_ptr()      # 140340397022336

detach(), detach_(), clone()

clone()

clone() returns a copy. This function is differentiable, so gradients will flow back from the result of this operation to the original tensor object. To create a tensor without an autograd relationship, see detach().

clone() vs copy.deepcopy()

For Tensors in most cases, you should go for clone() since this is a PyTorch operation that will be recorded by autograd. When it comes to Module, there is no clone() method available so you can either use copy.deepcopy or create a new instance of the model and just copy the parameters.

In the following example, both are equivalent, but there might be a (small) speed difference. When you use .data, you get a new Tensor with requires_grad=False, so cloning it won’t involve autograd.

1
2
x = model.encoder[0].weight.data.clone() 
x = copy.deepcopy(model.encoder[0].weight.data) 

The .data field is an old field that is kept for backward compatibility but should not be used anymore as it’s usage is dangerous and can make computations wrong. You should use .detach() and/or with torch.no_grad() instead now.

detach()

detach() returns a new Tensor, detached from the current graph. The result will never require gradient. Returned Tensor shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks.

detach_()

detach_() detaches the Tensor from the graph that created it, making it a leaf. Views cannot be detached in-place.

squeeze()/unsqueeze()

squeeze()

squeeze() returns a tensor with all the dimensions of input of size 1 removed. The returned tensor shares the storage with the input tensor, so changing the contents of one will change the contents of the other.

When dim is given, a squeeze operation is done only in the given dimension. If input is of shape: (A×1×B), squeeze(input, 0) leaves the tensor unchanged, but squeeze(input, 1) will squeeze the tensor to the shape (A×B).

1
2
3
x = torch.zeros(2, 1, 2, 1, 2) # torch.Size([2, 1, 2, 1, 2])
x.squeeze()                    # torch.Size([2, 2, 2])
torch.squeeze(x)               # torch.Size([2, 2, 2]) 

unsqueeze()

unsqueeze() returns a new tensor with a dimension of size one inserted at the specified position. The returned tensor shares the same underlying data with this tensor. A dim value within the range [-input.dim() - 1, input.dim() + 1) can be used. Negative dim will correspond to unsqueeze() applied at dim = dim + input.dim() + 1.

1
2
3
4
x = torch.tensor([1, 2, 3, 4]) # torch.Size([4])

x.unsqueeze(0)        # torch.Size([1, 4])
torch.unsqueeze(x, 0) # torch.Size([1, 4])
1
tensor([[1, 2, 3, 4]])
1
x.unsqueeze(1) # torch.Size([4, 1])
1
2
3
4
tensor([[1],
        [2],
        [3],
        [4]])

CPU, GPU

1
2
if torch.cuda.is_available():
    pass 

cuda() is used to move a tensor to GPU memory. cpu() moves it back to memory accessible to the CPU. cpu() or cuda() works differently for model and tensor.

For tensors:

1
2
3
4
5
6
tensor = tensor.cpu()
# or equivalently 
tensor = tensor.to("cpu")

tensor = tensor.cuda()
tensor = tensor.to(torch.device("cuda:0"))

Some operations on tensors cannot be performed on cuda tensors. In these cases, you need to move them to cpu first.

For models:

1
model.cuda() # in-place operation for models 

Get current device name:

1
torch.cuda.get_device_name(torch.cuda.current_device())

The first GPU can be identified by the following string:

1
"cuda:0"

Save and load

Using state_dict is the recommended way. Python dictionary can easily be pickled, unpickled, updated, and restored. Thus saving model using state_dict offers more flexibility. We can also save the optimizer state, hyperparameters, etc., as key-value pairs along with the model’s state_dict. The drawback is that we will need the model definition to load the state_dict.

It is also possible to save and load the entire model. It is however not a recommended way of doing. This approach saves/loads models with the least amount of code, it is also more intuitive. The drawbacks are:

  • Since Python’s pickle module is used internally, the serialized data is bound to the specific classes and the exact directory structure is used when the model is saved. Pickle simply saves a path to the file containing the specific class. This is used during load time.
  • The code might break after refactoring as the saved model might not link to the same path. Using such a model in another project is hard as well since the path structure needs to be maintained.

Save

1
2
# recommended
torch.save(model.state_dict(), PATH)
1
2
# not recommended 
torch.save(model, PATH)

Load

PATH is the path to a .pth/.pt file. model.load_state_dict(PATH) will not work. torch.load(PATH) returns a collections.OrderedDict containing the parameters.

1
2
3
4
model = ClassBlaBlaBla(*args, **kwargs)

# recommended
model.load_state_dict(torch.load(PATH))

If we save on GPU, load on CPU:

1
2
device = torch.device("cpu")
model.load_state_dict(torch.load(PATH, map_location=device))
1
2
# not recommended
model = torch.load(PATH)

sum

When using sum(), if dim is given (analogy to axis in numpy), the sum is only done in the given dimension dim. If dim is a list of dimensions, reduce over all of them.

If keepdim is True, the output tensor is of the same size as input except in the dimension(s) dim where it is of size 1. Otherwise, dim is squeezed (see torch.squeeze()), resulting in the output tensor having 1 (or len(dim)) fewer dimension(s). By default, keepdim=False.

einsum

  • https://pytorch.org/docs/stable/generated/torch.einsum.html
  • https://stackoverflow.com/a/55894780/7636942

torch.einsum(equation, *operands) → Tensor sums the product of the elements of the input operands along dimensions specified using a notation based on the Einstein summation convention.

For example, matrix multiplication can be computed using einsum as torch.einsum(“ij,jk->ik”, A, B). Here, j is the summation subscript and i and k the output subscripts.

Tensor unfold()/fold()

https://pytorch.org/docs/stable/generated/torch.nn.Unfold.html

torch.nn.Unfold() extracts sliding local blocks from a batched input tensor. Convolution is equivalent with Unfold + Matrix Multiplication + Fold (or view to output shape).

Read more »

Biases

Posted on 2021-03-05 | In Misc
| Words count in article 249

Confirmation bias

  • Confirmation bias is the tendency to search for, interpret, favor, and recall information in a way that confirms or supports one’s prior beliefs or values. People display this bias when they select information that supports their views, ignoring contrary information, or when they interpret ambiguous evidence as supporting their existing attitudes.
  • Biais de confirmation
  • 确认偏误,确认偏差,证实偏差,肯证偏误,验证偏误,验证性偏见,我方偏见

Survivorship bias

  • Survivorship bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to some false conclusions in several different ways. It is a form of selection bias.
  • Biais des survivants
  • 幸存者偏差

Selection bias

  • Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed.
  • Biais de sélection
  • 选择性偏差

Hindsight bias

  • Hindsight bias, also known as the knew-it-all-along phenomenon or creeping determinism, is the common tendency for people to perceive past events as having been more predictable than they actually were.
  • Biais rétrospectif
  • 后见之明偏误,事后诸葛亮
Read more »

MacOS hidden files

Posted on 2021-01-19 | In Misc
| Words count in article 415

This post talks about hidden files (or folders) created by Apple MacOS.

.DS_Store file

In the Apple MacOS operating system, .DS_Store is a file that stores custom attributes of its containing folder, such as the position of icons or the choice of a background image. The name is an abbreviation of Desktop Services Store. It is created and maintained by the Finder application in every folder, and has functions similar to the file desktop.ini in Microsoft Windows. Starting with a full stop (period) character, it is hidden in Finder and many Unix utilities. Its internal structure is proprietary, but has since been reverse-engineered. Starting at macOS 10.12 16A238m, Finder will not display .DS_Store files (even with com.apple.finder AppleShowAllFiles YES set).

Pretty much every folder on your hard disk is likely to contain a .DS_Store file.

To delete .DS_STORE files in current folder and all subfolders from command line

The find utility walks a file hierarchy, it recursively descends the directory tree for each path listed, evaluating an expression in terms of each file in the tree. . represents the current directory. The option -name specifies the pattern to match. The option -delete deletes found files and/or directories. The option -type f is for extra caution, it excludes directories. Here, f means regular file, d means directory, s means socket, l means symbolic link, etc..

1
find . -name '.DS_Store' -type f -delete 

__MACOSX folder

The __MACOSX folder is created when a Mac user creates and archive (also called a zip file) using the Mac. If the Mac user sends the zip file to another Mac user, the folder will not appear - this is a hidden folder. Many files on the Mac have two parts: a data fork, and a resource fork. The built in zip utility on the Mac sequesters all of the resource forks into this __MACOSX folder when creating a zip archive. For certain files (like some font files), these resource forks are necessary to be left intact.

When the Mac user sends the zip file to a PC user, however, all of the hidden files are shown. PC users are often confused by these (seemingly superfluous) files and folders.

References

  • https://en.wikipedia.org/wiki/.DS_Store
  • https://gotoes.org/sales/Zip_Mac_Files_For_PC/What_Is__MACOSX.php#:~:text=The%20__MACOSX%20folder%20is,fork%2C%20and%20a%20resource%20fork.
  • https://www.publicspace.net/ABetterFinderAttributes/Manual_RemoveInvisibleFiles.html
Read more »

Anaconda memo

Posted on 2021-01-18 | In Misc
| Words count in article 333

Managing Conda and Anaconda

Verify conda is installed, check version

Get the version

1
conda --version
1
conda info

Update conda package and environment manager

1
conda update conda

Update the anaconda meta package

1
conda update anaconda

Managing Environments

Get a list of all my Anaconda environments

1
2
3
4
5
conda env list

conda info --envs

# The above two commands are equivalent

Create an Anaconda environment

1
conda create -n env_name

Create an Anaconda environment with a specific version of Python

1
2
3
conda create -n env_name python=3.6

conda create --name env_name python=3.9

Create an Anaconda environment with a specific package

1
2
3
4
conda create -n env_name scipy

# with a specific version of a package 
conda create -n env_name scipy=0.15.0

Create an Anaconda environment from an existing environment (fork / branch)

1
conda create --clone <existing enviroment> -n <new environment>

Create an Anaconda environment from an environment.yml file

1
2
3
4
conda env create -f environment.yml

# The first line of the yml file sets 
# the new environment's name.

Copy the Anaconda environment into an environment.yml file

1
conda env export > environment.yml

In order to create an Anaconda environment file manually, see https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually

Activate an Anaconda environment to use it

1
conda activate env_name

Deactivate the Anaconda environment

1
conda deactivate 

Delete an Anaconda environment

1
conda env remove --name env_name

Installing python packages

1
conda install [PackageName]

If you encounter the error PackagesNotFoundError: The following packages are not available from current channels, do the following:

1
conda config --append channels conda-forge

conda-forge is a GitHub organization containing repositories of conda recipes. Thanks to some awesome continuous integration providers, each repository automatically builds its own recipe in a clean and repeatable way on Windows, Linux and OSX.

If a package cannot be easily installed with pip because of issues with system libraries (e.g. version of gcc), a possible solution is to install the package with conda, which is less dependent on system libraries.

Troubleshooter

  • “Collecting package metadata” cannot proceed and never end #9221 https://github.com/conda/conda/issues/9221
  • failed to create anaconda environment ResolvePackageNotFound https://stackoverflow.com/questions/48439159/failed-to-create-anaconda-environment-resolvepackagenotfound

References

  • https://kapeli.com/cheat_sheets/Conda.docset/Contents/Resources/Documents/index
  • https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#activating-an-environment
Read more »

Docker memo

Posted on 2021-01-02 | In Misc
| Words count in article 2667

General information

Show the Docker version information

1
docker version

Display system-wide information

1
docker info 

Display a live stream of container(s) resource usage statistics

1
docker stats

Show the history of an image

1
docker history [IMAGE_ID]

Display low-level information on Docker objects using docker inspect. For example, if one wants to compare content of images, one can look at section RootFS. If all layers are identical then images contains identical content.

1
docker inspect <imageName>
1
2
3
4
5
6
7
8
9
10
11
"RootFS": {
        "Type": "layers",
        "Layers": [
            "sha256:eda7136a91b7b4ba57aee64509b42bda59e630afcb2b63482d1b3341bf6e2bbb",
            "sha256:c4c228cb4e20c84a0e268dda4ba36eea3c3b1e34c239126b6ee63de430720635",
            "sha256:e7ec07c2297f9507eeaccc02b0148dae0a3a473adec4ab8ec1cbaacde62928d9",
            "sha256:38e87cc81b6bed0c57f650d88ed8939aa71140b289a183ae158f1fa8e0de3ca8",
            "sha256:d0f537e75fa6bdad0df5f844c7854dc8f6631ff292eb53dc41e897bc453c3f11",
            "sha256:28caa9731d5da4265bad76fc67e6be12dfb2f5598c95a0c0d284a9a2443932bc"
        ]
    }

Remove unused data (remove all unused containers, networks, images (both dangling and unreferenced), and optionally, volumes.)

1
docker system prune

Login to registry (in order to be able to push to DockerHub)

1
docker login --username "XXX" --password "YYY"

Docker image

Build an image from the Dockerfile in the current directory and tag the image

1
docker build -t myimage:1.0 

Pull an image from a registry

1
docker pull myimage:1.0 

Retag a local image with a new image name and tag

1
docker tag myimage:1.0 myrepo/myimage:2.0 

Push an image to a registry

1
docker push myrepo/myimage:2.0 

List all images that are locally stored with the Docker Engine

1
2
3
# both are equivalent 
docker image ls
docker images

Delete an image from the local image store

1
2
3
docker image rm [imageName]
docker rmi [imageName] # equivalent 
docker image rm alpine:3.4 # an example 

Clean up unused images

1
2
3
4
5
# remove dangling images, a dangling image is one that is not tagged and is not referenced by any container
docker image prune

# remove all images which are not used by existing containers 
docker image prune -a

Docker container

List containers (only shows running)

1
docker ps

List all containers (including non-running)

1
docker ps -a

List the running containers (add --all to include stopped containers)

1
docker container ls 

Run a container from the Alpine version 3.9 image, name the running container web and expose port 5000 externally, mapped to port 80 inside the container

1
docker container run --name web -p 5000:80 alpine:3.9

The commands docker stop docker start docker restart:

1
2
3
4
5
6
7
8
# to exit a container
docker stop [container]

# to restart a stopped container 
docker start [container]

# to restart a running container 
docker restart [container]

Copy files/folders between a container and the local filesystem

1
2
docker cp [OPTIONS] CONTAINER:SRC_PATH DEST_PATH|-
docker cp [OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH

An example: (The first line is run in the docker container. The second and third lines are run on the local terminal )

1
2
3
root@92d3e93c5606:/# mkdir -p /xxx/yyy/zzz 
% CONTAINER=92d3e93c5606
% docker cp ./upload_turbo.zip $CONTAINER:/xxx/yyy/zzz/hello_world.zip

Another example:

1
2
docker run --name my_container ...
docker cp my_container:/xxx/yyy/ .

When you stop a container, it is not automatically removed unless you started it with the --rm flag. To see all containers on the Docker host, including stopped containers, use docker ps -a. You may be surprised how many containers exist, especially on a development system! A stopped container’s writable layers still take up disk space. To clean this up, you can use the docker container prune command.

1
2
# remove all stopped containers 
docker container prune

Remove one or more containers (stop plus “prune”)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# remove one container 
docker rm CONTATINER_ID

# remove several containers
docker rm CONTATINER_ID_1 CONTATINER_ID_2

# the following two lines remove a container
# even if it is still running 
docker stop CONTAINER_ID
docker rm CONTATINER_ID

# remove the container even if it is still running
docker rm -f CONTAINER_ID

# remove a container and its volumes 
## Option -v removes anonymous volumes 
## associated with the container
docker rm -v CONTATINER_ID

Volume

Volumes can be used by one or more containers, and take up space on the Docker host. Volumes are never removed automatically, because to do so could destroy data.

1
2
# remove all volumes not used by at least one container 
docker volume prune

Network

List the networks

1
docker network ls 
1
2
# network access is disabled
docker run -dit --network none alpine:3.9 /bin/bash

The command: docker run

docker run runs a command in a new container.

Use -dit for docker run, in this way, the docker will be run at background and one can use docker exec -it ... to enter that container. If one run exit here, one can detach from this container and leave it running.

1
2
3
4
5
# -v: Bind mount a volume
## Map host's directory ~/xxx/yyy  
## to docker container's directory /zzz 
## Both are absolute path
docker run -dit --name my_env -v ~/xxx/yyy:/zzz [imageName]
1
2
# network access is disabled
docker run -dit --network none [imageName] /bin/bash

Run docker which uses cuda GPU

1
docker run -dit --gpus all [imageName] bash

If you use DataLoader of PyTorch with num_workers greater than 0 in a docker container, you probably need to raise the shared memory limit by using --shm-size=2gb or --ipc=host or -v /dev/shm:/dev/shm for docker run.

The command: docker commit

docker commit creates a new image from a container’s changes. The commit operation will not include any data contained in volumes mounted inside the container. By default, the container being committed and its processes will be paused while the image is committed. This reduces the likelihood of encountering data corruption during the process of creating the commit. If this behavior is undesired, set the --pause option to false.

1
2
3
4
5
6
7
# Using the command "docker images",
# sunhaozhe/IMAGE_NAME corresponds to REPOSITORY,
# TAG_NAME corresponds to TAG. 

docker commit -a "John Hannibal Smith <hannibal@a-team.com>" -m "blablabla" [CONTAINER_ID] sunhaozhe/IMAGE_NAME:TAG_NAME

docker commit --author "John Hannibal Smith <hannibal@a-team.com>" --message "blablabla" [CONTAINER_ID] sunhaozhe/IMAGE_NAME:TAG_NAME

Docker Hub

docker push pushes an image or a repository to a registry

Maybe create the repository first on Docker Hub’s web, before running the following command:

1
2
# push local image to Docker Hub
docker push sunhaozhe/IMAGE_NAME:TAG_NAME

Example

Docker image sunhaozhe/pytorch-cu100-jupyter-gym:

1
2
3
4
5
6
7
8
9
10
11
12
13
* linux Ubuntu 18.04
* python 3.6.8
* pytorch for GPU (cuda 10.0)
* jupyter, configure jupyter notebook for remote connection
* pip install gym 
* apt-get update
* apt-get install nano
* apt install tmux
* pip install visdom 
* pip install --upgrade pip 
* pip install tensorboard 
* pip install tensorboardx (No !!!!!!!)
* git clone https://github.com/lanpa/tensorboardX && cd tensorboardX && python setup.py install 

Docker image sunhaozhe/pytorch-cu100-gym-tmux-tensorboardx:

1
2
3
4
5
6
7
* pip install --upgrade pip
* pip install gym
* apt-get update
* apt-get install nano
* apt install tmux
* pip install tensorflow (not only tensorboard because reduced feature set)
* git clone https://github.com/lanpa/tensorboardX && cd tensorboardX && python setup.py install

all-in-one with jupyter, CPU-only / Python 3.

1
docker pull ufoym/deepo:all-py36-jupyter-cpu

all-in-one with jupyter, CUDA 10.0 / Python 3.6

1
docker pull ufoym/deepo:all-jupyter-py36-cu100
1
2
3
docker run -dit --name my_env --mount 
type=bind,source=C:\xxx\yyy\zzz\workspace,target=/workspace 
sunhaozhe/pytorch-cu100-gym-tmux-tensorboardx
  • -p 18888:8888 for jupyter notebook
  • -p 8097:8097 for visdom server
  • -p 6006:6006 for tensorboardx
1
docker exec -it my_env bash

Missing packages

Vim

  • apt update
  • apt search vim
  • apt install vim
  • vim --version

Troubleshooter

  • WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
    • Meaning: docker run won’t impose any limitations on the use of swap space. However, the warning message is also trying to say that option -m, –memory will still take effect, and the maximum amount of user memory (including file cache) will be set as intended.
    • https://stackoverflow.com/a/63726105/7636942
  • ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Reference
    • Two possibilities:
      • Set num-workers to 0 (according to the doc of PyTorch torch.utils.data DataLoade, but this will slow down training)
      • Use the flag --ipc=host when executing docker run ..., be careful of potential security issues.
    • If you got a Killed message within a docker container, it probably means that you don’t have enough memory/RAM (use the command free -h to check the amount of available memory). Either increase the amount of memory of the host machine, or increase the amount of memory docker is allowed to use.

Shared memory

Background

Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory.

The shared memory device, /dev/shm, provides a temporary file storage filesystem using RAM for storing files. It’s not mandatory to have /dev/shm, although it’s probably desirable since it facilitates inter-process communication (IPC). Why would you use /dev/shm instead of just stashing a temporary file under /tmp? Well, /dev/shm exists in RAM (so it’s fast), whereas /tmp resides on disk (so it’s relatively slow). The shared memory behaves just like a normal file system, but it’s all in RAM.

In order to see how big it is in /dev/shm:

1
df -h

In order to check what’s currently under /dev/shm:

1
2
3
4
ls -l /dev/shm

# you may get: total 0
# which means there is currently nothing there

Shared memory for Docker containers

Docker containers are allocated 64 MB of shared memory by default.

We can change the amount of shared memory allocated to a container by using the --shm-size option of docker run:

1
2
# 2 GB shared memory
docker run --shm-size=2gb ...

Unit can be b (bytes), k (kilobytes), m (megabytes), or g (gigabytes). If you omit the unit, the system uses bytes.

Note that in the above example, the container is getting its own /dev/shm, separate from that of the host.

What about sharing memory between the host and a container or between containers? This can be done by mounting /dev/shm:

1
docker run -v /dev/shm:/dev/shm ...

The option --ipc=hostcan be used instead of specifying --shm-size=XXX. --shm-size=XXX would be enough, but you’d need to set shared memory size that’s enough for your workload. --ipc=host sets shared memory to the same value it has on bare metal. However, --ipc=host may have some security concerns, --ipc=host removes a layer of security and creates new attack vectors as any application running on the host that misbehaves when presented with malicious data in shared memory segments can become a potential attack vector.

If you use DataLoader of PyTorch with num_workers greater than 0 in a docker container, you probably need to raise the shared memory limit because the default is only 64 MB:

RuntimeError: DataLoader worker (pid 585) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

If you still get erros like RuntimeError: DataLoader worker (pid 4161) is killed by signal: Killed. segmentation faults even though:

  • you already set --ipc=host
  • your shared memory (type df -h in docker container) is as big as 15G or even 40G+
  • you even tried to use 1 for the batch size…
  • your code run perfectly with num_workers=0 but breaks when it becomes larger than 0 (even with num_workers=1)

Then you can possibly solve this issue by simplifying the __getitem__() method of your custorm PyTorch dataset class. By simplifying, I mean you make the code in __getitem__() as simple as possible. For example, you’d better avoid using pandas DataFrame in __getitem__() for a reason that I did not understand. If originally your __getitem__() maps idx to img_path using a pd.DataFrame, then you’d better create a python list items which stores the same information in PyTorch dataset’s __init__ stage, then __getitem__() will only call the list items, it will never call pd.DataFrame. If this can still not solve the issue, then maybe try to increase the host’s total memory or simplify the data augmentation techniques used in the __getitem__()…. I tried to reduce the amount of compute related to data augmentation and dataset loading before the training loop begins (outside __getitem__() because outside the training loop which involves dataloader multiprocessing), but mysteriously this solved the issue, no clue of why this happened…

Some other clues:

  • import torch before sklearn
  • upgrade sklearn to the latest version
  • moving import pytorch before other imports
  • reduce input image size
  • the data you read and the python environment should be on the same mount. In his specific case, the python environment was in my home directory (which resides in some remote mount, to be shared across multiple compute nodes) and the data local to the compute node. After moving everything to be local with respect to the compute node, segfault went away.
  • Just removed all the pandas logic out of one of my projects and still encounter the issue. Have a feeling it might be more related to cv2 (or modifying images with numpy operations rather than PIL operations). Oh I see. It’s definitely cv2. It is known to be not nice with multiprocessing. Sorry I just did a quick glance at the imports and didn’t see any cv2. Ok I removed usage of cv2, but it still seems to be happening. Now debating whether it is random or scipy.ndimage related. That’s rather weird. What’s more puzzling is that Python is segfaulting, not cv2 or pytorch libraries. Could you try a different python version? I upgraded to python 3.6 from 2.7. Seems to have worked. Thanks.

References

  • http://www.ruanyifeng.com/blog/2018/02/docker-tutorial.html
  • https://docs.docker.com/engine/reference/commandline/docker/
  • https://datawookie.dev/blog/2021/11/shared-memory-docker/#:~:text=Docker%20containers%20are%20allocated%2064%20MB%20of%20shared%20memory%20by%20default.
  • https://github.com/pytorch/pytorch/issues/1158
  • https://github.com/pytorch/pytorch/issues/8976#issuecomment-401564899
  • https://github.com/pytorch/pytorch/issues/4969
Read more »

Shell programming memo

Posted on 2021-01-01 | In Misc
| Words count in article 814
  • https://learnxinyminutes.com/docs/bash/
  • https://devhints.io/bash
  • https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html

Shell programming in Linux/Unix

Steps to create a shell script

1
#!/bin/bash
  • Create a text file any text editor, name this script file with extension .sh, for example xxx.sh
  • Start the script with #!/bin/zsh or #!/bin/bash or #!/bin/sh
  • Write some code and save this file.
  • Currently, permission is not yet granted to this script so it cannot be run directly from shells (./xxx.sh). Instead, one can do:
    • zsh xxx.sh
    • bash xxx.sh
    • sh xxx.sh

#! is an operator called shebang which directs the script to the interpreter location.

Each command starts on a new line, or after a semicolon, for example:

1
2
# simple hello world example
echo Hello world!
1
echo 'This is the first line'; echo 'This is the second line'

Adding shell comments

# is used to add a comment in shell programming.

Shell variables

By convention, Unix shell variables will have their names in UPPERCASE.

Variable Types

When a shell is running, three main types of variables are present

  • Local Variables − A local variable is a variable that is present within the current instance of the shell. It is not available to programs that are started by the shell. They are set at the command prompt.
  • Environment Variables − An environment variable is available to any child process of the shell. Some programs need environment variables in order to function correctly. Usually, a shell script defines only those environment variables that are needed by the programs that it runs.
  • Shell Variables − A shell variable is a special variable that is set by the shell and is required by the shell in order to function correctly. Some of these variables are environment variables whereas others are local variables.

Special variables

1
2
3
# The following line prints the process 
# ID number, or PID, of the current shell 
echo $$ 

Similarly, the following table shows a number of special variables that one can use in shell scripts:

  • $0 The filename of the current script.
  • $n These variables correspond to the arguments with which a script was invoked. Here n is a positive decimal number corresponding to the position of an argument (the first argument is $1, the second argument is $2, and so on).
  • $# The number of arguments supplied to a script.
  • $? The exit status of the last command executed.
  • $$ The process number of the current shell. For shell scripts, this is the process ID under which they are executing.
  • $! The process number of the last background command.

Array variable

Besides scalar variables, there are also array variables.

zsh arrays start at index position 1, bash arrays start at index position 0.

For the following script, bash and sh will execute properly but zsh will generate an error.

1
2
3
4
5
6
7
NAME[0]="Zara"
NAME[1]="Qadir"
NAME[2]="Mahnaz"
NAME[3]="Ayan"
NAME[4]="Daisy"
echo "First Index: ${NAME[0]}"
echo "Second Index: ${NAME[1]}"

Examples

Create a shell variable and then prints it:

1
2
VARIABLE="Hello" # declaring a variable
echo $VARIABLE # this line prints the variable to the shell 

= cannot be surrounded by any space, which is different from Python programming. Otherwise, the shell will decide that VARIABLE (or the other part) is a command it must execute and give an error because it can’t be found.

Shell enables you to store any value you want in a variable.

1
2
VAR1="Zara Ali"
VAR2=100 

Shell provides a way to mark variables as read only:

1
2
3
NAME="Zara Ali"
readonly NAME
NAME="Qadiri" # this will generate an error 

Unsetting or deleting a variable directs the shell to remove the variable from the list of variables that it tracks. Once you unset a variable, you cannot access the stored value in the variable:

1
2
3
NAME="Zara Ali"
unset NAME
echo $NAME

The above example does not print anything (empty line with break line). You cannot use the unset command to unset variables that are marked readonly

Read variable from user’s input:

1
2
3
4
5
echo "what is your name?"
read NAME
echo "How do you do, $NAME?"
read REMARK
echo "I am $REMARK too!"
1
2
3
4
5
6
% zsh xxx.sh
what is your name?
sf
How do you do, sf?
gsgs
I am gsgs too!

Our current directory is available through the command pwd. We can also use the built-in variable $PWD.

1
2
3
4
# The following are equivalent

echo "I'm in $(pwd)" # execs `pwd` and interpolates output
echo "I'm in $PWD" # interpolates the variable

String quotes:

1
2
3
NAME="John"
echo "Hi $NAME"  # Hi John
echo 'Hi $NAME'  # Hi $NAME

set command in bash

set command is used by bash (see help set in bash), not by zsh

Functions

Defining functions

1
2
3
myfunc() {
    echo "hello $1"
}

Alternate syntax:

1
2
3
function myfunc() {
    echo "hello $1"
}

Returning values

1
2
3
4
5
6
7
8
9
# Only the last line will print to user's screen 

myfunc() {
    local myresult='some value'
    echo $myresult
}

result="$(myfunc)" 
echo $result # some value 
Read more »

Python debug tool - pdb

Posted on 2020-11-22 | In Python
| Words count in article 215

https://www.cnblogs.com/klb561/p/12057436.html

1
python3 -m pdb xxx.py
  • c 相当于continue,从当前位置继续执行代码直到结束
  • q quit 退出调试
  • r 相当于return,快速执行到函数最后一行,需要在函数里用
  • n next 执行下一条语句。如果本句是函数调用,则执行函数,接着执行当前执行语句的下一条。
  • s step 执行下一条命令。如果本句是函数调用,则s会执行到函数的第一句。
  • p + the name of a variable show the content of this variable. For example,
    • p a
    • p a, b, c
  • h help 帮助
  • w where 打印当前执行堆栈
  • u up 执行跳转到当前堆栈的上一层

pdb is capable of interpreting any python code, not only those special commands. For example, locals() or globals() to display all the variables in scope with their values.

Read more »

Git memo

Posted on 2020-11-11 | In Misc
| Words count in article 3268

Stage, commit, push

https://stackoverflow.com/questions/572549/difference-between-git-add-a-and-git-add

1
2
3
git add -A # stages all changes
git commit -m "xxx"
git push

We can undo git add before commit. The following will remove file.txt from the current index (the “about to be committed” list) without changing anything else

1
git reset file.txt

To unstage all changes

1
git reset

To push a local branch to the remote repo

1
2
3
4
5
6
git push --set-upstream <remote> <branch>
git push --set-upstream origin new_branch_name

# -u is the shortcut for --set-upstream
git push -u <remote> <branch>
git push -u origin new_branch_name

To delete a remote branch

1
git push origin --delete new_branch_name

To discard or unstage uncommitted local changes, to restore working tree files

1
git restore file_name.txt

Index and the working tree

To tell Git not to track changes to your local file/folder (meaning git status won’t detect changes to it)

1
git update-index --skip-worktree xxx/yyy/file_name

There is no “recursive” option to git update-index, thus if we want to git skip worktree on all tracked files inside a directory and its subdirectories, we should first cd into that directory, then run the following command (without any modification):

1
find . -maxdepth 1 -type d \( ! -name . \) -exec bash -c "cd '{}' && pwd && git ls-files -z ${pwd} | xargs -0 git update-index --skip-worktree" \;

To tell Git to track changes to your local version once again (so you can commit the changes)

1
git update-index --no-skip-worktree xxx/yyy/file_name

Similarly, to the above operation recursively, run the following command:

1
find . -maxdepth 1 -type d \( ! -name . \) -exec bash -c "cd '{}' && pwd && git ls-files -z ${pwd} | xargs -0 git update-index --no-skip-worktree" \;

Show information about files in the index and the working tree

1
2
3
4
5
6
7
8
9
# list all files in the repo 
# (assuming we are in the root folder)
git ls-files .

# -v makes the output verbose, 
# meaning that it will abbreviate 
# the file status with a letter in 
# front of the filename
git ls-files -v .

The options (git ls-files -v .) are:

1
2
3
4
5
6
7
H cached
S skip-worktree
M unmerged
R removed/deleted
C modified/changed
K to be killed
? other

List files ignored with skip-worktree

1
2
3
4
5
# On Linux or Mac
git ls-files -v . | grep ^S

# On Windows
git ls-files -v . | findstr "^S"

You want to let git ignore deleted files? The code below works on deleted as well as modified files to ignore it when you do a git status.

1
git update-index --assume-unchanged xxx/yyy/

https://stackoverflow.com/a/3453788/7636942

查看某一个commit都改动了哪些文件

1
git show --name-only <commit_hash>

将某一个文件恢复到某次提交时的版本

1
git checkout <commit_hash> -- <file_path>

Check who (which commit, when, author) changed which lines in a file

1
git blame [file_path]

Undo the most recent commit in Git (GitHub)

https://stackoverflow.com/a/927386/7636942

1
2
3
4
git commit -m "Something terribly misguided" # (0: Your Accident)
git reset HEAD~                              # (1)
[ edit files as necessary ]                  # (2)
git push origin master --force               # (3)
  • #1 This command is responsible for the undo. It will undo your last commit while leaving your working tree (the state of your files on disk) untouched. You’ll need to add them again before you can commit them again).
  • #2 Make corrections to working tree files, for example, delete the wrongly committed files.
  • #3 To remove (not revert) a commit that has been pushed to the server, this command rewrites history on the GitHub server.

Replace git reset HEAD~ by git reset HEAD~3 to undo the 3 most recent commits.

Branching and merging

Warning: when creating a pull request on GitHub, it should alway be preferred to create a new branch dedicated for this, for example a branch called pull_request_branch. In this way, we can always commit to the master/main branch without affecting the created pull request.

Lists existing branches, the current branch will be highlighted in green and marked with an asterisk

1
2
3
4
5
6
7
8
9
10
git branch

# lists remote-tracking branches
git branch -r

# lists both local and remote branches
git branch -a

# checks tracking branches on the remote repo
git branch -vv

Creates a new branch and switch to it at the same time

1
2
git branch new_branch_name
git checkout new_branch_name

This is equivalent to

1
git checkout -b new_branch_name

Switches (back) to the master branch (It’s best to have a clean working state when one switches branches)

1
git checkout master 

Makes some changes in a new branch, then merges that branch back into the master branch

1
2
3
4
5
6
7
8
9
10
11
12
13
# creates a new branch to work on
git checkout -b new_branch_name

...
git commit -m "blablabla"

# merges the new branch 
git checkout master
git merge new_branch_name 

# deletes the new branch, because one no longer 
# needs it, the master branch points to the same place
git branch -d new_branch_name

If a branch branch_name_1 is not fully merged, git branch -d branch_name_1 raises an error which prevents this deletion. If we are sure we want to delete it, do the following

1
2
3
4
git branch -d --force branch_name_1

# -D is the shortcut for --delete --force
git branch -D branch_name_1

Shows changes

1
2
3
4
5
# between branches
git diff master new_branch_name

# between commits 
git diff 8496675381 93788e32b

Git fetch remote branch

1
2
3
4
5
# fetch a remote branch
git fetch <remote> <rbranch>:<lbranch>

# if needed, switch to the newly fetched branch 
git checkout <lbranch>

…where <rbranch> is the remote branch or source ref and <lbranch> is the as yet non-existent local branch or destination ref you want to track and which you probably want to name the same as the remote branch or source ref. If we are working with GitHub, <remote> can just be the https address that is usually used to do git clone (without any further modification).

How to create a new remote branch out of an old commit?

1
2
3
git checkout -b new_branch_name
git reset --hard <old_commit_id>
git push origin new_branch_name

How to clone a specific branch?

It seems that in some cases, only the specified branch will be cloned even if --single-branch is not used.

1
2
git clone --branch <branchname> <remote-repo>
git clone --single-branch --branch <branchname> <remote-repo>

Find the most recent common ancestor of two branches

1
git merge-base [branch1] [branch2]

这个命令会输出一个提交哈希值,这就是两个分支的最近共同祖先。

选择性地将一个commit的内容(哈希值为 [commit_hash])从一个branch复制到另一个branch

1
2
3
4
5
# 首先切换到目标分支
git checkout [target_branch]

# 然后将提交应用到目标分支
git cherry-pick [commit_hash]

cherry-pick 到新 branch 上生成的 commit 将会有一个自己独立的哈希值,只带来了原来 [commit_hash] 中的内容。

合并(squash)或编辑当前 branch 最近(N+1)个提交

1
git rebase -i HEAD~N

git rebase -i 交互模式里 commit 从上到下列出顺序提交的历史commits。可以通过修改每个 commit 前方的文本来进行操作。

  • pick 保持提交不变
  • squash 将该 commit 与前一个 commit 合并 并且合并 commit messages
  • reword 修改 commit message 但保留 commit content

N 可以适当多取一些,因为并不是必须修改其中的每一个 commit。

举例:

HEAD~3: 这里 HEAD 表示当前分支的最新 commit,HEAD~3 表示从当前 commit 回溯3个 commits,包括最新提交和最近的3个 commits(一共4个 commits)。

  • commitA (HEAD, the latest commit)
  • commitB (HEAD~1)
  • commitC (HEAD~2)
  • commitD (HEAD~3)

git rebase -i HEAD~3 进入一个交互模式 编辑从 commitD 到 commitA 的这些提交。该例子中假如其中一个commit(比如commitC)是一个 merge commit(commit message 为 Merge branch 'A' into B),那么git rebase -i的交互模式将展开 A branch 中的每一个commit,所以看到的 commit 行数有可能大于4。

子模块

git clone 之后,进行子模块的初始化和更新

1
git submodule init && git submodule update

Why submodules in git? It often happens that while working on one project, you need to use another project from within it. Perhaps it’s a library that a third party developed or that you’re developing separately and using in multiple parent projects. A common issue arises in these scenarios: you want to be able to treat the two projects as separate yet still be able to use one from within the other.

To add a new submodule you use the git submodule add command with the absolute or relative URL of the project you would like to start tracking. The given URL is recorded into .gitmodules. This file contains one subsection per submodule. This file is version-controlled with your other files, like your .gitignore file. It’s pushed and pulled with the rest of your project. This is how other people who clone this project know where to get the submodule projects from.

git submodule init 读取 .gitmodules 文件并在 .git/config 中添加子模块的信息。git submodule init creates a record about submodules in the .git/config file, but does not check out the contents of the submodule repositories.

git submodule update 更新和检出每一个子模块的内容:

  • 如果需要保持子模块在特定的已知状态,git submodule update 更新 submodules 到主项目中记录的特定 commit。当你运行 git submodule update 时,Git 会将子模块检出到父项目中记录的那个具体提交哈希,这确保了子模块的状态与父项目中的记录一致。
  • 如果需要最新的子模块版本,git submodule update --remote 更新 submodules 到 remote repository 特定分支的最新 commit。

如何确定父项目中记录的子模块的特定 commit 呢?

在 Git 中,子模块的目标 commit 记录其实是保存在父项目的某个特定 commit 里的。这些目标 commit 记录存储在父项目的 Git 对象中,而不是一个单独的配置文件。具体来说,父项目的每个 commit 都包含了子模块的特定 commit 哈希。当你添加一个子模块到父项目时,父项目会记录子模块在某个特定时间点的 commit 哈希。你可以通过查看父项目提交的树对象来找到子模块的 commit 哈希。

如何查看子模块的目标 commit 呢?.gitmodules 文件和 .git/config 文件并不包含子模块的具体提交哈希。可以使用 git ls-tree HEAD 查看子模块的 commit 哈希。

1
git ls-tree <HEAD/commit_hash> 

git ls-tree HEAD 用于查看指定 commit(在这种情况下是 HEAD)的树对象内容,它类似 ls命令,会列出当前路径下的文件和子文件夹。该命令的输出中,对应submodules的那些行显示的哈希值就是我们要找的子模块的具体 commit 哈希,这个哈希值对应子模块仓库的一个特定 commit。

The word tree is rather overloaded in Git (well, and, in computing in general).

  • A tree object, in Git, is an internal data structure that records a directory tree or sub-tree. It contains one entry per file or sub-directory (or, for submodules, a gitlink entry for that submodule).
  • The work-tree or working tree (or other variations of this spelling) refers to the place in which you do your work.

Checking states

Displays the state of the working directory and the staging area

1
git status 

Shows commit logs

1
2
3
4
5
# shows the current HEAD and its ancestry 
git log 

# compact mode
git log --oneline

Reference logs. git reflog doesn’t traverse HEAD’s ancestry at all. The reflog is an ordered list of the commits that HEAD has pointed to: it’s undo history for your repo. The reflog isn’t part of the repo itself (it’s stored separately to the commits themselves) and isn’t included in pushes, fetches or clones; it’s purely local.

1
git reflog

Checks the version of Git

1
git --version

To view file diff in git before commit, for a file that we haven’t git added yet

1
git diff myfile.txt

The above command does not work for files that are already git added. If we want to see already added changes

1
git diff --cached myfile.txt 

可视化不同分支之间的提交历史

1
2
3
4
5
6
7
8
9
# 不仅显示当前分支,还会包括所有本地分支的提交记录。
git log --graph --decorate --branches --oneline

# 显示所有引用的提交记录,包括本地分支、远程分支、标签以及其他引用(如 stash)。
git log --graph --decorate --all --oneline


# 仅查看当前分支的提交记录(没有意义,只查看当前分支就没必要用 --graph)
git log --graph --decorate --oneline

--oneline means oneline version.

git stash

git stash is handy if you need to quickly switch context and work on something else, but you’re mid-way through a code change and aren’t quite ready to commit. git stash 命令的作用就是将目前还不想提交的但是已经修改的内容进行保存至堆栈中,后续可以在某个分支上恢复出堆栈中的内容。stash中的内容不仅仅可以恢复到原先开发的分支,也可以恢复到其他任意指定的分支上。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# stash changes including untracked files. 
git stash --include-untracked 

# removes the changes from your stash and reapplies them to your working copy. 
git stash pop

# reapply the changes to your working copy and keep them in your stash, 
# useful if you want to apply the same stashed changes to multiple branches.
git stash apply

# view all stashes
git stash list

# view the diff of a stash
git stash show

# view the full diff of a stash
git stash show -p
git stash show --patch

The git stash command takes your uncommitted changes (both staged and unstaged), saves them away for later use. The stash is local to your Git repository; stashes are not transferred to the server when you push. By default, running git stash will stash: (1) changes that have been added to your index (staged changes); (2) changes made to files that are currently tracked by Git (unstaged changes). But it will not stash: (1) new files in your working copy that have not yet been staged; (2) files that have been ignored. Adding the -u option (or --include-untracked) tells git stash to also stash your untracked files. You can include changes to ignored files as well by passing the -a option (or --all) when running git stash.

1
2
3

![git_stash.png](/blogs/assets/images/blog/git_stash.png)

You are not limited to a single stash. You can create multiple stashes, and then use git stash list to view them. By default, stashes are identified simply as a “WIP” – work in progress – on top of the branch and commit that you created the stash from. You can annotate your stashes with a description, using git stash save "message".

By default, git stash pop will re-apply the most recently created stash: stash@{0}. You can choose which stash to re-apply by doing e.g. git stash pop stash@{2}

删除在远程仓库中已不存在的分支的引用

git fetch -p(-p 是 --prune 的简写)清理本地引用列表,删除在远程仓库中已不存在的分支的引用。

Before fetching, remove any remote-tracking references that no longer exist on the remote.

Config

1
2
3
git config --list
git config --global user.name
git config --global user.email

git ignore

Ignore some files

1
vim .gitignore

Ignore some files locally

1
vim .git/info/exclude

Don’t forget Index and the working tree section.

Reset cached credentials on MacOS (password, personal access tokens, etc.)

On MacOS, we can use command line to delete existing credentials and then re-enter our new username and/or password when prompted.

To delete existing credentials, enter the following command:

1
2
3
4
git credential-osxkeychain erase
host=github.com
protocol=https
> [Press Return (twice)]

If successful, nothing will print out.

Now when we try to clone a GitHub repository (using https), we will be prompted to enter our credentials.

How to Change Your GitHub Remote Authentication from Username + Password to Personal Access Token

https://medium.com/geekculture/how-to-change-your-github-remote-authentication-from-username-password-to-personal-access-token-64e527a766cf

  • cd to the local directory, run git remote -v to check the current remote.
  • Remove the old remote. For example, run git remote remove origin to remove the origin remote.
  • Add your new remote in the following format: git remote add origin https://<TOKEN>@github.com/<USERNAME>/<REPO>.git

That’s all. This can be verified by running git remote -v

How to clone a private GitHub repo

1
git clone https://<pat>@github.com/<your account or organization>/<repo>.git

where <pat> means Personal Access Token.

Read more »

Vim memo

Posted on 2020-10-11 | In Misc
| Words count in article 652

Vim modes

When you launch the Vim editor, you’re in normal mode.

To type a text, you need to enter the insert mode by pressing the i key. This mode allows you to insert and delete characters in the same way you do in a regular text editor.

To go back to the normal mode from any other mode, just press the Esc key.

Opens a file in Vim / Vi

1
vim file.text

Saves the file without exiting Vim / Vi

  • Press Esc
  • Type :w
  • Press Enter

To save the file under a different name, type :w new_filename and hit Enter.

Saves a file and quits Vim / Vi

:wq

Quits Vim / Vi without saving the file

:q!

Navigating

Moves the cursor within the whole file

Places the cursor at the start of the file

gg

Places the cursor at the end of the file

G

Moves one page forward

Ctrl + f

Moves one page backward

Ctrl + b

Place the cursor at the line 14 (go to line 14)

:14

Moves the cursor within one line

Places the cursor at the beginning of a line

0

Places the cursor at the end of a line

$

Moves the cursor forward to the beginning of a word

w

Move forward three words

3w

Move forward a WORD (any non-whitespace characters)

W

Move backward to the beginning of a word

b

Move backward three words

3b

Move backward a WORD (any non-whitespace characters)

B

Jump to the 42nd column of the current line

42|

Moves the cursor within the screen

Places the cursor at the top of the screen

H

Places the cursor at the middle of the screen

M

Places the cursor at the bottom of the screen

L

Searches for text in the document

Searches for text in the document where keyword is whatever keyword, phrase or string of characters you’re looking for.

/[keyword]

Searches the text again in whatever direction the last search was.

n

Searches your text again in the opposite direction.

N

To turn off highlighting (of text search) until the next search.

:noh

Makes the vi/vim text editor show or hide line numbers

:set number

:set nonumber

delete one line

Press ESC to go to Normal mode, place the cursor on the line you need to delete, press dd

Clear all the document (select all and then delete)

ggdG

Vim 中结构化展示 jsonl

:%!jq .

字符逐个解释

  • : 进入 Vim 的命令模式。在这种模式下,你可以输入和执行 Vim 的命令。
  • % 选中整个缓冲区。在 Vim 中,% 是一个范围选定符,表示从文件的第一行到最后一行。
  • ! 过滤器操作符,表示将当前选定的文本通过指定的外部命令进行过滤处理。
  • jq 外部命令的名称。在此处,jq 是一个轻量级且功能强大的命令行 JSON 处理工具。
  • . jq 的过滤表达式。. 代表当前对象或数据流中的顶层元素。

用 vim 比较两个文件的不同

1
vim -d file1 file2

在 vim 的比较模式中:

  • ]c 跳到下一个差异(不用按 esc 和 :)
  • [c 跳到上一个差异(不用按 esc 和 :)

Turn off auto indent when pasting text into vim (vim中粘贴需要用)

To turn off autoindent when you paste code, there’s a special “paste” mode:

1
:set paste

Then paste your code. Note that the text in the tooltip now says -- INSERT (paste) --.

After you pasted your code, turn off the paste-mode, so that auto-indenting when you type works correctly again.

1
:set nopaste
Read more »

Spacy memo

Posted on 2020-08-16 | In Python
| Words count in article 128
1
nlp = spacy.load('en_core_web_sm') 
1
nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner']) 

To use lemma of spacy, consider include the pipe tagger because this will influence the effect of lemmatization:

1
2
3
4
nlp = spacy.load('en_core_web_sm') 
doc = nlp("This is my second time.")
for token in doc:
    print(token, token.lemma_) 
1
2
3
4
5
6
1
2
3
4
5
6
This this
is be
my -PRON-
second second
time time
. .
1
2
3
4
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner']) 
doc = nlp("This is my second time.")
for token in doc:
    print(token, token.lemma_) 
1
2
3
4
5
6
1
2
3
4
5
6
This this
is be
my -PRON-
second second
time time
. .
1
2
3
4
nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner']) 
doc = nlp("This is my second time.")
for token in doc:
    print(token, token.lemma_) 
1
2
3
4
5
6
This This
is be
my my
second 2
time time
. .
Read more »

Packing and unpacking arguments in Python

Posted on 2020-08-12 | In Python
| Words count in article 213

Unpacking

We can use * to unpack a list.

1
2
3
4
5
6
7
def func(a, b, c, d): 
    pass 
    
a = [1, 2, 3, 4] 

# unpacks list into four arguments 
func(*a) 
1
2
3
4
list(range(3, 6)) # [3, 4, 5] 

args = [3, 6] 
list(range(*args)) # [3, 4, 5] 

** is used for dictionaries.

1
2
3
4
5
6
def func(a, b, c): 
    pass 

# unpacking of dictionary 
d = {'a':2, 'b':4, 'c':10} 
func(**d) 

Packing

When we don’t know how many arguments need to be passed to a python function, we can pack all arguments in a tuple.

1
2
3
4
5
6
7
8
9
10
# This function uses packing to take 
# an unknown number of arguments 
def func(*args): 
    sum = 0
    for i in range(0, len(args)): 
        sum = sum + args[i] 
    return sum 

print(func(1, 2, 3, 4, 5)) 
print(func(10, 20)) 
1
2
3
4
5
6
7
8
def func(**kwargs): 
    # kwargs is a dict 
    print(type(kwargs)) 
    # Printing dictionary items 
    for key in kwargs: 
        print("%s = %s" % (key, kwargs[key])) 

func(name="geeks", ID="101", language="Python") 
Read more »

Python memo

Posted on 2020-07-28 | In Python
| Words count in article 1302

Main file

1
2
if __name__ == "__main__":
    pass

Reading files

Reads file as a string:

1
2
with open(file_path, "r") as file:
    content_as_str = file.read()

Reads file line by line:

1
2
3
4
5
6
7
8
9
# approach 1
with open(file_path, "r") as file:
    content = file.readlines()
content = [x.strip() for x in content] 

# approach 2 
with open(file_path, "r") as file:
    for line in file:
        print(line.strip()) 

String operations

Remove spaces at the beginning and at the end of the string:

1
2
3
txt = "     d     "
x = txt.strip()
y = "abc" + x + "efg" # "abcdefg"

Split strings:

1
2
a = "This is Mike."
b = a.split() # ['This', 'is', 'Mike.']
1
2
a = "Hello, my name is Mike, I am from Australia."
b = a.split(",") # ['Hello', ' my name is Mike', ' I am from Australia.'] 

Join:

1
2
a = ["1", "2", "3", "4"]  
b = "-".join(a) # "1-2-3-4"

File path operations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
os.curdir # '.'
os.pardir # '..'
os.sep # '/'
os.getcwd() # '/Users/ZZZ/Desktop/XXX/YYY'

# Returns the base name without the directory name
os.path.basename(file_path)
# Returns the directory name without the base name
os.path.dirname(file_path)
# Returns True if path is an existing regular file
os.path.isfile(path)
# Returns True if path is an existing directory
os.path.isdir(path)
# Returns a normalized absolutized version of the pathname path
os.path.abspath(path)
# Splits the path into a pair (root, ext)
os.path.splitext("XXX/YYY/tmp.txt") # ('XXX/YYY/tmp', '.txt') 
# Returns True if path is an absolute pathname
os.path.isabs(path)

Makes a new directory if it does not exist:

1
2
3
dir_path = os.path.join(XXX, YYY)
if not os.path.exists(dir_path):
    os.makedirs(dir_path)

In order to be able to import modules from packages in the parent directory:

1
2
3
4
5
import os
import sys

# in order to be able to import modules from packages in the parent directory
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

Gets a list of all base names (does not distinguish subdirectories and files) in the current directory:

1
2
3
# name_ is base name, not path
for name_ in os.listdir(X_dir):
    print(name_) 

Gets a list of all subdirectories in the current directory:

1
2
3
4
# name_ is base name, not path 
for name_ in os.listdir(X_dir):
    if os.path.isdir(os.path.join(X_dir, name_)):
        print(name_)

Generates the file names in a directory tree by walking the tree. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple. dir_path is a string, the path to the directory. dir_names is a list of the names of the subdirectories in dir_path (excluding . and ..). file_names is a list of the names of the non-directory files in dir_path.

1
2
3
4
for dir_path, dir_names, file_names in os.walk(X_dir):
    print(dir_path)
    print(dir_names)
    print(file_names)

Uses glob to find all the path names matching a specified pattern according to the rules used by the Unix shell:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import glob

# To search file paths
for file_path in glob.glob(os.path.join("XXX", "*.txt")):
    print(file_path) 
   
# To search file names (base name without the directory)
for file_path in glob.glob(os.path.join("XXX", "*.txt")):
    print(os.path.basename(file_path)) 

# File paths in the above code are not sorted. 
# If needed, use the following:

for file_path in sorted(glob.glob(os.path.join("XXX", "*.txt")), key=lambda x: TODO):
    print(file_path) 
    
# To search file paths in subdirectories:

for file_path in glob.glob(os.path.join("XXX", "*", "*.txt")):
    print(file_path) 

# To search file paths recursively

# If recursive is True, the pattern ** will match any files and 
# zero or more directories and subdirectories. If the pattern is 
# followed by an os.sep, only directories and subdirectories match.
for file_path in glob.glob(os.path.join("XXX", "**", "*.txt"), recursive=True):
    print(file_path)

Pickle

1
2
3
4
5
6
7
8
9
import pickle

# save_path ends with ".pkl"
with open(save_path, "wb") as f:
    pickle.dump(obj_, f)
    #pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)

with open(save_path, "rb") as f:
    obj_ = pickle.load(f)

Inheritance

1
super().__init__() 

Reimports a module in python while interactive

1
2
import importlib
importlib.reload(module_name) 

Exceptions

1
2
3
4
5
6
try:
    print(x)
except NameError:
    print("Variable x is not defined")
except:
    print("Something else went wrong")
1
2
3
4
try:
    print(x) 
except Exception as e:
    print(e) 
1
2
3
4
5
6
try:
    print("Hello")
except:
    print("Something went wrong")
else:
    print("Nothing went wrong")
1
2
3
4
5
6
try:
    print(x)
except:
    print("Something went wrong")
finally:
    print("The 'try except' is finished")
1
raise Exception("Blablabla")

Gets object attributes in Python

1
dir(obj_)

Sorting

1
2
3
4
5
6
7
8
9
a = [2, 8, 1, 4, 3, 9, 7]
print(a)

b = sorted(a)                # ascending order 
c = sorted(a, reverse=True)  # descending order

print(a)
print(b)
print(c)
1
2
3
4
[2, 8, 1, 4, 3, 9, 7]
[2, 8, 1, 4, 3, 9, 7]
[1, 2, 3, 4, 7, 8, 9]
[9, 8, 7, 4, 3, 2, 1]
1
2
3
a = ["cccc", "b", "dd", "aaa"] 
sorted(a)           # ['aaa', 'b', 'cccc', 'dd']
sorted(a, key=len)  # ['b', 'dd', 'aaa', 'cccc']
1
2
3
4
5
6
def func(x): 
    return x % 7
  
a = [15, 3, 11, 7] 
sorted(a)            # [3, 7, 11, 15]
sorted(a, key=func)  # [7, 15, 3, 11]
1
2
3
a = [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]

b = sorted(a, key=lambda x: x[2]) 
1
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
1
2
3
4
5
import operator 

a = [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]

b = sorted(a, key=operator.itemgetter(2)) 
1
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] 
1
2
3
4
5
6
7
a = {"Allen": 3, "Tom": 5, "John": 0, "Bob": -2}

# sort dict by key 
b = sorted(a)             # ['Allen', 'Bob', 'John', 'Tom']

# sort dict by value 
c = sorted(a, key=a.get)  # ['Bob', 'John', 'Allen', 'Tom'] 

Dict comprehension to reverse key-value pair in a dictionary

1
id2word = {v: k for k, v in word2id.items()} 

Built-in functions

  • globals() current global symbol table. It always returns the dictionary of the module namespace. This is always the dictionary of the current module (inside a function or method, this is the module where it is defined, not the module from which it is called).
  • locals() current local symbol table. It always returns a dictionary of the current namespace.

Parsing arguments

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import argparse 

parser = argparse.ArgumentParser("blablabla...")

# positional argument 
parser.add_argument("bar")

# optional arguments

# numerical values
parser.add_argument("--xx", type=int, default=0, help="""""")
parser.add_argument("--yy", type=int, default=None, help="""""")
parser.add_argument("--yy", type=float, default=None, help="""""")

# binary switch 
parser.add_argument("--zz", action="store_true", default=False, help="""""")

# shorthand (using -a is equivalent to using --blabla)
parser.add_argument("-a", "--blabla", type=float, default=None, help="""""")

# variable number of command-line parameters
parser.add_argument("--foo", nargs=2)
parser.add_argument("--foo", nargs="?", const="c", default="d")

# to make an option required
parser.add_argument('--foo', required=True)

args = parser.parse_args()

For nargs:

  • N (an integer). N arguments from the command line will be gathered together into a list.
  • ? One argument will be consumed from the command line if possible, and produced as a single item. If no command-line argument is present, the value from default will be produced. Note that for optional arguments, there is an additional case - the option string is present but not followed by a command-line argument. In this case the value from const will be produced.
  • * All command-line arguments present are gathered into a list.
  • + Just like *, all command-line args present are gathered into a list. Additionally, an error message will be generated if there wasn’t at least one command-line argument present.
Read more »

Python virtual environment

Posted on 2020-07-26 | In Python
| Words count in article 195

To create a virtual environment:

1
2
3
4
5
6
7
# navigate to the dedicated directory 
# this command creates venv_X in the current directory 
virtualenv venv_X

# create virtual env with a specific version of Python
## here /usr/bin/python3 is Python 3.8.2
virtualenv --python=/usr/bin/python3 venv_msr

To install the kernel for Jupyter notebook:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cd venv_X/
source bin/activate 
pip3 install --upgrade pip 
pip3 install ipykernel

python3 -m ipykernel install --user --name=venv_X

# install packages
pip3 install --upgrade matplotlib 
pip3 install --upgrade ipywidgets 
pip3 install --upgrade pandas 
pip3 install --upgrade sklearn 
pip3 install --upgrade opencv-python 

# 按顺序做完以上步骤之后激活虚拟环境或者不激活虚拟环境,都可以直接通过在 jupyter notebook 命令进入notebook并选与虚拟环境对应的kernel

To activate the virtual environment:

1
source venv_X/bin/activate

To deactivate the virtual environment:

1
deactivate

To delete the virtual environment:

1
rm -r venv_X

To uninstall the kernel for Jupyter notebook:

1
jupyter kernelspec uninstall venv_X

To see which kernels are available for Jupyter notebook:

1
jupyter kernelspec list
Read more »

Translation of some technical terms

Posted on 2020-07-19 | In Misc
| Words count in article 1109
English French Chinese
pre-trained networks réseaux pré-entrainés -
retraining the last layer le recyclage de la dernière couche -
fine-tuning le réglage fin -
convolutional neural network
CNN
ConvNet
réseau de neurones convolutifs
réseau de neurones à convolution
-
graph neural networks réseaux de neurones graphiques -
prior knowledge connaissances antérieures -
feature engineering ingénieurie des caractéristiques -
overfitting surapprentissage
sur-ajustement (surajusté)
surinterprétation
-
training set jeu de données d’apprentissage
données d’apprentissage
ensemble d’apprentissage
échantillon d’apprentissage
-
validation set jeu de données de validation -
test set jeu de données de test -
backpropagation
backprop
BP
rétropropagation du gradient 反向传播算法
vanishing gradient disparition du gradient 梯度消失
loss function fonction objectif -
chain rule théorème de dérivation des fonctions composée
règle de dérivation en chaîne
règle de la chaîne
链式法则
skip connection
residual connection
connexion saute-couche
connection résiduelle
saut de connexion
-
stochastic gradient descent algorithme du gradient stochastique -
gradient descent algorithme du gradient
méthode de descente de gradient
-
machine learning apprentissage automatique
apprentissage statistique
-
optimizer méthode d’optimisation
optimiseur
-
moving average
rolling average
running average
moyenne mobile
moyenne glissante
-
exponential moving average moyenne mobile exponentielle 指数移动平均
audio frame trame audio -
tune the hyperparameters régler les hyperparamètres 调参
short-time Fourier transform
STFT
transformée de Fourier à court terme
transformée de Fourier locale
transformée de Fourier à fenêtre glissante
-
discrete cosine transform transformée en cosinus discrète 离散余弦变换
word embedding word embedding
plongement de mots
plongement lexical
-
toy model modèle-jouet -
mesh
polygon mesh
mesh
maillage
多边形网格
disentanglement - 解耦
bounding box boîte englobante -
anchor boxes
anchor
boîtes d’ancrage
ancre
-
region proposal proposition de région -
domain adaptation adaptation de domaine -
domain shift changement de domaine -
feature map carte de caractéristiques -
likelihood vraisemblance -
grid cell cellule de grille -
veil of ignorance le voile d’ignorance 无知之幕
thought experiment expérience en imagination 思想实验
perplexity perplexité 困惑度
BLEU
bilingual evaluation understudy
BLEU 双语替换评测
setpoint
set point
set-point
valeur de consigne 设定值
目标值
参考值
non-maximum suppression suppression des non-maxima 非极大值抑制
object detection détection d’objet
reconnaissance d’objet
物体识别
human pose estimation - 人体姿态估计
face detection détection de visage 人脸检测
aspect ratio format d’image
rapport de forme
rapport de cadre
长宽比
宽高比
ground truth vérité terrain 基准真相
clustering partitionnement 聚类
cross entropy entropie croisée 交叉熵
Kullback–Leibler divergence
relative entropy
divergence de Kullback-Leibler
divergence K-L
entropie relative
KL散度
相对熵
color depth
bit depth
profondeur de couleur 色彩深度
色深
histogram of oriented gradients
HOG
histogramme de gradient orienté 方向梯度直方图
cross validation
rotation estimation
out-of-sample testing
validation croisée 交叉验证
循环估计
t-SNE
t-distributed stochastic neighbor embedding
algorithme t-SNE t分布随机邻域嵌入
facial recognition system système de reconnaissance faciale 人脸识别系统
facial landmark repère facial
repère du visage
人脸关键点
FLOPS
floating point operations per second
flops
flop/s
FLOPS
opérations en virgule flottante par seconde
每秒浮点运算次数
每秒峰值速度
radial basis function
RBF
fonction à base radiale
fonction radiale de base
径向基函数
spline spline 样条(函数)
evolutionary computation - 进化计算
evolutionary algorithm
EA
algorithme évolutionniste
algorithme évolutionnaire
进化算法
particle swarm optimization
PSO
optimisation par essaims particulaires
OEP
粒子群优化
微粒群算法
swarm intelligence
SI
intelligence distribuée
intelligence en essaim
群体智能
genetic algorithm
GA
algorithme génétique 遗传算法
metaheuristic métaheuristique 元启发算法
evolution strategy
ES
stratégie d’évolution 进化策略
differential evolution
DE
(algorithme à) évolution différentielle 差分进化算法
微分进化算法
low-discrepancy sequence
quasirandom sequence
suite à discrépance faible
suite quasi aléatoire
suite sous-aléatoire
低差异序列
inductive bias
learning bias
biais d’induction
biais inductif
biais d’apprentissage
归纳偏置
morphism morphisme 态射
isomorphism isomorphisme 同构
homomorphism homomorphisme 同态
automorphism automorphisme 自同构
endomorphism endomorphisme 自同态
homeomorphism
topological isomorphism
bicontinuous function
homéomorphisme 同胚
isometry isométrie 等距同构
保距映射
等距
diffeomorphism difféomorphisme 微分同胚
abstract algebra
modern algebra
algèbre générale
algèbre abstraite
抽象代数
algebraic structure structure algébrique 代数结构
group action action de(d’un) groupe 群作用
generative adversarial network
GAN
réseau antagoniste génératif
réseau adverse génératif
生成对抗网络
manifold variété 流形
diminishing returns loi des rendements décroissants 报酬递减
simulated annealing recuit simulé 模拟退火
dihedral group groupe diédral 二面体群
countable set ensemble dénombrable 可数集
可列集
polyhedron polyèdre 多面体
polygon polygone 多边形
antiderivative
inverse derivative
primitive function
primitive integral
indefinite integral
primitive 不定积分
faculty psychology psychologie des facultés 官能心理学
interpolation interpolation 插值,内插
extrapolation extrapolation 外推
convex hull
convex envelope
convex closure
enveloppe convexe 凸包
simplex simplexe 单纯形
simplicial complex complexe simplicial 单纯复形
Read more »

Excerpt

Posted on 2020-06-29 | In Misc
| Words count in article 449

尔曹身与名俱灭,不废江河万古流。—— 杜甫《戏为六绝句》

伯乐一顾

枝干相持

暗室逢灯,绝渡逢舟。 —— 清·夏敬渠《野叟曝言》

将伯之助,义不敢忘。 —— 清·蒲松龄《聊斋志异·连琐》

大漠孤烟直,长河落日圆。 —— 唐·王维《使至塞上》

醉后不知天在水,满船清梦压星河。—— 元·唐珙《题龙阳县青草湖》

最是人间留不住,朱颜辞镜花辞树。—— 清·王国维《蝶恋花·阅尽天涯离别苦》

流水不腐,户枢不蠹。—— 《吕氏春秋·尽数》liú shuǐ bù fǔ,hù shū bú dù

莫唱当年长恨歌,人间亦自有银河。石壕村里夫妻别,泪比长生殿上多。—— 清·袁枚《马嵬》

小舟从此逝,江海寄余生。—— 宋·苏轼《临江仙·夜归临皋》

腰缠十万贯,骑鹤下扬州。—— 南朝·殷芸《殷芸小说·吴蜀人》

All happy families are alike; each unhappy family is unhappy in its own way. —— Anna Karenina by Leo Tolstoy

幸福的家庭都是相似的,而不幸的家庭则各有各的不幸。 —— 列夫托尔斯泰《安娜卡列妮娜》

Toutes les familles heureuses se ressemblent ; chaque famille malheureuse est malheureuse à sa façon.

J’en doutais pas un seul instant. Gloire à notre guide !

Il attend avec impatience le prochain confinement pour se remettre au jogging.

La réalité n’est pas aussi simple qu’elle en a l’air.

太阳底下无新鲜事。

Nothing new under the sun.

Il n’y a rien de nouveau sous le soleil.

what doesn’t kill you makes you stronger. —— Friedrich Nietzsche

Ce qui ne tue pas rend plus fort.

X, Y, Z, …, to name a few

X, Y, Z, …, pour n’en nommer que quelques-uns

料敌从宽,御敌从严。

郑伯克段于鄢 ——《左传》

Peser le pour et le contre

适者生存

Survival of the fittest

survie du plus apte

I am too full of life to be half-loved. —— Ijeoma Umebinyuo (Nigerian poet)

Read more »

Latex memo

Posted on 2020-06-17 | In Misc
| Words count in article 1128

how to write the header (all in one)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

\usepackage[utf8]{inputenc}


\usepackage{float}

\usepackage{appendix}

\usepackage{makecell}

% bold table headers, maybe optional  
\renewcommand\theadfont{\bf} 


% hyperlinks
\usepackage{hyperref}


\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bbm}
\usepackage{amsthm}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}


\newcommand{\eg}{{\em e.g.,~}}
\newcommand{\ie}{{\em i.e.,~}}
\newcommand{\vs}{{\em vs.~}}
\renewcommand{\iff}{{\em iff~}}
\newcommand{\iid}{{\em i.i.d.}}
\newcommand{\etal}{{\em ~et al.}}

% references
\bibliographystyle{plain}

references

1
2
% in the header
\bibliographystyle{plain}
1
2
3
4
{
\small
\bibliography{refs}
}

e.g. i.e., etc.

1
2
3
4
5
6
7
8

\newcommand{\eg}{{\em e.g.,~}}
\newcommand{\ie}{{\em i.e.,~}}
\newcommand{\vs}{{\em vs.~}}
\renewcommand{\iff}{{\em iff~}}
\newcommand{\iid}{{\em i.i.d.}}
\newcommand{\etal}{{\em ~et al.}}

Appendix

1
2
% in the header
\usepackage{appendix}
1
2
% in the main text
\appendix

Line spacing and margins

Change line spacing for the whole document:

1
\renewcommand{\baselinestretch}{1.0}

Change margins:

1
\usepackage[left=0.75in,top=0.6in,right=0.75in,bottom=0.6in]{geometry}

Microsoft engineers set the defaults for an A4 page (210 mm x 297 mm) at 1 inch (25.4 mm) top and bottom, with 1 and a 1/4 inch (31.7 mm) on each side.

Citations

Easily switch between (number) and (number+author) in the same document. Year is not shown. With plainnat, the number is different from unsrt, the ordering of citation numbers are not determined by the order of appearance.

1
2
3
4
5
6
7
8
\usepackage[numbers]{natbib}
\bibliographystyle{plainnat}
% Hyperlinks bib references.
% This should be used together with \bibliographystyle{plainnat}
\usepackage{hyperref}  

\cite{} % citation with number
\citet{} % citation with number and author

1-column abstract in 2-column document

https://www.texfaq.org/FAQ-onecolabs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
\documentclass[twocolumn]{article}


...


\begin{document}

% 1-column abstract in 2-column document 
\twocolumn[
  \begin{@twocolumnfalse}
    \maketitle
    \begin{abstract}
    Blablabla
    \end{abstract}
  \end{@twocolumnfalse}
]


%\section{Introduction}

\end{document}

Be careful when using long equations that would span across two columns.

Figures

1
2
3
4
5
6
\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\linewidth]{images/xxx.png}
    \caption{}
    \label{fig:xxx}
\end{figure}

Positioning of floating elements

In order to be able to use the option [H], do the following:

1
\usepackage{float}

figure_position_specifiers

In order to placing a figure spanning the two columns of a twocolumn document, we need to use the starred version of the {figure*} environment to enable the figure occupy the two columns. Example:

1
2
3
4
5
6
\begin{figure*}[h]
    \centering
    \includegraphics[width=0.8\linewidth]{Figures/xxx.png}
    \caption{}
    \label{fig:xxx}
\end{figure*}

{figure*} should not be used together with [H].

Table

In order to make a table with 3 rows and 4 columns:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
\usepackage{adjustbox}


\begin{table}[h]
	\centering
	\begin{adjustbox}{}
	\begin{tabular}{|c|c|c|c|}
	\hline
     &  &  &  \\
    \hline
     &  &  &  \\
    \hline
     &  &  &  \\
    \hline
    \end{tabular}
	\end{adjustbox}
	\caption{}
	\label{tab:}
\end{table}

If the above table has too many columns that do not correctly fit the page width, one can use \begin{adjustbox}{width=\linewidth}

The signification of the parameters of \begin{tabular}:

  • l left-justified column
  • c centered column
  • r right-justified column
  • | vertical line
  • || double vertical line
  • ||| triple vertical line

Once in the tabular environment, & is column separator, \\ is start new row, \hline is horizontal line.

\checkmark and $\times$ can be used for binary information.

1
2
3
4
\usepackage{makecell}

% bold table headers, maybe optional  
\renewcommand\theadfont{\bf} 

\makecell{A \\B \\C} can be used to break lines in table cells. \thead{A \\B \\C} additionally makes the text bold.

If one only needs the header to be bold, one can just put \bf before the name (no need to put any { or } around).

Import a csv file as table

The solution I found is to use a tool to transform a .csv file to a .tex file and then copy the text content onto our document. A tool that works well is CSV2LaTeX (http://brouits.free.fr/csv2latex/).

The stable version can be downloaded here: http://brouits.free.fr/csv2latex/csv2latex-0.22.tar.gz Unzip the downloaded file, then run make within the unzipped directory. This make command will not install the software in the system (at least not on my MacOS), it only generates an executable file called csv2latex. Put the executable file called csv2latex and the target .csv file under the same directory, do:

1
./csv2latex table.csv > document.tex

We can also specify the format of the csv file, and some output disposition.

  • The separator is a comma, block delimiter is a double-quote, and produces a LaTeX document with 40 lines per table and where the text is flushed left in each cell. 40 lines is a good average for A4 paper:
1
./csv2latex --separator c --block d --lines 40 --position l table.csv > document.tex
  • The separator is a semi-colon, block delimiter is a simple quote and produces a LaTeX document with 20 lines per table and where the text is centered in each cell.
1
./csv2latex --separator s --block q --lines 20 --position c table.csv > document.tex
  • If you have tricky block delimiter such as $ and separators like tabs, let the program guess
1
./csv2latex --guess table.csv > document.tex

The generated document.tex contains tabular code. We can copy-paste it into something like this:

1
2
3
4
5
6
7
8
\begin{table}[h]
\centering
\begin{adjustbox}{}
% TODO
\end{adjustbox}
\caption{}
\label{tab:}
\end{table}

Below are some other solutions. These solutions do not work well.

This solution will ignore rows whose row index contains special symbols such as %

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
\usepackage{csvsimple,booktabs,siunitx}


\begin{table}[h]
\centering
\begin{adjustbox}{}
\csvreader[tabular={|c|c|c|c|c|c|},
           table head=\toprule A & B & C & D & E & F \\ \midrule,
           table foot=\bottomrule]
           {csv_files/figure1.csv}
           {1=\a,2=\b,3=\c,4=\d,5=\e,6=\f}
           {\a & \b & \c & \d & \e & \f}
\end{adjustbox}
\caption{}
\label{tab:}
\end{table}

This solution does not ignore rows whose row index contains special symbols such as %. It also automatically set up column names and columns. However, it is not clear how to specify table format. Another problem is that if the special symbol _ appears in a column name, it will become another symbol in the compiled pdf.

1
2
3
4
5
6
\usepackage{csvsimple,booktabs,siunitx}

\csvreader[
  respect all,
  autotabular
]{csv_files/figure1.csv}{}{\csvlinetotablerow}

This solution automatically generate column names as Column1, Column2, etc. The generated table does not have lines inside the table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
% This solution does not need \usepackage{csvsimple}

\usepackage{datatool}

\begin{document}
...

\DTLloadrawdb[noheader,keys={},headers={}]{figure1}{csv_files/figure1.csv}

\begin{table}[h]
    \centering
\begin{tabular}{|c|c|c|c|}\hline
    \DTLdisplaydb{figure1}
    \ \hline
    \end{tabular}
\end{table}

\end{document}

Hyperlinks

1
\href{URL}{Text}

If the Text part also contains an URL, you may want to use \url{} for that part. Otherwise you may get the error Missing $ inserted if your URL contains symbols like _. However, don’t use \url{} for the URL part (the first part) of \href{}{}, otherwise you will get fatal compilation error.

Multi-line (block) comments

1
2
3
\iffalse
blablabla...
\fi

Mathematics

1
2
3
4
5
6
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bbm}
\usepackage{amsthm}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

Glossary

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
\usepackage{glossaries}
\usepackage[nonumberlist]{glossaries} % disable page numbers after each glossaries

\makeglossaries

% does not allow line break in description
\newglossaryentry{}
{
    name=,
    description={}
}

% allows line break in description
\longnewglossaryentry{}
{
    name=,
    description={}
}

% lists all glossaries even if they are not used
\glsaddall

\begin{document}

...

\printglossaries

...

% Using defined terms
%% <label> is the X in \newglossaryentry{X}
%% which can be different from the "name" 
%% (Y in name=Y,)
\gls{<label>}
\glspl{<label>}
\Gls{<label>}
\Glspl{<label>}
\glslink{<label>}{<alternate text>}
\glssymbol{<label>}
\glsdesc{<label>}

\end{document}
Read more »

Multicollinearity

Posted on 2020-06-09 | In Mathematics
| Words count in article 1301

Effect on prediction

Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set. It only affects calculations regarding individual features (interpretability). That is, a multivariate regression model with collinear features can indicate how well the entire set of features predicts the target variable, but it may not give valid results about any individual feature [5]. Strictly speaking, multicollinearity in the training set should only reduce predictive performance in the test set if the covariance between features in the training set and test set is different. If the covariance structure (and consequently the multicollinearity) is similar in both training and test datasets, then it does not pose a problem for prediction. Since a test set is typically a random subset of the full dataset, it’s generally reasonable to assume that the covariance structure is the same. Therefore, multicollinearity is typically not an issue for the prediction [4].

Effect on interpretation

Multicollinearity is a problem because it undermines the statistical significance of an independent feature. Other things being equal, the larger the standard error of a regression coefficient, the less likely it is that this coefficient will be statistically significant [1]. The presence of multicollinearity increases standard error measures, which lowers statistical significance (this may lead to failing to reject the false null hypothesis, type II error) [10].

Multicollinearity makes it difficult to determine the actual relationship between the target variable and the features, in the sense that the calculated value of the coefficients associated with features that are (nearly) collinear with other features may not be reliable and can dramatically change with slightly different input data.

To give some intuition, if we have some features that are exactly/perfectly collinear, we can still find coefficients minimizing the sum of squared errors, however the coefficients won’t be unique, in fact there will be an infinite range of possible coefficient combinations that are equally valid. If the features are nearly collinear, we have a soft version of this same problem. We may be able to fit a model and find unique regression coefficients, but they don’t mean much. With slightly different versions of the same data, the coefficients would slide around in an arbitrary way among the nearly collinear features [3].

Notice that the comments above have little to do with least squares and apply generally to many kinds of machine learning models.

Condition number

In linear regression, the condition number (条件数) can be used as a diagnostic for multicollinearity. In the field of numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. A problem with a low condition number is said to be well-conditioned, while a problem with a high condition number is said to be ill-conditioned [6, 7].

Variance inflation factor (VIF)

Variance inflation factor (方差扩大因子) quantifies the severity of multicollinearity in an OLS regression analysis.

Check Sun Haozhe’s Blog - statsmodels memo

Mathematical foundation

Recall the formulation of ordinary least squares (OLS):

\[y = X\theta^* + \epsilon\] \[\hat{\theta} = (X^\intercal X)^{-1} X^\intercal y\]

To construct confidence intervals or perform significance tests in OLS, we need to calculate the standard error of coefficients corresponding to each feature, that is

\[\text{SE}(\hat{\theta}) = \begin{pmatrix} \sqrt{\mathbb{Var}[\hat{\theta}_1]} \\ \sqrt{\mathbb{Var}[\hat{\theta}_2]} \\ ... \\ \sqrt{\mathbb{Var}[\hat{\theta}_p]} \end{pmatrix}\]

which is simply the square root of the diagonal of the covariance matrix of $\hat{\theta}$.

\[\begin{equation} \begin{split} \hat{\theta} &= (X^\intercal X)^{-1} X^\intercal y \\ &= (X^\intercal X)^{-1} X^\intercal (X\theta^* + \epsilon) \\ &= \theta^* + (X^\intercal X)^{-1} X^\intercal \epsilon \end{split} \end{equation}\] \[\hat{\theta} - \theta^* = (X^\intercal X)^{-1} X^\intercal \epsilon\]

By assuming $\mathbb{E}[\epsilon] = 0$,

\[\begin{equation} \begin{split} \mathbb{E}[\hat{\theta}] &= \mathbb{E}[\theta^* + (X^\intercal X)^{-1} X^\intercal \epsilon] \\ &= \theta^* + (X^\intercal X)^{-1} X^\intercal \mathbb{E}[\epsilon] \\ &= \theta^* \end{split} \end{equation}\] \[\mathbb{Var}[\epsilon] = \mathbb{E}[\epsilon \epsilon^\intercal]\]

Let’s now calculate the covariance matrix $\mathbb{Var}[\hat{\theta}]$:

\[\begin{equation} \begin{split} \mathbb{Var}[\hat{\theta}] &= \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}] )(\hat{\theta} - \mathbb{E}[\hat{\theta}] )^\intercal] \\ &= \mathbb{E}[(\hat{\theta} - \theta^* )(\hat{\theta} - \theta^* )^\intercal] \\ &= \mathbb{E}[((X^\intercal X)^{-1} X^\intercal \epsilon )((X^\intercal X)^{-1} X^\intercal \epsilon )^\intercal] \\ &= (X^\intercal X)^{-1} X^\intercal \mathbb{E}[\epsilon \epsilon^\intercal] X (X^\intercal X)^{-1} \\ &= (X^\intercal X)^{-1} X^\intercal \mathbb{Var}[\epsilon] X (X^\intercal X)^{-1} \end{split} \end{equation}\]

If the errors are independent and have constant variance $\sigma^2$ (principal assumptions of linear regression), then

\[\mathbb{Var}[\epsilon] = \begin{pmatrix} \sigma^2 & 0 & \dots & 0 \\ 0 & \sigma^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma^2 \end{pmatrix} = \sigma^2 I\] \[\begin{equation} \begin{split} \mathbb{Var}[\hat{\theta}] &= (X^\intercal X)^{-1} X^\intercal \mathbb{Var}[\epsilon] X (X^\intercal X)^{-1} \\ &= \sigma^2 (X^\intercal X)^{-1} \end{split} \end{equation}\]

In order to estimate $\mathbb{Var}[\hat{\theta}]$, we need an estimate of $\sigma^2$.

The residual sum of squares (RSS) is the following:

\[\text{RSS} = (y - X \hat{\theta})^\intercal (y - X \hat{\theta}) = \sum_i (y_i - f_i)^2\]

It can be proved [8] that

\[\frac{\text{RSS}}{\sigma^2} \sim \chi_{n-p}^2\]

In consequence,

\[\mathbb{E}[\frac{\text{RSS}}{n - p}] = \sigma^2\]

$\widehat{\sigma^2} = \frac{\text{RSS}}{n - p}$ is thus an unbiased estimate of $\sigma^2$.

\[\widehat{\mathbb{Var}[\hat{\theta}]} = \widehat{\sigma^2} (X^\intercal X)^{-1} = \frac{(y - X \hat{\theta})^\intercal (y - X \hat{\theta})}{n - p} (X^\intercal X)^{-1}\]

One result that remains to be confirmed is the following [9]:

meaning_behind_XTX_1

References

[1] (1997) The problem of multicollinearity. In: Understanding Regression Analysis. Springer, Boston, MA

[2] Standard Errors in OLS. (n.d.). Home. https://lukesonnet.com/teaching/inference/200d_standard_errors.pdf

[3] Michael Hochster’s answer to why is multicollinearity bad in Layman’s terms? In feature selection for a regression model (intended for use in prediction), why is it a bad thing to have multicollinearity, or highly correlated independent variables? (n.d.). Quora - A place to share knowledge and better understand the world. https://www.quora.com/Why-is-multicollinearity-bad-in-laymans-terms-In-feature-selection-for-a-regression-model-intended-for-use-in-prediction-why-is-it-a-bad-thing-to-have-multicollinearity-or-highly-correlated-independent-variables/answer/Michael-Hochster

[4] Multicollinearity and predictive performance. (n.d.). Cross Validated. https://stats.stackexchange.com/questions/361247/multicollinearity-and-predictive-performance

[5] Multicollinearity. (2005, February 9). Wikipedia, the free encyclopedia. Retrieved June 9, 2020, from https://en.wikipedia.org/wiki/Multicollinearity

[6] Condition number. (2001, October 29). Wikipedia, the free encyclopedia. Retrieved June 9, 2020, from https://en.wikipedia.org/wiki/Condition_number

[7] https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html#Condition-number

[8] Why is RSS distributed CHI square times n-P? (n.d.). Cross Validated. https://stats.stackexchange.com/questions/20227/why-is-rss-distributed-chi-square-times-n-p

[9] The meaning behind $(X^TX)^{-1}$. (n.d.). Mathematics Stack Exchange. https://math.stackexchange.com/questions/2624986/the-meaning-behind-xtx-1/2625661#2625661

[10] Model selection: Regression models. (n.d.). https://campus.datacamp.com/courses/practicing-machine-learning-interview-questions-in-python/model-selection-and-evaluation-4?ex=9

Read more »

Python assignment, mutable vs. immutable

Posted on 2020-05-29 | In Python
| Words count in article 864

Everything in Python is an object. An object’s mutability is determined by its type. Objects whose value can change are said to be mutable, objects whose value is unchangeable once they are created are called immutable.

Among Python built-in types:

  • Mutable objects include: list dict set
  • Immutable objects include: tuple str int float bool

User-defined classes are generally mutable. There are some exceptions, such as simple sub-classes of an immutable type.

Arguments are passed by assignment in Python. Since assignment just creates references to objects, the parameter passed in is actually a reference to an object, but the reference is passed by value. When we call a function with a parameter, a new reference is created that refers to the object passed in. This is separate from the reference that was used in the function call.

If we pass a mutable object into a function, the function gets a reference to that same object and we can mutate it, the change in the function (if any) will be reflected in the outer scope. But if we rebind the reference in the function, the outer reference will still point to the original object.

If we pass an immutable object into a function, the function gets a reference to that same object but we cannot mutate it as this object does not provide such methods. If we rebind the reference in the function, the outer reference will still point to the original object.

id() function

Assignment in Python means that the variable named on the left should now refer to the value (object) on the right. If the right-hand side is also a name (variable), for example y = x, this means y should now refer to the object x refers to.

Let’s consider this example:

1
2
a = 1
a = 2

In C, we believe that a is a memory location that stores the value 1, then is updated to store the value 2. However in Python, a starts as a reference to an object with the value 1, then gets reassigned as a reference to an object with the value 2. Those two objects may continue to coexist even though a doesn’t refer to the first one anymore; in fact they may be shared by any number of other references within the program.

Some links to help understand this:

  • Understanding Python Variables https://mathieularose.com/python-variables/
  • http://pythontutor.com/visualize.html#mode=display

Whenever an object is instantiated, it is assigned a unique object id. id() function returns the identity of an object. This identity has to be unique and constant for this object during the lifetime. Two objects with non-overlapping lifetimes may have the same id() value. If we relate this to C, then they are actually the memory address, here in Python it is the unique id. This function is generally used internally in Python.

1
2
3
4
5
6
7
8
9
10
11
str1 = "abc"
print(id(str1)) # 4330114944
  
str2 = "abc"
print(id(str2)) # 4330114944

str3 = "abcd"
print(id(str3)) # 4333274648

print(id(str1) == id(str2)) # True
print(id(str1) == id(str3)) # False
1
2
3
4
5
6
7
8
9
10
11
a = [1, 2, 3]
print(id(a)) # 4340769096

a[1] = 4
print(a) # [1, 4, 3]
print(id(a)) # 4340769096

a = [1, 4, 3]
print(id(a)) # 4340468680

print(id([1, 4, 3])) # 4340769096

One trap about tuple’s immutability

tuples are immutable, but their values may change. This may happen when a tuple holds a reference to any mutable object, such as a list. What is immutable is the physical content of a tuple, consisting of the object references only. The content/value of a list referenced by a name/variable in a tuple can be changed, but the id of the referenced list object remains the same. A tuple has no way of preventing changes to the values of its items, which are independent objects and may be reached through references outside of the tuple.

For example:

1
a = ([1, 2, 3], 4) 

String

Strings are immutable in Python:

1
2
a = "abcd"
a[0] = "h"
1
TypeError: 'str' object does not support item assignment

Concatenating string in loops wastes lots of memory. As strings are immutable, concatenating two strings together actually creates a third string which is the combination of the previous two. If we are iterating a lot and building a large string, we will waste a lot of memory creating and throwing away objects. Use list comprehension join technique instead, for example:

1
"".join(["first", "second", "other"])

Tuple

1
2
a = (1, 2, 3)
a[1] = 4
1
TypeError: 'tuple' object does not support item assignment

List

1
2
3
a = [1, 2, 3]
a[1] = 4
print(a)
1
[1, 4, 3]
Read more »

Currency devaluation and revaluation

Posted on 2020-05-28 | In Misc
| Words count in article 389

货币贬值(currency devaluation)和货币升值(currency revaluation)。

本币贬值:

  • 有利于出口,不利于进口
  • 有利于吸引外资
  • 有利于扩大外需
  • 有利于回收外债
  • 有利于降低本国的国际金融风险
  • 有利于吸引外国游客,扩大旅游业发展
  • 会造成国内 进口商品物价 的上涨
  • 本币贬值后,本国的贸易收入往往会得到改善(但会给贸易伙伴国带来不利的影响,可能会引发国际关系的恶化),整个经济体系中 外贸部门 所占比重会扩大,从而提高本国的对外开放程度,可以有更多的产品同国外产品竞争
  • 如果贬值趋势不断发展,那么人们将会把资金从本国转移到其他国家,引起 资金外流
  • 会降低本国货币的国际信誉,使其在金融市场上受到抛压

本币升值:

  • 有利于进口,不利于出口
  • 有利于对外投资
  • 会提高本国企业和居民的国际购买力,有利于出国旅游留学
  • 有利于偿还外债

小国的汇率变动只会对贸易伙伴国的经济产生轻微的影响,但主要工业国的货币贬值会影响其国家的贸易收支,由此可引起贸易战和汇率战,并影响世界经济的发展。

Read more »
1 2 3 4
Sun Haozhe

Sun Haozhe

GitHub IO blogs

78 posts
4 categories
16 tags
GitHub
© 2019 - 2024 Sun Haozhe
Powered by Jekyll
Theme - NexT.Mist