General information
Show the Docker version information
1 | |
Display system-wide information
1 | |
Display a live stream of container(s) resource usage statistics
1 | |
Show the history of an image
1 | |
Display low-level information on Docker objects using docker inspect. For example, if one wants to compare content of images, one can look at section RootFS. If all layers are identical then images contains identical content.
1 | |
1 | |
Remove unused data (remove all unused containers, networks, images (both dangling and unreferenced), and optionally, volumes.)
1 | |
Login to registry (in order to be able to push to DockerHub)
1 | |
Docker image
Build an image from the Dockerfile in the current directory and tag the image
1 | |
Pull an image from a registry
1 | |
Retag a local image with a new image name and tag
1 | |
Push an image to a registry
1 | |
List all images that are locally stored with the Docker Engine
1 | |
Delete an image from the local image store
1 | |
Clean up unused images
1 | |
Docker container
List containers (only shows running)
1 | |
List all containers (including non-running)
1 | |
List the running containers (add --all to include stopped containers)
1 | |
Run a container from the Alpine version 3.9 image, name the running container web and expose port 5000 externally, mapped to port 80 inside the container
1 | |
The commands docker stop docker start docker restart:
1 | |
Copy files/folders between a container and the local filesystem
1 | |
An example: (The first line is run in the docker container. The second and third lines are run on the local terminal )
1 | |
Another example:
1 | |
When you stop a container, it is not automatically removed unless you started it with the --rm flag. To see all containers on the Docker host, including stopped containers, use docker ps -a. You may be surprised how many containers exist, especially on a development system! A stopped container’s writable layers still take up disk space. To clean this up, you can use the docker container prune command.
1 | |
Remove one or more containers (stop plus “prune”)
1 | |
Volume
Volumes can be used by one or more containers, and take up space on the Docker host. Volumes are never removed automatically, because to do so could destroy data.
1 | |
Network
List the networks
1 | |
1 | |
The command: docker run
docker run runs a command in a new container.
Use -dit for docker run, in this way, the docker will be run at background and one can use docker exec -it ... to enter that container. If one run exit here, one can detach from this container and leave it running.
1 | |
1 | |
Run docker which uses cuda GPU
1 | |
If you use DataLoader of PyTorch with num_workers greater than 0 in a docker container, you probably need to raise the shared memory limit by using --shm-size=2gb or --ipc=host or -v /dev/shm:/dev/shm for docker run.
The command: docker commit
docker commit creates a new image from a container’s changes. The commit operation will not include any data contained in volumes mounted inside the container. By default, the container being committed and its processes will be paused while the image is committed. This reduces the likelihood of encountering data corruption during the process of creating the commit. If this behavior is undesired, set the --pause option to false.
1 | |
Docker Hub
docker push pushes an image or a repository to a registry
Maybe create the repository first on Docker Hub’s web, before running the following command:
1 | |
Example
Docker image sunhaozhe/pytorch-cu100-jupyter-gym:
1 | |
Docker image sunhaozhe/pytorch-cu100-gym-tmux-tensorboardx:
1 | |
all-in-one with jupyter, CPU-only / Python 3.
1 | |
all-in-one with jupyter, CUDA 10.0 / Python 3.6
1 | |
1 | |
-p 18888:8888for jupyter notebook-p 8097:8097for visdom server-p 6006:6006for tensorboardx
1 | |
Missing packages
Vim
apt updateapt search vimapt install vimvim --version
Troubleshooter
- WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
- Meaning:
docker runwon’t impose any limitations on the use of swap space. However, the warning message is also trying to say that option -m, –memory will still take effect, and the maximum amount of user memory (including file cache) will be set as intended. - https://stackoverflow.com/a/63726105/7636942
- Meaning:
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Reference
- Two possibilities:
- Set
num-workersto0(according to the doc of PyTorchtorch.utils.data DataLoade, but this will slow down training) - Use the flag
--ipc=hostwhen executingdocker run ..., be careful of potential security issues.
- Set
- If you got a
Killedmessage within a docker container, it probably means that you don’t have enough memory/RAM (use the commandfree -hto check the amount of available memory). Either increase the amount of memory of the host machine, or increase the amount of memory docker is allowed to use.
- Two possibilities:
Shared memory
Background
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory.
The shared memory device, /dev/shm, provides a temporary file storage filesystem using RAM for storing files. It’s not mandatory to have /dev/shm, although it’s probably desirable since it facilitates inter-process communication (IPC). Why would you use /dev/shm instead of just stashing a temporary file under /tmp? Well, /dev/shm exists in RAM (so it’s fast), whereas /tmp resides on disk (so it’s relatively slow). The shared memory behaves just like a normal file system, but it’s all in RAM.
In order to see how big it is in /dev/shm:
1 | |
In order to check what’s currently under /dev/shm:
1 | |
Shared memory for Docker containers
Docker containers are allocated 64 MB of shared memory by default.
We can change the amount of shared memory allocated to a container by using the --shm-size option of docker run:
1 | |
Unit can be b (bytes), k (kilobytes), m (megabytes), or g (gigabytes). If you omit the unit, the system uses bytes.
Note that in the above example, the container is getting its own /dev/shm, separate from that of the host.
What about sharing memory between the host and a container or between containers? This can be done by mounting /dev/shm:
1 | |
The option --ipc=hostcan be used instead of specifying --shm-size=XXX. --shm-size=XXX would be enough, but you’d need to set shared memory size that’s enough for your workload. --ipc=host sets shared memory to the same value it has on bare metal. However, --ipc=host may have some security concerns, --ipc=host removes a layer of security and creates new attack vectors as any application running on the host that misbehaves when presented with malicious data in shared memory segments can become a potential attack vector.
If you use DataLoader of PyTorch with num_workers greater than 0 in a docker container, you probably need to raise the shared memory limit because the default is only 64 MB:
RuntimeError: DataLoader worker (pid 585) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
If you still get erros like RuntimeError: DataLoader worker (pid 4161) is killed by signal: Killed. segmentation faults even though:
- you already set
--ipc=host - your shared memory (type
df -hin docker container) is as big as 15G or even 40G+ - you even tried to use 1 for the batch size…
- your code run perfectly with
num_workers=0but breaks when it becomes larger than 0 (even withnum_workers=1)
Then you can possibly solve this issue by simplifying the __getitem__() method of your custorm PyTorch dataset class. By simplifying, I mean you make the code in __getitem__() as simple as possible. For example, you’d better avoid using pandas DataFrame in __getitem__() for a reason that I did not understand. If originally your __getitem__() maps idx to img_path using a pd.DataFrame, then you’d better create a python list items which stores the same information in PyTorch dataset’s __init__ stage, then __getitem__() will only call the list items, it will never call pd.DataFrame. If this can still not solve the issue, then maybe try to increase the host’s total memory or simplify the data augmentation techniques used in the __getitem__()…. I tried to reduce the amount of compute related to data augmentation and dataset loading before the training loop begins (outside __getitem__() because outside the training loop which involves dataloader multiprocessing), but mysteriously this solved the issue, no clue of why this happened…
Some other clues:
- import torch before sklearn
- upgrade sklearn to the latest version
- moving import pytorch before other imports
- reduce input image size
- the data you read and the python environment should be on the same mount. In his specific case, the python environment was in my home directory (which resides in some remote mount, to be shared across multiple compute nodes) and the data local to the compute node. After moving everything to be local with respect to the compute node, segfault went away.
- Just removed all the pandas logic out of one of my projects and still encounter the issue. Have a feeling it might be more related to cv2 (or modifying images with numpy operations rather than PIL operations). Oh I see. It’s definitely cv2. It is known to be not nice with multiprocessing. Sorry I just did a quick glance at the imports and didn’t see any cv2. Ok I removed usage of cv2, but it still seems to be happening. Now debating whether it is random or scipy.ndimage related. That’s rather weird. What’s more puzzling is that Python is segfaulting, not cv2 or pytorch libraries. Could you try a different python version? I upgraded to python 3.6 from 2.7. Seems to have worked. Thanks.
References
- http://www.ruanyifeng.com/blog/2018/02/docker-tutorial.html
- https://docs.docker.com/engine/reference/commandline/docker/
- https://datawookie.dev/blog/2021/11/shared-memory-docker/#:~:text=Docker%20containers%20are%20allocated%2064%20MB%20of%20shared%20memory%20by%20default.
- https://github.com/pytorch/pytorch/issues/1158
- https://github.com/pytorch/pytorch/issues/8976#issuecomment-401564899
- https://github.com/pytorch/pytorch/issues/4969