1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
<h1>Seccomp security profiles for Docker</h1>
<p>Secure computing mode (<code class="language-plaintext highlighter-rouge">seccomp</code>) is a Linux kernel feature. You can use it to restrict the actions available within the container. The <code class="language-plaintext highlighter-rouge">seccomp()</code> system call operates on the seccomp state of the calling process. You can use this feature to restrict your application’s access.</p> <p>This feature is available only if Docker has been built with <code class="language-plaintext highlighter-rouge">seccomp</code> and the kernel is configured with <code class="language-plaintext highlighter-rouge">CONFIG_SECCOMP</code> enabled. To check if your kernel supports <code class="language-plaintext highlighter-rouge">seccomp</code>:</p> <div class="highlight"><pre class="highlight" data-language="">$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)
CONFIG_SECCOMP=y
</pre></div> <h2 id="pass-a-profile-for-a-container">Pass a profile for a container</h2> <p>The default <code class="language-plaintext highlighter-rouge">seccomp</code> profile provides a sane default for running containers with seccomp and disables around 44 system calls out of 300+. It is moderately protective while providing wide application compatibility. The default Docker profile can be found <a href="https://github.com/moby/moby/blob/master/profiles/seccomp/default.json">here</a>.</p> <p>In effect, the profile is a allowlist which denies access to system calls by default, then allowlists specific system calls. The profile works by defining a <code class="language-plaintext highlighter-rouge">defaultAction</code> of <code class="language-plaintext highlighter-rouge">SCMP_ACT_ERRNO</code> and overriding that action only for specific system calls. The effect of <code class="language-plaintext highlighter-rouge">SCMP_ACT_ERRNO</code> is to cause a <code class="language-plaintext highlighter-rouge">Permission Denied</code> error. Next, the profile defines a specific list of system calls which are fully allowed, because their <code class="language-plaintext highlighter-rouge">action</code> is overridden to be <code class="language-plaintext highlighter-rouge">SCMP_ACT_ALLOW</code>. Finally, some specific rules are for individual system calls such as <code class="language-plaintext highlighter-rouge">personality</code>, and others, to allow variants of those system calls with specific arguments.</p> <p><code class="language-plaintext highlighter-rouge">seccomp</code> is instrumental for running Docker containers with least privilege. It is not recommended to change the default <code class="language-plaintext highlighter-rouge">seccomp</code> profile.</p> <p>When you run a container, it uses the default profile unless you override it with the <code class="language-plaintext highlighter-rouge">--security-opt</code> option. For example, the following explicitly specifies a policy:</p> <div class="highlight"><pre class="highlight" data-language="">$ docker run --rm \
-it \
--security-opt seccomp=/path/to/seccomp/profile.json \
hello-world
</pre></div> <h3 id="significant-syscalls-blocked-by-the-default-profile">Significant syscalls blocked by the default profile</h3> <p>Docker’s default seccomp profile is an allowlist which specifies the calls that are allowed. The table below lists the significant (but not all) syscalls that are effectively blocked because they are not on the Allowlist. The table includes the reason each syscall is blocked rather than white-listed.</p> <table> <thead> <tr> <th>Syscall</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">acct</code></td> <td>Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_PACCT</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">add_key</code></td> <td>Prevent containers from using the kernel keyring, which is not namespaced.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">bpf</code></td> <td>Deny loading potentially persistent bpf programs into kernel, already gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">clock_adjtime</code></td> <td>Time/date is not namespaced. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_TIME</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">clock_settime</code></td> <td>Time/date is not namespaced. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_TIME</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">clone</code></td> <td>Deny cloning new namespaces. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code> for CLONE_* flags, except <code class="language-plaintext highlighter-rouge">CLONE_NEWUSER</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">create_module</code></td> <td>Deny manipulation and functions on kernel modules. Obsolete. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_MODULE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">delete_module</code></td> <td>Deny manipulation and functions on kernel modules. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_MODULE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">finit_module</code></td> <td>Deny manipulation and functions on kernel modules. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_MODULE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">get_kernel_syms</code></td> <td>Deny retrieval of exported kernel and module symbols. Obsolete.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">get_mempolicy</code></td> <td>Syscall that modifies kernel memory and NUMA settings. Already gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_NICE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">init_module</code></td> <td>Deny manipulation and functions on kernel modules. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_MODULE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">ioperm</code></td> <td>Prevent containers from modifying kernel I/O privilege levels. Already gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_RAWIO</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">iopl</code></td> <td>Prevent containers from modifying kernel I/O privilege levels. Already gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_RAWIO</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">kcmp</code></td> <td>Restrict process inspection capabilities, already blocked by dropping <code class="language-plaintext highlighter-rouge">CAP_SYS_PTRACE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">kexec_file_load</code></td> <td>Sister syscall of <code class="language-plaintext highlighter-rouge">kexec_load</code> that does the same thing, slightly different arguments. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_BOOT</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">kexec_load</code></td> <td>Deny loading a new kernel for later execution. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_BOOT</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">keyctl</code></td> <td>Prevent containers from using the kernel keyring, which is not namespaced.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">lookup_dcookie</code></td> <td>Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">mbind</code></td> <td>Syscall that modifies kernel memory and NUMA settings. Already gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_NICE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">mount</code></td> <td>Deny mounting, already gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">move_pages</code></td> <td>Syscall that modifies kernel memory and NUMA settings.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">name_to_handle_at</code></td> <td>Sister syscall to <code class="language-plaintext highlighter-rouge">open_by_handle_at</code>. Already gated by <code class="language-plaintext highlighter-rouge">CAP_DAC_READ_SEARCH</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">nfsservctl</code></td> <td>Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">open_by_handle_at</code></td> <td>Cause of an old container breakout. Also gated by <code class="language-plaintext highlighter-rouge">CAP_DAC_READ_SEARCH</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">perf_event_open</code></td> <td>Tracing/profiling syscall, which could leak a lot of information on the host.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">personality</code></td> <td>Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">pivot_root</code></td> <td>Deny <code class="language-plaintext highlighter-rouge">pivot_root</code>, should be privileged operation.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">process_vm_readv</code></td> <td>Restrict process inspection capabilities, already blocked by dropping <code class="language-plaintext highlighter-rouge">CAP_SYS_PTRACE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">process_vm_writev</code></td> <td>Restrict process inspection capabilities, already blocked by dropping <code class="language-plaintext highlighter-rouge">CAP_SYS_PTRACE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">ptrace</code></td> <td>Tracing/profiling syscall. Blocked in Linux kernel versions before 4.8 to avoid seccomp bypass. Tracing/profiling arbitrary processes is already blocked by dropping <code class="language-plaintext highlighter-rouge">CAP_SYS_PTRACE</code>, because it could leak a lot of information on the host.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">query_module</code></td> <td>Deny manipulation and functions on kernel modules. Obsolete.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">quotactl</code></td> <td>Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">reboot</code></td> <td>Don’t let containers reboot the host. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_BOOT</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">request_key</code></td> <td>Prevent containers from using the kernel keyring, which is not namespaced.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">set_mempolicy</code></td> <td>Syscall that modifies kernel memory and NUMA settings. Already gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_NICE</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">setns</code></td> <td>Deny associating a thread with a namespace. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">settimeofday</code></td> <td>Time/date is not namespaced. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_TIME</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">stime</code></td> <td>Time/date is not namespaced. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_TIME</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">swapon</code></td> <td>Deny start/stop swapping to file/device. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">swapoff</code></td> <td>Deny start/stop swapping to file/device. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">sysfs</code></td> <td>Obsolete syscall.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">_sysctl</code></td> <td>Obsolete, replaced by /proc/sys.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">umount</code></td> <td>Should be a privileged operation. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">umount2</code></td> <td>Should be a privileged operation. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">unshare</code></td> <td>Deny cloning new namespaces for processes. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>, with the exception of <code class="language-plaintext highlighter-rouge">unshare --user</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">uselib</code></td> <td>Older syscall related to shared libraries, unused for a long time.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">userfaultfd</code></td> <td>Userspace page fault handling, largely needed for process migration.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">ustat</code></td> <td>Obsolete syscall.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">vm86</code></td> <td>In kernel x86 real mode virtual machine. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">vm86old</code></td> <td>In kernel x86 real mode virtual machine. Also gated by <code class="language-plaintext highlighter-rouge">CAP_SYS_ADMIN</code>.</td> </tr> </tbody> </table> <h2 id="run-without-the-default-seccomp-profile">Run without the default seccomp profile</h2> <p>You can pass <code class="language-plaintext highlighter-rouge">unconfined</code> to run a container without the default seccomp profile.</p> <div class="highlight"><pre class="highlight" data-language="">$ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \
unshare --map-root-user --user sh -c whoami
</pre></div>
<p><a href="https://docs.docker.com/search/?q=seccomp">seccomp</a>, <a href="https://docs.docker.com/search/?q=security">security</a>, <a href="https://docs.docker.com/search/?q=docker">docker</a>, <a href="https://docs.docker.com/search/?q=documentation">documentation</a></p>
<div class="_attribution">
<p class="_attribution-p">
© 2019 Docker, Inc.<br>Licensed under the Apache License, Version 2.0.<br>Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc. in the United States and/or other countries.<br>Docker, Inc. and other parties may also have trademark rights in other terms used herein.<br>
<a href="https://docs.docker.com/engine/security/seccomp/" class="_attribution-link">https://docs.docker.com/engine/security/seccomp/</a>
</p>
</div>
|