Nftables quick howto

Introduction

This document is between a dirty howto and a cheat sheet. For a short description of some interesting nftables features, you can read Why you will love nftables.

For a description of architecture and ideas behind Nftables, please read the announce of the first release of nftables. For more global information, you can also watch the talk I’ve made at Kernel Recipes: Eric Leblond, OISF – Nftables.

Building nftables

Libraries

The following libraries are needed

libmnl: git://git.netfilter.org/libmnl
libnftnl: git://git.netfilter.org/libnftnl

It is possible that your distribution already include libmnl. But it is easy to build both libraries as they build with the standard:

./autogen.sh
./configure
make
make install
ldconfig

nftables

First install dependencies:

aptitude install libgmp-dev libreadline-dev

If you want to build the documentation:

aptitude install docbook2x docbook-utils

git clone git://git.netfilter.org/nftables
cd nftables
./autogen.sh
./configure
make
make install

kernel

If you do not have already a Linux git tree, run:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

If you already have a Linux git tree, you can just update to latest sources

cd linux
git pull --rebase

Now that you have the source, you can choose nftables option:

$ make oldconfig

Netfilter Xtables support (required for ip_tables) (NETFILTER_XTABLES) [M/y/?] m
Netfilter nf_tables support (NF_TABLES) [N/m] (NEW) m
  Netfilter nf_tables payload module (NFT_PAYLOAD) [N/m] (NEW) m
  Netfilter nf_tables IPv6 exthdr module (NFT_EXTHDR) [N/m] (NEW) m
  Netfilter nf_tables meta module (NFT_META) [N/m] (NEW) m
  Netfilter nf_tables conntrack module (NFT_CT) [N/m] (NEW) m
  Netfilter nf_tables rbtree set module (NFT_RBTREE) [N/m] (NEW) m
  Netfilter nf_tables hash set module (NFT_HASH) [N/m] (NEW) m
  Netfilter nf_tables counter module (NFT_COUNTER) [N/m] (NEW) m
  Netfilter nf_tables log module (NFT_LOG) [N/m] (NEW) m
  Netfilter nf_tables limit module (NFT_LIMIT) [N/m] (NEW) m
  Netfilter nf_tables nat module (NFT_NAT) [N/m] (NEW) m
  Netfilter x_tables over nf_tables module (NFT_COMPAT) [N/m/?] (NEW) m

IPv4 nf_tables support (NF_TABLES_IPV4) [N/m] (NEW) m
  nf_tables IPv4 reject support (NFT_REJECT_IPV4) [N/m] (NEW) m
  IPv4 nf_tables route chain support (NFT_CHAIN_ROUTE_IPV4) [N/m] (NEW) m
  IPv4 nf_tables nat chain support (NFT_CHAIN_NAT_IPV4) [N/m] (NEW) m

IPv6 nf_tables support (NF_TABLES_IPV6) [M/n] m
  IPv6 nf_tables route chain support (NFT_CHAIN_ROUTE_IPV6) [M/n] m
  IPv6 nf_tables nat chain support (NFT_CHAIN_NAT_IPV6) [M/n] m

Ethernet Bridge nf_tables support (NF_TABLES_BRIDGE) [N/m/y] (NEW) m

Now, you can build your kernel with the usual commands.

On a debian, you can do on a dual core machine:

make -j 2 deb-pkg

Or you can alternately use the old method:

CONCURRENCY_LEVEL=2 make-kpkg --revision 0.1 --rootcmd fakeroot  --initrd   --append-to-version nftables kernel_image kernel_headers

Debian users can also get kernel build from git sources:

Other related packages are available in this directory.

Running it

Initial setup

To get a iptables like chain setup, use the ipv4-filter file provided in the source

nft -f files/nftables/ipv4-filter

You can then list the resulting chain:

nft list table filter

Note that filter as well as output or input are used as chain and table name. Any other string could have been used.

Basic rule handling

To drop output to a destination

nft add rule ip filter output  ip daddr 1.2.3.4 drop

Rule counters are optional with nftables and the counter keyword need to be used to activate it:

nft add rule ip filter output  ip daddr 1.2.3.4 counter drop

To add a rule to a network, you can directly use:

nft add rule ip filter output ip daddr 192.168.1.0/24 counter

To drop packet to port 80 the syntax is the following:

nft add rule ip filter input tcp dport 80 drop

To accept ICMP echo request:

nft add rule  filter input icmp type echo-request accept

To combine filtering, you just have to specify multiple time the ip syntax:

nft add rule ip filter output ip protocol icmp  ip daddr 1.2.3.4 counter drop

To delete all rules in a chain:

nft delete rule filter output

To delete one specific rule, you need to use the -a flag on nft to get the handle number:

# nft list table filter -a
table filter {
        chain output {
                 ip protocol icmp ip daddr 1.2.3.4 counter packets 5 bytes 420 drop # handle 10
...

You can then delete rule 10 with:

nft delete rule filter output handle 10

You can also flush the filter table:

nft flush table filter

It is possible to insert a rule:

nft insert rule filter input tcp dport 80 counter accept

It is possible to insert or add a rule at a specific position. To do so you need to get the handle of the rule where you want to insert or add a new one. This is done by using the -a flag in the list operation:

# nft list table filter -n  -a
table filter {
        chain output {
                 type filter hook output priority 0;
                 ip protocol tcp counter packets 82 bytes 9680 # handle 8
                 ip saddr 127.0.0.1 ip daddr 127.0.0.6 drop # handle 7
        }
}
# nft  add rule filter output position 8 ip daddr 127.0.0.8 drop 
# nft list table filter -n -a
table filter {
        chain output {
                 type filter hook output priority 0;
                 ip protocol tcp counter packets 190 bytes 21908 # handle 8
                 ip daddr 127.0.0.8 drop # handle 10
                 ip saddr 127.0.0.1 ip daddr 127.0.0.6 drop # handle 7
        }
}

Here, we’ve added a rule after the rule with handle 8. To add before the rule with a given handle, you can use:

nft insert rule filter output position 8 ip daddr 127.0.0.12 drop

If you only want to match on a protocol, you can use something like:

nft insert rule filter output ip  protocol tcp counter

IPv6

Like for IPv4, you need to create some chains. For that you can use:

nft -f files/nftables/ipv6-filter

You can then add rule:

nft add rule ip6 filter output ip6 daddr home.regit.org counter

The listing of the rules can be made with:

nft list table ip6 filter

To accept dynamic IPv6 configuration and neighbor discovery, one can use:

nft add rule ip6 filter input icmpv6 type nd-neighbor-solicit accept
nft add rule ip6 filter input icmpv6 type nd-router-advert accept

Connection tracking

To accept all incoming packets of an established connection:nft ins

nft insert rule filter input ct state established accept

Filter on interface

To accept all packets going out on loopback interface:

nft insert rule filter output oif lo accept

And for packet coming in on eth2:

nft insert rule filter input iif eth2 accept

Please note that oif is in reality a match on the integer which is the index of the interface inside of the kernel. Userspace is converting the given name to the interface index when the nft rule is evaluated (before being sent to kernel). A consequence of this is that the rule can not be added if the interface does not exist. An other consequence, is that if the interface is removed and created again, the match will not occur as the index of added interfaces in kernel is monotonically increasing. Thus, oif is a fast filter but it can lead to some issues when dynamic interfaces are used. It is possible to do a filter on interface name but it has a performance cost because a string match is done instead of an integer match. To do a filter on interface name, one has to use oifname:

nft insert rule filter input oifname ppp0 accept

Logging

Logging is made via a log keyword. A typical log and accept rule will look like:

nft add rule filter input tcp dport 22 ct state new log prefix \"SSH for ever\" group 2 accept

With nftables, it is possible to do in one rule what was split in two with iptables (NFLOG and ACCEPT). If the prefix is just the standard prefix option, the group option is containing the nfnetlink_log group if this mode is used as logging framework.

In fact, logging in nftables is using the Netfilter logging framework. This means the logging is depending on the loaded kernel module. Kernel module available are:

xt_LOG: printk based logging, outputting everything to syslog (same module as the one used for iptables LOG target)
nfnetlink_log: netlink based logging requiring to setup ulogd2 to get the events (same module as the one used for iptables NFLOG target)

To use one of the two modules, load them with modprobe.

You can then setup logging on a per-protocol basis. The configuration is available in /proc:

# cat /proc/net/netfilter/nf_log 
 0 NONE (nfnetlink_log)
 1 NONE (nfnetlink_log)
 2 nfnetlink_log (nfnetlink_log,ipt_LOG)
 3 NONE (nfnetlink_log)
 4 NONE (nfnetlink_log)
 5 NONE (nfnetlink_log)
 6 NONE (nfnetlink_log)
 7 nfnetlink_log (nfnetlink_log)
 8 NONE (nfnetlink_log)
 9 NONE (nfnetlink_log)
10 nfnetlink_log (nfnetlink_log,ip6t_LOG)
11 NONE (nfnetlink_log)
12 NONE (nfnetlink_log)

Here nfnetlink_log was loaded first and ulogd was started. For example, if you want to use ipt_LOG for IPv4 (2 in the list), you can do:

echo "ipt_LOG" >/proc/sys/net/netfilter/nf_log/2

This will active ipt_LOG for IPv4 logging:

# cat /proc/net/netfilter/nf_log 
 0 NONE (nfnetlink_log)
 1 NONE (nfnetlink_log)
 2 ipt_LOG (nfnetlink_log,ipt_LOG)
 3 NONE (nfnetlink_log)
 4 NONE (nfnetlink_log)
 5 NONE (nfnetlink_log)
 6 NONE (nfnetlink_log)
 7 nfnetlink_log (nfnetlink_log)
 8 NONE (nfnetlink_log)
 9 NONE (nfnetlink_log)
10 nfnetlink_log (nfnetlink_log,ip6t_LOG)
11 NONE (nfnetlink_log)
12 NONE (nfnetlink_log)

If you want to do some easy testing, simply load xt_LOG module before nfnetlink_log. It will bind to IPv4 and IPv6 protocol and provide you logging.

Using one single chain

The chains are defined by user and can be arranged in any way. For example, on a single box, it is possible for example to use one single chain for input. To do so create a file onechain with:

#! nft -f

table global {
        chain one { 
                type filter hook input priority   0;
        }
}

and run

nft -f onechain

You can then add rule like:

nft add rule ip global one ip daddr 192.168.0.0/24

The advantage of this setup is that Netfilter filtering will only be active for packets coming to the box.

Set

You can used non named set with the following syntax:

nft add rule ip Filter Output ip daddr {192.168.1.1, 192.168.1.4} drop

Named set can be used in a file. For example, you can create a simple file:

define ip_set = {192.168.1.2, 192.168.2.3}
add rule filter output ip daddr $ip_set counter

and running:

nft -f simple

It is also possible to use named set. To declare one set containing ipv4 address:

nft add set filter ipv4_ad { type ipv4_address\;}

To add elements to the set:

nft add element filter ipv4_ad { 192.168.3.4 }
nft add element filter ipv4_ad { 192.168.1.4, 192.168.1.5 }

Listing the set is done via:

nft list set filter ipv4_ad

The set can then be used in rule:

nft add rule ip filter input ip saddr @ipv4_ad drop

It is possible to remove element from an existing set:

nft delete element filter ipv4_ad { 192.168.1.5 }

and to delete a set:

nft delete set Filter myset

Mapping

Mapping are a specific type of set which behave like a dictionary. For example, it is possible to map ipv4_address to a verdict:

# nft -i
nft> add map filter verdict_map { type ipv4_address : verdict; }
nft> add element filter verdict_map { 1.2.3.5 : drop}
nft> add element filter verdict_map { 1.2.3.4 : accept}

nft> add rule filter output ip daddr vmap @verdict_map

To delete one element of a mapping, you can use the same syntax as the set operation:

nft> delete element filter verdict_map 1.2.3.5

To delete one set you can use:

nft delete set filter verdict_map

Mapping can also be used in a anonymous way:

nft add rule filter output ip daddr vmap {192.168.0.0/24 : drop, 192.168.0.1 : accept}

To list a specific mapping:

nft list set filter nat_map -n

NAT

First of all, the nat module is needed:

modprobe nft_nat

Next, you need to make the kernel aware of NAT for the protocol (here IPv4):

modprobe nft_chain_nat_ipv4

Now, we can create NAT dedicated chain:

nft add table nat
nft add chain nat post { type nat hook postrouting priority 0 \; }
nft add chain nat pre { type nat hook prerouting priority 0 \; }

We can now add NAT rules:

nft add rule nat post ip saddr 192.168.56.0/24 oif wlan0 snat 192.168.1.137
nft add rule nat pre udp dport 53 ip saddr 192.168.56.0/24 dnat 8.8.8.8:53

First one is NATing all trafic from 192.168.56.0/24 outgoing to wlan0 interface to the IP 192.168.1.137. Second one is redirecting all DNS trafic from 192.168.56.0/24 to the 8.8.8.8 server. It is possible to NAT to a range of address:

nft add rule nat post ip saddr 192.168.56.0/24 oif wlan0 snat 192.168.1.137-192.168.1.140

IPv6 NAT is possible too. First, you need to load the module to declare the NAT capability for IPv6:

modprobe nft_chain_nat_ipv6

Once done, you can add rules like:

table ip6 nat {
    chain postrouting {
        type nat hook postrouting priority -150; 
        ip6 saddr 2::/64 snat 1::3;
    }
}

Building a basic ruleset

The following ruleset is a typical ruleset to protect one laptop in IPv4 and IPv6:

# IPv4 filtering
table Filter {
        chain Input {
                 type filter hook input priority 0;
                 ct state established accept
                 ct state related accept
                 iif lo accept
                 tcp dport ssh counter accept
                 counter log drop
        }

        chain Output {
                 type filter hook output priority 0;
                 ct state established accept
                 ct state related accept
                 oif lo accept
                 ct state new counter accept
        }
}
#IPv6 filtering
table ip6 Filter {
        chain Input {
                 type filter hook input priority 0;
                 ct state established accept
                 ct state related accept
                 iif lo accept
                 tcp dport ssh counter accept
                 icmpv6 type { nd-neighbor-solicit, echo-request, nd-router-advert, nd-neighbor-advert } accept
                 counter log drop
        }

        chain Output {
                 type filter hook output priority 0;
                 ct state established accept
                 ct state related accept
                 oif lo accept
                 ct state new counter accept
        }

}

tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP pacing, we can implement Automatic Corking in the kernel, to help applications doing small write()/sendmsg() to TCP sockets. Idea is to change tcp_push() to check if the current skb payload is under skb optimal size (a multiple of MSS bytes) If under 'size_goal', and at least one packet is still in Qdisc or NIC TX queues, set the TCP Small Queue Throttled bit, so that the push will be delayed up to TX completion time. This delay might allow the application to coalesce more bytes in the skb in following write()/sendmsg()/sendfile() system calls. The exact duration of the delay is depending on the dynamics of the system, and might be zero if no packet for this flow is actually held in Qdisc or NIC TX ring. Using FQ/pacing is a way to increase the probability of autocorking being triggered. Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control this feature and default it to 1 (enabled) Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking This counter is incremented every time we detected skb was under used and its flush was deferred. Tested: Interesting effects when using line buffered commands under ssh. Excellent performance results in term of cpu usage and total throughput. lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128 9410.39 Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128': 35209.439626 task-clock # 2.901 CPUs utilized 2,294 context-switches # 0.065 K/sec 101 CPU-migrations # 0.003 K/sec 4,079 page-faults # 0.116 K/sec 97,923,241,298 cycles # 2.781 GHz [83.31%] 51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%] 25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%] 102,225,978,536 instructions # 1.04 insns per cycle # 0.51 stalled cycles per insn [83.38%] 18,657,696,819 branches # 529.906 M/sec [83.29%] 91,679,646 branch-misses # 0.49% of all branches [83.40%] 12.136204899 seconds time elapsed lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128 6624.89 Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128': 40045.864494 task-clock # 3.301 CPUs utilized 171 context-switches # 0.004 K/sec 53 CPU-migrations # 0.001 K/sec 4,080 page-faults # 0.102 K/sec 111,340,458,645 cycles # 2.780 GHz [83.34%] 61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%] 29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%] 108,654,349,355 instructions # 0.98 insns per cycle # 0.57 stalled cycles per insn [83.34%] 19,552,170,748 branches # 488.244 M/sec [83.34%] 157,875,417 branch-misses # 0.81% of all branches [83.34%] 12.130267788 seconds time elapsed

netfilter: x_tables: lightweight process control group matching

It would be useful e.g. in a server or desktop environment to have
a facility in the notion of fine-grained "per application" or "per
application group" firewall policies. Probably, users in the mobile,
embedded area (e.g. Android based) with different security policy
requirements for application groups could have great benefit from
that as well. For example, with a little bit of configuration effort,
an admin could whitelist well-known applications, and thus block
otherwise unwanted "hard-to-track" applications like [1] from a
user's machine. Blocking is just one example, but it is not limited
to that, meaning we can have much different scenarios/policies that
netfilter allows us than just blocking, e.g. fine grained settings
where applications are allowed to connect/send traffic to, application
traffic marking/conntracking, application-specific packet mangling,
and so on.

Implementation of PID-based matching would not be appropriate
as they frequently change, and child tracking would make that
even more complex and ugly. Cgroups would be a perfect candidate
for accomplishing that as they associate a set of tasks with a
set of parameters for one or more subsystems, in our case the
netfilter subsystem, which, of course, can be combined with other
cgroup subsystems into something more complex if needed.

As mentioned, to overcome this constraint, such processes could
be placed into one or multiple cgroups where different fine-grained
rules can be defined depending on the application scenario, while
e.g. everything else that is not part of that could be dropped (or
vice versa), thus making life harder for unwanted processes to
communicate to the outside world. So, we make use of cgroups here
to track jobs and limit their resources in terms of iptables
policies; in other words, limiting, tracking, etc what they are
allowed to communicate.

In our case we're working on outgoing traffic based on which local
socket that originated from. Also, one doesn't even need to have
an a-prio knowledge of the application internals regarding their
particular use of ports or protocols. Matching is *extremly*
lightweight as we just test for the sk_classid marker of sockets,
originating from net_cls. net_cls and netfilter do not contradict
each other; in fact, each construct can live as standalone or they
can be used in combination with each other, which is perfectly fine,
plus it serves Tejun's requirement to not introduce a new cgroups
subsystem. Through this, we result in a very minimal and efficient
module, and don't add anything except netfilter code.

One possible, minimal usage example (many other iptables options
can be applied obviously):

 1) Configuring cgroups if not already done, e.g.:

  mkdir /sys/fs/cgroup/net_cls
  mount -t cgroup -o net_cls net_cls /sys/fs/cgroup/net_cls
  mkdir /sys/fs/cgroup/net_cls/0
  echo 1 > /sys/fs/cgroup/net_cls/0/net_cls.classid
  (resp. a real flow handle id for tc)

 2) Configuring netfilter (iptables-nftables), e.g.:

  iptables -A OUTPUT -m cgroup ! --cgroup 1 -j DROP

 3) Running applications, e.g.:

  ping 208.67.222.222  <pid:1799>
  echo 1799 > /sys/fs/cgroup/net_cls/0/tasks
  64 bytes from 208.67.222.222: icmp_seq=44 ttl=49 time=11.9 ms
  [...]
  ping 208.67.220.220  <pid:1804>
  ping: sendmsg: Operation not permitted
  [...]
  echo 1804 > /sys/fs/cgroup/net_cls/0/tasks
  64 bytes from 208.67.220.220: icmp_seq=89 ttl=56 time=19.0 ms
  [...]

Of course, real-world deployments would make use of cgroups user
space toolsuite, or own custom policy daemons dynamically moving
applications from/to various cgroups.

  [1] http://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-biondi/bh-eu-06-biondi-up.pdf

ipv6: Add support for IPsec virtual tunnel interfaces

This patch adds IPv6  support for IPsec virtual tunnel interfaces
(vti). IPsec virtual tunnel interfaces provide a routable interface
for IPsec tunnel endpoints.

blog.area23.at - a simple url encoder/decoder

Labels

2014-05-09

linux kernel 3.13 naftables auto_corking x_tables