Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 35

Code View of

LINUX KERNEL NW – Packet Flow and Net Filters

NOMUS COMM SYSTEMS PVT LTD


HYDERABAD
Presented by NSS MURTHY on 20- 01-2020
 Netfilter: A Linux kernel framework.

 Allows various networking-related operations to be implemented in the form of customized handlers.

 Netfilter offers various functions and operations for Packet filtering, NAT, Port translation.
 Netfilter represents a set of hooks inside the Linux kernel,
 Allowing specific kernel modules to register callback functions with the kernel's networking stack.

 Those functions, usually applied to the traffic in form of filtering and modification rules, are called for every
packet that traverses the respective hook within the networking stack.
Netfilter Hooks:

PREROUTING: Packets will enter this chain before a routing decision is made (point 1 in Figure 6).

INPUT: Packet is going to be locally delivered. It does not have anything to do with processes having an opened socket; local
delivery is controlled by the "local-delivery" routing table: ip route show table local (point 2 Figure 6).

FORWARD: All packets that have been routed and were not for local delivery will traverse this chain (point 3 in Figure 6).

OUTPUT: Packets sent from the machine itself will be visiting this chain (point 5 in Figure 6)

POSTROUTING: Routing decision has been made. Packets enter this chain just before handing them off to the hardware (point
4 in Figure 6).

 The packet continues to traverse the chain until either

 A rule matches the packet and decides the ultimate fate of the packet, for example by calling one of the ACCEPT or
DROP, or a module returning such an ultimate fate; or

 A rule calls the RETURN verdict, in which case processing returns to the calling chain; or

 The end of the chain is reached; traversal either continues in the parent chain (as if RETURN was used), or the base
chain policy, which is an ultimate fate, is used.
netfilter hooks and packet processing
https://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO-3.html

Predefined chains have a policy, for example DROP, which is applied to the packet if it reaches the end of the chain.

The system administrator can create as many other chains as desired.

These chains have no policy; if a packet reaches the end of the chain it is returned to the chain which called it. A chain may
be empty.
 IPTABLES:

 iptables is a user-space application program.

 Allows administrator to configure the tables provided by the Linux kernel firewall (implemented as different Netfilter
modules) and the chains and rules it stores.

 Different kernel modules and programs are currently used for different protocols;
 iptables applies to IPv4, ip6tables to IPv6, arptables to ARP, and ebtables to Ethernet frames.

 IPSEC modules in Linux Kernel:

IPSEC module registers with PRE-ROUTING Hook to decrypt the packet and then it returns the decrypted packet to the
hook if there is an inbound policy configured.

IPSEC module registers with POST-ROUTING Hook to encrypt the packet. It needs the find the route before returning the
encrypted packet back to the Hook. Encryption is done based on the configured policy.
https://www.digitalocean.com/community/tutorials/a-deep-dive-into-iptables-and-netfilter-architecture
Packet flow paths. Packets start at a given box and will flow along a certain path, depending on the circumstances.

The origin of the packet determines which chain it traverses initially. There are five predefined chains (mapping to the five
available Netfilter hooks, see figure 5), though a table may not have all chains.
netif_receive_skb(struct sk_buff *skb): // ip_input.c ip_rcv(struct sk_buff *skb, struct net_device *dev) // IPV4 packet reception.
• Receive data processing function. {
• Called from softirq context with interrupts enabled. IP Checksum verification. Sanity checks on IP Header length, IP Version.
{
skb = handle_bridge(skb, &pt_prev, &ret, orig_dev); return NF_HOOK(PF_INET, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);
skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev); }

Gives a copy of the frame to each L3 registered protocol handler (for skb -> protocol) http://www.embeddedlinux.org.cn/linux_net/0596002556/understandlni-CHP-19-S
ECT-2.html

 NF_INET_PRE_ROUTING  point within the network stack:

1. By this time no routing decision was taken yet.

2. Application like Firewall(Linux iptables) registers with this NF HOOK.


 It checks with its rule database and may allow or drop the packet.

3. IPsec module may register with this HOOK and check if a matching inbound SA or an IPsec
Policy is present. If an SA is not present IKE negotiation is started and inbound and outbound
} SAs are formed as a result of the negotiation with the peer.

 If the incoming packet is an encrypted packet, it is decrypted and returned to


the net filter infrastructure using NF_HOOK().

4. Once all the registered Applications are called and if they process and return the packet
without dropping or consuming
 ip_rcv_finish() function is executed by the net filter hook mechanism.
ip_rcv_finish(struct sk_buff *skb) // ip_input.c
{
Decides whether the packet is to be
1. Locally Delivered or
2. Forwarded // it needs to find both the egress device and the next hop.

Initialise the virtual path cache for the packet.

struct iphdr *iph = ip_hdr(skb);

if(skb_dst(skb) == NULL) // skb->dst  contains route information of the packet destination.


{
int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, skb->dev); // get it from routing subsystem
if (err) goto drop; // Drop if destination is unreachable
}

// parse and process the optional IP header contents, if IP header length > 5 X 4 = 20 bytes.
if(iph->ihl > 5 && ip_rcv_options(skb)) // Parsing and processing some of the IP options.
goto drop;

return dst_input(skb); // skb->dst->input is set to ip_local_deliver() or ip_forward(),


depending on the destination address of the packet.
drop:
kfree_skb(skb); inline int dst_input(struct sk_buff *skb)
return NET_RX_DROP; {
return skb_dst(skb)->input(skb);
}
}
 When a host needs to route a packet,
o It first consults its cache and then,
 In the case of a cache miss, it queries the routing table.

 Every time the host queries the routing table, the result is saved into the cache.

 The IPv4 routing cache is composed of rtable structures.


o Among the fields of the rtable structure are the
o Each instance is associated with a
 Different destination IPaddress.
 The next hop (router), and
 A structure of type dst_entry that is used to store the protocol-independent information.
 dst_entry includes a pointer to the neighbour structure (ARP related) associated with the next hop.
int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
// Cache miss.
u8 tos, struct net_device *dev)
{
if(ipv4_is_multicast(daddr)) (addr = 0xexxxxxxx)
/* destination and source IP adresses, TOS and interface on which packet is received.
{ // Check is it is a multicast Address
Ingress lookup: This functions is used to route ingress (incoming) packets.
int our; struct in_device *in_dev;
skb may also contain an ingress ARP request. */
// Searches the route_cache hash table to find the matching route.
// returns true if the interface is a member
of the multicast group identified by daddr.
struct rtable * rth; int iif = dev->ifindex; struct net *net; net = dev_net(dev);
// (Is Interface configured to allow this mcast address by
hash = rt_hash(daddr, saddr, iif, rt_genid(net)); // Select one bucket of route cache hash table.
any local multicast Application?)
for (rth = (rt_hash_table[hash].chain); rth; rth = (rth->u.dst.rt_next)) if(our
{
if ( ( (rth->fl.fl4_dst ^ daddr) | (rth->fl.fl4_src ^ saddr) | #ifdef CONFIG_IP_MROUTE
(rth->fl.iif ^ iif) | rth->fl.oif | [daddr is not locally configured, but the kernel is compiled with
(rth->fl.fl4_tos ^ tos) support for multicast routing (CONFIG_IP_MROUTE)]
Browses the list of routes in that bucket one by one. ||
) == 0
(!ipv4_is_local_multicast(daddr) &&
&& Compare all the necessary fields until it either finds
IN_DEV_MFORWARD(in_dev))
rth->fl.mark == skb->mark a match or gets to the end without a match. #endif
&& )
net_eq(dev_net(rth->u.dst.dev), net) The lookup fields passed as input to {
&& ip_route_input() are compared to the fields stored return ip_route_input_mc(skb, daddr, saddr, tos,dev, our ) ;
!rt_is_expired(rth) in the fl field of the routing cache entry’s rtable.
) }
{ Binary XOR operator ^ (0,0 = 0) , (1,1 = 0), }
dst_use(&rth->u.dst, jiffies); (0,1 = 1), (1,0 = 0)
skb_dst_set(skb, &rth->u.dst); // No cache entry found for unicast packets.
skb_dst_set(skb, struct dst_entry *dst)
return 0; // Get the route from routing table, copy it to route cache.
{
} // update skb->_skb_dst
skb->_skb_dst = (unsigned long)dst;
} // end of for loop
}
return ip_route_input_slow(skb, daddr, saddr, tos, dev);
}
struct rtable struct flowi union
{ { {
union int oif, iif; struct { __be16 sport, dport; } ports;
{ __u32 mark; struct { __u8 type, code; } icmpt;
struct dst_entry dst; struct { __le16 sport, dport } dnports;
} u; union __be32 spi;
{ struct {__u8 type; } mht;
// Cache lookup keys struct } uli_u;
struct flowi fl; {
__be32 daddr, saddr; __u32 secid; // used by xfrm; see
struct in_device *idev; __u8 tos, scope; secid.txt } // end of struct flowi
} ip4_u;
int rt_genid; #define fl4_dst nl_u.ip4_u.daddr
unsigned rt_flags; struct #define fl4_src nl_u.ip4_u.saddr
__u16 rt_type; { #define fl4_tos nl_u.ip4_u.tos
struct in6_addr daddr, saddr; #define fl4_scope nl_u.ip4_u.scope
__be32 rt_dst; // Path destination __be32 flowlabel;
__be32 rt_src; // Path source } ip6_u; #define fl6_dst nl_u.ip6_u.daddr
int rt_iif; #define fl6_src nl_u.ip6_u.saddr
struct #define fl6_flowlabel nl_u.ip6_u.flowlabel
// Info on neighbour {
__be32 rt_gateway; __le16 daddr, saddr; #define fld_dst nl_u.dn_u.daddr
__u8 scope; #define fld_src nl_u.dn_u.saddr
// Miscellaneous cached information } dn_u; #define fld_scope nl_u.dn_u.scope
__be32 rt_spec_dst; // RFC1122 specific destination } nl_u;
struct inet_peer *peer; // long-living peer info #define fl_ip_sport uli_u.ports.sport
__u8 proto; #define fl_ip_dport uli_u.ports.dport
}; __u8 flags; #define fl_icmp_type uli_u.icmpt.type
#define fl_icmp_code uli_u.icmpt.code
#define fl_ipsec_spi uli_u.spi
#define fl_mh_type uli_u.mht.type
struct dst_entry
{
struct dst_entry *child;
struct net_device *dev;
short error, obsolete, flags;
unsigned long expires;

unsigned short header_len; // more space at head required


unsigned short trailer_len; // space to reserve at tail

unsigned int rate_tokens;


unsigned long rate_last; // rate limiting for ICMP

struct dst_entry *path;

struct neighbour *neighbour; // ARP related


struct hh_cache *hh;

int (*input)(struct sk_buff*); // route look up initializes them.


int (*output)(struct sk_buff*);

struct dst_ops *ops;

u32 metrics[RTAX_MAX];

union
{
struct dst_entry *next;
struct rtable *rt_next;
struct rt6_info *rt6_next;
struct dn_route *dn_next;
};
}
RTCF_BROADCAST: The destination address of the route is a broadcast address.
RTCF_MULTICAST : The destination address of the route is a multicast address.
RTCF_LOCAL : The destination address of the route is local (i.e., configured on one of the local interfaces). This flag is also set for local broadcast and multicast addresses.

int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,


u8 tos, struct net_device *dev, int our)
{
// Lookup routine used for multicast destinations.

// Updates the route cache with the route read from the multicast
routing table.

// Details TBD

}
int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, int __mkroute_input(struct sk_buff *skb, struct fib_result *res, struct in_device *in_dev,
u8 tos, struct net_device *dev) __be32 daddr, __be32 saddr, u32 tos,
{ // Unicast routing struct rtable **result)
{
// Perform a route look up for the packet in skb. rth->fl.fl4_dst = daddr; rth->rt_dst = daddr;
// Routing table can be hash table or a “TRIE” tree rth->fl.fl4_src = saddr; rth->rt_src = saddr; rth->rt_gateway = daddr;
// Forwarding Information Base, FIB
fib_lookup(net, &fl, &res); rth->fl.fl4_tos = tos; rth->fl.mark = skb->mrk;

// In case of forwarding traffic rth->rt_iif = rth->fl.iif = in_dev->dev->ifindex;


// create a routing cache entry and update the route cache rt_cache rth->u.dst.dev = (out_dev)->dev;
ip_mkroute_input(skb, &res, &fl, in_dev, daddr, saddr, tos); rth->idev = in_dev_get(rth->u.dst.dev);
rth->fl.oif = 0;
// In case of Traffic for the local consumption
// create a routing cache entry and update the route cache rt_cache rth->u.dst.input = ip_forward;
rth->u.dst.input = ip_local_deliver; rth->u.dst.output = ip_output;
} rth->rt_genid = rt_genid(dev_net(rth->u.dst.dev));

int ip_mkroute_input(struct sk_buff *skb, rt_set_nexthop(rth, res, itag);


struct fib_result *res, struct flowi *fl,
struct in_device *in_dev, rth->rt_flags = flags;
__be32 daddr, __be32 saddr, u32 tos)
*result = rth;
{
struct rtable* rth = NULL; unsigned hash;
rt_set_nexthop()
// create a routing cache entry {
__mkroute_input(skb, res, in_dev, daddr, saddr, tos, &rth); Inputs: routing cache entry rtable, a routing table lookup result res,

// put it into the cache Updates rtable’s fields, such as rt_gateway and the metrics vector of the
hash = rt_hash(daddr, saddr, fl->iif, rt_genid(dev_net(rth->u.dst.dev))); embedded dst_entry structure.
return rt_intern_hash(hash, rth, NULL, skb); }
}
struct iphdr
{
__u8 ihl:4,
version:4;

__u8 tos;
__be16 tot_len;
__be16 id;

__be16 frag_off; // includes Flags


__u8 ttl;
// Deliver IP Packets to the higher protocol layers. __u8 protocol;
int ip_local_deliver(struct sk_buff *skb) __sum16 check;
{
// Reassemble IP fragments if it is fragmented. __be32 saddr;
__be32 daddr;
if (ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET)) // The IP options start here.
{ };
if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
return 0;
}
// IP Header Flags:
return NF_HOOK(PF_INET, NF_INET_LOCAL_IN, skb, skb->dev, NULL, #define IP_MF 0x2000 // More Fragments
ip_local_deliver_finish); #define IP_DF 0x4000 // DO NOT Fragment

// ip_local_deliver_finish() sends the packet to raw ip socket or


Transport Layer socket registrations // IP Header Fragment Offset part
} #define IP_OFFSET 0x1FFF
struct sk_buff // skb char rt_tos2priority(u8 tos) // include/net/route.h
{ {
-------------------------- return ip_tos2prio[IPTOS_TOS(tos)>>1];
_u32 priority; // The 'priority' field of skb is used in QoS. }
--------------------------
} #define IPTOS_TOS(tos) ((tos)&IPTOS_TOS_MASK)
#define IPTOS_TOS_MASK 0x1E
• Priority field of skb can be determined by the TOS field setting in the IPV4 header.

• This priority value of skb is used by the packet scheduler and classifier layer of QoS int ip_forward(struct sk_buff *skb) // net/ipv4/ip_forward.c
to
o Schedule or Classify the Packet, as configured by the administrator. {
----------------------
• The Linux kernel translates the TOS value specified to a priority using an array.
o The priority affects how and when a packet is transmit from a queuing // Present towards the end of the function.
discipline. skb->priority = rt_tos2priority(iph->tos);
return NF_HOOK(PF_INET, NF_INET_FORWARD, skb, skb->dev,
__u8 ip_tos2prio[16] = // net/ipv4/route.c
rt->u.dst.dev, ip_forward_finish);
{
TC_PRIO_BESTEFFORT, }
ECN_OR_COST(BESTEFFORT),
TC_PRIO_BESTEFFORT, // Finally ip_forward_finish() is called which internally calls dst_output().
ECN_OR_COST(BESTEFFORT),
TC_PRIO_BULK,
ECN_OR_COST(BULK),
TC_PRIO_BULK,
ECN_OR_COST(BULK),
TC_PRIO_INTERACTIVE,
ECN_OR_COST(INTERACTIVE),
TC_PRIO_INTERACTIVE,
ECN_OR_COST(INTERACTIVE),
TC_PRIO_INTERACTIVE_BULK,
ECN_OR_COST(INTERACTIVE_BULK),
TC_PRIO_INTERACTIVE_BULK,
ECN_OR_COST(INTERACTIVE_BULK)
};
Locally generated packet path:

ip_queue_xmit() is used by TCP and other Transport layer protocols and raw socket based Applications.

int ip_queue_xmit(struct sk_buff *skb, int ipfragok)


{
update the variable struct flowi fl of route cache wit source and destination IP Addresses, Ports.

// This function returns both the next hop gateway and the egress device to use.
ip_route_output_flow(&rt, &fl, sk, 0);
skb_dst_set(skb, dst_clone(&rt->u.dst));

Allocate and build IP header as we now know where to send the packet.
If required fragment the packet.

ip_local_out(skb);
}

int ip_local_out(struct sk_buff *skb) // Output packet to network from transport.


{ int dst_output(struct sk_buff *skb)
__ip_local_out(skb); {
} return skb_dst(skb)->output(skb);
int __ip_local_out(struct sk_buff *skb) }
{
return NF_HOOK(PF_INET, NF_INET_LOCAL_OUT, skb, NULL, skb_dst(skb)->dev, dst_output);
}
int ip_forward_finish(struct sk_buff *skb) int ip_finish_output(struct sk_buff *skb)
{ {
// Traffic generated locally or forwarded from other hosts, pass through if(skb->len > ip_skb_dst_mtu(skb))
// dst_output() function on their way to a destination host. return ip_fragment(skb, ip_finish_output2); // fragment the ip packet.
else
// skb->dst->output() is initialized to return ip_finish_output2(skb);
// ip_output() if the destination address is unicast and }
// ip_mc_output() if it is multicast.
int ip_finish_output2(struct sk_buff *skb)
return dst_output(skb); {
} struct dst_entry *dst = skb_dst(skb);
struct net_device *dev = dst->dev; ;
int dst_output(struct sk_buff *skb)
{ if(dst->hh) // If L2 cache contains mac header for the destination, send the packet.
err = skb->dst->output(&skb); return neigh_hh_output(dst->hh, skb);
}
else if(dst->neighbour)
return dst->neighbour->output(skb); // ARP may need to be resolved.

int ip_output(struct sk_buff *skb) if(net_ratelimit())


{ printk(KERN_DEBUG "ip_finish_output2: No header cache and no neighbour!\n");
struct net_device *dev = skb_dst(skb)->dev;
skb->dev = dev; kfree_skb(skb);
skb->protocol = htons(ETH_P_IP); return -EINVAL;
}
return NF_HOOK_COND(PF_INET, NF_INET_POST_ROUTING, skb, NULL,
dev, ip_finish_output, IPsec may register with POST_ROUTING hook to get the packet.
!(IPCB(skb)->flags & IPSKB_REROUTED) ); Encrypt it, call routing function to update routing information
}
before returning the packet to net filter infrastructure.
int ip_finish_output2(struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb); struct rtable *rt = (struct rtable *)dst;
struct net_device *dev = dst->dev; unsigned int hh_len = LL_RESERVED_SPACE(dev);
// skb head room related code here.

if(dst->hh) return neigh_hh_output(dst->hh, skb);


else if(dst->neighbour) return dst->neighbour->output(skb);

if(net_ratelimit())
printk(KERN_DEBUG "ip_finish_output2: No header cache and no neighbour!\n"); kfree_skb(skb); return -EINVAL;

 The skb buffer input to ip_finish_output2, includes


• The packet data (but without an L2 header),
• along with information such as the device to use for transmission and
• the routing table cache entry (dst) that was used by the kernel to make the forwarding decision.
• The dst entry includes a pointer to the neighbour entry associated with the next hop (which can be either a router or the final destination itself).

 The decisions made by ip_finish_output2().


• If a cached L2 header is available (hh is not NULL), it is copied into the skb buffer.
(skb->data points to the start of the user data, which is where the L2 header should be placed.)
o Finally, neigh_hh_output is invoked.
• If no cached L2 header is available, ip_finish_output2 invokes the neigh->output method.

o The precise function associated with neigh->output depends on the state of the neighbour entry.
o If the L2 address is ready, the function will probably be neigh_connected, so the header can be filled in right away and the packet transmitted.
o Otherwise, neigh->output will probably be initialized to neigh_resolve_output(), which will put the packet in the arp_queue queue, try to resolve the address by sending a
solicitation request, and wait until the solicitation reply arrives, whereupon the packet is transmitted.

o Whether the packet is sent immediately or queued, ip_finish_output2 returns the same value, indicating success.
o The packet is not the IPsubsystem’s responsibility after this point;
 When the solicitation reply arrives, the neighboring subsystem dequeues the packet from arp_ queue and sends it to the device.
}
int neigh_hh_output(struct hh_cache *hh, struct sk_buff *skb)
{
unsigned seq; int hh_len; int hh_alen; // hardware header
hh_len = hh->hh_len; hh_alen = HH_DATA_ALIGN(hh_len);
memcpy(skb->data - hh_alen, hh->hh_data, hh_alen); // Prepare and Copy L2 the Header to skb
skb_push(skb, hh_len);

return hh->hh_output(skb); // net/ipv4/arp.c initializes hh_output with the function dev_queue_xmit() and transmits/queues an skb buffer.
}
dev_queue_xmit():

• It is the interface between the neighbouring subsystem (ARP) and the Traffic Control subsystem,
o It stands between the neighbouring protocol and the device driver.

• Queue a buffer for transmission to a network device.


o The caller must have set the device and priority in skb before calling.
o A return value of success does not guarantee the frame will be transmitted as it may be dropped due to congestion or traffic shaping.
o This method can also return errors from the queue disciplines, including NET_XMIT_DROP, which is a positive value. So, errors can also be positive.

• In case the frame is fragmented,


o check whether the device can handle them through scatter/gather DMA; Combine or linearize the fragments if the device is not capable of doing so.

• Makes sure the L4 checksum (that is, TCP/UDP) is computed, unless the device computes the checksum in hardware.
• Selecting which frame to transmit (the one pointed to by the input sk_buff may not be the one to transmit because there is a queue to honour.

// Two Options possible.

• If the device is configured to support Traffic Control infrastructure, it will be associated with a queuing discipline (Qdisc).
o In this case, interfacing to Traffic Control (the QoS layer) is done through qdisc_run() function.
o Based on the status of the outgoing queue, this buffer may or may not be the one that will actually be sent next and it may be queued in Qdisc.
o It seems to NSM that a default Qdisc is attached to every interface by default which can be changed by Linux tc commands.

• Invoking hard_start_xmit() directly if device is not using the Traffic Control infrastructures (i.e., virtual devices).
int dev_queue_xmit(struct sk_buff *skb) struct netdev_queue
{ {
struct net_device *dev = skb->dev; // read mostly part
struct netdev_queue *txq; struct Qdisc *q; struct net_device *dev;
struct Qdisc *qdisc,
// Handle Fragments and TCP/UDP checksum as described in the previous slide. struct Qdisc *qdisc_sleeping;
// Disable soft irqs for various locks below. Also stops pre-emption for RCU. unsigned long state;
rcu_read_lock_bh();
// write mostly part
txq = dev_pick_tx(dev, skb); // Select one of the multiple TX Queues of physical dev. spinlock_t _xmit_lock ____cacheline_aligned_in_smp;
q = rcu_dereference(txq->qdisc); // Get Qdisc attached. int xmit_lock_owner;

#ifdef CONFIG_NET_CLS_ACT skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS); unsigned long trans_start; // please use this field instead of dev->trans_start
#endif unsigned long tx_bytes; tx_packets; tx_dropped;

if(q->enqueue) // NSM Qdisc attached to interface } ____cacheline_aligned_in_smp;


{
rc = __dev_xmit_skb(skb, q, dev, txq);
goto out;
}

// The device has no queue. Common case for software devices: loopback and tunnels...
if(dev->flags & IFF_UP)
{
if(!netif_tx_queue_stopped(txq))
{
rc = dev_hard_start_xmit(skb, dev, txq); // NSM Transmit the packet and get out
if(dev_xmit_complete(rc))
HARD_TX_UNLOCK(dev, txq); goto out;
}

if(net_ratelimit())
printk(KERN_CRIT "Virtual device %s asks to " "queue packet!\n", dev->name);
// NSM We need to add a Qdisc of suitable type and Queue the packets on this
interface based on the traffic.
}
int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct qdisc_skb_cb
struct net_device *dev, struct netdev_queue *txq) {
{ int pkt_len;
spinlock_t *root_lock = qdisc_lock(q); // Serialize access to this queue. char data[];
};
if(unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
{
kfree_skb(skb);
rc = NET_XMIT_DROP; struct qdisc_skb_cb* qdisc_skb_cb(struct sk_buff *skb)
} {
else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && return (struct qdisc_skb_cb * )skb->cb;
!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) }
{
// This is a work-conserving queue; there are no old skbs waiting to
be sent out; and the qdisc is not running - xmit the skb directly.
int qdisc_enqueue_root(struct sk_buff *skb, struct Qdisc *sch)
{
if(sch_direct_xmit(skb, q, dev, txq, root_lock))
qdisc_skb_cb(skb)->pkt_len = skb->len;
__qdisc_run(q);
return qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
else
}
clear_bit(__QDISC_STATE_RUNNING, &q->state);
rc = NET_XMIT_SUCCESS;
} int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
else {
{ #ifdef CONFIG_NET_SCHED
rc = qdisc_enqueue_root(skb, q); // NSM enqueue the skb to Qdisc. if(sch->stab)
qdisc_run(q); // Try sending something from device’s queue qdisc_calculate_pkt_len(skb, sch->stab);
} #endif
spin_unlock(root_lock);
return rc; return sch->enqueue(skb, sch); // Enqueue packet to Qdisc ( tbf,htb,prio etc).
} }
 QDISCS: (Queueing Discipline). A Scheduler to traffic control flow.

 Whenever the kernel needs to send a packet out on an interface,


o It is enqueued to the Qdisc configured on that output interface.

 Immediately afterwards, the kernel tries to get as many packets as possible from the Qdisc,
for giving them to the network adaptor driver.

 A simple QDISC is the 'pfifo' one, which does no processing at all and is a pure First In, First Out queue.
o It does however store traffic when the network interface can't handle it momentarily.

 Qdisc is the major building block on which all of Linux traffic control is built.

o Classful Qdiscs:
• They can contain classes, and provide a handle to which to attach filters.

o Classless Qdiscs:
• They can contain no classes, nor is it possible to attach filter to a classless Qdisc.
• Because a classless qdisc contains no children of any kind, there is no utility to classifying.
 This means that no filter can be attached to a classless Qdisc.
 The Networking Subsystem of Linux is assigned with two different Softirqs. Registration of the Softirqs: enum
{
int __init net_dev_init(void) HI_SOFTIRQ=0,
• NET_RX_SOFTIRQ - Handles incoming traffic.
{ TIMER_SOFTIRQ,
• NET_TX_SOFTIRQ - Handles outgoing traffic. -----------------------------------------------------------------
NET_TX_SOFTIRQ, //NW
 As different instances of the same Softirq handler can run concurrently for_each_possible_cpu(i) // Init softnet_data per CPU. NET_RX_SOFTIRQ, //NW
on different CPUs (unlike tasklets), {
• Networking code is both low latency and scalable. struct softnet_data *queue; BLOCK_SOFTIRQ, BLOCK_IOPOLL_SOFTIRQ,
TASKLET_SOFTIRQ, SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ, RCU_SOFTIRQ,
queue = &per_cpu(softnet_data, i);
 Softnet_data Structure: Both the Softirqs refer to this structure. skb_queue_head_init(&queue->input_pkt_queue);
• Each CPU owns a separate data structure. No Lock required. NR_SOFTIRQS
};
queue->completion_queue = NULL;
struct softnet_data #define NR_CPUS 1
{ INIT_LIST_HEAD(&queue->poll_list);
---------------------------------------------------------------------
struct sk_buff_head input_pkt_queue; // Ingress frames.
}
struct list_head poll_list; // Related to receive frames. // Register network functions.
open_softirq(NET_TX_SOFTIRQ, net_tx_action);
struct Qdisc *output_queue; // list of devices ready to transmit. open_softirq(NET_RX_SOFTIRQ, net_rx_action);
struct sk_buff * completion_queue; // Transmitted skbs for release.
hotcpu_notifier(dev_cpu_callback, 0);
------------------------------------------------------------------
struct napi_struct backlog;
}
};

• Ingress frames are queued to input_pkt_queue.

• Egress frames are placed into the specialized queues handled by Traffic
Control (QoS layer) instead of being handled by softirqs and the
softnet_data structure.

• But softirqs are still used to clean up transmitted buffers afterward, to keep
that task from slowing transmission.
 The net_tx_action() function is called in two cases and if(sd->output_queue) // net_tx_action() contd 1 …. struct softnet_data
is a Softirq function. { {
struct Qdisc *head; struct sk_buff_head input_pkt_queue; // Ingress frames
1. To do housekeeping with the buffers that are not
needed anymore. local_irq_disable(); struct list_head poll_list; // Related to rx frames
2. When there are devices(Qdiscs) waiting to transmit head = sd->output_queue;
something . sd->output_queue = NULL; struct Qdisc* output_queue;
local_irq_enable(); // list of devices ready to transmit.
void net_tx_action(struct softirq_action *h)
{ while(head) // run the scheduled Qdiscs struct sk_buff * completion_queue;
struct softnet_data *sd = &__get_cpu_var(softnet_data); { // Transmitted skbs for release.
struct Qdisc *q = head; spinlock_t *root_lock;
if(sd->completion_queue) // 1. Free Skbs in Queue head = head->next_sched; root_lock = qdisc_lock(q); struct napi_struct backlog;
{ };
struct sk_buff *clist; if(spin_trylock(root_lock))
{ struct Qdisc
local_irq_disable(); smp_mb__before_clear_bit(); {
clist = sd->completion_queue; clear_bit( __QDISC_STATE_SCHED, &q->state); --------------------
sd->completion_queue = NULL; qdisc_run(q); // dequeue skb(s) from Qdisc and send. struct netdev_queue *dev_queue;
local_irq_enable(); spin_unlock(root_lock); struct Qdisc* next_sched;
} unsigned long state;
while(clist) else ---------------------
{ { }
struct sk_buff *skb = clist; if(!test_bit( __QDISC_STATE_DEACTIVATED, &q->state))
clist = clist->next; { enum qdisc_state_t
__netif_reschedule(q); // re - schedule if lock not avl {
WARN_ON(atomic_read(&skb->users)); } __QDISC_STATE_RUNNING,
__kfree_skb(skb); else // if deactivated __QDISC_STATE_SCHED,
} { __QDISC_STATE_DEACTIVATED,
} smp_mb__before_clear_bit(); };
clear_bit( __QDISC_STATE_SCHED, &q->state);
// net_tx_action() contd 1 …. }
}
}
}
}
• Whenever a device is scheduled for transmission, Note, that this procedure can be called by a watchdog timer.
o Next frame to transmit is selected by qdisc_run() __QDISC_STATE_RUNNING guarantees only one CPU can process this qdisc at a time.
o qdisc_run() indirectly calls the dequeue function netif_tx_lock serializes accesses to device driver.
of the associated queuing discipline and sends it to qdisc_lock(q) and netif_tx_lock are mutually exclusive, if one is grabbed, another must be free.
selected netdev_queue of the device.
int qdisc_restart(struct Qdisc *q) u16 skb_get_queue_mapping( struct sk_buff *skb)
void qdisc_run(struct Qdisc *q)
{ {
{
struct netdev_queue *txq; return skb->queue_mapping;
if(!test_and_set_bit( __QDISC_STATE_RUNNING, &q->state))
struct net_device *dev; }
__qdisc_run(q); // if not set, set the flag.
}
struct sk_buff *skb; struct netdev_queue* netdev_get_tx_queue(
spinlock_t *root_lock; struct net_device *dev, unsigned int index)
void __qdisc_run(struct Qdisc *q)
{
{
return &dev->_tx[index];
unsigned long start_time = jiffies;
}
while(qdisc_restart(q)) // exit if queue is empty/throttled
skb = dequeue_skb(q); // Dequeue packet from the associated Qdisc like tbf, prio, htb etc
{
if(unlikely(!skb)) return 0;
// Postpone processing if
1. another process needs the CPU;
root_lock = qdisc_lock(q); // Serializes queue accesses for this queue.
2. we've been doing it for too long.
dev = qdisc_dev(q);
// only one jiffy!!! NSM
txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
if (need_resched() || jiffies != start_time)
return sch_direct_xmit(skb, q, dev, txq, root_lock); // The dequeued packet is transmitted.
{
}
__netif_schedule(q);
Returns to the caller:
break; 0 - queue is empty or throttled. > 0 - queue is not empty.
}
} int need_resched(void)
clear_bit(__QDISC_STATE_RUNNING, &q->state); {
} return (test_thread_flag(TIF_NEED_RESCHED));
}
// Transmit one skb, and handle the return status as required. else if (ret == NETDEV_TX_LOCKED)
{
Holding the __QDISC_STATE_RUNNING bit guarantees that // Driver try lock failed
only one CPU can execute this function. ret = handle_dev_cpu_collision(skb, txq, q);
}
Returns to the caller: else
0 - queue is empty or throttled. >0 - queue is not empty. {
// Driver returned NETDEV_TX_BUSY - requeue skb
int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, if(unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
struct net_device *dev, struct netdev_queue *txq, printk(KERN_WARNING "BUG %s code %d qlen %d\n",
spinlock_t *root_lock) dev->name, ret, q->q.qlen);
{
int ret = NETDEV_TX_BUSY; ret = dev_requeue_skb(skb, q); // Put it back into Queue.
}
spin_unlock(root_lock); // release qdisc
HARD_TX_LOCK(dev, txq, smp_processor_id()); if(ret && (netif_tx_queue_stopped(txq) || netif_tx_queue_frozen(txq)))
ret = 0;
if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq))
ret = dev_hard_start_xmit(skb, dev, txq); // Final write in sw. return ret;
}
HARD_TX_UNLOCK(dev, txq);
spin_lock(root_lock);

if(dev_xmit_complete(ret))
{
// Driver sent out skb successfully or skb was consumed
ret = qdisc_qlen(q);
}
void __netif_schedule(struct Qdisc *q) // This function must run with irqs disabled!
{
#define __raise_softirq_irqoff(nr)
if(!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) raise_softirq_irqoff(unsigned int nr) do
__netif_reschedule(q); { {
} __raise_softirq_irqoff(nr); or_softirq_pending(1UL << (nr));
} while (0)
void __netif_reschedule(struct Qdisc *q)
{ // If we're in an interrupt or softirq, we're done Otherwise we wake up ksoftirqd to
struct softnet_data *sd; make sure we schedule the softirq soon.
unsigned long flags;
if(!in_interrupt())
local_irq_save(flags); wakeup_softirqd();
}
sd = &__get_cpu_var(softnet_data);

q->next_sched = sd->output_queue; // We cannot loop indefinitely here to avoid userspace starvation, but we also don't want to
sd->output_queue = q; // Add q to output_queue introduce a worst case 1/HZ latency to the pending events, so lets the Scheduler to balance
the softirq load for us.
raise_softirq_irqoff(NET_TX_SOFTIRQ);
void wakeup_softirqd(void)
local_irq_restore(flags); {
} // Interrupts are disabled: no need to stop pre-emption
enum qdisc_state_t struct Qdisc
{
struct task_struct *tsk = __get_cpu_var(ksoftirqd);
{
__QDISC_STATE_RUNNING, --------------------
__QDISC_STATE_SCHED, struct netdev_queue *dev_queue; if(tsk && tsk->state != TASK_RUNNING)
__QDISC_STATE_DEACTIVATED, struct Qdisc* next_sched; wake_up_process(tsk); // wakes up the ksoftirqd which in turn calls net_tx_action().
unsigned long state;
}; ---------------------
}
} // It is a kernel thread.
 Device Drivers control the packet transmission logic from sending packets to its buffer.
struct netdev_queue
{
• When a device driver realizes that it does not have enough Space in its buffer // read mostly part
to store a frame of maximum size (MTU), struct net_device *dev;
o It calls one of the following API to STOP queue 0/ a specific queue/all queues from sending. struct Qdisc *qdisc, *qdisc_sleeping;
o Some devices may support multiple egress or TX Queues (sw: struct netdev_queue). unsigned long state;
enum // bit map
// write mostly part netdev_queue_state_t {
• But when it has enough space in its buffer for 1 MTU size packet. spinlock_t _xmit_lock __QUEUE_STATE_XOFF,
o It calls one of the following API to START queue 0/ a specific queue/all queues to send. ____cacheline_aligned_in_smp; __QUEUE_STATE_FROZEN,
};
int xmit_lock_owner;
 The status of each egress queue is represented by the flag __QUEUE_STATE_XOFF.
o Clearing this flag allows transmission for the specific egress queue of the network device. unsigned long trans_start;
o Setting this flag stops transmission for the specific egress queue of the network device. unsigned long tx_bytes; tx_packets; tx_dropped;
} ____cacheline_aligned_in_smp;
o Status of this flag can be read using API.

// Called when the device is activated/ to restart a stopped device queue.


void netif_start_queue(struct net_device *dev) netif_tx_queue_stopped(struct netdev_queue *dev_queue)
- Enables egress queue 0 of the device to send. {
void netif_tx_start_queue(struct netdev_queue *dev_queue) return test_bit(__QUEUE_STATE_XOFF, &dev_queue->state);
{ }
clear_bit( __QUEUE_STATE_XOFF, &dev_queue->state); - Returns the status of the specified egress queue of a network device.
} // Enable the Queue specified to send. - Status returned: Enabled or Disabled.

void netif_stop_queue(struct net_device *dev): The drivers may also use the following function to restart a specific network
- Disable egress queue 0 of the device from sending. device. It will also check if anything to be transmitted. Immediate effect.

void netif_tx_stop_queue(struct netdev_queue *dev_queue) void netif_tx_wake_queue(struct netdev_queue *dev_queue)


{ {
set_bit( __QUEUE_STATE_XOFF, &dev_queue->state); if(test_and_clear_bit(__QUEUE_STATE_XOFF, &dev_queue->state))
} __netif_schedule(dev_queue->qdisc); // net_tx_action() is triggered.
- Disable the Queue specified from sending. }
struct qdisc_watchdog • qdisc_watchdog_schedule() is called from dequeue functions like htb_dequeue() ,
{ struct qdisc_watchdog watchdog; tbf_dequeue() ,cbq_dequeue() and hfsc_dequeue() functions.
struct hrtimer timer;
struct Qdisc *qdisc; • After the time provided is expired the timer callback function qdisc_watchdog() calls the
}; function __netif_schedule() which raises softirqd calling the function net_tx_action().

static int htb_init(struct Qdisc *sch, struct nlattr *opt) • As a result , one or more skbs are dequeued from the Qdisc and sent to the driver.
{
struct htb_sched *q = qdisc_priv(sch); void qdisc_watchdog_schedule(struct qdisc_watchdog *wd, psched_time_t expires)
-------------------------------------------------------- {
qdisc_watchdog_init(&q->watchdog, sch); ktime_t time;
INIT_WORK(&q->work, htb_work_func);
} if(test_bit(__QDISC_STATE_DEACTIVATED, &qdisc_root_sleeping(wd->qdisc)->state))
return;
int tbf_init(struct Qdisc* sch, struct nlattr *opt)
wd->qdisc->flags |= TCQ_F_THROTTLED;
{
time = ktime_set(0, 0);
struct tbf_sched_data *q = qdisc_priv(sch);
time = ktime_add_ns(time, PSCHED_TICKS2NS(expires));
qdisc_watchdog_init(&q->watchdog, sch);
hrtimer_start(&wd->timer, time, HRTIMER_MODE_ABS);
}
}

void qdisc_watchdog_init(struct qdisc_watchdog *wd, enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)


struct Qdisc *qdisc) {
{ struct qdisc_watchdog *wd = container_of(timer, struct qdisc_watchdog, timer);
hrtimer_init(&wd->timer, CLOCK_MONOTONIC,
HRTIMER_MODE_ABS); wd->qdisc->flags &= ~TCQ_F_THROTTLED;
__netif_schedule(qdisc_root(wd->qdisc)); // raise softirqd which calls net_tx_action()
wd->timer.function = qdisc_watchdog;
wd->qdisc = qdisc; return HRTIMER_NORESTART;
} }
// htb

htb_work_func(struct work_struct *work)


{
struct htb_sched *q = container_of(work, struct htb_sched, work);

struct Qdisc *sch = q->watchdog.qdisc;

__netif_schedule(qdisc_root(sch));
}
pfifo_fast Qdisc:
 pfifo_fast is the default qdisc of each Linux network interface.
• Whenever an interface is created, the pfifo_fast qdisc is automatically used as a queue.
• If another qdisc is attached, it pre-empts the default pfifo_fast.
o pfifo_fast qdisc automatically returns to function when the existing qdisc is detached.

• Three-bands first in, first out queue.


• It performs no shaping or rearranging of packets.
• It can be used inside a leaf class of a classful qdiscs. 

 ALGORITHM:

• The algorithm is very similar to that of the classful tc-prio qdisc.


• pfifo_fast is like three tc-pfifo queues side by side.
• Packets can be enqueued in any of the three bands based on
o Their Type of Service (tos) bits or assigned priority.

• Not all three bands are dequeued simultaneously.


o As long as lower bands have traffic, higher bands are never dequeued.
o This can be used to prioritize interactive traffic or penalize 'lowest cost' traffic.

• Each band can be txqueuelen packets long.


o Configurable using ifconfig or ip command.
o Additional packets coming in are not enqueued but are instead dropped.

• Go through tc-prio slides on how TOS bits are translated into bands.

• Does not maintain statistics and does not show up in tc qdisc ls.
This is because it is the automatic default in the absence of a configured qdisc.
Jiffies and HZ:

 A jiffy is a kernel unit of time declared in <param.h linux/jiffies.h>.

 HZ is a constant in Linux and is the number of times jiffies is incremented in one second.
• Each increment is called a tick.
• In other words, HZ represents the size of a jiffy.
• HZ depends on the hardware and on the kernel version.
o It also determines how frequently the clock interrupt fires. This is configurable on some architectures, fixed on
other ones.

 If HZ = 1,000, then jiffies is incremented 1,000 times (that is, one tick every 1/1,000 seconds).
• Once defined, the programmable interrupt timer (PIT) is programmed with that value ...

• A value of HZ 1000 gives a jiffy of 0.001 seconds.

 To achieve perfection, the second bucket of TBF may contain only a single packet, which leads to the 1mbit/s limit.

 This limit is caused by the fact that the kernel can only throttle for at minimum 1 'jiffy', which depends on HZ as 1/HZ.
For perfect shaping, only a single packet can get sent per jiffy - for HZ=100, this means 100 packets of on average 1000
bytes each, which roughly corresponds to 1mbit/s.

 https://unix.stackexchange.com/questions/100785/bucket-size-in-tbf
 From the Linux TC man page, the only constraint on burst is that it must be high enough to allow your configured rate:
• Burst must be at least rate / HZ.
• HZ is a kernel configuration parameter.

• If HZ on the Linux system is 1000. To hit a rate of 10mbps, I'd thus need a burst of at least 10,000,000 bits/sec ÷ 1000
Hz = 10,000 bits = 1250 bytes.

 Configure burst to be at least large enough to achieve the desired rate.


• Beyond that, you may increase it further, depending on what you're trying to achieve.
• But beyond this, burst is also a policy tool.
o It configures the extent to which you can use less bandwidth now to "save" it for future use.
o One common thing here is that you may want to allow small downloads (say, a web page) to go very fast, while
throttling big downloads. You do this by increasing burst to the size you consider a small download.
• (Though, you'd often switch to a classful qdisc like htb, so you can segment out the different traffic types.)

You might also like