Professional Documents
Culture Documents
QoS - Linux - NSM - PACKET - FLOW
QoS - Linux - NSM - PACKET - FLOW
Netfilter offers various functions and operations for Packet filtering, NAT, Port translation.
Netfilter represents a set of hooks inside the Linux kernel,
Allowing specific kernel modules to register callback functions with the kernel's networking stack.
Those functions, usually applied to the traffic in form of filtering and modification rules, are called for every
packet that traverses the respective hook within the networking stack.
Netfilter Hooks:
PREROUTING: Packets will enter this chain before a routing decision is made (point 1 in Figure 6).
INPUT: Packet is going to be locally delivered. It does not have anything to do with processes having an opened socket; local
delivery is controlled by the "local-delivery" routing table: ip route show table local (point 2 Figure 6).
FORWARD: All packets that have been routed and were not for local delivery will traverse this chain (point 3 in Figure 6).
OUTPUT: Packets sent from the machine itself will be visiting this chain (point 5 in Figure 6)
POSTROUTING: Routing decision has been made. Packets enter this chain just before handing them off to the hardware (point
4 in Figure 6).
A rule matches the packet and decides the ultimate fate of the packet, for example by calling one of the ACCEPT or
DROP, or a module returning such an ultimate fate; or
A rule calls the RETURN verdict, in which case processing returns to the calling chain; or
The end of the chain is reached; traversal either continues in the parent chain (as if RETURN was used), or the base
chain policy, which is an ultimate fate, is used.
netfilter hooks and packet processing
https://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO-3.html
Predefined chains have a policy, for example DROP, which is applied to the packet if it reaches the end of the chain.
These chains have no policy; if a packet reaches the end of the chain it is returned to the chain which called it. A chain may
be empty.
IPTABLES:
Allows administrator to configure the tables provided by the Linux kernel firewall (implemented as different Netfilter
modules) and the chains and rules it stores.
Different kernel modules and programs are currently used for different protocols;
iptables applies to IPv4, ip6tables to IPv6, arptables to ARP, and ebtables to Ethernet frames.
IPSEC module registers with PRE-ROUTING Hook to decrypt the packet and then it returns the decrypted packet to the
hook if there is an inbound policy configured.
IPSEC module registers with POST-ROUTING Hook to encrypt the packet. It needs the find the route before returning the
encrypted packet back to the Hook. Encryption is done based on the configured policy.
https://www.digitalocean.com/community/tutorials/a-deep-dive-into-iptables-and-netfilter-architecture
Packet flow paths. Packets start at a given box and will flow along a certain path, depending on the circumstances.
The origin of the packet determines which chain it traverses initially. There are five predefined chains (mapping to the five
available Netfilter hooks, see figure 5), though a table may not have all chains.
netif_receive_skb(struct sk_buff *skb): // ip_input.c ip_rcv(struct sk_buff *skb, struct net_device *dev) // IPV4 packet reception.
• Receive data processing function. {
• Called from softirq context with interrupts enabled. IP Checksum verification. Sanity checks on IP Header length, IP Version.
{
skb = handle_bridge(skb, &pt_prev, &ret, orig_dev); return NF_HOOK(PF_INET, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);
skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev); }
Gives a copy of the frame to each L3 registered protocol handler (for skb -> protocol) http://www.embeddedlinux.org.cn/linux_net/0596002556/understandlni-CHP-19-S
ECT-2.html
3. IPsec module may register with this HOOK and check if a matching inbound SA or an IPsec
Policy is present. If an SA is not present IKE negotiation is started and inbound and outbound
} SAs are formed as a result of the negotiation with the peer.
4. Once all the registered Applications are called and if they process and return the packet
without dropping or consuming
ip_rcv_finish() function is executed by the net filter hook mechanism.
ip_rcv_finish(struct sk_buff *skb) // ip_input.c
{
Decides whether the packet is to be
1. Locally Delivered or
2. Forwarded // it needs to find both the egress device and the next hop.
// parse and process the optional IP header contents, if IP header length > 5 X 4 = 20 bytes.
if(iph->ihl > 5 && ip_rcv_options(skb)) // Parsing and processing some of the IP options.
goto drop;
Every time the host queries the routing table, the result is saved into the cache.
u32 metrics[RTAX_MAX];
union
{
struct dst_entry *next;
struct rtable *rt_next;
struct rt6_info *rt6_next;
struct dn_route *dn_next;
};
}
RTCF_BROADCAST: The destination address of the route is a broadcast address.
RTCF_MULTICAST : The destination address of the route is a multicast address.
RTCF_LOCAL : The destination address of the route is local (i.e., configured on one of the local interfaces). This flag is also set for local broadcast and multicast addresses.
// Updates the route cache with the route read from the multicast
routing table.
// Details TBD
}
int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, int __mkroute_input(struct sk_buff *skb, struct fib_result *res, struct in_device *in_dev,
u8 tos, struct net_device *dev) __be32 daddr, __be32 saddr, u32 tos,
{ // Unicast routing struct rtable **result)
{
// Perform a route look up for the packet in skb. rth->fl.fl4_dst = daddr; rth->rt_dst = daddr;
// Routing table can be hash table or a “TRIE” tree rth->fl.fl4_src = saddr; rth->rt_src = saddr; rth->rt_gateway = daddr;
// Forwarding Information Base, FIB
fib_lookup(net, &fl, &res); rth->fl.fl4_tos = tos; rth->fl.mark = skb->mrk;
// put it into the cache Updates rtable’s fields, such as rt_gateway and the metrics vector of the
hash = rt_hash(daddr, saddr, fl->iif, rt_genid(dev_net(rth->u.dst.dev))); embedded dst_entry structure.
return rt_intern_hash(hash, rth, NULL, skb); }
}
struct iphdr
{
__u8 ihl:4,
version:4;
__u8 tos;
__be16 tot_len;
__be16 id;
• This priority value of skb is used by the packet scheduler and classifier layer of QoS int ip_forward(struct sk_buff *skb) // net/ipv4/ip_forward.c
to
o Schedule or Classify the Packet, as configured by the administrator. {
----------------------
• The Linux kernel translates the TOS value specified to a priority using an array.
o The priority affects how and when a packet is transmit from a queuing // Present towards the end of the function.
discipline. skb->priority = rt_tos2priority(iph->tos);
return NF_HOOK(PF_INET, NF_INET_FORWARD, skb, skb->dev,
__u8 ip_tos2prio[16] = // net/ipv4/route.c
rt->u.dst.dev, ip_forward_finish);
{
TC_PRIO_BESTEFFORT, }
ECN_OR_COST(BESTEFFORT),
TC_PRIO_BESTEFFORT, // Finally ip_forward_finish() is called which internally calls dst_output().
ECN_OR_COST(BESTEFFORT),
TC_PRIO_BULK,
ECN_OR_COST(BULK),
TC_PRIO_BULK,
ECN_OR_COST(BULK),
TC_PRIO_INTERACTIVE,
ECN_OR_COST(INTERACTIVE),
TC_PRIO_INTERACTIVE,
ECN_OR_COST(INTERACTIVE),
TC_PRIO_INTERACTIVE_BULK,
ECN_OR_COST(INTERACTIVE_BULK),
TC_PRIO_INTERACTIVE_BULK,
ECN_OR_COST(INTERACTIVE_BULK)
};
Locally generated packet path:
ip_queue_xmit() is used by TCP and other Transport layer protocols and raw socket based Applications.
// This function returns both the next hop gateway and the egress device to use.
ip_route_output_flow(&rt, &fl, sk, 0);
skb_dst_set(skb, dst_clone(&rt->u.dst));
Allocate and build IP header as we now know where to send the packet.
If required fragment the packet.
ip_local_out(skb);
}
if(net_ratelimit())
printk(KERN_DEBUG "ip_finish_output2: No header cache and no neighbour!\n"); kfree_skb(skb); return -EINVAL;
o The precise function associated with neigh->output depends on the state of the neighbour entry.
o If the L2 address is ready, the function will probably be neigh_connected, so the header can be filled in right away and the packet transmitted.
o Otherwise, neigh->output will probably be initialized to neigh_resolve_output(), which will put the packet in the arp_queue queue, try to resolve the address by sending a
solicitation request, and wait until the solicitation reply arrives, whereupon the packet is transmitted.
o Whether the packet is sent immediately or queued, ip_finish_output2 returns the same value, indicating success.
o The packet is not the IPsubsystem’s responsibility after this point;
When the solicitation reply arrives, the neighboring subsystem dequeues the packet from arp_ queue and sends it to the device.
}
int neigh_hh_output(struct hh_cache *hh, struct sk_buff *skb)
{
unsigned seq; int hh_len; int hh_alen; // hardware header
hh_len = hh->hh_len; hh_alen = HH_DATA_ALIGN(hh_len);
memcpy(skb->data - hh_alen, hh->hh_data, hh_alen); // Prepare and Copy L2 the Header to skb
skb_push(skb, hh_len);
return hh->hh_output(skb); // net/ipv4/arp.c initializes hh_output with the function dev_queue_xmit() and transmits/queues an skb buffer.
}
dev_queue_xmit():
• It is the interface between the neighbouring subsystem (ARP) and the Traffic Control subsystem,
o It stands between the neighbouring protocol and the device driver.
• Makes sure the L4 checksum (that is, TCP/UDP) is computed, unless the device computes the checksum in hardware.
• Selecting which frame to transmit (the one pointed to by the input sk_buff may not be the one to transmit because there is a queue to honour.
• If the device is configured to support Traffic Control infrastructure, it will be associated with a queuing discipline (Qdisc).
o In this case, interfacing to Traffic Control (the QoS layer) is done through qdisc_run() function.
o Based on the status of the outgoing queue, this buffer may or may not be the one that will actually be sent next and it may be queued in Qdisc.
o It seems to NSM that a default Qdisc is attached to every interface by default which can be changed by Linux tc commands.
• Invoking hard_start_xmit() directly if device is not using the Traffic Control infrastructures (i.e., virtual devices).
int dev_queue_xmit(struct sk_buff *skb) struct netdev_queue
{ {
struct net_device *dev = skb->dev; // read mostly part
struct netdev_queue *txq; struct Qdisc *q; struct net_device *dev;
struct Qdisc *qdisc,
// Handle Fragments and TCP/UDP checksum as described in the previous slide. struct Qdisc *qdisc_sleeping;
// Disable soft irqs for various locks below. Also stops pre-emption for RCU. unsigned long state;
rcu_read_lock_bh();
// write mostly part
txq = dev_pick_tx(dev, skb); // Select one of the multiple TX Queues of physical dev. spinlock_t _xmit_lock ____cacheline_aligned_in_smp;
q = rcu_dereference(txq->qdisc); // Get Qdisc attached. int xmit_lock_owner;
#ifdef CONFIG_NET_CLS_ACT skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS); unsigned long trans_start; // please use this field instead of dev->trans_start
#endif unsigned long tx_bytes; tx_packets; tx_dropped;
// The device has no queue. Common case for software devices: loopback and tunnels...
if(dev->flags & IFF_UP)
{
if(!netif_tx_queue_stopped(txq))
{
rc = dev_hard_start_xmit(skb, dev, txq); // NSM Transmit the packet and get out
if(dev_xmit_complete(rc))
HARD_TX_UNLOCK(dev, txq); goto out;
}
if(net_ratelimit())
printk(KERN_CRIT "Virtual device %s asks to " "queue packet!\n", dev->name);
// NSM We need to add a Qdisc of suitable type and Queue the packets on this
interface based on the traffic.
}
int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct qdisc_skb_cb
struct net_device *dev, struct netdev_queue *txq) {
{ int pkt_len;
spinlock_t *root_lock = qdisc_lock(q); // Serialize access to this queue. char data[];
};
if(unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
{
kfree_skb(skb);
rc = NET_XMIT_DROP; struct qdisc_skb_cb* qdisc_skb_cb(struct sk_buff *skb)
} {
else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && return (struct qdisc_skb_cb * )skb->cb;
!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) }
{
// This is a work-conserving queue; there are no old skbs waiting to
be sent out; and the qdisc is not running - xmit the skb directly.
int qdisc_enqueue_root(struct sk_buff *skb, struct Qdisc *sch)
{
if(sch_direct_xmit(skb, q, dev, txq, root_lock))
qdisc_skb_cb(skb)->pkt_len = skb->len;
__qdisc_run(q);
return qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
else
}
clear_bit(__QDISC_STATE_RUNNING, &q->state);
rc = NET_XMIT_SUCCESS;
} int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
else {
{ #ifdef CONFIG_NET_SCHED
rc = qdisc_enqueue_root(skb, q); // NSM enqueue the skb to Qdisc. if(sch->stab)
qdisc_run(q); // Try sending something from device’s queue qdisc_calculate_pkt_len(skb, sch->stab);
} #endif
spin_unlock(root_lock);
return rc; return sch->enqueue(skb, sch); // Enqueue packet to Qdisc ( tbf,htb,prio etc).
} }
QDISCS: (Queueing Discipline). A Scheduler to traffic control flow.
Immediately afterwards, the kernel tries to get as many packets as possible from the Qdisc,
for giving them to the network adaptor driver.
A simple QDISC is the 'pfifo' one, which does no processing at all and is a pure First In, First Out queue.
o It does however store traffic when the network interface can't handle it momentarily.
Qdisc is the major building block on which all of Linux traffic control is built.
o Classful Qdiscs:
• They can contain classes, and provide a handle to which to attach filters.
o Classless Qdiscs:
• They can contain no classes, nor is it possible to attach filter to a classless Qdisc.
• Because a classless qdisc contains no children of any kind, there is no utility to classifying.
This means that no filter can be attached to a classless Qdisc.
The Networking Subsystem of Linux is assigned with two different Softirqs. Registration of the Softirqs: enum
{
int __init net_dev_init(void) HI_SOFTIRQ=0,
• NET_RX_SOFTIRQ - Handles incoming traffic.
{ TIMER_SOFTIRQ,
• NET_TX_SOFTIRQ - Handles outgoing traffic. -----------------------------------------------------------------
NET_TX_SOFTIRQ, //NW
As different instances of the same Softirq handler can run concurrently for_each_possible_cpu(i) // Init softnet_data per CPU. NET_RX_SOFTIRQ, //NW
on different CPUs (unlike tasklets), {
• Networking code is both low latency and scalable. struct softnet_data *queue; BLOCK_SOFTIRQ, BLOCK_IOPOLL_SOFTIRQ,
TASKLET_SOFTIRQ, SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ, RCU_SOFTIRQ,
queue = &per_cpu(softnet_data, i);
Softnet_data Structure: Both the Softirqs refer to this structure. skb_queue_head_init(&queue->input_pkt_queue);
• Each CPU owns a separate data structure. No Lock required. NR_SOFTIRQS
};
queue->completion_queue = NULL;
struct softnet_data #define NR_CPUS 1
{ INIT_LIST_HEAD(&queue->poll_list);
---------------------------------------------------------------------
struct sk_buff_head input_pkt_queue; // Ingress frames.
}
struct list_head poll_list; // Related to receive frames. // Register network functions.
open_softirq(NET_TX_SOFTIRQ, net_tx_action);
struct Qdisc *output_queue; // list of devices ready to transmit. open_softirq(NET_RX_SOFTIRQ, net_rx_action);
struct sk_buff * completion_queue; // Transmitted skbs for release.
hotcpu_notifier(dev_cpu_callback, 0);
------------------------------------------------------------------
struct napi_struct backlog;
}
};
• Egress frames are placed into the specialized queues handled by Traffic
Control (QoS layer) instead of being handled by softirqs and the
softnet_data structure.
• But softirqs are still used to clean up transmitted buffers afterward, to keep
that task from slowing transmission.
The net_tx_action() function is called in two cases and if(sd->output_queue) // net_tx_action() contd 1 …. struct softnet_data
is a Softirq function. { {
struct Qdisc *head; struct sk_buff_head input_pkt_queue; // Ingress frames
1. To do housekeeping with the buffers that are not
needed anymore. local_irq_disable(); struct list_head poll_list; // Related to rx frames
2. When there are devices(Qdiscs) waiting to transmit head = sd->output_queue;
something . sd->output_queue = NULL; struct Qdisc* output_queue;
local_irq_enable(); // list of devices ready to transmit.
void net_tx_action(struct softirq_action *h)
{ while(head) // run the scheduled Qdiscs struct sk_buff * completion_queue;
struct softnet_data *sd = &__get_cpu_var(softnet_data); { // Transmitted skbs for release.
struct Qdisc *q = head; spinlock_t *root_lock;
if(sd->completion_queue) // 1. Free Skbs in Queue head = head->next_sched; root_lock = qdisc_lock(q); struct napi_struct backlog;
{ };
struct sk_buff *clist; if(spin_trylock(root_lock))
{ struct Qdisc
local_irq_disable(); smp_mb__before_clear_bit(); {
clist = sd->completion_queue; clear_bit( __QDISC_STATE_SCHED, &q->state); --------------------
sd->completion_queue = NULL; qdisc_run(q); // dequeue skb(s) from Qdisc and send. struct netdev_queue *dev_queue;
local_irq_enable(); spin_unlock(root_lock); struct Qdisc* next_sched;
} unsigned long state;
while(clist) else ---------------------
{ { }
struct sk_buff *skb = clist; if(!test_bit( __QDISC_STATE_DEACTIVATED, &q->state))
clist = clist->next; { enum qdisc_state_t
__netif_reschedule(q); // re - schedule if lock not avl {
WARN_ON(atomic_read(&skb->users)); } __QDISC_STATE_RUNNING,
__kfree_skb(skb); else // if deactivated __QDISC_STATE_SCHED,
} { __QDISC_STATE_DEACTIVATED,
} smp_mb__before_clear_bit(); };
clear_bit( __QDISC_STATE_SCHED, &q->state);
// net_tx_action() contd 1 …. }
}
}
}
}
• Whenever a device is scheduled for transmission, Note, that this procedure can be called by a watchdog timer.
o Next frame to transmit is selected by qdisc_run() __QDISC_STATE_RUNNING guarantees only one CPU can process this qdisc at a time.
o qdisc_run() indirectly calls the dequeue function netif_tx_lock serializes accesses to device driver.
of the associated queuing discipline and sends it to qdisc_lock(q) and netif_tx_lock are mutually exclusive, if one is grabbed, another must be free.
selected netdev_queue of the device.
int qdisc_restart(struct Qdisc *q) u16 skb_get_queue_mapping( struct sk_buff *skb)
void qdisc_run(struct Qdisc *q)
{ {
{
struct netdev_queue *txq; return skb->queue_mapping;
if(!test_and_set_bit( __QDISC_STATE_RUNNING, &q->state))
struct net_device *dev; }
__qdisc_run(q); // if not set, set the flag.
}
struct sk_buff *skb; struct netdev_queue* netdev_get_tx_queue(
spinlock_t *root_lock; struct net_device *dev, unsigned int index)
void __qdisc_run(struct Qdisc *q)
{
{
return &dev->_tx[index];
unsigned long start_time = jiffies;
}
while(qdisc_restart(q)) // exit if queue is empty/throttled
skb = dequeue_skb(q); // Dequeue packet from the associated Qdisc like tbf, prio, htb etc
{
if(unlikely(!skb)) return 0;
// Postpone processing if
1. another process needs the CPU;
root_lock = qdisc_lock(q); // Serializes queue accesses for this queue.
2. we've been doing it for too long.
dev = qdisc_dev(q);
// only one jiffy!!! NSM
txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
if (need_resched() || jiffies != start_time)
return sch_direct_xmit(skb, q, dev, txq, root_lock); // The dequeued packet is transmitted.
{
}
__netif_schedule(q);
Returns to the caller:
break; 0 - queue is empty or throttled. > 0 - queue is not empty.
}
} int need_resched(void)
clear_bit(__QDISC_STATE_RUNNING, &q->state); {
} return (test_thread_flag(TIF_NEED_RESCHED));
}
// Transmit one skb, and handle the return status as required. else if (ret == NETDEV_TX_LOCKED)
{
Holding the __QDISC_STATE_RUNNING bit guarantees that // Driver try lock failed
only one CPU can execute this function. ret = handle_dev_cpu_collision(skb, txq, q);
}
Returns to the caller: else
0 - queue is empty or throttled. >0 - queue is not empty. {
// Driver returned NETDEV_TX_BUSY - requeue skb
int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, if(unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
struct net_device *dev, struct netdev_queue *txq, printk(KERN_WARNING "BUG %s code %d qlen %d\n",
spinlock_t *root_lock) dev->name, ret, q->q.qlen);
{
int ret = NETDEV_TX_BUSY; ret = dev_requeue_skb(skb, q); // Put it back into Queue.
}
spin_unlock(root_lock); // release qdisc
HARD_TX_LOCK(dev, txq, smp_processor_id()); if(ret && (netif_tx_queue_stopped(txq) || netif_tx_queue_frozen(txq)))
ret = 0;
if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq))
ret = dev_hard_start_xmit(skb, dev, txq); // Final write in sw. return ret;
}
HARD_TX_UNLOCK(dev, txq);
spin_lock(root_lock);
if(dev_xmit_complete(ret))
{
// Driver sent out skb successfully or skb was consumed
ret = qdisc_qlen(q);
}
void __netif_schedule(struct Qdisc *q) // This function must run with irqs disabled!
{
#define __raise_softirq_irqoff(nr)
if(!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) raise_softirq_irqoff(unsigned int nr) do
__netif_reschedule(q); { {
} __raise_softirq_irqoff(nr); or_softirq_pending(1UL << (nr));
} while (0)
void __netif_reschedule(struct Qdisc *q)
{ // If we're in an interrupt or softirq, we're done Otherwise we wake up ksoftirqd to
struct softnet_data *sd; make sure we schedule the softirq soon.
unsigned long flags;
if(!in_interrupt())
local_irq_save(flags); wakeup_softirqd();
}
sd = &__get_cpu_var(softnet_data);
q->next_sched = sd->output_queue; // We cannot loop indefinitely here to avoid userspace starvation, but we also don't want to
sd->output_queue = q; // Add q to output_queue introduce a worst case 1/HZ latency to the pending events, so lets the Scheduler to balance
the softirq load for us.
raise_softirq_irqoff(NET_TX_SOFTIRQ);
void wakeup_softirqd(void)
local_irq_restore(flags); {
} // Interrupts are disabled: no need to stop pre-emption
enum qdisc_state_t struct Qdisc
{
struct task_struct *tsk = __get_cpu_var(ksoftirqd);
{
__QDISC_STATE_RUNNING, --------------------
__QDISC_STATE_SCHED, struct netdev_queue *dev_queue; if(tsk && tsk->state != TASK_RUNNING)
__QDISC_STATE_DEACTIVATED, struct Qdisc* next_sched; wake_up_process(tsk); // wakes up the ksoftirqd which in turn calls net_tx_action().
unsigned long state;
}; ---------------------
}
} // It is a kernel thread.
Device Drivers control the packet transmission logic from sending packets to its buffer.
struct netdev_queue
{
• When a device driver realizes that it does not have enough Space in its buffer // read mostly part
to store a frame of maximum size (MTU), struct net_device *dev;
o It calls one of the following API to STOP queue 0/ a specific queue/all queues from sending. struct Qdisc *qdisc, *qdisc_sleeping;
o Some devices may support multiple egress or TX Queues (sw: struct netdev_queue). unsigned long state;
enum // bit map
// write mostly part netdev_queue_state_t {
• But when it has enough space in its buffer for 1 MTU size packet. spinlock_t _xmit_lock __QUEUE_STATE_XOFF,
o It calls one of the following API to START queue 0/ a specific queue/all queues to send. ____cacheline_aligned_in_smp; __QUEUE_STATE_FROZEN,
};
int xmit_lock_owner;
The status of each egress queue is represented by the flag __QUEUE_STATE_XOFF.
o Clearing this flag allows transmission for the specific egress queue of the network device. unsigned long trans_start;
o Setting this flag stops transmission for the specific egress queue of the network device. unsigned long tx_bytes; tx_packets; tx_dropped;
} ____cacheline_aligned_in_smp;
o Status of this flag can be read using API.
void netif_stop_queue(struct net_device *dev): The drivers may also use the following function to restart a specific network
- Disable egress queue 0 of the device from sending. device. It will also check if anything to be transmitted. Immediate effect.
static int htb_init(struct Qdisc *sch, struct nlattr *opt) • As a result , one or more skbs are dequeued from the Qdisc and sent to the driver.
{
struct htb_sched *q = qdisc_priv(sch); void qdisc_watchdog_schedule(struct qdisc_watchdog *wd, psched_time_t expires)
-------------------------------------------------------- {
qdisc_watchdog_init(&q->watchdog, sch); ktime_t time;
INIT_WORK(&q->work, htb_work_func);
} if(test_bit(__QDISC_STATE_DEACTIVATED, &qdisc_root_sleeping(wd->qdisc)->state))
return;
int tbf_init(struct Qdisc* sch, struct nlattr *opt)
wd->qdisc->flags |= TCQ_F_THROTTLED;
{
time = ktime_set(0, 0);
struct tbf_sched_data *q = qdisc_priv(sch);
time = ktime_add_ns(time, PSCHED_TICKS2NS(expires));
qdisc_watchdog_init(&q->watchdog, sch);
hrtimer_start(&wd->timer, time, HRTIMER_MODE_ABS);
}
}
__netif_schedule(qdisc_root(sch));
}
pfifo_fast Qdisc:
pfifo_fast is the default qdisc of each Linux network interface.
• Whenever an interface is created, the pfifo_fast qdisc is automatically used as a queue.
• If another qdisc is attached, it pre-empts the default pfifo_fast.
o pfifo_fast qdisc automatically returns to function when the existing qdisc is detached.
ALGORITHM:
• Go through tc-prio slides on how TOS bits are translated into bands.
• Does not maintain statistics and does not show up in tc qdisc ls.
This is because it is the automatic default in the absence of a configured qdisc.
Jiffies and HZ:
HZ is a constant in Linux and is the number of times jiffies is incremented in one second.
• Each increment is called a tick.
• In other words, HZ represents the size of a jiffy.
• HZ depends on the hardware and on the kernel version.
o It also determines how frequently the clock interrupt fires. This is configurable on some architectures, fixed on
other ones.
If HZ = 1,000, then jiffies is incremented 1,000 times (that is, one tick every 1/1,000 seconds).
• Once defined, the programmable interrupt timer (PIT) is programmed with that value ...
To achieve perfection, the second bucket of TBF may contain only a single packet, which leads to the 1mbit/s limit.
This limit is caused by the fact that the kernel can only throttle for at minimum 1 'jiffy', which depends on HZ as 1/HZ.
For perfect shaping, only a single packet can get sent per jiffy - for HZ=100, this means 100 packets of on average 1000
bytes each, which roughly corresponds to 1mbit/s.
https://unix.stackexchange.com/questions/100785/bucket-size-in-tbf
From the Linux TC man page, the only constraint on burst is that it must be high enough to allow your configured rate:
• Burst must be at least rate / HZ.
• HZ is a kernel configuration parameter.
• If HZ on the Linux system is 1000. To hit a rate of 10mbps, I'd thus need a burst of at least 10,000,000 bits/sec ÷ 1000
Hz = 10,000 bits = 1250 bytes.