[翻译]linux 任务调度

本文译自 University of Edinburgh 的 Volker Seeker 的 Process Scheduling in Linux ，介绍了 Linux kernel 3.1 的任务调度机制。

原文 : PDF 中英文混排 : git

1. 进程调度

现在的 Linux 内核是一个多任务内核。因此，任何时刻都可以有不止一个进程存在，并且每个进程运行起来就像系统只存在它自己一个进程。进程调度器负责何时运行那个进程。这种情况下，调度其要完成下面的任务：

为所有运行的进程公平的分配处理器资源
如果需要的话挑选合适的进程作在下一次进程切换后运行，这需要考虑调度类型和策略，以及进程优先级
在 SMP 系统中要在多个处理器核心之间均衡进程运行

1.2. Linux 的进程和线程

在 Linux 中进程就是一组共享了线程组 ID （TGID）的线程和所需的必要资源，并且两者之间并没有什么不同。内核调度的是不同的线程而不是进程。因此名词 “任务（task）” 在本文中表示的是线程。

在 Linux 中 task_struct（include/linux/sched.h）是用来表示一个特定任务的全部信息的结构体。

2. 任务分类

2.1. CPU 限制 vs. I/O 限制

任务要么是限制在 CPU 要么限制在 I/O。也就是说，一些线程会使用 CPU 进行大量的运算，而另一些则会花费很多时间等待相对较慢的 I/O 操作完成。在本文， I/O 操作可能是等待用户输入，磁盘或网络访问。

同时这种分类也强烈依赖于任务运行的系统。一个服务器或者 HPC （超级计算机）的负载通常是 CPU 限制任务，而桌面或者移动平台的负载主要是 I/O 限制任务。

Linux 操作系统运行在这些系统之上，也因此设计时就是要应对所有类型的任务。要解决这个问题， Linux 的调度器需要负责及时响应 I/O 限制型任务，提高 CPU 限制型任务的效率。如果任务要运行更长的时间周期，这样他们呢就能完成更多的工作，但是响应性会受影响。如果每个任务的时间周期短一些，则系统能更快的相应 I/O 事件，但是这又会导致花费更多的时间在运行调度算法在任务切换上，效率会会受到影响。这就要求调度器需要有取有舍，并且要在两个需求之间取得平衡。

2.2. 实时 vs. 普通

更进一步，运行在 Linux 上的任务可以明确的分为实时（RT）任务和普通任务。实时任务有严格的时序要求，也因此比系统上的其他任务优先级要高。通过调度分类，针对 RT 任务和普通任务采用了不同的调度策略实现调度器。

2.3. 任务优先级值

在内核里，高的优先级在数值上更小。实时任务的优先级从 1（最高）到 99，而普通任务的优先级从 100 到 139（最低）。然而这样又会在使用系统调用或者调度器库函数修改任务优先级时产生混乱。因为数字的顺序可以被反转和/或被映射到不同的值（即 nice 值）

3. 调度类

Linux 的进程调度器是模块化的，它使用不同算法和策略调度不同类型任务。一个调度算法的实现被封装在一个所谓的调度类（scheduling class）。一个调度类提供接口给主调度器框架，框架就可以根据算法的实现来处理对应的任务。

在文件 include/linux/sched.h 可以找到 sched_class 数据结构：

struct sched_class {

    const struct sched_class *next;
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*yield_task) (struct rq *rq);
    bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);
    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
    struct task_struct * (*pick_next_task) (struct rq *rq);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
    int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
    void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
    void (*post_schedule) (struct rq *this_rq);
    void (*task_waking) (struct task_struct *task);
    void (*task_woken) (struct rq *this_rq, struct task_struct *task);
    void (*set_cpus_allowed)(struct task_struct *p, const struct cpumask *newmask);
    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);
#endif
    void (*set_curr_task) (struct rq *rq);
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork) (struct task_struct *p);
    void (*switched_from) (struct rq *this_rq, struct task_struct *task);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,  int oldprio);
    unsigned int (*get_rr_interval) (struct rq *rq,  struct task_struct *task);
#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*task_move_group) (struct task_struct *p, int on_rq);
#endif
};

除了第一个成员，结构体全部成员都是函数指针，可以被调度器框架用来调用对应的策略实现钩子函数。

内核中所有存在的调度类都按它们的照优先级放在一个链表上。结构体上的名字是 next 的第一个成员就一个指针，指向链表中一个较低优先级的调度类。

这个链表用来对不同类型任务进行排序。在本文所用的内核版本中，完整的链表如下所述：

stop_sched_class → rt_sched_class → fair_sched_class → idle_sched_class → NULL

停止类和空闲类是特殊的调度类。停止类是用来来掉调度每个 CPU 上那些抢占一切任务但不可以被抢占的停止任务，而空闲类是用来调度每个 CPU 上那些没有可运行任务时才会运行的空闲任务（同时也叫交换任务）其它两个就是之前提到的实时任务和普通任务。

4. 主运行队列

主运行队列在每个 CPU 上都有一份，数据结构定义在 kernel/sched.c。它记录了特定 CPU 上所有的可运行的任务，并且还管理和每个处理器负载或域负载相关的不同调度的统计信息。更进一步来说，它包括下面几项：

一个锁，用来同步当前 CPU 的调度操作
```
raw_spinlock_t lock;
```
分别指向当前运行的任务、空闲任务、停止任务的 task_struct 结构体指针
```
struct task_struct *curr, *idle, *stop;
```
公平调度和实时调度类的运行队列数据结构
```
struct cfs_rq cfs;
struct rt_rq rt;
```

5. 调度框架

5.1. 调度器入口

进入进程调度器的主入口是函数 schedule() ，定义在文件 kernel/sched.c。内核其余部分都要使用这个函数来调用进程调度器，决定那个进程运行以及接下来运行那个进程。

schedule() 的主要目标是挑选下一个运行的任务，并且把这个进程传给本地变量 next。最后它就执行上下文切换进入新的任务。如果只有 prev 一个任务并且 prev 还可以运行，那么重新调度基本上就意味着 schedule() 什么都没改变。

/*
* __schedule() is the main scheduler function.
*/
static void __sched __schedule(void)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    struct rq *rq;
    int cpu;
need_resched:
    preempt_disable();
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    rcu_note_context_switch(cpu);
    prev = rq->curr;
    schedule_debug(prev);
    if (sched_feat(HRTICK))
        hrtick_clear(rq);
    raw_spin_lock_irq(&rq->lock);
    switch_count = &prev->nivcsw;
    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
        if (unlikely(signal_pending_state(prev->state, prev))) {
            prev->state = TASK_RUNNING;
        } else {
            deactivate_task(rq, prev, DEQUEUE_SLEEP);
            prev->on_rq = 0;
            /*
            * If a worker went to sleep, notify and ask workqueue
            * whether it wants to wake up a task to maintain
            * concurrency.
            */
            if (prev->flags & PF_WQ_WORKER) {
                struct task_struct *to_wakeup;
                to_wakeup = wq_worker_sleeping(prev, cpu);
                if (to_wakeup)
                    try_to_wake_up_local(to_wakeup);
            }
        }
        switch_count = &prev->nvcsw;
    }
    pre_schedule(rq, prev);
    if (unlikely(!rq->nr_running))
        idle_balance(cpu, rq);
    put_prev_task(rq, prev);
    next = pick_next_task(rq);
    clear_tsk_need_resched(prev);
    rq->skip_clock_update = 0;
    if (likely(prev != next)) {
        rq->nr_switches++;
        rq->curr = next;
        ++*switch_count;
        context_switch(rq, prev, next); /* unlocks the rq */
        /*
        * The context switch have flipped the stack from under us
        * and restored the local variables which were saved when
        * this task called schedule() in the past. prev == current
        * is still correct, but it can be moved to another cpu/rq.
        */
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
    } else
        raw_spin_unlock_irq(&rq->lock);
    post_schedule(rq);
    preempt_enable_no_resched();
    if (need_resched())
        goto need_resched;
}

因为 Linux 内核是可以抢占的，所以在内核空间执行的任务代码是可以被高优先级任务随时打断的。这就会使得被抢占的任务在内核空间没有执行完的操作只能等待下次调度执行了。因此调度函数要做的第一件事是调用 preempt_disable() 禁止抢占，这样以来调度中的线程就不会在执行临界区操作时被打断。

第二步，系统会通过锁定当前 CPU 的运行队列锁来建立另一个保护，因为某一时刻只允许一个线程修改运行队列。

接下来，schedule() 要检查 prev 指向的之前执行的任务。如果这个任务在内核模式是不可运行的并且还没有被抢占，那么它就应该被移出运行队列。然而如果该任务有非阻塞的等待信号，那就把它的运行状态设置为 TASK_RUNNING 然后继续留在运行队列里。这就意味着 prev 指向的任务又获取了一次被调度执行的机会。

要将一个任务移出运行队列需要在内部调用 deactivate_task() ，这个函数会调用该任务所属的调度类的钩子函数 dequeue_task()。

static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
{
    update_rq_clock(rq);
    sched_info_dequeued(p);
    p->sched_class->dequeue_task(rq, p, flags);
}

下一个动作是检查当前 CPU 的运行队列是否还有可以运行的任务。如果没有的话就调用 idle_balance() 从其他 CPU 获取一些可以运行的任务（参见负载均衡）。

put_prev_task() 是一个调度类的钩子函数，是用来告诉任务所属的类这个任务将要被切换出当前 CPU 。

现在就要通过调用 pick_next_task() 来命令相应的调度类挑选下一个适合在当前 CPU 上运行的任务。紧接着就是清掉之前最开始为了调用 schedule() 而设置的 need_resched 标志。

need_resched 标志位于结构体 task_struct ，并且定期内核会检查该标志。这是一种告诉内核其它任务应该运行了、schedule() 函数需要尽快被调用的途径。

/*
* Pick up the highest-prio task:
*/
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
    const struct sched_class *class;
    struct task_struct *p;
    /*
    * Optimization: we know that if all tasks are in
    * the fair class we can call that function directly:
    */
    if (likely(rq->nr_running == rq->cfs.nr_running)) {
        p = fair_sched_class.pick_next_task(rq);
        if (likely(p))
            return p;
    }
    for_each_class(class) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
    }
    BUG(); /* the idle class will always have a runnable task */
}

pick_next_task() 也是在 sched.c 中实现的。它会通过我们的链表遍历全部调度类，找出拥有可运行任务的优先级最高的调度类（参见上文的调度类）。如果这个类找到了，就会调用对应的钩子函数。因为大部分的任务都是由 sched_fair 调度类处理的，所以在函数开始会有一个快捷方式直接调用公平调度类的钩子。

现在 schedule() 要检查 pick_next_task() 是否找出一个可以运行的新任务，或者找出的任务和当前运行的任务是同一个。如果是后者的话，就不会进行任务切换，只需要让当前任务保持运行就行了。如果找到一个新任务，大多数情况下是这样的，那么实际的任务切换是通过调用 contex_switch() 执行的。在函数 context_switch() 中会执行切换到新任务的内存映射表以及交换寄存器状态和任务栈操作。

完成之后，就要解锁运行队列，重新允许优先级抢占。如果抢占被禁止时有抢占被发起，那么 schedule() 会立刻执行。

5.2. 调用调度器

在看了调度器入口代码之后，让我们来看看实际中何时会调用 schedule() 函数。在内核代码中有三个主要的时机会发生任务切换：

1. 周期性更新当前调度的任务

函数 scheduler_tick() 会被时钟中断周期性调用。它的目标是更新运行队列的始终，CPU 负载以及当前任务的运行时计数器。

/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*/
void scheduler_tick(void)
{
    int cpu = smp_processor_id();
    struct rq *rq = cpu_rq(cpu);
    struct task_struct *curr = rq->curr;
    sched_clock_tick();
    raw_spin_lock(&rq->lock);
    update_rq_clock(rq);
    update_cpu_load_active(rq);
    curr->sched_class->task_tick(rq, curr, 0);
    raw_spin_unlock(&rq->lock);
    perf_event_task_tick();
#ifdef CONFIG_SMP
    rq->idle_at_tick = idle_cpu(cpu);
    trigger_load_balance(rq, cpu);
#endif
}

你可以看到 scheduler_tick() 调用了调度类的钩子 task_tick()，它是相应的调度类用来进行周期性任务更新。在其内部，调度类可以决定一个新任务是否需要被调度，以及为任务设置 need_resched 标志来告诉内核尽快调用 schedule().

在 scheduler_tick() 结尾，你也看到了，如果内核设置了 SMP 那么还会调用负载均衡。

2. 当前任务需要睡眠

在 Linux 内核的实现中，如果一个进程要等待某个特定事件发生，那么它就要进入睡眠状态。这通常都会遵循一个特定的模式：

/* ‘q’ is the wait queue we wish to sleep on */
DEFINE_WAIT(wait);
add_wait_queue(q, &wait);
while (!condition)   /* condition is the event that we are waiting for */
{
    prepare_to_wait(&q, &wait, TASK_INTERRUPTIBLE);
    if (signal_pending(current))
        /* handle signal */
        schedule();
}
finish_wait(&q, &wait);

任务会创建一个等待队列，然后把自己放进去。之后会启动一个循环来等待一个特定的条件变成真，在这个循环中，任务会把自己的状态设置成 TASK_INTERRUPTIBLE（可中断的睡眠状态）或 TASK_UNITERRUPTIBLE（不可中断的睡眠状态）。如果是前者，则任务会被它能处理的等待信号唤醒。

如果需要的时间还没有发生，任务就会调用 schedule() 然后进入睡眠。 schedule() 将会吧这个任务从运行队列中移出（参见调度器入口）。如果条件成真则跳出循环，将任务从等待队列中删除。

从这里你可以看到 schedule() 总是在任务进入睡眠之前被调用，用来挑选下一个要运行的任务。

3. 唤醒睡眠任务

产生唤醒睡眠任务的事件的代码通常都会在对应的等待队列上调用 wake_up() ，最后终止在调度器函数 try_to_wake_up()（ttwu）。

这个函数干三件事：

把将要唤醒的任务放回运行队列。
通过把任务的状态设为 TASK_RUNNING 来唤醒任务。
如果唤醒的任务比当前运行的任务的优先级高，则还要设置 need_resched 标志来调用 schedule()

/**
* try_to_wake_up - wake up a thread
* @p: the thread to be awakened
* @state: the mask of task states that can be woken
* @wake_flags: wake modifier flags (WF_*)
*
* Put it on the run-queue if it's not already there. The "current"
* thread is always on the run-queue (except when the actual
* re-schedule is in progress), and as such you're allowed to do
* the simpler "current->state = TASK_RUNNING" to mark yourself
* runnable without the overhead of this.
*
* Returns %true if @p was woken up, %false if it was already running
* or @state didn't match @p's state.
*/
static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
    unsigned long flags;
    int cpu, success = 0;
    smp_wmb();
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    if (!(p->state & state))
        goto out;
    success = 1; /* we're going to change ->state */
    cpu = task_cpu(p);
    if (p->on_rq && ttwu_remote(p, wake_flags))
        goto stat;
#ifdef CONFIG_SMP
    /*
    * If the owning (remote) cpu is still in the middle of schedule() with
    * this task as prev, wait until its done referencing the task.
    */
    while (p->on_cpu) {
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
        /*
        * In case the architecture enables interrupts in
        * context_switch(), we cannot busy wait, since that
        * would lead to deadlocks when an interrupt hits and
        * tries to wake up @prev. So bail and do a complete
        * remote wakeup.
        */
        if (ttwu_activate_remote(p, wake_flags))
            goto stat;
#else
        cpu_relax();
#endif
    }
    /*
    * Pairs with the smp_wmb() in finish_lock_switch().
    */
    smp_rmb();
    p->sched_contributes_to_load = !!task_contributes_to_load(p);
    p->state = TASK_WAKING;
    if (p->sched_class->task_waking)
        p->sched_class->task_waking(p);
    cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
    if (task_cpu(p) != cpu) {
        wake_flags |= WF_MIGRATED;
        set_task_cpu(p, cpu);
    }
#endif /* CONFIG_SMP */
    ttwu_queue(p, cpu);
stat:
    ttwu_stat(p, cpu, wake_flags);
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}

在做了一些初始错误检查和 SMP 技巧之后就调用 ttwu_queue() 进行唤醒工作。

static void ttwu_queue(struct task_struct *p, int cpu)
{
    struct rq *rq = cpu_rq(cpu);
#if defined(CONFIG_SMP)
    if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
        sched_clock_cpu(cpu); /* sync clocks x-cpu */
        ttwu_queue_remote(p, cpu);
        return;
    }
#endif
    raw_spin_lock(&rq->lock);
    ttwu_do_activate(rq, p, 0);
    raw_spin_unlock(&rq->lock);
}
static void
ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
{
#ifdef CONFIG_SMP
    if (p->sched_contributes_to_load)
        rq->nr_uninterruptible--;
#endif
    ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
    ttwu_do_wakeup(rq, p, wake_flags);
}

ttwu_queue() 函数锁定了运行队列并调用 ttwu_do_activate() ，而 ttwu_do_activate() 又会调用 ttwu_activate() 执行步骤 1 ，而 ttwu_do_wakeup() 会去执行步骤 2 和 3。

static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
{
    activate_task(rq, p, en_flags);
    p->on_rq = 1;
    /* if a worker is waking up, notify workqueue */
    if (p->flags & PF_WQ_WORKER)
        wq_worker_waking_up(p, cpu_of(rq));
}
/*
* activate_task - move a task to the runqueue.
*/
static void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
    if (task_contributes_to_load(p))
        rq->nr_uninterruptible--;
    enqueue_task(rq, p, flags);
    inc_nr_running(rq);
}
static void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
    update_rq_clock(rq);
    sched_info_queued(p);
    p->sched_class->enqueue_task(rq, p, flags);
}

如果你跟随 ttwu_activate() 的调用链，最终在会停在 enqueue_task() 调用相应的调度类的钩子函数，而 enqueue_task() 就是我们之前在 schedule() 看到的用来把任务放回运行队列的函数。

/*
* Mark the task runnable and perform wakeup-preemption.
*/
static void
ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
    trace_sched_wakeup(p, true);
    check_preempt_curr(rq, p, wake_flags);
    p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
    if (p->sched_class->task_woken)
        p->sched_class->task_woken(rq, p);
    if (rq->idle_stamp) {
        u64 delta = rq->clock - rq->idle_stamp;
        u64 max = 2*sysctl_sched_migration_cost;
        if (delta > max)
            rq->avg_idle = max;
        else
            update_avg(&rq->avg_idle, delta);
        rq->idle_stamp = 0;
    }
#endif
}
static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
    const struct sched_class *class;
    if (p->sched_class == rq->curr->sched_class) {
        rq->curr->sched_class->check_preempt_curr(rq, p, flags);
    } else {
        for_each_class(class) {
            if (class == rq->curr->sched_class)
                break;
            if (class == p->sched_class) {
                resched_task(rq->curr);
                break;
            }
        }
    }
    /*
    * A queue event has occurred, and we're going to schedule. In
    * this case, we can save a useless back to back clock update.
    */
    if (rq->curr->on_rq && test_tsk_need_resched(rq->curr))
        rq->skip_clock_update = 1;
}

ttwu_do_wakeup() 要检查当前运行的任务是否要被刚唤醒的现在处于运行队列的任务抢占。 check_preempt_curr() 函数内部最后会调用相应调度类的钩子函数，可能会设置 need_resched 标志。在任务的标志被设置成 TASK_RUNNING 之后，唤醒过程就完了。

6. 调度算法简史

这里是一份 Linux 的调度算法的开发历史：

1995 1.2 循环运行队列与采用循环调度进程调度算法
1999 2.2 引入了调度类和实时任务，不可抢占任务和非实时任务，开始支持 SMP
2001 2.4 O(N) 复杂度的调度器，将时间分片，每个任务只被允许一段特定时间片，遍历 N 可运行任务，使用 goodness 函数来决定下一个运行的任务
2003 2.6 O(1) 复杂度的调度器，针对每个优先级使用了多个运行队列，是一个更有效且可扩展的 O(N) 版调度算法，引入了奖励系统来应对交互式 vs 批处理任务
2008 2.6.23 完全公平调度器

7. 完全公平调度器

现在让我们描述一下用于 NORMAL 类型任务的调度算法：目前的调度器是 Ingo Molnar 基于 Con Kolivas 开发的 Rotating Staircase Deadline Scheduler（RSDL）开发的完全公平调度算法。 CFS 的实现就隐藏在 /kernel/sched_fair.c 的 fair_sched_class 调度类。

CFS 算法是基于一个完美的多任务处理器理念实现的。这样一个处理器系统表面上会同时运行当前所有的活动任务，而每个任务都会获取一份属于自己的处理器运算能力。举个例子，如果两个任务都是可运行的，那么他们就可以同时运行，分别占有 50% 的处理器运算能力。这也就意味着，加入这两个任务同时启动，每个任务的执行时间或者运行时间在任何时候都是相同的，因此是完全公平的。你也可以说每个任务都运行一个无穷小的时间但是完全占用全部的处理器运算能力。既然物理上是不可能以这种方式使用处理器，并且这样子也是极度没有效率的，因为任务切换的损耗会导致实际任务运行时间很短。调度器会记录每个可运行的任务的运行时，即所谓的虚拟运行时，并且尝试随时在全部可运行任务之间维护均衡。

这个原则的基础是指定的全部可运行任务是公平分享处理器的。全部份额会随着运行中的处理器个数变化而增加或者减少，而相对份额会基于任务的优先级变化而增加或减少。

7.1. 时间片 vs. 虚拟运行时

在 CFS 调度算法中，将每个任务的时间划分成固定长度的时间片这种传统模型已经不再存在了。作为替代，每个任务都引入了虚拟运行时（vruntime）。在为每一个使用 CFS 调度的任务执行 scheduler_tick() 时，任务被调度以后，它的 vruntime 会和已经流逝的时间一起更新。一旦运行队列中另一个任务的 vruntime 比当前任务的小，那么就会执行一次任务调度，接下来 vruntime 最小的任务就会得到执行。

为了避免因为频繁进行任务切换而造成任务调度开销过大， Linux 引入了调度延迟和最小任务运行时间粒度。

目标调度延迟（TSL）是一个用来计算最小任务运行时间的常量。假如 TSL 是 20 ms，而我们有两个同等优先级任务在运行。每个任务都会在被对方抢占前运行 10ms。

如果任务的数量在增加，则这部分的运行时会减少，甚至为 0 ，这就会导致一个毫无效率的切换频率。要应对这点，CFS 引入了常量“最小粒度”，根据运行中任务的数量来扩大 TSL 。

7.2. 优先级

调度中任务的 vruntime 增长的速度来决定了一个任务是否比另一个任务优先级高。时间差产生的原因是任务的调度是根据任务的优先级决定的。这也意味着高优先级任务的 vruntime 增长的速度要比低优先级任务慢。（译注：即增长的越慢，可能运行的时间越长）

7.3. 应对 I/O 限制型和 CPU 限制型任务

在之前的调度算法中， O(1) 调度其尝试使用启发式的睡眠和运行时来决定一个任务是 I/O 限制型任务还是 CPU 限制型任务。这样就可以让某个比另一个更占优势。根据之后的结果发现这个原则并不能满足预期的工作要求，因为启发法太复杂了而且容易出错。

而 CFS 的原则是为每个任务提供公平的计算机使用，实际工作效果相当好，也容易应用。我打算用 Robert Love 的 Linux Kernel Development 中的例子来解释 CFS 是怎么工作的：

考虑一个系统有两个可运行的任务：一个文本编辑器和一个视频编码器。文本编辑器是 I/O 限制型任务因为它花费了几乎全部的时间在等待用户的按键输入。然而当文本编辑器捕获到键盘按下后，用户就期待它立即响应。对应的，视频编码器就是一个处理器限制型任务。

除了从磁盘读取原始的数据流和将结果写入视频文件，编码器几乎全部的时间都用在了编解码视频原始数据上，很轻松的就会使用了 100% 的处理器。视频编码对何时运行并没有很强的限制——用户是不可能区分出是立刻运行还是半秒之后运行的，他们也不关心这一点。当然了，越快完成越好，但是延迟并不一个主要考虑的因素。在这种场景下，理想的调度其会给文本编辑器相较于视频编码器更多的处理器时间，因为文本编辑器是交互型的。我们对文本编辑器有两个目标。第一，我们希望他有大量可用的处理器时间；不是因为他需要大量处理器时间（它并不需要）而是因为我们希望他在需要处理器时总可以使用处理器。第二，我们希望文本编辑器在被唤醒时可以抢占视频编码器（也就是在当用户按下按键时）。这一点可以保证文本编辑器能够提供很好的交互性能和及时的响应用户输入。

CFS 通过下面的办法解决了这个问题：调度器保证文本编辑器有特定的处理器时间比例而不是赋给他特定的优先级和时间片。如果视频编码器和文本编辑器是仅有的运行中的进程，并且有着相同的友好级（nice level），这个比例可以是 50% —— 每个任务都保证占有一半的处理器时间。因为文本编辑器把大部分的时间花费到了阻塞上，等待用输入，所以他它没有用尽 10% 的处理器。相对的，视频编码器能够自由使用超过分配给自己的 50% 的处理器，这就是他能够快速的完成编码工作。

一个重要的原则当文本编辑器被唤醒时。我们的主要目标是保证它能根据用户的输入立即运行。在这种情况下，当编辑器唤醒后， CFS 注意到它分配了 50% 的处理器但是相当大的部分并没有使用。特别的是， CFS 确定了文本编辑器已经运行的时间比视频编码器要少。调度器会尝试给全部进程分配公平的的处理器时间，然后抢占了视频编码器并让文本编辑器开始运行。文本编辑器开始运行，迅速的处理用户的按下的按键，然后就进入睡眠等待更多的输入。因为文本编辑器并没有消耗完分配给它的 50% 处理器，我们继续这个方式， CFS 总会让文本编辑器能够在它想运行的时刻运行，然后视频编码器会运行在剩余的时间。

7.4. 创建新任务

CFS 维护了一个 vruntime_min 值，这是个单调递增的值，记录了运行队列中全部任务最小的 vruntime 值。这个值会加到运行队列上以此来给新任务一个更快的调度机会。

7.5. 运行队列 - 红黑树

CFS 在运行队列中使用红黑树数据结构。这个数据结构是一个自平衡二叉搜索树。遵循一套确定的规则，树中不会有一条路径会是另一条的两倍长。更进一步，这种树的操作时间复杂度是 O(log n)，这就允许快速、高效的插入和删除节点的操作。

树中没一个节点代表一个而任务，他们根据任务的 vruntime 排序。这就意味着二叉树最左边的节点总是指向 vruntime 最小的任务，也就是说这个任务最需要处理器。缓存最左边的节点可以更快和更方便的访问节点。

7.6. 公平组调度

组调度是另一种给任务调度增加公平性的途径，特别是面对任务又生成很多任务的情况。考虑到服务器会生成很多任务来并行处理收到的接入连接。 CFS 引入了组来描述这种行为而不是统一的公平对待全部任务。生成任务的服务器进程在组内（根据层级）共享他们的 vruntime ，而单独任务会维护自己独立的 vruntime。在这种办法中，但任务接收和组粗略相等调度时间。

让我们假设机器上除了服务器任务，另一个用户只有一个运行中的任务。如果没有组调度，第二个用户相对于服务器会受到不公平的对待。通过组调度， CFS 会首先尝试对组公平，然后再对组内进程公平。因此两个用户都获得 50% 的 CPU。

组可以通过 /proc 接口进行配置。

8. CFS 实现细节

8.1. 数据结构

CFS 调度器为 Linux 的进程调度器引入了一个名叫 sched_entity 的结构体。它最主要的用处是记录单个任务的运行时间，并且作为一个成员添加到了每个任务的 task_struct 结构体中。它定义在文件 include/linux/sched.h 中：

struct sched_entity {

    struct load_weight load; /* for load-balancing */
    struct rb_node run_node;
    struct list_head group_node;
    unsigned int on_rq;
    u64 exec_start;
    u64 sum_exec_runtime;
    u64 vruntime;
    u64 prev_sum_exec_runtime;
    u64 nr_migrations;

#ifdef CONFIG_SCHEDSTATS
    struct sched_statistics statistics;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    struct sched_entity *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq *my_q;
#endif

};

另一个和 CFS 相关的字段是一个添加到运行队列的数据结构的成员变量 cfs。它的类型是 cfs_rq ，在文件 kernel/sched.c 中实现。它包含一个由指向全部运行中的 CFS 任务的指针的链表，CFS 调度器的红黑树的根，一个指向最左边节点的指针，最小虚拟运行时（min_vruntime)，指向前一个任务和当前调度的任务的指针以及额外的用于公平组调度和 SMP 调度和负载运行的成员变量。任务的优先级已经编码到了 load_weight 数据结构中。

/* CFS-related fields in a runqueue */

struct cfs_rq {
    struct load_weight load;
    unsigned long nr_running;
    u64 exec_clock;
    u64 min_vruntime;

#ifndef CONFIG_64BIT
    u64 min_vruntime_copy;
#endif

    struct rb_root tasks_timeline;
    struct rb_node *rb_leftmost;
    struct list_head tasks;
    struct list_head *balance_iterator;

    /*
    * 'curr' points to currently running entity on this cfs_rq.
    * It is set to NULL otherwise (i.e when none are currently running).
    */
    struct sched_entity *curr, *next, *last, *skip;

#ifdef CONFIG_SCHED_DEBUG
    unsigned int nr_spread_over;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */
    /*
    * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
    * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
    * (like users, containers etc.)
    *
    * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
    * list is used during load balance.
    */
    int on_list;
    struct list_head leaf_cfs_rq_list;
    struct task_group *tg;/* group that "owns" this runqueue */

#ifdef CONFIG_SMP
    /*
    * the part of load.weight contributed by tasks
    */
    unsigned long task_weight;

    /*
    * h_load = weight * f(tg)
    *
    * Where f(tg) is the recursive weight fraction assigned to
    * this group.
    */

    unsigned long h_load;

    /*
    * Maintaining per-cpu shares distribution for group scheduling
    *
    * load_stamp is the last time we updated the load average
    * load_last is the last time we updated the load average and saw load
    * load_unacc_exec_time is currently unaccounted execution time
    */
    u64 load_avg;
    u64 load_period;
    u64 load_stamp, load_last, load_unacc_exec_time;
    unsigned long load_contribution;
#endif
#endif

};

8.2. 时间审计

如上所述， vruntime 是用来记录 CFS 的红黑树中的可运行任务的虚拟时间的。调度器框架的函数 scheduler_tick() 周期性调用 CFS 跟新任务的钩子函数 task_tick() ：

/*
* scheduler tick hitting a task of our scheduling class:
*/
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;
    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        entity_tick(cfs_rq, se, queued);
    }

}

task_tick_fair() 调用 entity_tick() 是为了任务调度实体和相应的运行队列。entity_tick() 有两个主要任务：第一是为当前调度的任务更新运行时数据，第二是检查当前任务是否需要被抢占。

entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
    * Update run-time statistics of the 'current'.
    */
    update_curr(cfs_rq);
    /*
    * Update share accounting for long-running entities.
    */
    update_entity_shares_tick(cfs_rq);
#ifdef CONFIG_SCHED_HRTICK
    /*
    * queued ticks are scheduled to match the slice, so don't bother
    * validating it and just reschedule.
    */
    if (queued) {
        resched_task(rq_of(cfs_rq)->curr);
        return;
    }
    /*
    * don't let the period tick interfere with the hrtick preemption
    */
    if (!sched_feat(DOUBLE_TICK) &&
        hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
        return;
#endif
    if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
        check_preempt_tick(cfs_rq, curr);
}

函数 update_curr() 负责更新当前任务的运行时数据。它会计算当前任务自从上次被调用以后使用了的时间，然后使用函数 __update_curr() 将结果 delta_exec 传给出去。

static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_of(cfs_rq)->clock_task;
    unsigned long delta_exec;
    if (unlikely(!curr))
        return;
    /*
    * Get the amount of time the current task was running
    * since the last time we changed load (this cannot
    * overflow on 32 bits):
    */
    delta_exec = (unsigned long)(now - curr->exec_start);
    if (!delta_exec)
        return;
    __update_curr(cfs_rq, curr, delta_exec);
    curr->exec_start = now;
    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);
        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }
}

运行时差值可以加权当前任务的优先级，而这个值是编码在 load_weigth 当中的，而且这个结果会加到当前任务的 vruntime 。这就是 vruntime 增长的更快或者更慢的地方，这依赖于任务的优先级，同时你也可以看到 __update_curr() 更新了 min_vruntime 。

/*
 * Update the current task's runtime statistics. Skip current tasks that
 * are not in our scheduling class.
*/
static inline void __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,unsigned long delta_exec)
{
    unsigned long delta_exec_weighted;
    schedstat_set(curr->statistics.exec_max,max((u64)delta_exec, curr->statistics.exec_max));
    curr->sum_exec_runtime += delta_exec;
    schedstat_add(cfs_rq, exec_clock, delta_exec);
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);
    curr->vruntime += delta_exec_weighted;
    update_min_vruntime(cfs_rq);
#if defined CONFIG_SMP && defined  CONFIG_FAIR_GROUP_SCHED
    cfs_rq->load_unacc_exec_time += delta_exec;
#endif
}

除了周期性更新外，在对应的任务变成可运行或者将要睡眠的时候 update_current() 也会被调用。

回到 entity_tick()。在更新了当前任务之后，会调用 check_preempt_tick() 来满足 CFS 的基本原则：为每个任务提供一个公平的 CPU 份额。当前进程的 vruntime 会和红黑树中最左边的任务进行比较，以此来决定进程切换是否是必须的。

/*
* Preempt the current task with a newly woken task if needed:
*/
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;
    ideal_runtime = sched_slice(cfs_rq, curr);
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
    if (delta_exec > ideal_runtime) {
        resched_task(rq_of(cfs_rq)->curr);
        /*
        * The current task ran long enough, ensure it doesn't get
        * re-elected due to buddy favours.
        */
        clear_buddies(cfs_rq, curr);
        return;
    }
    /*
    * Ensure that a task that missed wakeup preemption by a
    * narrow margin doesn't have to wait for a full slice.
    * This also mitigates buddy induced latencies under load.
    */
    if (!sched_feat(WAKEUP_PREEMPT))
        return;
    if (delta_exec < sysctl_sched_min_granularity)
        return;
    if (cfs_rq->nr_running > 1) {
        struct sched_entity *se = __pick_first_entity(cfs_rq);
        s64 delta = curr->vruntime - se->vruntime;
        if (delta < 0)
            return;
        if (delta > ideal_runtime)
            resched_task(rq_of(cfs_rq)->curr);
    }
}

删除一个节点的方法也是类似的。首先更新 vruntime 使用然后 __dequeue_entity() 将任务从树上删除掉。

static void dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    /*
    * Update run-time statistics of the 'current'.
    */
    update_curr(cfs_rq);
    update_stats_dequeue(cfs_rq, se);
    if (flags & DEQUEUE_SLEEP) {
#ifdef CONFIG_SCHEDSTATS
        if (entity_is_task(se)) {
            struct task_struct *tsk = task_of(se);
            if (tsk->state & TASK_INTERRUPTIBLE)
                se->statistics.sleep_start = rq_of(cfs_rq)->clock;
            if (tsk->state & TASK_UNINTERRUPTIBLE)
                se->statistics.block_start = rq_of(cfs_rq)->clock;
        }
#endif
    }
    clear_buddies(cfs_rq, se);
    if (se != cfs_rq->curr)
        __dequeue_entity(cfs_rq, se);
    se->on_rq = 0;
    update_cfs_load(cfs_rq, 0);
    account_entity_dequeue(cfs_rq, se);
    /*
    * Normalize the entity after updating the min_vruntime because the
    * update can refer to the ->curr item and we need to reflect this
    * movement in our normalized position.
    */
    if (!(flags & DEQUEUE_SLEEP))
        se->vruntime -= cfs_rq->min_vruntime;
    update_min_vruntime(cfs_rq);
    update_cfs_shares(cfs_rq);
}

你所看到的其它函数调用是组调度，更新调度数据和伙伴系统所必须的，伙伴系统用来挑选下一个要运行的任务。

8.4. 挑选下一个可运行的任务

主调度函数 schedule() 会调用拥有可运行任务的最高优先级的调度类的 pick_next_task() 。如果调用的是 CFS 调度类的这个函数，调度类钩子函数会调用 pick_next_task_fair()。

如果这个调度类中没有任务，则直接返回 NULL。否则调用 pick_next_entity() 从树中挑选下一个任务。之后会转发给 set_next_entity() 来将任务从树上删除，因为调度了的进程是不允许继续存在这里。这个 while 循环使用来进行公平组调度的。

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
    struct task_struct *p;
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;
    if (!cfs_rq->nr_running)
        return NULL;
    do {
        se = pick_next_entity(cfs_rq);
        set_next_entity(cfs_rq, se);
        cfs_rq = group_cfs_rq(se);
    } while (cfs_rq);
    p = task_of(se);
    hrtick_start_fair(rq, p);
    return p;
}

在 pick_next_entity() 中，树中最左边的任务并不一定会被选出来运行。伙伴系统会给出选择下一个要运行的任务的等级。这被认为是有益于缓存位置和组调度。

/*
* Pick the next process, keeping these things in mind, in this order:
* 1) keep things fair between processes/task groups
* 2) pick the "next" process, since someone really wants that to run
* 3) pick the "last" process, for cache locality
* 4) do not run the "skip" process, if something else is available
*/
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
    struct sched_entity *se = __pick_first_entity(cfs_rq);
    struct sched_entity *left = se;
    /*
    * Avoid running the skip buddy, if running something else can
    * be done without getting too unfair.
    */
    if (cfs_rq->skip == se) {
        struct sched_entity *second = __pick_next_entity(se);
        if (second && wakeup_preempt_entity(second, left) < 1)
            se = second;
    }
    /*
    * Prefer last buddy, try to return the CPU to a preempted task.
    */
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
        se = cfs_rq->last;
    /*
    * Someone really wants this to run. If it's not unfair, run it.
    */
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
        se = cfs_rq->next;
    clear_buddies(cfs_rq, se);
    return se;
}

__pick_first_entity() 从树中挑选出最左边的节点，你可以从下面的代码看出来这是非常简单和快速的：

static struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
{
    struct rb_node *left = cfs_rq->rb_leftmost;
    if (!left)
        return NULL;
    return rb_entry(left, struct sched_entity, run_node);
}

9. 软实时调度

Linux 调度器支持软实时（RT）任务调度。这意味着内核可以有效的调度有严格时间限制的任务。虽然内核又能力满足对截止时间要求非常严格的实时任务调度，但是并不能保证一定能满足这个截止时间。

对应的调度类是在 kernel/sched_rt.c 实现的 rt_sched_class 。 RT 任务拥有比 CFS 任务高的优先级。

9.1. 调度模式

由 RT 调度器处理的任务可以配置成下面两种不同的模式：

SCHED_FIFO ：一个按照 FIFO（先入先出）模式调度的任务并没有时间片，并且只会在任务终止、放弃处理器时被切换。
SCHED_RR ：一个 RR 任务会按照固定的时间片进行调度，然后会按照循环方式被相同优先级的任务抢占。这就是说只要任务的时间片用完了就会被放到任务队列的末尾并且时间片会重新填满。在时间片用完之前，任务同样也可能被高优先级的任务抢占。

9.2. 优先级

根据优先级的实现内容，RT 调度类也遵循之前 O(1) （复杂度）调度器同样的规则。内核使用多个运行队列，每个队列对应一个优先级。按照这种方式，添加、删除或者寻找最高优先级任务的操作可以在 O(1) 时间内完成。

9.3. 实时调节

你偶尔可以看到 RT 调度器实现中的调节和带宽操作。添加这些操作是为了提高 SCHED_FIFO 任务的安全性。在任务自己想要或不想要被打断之前，这些操作会为每个处理器分配一个由 FIFO 任务组成的 RT 任务组。这就在内核开发者当中引起了一些讨论，而讨论的结果还没有定下来。

9.4. 实现细节

数据结构

和 CFS 类似， RT 调度器有它自己的调度实体和运行队列数据结构，这些已经作为成员添加到了 task_struct 和主运行队列。

sched_rt_entity 在 /include/linux/sched.h 实现。它有记录时间片的字段，一个指向它所属优先级列表的指针，以及和组调度相关的成员。

struct sched_rt_entity {
    struct list_head run_list;
    unsigned long timeout;
    unsigned int time_slice;
    int nr_cpus_allowed;
    struct sched_rt_entity *back;
#ifdef CONFIG_RT_GROUP_SCHED
    struct sched_rt_entity*parent;
    /* rq on which this entity is (to be) queued: */
    struct rt_rq *rt_rq;
    /* rq "owned" by this entity/group: */
    struct rt_rq *my_q;
#endif
};

rt_rq 在 kernel/sched.c 实现。第一个字段记录了优先级数组。其他几乎所有的字段都是与 SMP 和组调度相关的。

/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
    struct rt_prio_array active;
    unsigned long rt_nr_running;
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
    struct {
        int curr; /* highest queued rt task prio */
#ifdef CONFIG_SMP
        int next; /* next highest */
#endif
    } highest_prio;
#endif
#ifdef CONFIG_SMP
    unsigned long rt_nr_migratory;
    unsigned long rt_nr_total;
    int overloaded;
    struct plist_head pushable_tasks;
#endif
    int rt_throttled;
    u64 rt_time;
    u64 rt_runtime;
    /* Nests inside the rq lock: */
    raw_spinlock_t rt_runtime_lock;
#ifdef CONFIG_RT_GROUP_SCHED
    unsigned long rt_nr_boosted;
    struct rq *rq;
    struct list_head leaf_rt_rq_list;
    struct task_group *tg;
#endif
};

记录时间

scheduler_tick() 调用 task_tick_rt() 更新当前任务的时间片。实现相当简单：一开始就使用 update_curr_rt() 更新当前任务的运行时数据和对应的运行队列。然后这个函数返回当前任务是否是 FIFO 任务。

如果不是， RR 任务的时间片就会减 1 。如果减到 0 了就设回默认值然后如果还任务队列中有其他任务的话就把当前任务放到队列的末尾。此外还要设置 need_resched 标志。

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
    update_curr_rt(rq);
    watchdog(rq, p);
    /*
    * RR tasks need a special form of timeslice management.
    * FIFO tasks have no timeslices.
    */
    if (p->policy != SCHED_RR)
        return;
    if (--p->rt.time_slice)
        return;
    p->rt.time_slice = DEF_TIMESLICE;
    /*
    * Requeue to the end of queue if we are not the only element
    * on the queue:
    */
    if (p->rt.run_list.prev != p->rt.run_list.next) {
        requeue_task_rt(rq, p, 0);
        set_tsk_need_resched(p);
    }
}

挑选下一个运行的任务

如之前所说的，挑选下一个运行的任务可以在常数时间复杂度内完成。以函数 pick_next_task_rt() 开始，它会直接调用 _pick_next_task_rt() 完成实际工作。

static struct task_struct *pick_next_task_rt(struct rq *rq)
{
    struct task_struct *p = _pick_next_task_rt(rq);
    /* The running task is never eligible for pushing */
    if (p)
        dequeue_pushable_task(rq, p);
#ifdef CONFIG_SMP
    /*
    * We detect this state here so that we can avoid taking the RQ
    * lock again later if there is no need to push
    */
    rq->post_schedule = has_pushable_tasks(rq);
#endif
    return p;
}

如果没有可以运行的任务，就返回 NULL，然后其它调度类会寻找可运行的任务。

pick_next_rt_entity() 会从运行队列中选出优先级最高的任务作为下一个运行的任务。

static struct task_struct *_pick_next_task_rt(struct rq *rq)
{
    struct sched_rt_entity *rt_se;
    struct task_struct *p;
    struct rt_rq *rt_rq;
    rt_rq = &rq->rt;
    if (!rt_rq->rt_nr_running)
        return NULL;
    if (rt_rq_throttled(rt_rq))
        return NULL;
    do {
        rt_se = pick_next_rt_entity(rq, rt_rq);
        BUG_ON(!rt_se);
        rt_rq = group_rt_rq(rt_se);
    } while (rt_rq);
    p = rt_task_of(rt_se);
    p->se.exec_start = rq->clock_task;
    return p;
}

在 pick_next_rt_entity() 中你可以看到如何使用位图来在全部优先级中快速的找出拥有可运行任务的优先级最高的队列。优先级队列本事是一个简单的链表，而下一个可运行的任务可以在它的表头找到。

static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
        struct rt_rq *rt_rq)
{
    struct rt_prio_array *array = &rt_rq->active;
    struct sched_rt_entity *next = NULL;
    struct list_head *queue;
    int idx;
    idx = sched_find_first_bit(array->bitmap);
    BUG_ON(idx >= MAX_RT_PRIO);
    queue = array->queue + idx;
    next = list_entry(queue->next, struct sched_rt_entity, run_list);
    return next;
}

在优先级队列中添加、删除任务也是非常的简单。仅仅就是找出正确的队列然后设置或者清掉优先级位图中对应位置。因此我就不在这里提供更进一步的细节了。

10. SMP 系统下的负载均衡

引入负载均衡的主要目的是通过将任务从负载重的处理器转移到负载轻的处理器上来提高整个 SMP 系统的性能。Linux 的任务调度器会周期性检查任务负载是如何分配到整个系统的各个处理器，以及是否有必要执行负载均衡。

负载均衡的复杂之处在于要服务各种各样拓扑结构的 SMP 系统。有些包含多个物理处理器核心的系统，要将任务调度到不同的 CPU 上要比在继续运行在本身已经负载重的 CPU 上多执行一次清洗缓存（cache）的操作。而对支持超线程的系统，因为是共享缓存的缘故，相同的场景下调度任务到不同的处理器更灵活。NUMA 架构创造了一个场景：不同的节点访问不同区域的主内存速度不同。

为了解决拓扑结构多样性的问题，在 2.6 版的 Linux 内核引入了调度域的概念。这种方案按层次结构将系统中所有可用的 CPU 分组，这就使得内核有了一种办法描述下层的处理器核心拓扑结构。

10.1. 调度域和调度组

一个调度域是一组共享调度属性和策略的 CPU，可以用来和其它域进行负载平衡。

每个域包含一个或多个调度组，每个组在域内都被视为一个单元，调度器会尝试使各个组的负载均等而不考虑组内到底发生了什么。

想象一下，一个系统有两个物理处理器，每个都支持超线，程这就给我们提供了 4 个逻辑处理器。如果系统启动了，内核会如图所示把逻辑核心分到两个调度域层级。

每个超线程处理器会被精确的放到一个组内而且每两个会放到一个域内。这两个域然后会被放到控制整个处理器的顶级域里。

整个系统包含两个组，每个组又有两个处理器核心。

如果是 NUMA 系统，就会像上面图标所示的那样包含多个域；每个域代表一个 NUMA 节点。系统会分三级，系统级域会包含所有的 NUMA 节点。

10.2. 负载均衡

每个调度域都有一个只在本层级中有效的均衡策略集。这个策略的参数包含每隔多长时间要尝试在整个域内进行一次负载均衡，在尝试执行负载均衡之前成员处理器的负载在达到多少之前可以处于不同步状态，一个处理器处于空闲状态多久才会被认为不再具有明显的缓存亲和性。

最重要的是不同的的策略标志指定了特定情况下的调度行为，比如：一个 CPU 进入空闲，负载均衡应该给它提供一个任务吗？唤醒或创建一个任务，应该调度到那个 CPU 上执行？

系统会周期性的进行积极负载均衡将调度域层级上移，按顺序检查所有的组是否失去均衡了。如果失去均衡则会尝试使用对应域的调度策略的规则执行一次负载均衡。

10.3. 实现细节

数据结构

include/linux/sched.h 添加了两个数据结构来将处理器核心分配到不同的负载均衡层级。sched_domain 代表一个调度域，sched_group 代表一个调度组：

struct sched_domain {
    /* These fields must be setup */
    struct sched_domain *parent; /* top domain must be null terminated */
    struct sched_domain *child; /* bottom domain must be null terminated */
    struct sched_group *groups; /* the balancing groups of the domain */
    unsigned long min_interval; /* Minimum balance interval ms */
    unsigned long max_interval; /* Maximum balance interval ms */
    unsigned int busy_factor; /* less balancing by factor if busy */
    unsigned int imbalance_pct; /* No balance until over watermark */
    unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
    unsigned int busy_idx;
    unsigned int idle_idx;
    unsigned int newidle_idx;
    unsigned int wake_idx;
    unsigned int forkexec_idx;
    unsigned int smt_gain;
    int flags; /* See SD_* */
    int level;
    /* Runtime fields. */
    unsigned long last_balance; /* init to jiffies. units in jiffies */
    unsigned int balance_interval; /* initialise to 1. units in ms. */
    unsigned int nr_balance_failed; /* initialise to 0 */
    u64 last_update;
    ...
    unsigned int span_weight;
    /*
    * Span of all CPUs in this domain.
    *
    * NOTE: this field is variable length. (Allocated dynamically
    * by attaching extra space to the end of the structure,
    * depending on how many CPUs the kernel has booted up with)
    */
    unsigned long span[0];
};
struct sched_group {
    struct sched_group *next; /* Must be a circular list */
    atomic_t ref;
    unsigned int group_weight;
    struct sched_group_power *sgp;
    /*
    * The CPUs this group covers.
    *
    * NOTE: this field is variable length. (Allocated dynamically
    * by attaching extra space to the end of the structure,
    * depending on how many CPUs the kernel has booted up with)
    */
    unsigned long cpumask[0];
};

在 include/linux/topology.h或对应的处理器架构的代码中你可以看到如何设置调度域的标志和值。

在 sched_group 中你会看到一个 sched_group_power 类型的名为 sgp 的字段。系统引入了全组 CPU 能力的概念来指定处理器的拓扑结构。甚至一个超线程核心都被看作一个独立的单元，它实际上比另一个物理核心明显缺少很多处理器能力。两个分开的处理器会有两个 CPU power 而一对超线程核心拥有接近 1.1 个 CPU 能力。在进行负载均衡的过程中，内核会尝试最大化 CPU 能力的值来提高整个系统的吞吐量。

积极均衡

系统会周期性的在每个 CPU 上进行积极平衡。在积极平衡中，内核会遍历整个域层级，从当前 CPU 所在的域开始检查每个调度域是否需要进行均衡，是的话就需要执行一次均衡操作。

在调度器初始化的时候系统会注册一个软中断处理程序来定期的执行负载均衡。在 scheduler_tick() 调用 trigger_load_balance() （参见调度框架）时会触发积极平衡。 trigger_load_balance() 会检查定时器，如果到了进行均衡的时间点则会通过对应的标志 SCHED_SOFTIRQ 启动这个软中断。

/*
* Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
*/
static inline void trigger_load_balance(struct rq *rq, int cpu)
{
    /* Don't need to rebalance while attached to NULL domain */
    if (time_after_eq(jiffies, rq->next_balance) &&
        likely(!on_null_domain(cpu)))
        raise_softirq(SCHED_SOFTIRQ);
#ifdef CONFIG_NO_HZ
    else if (nohz_kick_needed(rq, cpu) && likely(!on_null_domain(cpu)))
        nohz_balancer_kick(cpu);
#endif
}

这个被注册为中断处理程序的函数是 run_rebalance_domains() ，它会调用 rebalance_domains() 进行实际工作。

/*
* run_rebalance_domains is triggered when needed from the scheduler tick.
* Also triggered for nohz idle balancing (with nohz_balancing_kick set).
*/
static void run_rebalance_domains(struct softirq_action *h)
{
    int this_cpu = smp_processor_id();
    struct rq *this_rq = cpu_rq(this_cpu);
    enum cpu_idle_type idle = this_rq->idle_at_tick ?
                              CPU_IDLE : CPU_NOT_IDLE;
    rebalance_domains(this_cpu, idle);
    /*
    * If this cpu has a pending nohz_balance_kick, then do the
    * balancing on behalf of the other idle cpus whose ticks are
    * stopped.
    */
    nohz_idle_balance(this_cpu, idle);
}

rebalance_domains() 之后会遍历整个域层级，然后如果某个域的标志 SD_LOAD_BALANCE 被设置了并且均衡间隔到期了，那么就调用 load_balance() 执行一次均衡操作。一个域的均衡间隔是以 jiffies 计算，每次执行均衡之后会更新。

注意，积极均衡对于执行它的 CPU 来说是一个拉取操作。他将或从一个过载的 CPU 上拉取一个任务到当前的处理器以此来重新让任务分配平衡，但是他不会把自己的任务推送给其他处理器。执行这个拉取操作的函数是 load_balance() 。如果它可以找到一个不平衡的组，他也会移动一个或更多的任务到当前 CPU 并且返回一个比 0 大的值。

/*
 * It checks each scheduling domain to see if it is due to be balanced,
 * and initiates a balancing operation if so.
 *
 * Balancing parameters are set up in arch_init_sched_domains.
 */
static void rebalance_domains(int cpu, enum cpu_idle_type idle)
{
    int balance = 1;
    struct rq *rq = cpu_rq(cpu);
    unsigned long interval;
    struct sched_domain *sd;
    /* Earliest time when we have to do rebalance again */
    unsigned long next_balance = jiffies + 60*HZ;
    int update_next_balance = 0;
    int need_serialize;
    update_shares(cpu);
    rcu_read_lock();
    for_each_domain(cpu, sd) {
        if (!(sd->flags & SD_LOAD_BALANCE))
            continue;
        interval = sd->balance_interval;
        if (idle != CPU_IDLE)
            interval *= sd->busy_factor;
        /* scale ms to jiffies */
        interval = msecs_to_jiffies(interval);
        interval = clamp(interval, 1UL, max_load_balance_interval);
        need_serialize = sd->flags & SD_SERIALIZE;
        if (need_serialize) {
            if (!spin_trylock(&balancing))
                goto out;
        }
        if (time_after_eq(jiffies, sd->last_balance + interval)) {
            if (load_balance(cpu, rq, sd, idle, &balance)) {
                /*
                 * We've pulled tasks over so either we're no
                 * longer idle.
                 */
                idle = CPU_NOT_IDLE;
            }
            sd->last_balance = jiffies;
        }
        if (need_serialize)
            spin_unlock(&balancing);
out:
        if (time_after(next_balance, sd->last_balance + interval)) {
            next_balance = sd->last_balance + interval;
            update_next_balance = 1;
        }
        /*
         * Stop the load balance at this level. There is another
         * CPU in our sched group which is doing load balancing more
         * actively.
         */
        if (!balance)
            break;
    }
    rcu_read_unlock();
    /*
     * next_balance will be updated only when there is a need.
     * When the cpu is attached to null domain for ex, it will not be
     * updated.
     */
    if (likely(update_next_balance))
        rq->next_balance = next_balance;
}

load_balance() 调用 find_busiest_group() 在给定的 sched_domain（调度域）中寻找不均衡然后如果能找到最繁忙的组的话就返回这个组。如果系统处于均衡状态而且没有找到不均衡的组， load_balance() 返回。

如果返回了一个组，这个组就被传给 find_busiest_queue() 来寻找组中最忙的逻辑 CPU 的运行队列。

load_balance() 接着搜寻返回的运行队列，通过 move_tasks() 挑选一个要从当前 CPU 的队列移出的任务。在 find_busiest_group() 中设置的不均衡参数指定了要移出的任务的数量。可能会发生这样的情况：因为缓存亲和力的缘故队列中的全部任务都被固定了。这种情况下 load_balance() 会再次搜索但是会排除之前找到的 CPU 。

/*
 * Check this_cpu to ensure it is balanced within domain. Attempt to move
 * tasks if there is an imbalance.
 */
static int load_balance(int this_cpu, struct rq *this_rq,
                        struct sched_domain *sd, enum cpu_idle_type idle,
                        int *balance)
{
    int ld_moved, all_pinned = 0, active_balance = 0;
    struct sched_group *group;
    unsigned long imbalance;
    struct rq *busiest;
    unsigned long flags;
    struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
    cpumask_copy(cpus, cpu_active_mask);
    schedstat_inc(sd, lb_count[idle]);
redo:
    group = find_busiest_group(sd, this_cpu, &imbalance, idle,
                               cpus, balance);
    if (*balance == 0)
        goto out_balanced;
    if (!group) {
        schedstat_inc(sd, lb_nobusyg[idle]);
        goto out_balanced;
    }
    busiest = find_busiest_queue(sd, group, idle, imbalance, cpus);
    if (!busiest) {
        schedstat_inc(sd, lb_nobusyq[idle]);
        goto out_balanced;
    }
    BUG_ON(busiest == this_rq);
    schedstat_add(sd, lb_imbalance[idle], imbalance);
    ld_moved = 0;
    if (busiest->nr_running > 1) {
        /*
         * Attempt to move tasks. If find_busiest_group has found
         * an imbalance but busiest->nr_running <= 1, the group is
         * still unbalanced. ld_moved simply stays zero, so it is
         * correctly treated as an imbalance.
         */
        all_pinned = 1;
        local_irq_save(flags);
        double_rq_lock(this_rq, busiest);
        ld_moved = move_tasks(this_rq, this_cpu, busiest,
                              imbalance, sd, idle, &all_pinned);
        double_rq_unlock(this_rq, busiest);
        local_irq_restore(flags);
        /*
         * some other cpu did the load balance for us.
         */
        if (ld_moved && this_cpu != smp_processor_id())
            resched_cpu(this_cpu);
        /* All tasks on this runqueue were pinned by CPU affinity */
        if (unlikely(all_pinned)) {
            cpumask_clear_cpu(cpu_of(busiest), cpus);
            if (!cpumask_empty(cpus))
                goto redo;
            goto out_balanced;
        }
    }
    ...
    goto out;
out_balanced:
    schedstat_inc(sd, lb_balanced[idle]);
    sd->nr_balance_failed = 0;
out_one_pinned:
    /* tune up the balancing interval */
    if ((all_pinned && sd->balance_interval < MAX_PINNED_INTERVAL) ||
        (sd->balance_interval < sd->max_interval))
        sd->balance_interval *= 2;
    ld_moved = 0;
out:
    return ld_moved;
}

一个和节省能源相关的方法隐藏在函数 find_busiest_group() 中。如果设置了域策略的 SD_POWERSAVINGS_BALANCE 标志，而且没有找到最繁忙的组，则 find_busiest_group() 会寻找调度域中负载最低的组，所以这个处理器就可以进入空闲模式了。而在 Android 使用的内核中这个特性并没有激活，而且 Ubuntu 也取消了这个功能。

空闲平衡

一旦 CPU 进入空闲模式就会执行空闲平衡操作。因此如果运行队列变空（参见调度框架），则执行当前调度线程的 CPU 会在 schedule() 中执行空闲平衡操作。

和积极平衡类似， idle_balance() 的实现位于 sched_fair.c 中。首先，它会检查空闲的运行队列的平均空闲时间是否比将任务迁移过去还要长，这实际上就是说系统会检查是否值得迁移任务或者仅仅保持现状就好了，因为下一个任务很快就会被唤醒了。

如果迁移任务起作用，idle_balance() 的功能更类似 rebalance_domains() 。它会遍历域层级然后调用在设置了标志 SD_LOAD_BALANCE 和 SD_BALANCE_NEWIDLE 的域中执行 idle_balance() 。

如果一个或多个任务被拉过来，则遍历层级会终止然后 idle_balance() 返回。

/*
 * idle_balance is called by schedule() if this_cpu is about to become
 * idle. Attempts to pull tasks from other CPUs.
 */
static void idle_balance(int this_cpu, struct rq *this_rq)
{
    struct sched_domain *sd;
    int pulled_task = 0;
    unsigned long next_balance = jiffies + HZ;
    this_rq->idle_stamp = this_rq->clock;
    if (this_rq->avg_idle < sysctl_sched_migration_cost)
        return;
    /*
     * Drop the rq->lock, but keep IRQ/preempt disabled.
     */
    raw_spin_unlock(&this_rq->lock);
    update_shares(this_cpu);
    rcu_read_lock();
    for_each_domain(this_cpu, sd) {
        unsigned long interval;
        int balance = 1;
        if (!(sd->flags & SD_LOAD_BALANCE))
            continue;
        if (sd->flags & SD_BALANCE_NEWIDLE) {
            /* If we've pulled tasks over stop searching: */
            pulled_task = load_balance(this_cpu, this_rq,
                                       sd, CPU_NEWLY_IDLE, &balance);
        }
        interval = msecs_to_jiffies(sd->balance_interval);
        if (time_after(next_balance, sd->last_balance + interval))
            next_balance = sd->last_balance + interval;
        if (pulled_task) {
            this_rq->idle_stamp = 0;
            break;
        }
    }
    rcu_read_unlock();
    raw_spin_lock(&this_rq->lock);
    if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
        /*
         * We are going idle. next_balance may be set based on
         * a busy processor. So reset next_balance.
         */
        this_rq->next_balance = next_balance;
    }
}

从运行队列中挑选一个新任务

第三个需要平衡操作做决定的地方是唤醒一个任务或者创建任务并且需要将之放入运行队列。选择这个运行队列需要考虑整个系统的负载均衡。

每个调度类实现了自己的处理各自任务的策略并且提供了可以被调度框架（kernel/sched.c）执行的钩子（select_task_rq()）函数。它会在三种不同的场景下被调用，每一个都会被各自域的标志进行标记。

1. sched_exec() 中的标志 SD_BALANCE_EXEC 。当一个任务通过 exec() 系统调用启动新任务时会调用该函数。一个新任务在此刻会占用很少的内存和缓存，这就给了内核一个很好的进行平衡的机会。
1. wake_up_new_task() 中的标志 SD_BALANCE_FORK 。这个函数是在新创建的任务第一次被唤醒时调用。
1. try_to_wake_up() 中的标志 SD_BALANCE_WAKE。如果一个正在运行的任务再被唤醒之前通常会有一些缓存亲和性要考虑，以此来决定将它调度到那一个适合的队列中运行。

11. 实时负载均衡

CFS 调度类中实现的积极平衡和空闲平衡能确保只有 CFS 任务受影响。实时任务的负载均衡是由实时调度器处理的。在超载的 RT 队列上执行拉取和推送操作时要考虑下面几种情况：

1. 一个任务唤醒和需要被调度：这种场景下是由实现 RT 调度算法的函数 select_task_rq() 处理的。
1. 队列中有一个高优先级的任务在运行，然后一个低优先级的任务被唤醒了：会对这个低优先级任务执行推送操作。
1. 队列中有一个低优先级的任务在运行，然后一个高优先级的任务被唤醒了：同样也对低优先级任务执行推送操作。
1. 运行队里的优先级发生变化，因为一个任务降低了自己的优先级，从而之前一个更低优先级的任务变成相对高的优先级：会执行推送操作来寻找一个高优先级的任务并推送到这个运行队列。

实时负载均衡这种设计追求的目标是在整个系统中严格的执行实时优先级调度。这也就意味着实时任务调度器需要确保 N 个最高优先级的 RT 任务能够在任何时间点都能及时运行，其中 N 是 CPU 的个数。

11.1. 根域和 CPU 优先级

给定的设计目标要求调度器能快速且有效的获取一个关于系统中所有运行队列的综述，以此来做出调度决策。这就引出了随着 CPU 个数增而加造成的扩容问题。因此引入了根域这个概念，它就是把多个 CPU 划分成几个子集，每个子集包括一个或一组处理器。所有的实时任务调度决策都是在单个根域的范围生效的。

另一个为了处理给定的设计目标而引入的概念是 CPU 优先级。CPU 优先级反映的是根域中最高优先级的 RT 任务的优先级。是另一个维度的优先级，和处于这个优先级的一个 CPU。CPU 优先级的实现在 /kernel/sched_cpupri.c 和 /kernel/sched_cpupri.h 中。

每个 root_domain 结构体都有一个位数组（bit array）记录它对应的根域中有多少超载的 CPU ， cpupri 结构体含有一个 CPU 优先级位图（bitmap）。如果一个 CPU 的运行队列上有不止一个可运行的 RT 任务则它被认为是超载的。

/*
 * We add the notion of a root-domain which will be used to define per-domain
 * variables. Each exclusive cpuset essentially defines an island domain by
 * fully partitioning the member cpus from any other cpuset. Whenever a new
 * exclusive cpuset is created, we also create and attach a new root-domain
 * object.
 *
 */
struct root_domain {
    atomic_t refcount;
    atomic_t rto_count;
    struct rcu_head rcu;
    cpumask_var_t span;
    cpumask_var_t online;
    /*
     * The "RT overload" flag: it gets set if a CPU has more than
     * one runnable RT task.
     */
    cpumask_var_t rto_mask;
    struct cpupri cpupri;
};
struct cpupri_vec {
    raw_spinlock_t lock;
    int count;
    cpumask_var_t mask;
};
struct cpupri {
    struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
    long pri_active[CPUPRI_NR_PRI_WORDS];
    int cpu_to_pri[NR_CPUS];
};

函数 cpupri_find() 可以通过上面提到的 cpupri 的位图快速的找出一个可以将高优先级任务推送过去的低优先级的 CPU 。如果一个优先级等级非空，而且比这个被推送的任务优先级还要低，则标志 lowest_mask 会设置已经选择了的等级对应的掩码。这个掩码是被推送算法用来寻找最合适推送任务的 CPU，算法是基于亲和性，拓扑结构和缓存特性进行计算的。

/**
 * cpupri_find - find the best (lowest-pri) CPU in the system
 * @cp: The cpupri context
 * @p: The task
 * @lowest_mask: A mask to fill in with selected CPUs (or NULL)
 *
 * Note: This function returns the recommended CPUs as calculated during the
 * current invocation. By the time the call returns, the CPUs may have in
 * fact changed priorities any number of times. While not ideal, it is not
 * an issue of correctness since the normal rebalancer logic will correct
 * any discrepancies created by racing against the uncertainty of the current
 * priority configuration.
 *
 * Returns: (int)bool - CPUs were found
 */
int cpupri_find(struct cpupri *cp, struct task_struct *p,
                struct cpumask *lowest_mask)
{
    int idx = 0;
    int task_pri = convert_prio(p->prio);
    for_each_cpupri_active(cp-> pri_active, idx)
    {
        struct cpupri_vec *vec = &cp->pri_to_cpu[idx];
        if (idx >= task_pri)
            break;
        if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
            continue;
        if (lowest_mask) {
            cpumask_and(lowest_mask, &p->cpus_allowed, vec->mask);
            /*
             * We have to ensure that we have at least one bit
             * still set in the array, since the map could have
             * been concurrently emptied between the first and
             * second reads of vec->mask. If we hit this
             * condition, simply act as though we never hit this
             * priority level and continue on.
             */
            if (cpumask_any(lowest_mask) >= nr_cpu_ids)
                continue;
        }
        return 1;

    }
    return 0;

}

find_lowest_rq() 是用来挑选出最终要被推送任务的处理器的函数，而这个任务对应的是最低的掩码。它的实现位于 kernel/sched_rt.c 。如果 cpupri_find() 返回一个包含可能的目标 CPU 的掩码，find_lowest_rq() 将会首先检查这个 CPU 是否属于推送上次执行的任务的目标 CPU 。这个 CPU 通常有最热门的缓存（hottest cache），因此也是最佳选择。

如果不是的话，find_lowest_rq() 会遍历整个调度域层级来找出一个逻辑上最接近当前 CPU 的热门缓存数据，同时也位于最低优先级映射中的 CPU。

如果这次搜索同样没有返回任何可用结果，那么就返回掩码对应的任意一个 CPU 。

static int find_lowest_rq(struct task_struct *task)
{
    struct sched_domain *sd;
    struct cpumask *lowest_mask = __get_cpu_var(local_cpu_mask);
    int this_cpu = smp_processor_id();
    int cpu = task_cpu(task);
    /* Make sure the mask is initialized first */
    if (unlikely(!lowest_mask))
        return -1;
    if (task->rt.nr_cpus_allowed == 1)
        return -1; /* No other targets possible */
    if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
        return -1; /* No targets found */
    /*
     * At this point we have built a mask of cpus representing the
     * lowest priority tasks in the system. Now we want to elect
     * the best one based on our affinity and topology.
     *
     * We prioritize the last cpu that the task executed on since
     * it is most likely cache-hot in that location.
     */
    if (cpumask_test_cpu(cpu, lowest_mask))
        return cpu;
    /*
     * Otherwise, we consult the sched_domains span maps to figure
     * out which cpu is logically closest to our hot cache data.
     */
    if (!cpumask_test_cpu(this_cpu, lowest_mask))
        this_cpu = -1; /* Skip this_cpu opt if not among lowest */
    rcu_read_lock();
    for_each_domain(cpu, sd) {
        if (sd->flags & SD_WAKE_AFFINE) {
            int best_cpu;
            /*
             * "this_cpu" is cheaper to preempt than a
             * remote processor.
             */
            if (this_cpu != -1 &&
                cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
                rcu_read_unlock();
                return this_cpu;
            }
            best_cpu = cpumask_first_and(lowest_mask,
                                         sched_domain_span(sd));
            if (best_cpu < nr_cpu_ids) {
                rcu_read_unlock();
                return best_cpu;
            }
        }
    }
    rcu_read_unlock();
    /*
     * And finally, if there were no matches within the domains
     * just give the caller *something* to work with from the compatible
     * locations.
     */
    if (this_cpu != -1)
        return this_cpu;
    cpu = cpumask_any(lowest_mask);
    if (cpu < nr_cpu_ids)
        return cpu;
    return -1;

}    

11.2. 推送操作

RT 调度器中的任务推送操作是在调度框架调度新任务到当前 CPU 时执行的。调度类的钩子 post_schedule() 会被调用，目前它只在 RT 调度类中实现了。在函数内部，它会为当前 CPU 的运行队列中所有的任务调用 push_rt_task() 。

这个函数会检查运行队列是否超载，如果超载了就尝试将不是正在运行的 RT 任务迁移到另一个可用的 CPU 直到迁移失败。如果一个任务被迁移了，则之后都会通过设置标志 need_resched 来调用 schedule()。

内核会使用内部调用了 find_lowest_rq() 的 find_lock_lowest_rq() ，但是如果发现了队列就会对这个队列加锁。

/*
 * If the current CPU has more than one RT task, see if the non
 * running task can migrate over to a CPU that is running a task
 * of lesser priority.
 */
static int push_rt_task(struct rq *rq)
{
    struct task_struct *next_task;
    struct rq *lowest_rq;
    if (!rq->rt.overloaded)
        return 0;
    next_task = pick_next_pushable_task(rq);
    if (!next_task)
        return 0;
retry:
    if (unlikely(next_task == rq->curr)) {
        WARN_ON(1);
        return 0;
    }
    /*
     * It's possible that the next_task slipped in of
     * higher priority than current. If that's the case
     * just reschedule current.
     */
    if (unlikely(next_task->prio < rq->curr->prio)) {
        resched_task(rq->curr);
        return 0;
    }
    /* We might release rq lock */
    get_task_struct(next_task);
    /* find_lock_lowest_rq locks the rq if found */
    lowest_rq = find_lock_lowest_rq(next_task, rq);
    if (!lowest_rq) {
        struct task_struct *task;
        /*
         * find lock_lowest_rq releases rq->lock
         * so it is possible that next_task has migrated.
         *
         * We need to make sure that the task is still on the same
         * run-queue and is also still the next task eligible for
         * pushing.
         */
        task = pick_next_pushable_task(rq);
        if (task_cpu(next_task) == rq->cpu && task == next_task) {
            /*
             * If we get here, the task hasn't moved at all, but
             * it has failed to push. We will not try again,
             * since the other cpus will pull from us when they
             * are ready.
             */
            dequeue_pushable_task(rq, next_task);
            goto out;
        }
        if (!task)
            /* No more tasks, just exit */
            goto out;
        /*
         * Something has shifted, try again.
         */
        put_task_struct(next_task);
        next_task = task;
        goto retry;
    }
    deactivate_task(rq, next_task, 0);
    set_task_cpu(next_task, lowest_rq->cpu);
    activate_task(lowest_rq, next_task, 0);
    resched_task(lowest_rq->curr);
    double_unlock_balance(rq, lowest_rq);
out:
    put_task_struct(next_task);
    return 1;
}         

11.3. 拉取操作

拉取操作是通过调用钩子 pre_schedule() 执行的。这个函数目前也只是在 RT 调度类里面实现了，并且在函数 schedule() 的开始被调用。它会去检查当前 CPU 的运行队列中的任务的最高优先级是否小于上次运行的任务的优先级。如果是的话，内核就会调用 pull_rt_task() 从另一个 CPU 上拉取一个比（当前 CPU 上）将被调度的任务优先级高的任务。

static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
{
    /* Try to pull RT tasks here if we lower this rq's prio */
    if (rq->rt.highest_prio.curr > prev->prio)
        pull_rt_task(rq);
}

pull_rt_task() 会检查和当前 CPU 处于同一个根域的所有 CPU。它会在潜在的源 CPU 上的寻找第二高优先级任务，然后拉取到当前的 CPU。如果找到一个任务就拉过来。因为 Schedule() 会在很多情况下被执行，所以此处没有调用。

static int pull_rt_task(struct rq *this_rq)
{
    int this_cpu = this_rq->cpu, ret = 0, cpu;
    struct task_struct *p;
    struct rq *src_rq;
    if (likely(!rt_overloaded(this_rq)))
        return 0;
    for_each_cpu(cpu, this_rq->rd->rto_mask) {
        if (this_cpu == cpu)
            continue;
        src_rq = cpu_rq(cpu);
        /*
         * Don't bother taking the src_rq->lock if the next highest
         * task is known to be lower-priority than our current task.
         * This may look racy, but if this value is about to go
         * logically higher, the src_rq will push this task away.
         * And if its going logically lower, we do not care
         */
        if (src_rq->rt.highest_prio.next >=
            this_rq->rt.highest_prio.curr)
            continue;
        /*
         * We can potentially drop this_rq's lock in
         * double_lock_balance, and another CPU could
         * alter this_rq
         */
        double_lock_balance(this_rq, src_rq);
        /*
         * Are there still pullable RT tasks?
         */
        if (src_rq->rt.rt_nr_running <= 1)
            goto skip;
        p = pick_next_highest_task_rt(src_rq, this_cpu);
        /*
         * Do we have an RT task that preempts
         * the to-be-scheduled task?
         */
        if (p && (p->prio < this_rq->rt.highest_prio.curr)) {
            WARN_ON(p == src_rq->curr);
            WARN_ON(!p->on_rq);
            /*
             * There's a chance that p is higher in priority
             * than what's currently running on its cpu.
             * This is just that p is wakeing up and hasn't
             * had a chance to schedule. We only pull
             * p if it is lower in priority than the
             * current task on the run queue
             */
            if (p->prio < src_rq->curr->prio)
                goto skip;
            ret = 1;
            deactivate_task(src_rq, p, 0);
            set_task_cpu(p, this_cpu);
            activate_task(this_rq, p, 0);
            /*
             * We continue with the search, just in
             * case there's an even higher prio task
             * in another runqueue. (low likelihood
             * but possible)
             */
        }

skip:

        double_unlock_balance(this_rq, src_rq);

    }
    return ret;

}             

11.4. 为唤醒的任务挑选一个运行队列

如之前在介绍 CFS 任务负载均衡的章节中描述的，一旦一个任务再次被唤醒或者是首次被创建内核就会立即调用 select_task_rq()。除了额外的推送和拉取操作，这个钩子函数也在 RT 调度器中实现了。

如果当前 CPU 的运行队列中正在运行的是 RT 任务，加入他的优先级更高，并且我们可以移动一个将要被唤醒的任务，然后尝试寻找另一个运行队列可以安置我们唤醒的任务。如果不是的话，就在同一个 CPU 上唤醒任务，然后让调度器完成剩余的工作。

这个函数同时也使用 find_lowest_rq() 寻找一个适合这个任务的 CPU。

static int select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
{
    struct task_struct *curr;
    struct rq *rq;
    int cpu;
    if (sd_flag != SD_BALANCE_WAKE)
        return smp_processor_id();
    cpu = task_cpu(p);
    rq = cpu_rq(cpu);
    rcu_read_lock();
    curr = ACCESS_ONCE(rq->curr); /* unlocked access */
    if (curr && unlikely(rt_task(curr)) &&
        (curr->rt.nr_cpus_allowed < 2 ||
         curr->prio <= p->prio) &&
        (p->rt.nr_cpus_allowed > 1)) {
        int target = find_lowest_rq(p);
        if (target != -1)
            cpu = target;
    }
    rcu_read_unlock();
    return cpu;
}

12. 参考资料

Robert Love, Linux Kernel Development, 3rd edition, 2010
Daniel P. Bovet, Marco Cesati, Understanding the Linux Kernel, 3rd edition, Nov 2005
Tim Jones, Inside the Linux 2.6 Completely Fair Scheduler, 15.12.2009 - http://www.ibm.com/developerworks/library/l-completely-fair-scheduler/
Chandandeep Singh Pabla, Completely Fair Scheduler, 01.08.2009 - http://www.linuxjournal.com/magazine/completely-fair-scheduler
CFS Scheduler - https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt -Josh Aas, Understanding the Linux 2.6.8.1 CPU Scheduler, 17.02.2005 - http://joshaas.net/linux/linux_cpu_scheduler.pdf
Jonathan Corbet, Schedulers: the plot thickens, 17.04.2007 - http://lwn.net/Articles/230574/
Jonathan Corbet, SCHED_FIFO and realtime throttling, 01.09.2008 - http://lwn.net/Articles/296419/
Jonathan Corbet, Scheduling domains, 19.04.2004 - http://lwn.net/Articles/80911/
Ankita Garg, Real-Time Linux Kernel Scheduler, 01.08.2009 - http://www.linuxjournal.com/magazine/real-time-linux-kernel-scheduler?page=0,4