More than 5 years have passed since last update.

【読解入門】Linuxカーネル（スケジューラ編その3-3）

Last updated at 2019-04-03Posted at 2019-03-25

・本稿で読解の対象にしているLinuxカーネルの版数は5.1-rc2です。

FAIRクラスについて

FAIRクラス（以下、一般クラスと記載）は、Linuxで一番使用頻度が高いプロセスのクラスになります。何も指定せずにプログラムを実行すると本クラスで実行されます。コードを読む前に要点を押さえて、その後にコードを見ましょう。まずは要点を箇条書きで述べます。本連載では、カーネルの基本的な概要を理解しやすくするためにCGROUP及びSMP向けのコードについては触れません。

・一般クラスのプロセス群のスケジューラは、2.6.23でマージされたCFS（Completely Fair Scheduler）というスケジューラで管理している。それまでのアルゴリズムは経験則（I/Oバウンドプロセスの判定など）に基づくアルゴリズムであったため、完全な公平性とは言えなかった。

・CFSでは、各プロセスを赤黒木で管理する。（RTクラスの時は優先度のbitmapに基づいたリスト（キュー）でしたね）
赤黒木の詳細については、他のWebページを参照ください。平衡二分木で計算量はO(log n)です。
Linuxカーネルのスケジューラは、常に赤黒木の「一番左に存在するノードを次に実行するノード」とする。

*wikipediaより抜粋赤黒木

各ノードは、プロセスの仮想実行時間vruntimeを用いて管理する。vruntimeの値が一番小さいプロセスが左端に位置し、次の実行プロセスとして変数にキャッシュされている。
ユーザーが設定するnice値（-20～19）は、本スケジューラで定義されている値（本ページの最下部の配列）に基づいて重み付けを行い、vruntimeに反映されて木の適切な位置に追加される。

・一般プロセスを全て実行する周期の目標は、/proc/sys/kernel/sched_latency_nsで決定する。CPUが1つの場合、6msが目標値となる。下記に定義部を示す。

kernel/sched/fair.c

unsigned int sysctl_sched_latency                       = 6000000ULL;
static unsigned int normalized_sysctl_sched_latency     = 6000000ULL;

・CPUを得たプロセスがプリエンプションされない期間の長さは/proc/sys/kernel/sched_min_granularity_nsで決定する。CPU数が1の場合、750usが最小実行時間の長さとなる。（RTプロセスによるプリエンプションがない場合）

kernel/sched/fair.c

unsigned int sysctl_sched_min_granularity                       = 750000ULL;
static unsigned int normalized_sysctl_sched_min_granularity     = 750000ULL;

・一般プロセスが起床後にプリエンプションするまでの時間の粒度。デフォルトでは、1msの粒度でプリエンプションのチェックが行われる。本値を小さくしすぎるとプリエンプションチェック処理回数が多くなるため、オーバーヘッドが増加することに注意が必要。

kernel/sched/fair.c

unsigned int sysctl_sched_wakeup_granularity                    = 1000000UL;
static unsigned int normalized_sysctl_sched_wakeup_granularity  = 1000000UL;

・（前回も述べましたが）RTクラスのプロセスが実行可能状態になると、一般クラスのプロセスはプリエンプションされる。（RTプロセスに勝つことは絶対にない）

・子プロセス生成時に親と子、どちらのプロセスを先に実行するかは/proc/sys/kernel/sched_child_runs_firstで決定する。1を設定していれば、子プロセスが先に実行される。0の場合は、親プロセスが先に実行される。

・FAIRクラスのプロセス群はcfs_rq構造体（CPU毎に存在）を用いて管理する。
各メンバは追々説明します。

kernel/sched/sched.h

struct cfs_rq {
        struct load_weight      load;
        unsigned long           runnable_weight;
        unsigned int            nr_running;
        unsigned int            h_nr_running;

        u64                     exec_clock;
        u64                     min_vruntime;

        struct rb_root_cached   tasks_timeline;

        struct sched_entity     *curr;
        struct sched_entity     *next;
        struct sched_entity     *last;
        struct sched_entity     *skip;
};

・sched_entity構造体は、基本的にプロセス単位と考えて良いが、CGROUPを使用するときにはプロセス単位でなく、グループ単位となる。cfs_rq構造体のnr_runningメンバとh_nr_runningメンバの違いは以下の通り。
nr_running：実行可能なsched_entity数（CGROUPのグループ数）
h_nr_running：実行可能なプロセス数

4.FAIRクラスのsched_class構造体メンバ

enqueue_taskメンバ

実行可能状態になったプロセスを赤黒木のノードとして追加する処理です。

kernel/sched/fair.c

/*
 * The enqueue_task method is called before nr_running is
 * increased. Here we update the fair scheduling stats and
 * then put the task into the rbtree:
 */
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;

 1       util_est_enqueue(&rq->cfs, p);

 2       if (p->in_iowait)
 3               cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);

 4       for_each_sched_entity(se) {
 5               if (se->on_rq)
 6                       break;
 7               cfs_rq = cfs_rq_of(se);
 8               enqueue_entity(cfs_rq, se, flags);

 9               if (cfs_rq_throttled(cfs_rq))
10                        break;
11                cfs_rq->h_nr_running++;

12                flags = ENQUEUE_WAKEUP;
13        }

14        for_each_sched_entity(se) {
15                cfs_rq = cfs_rq_of(se);
16                cfs_rq->h_nr_running++;

17                if (cfs_rq_throttled(cfs_rq))
18                        break;

19                update_load_avg(cfs_rq, se, UPDATE_TG);
20                update_cfs_group(se);
21        }

22        if (!se) {
23                add_nr_running(rq, 1);
24                if (flags & ENQUEUE_WAKEUP)
25                        update_overutilized_status(rq);
26        }

27        if (cfs_bandwidth_used()) {
28                for_each_sched_entity(se) {
29                        cfs_rq = cfs_rq_of(se);

30                        if (list_add_leaf_cfs_rq(cfs_rq))
31                                break;
32                }
33        }

34        assert_list_leaf_cfs_rq(rq);

35        hrtick_update(rq);
36}

概要を掴むためにポイントのみを抜き出して追っていきます。
本関数では8行目を追う必要があります。（enqueue_entity(cfs_rq, se, flags); ）

その他の処理を簡単に述べると以下の通りです。
4行目のfor_each_sched_entity( )はCGROUPを使用していない場合、下記の通り一回処理して抜ける。
#define for_each_sched_entity(se) 　 for (; se; se = NULL)
そして14行目のfor_each_sched_entity( )を処理する時はseにNULLが入っている為、14-21行目まではスキップ。
5、6行目：すでに対象プロセスが実行キュー（赤黒木）に存在していれば4ー13行目のforループをbreakする
9、10行目：CGRUPの帯域制御用。（なので、無関係の処理として扱う）
11行目：cfs_rq構造体の実行可能プロセス数をインクリメント

kernel/sched/fair.c

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
        bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags &　ENQUEUE_MIGRATED);
        bool curr = cfs_rq->curr == se;

 1        if (renorm && curr)
 2                se->vruntime += cfs_rq->min_vruntime;

 3        update_curr(cfs_rq);

 4        if (renorm && !curr)
 5                se->vruntime += cfs_rq->min_vruntime;

 6        update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 7        update_cfs_group(se);
 8        enqueue_runnable_load_avg(cfs_rq, se);
 9        account_entity_enqueue(cfs_rq, se);

10        if (flags & ENQUEUE_WAKEUP)
11                place_entity(cfs_rq, se, 0);

12        check_schedstat_required();
13        update_stats_enqueue(cfs_rq, se, flags);
14        check_spread(cfs_rq, se);
15        if (!curr)
16               __enqueue_entity(cfs_rq, se);
17        se->on_rq = 1;

18        if (cfs_rq->nr_running == 1) {
19                list_add_leaf_cfs_rq(cfs_rq);
20                check_enqueue_throttle(cfs_rq);
21        }
}

本関数で追う必要があるのは16行目です。（__enqueue_entity(cfs_rq, se); ）

その他の処理は以下の通り。
3行目：yield_taskメンバの部分で説明する。currnetプロセスの実行時間の統計情報を更新する処理。
2 or 5行目：対象のプロセスのvruntimeにcfs_rq->min_vruntime（初期値は(-(1LL << 20))）加算する。
6、7、8行目：Uni-ProcessorでCGROUPが無効の場合、空関数です。
9行目：cfs_rq構造体のnr_runningをインクリメントする処理などがあります。
17行目：対象プロセスが赤黒木に登録されていることを設定します（se->on_rqを1にセット）
19、20行目：CGROUPが無効の場合、空関数です。

kernel/sched/fair.c

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
        struct rb_node *parent = NULL;
        struct sched_entity *entry;
        bool leftmost = true;

        /*
         * Find the right place in the rbtree:
         */
 1        while (*link) {
 2                parent = *link;
 3                entry = rb_entry(parent, struct sched_entity, run_node);
                /*
                 * We dont care about collisions. Nodes with
                 * the same key stay together.
                 */
 4                if (entity_before(se, entry)) {
 5                        link = &parent->rb_left;
 6                } else {
 7                        link = &parent->rb_right;
 8                        leftmost = false;
 9                }
        }

10        rb_link_node(&se->run_node, parent, link);
11        rb_insert_color_cached(&se->run_node,
                               &cfs_rq->tasks_timeline, leftmost);
}

1行目から9行目までの処理で赤黒木の適切な場所を見つけた後、11行目で対象のプロセスを赤黒木に追加します。
この時、rb_root_cached構造体であるcfs_rq->tasks_timelineのrb_leftmostメンバを更新する可能性があります。このrb_leftmostメンバは赤黒木の一番左のノードを常に保持することで、次に実行するプロセスの決定を瞬時に行う役割を担っています。

dequeue_taskメンバ

指定のプロセスを赤黒木（実行キュー）から外す処理です。

kernel/sched/fair.c

/*
 * The dequeue_task method is called before nr_running is
 * decreased. We remove the task from the rbtree and
 * update the fair scheduling stats:
 */
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;
        int task_sleep = flags & DEQUEUE_SLEEP;

1        for_each_sched_entity(se) {
2                cfs_rq = cfs_rq_of(se);
3                dequeue_entity(cfs_rq, se, flags);

                /*
                 * end evaluation on encountering a throttled cfs_rq
                 *
                 * note: in the case of encountering a throttled cfs_rq we will
                 * post the final h_nr_running decrement below.
                */
4                if (cfs_rq_throttled(cfs_rq))
5                        break;
6                cfs_rq->h_nr_running--;

                /* Don't dequeue parent if it has other entities besides us */
7                if (cfs_rq->load.weight) {
                        /* Avoid re-evaluating load for this entity: */
8                        se = parent_entity(se);

9                        if (task_sleep && se && !throttled_hierarchy(cfs_rq))
10                                set_next_buddy(se);
11                        break;
12                }
13                flags |= DEQUEUE_SLEEP;
14        }

15        for_each_sched_entity(se) {
16                cfs_rq = cfs_rq_of(se);
17                cfs_rq->h_nr_running--;

18                if (cfs_rq_throttled(cfs_rq))
19                        break;

20                update_load_avg(cfs_rq, se, UPDATE_TG);
21                update_cfs_group(se);
22        }

23        if (!se)
24                sub_nr_running(rq, 1);

25        util_est_dequeue(&rq->cfs, p, task_sleep);
26        hrtick_update(rq);
}

3行目を追う必要があります。

その他の処理としては以下の通りです。
4、5行目：CGROUPが無効の場合、空関数です。
6行目：cfs_rq構造体のnr_runningをデクリメントします。（実行可能なプロセス数のデクリメント）
15-22行目：この段階でseはNULLのため、本処理はスキップされます。

kernel/sched/fair.c

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
        /*
         * Update run-time statistics of the 'current'.
         */
 1        update_curr(cfs_rq);

        /*
         * When dequeuing a sched_entity, we must:
         *   - Update loads to have both entity and cfs_rq synced with now.
         *   - Subtract its load from the cfs_rq->runnable_avg.
         *   - Subtract its previous weight from cfs_rq->load.weight.
         *   - For group entity, update its weight to reflect the new share
         *     of its group cfs_rq.
         */
 2        update_load_avg(cfs_rq, se, UPDATE_TG);
 3        dequeue_runnable_load_avg(cfs_rq, se);

 4        update_stats_dequeue(cfs_rq, se, flags);

 5        clear_buddies(cfs_rq, se);

 6        if (se != cfs_rq->curr)
 7               __dequeue_entity(cfs_rq, se);
 8        se->on_rq = 0;
 9        account_entity_dequeue(cfs_rq, se);

        /*
         * Normalize after update_curr(); which will also have moved
         * min_vruntime if @se is the one holding it back. But before doing
         * update_min_vruntime() again, which will discount @se's position and
         * can move min_vruntime forward still more.
         */
10        if (!(flags & DEQUEUE_SLEEP))
11                se->vruntime -= cfs_rq->min_vruntime;

        /* return excess runtime on last dequeue */
12        return_cfs_rq_runtime(cfs_rq);

13        update_cfs_group(se);

        /*
         * Now advance min_vruntime if @se was the entity holding it back,
         * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
         * put back on, and if we advance min_vruntime, we'll be placed back
         * further than we started -- ie. we'll be penalized.
         */
14        if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
15                update_min_vruntime(cfs_rq);
}

7行目がポイントのため、追います。

kernel/sched/fair.c

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
}

本関数を実行することで赤黒木から対象のプロセスを削除します。
この時、rb_erase_cached( )で、次に実行するプロセスrb_leftmostも必要に応じて変更します。

yield_taskメンバ

currentプロセスが一度CPUの実行権を手放すメンバです。

kernel/sched/fair.c

static void yield_task_fair(struct rq *rq)
{
        struct task_struct *curr = rq->curr;
        struct cfs_rq *cfs_rq = task_cfs_rq(curr);
        struct sched_entity *se = &curr->se;

        /*
         * Are we the only task in the tree?
         */
1        if (unlikely(rq->nr_running == 1))
2                return;

3        clear_buddies(cfs_rq, se);

4        if (curr->policy != SCHED_BATCH) {
5                update_rq_clock(rq);
                /*
                 * Update run-time statistics of the 'current'.
                 */
6                update_curr(cfs_rq);
                /*
                 * Tell update_rq_clock() that we've just updated,
                 * so we don't do microscopic update in schedule()
                 * and double the fastpath cost.
                 */
7                rq_clock_skip_update(rq);
8        }

9        set_skip_buddy(se);
}

1、2行目：実行可能状態のプロセスがcurrentプロセスのみの場合は何もせずに復帰。つまり、currentプロセスがそのまま実行されることを示します。
6行目のupdate_curr( )がポイントです。

kernel/sched/fair.c

static void update_curr(struct cfs_rq *cfs_rq)
{
        struct sched_entity *curr = cfs_rq->curr;
        u64 now = rq_clock_task(rq_of(cfs_rq));
        u64 delta_exec;

 1        if (unlikely(!curr))
 2                return;

 3        delta_exec = now - curr->exec_start;
 4        if (unlikely((s64)delta_exec <= 0))
 5                return;

 6        curr->exec_start = now;

 7        schedstat_set(curr->statistics.exec_max,
                      max(delta_exec, curr->statistics.exec_max));

 8        curr->sum_exec_runtime += delta_exec;
 9        schedstat_add(cfs_rq->exec_clock, delta_exec);

10        curr->vruntime += calc_delta_fair(delta_exec, curr);
11        update_min_vruntime(cfs_rq);

12        if (entity_is_task(curr)) {
13                struct task_struct *curtask = task_of(curr);

14                trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
15                cgroup_account_cputime(curtask, delta_exec);
16                account_group_exec_runtime(curtask, delta_exec);
17        }

18        account_cfs_rq_runtime(cfs_rq, delta_exec);
}

1、2行目：currentプロセスでない場合は、即復帰
3行目：currentプロセスがディスパッチされてからの実行時間を算出
10行目：currentプロセスのvruntimeを3行目の実行時間を基に更新
上記処理で、次にリスケジューリングするタイミングでrb_leftmostのプロセスがディスパッチされます。

check_preempt_currメンバ

currentプロセスをプリエンプションする必要があるか否かをチェックする処理です。

kernel/sched/fair.c

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
flags)
{
        struct task_struct *curr = rq->curr;
        struct sched_entity *se = &curr->se, *pse = &p->se;
        struct cfs_rq *cfs_rq = task_cfs_rq(curr);
        int scale = cfs_rq->nr_running >= sched_nr_latency;
        int next_buddy_marked = 0;

 1        if (unlikely(se == pse))
 2                return;

 3        if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
 4                return;

 5        if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
 6                set_next_buddy(pse);
 7                next_buddy_marked = 1;
 8        }

 9        if (test_tsk_need_resched(curr))
                return;

        /* Idle tasks are by definition preempted by non-idle tasks. */
10        if (unlikely(task_has_idle_policy(curr)) &&
             likely(!task_has_idle_policy(p)))
11                goto preempt;

12        if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
13                return;
14        find_matching_se(&se, &pse);
15        update_curr(cfs_rq_of(se));
16        BUG_ON(!pse);
17        if (wakeup_preempt_entity(se, pse) == 1) {
18                if (!next_buddy_marked)
19                        set_next_buddy(pse);
20                goto preempt;
21        }

22        return;

23preempt:
24        resched_curr(rq);

25        if (unlikely(!se->on_rq || curr == rq->idle))
26                return;

27        if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
28                set_last_buddy(se);
}

9行目：すでにリスケジューリングフラグが設定されている場合は本関数から復帰します。
10、11行目：currentプロセスがSCHED_IDLEで、指定したプロセスがSCHED_IDLEでない場合はプリエンプション設定処理（24行目）を実施します。
24行目のresched_curr( )でリスケジューリングフラグをセットします。

pick_next_taskメンバ

次に実行するプロセスを決定し、呼び出し元にプロセスを返す処理メンバです。

kernel/sched/fair.c

static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
        struct cfs_rq *cfs_rq = &rq->cfs;
        struct sched_entity *se;
        struct task_struct *p;
        int new_tasks;

 1 again:
 2        if (!cfs_rq->nr_running)
 3                goto idle;

 4        put_prev_task(rq, prev);

 5        do {
 6                se = pick_next_entity(cfs_rq, NULL);
 7                set_next_entity(cfs_rq, se);
 8                cfs_rq = group_cfs_rq(se);
 9        } while (cfs_rq);

10        p = task_of(se);

11 done: __maybe_unused;

12        if (hrtick_enabled(rq))
13                hrtick_start_fair(rq, p);

14        update_misfit_status(p, rq);

15        return p;

16 idle:
17        update_misfit_status(NULL, rq);
18        new_tasks = idle_balance(rq, rf);

19        if (new_tasks < 0)
20                return RETRY_TASK;

21        if (new_tasks > 0)
22                goto again;

23        update_idle_rq_clock_pelt(rq);

24        return NULL;
}

6行目で次に実行するプロセスを決定します。
pick_next_entity( )では、__pick_first_entity( )を呼び出します。
__pick_first_entity( )では、
rb_first_cached(&cfs_rq->tasks_timeline);
を実行して、rb_leftmostメンバを取得します。

19 - 22行目：Uni-Processorシステムではidle_balanceは0を返すため、これらのgoto文が実行されることはありません。

put_prev_taskメンバ

kernel/sched/fair.c

static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
1        struct sched_entity *se = &prev->se;
2        struct cfs_rq *cfs_rq;

3        for_each_sched_entity(se) {
4                cfs_rq = cfs_rq_of(se);
5                put_prev_entity(cfs_rq, se);
6        }
}

前述した通り、3行目のループ処理は1回のみ実行されます。
5行目でput_prev_entity( )を呼び出しています。

kernel/sched/fair.c

static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
        /*
         * If still on the runqueue then deactivate_task()
         * was not called and update_curr() has to be done:
         */
1        if (prev->on_rq)
2                update_curr(cfs_rq);

        /* throttle cfs_rqs exceeding runtime */
3        check_cfs_rq_runtime(cfs_rq);

4        check_spread(cfs_rq, prev);

5        if (prev->on_rq) {
6                update_stats_wait_start(cfs_rq, prev);
                /* Put 'current' back into the tree. */
7                __enqueue_entity(cfs_rq, prev);
                /* in !on_rq case, update occurred at dequeue */
8                update_load_avg(cfs_rq, prev, 0);
9        }
        cfs_rq->curr = NULL;
}

3行目：CGROUP用の処理です。
5-9行目：prevプロセスのon_rqが設定されていれば、prevプロセスを赤黒木に追加します。

set_curr_taskメンバ

kernel/sched/fair.c

static void set_curr_task_fair(struct rq *rq)
{
        struct sched_entity *se = &rq->curr->se;

1        for_each_sched_entity(se) {
2                struct cfs_rq *cfs_rq = cfs_rq_of(se);

3                set_next_entity(cfs_rq, se);
                 /* ensure bandwidth has been allocated on our new cfs_rq */
4                account_cfs_rq_runtime(cfs_rq, 0);
        }
}

前述した通り、1行目のループ処理は1回のみ実行されます。
set_next_entity( )がポイントですが、set_next_entity( )では以下を行います。
1)赤黒木から外す。
2)実行開始時刻を記憶する。
3)cfs_rq->currにcurrentプロセスを設定する。

task_tickメンバ

tick割込み契機で実行する処理です。

kernel/sched/fair.c

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &curr->se;

1        for_each_sched_entity(se) {
2                cfs_rq = cfs_rq_of(se);
3                entity_tick(cfs_rq, se, queued);
4        }

5        if (static_branch_unlikely(&sched_numa_balancing))
6                task_tick_numa(rq, curr);

7        update_misfit_status(curr, rq);
8        update_overutilized_status(task_rq(curr));
}

ポイントは3行目のentity_tick( )です。

kernel/sched/fair.c

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
        /*
         * Update run-time statistics of the 'current'.
         */
1        update_curr(cfs_rq);

        /*
         * Ensure that runnable average is periodically updated.
         */
2        update_load_avg(cfs_rq, curr, UPDATE_TG);
3        update_cfs_group(curr);

4        if (cfs_rq->nr_running > 1)
5                check_preempt_tick(cfs_rq, curr);
}

1行目でcurrentプロセスの実行時間などの統計情報を加算しています。（yield_taskメンバの説明の中でupdate_curr( )を説明しています。）
実行可能プロセスが2以上であれば（4行目）、プリエンプションの必要性を確認し、必要に応じてリスケジューリングフラグを設定します（5行目）。

get_rr_intervalメンバ

kernel/sched/fair.c

static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
{
        struct sched_entity *se = &task->se;
        unsigned int rr_interval = 0;

        /*
         * Time slice is 0 for SCHED_OTHER tasks that are on an otherwise
         * idle runqueue:
         */
1        if (rq->cfs.load.weight)
2                rr_interval = NS_TO_JIFFIES(sched_slice(cfs_rq_of(se), se));

3        return rr_interval;
}

1行目のrq->cfs.load.weightは、SMPの負荷分散のために存在する変数のため、Uni-Processorシステムでは常に0です。従って本関数の戻り値は常に0になります。

prio_changedメンバ

プロセスの優先度が変更された時の処理です。

kernel/sched/fair.c

static void
prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
{
1        if (!task_on_rq_queued(p))
2                return;

3        if (rq->curr == p) {
4                if (p->prio > oldprio)
5                        resched_curr(rq);
6        } else
7                check_preempt_curr(rq, p, 0);
}

1行目は対象のプロセスのtask_struct構造体のon_rqメンバにTASK_ON_RQ_QUEUEDが設定されていなければ、2行目で復帰します。このTASK_ON_RQ_QUEUEDが設定されていると実行可能でキュー（赤黒木）に存在することを示します。

3行目から5行目で対象プロセスがcurrentプロセスかつ、優先度が低くなる場合、リスケジューリングフラグを設定します。対象プロセスがcurrentプロセスでない場合は対象プロセスに関するプリエンプションの必要性有無を確認します。

switched_toメンバ

kernel/sched/fair.c

static void switched_to_fair(struct rq *rq, struct task_struct *p)
{
1        attach_task_cfs_rq(p);

2        if (task_on_rq_queued(p)) {
3                if (rq->curr == p)
4                        resched_curr(rq);
5                else
6                        check_preempt_curr(rq, p, 0);
7        }
}

対象プロセスが実行キュー（木）に存在し（2行目）、currentプロセスの場合はリスケジューリングフラグを設定します（4行目）。currentプロセスでない場合は、対象プロセスのプリエンプションの必要性の有無を確認します。

update_currメンバ

static void update_curr(struct cfs_rq *cfs_rq)
{
        struct sched_entity *curr = cfs_rq->curr;
        u64 now = rq_clock_task(rq_of(cfs_rq));
        u64 delta_exec;

        if (unlikely(!curr))
                return;

        delta_exec = now - curr->exec_start;
        if (unlikely((s64)delta_exec <= 0))
                return;

        curr->exec_start = now;

        schedstat_set(curr->statistics.exec_max,
                      max(delta_exec, curr->statistics.exec_max));

        curr->sum_exec_runtime += delta_exec;
        schedstat_add(cfs_rq->exec_clock, delta_exec);

        curr->vruntime += calc_delta_fair(delta_exec, curr);
        update_min_vruntime(cfs_rq);

        if (entity_is_task(curr)) {
                struct task_struct *curtask = task_of(curr);

                trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
                cgroup_account_cputime(curtask, delta_exec);
                account_group_exec_runtime(curtask, delta_exec);
        }

        account_cfs_rq_runtime(cfs_rq, delta_exec);
}

yield_taskメンバの説明でも出てきました。詳細はそちらを参照ください。
本メンバはcurrentプロセスの実行時間の更新を含む統計情報を更新します。

nice値とvruntimeの関係

nice値と仮想実行時間vruntimeの関係はどのようになっているのでしょうか。
vruntimeには、nice値のような優先度という概念は存在しません。
下記を参照ください。

kernel/sched/core.c

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

/*
 * Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.
 *
 * In cases where the weight does not change often, we can use the
 * precalculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
const u32 sched_prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

上記の通り、nice値に応じて重みという概念を用いており、コメントにあるようにnice値が1異なるごとに重みが25%前後異なります。この重みを基にvruntimeを決定し、赤黒木の挿入位置を求めることで、結果的にnice値に応じた実行時間を実現しています。

載せているコードの量に対して説明文が不足していると自覚していますので、本記事の説明文を徐々に更新（充実）させていきます。

初回：【読解入門】Linuxカーネル（概要編）
前回：【読解入門】Linuxカーネル（スケジューラ編その3-2）
次回：【読解入門】Linuxカーネル（スケジューラ編その4-1）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【読解入門】Linuxカーネル （スケジューラ編その3-3）

FAIRクラスについて

4.FAIRクラスのsched_class構造体メンバ

enqueue_taskメンバ

dequeue_taskメンバ

yield_taskメンバ

check_preempt_currメンバ

pick_next_taskメンバ

put_prev_taskメンバ

set_curr_taskメンバ

task_tickメンバ

get_rr_intervalメンバ

prio_changedメンバ

switched_toメンバ

update_currメンバ

nice値とvruntimeの関係

【読解入門】Linuxカーネル（スケジューラ編その3-3）