Skip to content

Commit

Permalink
Merge tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/lin…
Browse files Browse the repository at this point in the history
…ux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Add cpufreq pressure feedback for the scheduler

 - Rework misfit load-balancing wrt affinity restrictions

 - Clean up and simplify the code around ::overutilized and
   ::overload access.

 - Simplify sched_balance_newidle()

 - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
   handling that changed the output.

 - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch()

 - Reorganize, clean up and unify most of the higher level
   scheduler balancing function names around the sched_balance_*()
   prefix

 - Simplify the balancing flag code (sched_balance_running)

 - Miscellaneous cleanups & fixes

* tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  sched/pelt: Remove shift of thermal clock
  sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
  thermal/cpufreq: Remove arch_update_thermal_pressure()
  sched/cpufreq: Take cpufreq feedback into account
  cpufreq: Add a cpufreq pressure feedback for the scheduler
  sched/fair: Fix update of rd->sg_overutilized
  sched/vtime: Do not include <asm/vtime.h> header
  s390/irq,nmi: Include <asm/vtime.h> header directly
  s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
  sched/vtime: Get rid of generic vtime_task_switch() implementation
  sched/vtime: Remove confusing arch_vtime_task_switch() declaration
  sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
  sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
  sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
  sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
  sched/fair: Rename root_domain::overload to ::overloaded
  sched/fair: Use helper functions to access root_domain::overload
  sched/fair: Check root_domain::overload value before update
  sched/fair: Combine EAS check with root_domain::overutilized access
  sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
  ...
  • Loading branch information
torvalds committed May 14, 2024
2 parents 17ca7fc + 97450eb commit 6e5a0c3
Show file tree
Hide file tree
Showing 42 changed files with 550 additions and 441 deletions.
1 change: 1 addition & 0 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5826,6 +5826,7 @@
but is useful for debugging and performance tuning.

sched_thermal_decay_shift=
[Deprecated]
[KNL, SMP] Set a decay shift for scheduler thermal
pressure signal. Thermal pressure signal follows the
default decay period of other scheduler pelt
Expand Down
12 changes: 6 additions & 6 deletions Documentation/scheduler/sched-domains.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,21 +31,21 @@ is treated as one entity. The load of a group is defined as the sum of the
load of each of its member CPUs, and only when the load of a group becomes
out of balance are tasks moved between groups.

In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU
through scheduler_tick(). It raises a softirq after the next regularly scheduled
In kernel/sched/core.c, sched_balance_trigger() is run periodically on each CPU
through sched_tick(). It raises a softirq after the next regularly scheduled
rebalancing event for the current runqueue has arrived. The actual load
balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
balancing workhorse, sched_balance_softirq()->sched_balance_domains(), is then run
in softirq context (SCHED_SOFTIRQ).

The latter function takes two arguments: the runqueue of current CPU and whether
the CPU was idle at the time the scheduler_tick() happened and iterates over all
the CPU was idle at the time the sched_tick() happened and iterates over all
sched domains our CPU is on, starting from its base domain and going up the ->parent
chain. While doing that, it checks to see if the current domain has exhausted its
rebalance interval. If so, it runs load_balance() on that domain. It then checks
rebalance interval. If so, it runs sched_balance_rq() on that domain. It then checks
the parent sched_domain (if it exists), and the parent of the parent and so
forth.

Initially, load_balance() finds the busiest group in the current sched domain.
Initially, sched_balance_rq() finds the busiest group in the current sched domain.
If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in
that group. If it manages to find such a runqueue, it locks both our initial
CPU's runqueue and the newly found busiest one and starts moving tasks from it
Expand Down
37 changes: 21 additions & 16 deletions Documentation/scheduler/sched-stats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
Scheduler Statistics
====================

Version 16 of schedstats changed the order of definitions within
'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
columns in show_schedstat(). In particular the position of CPU_IDLE
and __CPU_NOT_IDLE changed places. The size of the array is unchanged.

Version 15 of schedstats dropped counters for some sched_yield:
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
identical to version 14.
Expand Down Expand Up @@ -72,53 +77,53 @@ domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

The first field is a bit mask indicating what cpus this domain operates over.

The next 24 are a variety of load_balance() statistics in grouped into types
The next 24 are a variety of sched_balance_rq() statistics in grouped into types
of idleness (idle, busy, and newly idle):

1) # of times in this domain load_balance() was called when the
1) # of times in this domain sched_balance_rq() was called when the
cpu was idle
2) # of times in this domain load_balance() checked but found
2) # of times in this domain sched_balance_rq() checked but found
the load did not require balancing when the cpu was idle
3) # of times in this domain load_balance() tried to move one or
3) # of times in this domain sched_balance_rq() tried to move one or
more tasks and failed, when the cpu was idle
4) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was idle
sched_balance_rq() in this domain when the cpu was idle
5) # of times in this domain pull_task() was called when the cpu
was idle
6) # of times in this domain pull_task() was called even though
the target task was cache-hot when idle
7) # of times in this domain load_balance() was called but did
7) # of times in this domain sched_balance_rq() was called but did
not find a busier queue while the cpu was idle
8) # of times in this domain a busier queue was found while the
cpu was idle but no busier group was found
9) # of times in this domain load_balance() was called when the
9) # of times in this domain sched_balance_rq() was called when the
cpu was busy
10) # of times in this domain load_balance() checked but found the
10) # of times in this domain sched_balance_rq() checked but found the
load did not require balancing when busy
11) # of times in this domain load_balance() tried to move one or
11) # of times in this domain sched_balance_rq() tried to move one or
more tasks and failed, when the cpu was busy
12) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was busy
sched_balance_rq() in this domain when the cpu was busy
13) # of times in this domain pull_task() was called when busy
14) # of times in this domain pull_task() was called even though the
target task was cache-hot when busy
15) # of times in this domain load_balance() was called but did not
15) # of times in this domain sched_balance_rq() was called but did not
find a busier queue while the cpu was busy
16) # of times in this domain a busier queue was found while the cpu
was busy but no busier group was found

17) # of times in this domain load_balance() was called when the
17) # of times in this domain sched_balance_rq() was called when the
cpu was just becoming idle
18) # of times in this domain load_balance() checked but found the
18) # of times in this domain sched_balance_rq() checked but found the
load did not require balancing when the cpu was just becoming idle
19) # of times in this domain load_balance() tried to move one or more
19) # of times in this domain sched_balance_rq() tried to move one or more
tasks and failed, when the cpu was just becoming idle
20) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was just becoming idle
sched_balance_rq() in this domain when the cpu was just becoming idle
21) # of times in this domain pull_task() was called when newly idle
22) # of times in this domain pull_task() was called even though the
target task was cache-hot when just becoming idle
23) # of times in this domain load_balance() was called but did not
23) # of times in this domain sched_balance_rq() was called but did not
find a busier queue while the cpu was just becoming idle
24) # of times in this domain a busier queue was found while the cpu
was just becoming idle but no busier group was found
Expand Down
10 changes: 5 additions & 5 deletions Documentation/translations/zh_CN/scheduler/sched-domains.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,17 +34,17 @@ CPU共享。任意两个组的CPU掩码的交集不一定为空,如果是这
调度域中的负载均衡发生在调度组中。也就是说,每个组被视为一个实体。组的负载被定义为它
管辖的每个CPU的负载之和。仅当组的负载不均衡后,任务才在组之间发生迁移。

在kernel/sched/core.c中,trigger_load_balance()在每个CPU上通过scheduler_tick()
在kernel/sched/core.c中,sched_balance_trigger()在每个CPU上通过sched_tick()
周期执行。在当前运行队列下一个定期调度再平衡事件到达后,它引发一个软中断。负载均衡真正
的工作由run_rebalance_domains()->rebalance_domains()完成,在软中断上下文中执行
的工作由sched_balance_softirq()->sched_balance_domains()完成,在软中断上下文中执行
(SCHED_SOFTIRQ)。

后一个函数有两个入参:当前CPU的运行队列、它在scheduler_tick()调用时是否空闲。函数会从
后一个函数有两个入参:当前CPU的运行队列、它在sched_tick()调用时是否空闲。函数会从
当前CPU所在的基调度域开始迭代执行,并沿着parent指针链向上进入更高层级的调度域。在迭代
过程中,函数会检查当前调度域是否已经耗尽了再平衡的时间间隔,如果是,它在该调度域运行
load_balance()。接下来它检查父调度域(如果存在),再后来父调度域的父调度域,以此类推。
sched_balance_rq()。接下来它检查父调度域(如果存在),再后来父调度域的父调度域,以此类推。

起初,load_balance()查找当前调度域中最繁忙的调度组。如果成功,在该调度组管辖的全部CPU
起初,sched_balance_rq()查找当前调度域中最繁忙的调度组。如果成功,在该调度组管辖的全部CPU
的运行队列中找出最繁忙的运行队列。如能找到,对当前的CPU运行队列和新找到的最繁忙运行
队列均加锁,并把任务从最繁忙队列中迁移到当前CPU上。被迁移的任务数量等于在先前迭代执行
中计算出的该调度域的调度组的不均衡值。
Expand Down
30 changes: 15 additions & 15 deletions Documentation/translations/zh_CN/scheduler/sched-stats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,42 +75,42 @@ domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
繁忙,新空闲):


1) 当CPU空闲时,load_balance()在这个调度域中被调用了#次
2) 当CPU空闲时,load_balance()在这个调度域中被调用,但是发现负载无需
1) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用了#次
2) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用,但是发现负载无需
均衡#次
3) 当CPU空闲时,load_balance()在这个调度域中被调用,试图迁移1个或更多
3) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用,试图迁移1个或更多
任务且失败了#次
4) 当CPU空闲时,load_balance()在这个调度域中被调用,发现不均衡(如果有)
4) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用,发现不均衡(如果有)
#次
5) 当CPU空闲时,pull_task()在这个调度域中被调用#次
6) 当CPU空闲时,尽管目标任务是热缓存状态,pull_task()依然被调用#次
7) 当CPU空闲时,load_balance()在这个调度域中被调用,未能找到更繁忙的
7) 当CPU空闲时,sched_balance_rq()在这个调度域中被调用,未能找到更繁忙的
队列#次
8) 当CPU空闲时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组
#次
9) 当CPU繁忙时,load_balance()在这个调度域中被调用了#次
10) 当CPU繁忙时,load_balance()在这个调度域中被调用,但是发现负载无需
9) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用了#次
10) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用,但是发现负载无需
均衡#次
11) 当CPU繁忙时,load_balance()在这个调度域中被调用,试图迁移1个或更多
11) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用,试图迁移1个或更多
任务且失败了#次
12) 当CPU繁忙时,load_balance()在这个调度域中被调用,发现不均衡(如果有)
12) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用,发现不均衡(如果有)
#次
13) 当CPU繁忙时,pull_task()在这个调度域中被调用#次
14) 当CPU繁忙时,尽管目标任务是热缓存状态,pull_task()依然被调用#次
15) 当CPU繁忙时,load_balance()在这个调度域中被调用,未能找到更繁忙的
15) 当CPU繁忙时,sched_balance_rq()在这个调度域中被调用,未能找到更繁忙的
队列#次
16) 当CPU繁忙时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组
#次
17) 当CPU新空闲时,load_balance()在这个调度域中被调用了#次
18) 当CPU新空闲时,load_balance()在这个调度域中被调用,但是发现负载无需
17) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用了#次
18) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用,但是发现负载无需
均衡#次
19) 当CPU新空闲时,load_balance()在这个调度域中被调用,试图迁移1个或更多
19) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用,试图迁移1个或更多
任务且失败了#次
20) 当CPU新空闲时,load_balance()在这个调度域中被调用,发现不均衡(如果有)
20) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用,发现不均衡(如果有)
#次
21) 当CPU新空闲时,pull_task()在这个调度域中被调用#次
22) 当CPU新空闲时,尽管目标任务是热缓存状态,pull_task()依然被调用#次
23) 当CPU新空闲时,load_balance()在这个调度域中被调用,未能找到更繁忙的
23) 当CPU新空闲时,sched_balance_rq()在这个调度域中被调用,未能找到更繁忙的
队列#次
24) 当CPU新空闲时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组
#次
Expand Down
6 changes: 3 additions & 3 deletions arch/arm/include/asm/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology

/* Replace task scheduler's default thermal pressure API */
#define arch_scale_thermal_pressure topology_get_thermal_pressure
#define arch_update_thermal_pressure topology_update_thermal_pressure
/* Replace task scheduler's default HW pressure API */
#define arch_scale_hw_pressure topology_get_hw_pressure
#define arch_update_hw_pressure topology_update_hw_pressure

#else

Expand Down
2 changes: 1 addition & 1 deletion arch/arm/kernel/topology.c
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
* can take this difference into account during load balance. A per cpu
* structure is preferred because each CPU updates its own cpu_capacity field
* during the load balance except for idle cores. One idle core is selected
* to run the rebalance_domains for all idle cores and the cpu_capacity can be
* to run the sched_balance_domains for all idle cores and the cpu_capacity can be
* updated during this sequence.
*/

Expand Down
6 changes: 3 additions & 3 deletions arch/arm64/include/asm/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@ void update_freq_counters_refs(void);
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology

/* Replace task scheduler's default thermal pressure API */
#define arch_scale_thermal_pressure topology_get_thermal_pressure
#define arch_update_thermal_pressure topology_update_thermal_pressure
/* Replace task scheduler's default HW pressure API */
#define arch_scale_hw_pressure topology_get_hw_pressure
#define arch_update_hw_pressure topology_update_hw_pressure

#include <asm-generic/topology.h>

Expand Down
1 change: 0 additions & 1 deletion arch/powerpc/include/asm/Kbuild
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,4 @@ generic-y += agp.h
generic-y += kvm_types.h
generic-y += mcs_spinlock.h
generic-y += qrwlock.h
generic-y += vtime.h
generic-y += early_ioremap.h
13 changes: 0 additions & 13 deletions arch/powerpc/include/asm/cputime.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,23 +32,10 @@
#ifdef CONFIG_PPC64
#define get_accounting(tsk) (&get_paca()->accounting)
#define raw_get_accounting(tsk) (&local_paca->accounting)
static inline void arch_vtime_task_switch(struct task_struct *tsk) { }

#else
#define get_accounting(tsk) (&task_thread_info(tsk)->accounting)
#define raw_get_accounting(tsk) get_accounting(tsk)
/*
* Called from the context switch with interrupts disabled, to charge all
* accumulated times to the current process, and to prepare accounting on
* the next process.
*/
static inline void arch_vtime_task_switch(struct task_struct *prev)
{
struct cpu_accounting_data *acct = get_accounting(current);
struct cpu_accounting_data *acct0 = get_accounting(prev);

acct->starttime = acct0->starttime;
}
#endif

/*
Expand Down
22 changes: 22 additions & 0 deletions arch/powerpc/kernel/time.c
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,28 @@ void vtime_flush(struct task_struct *tsk)
acct->hardirq_time = 0;
acct->softirq_time = 0;
}

/*
* Called from the context switch with interrupts disabled, to charge all
* accumulated times to the current process, and to prepare accounting on
* the next process.
*/
void vtime_task_switch(struct task_struct *prev)
{
if (is_idle_task(prev))
vtime_account_idle(prev);
else
vtime_account_kernel(prev);

vtime_flush(prev);

if (!IS_ENABLED(CONFIG_PPC64)) {
struct cpu_accounting_data *acct = get_accounting(current);
struct cpu_accounting_data *acct0 = get_accounting(prev);

acct->starttime = acct0->starttime;
}
}
#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */

void __no_kcsan __delay(unsigned long loops)
Expand Down
2 changes: 0 additions & 2 deletions arch/s390/include/asm/vtime.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@
#ifndef _S390_VTIME_H
#define _S390_VTIME_H

#define __ARCH_HAS_VTIME_TASK_SWITCH

static inline void update_timer_sys(void)
{
S390_lowcore.system_timer += S390_lowcore.last_update_timer - S390_lowcore.exit_timer;
Expand Down
1 change: 1 addition & 0 deletions arch/s390/kernel/irq.c
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
#include <asm/hw_irq.h>
#include <asm/stacktrace.h>
#include <asm/softirq_stack.h>
#include <asm/vtime.h>
#include "entry.h"

DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
Expand Down
1 change: 1 addition & 0 deletions arch/s390/kernel/nmi.c
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
#include <asm/crw.h>
#include <asm/asm-offsets.h>
#include <asm/pai.h>
#include <asm/vtime.h>

struct mcck_struct {
unsigned int kill_task : 1;
Expand Down
26 changes: 13 additions & 13 deletions drivers/base/arch_topology.c
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
#include <linux/units.h>

#define CREATE_TRACE_POINTS
#include <trace/events/thermal_pressure.h>
#include <trace/events/hw_pressure.h>

static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
Expand Down Expand Up @@ -160,26 +160,26 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
per_cpu(cpu_scale, cpu) = capacity;
}

DEFINE_PER_CPU(unsigned long, thermal_pressure);
DEFINE_PER_CPU(unsigned long, hw_pressure);

/**
* topology_update_thermal_pressure() - Update thermal pressure for CPUs
* topology_update_hw_pressure() - Update HW pressure for CPUs
* @cpus : The related CPUs for which capacity has been reduced
* @capped_freq : The maximum allowed frequency that CPUs can run at
*
* Update the value of thermal pressure for all @cpus in the mask. The
* Update the value of HW pressure for all @cpus in the mask. The
* cpumask should include all (online+offline) affected CPUs, to avoid
* operating on stale data when hot-plug is used for some CPUs. The
* @capped_freq reflects the currently allowed max CPUs frequency due to
* thermal capping. It might be also a boost frequency value, which is bigger
* HW capping. It might be also a boost frequency value, which is bigger
* than the internal 'capacity_freq_ref' max frequency. In such case the
* pressure value should simply be removed, since this is an indication that
* there is no thermal throttling. The @capped_freq must be provided in kHz.
* there is no HW throttling. The @capped_freq must be provided in kHz.
*/
void topology_update_thermal_pressure(const struct cpumask *cpus,
void topology_update_hw_pressure(const struct cpumask *cpus,
unsigned long capped_freq)
{
unsigned long max_capacity, capacity, th_pressure;
unsigned long max_capacity, capacity, hw_pressure;
u32 max_freq;
int cpu;

Expand All @@ -189,21 +189,21 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,

/*
* Handle properly the boost frequencies, which should simply clean
* the thermal pressure value.
* the HW pressure value.
*/
if (max_freq <= capped_freq)
capacity = max_capacity;
else
capacity = mult_frac(max_capacity, capped_freq, max_freq);

th_pressure = max_capacity - capacity;
hw_pressure = max_capacity - capacity;

trace_thermal_pressure_update(cpu, th_pressure);
trace_hw_pressure_update(cpu, hw_pressure);

for_each_cpu(cpu, cpus)
WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
WRITE_ONCE(per_cpu(hw_pressure, cpu), hw_pressure);
}
EXPORT_SYMBOL_GPL(topology_update_thermal_pressure);
EXPORT_SYMBOL_GPL(topology_update_hw_pressure);

static ssize_t cpu_capacity_show(struct device *dev,
struct device_attribute *attr,
Expand Down
Loading

0 comments on commit 6e5a0c3

Please sign in to comment.