sched_ule(4): Improve long-term load balancer.
Before this change long-term load balancer was unable to migrate running threads, only ones waiting on run queues. But with growing number of CPU cores it is quite typical now for system to not have many waiting threads. But same time if due to some coincidence two long-running CPU-bound threads ended up sharing same physical CPU core, they could suffer from the SMT penalty indefinitely, and the load balancer couldn't help. Improve that by teaching the load balancer to hint running threads to migrate by marking them with TDF_NEEDRESCHED and new TDF_PICKCPU flag, making sched_pickcpu() to search for better CPU later, when it is convenient. Fix CPU search logic when balancing to limit round-robin migrations in case of almost equal load to the group of physical cores. The previous code bounced threads across all the system, that should be pretty bad for caches and NUMA affinity, while additional fairness was almost invisible, diminishing with number of cores in the group. MFC after: 1 month