This document contains a set of frequently asked questions (FAQ) and answers. It covers the implementation details of the scheduling extensions contained in QNX Neutrino Adaptive Partitioning, as well as any common questions related to partitions in general.
This chapter includes:
The information contained in this Frequently Asked Questions section is subject to change at any time without notice. QSS makes no representation or warranty regarding the information and is not responsible whatsoever for reliance on the information contained herein. |
The thread scheduler guarantees a minimum CPU budget by ensuring that other partitions don't overrun their budget. This determination is made every clock tick.
The clock interrupt handler invokes the thread scheduler. That means it runs a miniimum of every clock period (typically every millisecond). On each clock tick:
On exit from the Neutrino clock interrupt handler, the handler examines the flag. If set, it causes the system to immediately enter the kernel and invoke the full scheduling algorithm.
The full thread scheduler algorithm examines all partitions. It stosp running the current partition if it is about to go out of budget (i.e. it no longer has enough to pay for another quarter clock period, in addition to one clock period for each additional CPU - if the system is multicore). In other words, the thread scheduler guarantees that budgets are met by forcing a paritition to temporarily stop running if it will run over its budget before the next time the scheduler is in control of the system. This also requires that some other partition has budget and threads that are ready to run.
The thread scheduler guarantees that budgets are met by forcing a partition to temporarily stop running if it runs over its budget before the next time when the scheduler is in control of the system.
The thread scheduler makes sure that a partition gets at least its budget in the current averaging window when:
Then the thread scheduler guarantees that partition p gets B(p) percent of CPU time over the last averaging window if:
R(p) >= N * B(p)/100
In other words, it means that when the partition has enough ready to run threads to occupy the processors in the system.
In other words, budgets are guaranteed if the system is busy enough and no partition has used its critical budget.
See the next answer.
A 100-ms averaging window stores a detailed history of CPU usage for each of the last 100 millisecond intervals. Rather, it stores a history of CPU usage, with detail for each of the last 100 millisecond intervals. The window rotates, or slides forward in time, for every clock tick. So the window provides precise information about the average CPU consumption every millisecond (or clock period).
Between clock ticks, when the thread scheduler algorithm is called, CPU usage of each partition is approximated with the assumption that each partition will likely run continuously at least until the next clock tick.
In other words, the thread scheduler computes the used CPU time and enforces the budgets, many times per millisecond.
In order to guarantee that the partitions get their guaranteed minimum CPU budgets, the design assumes:
Never. It avoids doing division in order to execute quickly.
The scheduler only compares a partition's CPU usage with its budget, expressed as a total time over the last averaging window rather than as a percentage. To make a quick comparison, both usage and budgets are treated internally as counts of ClockCycles(), not as percentages.
At least once every clock period (typically every millisecond). However, it also does it on kernel calls, such as message and pulse sending or mutex releases. For example, on a 733MHz x86 machine that performs a lot of I/O, the scheduler computes CPU usage around 50 times every millisecond.
Within a single partition, the thread scheduler always follows POSIX scheduling rules, i.e. preemptive priority-based scheduling with FIFO and sporadic policies. So a partition looks somewhat like a complete system in POSIX.
However the CPU time, seen by a partition, may be sliced by threads running in other partitions.
So the question remains: when does a partition get continuous realtime? Since our definition of realtime is to schedule strictly by priority, the answer is the thread scheduler schedules strictly by priority whenever a set of partitions has used less than their budgets over the last averaging window. This implies that all threads run by priority-preemption rules as long as their partitions have not exhausted their budget in the current averaging window. In brief, it's realtime when a partition is using less than its budget.
See the next answer.
Free-time mode is a specific budget situation when at least one partition with a nonzero budget isn't using all of its budget. Free-time mode means other partitions may use up the free time even if they exceed their own budgets. This is one of the reasons why adaptive partitioning is adaptive.
The extra time a partition gets in free time mode is called “free time,” but it isn't always free; sometimes it must be paid back.
Partly. In general, only the free time during the last averaging window needs to be paid back.
For example, suppose that partition p1 has exhausted its budget, and another partition p2 has available budget. Therefore partition p2 is running. Now assume that partition p2 becomes idle (i.e. goes to sleep) for 10 milliseconds. Because partition p2 has no competition and is in free-time mode, partition p1 begins running and exceeds its budget by 10 milliseconds.
Now, say partition p2 wakes up. The partition p2 won't run until the averaging window rotates enough to carry the history of its CPU over-usage past 100 milliseconds into the past. So, p2 may not run until window-size − budget milliseconds passes. This interval, where p2, is suspended is effectively paying back the free time.
In general, when free time is less than window size — budget must be paid back.
In a different example, suppose partition p2 goes to sleep for a minute. In this situation, partition p1 runs opportunistically and subsequently consumes 100% of the CPU. When partition p2 wakes up, it will have available budget, and partition p1 will be over budget, so partition p1 will run.
The partition p2 won't run again until window rotation removes history of its CPU usage past 100 milliseconds in the past. So in this case, partition p2 needs to pay back only window-size − budget milliseconds of the minute of CPU time that ran because partition p1 was asleep.
While the partition is over budget (because of the free time it received) — it won't run at all until enough time has passed to cause the total usage (recorded in the averaging window) to fall below budget. It implies that the partition has stopped running until its stopped time compensates for the free time it took earlier.
An exception is free time that occurred just before a call to SchedCtl(SCHED_APS_SET_PARMS,...) to change the window size. Changing the window size wipes the scheduler's memory so free time just before a change in window size isn't paid back.
Adaptive partitioning treats a two-headed HT processor as a multicore system with two CPUs. It assumes that each virtual processor has equal and constant throughput. Whereas this is true for SMP machines, it's true on HT machines only when the system is sufficiently loaded to keep both pseudo-CPUs busy. Adaptive partitioning requires that a system's throughput be proportional to the ClockCycles() function.
Without the thread scheduler (i.e. using classic Neutrino scheduling), a round-robin thread:
With the thread scheduler, a round-robin thread:
The scheduler overrides the time slice for a round-robin thread. When a partition has more than 4 ticks of available time left in its budget, thread scheduler behavior is the same as the classic Neutrino scheduling. However on a loaded system, it's best to assume that a Round-Robin thread may be sliced every tick.
When a round-robin thread is preempted by the scheduler, it will be able to run a thread in a different partition. In other words, round-robin behavior is unchanged relative to the other threads in the same partition.
Without the thread scheduler, if not preempted by a higher priority thread, a FIFO thread runs until it gives up control voluntarily.
With the thread scheduler, a FIFO thread runs if not preempted by a higher priority thread in the same partition until it gives up control voluntarily, or its partition runs out of budget.
FIFO behavior is unchanged as long as your partition has budget. On a loaded system, it's best to assume that a FIFO thread may be time sliced every millisecond with threads in other partitions. However, relative to all other threads in the same partition, FIFO behavior is the same as in classic Neutrino scheduling.
Without the thread scheduler, if not preempted by a higher priority thread, an SS thread runs until it gives up control voluntarily. Since the priority of an SS thread alternates between normal and low priorities, it's likely to be preempted when running at its low priority.
With the thread scheduler, the SS thread runs if not preempted by a higher priority thread in the same partition until it gives up control voluntarily or its partition runs out of budget.
Some developers set the higher priority of a sporadic-scheduled thread to be the highest priority in the system, in order to make the thread nonpreemptible during its high-priority mode. With the thread scheduler, the thread is non-preemptible only as long as its partition hasn't exhausted its budget.
Sporadic scheduling behavior is unchanged as long as your partition has budget. On a loaded system, it's best to assume that an SS thread may be time-sliced every millisecond with threads in other partitions. However, relative to all other threads in the same partition, SS behavior is the same as in classic Neutrino scheduling.
See the next answer.
The thread scheduler runs and enforces budgets:
The frequency depends on how often messaging occurs.
If the system suspends, scheduler is unaware of the interruption. Upon resumption, partitions will have the same percentage consumption they had at suspension.
If the system varies the processor speed to conserve power, scheduler is unaware of the variation. Although the scheduler guarantees that all partitions get their budget percentages, it assumes that each millisecond has the same throughput. This means that partition budget enforcement is effectively inaccurate for the 100 milliseconds (or window size) after the CPU changes speed. Thereafter, it's inaccurate.
If you change the clock period, the thread scheduler can't schedule accurately because it's unaware of the change in the size of the tick. However, calling SchedCtl(SET_APS_PARMS,...) with the existing window size causes the scheduler to recalculate all internal parameters that depend on the size of the clock period. Correspondingly this calling restores accuracy.
As described in the Adaptive Partitioning User's Guide, you should set the window size after changing the clock period.
Microbilling refers to the accounting for the CPU time that is used by a thread to a much finer resolution than the clock period between tick interrupts.
The thread scheduler has been implemented where threads send or receive messages many times (as opposed to a single time) per clock period. Adaptive partitioning scheduling would not be possible if we were limited to counting integer ticks of CPU time. That's because most threads send or receive messages, or otherwise block, many times per clock period.
Microbilling works by taking a fine-resolution timestamp every time a thread changes state from ready to not-ready, and charging differences between sequential timestamps against that partition's used CPU cycles count.
Microbilling uses the system call ClockCycles() to get that fine-resolution timestamp.
The thread scheduler microbills each time that:
The thread scheduler always depends on the processor being used. On x86 processors, Neutrino uses a free-running counter that is implemented on the CPU chip itself. This counter is read with a single instruction.
On PowerPC targets, Neutrino reads a similar free-running counter with just a few instructions. In these situations, ClockCycles() increments typically at about the processor's clock rates (i.e. ClockCycles() increases by 3 billion counts every second on a 3Ghz machine).
On both x86 and PowerPC processors, ClockCyles() increase by about 1 billion counts every second on a 1 GHz processor.
On processors that don't have a free-running counter for the purpose of being a fine-grained clock, Neutrino emulates ClockCyles(). For example, on ARM processors, Neutrino reads the intermediate value of the countdown timer that's used to trigger the clock interrupts. This value tells how far you're into the current clock tick. Neutrino further adds a scaled version of how far you're into the current clock tick to a constant determined at the last clock tick to get an emulated ClockCycles() value.
On some processors, such as ARM, the countdown timer used for emulating ClockCycles() is located off-chip and requires slow I/O operations to read it. On other processors, such as MIPS, the countdown timer is located on-chip, and can be quickly read.
See the next answer.
The accuracy of microbilling or ClockCycles() is determined by the accuracy of the clock oscillator source used in the CPU. However, since the scheduling is relative between partitions, it doesn't require ClockCycles() be equal to the absolute time; it requires only that ClockCycles() be proportional to the work done by CPU. In fact, a wrongly calibrated ClockCycles() has no effect on the accuracy of the thread scheduler.
It's the resolution of the ClockCycles() function. The resolution of clock cycles varies from platform to platform. In most cases, the resolution is much finer.
The thread scheduler requires 1/200 of a tick to meet its specification for accuracy. In some platforms, such as x86, the resolution is on the order of nanoseconds. |
The averaging window consists of tables. There are two tables per partition, one for the CPU time spent while critical, and another for any CPU time spent. The tables have one slot per timer tick. So a 100-ms averaging window, with a 1-ms clock period, has 100 slots. Each slot is used to hold the CPU time spent during a particular tick interval. For example:
[99ms ago][98 ms ago][97 ms ago]....[1 ms ago][current ms]
The slots hold the total CPU times of all threads in that partition as measured by consecutive calls to ClockCycles(). Note that total CPU times are then scaled by a carefully chosen factor so that all numbers fit into a 32-bit unsigned integer register.
At any time, the sum of the elements of a table represents the total CPU time used by that partition over the averaging period.
When the scheduler stops a thread running, it adds the time spent by that thread since when it started, or since the last tick, into the current ms slot of the table. If the thread was running as critical, the scheduler also adds the time to the current ms slot of that partition's critical time table. The scheduler also does this when a clock tick occurs.
However, on a clock tick, after billing the current thread to its partition's [current ms] slot, the scheduler also rotates the table. To rotate the table, it does the following:
This is called window rotation. Each rotation effectively provides available budget back to the partition that ran 99 ms ago. Window rotation is implemented without summing the entire table, shifting the table, or calls to the malloc() or free() functions.
The averaging window isn't physically rotated. It's logically rotated:
usage_hist[cur_hist_index] += delta_time; used_cycles += delta_time;
used_cycles -= usage_hist[(cur_hist_index +1) MOD 100]
cur_hist_index = (cur_hist_index+1) MOD 100
This is done for every partition, for both normal and critical CPU time.
See the next answer.
You can change the window size with the SchedCtl(SCHED_APS_SET_PARMS,...) on the fly. The scheduler doesn't malloc() new tables, but it does zero the history in all tables, zeros all the totals, and zeros the table indexes.
The effect is to wipe the memory of the scheduler. Here the scheduler assumes that no partition has run in the last x ms, where x is the new window size.
We recommend you leave the window size at the default, or set it during startup. Also, you shouldn't change the window size often.
In general, the longer the averaging window, the longer the partition has to wait before it gets the CPU time.
For example, with a 100 milliseconds averaging window and a partition p with a 10% budget, the partition p will exhaust its budget if it runs continuously for 10 milliseconds. It has to wait another 90 milliseconds before window rotations cause the averaging window to lose memory of its past execution. So, it will be 90 milliseconds before the partition p gets some available budget back and runs again.
However, in most real systems that engage in inter-partition interaction, partition p's 10 milliseconds of running time is likely to get spread out in the averaging window. So even if p exhausts the budget soon, it will most likely get available budget back in much less than 90 milliseconds.
The Adaptive Partitioning User's Guide describes an unlikely scenario where two interacting partitions result in a larger latency than the window size budget.
See the next answer.
The thread scheduler evaluates a merit function on each partition and chooses the partition with the highest merit. It then picks the highest-priority thread in that partition. A partition with budget has more merit than a partition that has exhausted its budget.
First, let's look at a few helper functions. The details are provided below:
Or:
cycles_used(p) + cycles_left_in_current_tick <= budget_cycles(p)
where:
1 - used_cycles(p)/budget_cycles(p)
If the partition has a zero budget, then RFF(p) is defined to be a constant smaller than the smallest possible value of RFF() for all other nonzero partitions.
Some operating modes, defined by these boolean expressions, are also defined:
The scheduler picks up one of the merit functions, depending on the operating mode:
If the mode is idle, the scheduler chooses to run the idle thread in the System partition.
Otherwise, the scheduler chooses to run the highest-priority thread that has a compatible runmask for the CPU on which the scheduler was invoked from the partition p such that:
merit(p) > merit(p')
for all p' not equal to p.
Merit functions return tuples, and are compared as tuples. For example:
(a,b) < (c,d) if (a<c) || ( (a=c) && (b<d) )
It does it very quickly. Each partition has a bitmap that tracks the priority levels (between 0 to 255) that are in use by some ready to run thread in that partition.
Each time the scheduler makes a thread ready to run, it sets the bit corresponding to that thread's priority. When the scheduler runs a thread (its state changes from ready to run), the scheduler examines the queue of threads in that partition that are ready-to-run and at the same priority. If there are no other threads of that priority, the scheduler clears the bit for that thread's priority.
When the scheduler needs to know the highest priority thread that is ready to run in a partition, it uses the bitmap to index a table that maps integers to the number of their highest 1 bit. This is done with a set of tables to avoid the need for 2255 table elements.
The same mechanism is also used in classic Neutrino scheduling. The macros are:
For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't perform floating-point operation inside the kernel or even fixed-point division; these operations are very slow on some platforms.
Neutrino computes a function equivalent to RFF() that requires only addition and multiplication.
For the scheduling algorithm, the computation of RFF() requires floating-point division. However, Neutrino doesn't need the absolute values of RFF(); it needs to know only the relative ordering of RFF(p1), RFF(p2), .... RFF(pn).
Therefore, Neutrino computes a different function that has the same ordering properties as RFF(). This function is computable with only addition and 16×16 bit multiplication.
The idea is:
RFF(P) = 1 - cycles_used/budget_cycles
However, instead of finding partition p, such that RFF(p) > RFF(p') for p' not equal p, define relative_fraction_used(p) = RFU(p) = cycles_used/budget_cycles , and find partition p such that RFU(p) < RFU(p') for p' not equal to p.
used_cycles(p0)/budget_cycles(p0) < used_cycles(p1)/budget_cycles(p2)< .... < used_cycles(pn)/budget_cycles(pn)
k = budget_cycles(p0) * budget_cycles(p1) * ... * budget_cycles(pn), then
f(p) = used_cycles(p) * c(p)
and compare f(p) to f(p') to find which has the better RFF().
However, there are two complications:
f(p) = (used_cycles(p)>>scaling_factor) * scaled_c(p)
Therefore, Neutrino can set f() for a zero-budget partition as:
f_zero = 1 + window size*c(pm)
and then scale it as described for running out of bits.
The window size is expressed in cycles. |
See the next answer.
When the scheduler algorithm picks a thread that is allowed to run as critical to run, it doesn't always charge its CPU time to its partition's critical budget. A thread t charges its CPU time to the critical budget of its partition p only when the following are true when the scheduler algorithm is invoked:
COMPETING(p') &&(HAS_BUDGET(p')||MAY_RUN_CRITICAL(p')) == True
For definitions of COMPETING(), HAS_BUDGET(), and MAY_RUN_CRITICAL(), see the topic How does the scheduling algorithm work?.
The mathematics of the algorithm are extendable to any number of partitions. However, these are the limitations of the current implementation:
Accuracy refers to the closeness of the scheduler's guarantee or limit that a partition can consume only its budget on a loaded system. For Neutrino, the accuracy is measured based on whichever is greater:
Or
When you changes the averaging window size to x ms, the accuracy is undefined for the next x ms. |
The first limitation comes from the accuracy in which the RFF() calculation is carried out. The accuracy of RFF() is calculated to a limited number of bits, specifically to speed up the scheduling algorithm.
The second limitation comes from the uncertainty in predicting how long a thread runs before it voluntarily blocks, is preempted by a higher-priority thread, or when the next tick interrupt occurs. This limitation comes from the fact that the thread scheduler has guaranteed control of the system only every tick (but may run more often).
In practice, the last limitation implies that when a window size is changed, the scheduler clears its history of used CPU time. So the partition (p) with the highest priority thread runs for budget(p)*window size (ms) before another partition runs. After the window size (in milliseconds) has elapsed, all budgets are again guaranteed. So a partition, configured for a budget of 40%, with a 100 milliseconds averaging window, is considered to be scheduled accurately when its usage over the last 100 ms was 39 to 41 ms. This happens when the window size hasn't changed in the last 100 milliseconds. In practice, the scheduling accuracy is usually much better.
In order to save overhead, a very short version of the scheduling algorithm is used on some paths involved in message passing. This short version is implemented with the internal scheduler functions, such as ready_ppg(), block_and_ready_ppg() and adjust_priority_ppg().
Let's consider all kernel calls, such as messaging and mutexting, that switch threads to be overhead. Call the initial running thread t1, and the next thread t2. Let's consider the kernel calls that are initiated by t1 and cause t1 to stop running and t2 to start running.
The overhead is split between t1 and t2, but mostly to t1 with the following details:
Time to: | Is charged to the partition of: |
---|---|
Enter the kernel | t1 |
Run the scheduling algorithm | t1 |
Do a context switch | t2 |
Exit the kernel | t2 |
There are two parts of interrupt servicing: the interrupt handler and the interrupt thread.
If you service interrupts with an interrupt thread, most of the time spent servicing the interrupt is the thread's time, and only a small part of the time is spent in the interrupt handler. The interrupt handler determines the thread to which the interrupt event should be delivered.
If you service interrupts with an interrupt handler, all of the time spent servicing the interrupt is in the handler.
The time spent in an interrupt thread is charged against the partition of that thread.
The time spent in an interrupt handler is charged against the partition that's running at that time.
Since the interrupts occur in random, time spent in interrupt handler is spread evenly over all running partitions.
Our results indicate that heavy compiling benchmark that involve a lot of filesystem-related messaging are about 1% slower on x86 platforms when using the thread scheduler.
Both of these are in the kernel space.
In approximate order of importance, the cost of the thread scheduler increases with:
In all the above cases, the increase is approximately linear.
The following factors don't affect the cost of scheduling at all:
See the next answer.
Neutrino maintains a data block, the thread_entry, representing the state of each thread. It contains three state bits for controlling the critical threads that indicate whether or not the thread is:
These state bits are turned on as follows:
Always allowed | When the user calls SchedCtl() with the SCHED_APS_MARK_CRITICAL command on that thread. |
Until blocked | When the thread receives an event from an interrupt handler, a message from another thread marked either “always allowed to run critical”, or “allow critical until it blocks” an event, on which the user has previously called the macro, SIGEV_MAKE_CRITICAL() |
Currently running as critical | When the scheduler algorithm decides that thread would not have been eligible to run if it hadn't been allowed to run as critical. |
No.
You can set your own thread to be critical, or receive a critically tagged event or message. This way, the thread gets the property of the “allowed to run critical” flag. You must configure the partition with a nonzero critical budget to:
Setting a nonzero critical budget on a partition is controlled. For the recommended scheduler partition security settings, only root, running in the parent partition of a target partition, can set a nonzero critical budget.
To save time, the thread scheduler only polls partitions for bankruptcy only on each clock tick (rather than every scheduling operation). So typically, bankruptcy is detected one millisecond (or clock period) after a partition's critical budget has been exhausted.
Neutrino compares the total critical time (over the last averaging window) to the partition's configured maximum critical time budget. Each partition maintains a separate rotating window for tracking critical time usage. The history window for this critical time identifies, for each millisecond of the last 100 milliseconds, which part of the total CPU time was considered to be critical time.
Partition inheritance occurs when the scheduler bills the CPU time of a thread not to its own partition, but to the partition of a different thread. This feature makes the thread scheduler adaptive.
Partition inheritance occurs under two situations:
When a client thread sends a message to a server thread, that server thread is considered to be working on the client thread's behalf. In this case, Neutrino charges the execution time of the receiving thread, from the time it receives the message and up to the time it waits for the next message, to the partition of the sending thread.
This means that resource managers, such as filesystems, automatically bill their time to their appropriate clients. This implies that partitions containing only resource managers don't need to be reengineered every time a new client is added to the system.
When threads line up for access to a mutex, Neutrino doesn't consider the thread holding the mutex to be waiting on behalf of the threads waiting for the mutex. So, there is no inheritance of partitions.
However, there is a special scenario when the thread holding the mutex is in a partition that ran out of available budget. In this scenario, the thread can't run and release the mutex. All the threads waiting for that mutex are stalled until enough window rotations have occurred for mutex-holding partitions to regain some available budget. This is particularly nasty if the user has configured that partition to have a zero budget.
So, when a thread t1 holds a mutex in a partition that has exhausted its budget, and another thread t2 attempts to seize the mutex, Neutrino puts thread t2 to sleep until thread t1 releases the mutex (which is classic mutex handling), and then changes the partition of p1 to be p2 until it releases the mutex, provided the budget of partition p2 is nonzero. This prevents extended delays, should the current mutex holder run out of budget.
Very fast.
The data block that Neutrino keeps for each thread, the thread_entry, has a pointer to its containing partition. So inheritance is simply a matter of swapping the pointer. Often, Neutrino doesn't even need to update the microbilling because the same partition is executing before and after the inheritance.
Sending a message to a process effectively gives the sender's partition budget to the receiver thread (temporarily). However, to receive threads in that manner, the receiver process must have been started under the root user.
You can change a partition's budget any time.
See the next answer.
The operation is quick and doesn't reset the scheduler or cause any change to the partition's history of CPU usage that is stored in the averaging window.
However, if you change the budget of a partition from 90% to 10%, the partition could suddenly become over budget. In this situation, the partition may not run again until enough window rotations have occurred to lower the partition's used cycles below its budget.
A change in budget takes effect at the next tick interrupt or next scheduling operation i.e. typically, in less than one millisecond.
Threads in a partition with a defined budget of zero runs if all nonzero partitions are sleeping. These threads also run if they inherit the partition of thread that sends a message. Zero-budget partitions are most useful for resource managers with no internal daemon threads. They're also useful for turning off unused partitions.
At startup, Neutrino creates the first partition (the System partition) with a budget of 100%. Thereafter, when a thread running in a partition creates a new partition, the current partition is considered as the parent and the new partition is the child. The budget of the child is always taken from the budget of the parent, and may never reduce the parent's budget below zero. So creating partitions produces a hierarchy of partitions that subdivide the System's original budget of 100%.
For any change to occur, the scheduler partition security would have to be:
Note that a thread in a partition can't increase its budget more than the budget of its parent partition.
You can:
As the root user, unlock the scheduler partition configuration and turn off the scheduler partition security.
In order to do either of these, you must be the root user, unlock the scheduler partition configuration and turn off the scheduler partition security. |
The following ideas look promising, but:
See the next answer.
Each thread_entry (the control block that Neutrino maintains for each thread) has a pointer to its containing partition. Joining a thread means only changing this pointer. The act of joining is very fast. Most of the time is spent in entering the kernel in order to swap the pointer.
It's safer and much more efficient not to delete a partition. A suggested alternative is to set the partition's budget to zero.
To delete a partition, Neutrino would have to locate all threads (or assert that there are none) in a partition and move them to some other partition.
Threads are mapped to their partitions with a single pointer. There is no back pointer, as it would require a linked list to implement a many-to-one mapping to chain together all threads.
In addition, Neutrino would require additional kernel memory for a two-way queue through all thread_entries. In addition, Neutrino also have to do two-way queue extractions every time it (Neutrino) inherited partitions (e.g. message sending) while evading the simultaneous destruction of other threads.
See the next answer.
Adaptive partitioning scheduler is part of the kernel.
It is shipped as a library module (libmod) that is built into the image along with procnto. The procnto also contains the code for the classic Neutrino scheduler when the thread scheduler module is not present. However, when the thread scheduler module is present, procnto initializes the thread scheduler instead of the classic scheduler. The thread scheduler then directs a set of function pointers, one for each primitive scheduling operation (such as ready(), block(), etc.), to its own function constants. Subsequently, it creates the system partition, which it returns to procnto.
Yes. The thread scheduler calls InterruptDisable() for slightly longer than the time required to call ClockCycles() each time it must microbill. That includes not inhibiting interrupts to get mutual exclusion between the clock interrupt handler, scheduling algorithm, getting partition statistics, or changing budgets.
Other than the cost of the SchedCtl() kernel call, the answer is no.
Getting statistics doesn't inhibit interrupts, or delay window rotations or the scheduling algorithm (on other SMP processors.) Consistent retrieval of statistics is accomplished by detecting collisions and having the API withdraw and retry. Note that the call to SchedCtl(SCHED_APS_PARTITION_STATS API..) fails with EINTR only in the unlikely case of three consecutive collisions. In general, this can occur only if the user has set the clock period to such a short value that it's likely unsafe for the rest of the system.