Warning: Parameter 1 to Language::getMagic() expected to be a reference, value given in /opt/local/apache2/htdocs/wiki/includes/StubObject.php on line 58
Telecon 03-29-07 - OSR

Telecon 03-29-07

From OSR

Jump to: navigation, search

(Meeting notes at bottom)

Agenda:

   1) Brief round table and status updates
   2) Today's Topic: Job Load


Trammell's job load proposal:

On Thu, Mar 22, 2007 at 03:06:42PM -0600, Kevin Pedretti wrote:
> In this scheme, more goes on in the kernel.  The idea, while not
> explicitly said in the document, is that the load program is a sort of
> script that the LWK interprets.  It does essentially what crt0 is doing
> now.  Another difference is that the fan-out broadcast tree is
> configured at the LWK level.

Here is my proposal for a two phase job load protocol, in which the
first stage is a very simple init code that does the rest of the job
load.  You could think of this piece sort of like the PCT, in that
it does all of the setup for the user application, but it also handles
ELF parsing, dynamic linking, heterogenous load and removes this
complexity from the kernel.  I think we're probably about on the same
page with only slight disagreements over what is in-kernel versus
in-user space.


1. Service node does private mmap of user init code from disk.

2. Service node adds nid map and parameters to image at end of image
(in what will becomes the bottom of its BSS effectively).

3. Service node adds filename of actual user code / argc / argv
into segment as well.

4. Service node PUTS job load message to rank 0.
        - Service node puts user ID/job ID in header
        - Message has is just the init code

5. Rank 0 kernel sends a NACK if there is a process already running,
otherwise it creates process with user ID of source, creates init code
linear segment, sends ACK reply and jumps into run_user().

6. Rank 0 init user code GETS user executable from service node
via fopen(), perhaps using the user space filesystems.
This is where heterogenous load would occur as well (nid map
might specify different images for some nodes).

7. Rank 0 computes fan out tree based on nid map, taking into
account any heterogenous load.

8. Rank 0 PUTS job load message to its children (with init image)

9. Rank N sets up init user code.

10. Rank N init user code GETS executable image from parent

11. User code fan-in of status: each sub-tree parent collects ACK or
NACKs from its children and forwrds them to its parent.

12. Rank 0 user code PUTS status to service node

13. Service node PUTS go message to rank 0 user code

14. Rank 0 user code fan-out of go message

15. All ranks jump into main()



Since this entire protocol is built at the user level other than the
initial process creation, it is possible to write a user-level one phase
protocol if the user executable has the init image linked in:

1. Service node does private mmap.

2. Service node adds nid map to image at end of image.

3. Service node fills in argc/argv at end of image.

4. Service node PUTS job load message to rank 0 with entire
executable.

5. Rank 0 creates process and jumps into it.

6. Rank 0 PUTS job load message with full executable to its children in
fan out tree.

7. Rank N sets up user code and fans out to children.

8. Rank N PUTS status to parent

8. Rank 0 PUTS status to service node

9. Service node PUTS go message to rank 0

10. Rank 0 fans out go message

11. All ranks jump from init image into main().


Only certain messages are destined for or interpreted by kernel:

- Job load (ACK/NACK response from kernel)
- App status (stack dump of all CPU contexts)
- App kill (affirmative response from kernel)

Kernel does not do any store-n-forward messages of fan-out of
its own.  This means that the app kill message must be 1-to-n from
the service node.


Meeting Notes

1) Discussed recent interrupt changes.

   - lapic_eoi() should be moved into interrupt wrapper + removed from the end of each ISR.
   - Probably want to write a C interrupt wrapper, similar to Linux do_IRQ()

2) Discussed pthreads hang on nway1:

   - Decided probably need IPI to flush remote CPUs TLB.

3) Discussed plans for serial/keyboard interrupt handlers

   - Currently are being pretty chummy with the architecture by passing in irq_frame as a sometimes hidden arg.  Probably need to come up with an alternative method.

4) Discussed today's topic, Job Load

   - Trammell pushing to keep everything at user level
   - Spent most of the time discussing whether having everything at user-level is feasible.
       - Job exit is potentially problematic.  How does kernel know when to notify loader
           - May need to track thread groups in kernel for pthreads exit semantics
           - Alternative may be user-level exit handlers, similar to the proposed page fault handlers
       - Potential for N-1 exit messages, not scalable
           - One solution is a kernel level fan-in of exit messages.
           - No other obvious solutions.

5) Briefly discussed device driver initialization

   - Device drivers could register a system call handler.