Warning: Parameter 1 to Language::getMagic() expected to be a reference, value given in /opt/local/apache2/htdocs/wiki/includes/StubObject.php on line 58
Nway API - OSR

Nway API

From OSR

Jump to: navigation, search

Contents

NWay Lightweight Kernel API

This is an RFC for the new LWK memory and process management API. It is not in anyway a complete system at this point. The overriding philosophy is "mechanism, not policy" since we do not yet have a full picture of how to interact with the different CPU / Core / Thread / NUMA issues.

Execution Unit addressing

Thread unit id

A system consists of some number of CPUs, each with some number of cores, each with some number of execution units. The naming scheme is a tuple of CPU.Core.Thread. Not all cores need to share the same virtual memory mappings, so there is no way to uniquely address a virtual address.

Resource discovery

How to tell the application about the state of the node?

Memory Management

The core function of the operating system is to management blocks of memory for the applications. It does not handle any policy for how these regions are used, only mechanism for creating them and preventing the user from overwriting the OS itself. The application can, as it choses, allocate virtual memory in a centralized fashion from one core or each core may manage its own regions. Again there is no OS requirement for how the memory regions are laid out or used by the application.

Initial memory maps

The kernel reserves the top N MB of memory for its own use. Each core is capable of being in the kernel.

Borrowing the idea from the Alpha's kseg, the process is loaded into a linear memory map of the entire physical address space (minus that reserved for the kernel) with only one core active. The initial memory region is available via the global region_t *region_linear. This linear segment is shared with all of the CPUs and is read/write/execute. Core 0.0.0 is responsible for creating the rest of the memory maps on each CPU and starting the other cores. This requires that each core be able to manipulate the memory maps of the other cores.

Memory Regions

A memory region consists of a virtually contiguous region of memory with a virtual start and extent. It may be shared between all CPUs or private to a single one. If it is shared between CPUs then it must reside at the same virtual address in all CPUs, which may require the application to manage virtual regions to prevent conflicts. It may readable, writable and executable (or some combination depending on the underlying hardware). Physically the region may not be present, may be contiguous in one NUMA region or, alternatively and depending on hardware support, may stride across several NUMA regions.

Pages in regions that are not present can map their own page fault handlers to the region that will run in the thread context that caused the fault.

Creating regions

To create a region the application calls region_create:

   region_t *
   region_create(
       void * virtual_start,
       size_t extent,
       cpu_t core,
       int mode,
       void * priv
   );

virtual_start is either the virtual address to use, or a NULL pointer in which case the kernel will find a region that is available on all of the requested cores.

core is either CPU_ALL or the specific CPU that will have the region created. If the region would overlap an existant region on any of the cores the return result will be a NULL pointer and no memory maps will be modified. The virtual region that is created has no backing store and no region-specific page fault handler installed.

The region_t structure is opaque to the application and resides in the kernel space.

Splitting Regions

If we want to support POSIX calls like mprotect(), we may need to have the ability to split regions. I'm not 100% sure we want the complexity of doing so or if we want to just change the flags on that one page inside the region and making it inconsistent with the rest of the region.

Tearing down regions

Regions are normally created and not altered due to spooky action at a distance issues. If an application wants to remove a region, it may do so with:

   int
   region_destroy(
       region_t * region,
       cpu_t cpu
   );

This will unmap any pages that are currently in use by the region, delete the structure and likely cause page faults on other cores. Be careful!

Allocating pages

To generate the backing store, the application then calls region_allocate_pages:

   int
   region_allocate_pages(
       region_t * region,
       void * virtual_addr,
       void * physical_addr,
       size_t extent_in_pages,
       int overwrite
   );

This will build the memory map for every core that uses this region, mapping the physial pages. It is not necessary to allocate all pages in a region; page falts will be handled by the system default page fault handler or by a region specific one.

If the virtual_addr does not reside in the region or if the extent_in_pages would put the result outside of the region or if the physical_addr is a page reserved to the kernel, an error code is returned.

There are no problems with aliasing multiple virtual addresses to the same page. Since the entire physical memory is already mapped into the linear kseg, there will frequently be such duplicate mappings. This does mean that there is no easy way to translate a physical address to a virtual one, other than to the linear region. Virtual addresses, however, may be translated to pages quite easily by the user level code.

The overwrite parameter must be set if the application wishes to overwrite an already mapped portion of region. Otherwise an error will be returned if any pages in the requested region are already allocated.

Custom page fault handlers

   typedef
   uint64_t
   (*region_pagefault_handler_t)(
       context_t * context,
       region_t * region,
       void * priv,
       void * virtual_address,
       void * eip,
       int mode
   );
   int
   region_pagefalt_handler(
       region_t * region,
       region_pagefault_handler_t handler,
       void * priv
   );

When an execution unit page faults it will trap into the kernel, which will suspend processing of that thread. The context will be saved on the stack and if the region in which the virtual address caused the page fault has its own handler, this will be called in user space as if it were a function call with normal ABI constraints.

For read-faults, the handler may return up to a quad-word of data that will be patched into the result of the read. For write-faults, the return value is ignored.

The context structure can be used to kill the process or to do other fixup routines.

If the region specific page fault handler page faults, then the system default one will be run. If that one also faults then the app will be killed.

Thread Management

There exists at most one thread per execution unit to simplify scheduling. Threads are not preemptible by other threads, nor do they migrate between cores or CPUs. After bootup all the cores other than 0.0.0 are spinning internally wait_for_work reading a mailbox structure that resides in the global linear address space.

Starting threads

To start a new thread on another core, the application calls:

   int
   thread_create(
       cpu_t core,
       void * (*start_address)( void * priv ),
       void * stack_pointer,
       void * priv,
       void ** result
   );

The void ** result will be used to fill in any return codes when the thread exits. It must be a pointer that is valid in the destination core's address space or else the thread will page fault on exit. A NULL pointer will be interpreted to mean that no return result will be stored.

Stopping threads

To stop a thread on another core:

   int
   thread_kill(
       cpu_t core
   );

There are no permission checks in this call and no state on the remote core will be unwound.

Interprocess Communication

Signals

How to implement signals? Destination of signals to thread units?

MPI

MPI support.

pthreads / clone / OpenMP compatibility

File API

Files are implemented as at the user level by memory mapped regions and may be lazy demand paged (via a user level page fault handler) from the file server or loaded all at once. Write access to the files is implemented at the block level and can be either immediate writes via a page fault handler or require explicit sync commands.

There is should also be a collective way to open files and fan the data out to the other nodes to avoid the all-to-one effect of:

   int main( void )
   {
       FILE * f = fopen( "/etc/timezone", O_RDONLY );
       ...
   }

Once mapped, the user level library translates read(2) and write(2) calls into access into the blocks. It trivially translates mmap(2) calls into a copy of the region, although the mmap(/dev/zero) trick that glibc uses to get anonymous memory really isn't necessary in the lwk.

Shared libraries

Most calls to dlopen(3) are also translated into collective file opens and remapping to regions with proper permissions. Since the user level application is able to manipulate its own memory mappings as described above, this is a fairly simple operation.

Not all dlopen calls are collective, which is unfortunate. So we must also support a unitary dlopen as well.

Since there may be multiple NUMA domains on each node, it may be efficient for an application to make a per-thread region that has the code segment in its own domain. This wastes some amount of memory, but reduces inter-core memory bandwidth requirements. Again, since this is all done at user level it is trivial for the application to support this model. The library should have a runtime switch to control the duplication of all text segments across NUMA domains, even the main text of the application (setup, ideally, in the startup code that runs on 0.0.0).