Developers Guide

The following page documents the architecture and code of the SAFE project. This document is meant for users interested in developing for SAFE, or those interested in tweaking SAFE for their needs.

Architecture Overview

_images/highlevelarch.png

SAFE high-level architecture.

SAFE has three major components, the launch daemon, the execution manager and the simulation client. The launcher daemon is responsible for bootstrapping both clients and execution manager. This proccess starts when it receives a tarball file from the client through SFTP. It automatically untar this file, reads the configurations inside it and starts the clients by SSH (using fabrics library) and the Execution Manager by creating a subprocess.

The execution manager is responsible for the generation of all design points. The experimental design is expressed through xml models, which can be user generated, or eventually, generated through a graphical user interface. These design points will then be distributed across all connecting simulation clients which will setup and run the desired simulation. This architecture will allow for an arbitrary number of processors to contribute in parallel to the simulation of all configured design points.

The simulator itself communicates statistical samples to the respective simulation client. The simulation client is then be responsible for the computation of the transient or initialization bias. Once the simulation has reached steady state as determined by the simulation client, further statistics will be forwarded to the execution manager. The execution manager can then aggregate all statistical samples for each design point across any number of parallel simulation instances. Once a desired confidence level has been reached for the desired metrics, the execution manager will contact each of the simulation clients working with the design point and have them terminate the simulation run gracefully. Once the simulation client has successfully terminated the simulator, it will request from the execution manager another design point to compute. This process will repeat until all design points have been computed to within the desired confidence interval. This architecture, often referred to as Multiple Replications in Parallel (MRIP), comes from the Akaroa project.

Additionally, a web based management system is in place to aid the novice user through the configuration and execution of an experiment using SAFE. In the future, this tool will aid the user through the development of the xml models through an easy to use graphical interface. Such xml models will then be automatically generated and forwarded to the execution manager directly. The web manager provides an interface to execute an experiment with existing xml models, as well as provide a graphical interface to the configured experiment and computed results.

Execution Manager

The execution manager coordinates all of the components in the automation framework. It has three primary responsibility, compute the design points from the xml models, dispatch and manage connected simulation clients, and manage the database.

Design Point Generation

Given the experiment description and the restrictions inputs, the execution manager can compute all of the design points. The algorithm for design point generation is based on a recursive backtracking algorithm with constraint propagation. It has been unrolled to be iterative, thereby allowing the execution manager to generate design points one by one as requests for new simulations arrive.

This approach has several advantages. First, the execution manager can begin serving requesting immediately after is has been spawned, there is no need to wait for all of the design points to be computed. Second, because SAFE is built using the asynchronous library twisted, while the computation of design points is executed, the code is blocking, thereby halting any other services running on that reactor. When integrating the execution manager with the web manager, this means the web interface would become responsive, or any other execution managers running on the same reactor.

Manage Simulation Clients

The execution manage handles all of the connected simulation clients. This includes several responsibilities.

Messages

Messages are exchange between clients and execution manager to synchronize both states. The list of all possible messages are shown below, and more details are given on the next sections.

  • REGISTER: Client sends to EEM to register itself.
  • NEXT_REQUEST: Client sends to EEM to ask for a new design point.
  • NEXT_REPLY: EEM replies to NEXT_REQUEST with design point informations.
  • FINISHED: EEM tells Clients to finish current simulation.
  • TERMINATE: EEM tells Clients to terminate itself.
Client Registration

The first thing a simulation client should do when it connects is to send a REGISTER message. Included in this registration message should be information about the client itself such as information about the simulator version or configuration, kernel version, architecture (x86, x86_64, etc.). This information is stored in the database, and associated with all simulations and results reported by the given simulation client. This is useful for ensuring all of the clients have compatible configurations so that the results are not biased by differing versions of software. Furthermore, by associating this information with subsequent results, it allows for the framework to be able to remove any results reported by a given client, and then re-collect results from a more trustworthy host.

Simulation Dispatching

The execution manager is responsible for dispatching ready design points to the simulation clients. The simulation client is responsible for requesting simulations when they are ready; this is done by sending a NEXT_REQUEST message. This signals to the execution manager to query the database for the next ready design point. The configuration for the next design point is encapsulated in a NEXT_REPLY message which is sent back to the simulation client, which processes the xml and runs the configured simulation.

Most often, the execution manager will have many ready design points to choose from when deciding which to dispatch. There are several methods which can be employed for this purpose:

  • One design point at a time: Keep all active simulation clients working on the same design point. This is how Akaroa works and is ideal when you have one design point, or relatively few design points to compute. The downfall of this approach is that every simulation client must simulate the transient, which is later discarded.
  • Dispatch to N hosts: Send a design point to at most N hosts. This is how SWAN-Tools works with n = 1. This worked well in SWAN-Tools partly because it lacked a few features such as automatic run length detection and dynamic transient calculation. When considering values for N, it is important to consider that every simulation client must pay the time to compute the transient before any useful work can be done. For larger N, more time is wasted in transient computation.
  • Round Robin: Dispatch design points in a round robin fashion. This is being implemented by SAFE. This method helps to minimize the number of times each design point is dispatched to minimize the amount of time spent in computing the transient.

The experiment execution manager (EEM) is comprised of one server and multiple clients. The EEM server creates pairs of EEM client/simulator in each remote processor that is configured to work with the automation system. In order to do this, the EEM server establishes an SSH connection to each remote processor, where it starts an EEM client and its corresponding instance of ns-3 for each individual core. In each processor, one single twisted daemon manages the collection of EEM client/simulator pairs for all cores. The EEM server communicates with each EEM client over a TCP socket, whereas each EEM client communicates with their instance of ns-3 over a Unix pipe (which is the most reasonable option of IPC since both live in the same processor or core). The architecture of the EEM is represented in the figure below.

Run Length Determination

Periodically, as results are reported, the execution manager checks if enough samples have been collected for a design point to reach a desired confidence level on the user specified metrics. If and when the design point is calculated to be complete, it is marked as such in the database, and a FINISHED message is sent to all simulation clients contributing to the given design point. The simulation client should handle this message by terminating the simulator and requesting another design point to execute. Finally, when the execution manager determines that all design points have been completed, it sends a TERMINATE message to all connected clients, signaling the termination of their active simulation, as well as the termination of the client as well.

Simulation Client

The simulation client provides link between the simulator itself and the execution manager. The simulation client is responsible for the configuration and setup of the simulator based on the configuration given by the execution manager. Additionally, it is also responsible for the transient detection, as the results sent to the execution manager are assumed to be from after the transient has passed.

Simulator Configuration

The ns-3 simulator can be configured using an xml based language for attributes. In the case that the simulation experiment only varies attributes of a model, the xml file sent from the execution manager can be read directly into the ns-3 simulator. In the case that the experiment modifies the simulation model (changes for example, the number of nodes in the simulation), then the simulation script must be generated dynamically based on the xml passed from the execution manager.

Simulator IPC

A one way communication link from the simulator to the client must be established to allow for the transmission of statistical samples during the simulation execution. This is done by writing statistical samples out to a pipe during execution. Most commonly, this strategy is employed with the STDOUT pipe, but SAFE has been built to create and listen on an alternative pipe. The simulation itself must open this pipe using fdopen(3). File descriptor 3 is used by default so as to multiplex between statistical samples and any logging or debugging that may otherwise be printed to STDOUT or STDERR.

Transient Detection

Transient detection is done in an external process. Sample are received on STDIN and the end of the transient is signal on STDERR.

Results Handling

Once an experiment has completed, the results of interest are stored in the database. Facilities to access and process the data are important to be able to use the results of our computations. Given that we have built our database schema to be flexible in handling different types of experiments with different numbers of factors and levels, extra care has to be taken to ensure that results are queried for properly.

When querying for design points, which we can later associate with results, we want to find design points with given factor levels. It is easy to search for all design points where some factor takes on some level, but given a tuple of factors and levels which define a design point, it is more difficult to find the results efficiently. This would require searching through the xml configuration or joining a table where factors and levels are stored, but this table would have to be joined for each element in the tuple. Instead, a unique identifier is associated with each design point separate from the primary key. This identifier can be decoded to find all of the factors and levels, and a tuple of factors and levels can compute this identifier. More formally, there exists a bijective function between the tuples of the design points and the unique identifiers. This will allow for easier querying for specific design points.

Code Documentation

Data Models

Boolean Expressions

Execution Manager

Simulation Client