Developing modules¶
General implementation approach¶
ExaDiS modules are typically implemented as classes which follow a generic template, e.g.:
class MyExaDisModule : public BaseClass {
public:
MyExaDisModule(System* system) : BaseClass(system) {
// Initialize
}
void function(System* system, ...) {
// Do some stuff
}
...
};
where System
is the main class that contains all information about the simulated dislocation system, including the parameters, the crystal instance, and the dislocation network object, see System class.
Accessing the dislocation network¶
The System
object provides the get_serial_network()
and get_device_network()
methods to request various instances of the dislocation network to be used in various execution spaces. These methods are implemented in the DisNetManager class.
Method get_serial_network()
returns a pointer to a SerialDisNet instance of the network to be used for execution in the host execution space and whose data is allocated on the host memory space.
// Request a `SerialDisNet` instance of the dislocation network
SerialDisNet* net = system->get_serial_network();
// Perform operations in the serial execution space
...
Method get_device_network()
returns a pointer to a DeviceDisNet instance of the network to be used for execution in the device execution space and whose data is allocated on the device memory space or accessible from the device (e.g. via the use of unified memory).
// Request a `DeviceDisNet` instance of the dislocation network
DeviceDisNet* net = system->get_device_network();
// Perform operation in the device execution space
...
Note
The device memory/execution spaces refer to the highest spaces available for the current build. If the code is compiled for GPU (e.g. using build option -DKokkos_ENABLE_CUDA=On
), then the device execution/memory spaces refer to the GPU device spaces. If the code is compiled for CPU execution only, then the device execution/memory spaces will default to the host execution/memory spaces, and any part of the code operating on a DeviceDisNet
instance will be normally executed on the host CPU.
The DisNetManager
object keeps track of the last requested instance of the dislocation network by marking it as active. When calling the get_*_network()
methods, a memory copy is then only triggered if the requested network type instance is not currently marked as active to eliminate unnecessary memory transfers between host and device memory spaces.
For instance, in the following code example
void function1(System* system) {
DeviceDisNet* net = system->get_device_network();
// do some stuff
}
void function2(System* system) {
DeviceDisNet* net = system->get_device_network();
// do some stuff
}
// Main body
function1(system);
function2(system);
no memory copy is triggered in function2
because the DeviceDisNet
instance was already marked as active when executing function1
.
Conversely, in the following code example
void function1(System* system) {
SerialDisNet* net = system->get_serial_network();
// do some stuff
}
void function2(System* system) {
DeviceDisNet* net = system->get_device_network();
// do some stuff
}
// Main body
function1(system);
function2(system);
a copy of the network instance from the host space to the device space will be triggered in function2
to ensure that an up-to-date instance of the network (e.g. which may have been modified in function1
) is accessible from the device.
Example 1: computing the total dislocation density¶
As a concrete example, let’s implement a function to compute the total dislocation density in the system. Calculating the dislocation density is a good starting example because it involves looping over each segment of the network, accessing their end-nodes to compute the segment length, summing up their lengths, and dividing the total length by the simulation cell volume (using appropriate units).
Serial implementation¶
As a prototype, we may start to implement this function for serial/host execution for simplicity. We could implement the following function:
double compute_dislocation_density(System* system) {
// Request a `SerialDisNet` instance of the dislocation network
SerialDisNet* net = system->get_serial_network();
// Loop over the local segments, compute and sum their lengths
double density = 0.0;
for (int i = 0; i < net->Nsegs_local(); i++) {
auto nodes = net->get_nodes(); // generic node accessor
auto segs = net->get_segs(); // generic segment accessor
auto cell = net->cell;
// Get segment end nodes indices
int n1 = segs[i].n1; // end-node 1 of segment i
int n2 = segs[i].n2; // end-node 2 of segment i
// Compute segment length
Vec3 r1 = nodes[n1].pos;
Vec3 r2 = cell.pbc_position(r1, nodes[n2].pos); // account for PBC
double Lseg = (r2-r1).norm();
// Increment density value by segment length
density += Lseg;
}
// Normalize by the volume and convert to 1/m^2 units
double burgmag = system->params.burgmag; // Burgers vector magnitude in m
density *= 1.0/net->cell.volume()/burgmag/burgmag; // 1/m^2 units
// Return the dislocation density value
return density;
}
Parallel implementation¶
For high-performance, we may need to implement the same function for device spaces (e.g. GPU execution). To achieve this, what we need to do is to replace the serial for
loop by a parallel loop and operate on a DeviceDisNet
instance of the network. Here, since we are summing the segments length into a single variable density
, we also need to be careful to avoid race conditions (several parallel threads attempting to modify the value of the same variable will result in incorrect results). For this, Kokkos provides the parallel_reduce()
algorithm. Our parallel implementation could look like:
double compute_dislocation_density(System* system) {
// Request a `DeviceDisNet` instance of the dislocation network
DeviceDisNet* net = system->get_device_network();
// Loop over the local segments, compute and sum their lengths
double density = 0.0;
Kokkos::parallel_reduce(net->Nsegs_local, KOKKOS_LAMBDA(const int& i, double& density_sum) {
auto nodes = net->get_nodes(); // generic node accessor
auto segs = net->get_segs(); // generic segment accessor
auto cell = net->cell;
// Get segment end nodes indices
int n1 = segs[i].n1; // end-node 1 of segment i
int n2 = segs[i].n2; // end-node 2 of segment i
// Compute segment length
Vec3 r1 = nodes[n1].pos;
Vec3 r2 = cell.pbc_position(r1, nodes[n2].pos); // account for PBC
double Lseg = (r2-r1).norm();
// Increment density value by segment length
density_sum += Lseg;
}, density);
// Normalize by the volume and convert to 1/m^2 units
double burgmag = system->params.burgmag; // Burgers vector magnitude in m
density *= 1.0/net->cell.volume()/burgmag/burgmag; // 1/m^2 units
// Return the dislocation density value
return density;
}
which only requires a few modifications of the serial version above.
Templated implementation¶
As a more advanced implementation, it may be beneficial to have our function working indifferently with both host/device spaces and corresponding instance types of the network. For this, we could implement the following templated version of our function:
template<class N>
double compute_dislocation_density(System* system, N* net) {
// Define execution policy based on network instance type
using policy = Kokkos::RangePolicy<typename N::ExecutionSpace>;
// Loop over the local segments, compute and sum their lengths
double density = 0.0;
Kokkos::parallel_reduce(policy(0, net->Nsegs_local), KOKKOS_LAMBDA(const int& i, double& density_sum) {
auto nodes = net->get_nodes(); // generic node accessor
auto segs = net->get_segs(); // generic segment accessor
auto cell = net->cell;
// Get segment end nodes indices
int n1 = segs[i].n1; // end-node 1 of segment i
int n2 = segs[i].n2; // end-node 2 of segment i
// Compute segment length
Vec3 r1 = nodes[n1].pos;
Vec3 r2 = cell.pbc_position(r1, nodes[n2].pos); // account for PBC
double Lseg = (r2-r1).norm();
// Increment density value by segment length
density_sum += Lseg;
}, density);
// Normalize by the volume and convert to 1/m^2 units
double burgmag = system->params.burgmag; // Burgers vector magnitude in m
density *= 1.0/net->cell.volume()/burgmag/burgmag; // 1/m^2 units
// Return the dislocation density value
return density;
}
which would equally work when called for serial/host execution:
SerialDisNet* net = system->get_serial_network();
double density = compute_dislocation_density(system, net);
or parallel/device execution:
DeviceDisNet* net = system->get_device_network();
double density = compute_dislocation_density(system, net);
Example 2: implementing a simple force module¶
As another example, let’s assume we want to create a new force module ForceConstant
that sets all nodal forces to a constant vector fval
.
Force base class¶
For force calculation modules, the base class is the Force
class defined in src/force.h
, and we can create the following new class:
class ForceConstant : public Force {
private:
// Member properties
Vec3 fval;
public:
// Force parameters
struct Params {
Vec3 fval = 0.0;
Params() {}
Params(Vec3 val) : fval(val) {}
};
// Constructor
ForceConstant(System* system, Params params) {
// Initialize
fval = params.fval;
}
// Pre-compute operation
void pre_compute(System* system) {} // nothing to pre-compute here
// Global compute
void compute(System* system, bool zero=true) {
// Set nodal forces to fval
...
}
// Individual compute
Vec3 node_force(System* system, const int& i) {
// Compute individual force at node i
Vec3 fi = ...
return fi;
}
// Team individual node force implementation (optional)
template<class N>
KOKKOS_INLINE_FUNCTION
Vec3 node_force(System* system, N* net, const int& i, const team_handle& team)
{
// Compute individual force at node i using a team of workers
Vec3 fi = ...
return fi;
}
};
The guideline in ExaDiS is for each module to possess its own Param
struct that serves to define the dedicated parameters of the module, and provide it to the constructor. After that, we need to implemented the various methods associated with the Force
base class.
For a Force
module, the 3 required methods to implement are:
pre_compute()
: method to perform any pre-computation that may be required to compute nodal forces. In the traditional cycle, the pre-compute function is called once at the beginning of each simulation time step. For instance, for theForceFFT
module, the pre-computation step is used to compute and tabulate the long-range stress field on a grid. In theForceSegSegList
module, the pre-computation step is used to build the list of segment pair interactions.compute()
: method to perform a global computation of nodal forces, i.e. to compute the forces on all nodes of the network.node_force()
: method to perform an individual force computation, i.e. to compute the force on a given node of the network. Optionally, this function can also be implemented when using a team of workers for parallel execution on device (GPU), where the individual node force computation can itself be parallelized across several concurrent threads using hierarchical parallelization.
Serial implementation¶
In our simple example, there is no need to perform any pre-computation, so we can leave the pre_compute()
method empty. The next step is to implement the compute()
method. Recall we want to set the nodal forces to a constant value. As a first attempt/prototype, we may want to implement this method in serial on the CPU for simplicity. Here, we could do something like this:
// Global compute (serial implementation)
void compute(System* system, bool zero=true) {
// Request a `SerialDisNet` instance of the dislocation network
SerialDisNet* net = system->get_serial_network();
// Zero out the nodal forces if requested (mandatory instruction)
if (zero) zero_force(net);
// Set nodal forces to fval using a simple for loop
for (int i = 0; i < net->Nnodes_local; i++) {
auto nodes = net->get_nodes(); // generic node accessor
nodes[i].f = fval;
}
}
where we loop over the local nodes and assign fval
to their force property. We’re done, and we can check with a test case that this is working properly.
Parallel implementation¶
Now, let’s imagine that we are happy with our implementation, but want to use this module in a production run on GPU. In this case, executing the compute()
method in the serial execution space is going to be very inefficient. (This would be the case even for such a trivial force module. This is because other force/mobility modules used in the simulation will likely be executed on GPU, hence the call to the system->get_serial_network()
will trigger memory copies from the GPU to the CPU, while the system->get_device_network()
in the other GPU modules will trigger the reverse memory copies. These back-and-forth movements may significantly hit the performance.) Alternatively, we can now implement a new version of the compute()
method that will be executed on the device space. Here, the implementation is trivial because each index i
of the loop is independent and thus can be parallelized. As such, all we need to do is to replace the serial for
loop with a Kokkos::parallel_for
loop, while now operating on a DeviceDisNet
object:
// Global compute (parallel implementation)
void compute(System* system, bool zero=true) {
// Request a `DeviceDisNet` instance of the dislocation network
DeviceDisNet* net = system->get_device_network();
// Zero out the nodal forces if requested (mandatory instruction)
if (zero) zero_force(net);
// Set nodal forces to fval using a parallel_for loop
Vec3 f = fval; // set vector variable in local scope
Kokkos::parallel_for(net->Nnodes_local, KOKKOS_LAMBDA(const int i) {
auto nodes = net->get_nodes();
nodes[i].f = f;
});
Kokkos::fence();
}
Here we are using a lambda expression to define the parallel kernel using the KOKKOS_LAMBDA
macro. Note that the vector fval
is copied to a local vector f
in the method scope in order to avoid implicit capture of a class member in the lambda function. As slightly more advanced implementations, we could use class operator()
or a dedicated functor to define the same parallel kernel without using lambda functions. Note that we also need to call the Kokkos::fence()
function to make sure we have finished executing the potentially asynchronous parallel_for
kernel launch before leaving the method.
Finally, we can also implement the node_force()
method to compute the force on a single node. Here this is trivial and we can just have:
// Individual compute
Vec3 node_force(System* system, const int& i) {
// Compute individual force at node i
Vec3 fi = fval;
return fi;
}
and we can implement the same for the node_force()
method using team workers. The team implementation is optional but allows other modules that use node_force()
calls in parallel kernels (e.g. module TopologyParallel
) to be used with our ForceConstant
module.
Example 3: implementing a segment-based force module¶
As a slightly more complex example, let’s now imagine that we want to implement a force module ForceSegConstant
for which the force on each segment is the constant fval
multiplied by the length of the segment, and that the segment force is equally distributed between the two end-nodes.
Serial implementation¶
In a serial fashion, we could write the compute()
function as:
// Global compute (serial implementation)
void compute(System* system, bool zero=true) {
// Request a `SerialDisNet` instance of the dislocation network
SerialDisNet* net = system->get_serial_network();
// Zero out the nodal forces if requested (mandatory instruction)
if (zero) zero_force(net);
// Loop over segments to aggregate forces at the nodes
for (int i = 0; i < net->Nsegs_local; i++) {
auto nodes = net->get_nodes(); // generic node accessor
auto segs = net->get_segs(); // generic segment accessor
auto cell = net->cell;
// Get segment end nodes indices
int n1 = segs[i].n1; // end-node 1 of segment i
int n2 = segs[i].n2; // end-node 2 of segment i
// Compute segment length
Vec3 r1 = nodes[n1].pos;
Vec3 r2 = cell.pbc_position(r1, nodes[n2].pos); // account for PBC
double L = (r2-r1).norm();
// Distribute segment force at nodes
nodes[n1].f += 0.5*L*fval;
nodes[n2].f += 0.5*L*fval;
}
}
Here, we now loop over the local segments, compute the segments length from their end-nodes positions, and distribute the segment force equally by incrementing the end-nodes force values.
Parallel implementation¶
We can notice that if we want to implement the same method in a parallel fashion, we would need to be careful. This is because indices i
in the loop are no longer fully independent: two distinct segments i
may need to access some of the same nodes n1
or n2
. When running the above loop kernel in parallel, this may create race conditions and yield incorrect results.
To simplify the implementation, ExaDiS provides additional base classes that abstract away this type of complexity. In this particular case, base class ForceSeg
in src/force.h
provides a simple way to implement our desired parallel kernel without having to worry about the parallelism aspect of it. All what base class ForceSeg
requires is the kernel inside of the parallel loop to be provided in the form of a segment_force()
method that returns end-nodes forces. To provide it, we simply need to define a struct that implements our desired segment_force()
kernel:
struct SegConstant
{
// Flags to instruct what kernels are implemented in the struct
static const bool has_pre_compute = false;
static const bool has_compute_team = false;
static const bool has_node_force = false;
Vec3 fval;
// Force parameters
struct Params {
Vec3 fval = 0.0;
Params() {}
Params(Vec3 val) : fval(val) {}
};
// Constructor
SegConstant(System* system, Params params) {
// Initialize
fval = params.fval;
}
// Segment force kernel
template<class N>
KOKKOS_INLINE_FUNCTION
SegForce segment_force(System* system, N* net, const int& i)
{
auto nodes = net->get_nodes(); // generic node accessor
auto segs = net->get_segs(); // generic segment accessor
auto cell = net->cell;
// Get segment end nodes indices
int n1 = segs[i].n1; // end-node 1 of segment i
int n2 = segs[i].n2; // end-node 2 of segment i
// Compute segment length
Vec3 r1 = nodes[n1].pos;
Vec3 r2 = cell.pbc_position(r1, nodes[n2].pos); // account for PBC
double L = (r2-r1).norm();
Vec3 f1 = 0.5*L*fval;
Vec3 f2 = 0.5*L*fval;
return SegForce(f1, f2);
}
static constexpr const char* name = "SegConstant";
};
and then declare our ForceSegConstant
force module as the base class ForceSeg
templated with our struct SegConstant
implementing the kernel:
typedef ForceSeg<SegConstant> ForceSegConstant;
In struct SegConstant
, all we had to do is to pretty much copy the inside of our serial loop into the segment_force()
method. We have also templated the network instance type N
so that the same kernel can be compiled for serial or device execution spaces using indifferently SerialDisNet
or DeviceDisNet
instances of the network. When compiled for GPU, the segment forces will be computed in a highly parallel fashion on the device (GPU) space by default, and forces at nodes will be aggregated properly avoiding race conditions, following the machinery implemented in base class ForceSeg
and here abstracted from the user. In addition, when using this approach the associated node_force()
method becomes automatically created as well, without us having to explicitly define it. However, if we want to implement a dedicated node_force()
method, we could also do that by setting flag has_node_force = true
and implementing method node_force()
in our SegConstant
struct, in which case its base class implementation will be overridden. Similarly, we could provide a team implementation or pre-compute methods.
In the code, base class ForceSeg
is for instance used to compute the core force in src/force_types/force_lt.h
, or implement the N^2 force model in src/force_types/force_n2.h
.