gmxapi.mdrun violates assumptions in MPI-enabled GROMACS when Python is launched in a multi-rank context.

Summary

mpiexec -n $N python my_gmxapi_script.py with N>=2 will hang if the script includes gmxapi.mdrun and the Python package is built against an MPI-enabled GROMACS installation.

GROMACS version

All

Steps to reproduce

mpiexec -n 2 $(which python) -m mpi4py -c 'import gmxapi; gmxapi.mdrun("sometprfile.tpr").run()'

The problem is clearer with reference to the proof-of-concept script below.

What is the current bug behavior?

Since we still aren't explicitly passing an MPI communicator from gmxapi to libgromacs_mpi, the simulator detects COMM_WORLD and tries to use it.

However, when the script is launched with mpiexec -n <N> for greater than or equal to 2, gmxapi still makes API calls with the expectation of running on simulation per rank.

If there is a single input, then the simulation launches and hangs, with the library waiting for communication with the non-root ranks that will never come.

With multiple inputs, gmxapi thinks it is launching one simulation per rank, but instead it is actually only running the first simulation, since libgromacs ignores inputs on the non-root ranks and broadcasts inputs during launch.

What did you expect the correct behavior to be?

In the long run, gmxapi should pass a subcommunicator to the library (#4422 (closed)).

For gmxapi 0.3.x, we should do our best to allocate resources, issue warnings or errors as appropriate, and by all means avoid hanging.

Possible fixes

A proper fix is to do #4422 (closed)

For gmxapi 0.3.x, we could try to clean the environment to prevent libgromacs_mpi from expecting more than one rank per simulation and spawn a subprocess for the gmxapi.mdrun task.

This is tested in concept, but a gmxapi patch has not yet been prepared and tested.

# Execute with (e.g.)
#     mpiexec -n 2 $(which python) -m mpi4py test.py
import contextlib
import os
import multiprocessing as mp
import typing
from multiprocessing import Process

# Environment variable prefixes known to be associated with MPI implementations,
# which may affect MPI context detection, and which should not matter outside of
# MPI contexts.
_filtered_prefixes = (
    'DCMF_',    # IBM
    'MPICH_',
    'MPIEXEC_',
    'MPIO_',    # IBM
    'MV2_',     # MVAPICH2 and some forks
    'MVAPICH_',
    'HYDRA_',   # MPICH
    'OMPI_',    # OpenMPI
    'PMI_',     # Process Management Interface
    'PMIX_',    # Newer PMI and batch systems
)


def filtered(key):
    if any(key.startswith(prefix) for prefix in _filtered_prefixes):
        return True
    else:
        return False


def filtered_items(map: typing.Mapping[str, str]):
    for key, value in map.items():
        if filtered(key):
            continue
        yield (key, value)


@contextlib.contextmanager
def filtered_environ():
    environ_original = os.environ.copy()
    for key in environ_original.keys():
        if filtered(key):
            del os.environ[key]
    try:
        yield
    finally:
        for key, value in environ_original.items():
            if key not in os.environ:
                os.environ[key] = value


def task():
    import mpi4py
    mpi4py.rc.initialize = False  # do not initialize MPI automatically
    mpi4py.rc.finalize = False    # do not finalize MPI automatically
    import mpi4py.MPI
    assert not mpi4py.MPI.Is_initialized()
    mpi4py.MPI.Init()
    print(f'subtask comm size: {mpi4py.MPI.COMM_WORLD.Get_size()}')
    mpi4py.MPI.Finalize()


if __name__ == '__main__':
    mp.set_start_method('spawn')

    # By default, mpi4py initializes and finalizes MPI automatically.
    import mpi4py.MPI

    # This would fail:
    # p = Process(target=task, args=())
    # p.start()
    # p.join()

    with filtered_environ():
        p = Process(target=task, args=())
        p.start()
        p.join()

    assert mpi4py.MPI.Is_initialized()
    print(f'Parent rank: {mpi4py.MPI.COMM_WORLD.Get_rank()}/{mpi4py.MPI.COMM_WORLD.Get_size()}')

Edited Mar 08, 2022 by M. Eric Irrgang