7. Verbs Interface

rdma.ibverbs implements a set of Python extension objects and functions that provide a wrapper around the OFA verbs interface from libibverbs. The wrapper puts the verbs interface into an OOP methodology and generally exposes most functionality to Python.

A basic example for getting a verbs instance and a protection domain is:

import rdma
import rdma.ibverbs as ibv

end_port = rdma.get_end_port()
with rdma.get_verbs(end_port) as ctx:
    pd = ctx.pd();

Verbs objects that have an underlying kernel allocation are all context managers and have a close() method, but the objects also keep track of their own children. Ie closing a rdma.ibverbs.Context will close all rdma.ibverbs.PD and rdma.ibverbs.CQ objects created by it. This makes resource clean up quite straightforward in most cases.

Like with file objects users should be careful to call the close() method once the instance is no longer needed. Generally focusing on the rdma.ibverbs.Context and rdma.ibverbs.PD is sufficient due to the built in resource clean up.

The IB verbs structures (eg ibv_qp_attr) are mapped into Python objects, (eg rdma.ibverbs.ibv_qp_attr). As Python objects they work similarly to the C syntax with structure member assignment, but they can also be initialized with a keyword argument list to the constructor. This can save a considerable number of lines.

There are efficient wrapper functions that create qp_attr, ah_attr and sge objects with a reduced number of arguments.

Errors from verbs are raised as a rdma.SysError which includes the libibverb function that failed and the associated errno.

Note

Despite the name ‘ibverbs’ the verbs interface is a generic interface that is supported by all RDMA devices. Different technologies have various limitations and support for anything but IB through this library is not completed.

7.1. Verbs and rdma.path.IBPath

The raw verbs interface for creating QPs is simplified to rely on the standard IBPath structure which should be filled in with all the necessary parameters. The wrapper QP modify methods modify_to_init(), modify_to_rtr(), and modify_to_rts() can setup a QP without additional information.

The attributes in an IBPath are used as follows when modifying a QP:

Path Attribute Usage
end_port.port_id qp_attr.port_num
pkey qp_attr.pkey_index
qkey qp_attr.qkey
MTU qp_attr.path_mtu
retries qp_attr.retry_cnt
min_rnr_timer qp_attr.min_rnr_timer
packet_life_time qp_attr.timeout
dack_resp_time qp_attr.timeout
sack_resp_time  
dqpn qp_attr.dest_qp_num
sqpn  
dqpsn qp_attr.rq_psn
sqpsn qp_attr.sq_psn
drdatomic qp_attr.max_dest_rd_atomic
srdatomic qp_attr.max_rd_atomic

IBPath structures can also be used any place where an ah_attr could be used, including for creating AH instances and with modify(). With this usage the IBPath caches the created AH, so getting the AH for a path the second time does not rebuild the AH. This means callers generally don’t have to worry about creating and maintaining AH’s explicitly.

The attributes in an IBPath are used as follows when creating an AH:

Path Attribute Usage
has_grh ah_attr.is_global
DGID ah_attr.grh.dgid
SGID ah_attr.grh.sgid_index
flow_label ah_attr.grh.flow_label
hop_limit ah_attr.grh.hop_limit
traffic_class ah_attr.grh.traffic_class
DLID ah_attr.dlid
SLID ah_attr.SLID_bits
SL ah_attr.SL
rate ah_attr.static_rate
end_port.port_id ah_attr.port_num

7.2. Usage Examples

This is not intended to be a verbs primer. Generally the API follows that of the normal OFA verbs (with ibv_ prefixes removed) , which in turn follows the API documented by the IBA specification. Many helper functions are provided to handle common situations in a standard way, generally these are preferred.

7.2.1. UD QP Setup

Setting up a QP for UD communication is very simple. There are two major cases, for communication with a single end port, and for communication with multiple. The single case is:

path = IBPath(end_port,dpqn=1,qkey=IBA.IB_DEFAULT_QP1_QKEY,DGID=...);
with rdma.get_gmp_mad(path.end_port,verbs=ctx) as umad:
    rdma.path.resolve_path(umad,path,reversible=True);
with ctx.pd() as pd:
    depth = 16;
    cq = pd.cq(2*depth);
    qp = pd.qp(ibv.IBV_QPT_UD,depth,cq,depth,cq)
    path.sqpn = qp.qp_num;
    # Post receive work requests to qp here
    qp.establish(path);

    qp.post_send(ibv.send_wr(opcode=ibv.IBV_WR_SEND,
                             send_flags=ibv.IBV_SEND_SIGNALED,
                             ah=pd.ah(path),
                             remote_qpn=path.dpqn,
                             remote_qkey=path.qkey,
                             ...));

Notice that the path is used to configure the pkey and qkey values of the UD QP during initialization, and is also used to create the AH for the send work request.

The case for multiple destinations is very similar, however all destinations must share the same PKey and QKey. For instance, assuming there is a list of DGIDs:

with rdma.get_gmp_mad(path.end_port) as umad:
    paths = [rdma.path.resolve_path(umad,IBPath(end_port,DGID=I,
                                                qkey=IBA.IB_DEFAULT_QP1_QKEY),
                                    reversible=True,
                                    properties={'PKey': IBA.DEFAULT_PKEY})
             for I in destinations];

Will resolve all the DGIDs into paths with the same QKey and PKey. paths[-1] can be used to setup the QP and all the paths can be used interchangeably in work requests.

7.2.2. UD response path

Constructing the reply path and generating a send WR from a UD WC is very straightforward:

wcs = cq.poll():
for wc in wcs:
    path = ibv.WCPath(self.end_port,wc,
                      buf,0,
                      pkey=qp_pkey,
                      qkey=qp_qkey);
    path.reverse();
    ah = pd.ah(path);
    wr = ibv.send_wr(opcode=ibv.IBV_WR_SEND,
                     ah=ah,
                     remote_qpn=path.dpqn,
                     remote_qkey=path.qkey,
                     ...);

buf,0 is the buffer and offset of the memory posted in the recv request. Remember that on UD QPs the first 40 bytes of the receive buffer are reserved for a GRH, which is accessed by rdma.ibverbs.WCPath().

7.2.3. No CM QP Setup

The library has built in support for correctly establishing IB connections without using a CM by exchanging information over a side channel (eg a TCP socket). Side A would do this:

qp = pd.qp(ibv.IBV_QPT_RC,...);
path = rdma.path.IBPath(end_port,SGID=end_port.default_gid);
rdma.path.fill_path(qp,path);
path.reverse(for_reply=False);
send_to_side_b(pickle.pickle(path));
path = pickle.unpickle(recv_from_side_b());
path.reverse(for_reply=False);
path.end_port = end_port;

qp.establish(self.path,ibv.IBV_ACCESS_REMOTE_WRITE);

# Synchronize transition to RTS
send_to_side_b(True);
recv_from_side_b();

Side B would do this:

qp = pd.qp(ibv.IBV_QPT_RC,...);
path = pickle.unpickle(recv_from_side_a());
path.end_port = end_port;
rdma.path.fill_path(qp,path);
with rdma.get_gmp_mad(path.end_port) as umad:
   rdma.path.resolve_path(umad,path);
send_to_sid_a(pickle.pickle(path));

qp.establish(self.path,ibv.IBV_ACCESS_REMOTE_WRITE);

# Synchronize transition to RTS
recv_from_side_a();
send_to_side_a(True);

rdma.path.fill_path() sets up most of the the QP related path parameters and rdma.path.resolve_path() gets the path record(s) from the SA.

This procedure implements the same process and information exchange that the normal IB CM would do, including negotiating responder resources and having the capability to setup asymmetric paths (unimplemented today).

Any QP type is supported by this basic procedure, the extra information exchanged is simply not used.

Note

Pickle is only used as an easy example here. Real cases should do something else as unpickling untrusted data is dangerous. The Path object has a __reduce__() method which can be used to implement a protocol appropriate encoding.

7.2.4. WC Error handling

The class rdma.ibverbs.WCError is an exception that can be thrown when a WC error is detected. It formats the information in the WC and provides a way for the catcher to determine the failed QP:

wcs = cq.poll():
for wc in wcs:
    if wc.status != ibv.IBV_WC_SUCCESS:
        raise ibv.WCError(wc,cq);

Depending on the situation QP errors may not be recoverable so the whole QP should be torn down.

7.2.5. Completion Channels

Additional helpers are provided to simplify completion channel processing, suitable for single threaded applications. The basic usage for a completion channel is:

# To setup the completion channel
cc = ctx.comp_channel();
poll = select.poll();
cc.register_poll(poll);
cq = ctx.cq(2*depth,cc)

def get_wcs():
    cq.req_notify();
    while True:
        ret = poll.poll();
        for I in ret:
            if cc.check_poll(I) is not None:
                wcs = cq.poll();
                if wcs is not None:
                    return wcs;

wcs = get_wcs();

Obviously the methodology becomes more complex if additional things are polled for. The basic idea is that rdma.ibverbs.CompChannel.check_poll() takes care of all the details and returns the CQ that has available work completions.

Using CQPoller the above example can be further simplified:

cc = ctx.comp_channel();
cq = ctx.cq(2*depth,cc)
poller = rdma.vtools.CQPoller(cq);

for wc in poller.iterwc(timeout=1):
    print wc

CQPoller also monitors for asynchronous events and will call rdma.ibverbs.Context.handle_async_event() which will produce exceptions for failure conditions and update the end port cache as necessary.

7.2.6. Memory

Memory registrations are made explicit, as with verbs everything that is passed into a work request must have an associated memory registration. A MR object can be created for anything that supports the Python buffer protocol, and writable MRs require a mutable Python buffer. Some useful examples:

s = "Hello";
mr = pd.mr(s,ibv.IBV_ACCESS_REMOTE_READ);
s = bytearray(256);
mr = pd.mr(s,ibv.IBV_ACCESS_REMOTE_WRITE);
s = mmap.mmap(-1,256);
mr = pd.mr(s,ibv.IBV_ACCESS_REMOTE_WRITE);

SGEs are constructed through the MR:

sge = mr.sge();
sge = mr.sge(length=128,off=10);

A tool is provided for managing a finite pool of fixed size buffers. This construct is very useful for applications using the SEND verb:

pool = rdma.vtools.BufferPool(pd,count=100,size=1024);
pool.post_recvs(qp,50);

buf_idx = pool.pop();
pool.copy_to("Hello message!",buf_idx);
qp.post_send(pool.make_send_wr(buf_idx,pool.size,path));

7.3. rdma.vtools module

rdma.vtools provides various support functions to make verbs programming easier.

class rdma.vtools.BufferPool(pd, count, size)

Bases: object

Hold onto a block of fixed size buffers and provide some helpers for using them as send and receive buffers with a QP.

This can be used to provide send buffers for a QP, as well as receive buffers for a QP or a SRQ. Generally the qp argument to methods of this class can be a rdma.ibverbs.QP or rdma.ibverbs.SRQ.

A rdma.ibverbs.MR is created in pd with count buffers of size bytes.

BUF_ID_MASK

Mask to convert a wr_id back into a buf_idx.

NO_WR_ID

Constant value to set wr_id to when it is not being used.

RECV_FLAG

Constant value to or into wr_id to indicate it was posted as a recv.

close()

Close held objects

copy_from(buf_idx, offset=0, length=4294967295)

Return a copy of buffer buf_idx. buf_idx may be a wr_id.

Return type:bytearray
copy_to(buf, buf_idx, offset=0, length=4294967295)

Copy buf into the buffer buf_idx

count

Number of buffers.

finish_wcs(qp, wcs)

Process work completion list wcs to recover buffers attached to completed work and re-post recv buffers to qp. Every work request with an attached buffer must have a signaled completion to recover the buffer.

wcs may be a single wc.

Raises rdma.ibverbs.WCError:
 For WC’s marked as error.
make_send_wr(buf_idx, buf_len, path=None)

Return a rdma.ibverbs.send_wr for buf_idx and path. If path is None then the wr does not contain path information (eg for connected QPs)

make_sge(buf_idx, buf_len)

Return a rdma.ibverbs.SGE for buf_idx.

pop()

Return a new buffer index.

post_recvs(qp, count)

Post count buffers for receive to qp, which may be any object with a post_recv method.

size

Size of a single buffer.

class rdma.vtools.CQPoller(cq, async_events=True, solicited_only=False)

Bases: object

Simple wrapper for a rdma.ibverbs.CQ and rdma.ibverbs.CompChannel to provide a blocking API for getting work completions.

cq is the completion queue to read work completions from. If the cq does not have a completion channel then this will spin loop on cq otherwise it sleeps on the completion channel.

If async_events is True then the async event queue will be monitored while sleeping.

iterwc(count=None, timeout=None, wakeat=None)

Generator that returns work completions from the CQ. If not None at most count wcs will be returned. timeout is the number of seconds this function can run for, and wakeat is the value of rdma.tools.clock_monotonic() after which iteration stops.

Return type:rdma.ibverbs.wc
sleep(wakeat)

Go to sleep until the cq gets a completion. wakeat is the value of rdma.tools.clock_monotonic() after which the function returns None. Returns True if the completion channel triggered.

If no completion channel is in use this just returns True.

Note: It is necessary to call rdma.ibverbs.CQ.req_notify() on the CQ, then poll the CQ before calling sleep(). Otherwise the edge triggered nature of the completion channels can cause deadlock.

timedout

True if iteration was stopped due to a timeout

wakeat

Value of rdma.tools.clock_monotonic() to stop iterating. This can be altered while iterating.

7.4. rdma.ibverbs module

Note

Unfortunately Sphinx does not do a very good job auto documenting extension modules, and all the function arguments are stripped out. Until this is resolved the documentation after this point is incomplete.

The rdma.ibverbs module wrappers all of the functions in libibverbs that are not duplicated elsewhere in the library, for instance, device discovery uses the rdma.devices module, not the functions from libibverbs.

class rdma.ibverbs.AH

Bases: object

Address handle, this is a context manager.

close

Free the verbs AH handle.

exception rdma.ibverbs.AsyncError

Bases: rdma.RDMAError

Raised when an asynchronous error event is received.

class rdma.ibverbs.CQ

Bases: object

Completion queue, this is a context manager.

close

Free the verbs CQ handle.

comp_chan
comp_events
ctx
poll

Perform the poll_cq operation, return a list of work requests.

req_notify

Request event notification for CQEs added to the CQ.

resize

Resize the CQ to have at least cqes entries.

class rdma.ibverbs.CompChannel

Bases: object

Completion channel, this is a context manager.

check_poll

Returns a rdma.ibverbs.CQ that got at least one completion event, or None. This updates the comp channel and keeps track of received events, and appropriately calls ibv_ack_cq_events internally. After this call the CQ must be re-armed via rdma.ibverbs.CQ.req_notify()

close

Free the verbs completion channel handle.

ctx
fileno

Return the FD associated with this completion channel.

register_poll

Add the FD associated with this object to select.poll object poll.

class rdma.ibverbs.Context

Bases: object

Verbs context handle, this is a context manager. Call rdma.get_verbs() to get an instance of this.

check_poll

Return True if pevent indicates that get_async_event() will return data.

close

Free the verbs context handle and all resources allocated by it.

comp_channel

Create a new rdma.ibverbs.CompChannel for this context.

cq

Create a new rdma.ibverbs.CQ for this context.

end_port
from_qp_num

Return a rdma.ibverbs.QP for the qp number num or None if one was not found.

get_async_event

Get a single async event for this context. The return result is a namedtuple of (event_type,obj where obj will be the rdma.ibverbs.CQ, rdma.ibverbs.QP, rdma.ibverbs.SRQ, rdma.devices.EndPort or rdma.devices.RDMADevice associated with the event.

handle_async_event

This provides a generic handler for async events. Depending on the event it will: - Raise a rdma.ibverbs.AsyncError exception - Reload cached information in the end port

node
pd

Create a new rdma.ibverbs.PD for this context.

query_device

Return a rdma.ibverbs.device_attr for the device.

Return type:rdma.ibverbs.device_attr
query_port

Return a rdma.ibverbs.port_attr for the port_id. If port_id is none then the port info is returned for the end port this context was created against.

Return type:rdma.ibverbs.port_attr
register_poll

Add the async event FD associated with this object to select.poll object poll.

class rdma.ibverbs.MR

Bases: object

Memory registration, this is a context manager.

addr
close

Free the verbs MR handle.

ctx
length
lkey
pd
rkey
sge

Create a rdma.ibv.sge referring to length bytes of this MR starting at off. If length is -1 (default) then the entire MR from off to the end is used.

class rdma.ibverbs.PD

Bases: object

Protection domain handle, this is a context manager.

ah

Create a new rdma.ibverbs.AH for this protection domain. attr may be a rdma.ibverbs.ah_attr or rdma.path.IBPath. When used with a IBPath this function will cache the AH in the IBPath. rdma.path.Path.drop_cache() must be called to release all references to the AH.

close

Free the verbs pd handle.

ctx
from_qp_num

Return a rdma.ibverbs.QP for the qp number num or None if one was not found.

mr

Create a new rdma.ibverbs.MR for this protection domain.

qp

Create a new rdma.ibverbs.QP for this protection domain. This version expresses the QP creation attributes as keyword arguments.

qp_raw

Create a new rdma.ibverbs.QP for this protection domain. init is a rdma.ibverbs.qp_init_attr.

srq

Create a new rdma.ibverbs.SRQ for this protection domain. init is a rdma.ibverbs.srq_init_attr.

class rdma.ibverbs.QP

Bases: object

Queue pair, this is a context manager.

attach_mcast

Attach this QP to receive the multicast group described by path.DGID and path.DLID.

close

Free the verbs QP handle.

ctx
detach_mcast

Detach this QP from the multicast group described by path.DGID and path.DLID.

establish

Perform modify_to_init(), modify_to_rtr() and :meth`modify_to_rts`. This function is most useful for UD QPs which do not require any external sequencing.

max_recv_sge
max_recv_wr
max_send_sge
max_send_wr
modify

When modifying a QP the value attr.ah_attr may be a rdma.ibverbs.ah_attr or rdma.path.IBPath.

modify_to_init

Modify the QP to the INIT state.

modify_to_rtr

Modify the QP to the RTR state.

modify_to_rts

Modify the QP to the RTS state.

pd
post_recv

wrlist may be a single rdma.ibverbs.recv_wr or a list of them.

post_send

wrlist may be a single rdma.ibverbs.send_wr or a list of them.

qp_num
qp_type
query

Return information about the QP. mask selects which fields to return.

Return type:tuple(rdma.ibverbs.qp_attr,:class:rdma.ibverbs.qp_init_attr)
state
class rdma.ibverbs.SRQ

Bases: object

Shared Receive queue, this is a context manager.

close

Free the verbs SRQ handle.

ctx
modify

Modify the srq_limit and max_wr values of SRQ. If the argument is None it is not changed.

pd
post_recv

wrlist may be a single rdma.ibverbs.recv_wr or a list of them.

query

Return a rdma.ibverbs.srq_attr.

exception rdma.ibverbs.WCError

Bases: rdma.RDMAError

Raised when a WC is completed with error. Note: Not all adaptors support returning the opcode and qp_num in an error WC. For those that do the values are decoded.

wc is the error wc, msg is an additional descriptive message, cq is the CQ the error WC was received on and obj is a rdma.ibverbs.SRQ or rdma.ibverbs.QP if one is known. is_rq is True if the WC is known to apply to the receive of the QP, and False if the WC is known the apply to the send queue of the QP. None if unknown

rdma.ibverbs.WCPath()

Create a rdma.path.IBPath from a work completion. buf should be the receive buffer when this is used with a UD QP, the first 40 bytes of that buffer could be a GRH. off is the offset into buf. kwargs are applied to rdma.path.IBPath

Note: wc.pkey_index is not used, if the WC is associated witha GSI QP (unlikely) then the caller can pass pkey_index=wc.pkey_index as an argument.

exception rdma.ibverbs.WRError

Bases: rdma.SysError

Raised when an error occurs posting work requests. bad_index is the index into the work request list what failed to post.

rdma.ibverbs.WeakSet

alias of _my_weakset

rdma.ibverbs.wc_status_str()

Convert a rdma.ibverbs.wc.status value into a string.