rdma.ibverbs implements a set of Python extension objects and functions that provide a wrapper around the OFA verbs interface from libibverbs. The wrapper puts the verbs interface into an OOP methodology and generally exposes most functionality to Python.
A basic example for getting a verbs instance and a protection domain is:
import rdma
import rdma.ibverbs as ibv
end_port = rdma.get_end_port()
with rdma.get_verbs(end_port) as ctx:
pd = ctx.pd();
Verbs objects that have an underlying kernel allocation are all context managers and have a close() method, but the objects also keep track of their own children. Ie closing a rdma.ibverbs.Context will close all rdma.ibverbs.PD and rdma.ibverbs.CQ objects created by it. This makes resource clean up quite straightforward in most cases.
Like with file objects users should be careful to call the close() method once the instance is no longer needed. Generally focusing on the rdma.ibverbs.Context and rdma.ibverbs.PD is sufficient due to the built in resource clean up.
The IB verbs structures (eg ibv_qp_attr) are mapped into Python objects, (eg rdma.ibverbs.ibv_qp_attr). As Python objects they work similarly to the C syntax with structure member assignment, but they can also be initialized with a keyword argument list to the constructor. This can save a considerable number of lines.
There are efficient wrapper functions that create qp_attr, ah_attr and sge objects with a reduced number of arguments.
Errors from verbs are raised as a rdma.SysError which includes the libibverb function that failed and the associated errno.
Note
Despite the name ‘ibverbs’ the verbs interface is a generic interface that is supported by all RDMA devices. Different technologies have various limitations and support for anything but IB through this library is not completed.
The raw verbs interface for creating QPs is simplified to rely on the standard IBPath structure which should be filled in with all the necessary parameters. The wrapper QP modify methods modify_to_init(), modify_to_rtr(), and modify_to_rts() can setup a QP without additional information.
The attributes in an IBPath are used as follows when modifying a QP:
Path Attribute | Usage |
---|---|
end_port.port_id | qp_attr.port_num |
pkey | qp_attr.pkey_index |
qkey | qp_attr.qkey |
MTU | qp_attr.path_mtu |
retries | qp_attr.retry_cnt |
min_rnr_timer | qp_attr.min_rnr_timer |
packet_life_time | qp_attr.timeout |
dack_resp_time | qp_attr.timeout |
sack_resp_time | |
dqpn | qp_attr.dest_qp_num |
sqpn | |
dqpsn | qp_attr.rq_psn |
sqpsn | qp_attr.sq_psn |
drdatomic | qp_attr.max_dest_rd_atomic |
srdatomic | qp_attr.max_rd_atomic |
IBPath structures can also be used any place where an ah_attr could be used, including for creating AH instances and with modify(). With this usage the IBPath caches the created AH, so getting the AH for a path the second time does not rebuild the AH. This means callers generally don’t have to worry about creating and maintaining AH’s explicitly.
The attributes in an IBPath are used as follows when creating an AH:
Path Attribute | Usage |
---|---|
has_grh | ah_attr.is_global |
DGID | ah_attr.grh.dgid |
SGID | ah_attr.grh.sgid_index |
flow_label | ah_attr.grh.flow_label |
hop_limit | ah_attr.grh.hop_limit |
traffic_class | ah_attr.grh.traffic_class |
DLID | ah_attr.dlid |
SLID | ah_attr.SLID_bits |
SL | ah_attr.SL |
rate | ah_attr.static_rate |
end_port.port_id | ah_attr.port_num |
This is not intended to be a verbs primer. Generally the API follows that of the normal OFA verbs (with ibv_ prefixes removed) , which in turn follows the API documented by the IBA specification. Many helper functions are provided to handle common situations in a standard way, generally these are preferred.
Setting up a QP for UD communication is very simple. There are two major cases, for communication with a single end port, and for communication with multiple. The single case is:
path = IBPath(end_port,dpqn=1,qkey=IBA.IB_DEFAULT_QP1_QKEY,DGID=...);
with rdma.get_gmp_mad(path.end_port,verbs=ctx) as umad:
rdma.path.resolve_path(umad,path,reversible=True);
with ctx.pd() as pd:
depth = 16;
cq = pd.cq(2*depth);
qp = pd.qp(ibv.IBV_QPT_UD,depth,cq,depth,cq)
path.sqpn = qp.qp_num;
# Post receive work requests to qp here
qp.establish(path);
qp.post_send(ibv.send_wr(opcode=ibv.IBV_WR_SEND,
send_flags=ibv.IBV_SEND_SIGNALED,
ah=pd.ah(path),
remote_qpn=path.dpqn,
remote_qkey=path.qkey,
...));
Notice that the path is used to configure the pkey and qkey values of the UD QP during initialization, and is also used to create the AH for the send work request.
The case for multiple destinations is very similar, however all destinations must share the same PKey and QKey. For instance, assuming there is a list of DGIDs:
with rdma.get_gmp_mad(path.end_port) as umad:
paths = [rdma.path.resolve_path(umad,IBPath(end_port,DGID=I,
qkey=IBA.IB_DEFAULT_QP1_QKEY),
reversible=True,
properties={'PKey': IBA.DEFAULT_PKEY})
for I in destinations];
Will resolve all the DGIDs into paths with the same QKey and PKey. paths[-1] can be used to setup the QP and all the paths can be used interchangeably in work requests.
Constructing the reply path and generating a send WR from a UD WC is very straightforward:
wcs = cq.poll():
for wc in wcs:
path = ibv.WCPath(self.end_port,wc,
buf,0,
pkey=qp_pkey,
qkey=qp_qkey);
path.reverse();
ah = pd.ah(path);
wr = ibv.send_wr(opcode=ibv.IBV_WR_SEND,
ah=ah,
remote_qpn=path.dpqn,
remote_qkey=path.qkey,
...);
buf,0 is the buffer and offset of the memory posted in the recv request. Remember that on UD QPs the first 40 bytes of the receive buffer are reserved for a GRH, which is accessed by rdma.ibverbs.WCPath().
The library has built in support for correctly establishing IB connections without using a CM by exchanging information over a side channel (eg a TCP socket). Side A would do this:
qp = pd.qp(ibv.IBV_QPT_RC,...);
path = rdma.path.IBPath(end_port,SGID=end_port.default_gid);
rdma.path.fill_path(qp,path);
path.reverse(for_reply=False);
send_to_side_b(pickle.pickle(path));
path = pickle.unpickle(recv_from_side_b());
path.reverse(for_reply=False);
path.end_port = end_port;
qp.establish(self.path,ibv.IBV_ACCESS_REMOTE_WRITE);
# Synchronize transition to RTS
send_to_side_b(True);
recv_from_side_b();
Side B would do this:
qp = pd.qp(ibv.IBV_QPT_RC,...);
path = pickle.unpickle(recv_from_side_a());
path.end_port = end_port;
rdma.path.fill_path(qp,path);
with rdma.get_gmp_mad(path.end_port) as umad:
rdma.path.resolve_path(umad,path);
send_to_sid_a(pickle.pickle(path));
qp.establish(self.path,ibv.IBV_ACCESS_REMOTE_WRITE);
# Synchronize transition to RTS
recv_from_side_a();
send_to_side_a(True);
rdma.path.fill_path() sets up most of the the QP related path parameters and rdma.path.resolve_path() gets the path record(s) from the SA.
This procedure implements the same process and information exchange that the normal IB CM would do, including negotiating responder resources and having the capability to setup asymmetric paths (unimplemented today).
Any QP type is supported by this basic procedure, the extra information exchanged is simply not used.
Note
Pickle is only used as an easy example here. Real cases should do something else as unpickling untrusted data is dangerous. The Path object has a __reduce__() method which can be used to implement a protocol appropriate encoding.
The class rdma.ibverbs.WCError is an exception that can be thrown when a WC error is detected. It formats the information in the WC and provides a way for the catcher to determine the failed QP:
wcs = cq.poll():
for wc in wcs:
if wc.status != ibv.IBV_WC_SUCCESS:
raise ibv.WCError(wc,cq);
Depending on the situation QP errors may not be recoverable so the whole QP should be torn down.
Additional helpers are provided to simplify completion channel processing, suitable for single threaded applications. The basic usage for a completion channel is:
# To setup the completion channel
cc = ctx.comp_channel();
poll = select.poll();
cc.register_poll(poll);
cq = ctx.cq(2*depth,cc)
def get_wcs():
cq.req_notify();
while True:
ret = poll.poll();
for I in ret:
if cc.check_poll(I) is not None:
wcs = cq.poll();
if wcs is not None:
return wcs;
wcs = get_wcs();
Obviously the methodology becomes more complex if additional things are polled for. The basic idea is that rdma.ibverbs.CompChannel.check_poll() takes care of all the details and returns the CQ that has available work completions.
Using CQPoller the above example can be further simplified:
cc = ctx.comp_channel();
cq = ctx.cq(2*depth,cc)
poller = rdma.vtools.CQPoller(cq);
for wc in poller.iterwc(timeout=1):
print wc
CQPoller also monitors for asynchronous events and will call rdma.ibverbs.Context.handle_async_event() which will produce exceptions for failure conditions and update the end port cache as necessary.
Memory registrations are made explicit, as with verbs everything that is passed into a work request must have an associated memory registration. A MR object can be created for anything that supports the Python buffer protocol, and writable MRs require a mutable Python buffer. Some useful examples:
s = "Hello";
mr = pd.mr(s,ibv.IBV_ACCESS_REMOTE_READ);
s = bytearray(256);
mr = pd.mr(s,ibv.IBV_ACCESS_REMOTE_WRITE);
s = mmap.mmap(-1,256);
mr = pd.mr(s,ibv.IBV_ACCESS_REMOTE_WRITE);
SGEs are constructed through the MR:
sge = mr.sge();
sge = mr.sge(length=128,off=10);
A tool is provided for managing a finite pool of fixed size buffers. This construct is very useful for applications using the SEND verb:
pool = rdma.vtools.BufferPool(pd,count=100,size=1024);
pool.post_recvs(qp,50);
buf_idx = pool.pop();
pool.copy_to("Hello message!",buf_idx);
qp.post_send(pool.make_send_wr(buf_idx,pool.size,path));
rdma.vtools provides various support functions to make verbs programming easier.
Bases: object
Hold onto a block of fixed size buffers and provide some helpers for using them as send and receive buffers with a QP.
This can be used to provide send buffers for a QP, as well as receive buffers for a QP or a SRQ. Generally the qp argument to methods of this class can be a rdma.ibverbs.QP or rdma.ibverbs.SRQ.
A rdma.ibverbs.MR is created in pd with count buffers of size bytes.
Mask to convert a wr_id back into a buf_idx.
Constant value to set wr_id to when it is not being used.
Constant value to or into wr_id to indicate it was posted as a recv.
Close held objects
Return a copy of buffer buf_idx. buf_idx may be a wr_id.
Return type: | bytearray |
---|
Copy buf into the buffer buf_idx
Number of buffers.
Process work completion list wcs to recover buffers attached to completed work and re-post recv buffers to qp. Every work request with an attached buffer must have a signaled completion to recover the buffer.
wcs may be a single wc.
Raises rdma.ibverbs.WCError: | |
---|---|
For WC’s marked as error. |
Return a rdma.ibverbs.send_wr for buf_idx and path. If path is None then the wr does not contain path information (eg for connected QPs)
Return a rdma.ibverbs.SGE for buf_idx.
Return a new buffer index.
Post count buffers for receive to qp, which may be any object with a post_recv method.
Size of a single buffer.
Bases: object
Simple wrapper for a rdma.ibverbs.CQ and rdma.ibverbs.CompChannel to provide a blocking API for getting work completions.
cq is the completion queue to read work completions from. If the cq does not have a completion channel then this will spin loop on cq otherwise it sleeps on the completion channel.
If async_events is True then the async event queue will be monitored while sleeping.
Generator that returns work completions from the CQ. If not None at most count wcs will be returned. timeout is the number of seconds this function can run for, and wakeat is the value of rdma.tools.clock_monotonic() after which iteration stops.
Return type: | rdma.ibverbs.wc |
---|
Go to sleep until the cq gets a completion. wakeat is the value of rdma.tools.clock_monotonic() after which the function returns None. Returns True if the completion channel triggered.
If no completion channel is in use this just returns True.
Note: It is necessary to call rdma.ibverbs.CQ.req_notify() on the CQ, then poll the CQ before calling sleep(). Otherwise the edge triggered nature of the completion channels can cause deadlock.
True if iteration was stopped due to a timeout
Value of rdma.tools.clock_monotonic() to stop iterating. This can be altered while iterating.
Note
Unfortunately Sphinx does not do a very good job auto documenting extension modules, and all the function arguments are stripped out. Until this is resolved the documentation after this point is incomplete.
The rdma.ibverbs module wrappers all of the functions in libibverbs that are not duplicated elsewhere in the library, for instance, device discovery uses the rdma.devices module, not the functions from libibverbs.
Bases: object
Address handle, this is a context manager.
Free the verbs AH handle.
Bases: rdma.RDMAError
Raised when an asynchronous error event is received.
Bases: object
Completion queue, this is a context manager.
Free the verbs CQ handle.
Perform the poll_cq operation, return a list of work requests.
Request event notification for CQEs added to the CQ.
Resize the CQ to have at least cqes entries.
Bases: object
Completion channel, this is a context manager.
Returns a rdma.ibverbs.CQ that got at least one completion event, or None. This updates the comp channel and keeps track of received events, and appropriately calls ibv_ack_cq_events internally. After this call the CQ must be re-armed via rdma.ibverbs.CQ.req_notify()
Free the verbs completion channel handle.
Return the FD associated with this completion channel.
Add the FD associated with this object to select.poll object poll.
Bases: object
Verbs context handle, this is a context manager. Call rdma.get_verbs() to get an instance of this.
Return True if pevent indicates that get_async_event() will return data.
Free the verbs context handle and all resources allocated by it.
Create a new rdma.ibverbs.CompChannel for this context.
Create a new rdma.ibverbs.CQ for this context.
Return a rdma.ibverbs.QP for the qp number num or None if one was not found.
Get a single async event for this context. The return result is a namedtuple of (event_type,obj where obj will be the rdma.ibverbs.CQ, rdma.ibverbs.QP, rdma.ibverbs.SRQ, rdma.devices.EndPort or rdma.devices.RDMADevice associated with the event.
This provides a generic handler for async events. Depending on the event it will: - Raise a rdma.ibverbs.AsyncError exception - Reload cached information in the end port
Create a new rdma.ibverbs.PD for this context.
Return a rdma.ibverbs.device_attr for the device.
Return type: | rdma.ibverbs.device_attr |
---|
Return a rdma.ibverbs.port_attr for the port_id. If port_id is none then the port info is returned for the end port this context was created against.
Return type: | rdma.ibverbs.port_attr |
---|
Add the async event FD associated with this object to select.poll object poll.
Bases: object
Memory registration, this is a context manager.
Free the verbs MR handle.
Create a rdma.ibv.sge referring to length bytes of this MR starting at off. If length is -1 (default) then the entire MR from off to the end is used.
Bases: object
Protection domain handle, this is a context manager.
Create a new rdma.ibverbs.AH for this protection domain. attr may be a rdma.ibverbs.ah_attr or rdma.path.IBPath. When used with a IBPath this function will cache the AH in the IBPath. rdma.path.Path.drop_cache() must be called to release all references to the AH.
Free the verbs pd handle.
Return a rdma.ibverbs.QP for the qp number num or None if one was not found.
Create a new rdma.ibverbs.MR for this protection domain.
Create a new rdma.ibverbs.QP for this protection domain. This version expresses the QP creation attributes as keyword arguments.
Create a new rdma.ibverbs.QP for this protection domain. init is a rdma.ibverbs.qp_init_attr.
Create a new rdma.ibverbs.SRQ for this protection domain. init is a rdma.ibverbs.srq_init_attr.
Bases: object
Queue pair, this is a context manager.
Attach this QP to receive the multicast group described by path.DGID and path.DLID.
Free the verbs QP handle.
Detach this QP from the multicast group described by path.DGID and path.DLID.
Perform modify_to_init(), modify_to_rtr() and :meth`modify_to_rts`. This function is most useful for UD QPs which do not require any external sequencing.
When modifying a QP the value attr.ah_attr may be a rdma.ibverbs.ah_attr or rdma.path.IBPath.
Modify the QP to the INIT state.
Modify the QP to the RTR state.
Modify the QP to the RTS state.
wrlist may be a single rdma.ibverbs.recv_wr or a list of them.
wrlist may be a single rdma.ibverbs.send_wr or a list of them.
Return information about the QP. mask selects which fields to return.
Return type: | tuple(rdma.ibverbs.qp_attr,:class:rdma.ibverbs.qp_init_attr) |
---|
Bases: object
Shared Receive queue, this is a context manager.
Free the verbs SRQ handle.
Modify the srq_limit and max_wr values of SRQ. If the argument is None it is not changed.
wrlist may be a single rdma.ibverbs.recv_wr or a list of them.
Return a rdma.ibverbs.srq_attr.
Bases: rdma.RDMAError
Raised when a WC is completed with error. Note: Not all adaptors support returning the opcode and qp_num in an error WC. For those that do the values are decoded.
wc is the error wc, msg is an additional descriptive message, cq is the CQ the error WC was received on and obj is a rdma.ibverbs.SRQ or rdma.ibverbs.QP if one is known. is_rq is True if the WC is known to apply to the receive of the QP, and False if the WC is known the apply to the send queue of the QP. None if unknown
Create a rdma.path.IBPath from a work completion. buf should be the receive buffer when this is used with a UD QP, the first 40 bytes of that buffer could be a GRH. off is the offset into buf. kwargs are applied to rdma.path.IBPath
Note: wc.pkey_index is not used, if the WC is associated witha GSI QP (unlikely) then the caller can pass pkey_index=wc.pkey_index as an argument.
Bases: rdma.SysError
Raised when an error occurs posting work requests. bad_index is the index into the work request list what failed to post.
alias of _my_weakset
Convert a rdma.ibverbs.wc.status value into a string.