Higher Level View of RDMA programming and its vocabularies

Recently I have come across a pretty cool tool called RDMA. It enables direct memory access from the memory of one computer into that of another computer without involving the burden of either one’s operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. In this blog I will be noting down few vocabularies that comes in handy when dealing with RDMA.

Queue Pair (QP) consists of a Send Queue (SQ) and Receive Queue (RQ). When we expect it to send data we would send it to SQ and when we expect it to receive data, we would sends it to RQ. Both of them can but put on a Completion Queue (CQ).  Completion queue (CQ) is used by network adapter to notify the status of the completed Work Request. Each entry in Completion Queue entry (CQE) holds information of completion status of one or more completed work requests.

When we want an adapter to send or receive, we need to post a request these are called work requests. In a Send Request we need to assign how much data will be sent for connected and unconnected transport and the memory buffer where data is located for connected and unconnected transport, to where the data should be send and the type of the send request and in a receive requests, the maximum data size to be received and memory buffer where data should be. Completion of a send queue and a receive queue can be assigned to same or different completion queues.

Work queue maintains order of their posted time however in different work queue does not maintain orders. Every work queue has ids own user defined id wr_id and flags, for example wr.send_flags = IBV_SEND_SIGNALED  defines generation of a completion element once the data is transmitted. it can be handled in a chain manner by assigning another work queue in wr.next

ibv_create_cq is the command that helps to create CQ. Transportations can be completed successfully or with error result is reported through a completion queue entry (CQE) polling a CQ is used to retrieve the CQE from the CQ outcome is reported in the status field of the completion entry.

We create a QP using ibv_create_qp function. In the parameter it takes a Protection Domain(PD) and a set of attributes. Protection domain is gathering resources in groups. Resource from same protection domain are allowed to communicate with each other. Eg: QP, MP. Resource from outside protection domain are not allowed to communicate. To allocate protection domain by calling ibv_alloc_pd. Attribute struct would look something like this:

struct ibv_qp_init_attr qp_init_attr;
struct ibv_cq *cq;
qp_init_attr.send_cq = cq; 
qp_init_attr.recv_cq = cq; 
qp_init_attr.qp_type = IBV_QPT_UD; 
qp_init_attr.cap.max_send_wr = 2; 
qp_init_attr.cap.max_recv_wr = 2; 
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;

more at: https://www.rdmamojo.com/2012/12/21/ibv_create_qp/

Where max_send_wr maximum number entities that we want to allow in a send queue before completion. By the way it would be wise to note that it should be less than max cqe.
max_recv_wr maximum scatter queue that we want to allow.
At max_send_sge, max_recv_sge, sge is a short hand for Scatter Gather Entries. The maximum number of scatter gather entries can be queried using ibv_device_query.

As you can see we have set qp_type to IBV_QPT_UD which is there for Unreliable Data. In reliable context, QP is possible between two RCs, but when it is about Unreliable QP, it allows one to many Unreliable Queue Pair without requiring any previous connection setup.

Like many things in network programming, QP goes through a series of steps before it ends up processing send a receive.

RESET: By default, upon the creation of QP, it is at it’s reset state. Although it is at its ready to receive data but it can’t process any work request at this state.
INIT: After RESET it goes to INIT state after its initial configuration. When QP moves from RESET to INIT, QP starts receiving receive buffers in the receive queues using ipv_recv commands. this data won’t be used until QP is in RTR state.
RTR: After that it goes to Ready To Receive state, at this state it is configured to receive data.
RTS: After that it goes to Ready To Send, at this stage it is configured to send data. At this stage device can post using ipv_post_send commend.

After creation if you want you can modify QP using ibv_modify_qp. When modifying QP, pkey_index, port_num, qkey (for unrelieable datagram only) might be necessary. All QP that wants to communicate on unreliable datagram must share same q_key.

To make RDMA do things, it is necessary for network adapters to ask for permissions to access local data. this is done through MR (memory region). A memory region has address, size and set of permissions. that control access to the memory pointed out by the region.

To register memory region we need to call ibv_reg_mr. It takes the Protection Domain, Start Virtual Memory Address, size, access bit information like local read, local write, remote read, remote write, atomic operation. Local read access is necessary when the adapter has to access local pc to gather data when rdma operating is being processed. Local write access is necessary when adapter has to scatter data when recieving a send operation. Remote access is necessary when the adapter has to access local data from rdma operation recieved by remote process.

to open a communication we would need to call ipv_open_device, where we can assign a context and a pointer. cq_context, channel, comp_vector is necessary when dealing with completion events.

If we want to send data from a to b, we would be needing a source and a destination address or destination_gid, which is known as address vector.

We can collect our device details using ibstat command. But please note that we would need to have connect two devices, install mlnx_ofed, ibstat command, change port type ib/eth, check ports are enabled state LinkUp, if running using infiniBand opensm must be running. Also we can collect that information programmatically using ibv_get_device_list function.

Under the hood libverb handles rdma network related operations, like creating, modifying, querying, destroying resources. it handles sending receiving data from QPs, and recieving Completion Queues.

As Ipdump does not work when we are dealing with infiniBand as it bypasses OS layer we can use ibdump for debugging.