Table of Contents
bus1 is a local IPC system, which provides a decentralized infrastructure to share objects between local peers. The main building blocks are nodes and handles. Nodes represent objects of a local peer, while handles represent descriptors that point to a node. Nodes can be created and destroyed by any peer, and they will always remain owned by their respective creator. Handles, on the other hand, are used to refer to nodes and can be passed around with messages as auxiliary data. Whenever a handle is transferred, the receiver will get its own handle allocated, pointing to the same node as the original handle.
Any peer can send messages directed at one of their handles. This will transfer the message to the owner of the node the handle points to. If a peer does not posess a handle to a given node, it will not be able to send a message to that node. That is, handles provide exclusive access management. Anyone that somehow acquired a handle to a node is privileged to further send this handle to other peers. As such, access management is transitive. Once a peer acquired a handle, it cannot be revoked again. However, a node owner can, at anytime, destroy a node. This will effectively unbind all existing handles to that node on any peer, notifying each one of the destruction.
Unlike nodes and handles, peers cannot be addressed directly. In fact, peers are completely disconnected entities. A peer is merely an anchor of a set of nodes and handles, including an incoming message queue for any of those. Whether multiple nodes are all part of the same peer, or part of different peers does not affect the remote view of those. Peers solely exist as management entity and command dispatcher to local processes.
The set of actors on a system is completely decentralized. There is no global component involved that provides a central registry or discovery mechanism. Furthermore, communication between peers only involves those peers, and does not affect any other peer in any way. No global communication lock is taken. However, any communication is still globally ordered, including unicasts, multicasts, and notifications.
Table of Contents
A peer context provides access to the bus1 system. A peer itself is not a routable entity, but rather only a local anchor to serve as gateway to the bus. To participate on the bus, you need to allocate a peer. This peer manages all your state on the bus, including all allocated nodes, owned handles, incoming messages, and more.
A peer is split into 3 sections: - A static section that is initialized at peer creation and never changes - A peer-local section that is only ever accessed by ioctls done by the peer itself. - A data section that might be accessed by remote peers when interacting with this peer.
All peers on the system operate on the same level. There is no context a peer is linked into. Hence, you can never lock multiple peers at the same time. Instead, peers provide active-references. Before performing an operation on a peer, an active reference must be acquired, and hold as long as the operation goes on. When done, the reference is released again. When a peer is disconnected, no more active references can be acquired, and any outstanding operation is waited for before the peer is destroyed.
Additionally to active-references, there are 2 locks: A peer-local lock and a data lock. The peer-local lock is used to synchronize operations done by the peer itself. It is never acquired by a remote peer. The data lock protects the data of the peer, which might be modified by remote peers. The data lock nests underneath the local-lock. Furthermore, the data-lock critical sections must be kept small and never block indefinitely. Remote peers might wait for data-locks, hence they must rely on not being DoSed. The local peer lock, however, is private to the peer itself. Not such restrictions apply. It is mostly used to give the impression of atomic operations (i.e., making the API appear consistent and coherent).
struct bus1_peer — peer context
struct bus1_peer { u64 id; u64 flags; const struct cred * cred; struct pid_namespace * pid_ns; struct bus1_user * user; struct rcu_head rcu; wait_queue_head_t waitq; struct bus1_active active; struct dentry * debugdir; struct local; };
bus1_peer_acquire — acquire active reference to peer
struct bus1_peer * bus1_peer_acquire (
struct bus1_peer * peer)
;
bus1_peer_release — release an active reference
struct bus1_peer * bus1_peer_release (
struct bus1_peer * peer)
;
bus1_peer_new — allocate new peer
struct bus1_peer * bus1_peer_new (
void)
;
bus1_peer_free — destroy peer
struct bus1_peer * bus1_peer_free (
struct bus1_peer * peer)
;
bus1_peer_ioctl — handle peer ioctls
long bus1_peer_ioctl (
struct file * file, unsigned int cmd, unsigned long arg)
;
Table of Contents
XXX
struct bus1_factory — message factory
struct bus1_factory { struct bus1_peer * peer; struct bus1_cmd_send * param; const struct cred * cred; struct pid * pid; struct pid * tid; bool on_stack:1; bool has_secctx:1; size_t length_vecs; size_t n_vecs; size_t n_handles; size_t n_handles_charge; size_t n_files; u32 n_secctx; struct iovec * vecs; struct file ** files; char * secctx; struct bus1_flist handles[]; };
sending peer
factory parameters
sender credentials
sender PID
sender TID
whether object lives on stack
whether secctx has been set
total length of data in vectors
number of vectors
number of handles
number of handles to charge on commit
number of files
length of secctx
vector array
file array
allocated secctx
handle array
struct bus1_message — data messages
struct bus1_message { struct kref ref; struct bus1_queue_node qnode; struct bus1_handle * dst; struct bus1_user * user; u64 flags; uid_t uid; gid_t gid; pid_t pid; pid_t tid; size_t n_bytes; size_t n_handles; size_t n_handles_charge; size_t n_files; size_t n_secctx; struct bus1_pool_slice * slice; struct file ** files; struct bus1_flist handles[]; };
reference counter
embedded queue node
destination handle
sending user
message flags
sender UID
sender GID
sender PID
sender TID
number of user-bytes transmitted
number of handles transmitted
number of handle charges
number of files transmitted
number of bytes of security context transmitted
actual message data
passed file descriptors
passed handles
bus1_message_ref — acquire object reference
struct bus1_message * bus1_message_ref (
struct bus1_message * m)
;
bus1_message_unref — release object reference
struct bus1_message * bus1_message_unref (
struct bus1_message * m)
;
bus1_factory_new — create new message factory
struct bus1_factory * bus1_factory_new (
struct bus1_peer * peer, struct bus1_cmd_send * param, void * stack, size_t n_stack)
;
peer
peer to operate as
param
factory parameters
stack
optional stack for factory, or NULL
n_stack
size of space at stack
This allocates a new message factory. It imports data from param
and
prepares the factory for a transaction. From this factory, messages can be
instantiated. This is used both for unicasts and multicasts.
If stack
is given, this tries to place the factory on the specified stack
space. The caller must guarantee that the factory does not outlive the stack
frame. If this is not wanted, pass 0 as n_stack
.
In either case, if the stack frame is too small, this will allocate the
factory on the heap.
bus1_factory_free — destroy message factory
struct bus1_factory * bus1_factory_free (
struct bus1_factory * f)
;
bus1_factory_seal — charge and commit local resources
int bus1_factory_seal (
struct bus1_factory * f)
;
bus1_factory_instantiate — instantiate a message from a factory
struct bus1_message * bus1_factory_instantiate (
struct bus1_factory * f, struct bus1_handle * handle, struct bus1_peer * peer)
;
bus1_message_stage — stage message
void bus1_message_stage (
struct bus1_message * m, struct bus1_tx * tx)
;
This acquires all resources of the message m
and then stages the message on
tx
. Like all stage operations, this cannot be undone. Hence, you must make
sure you can continue to commit the transaction without erroring-out in
between.
This consumes the caller's reference on m
, plus the active reference on the
destination peer.
bus1_message_install — install message payload into target process
int bus1_message_install (
struct bus1_message * m, struct bus1_cmd_recv * param)
;
Table of Contents
struct bus1_tx — transaction context
struct bus1_tx { struct bus1_peer * origin; struct bus1_queue_node * sync; struct bus1_queue_node * async; struct bus1_queue_node * postponed; unsigned long flags; u64 timestamp; u64 async_ts; };
bus1_tx_init — initialize transaction context
void bus1_tx_init (
struct bus1_tx * tx, struct bus1_peer * origin)
;
bus1_tx_deinit — deinitialize transaction context
void bus1_tx_deinit (
struct bus1_tx * tx)
;
bus1_tx_stage_sync — stage message
void bus1_tx_stage_sync (
struct bus1_tx * tx, struct bus1_queue_node * qnode)
;
This stages qnode
on the transaction tx
. It is an error to call this on a
qnode that is already staged. The caller must set qnode->owner to the
destination peer and acquire it. If it is NULL, it is assumed to be the same
as the origin of the transaction.
The caller must hold the data-lock of the destination peer.
This consumes qnode
. The caller must increment the required reference
counts to make sure qnode
does not vanish.
bus1_tx_stage_later — postpone message
void bus1_tx_stage_later (
struct bus1_tx * tx, struct bus1_queue_node * qnode)
;
This queues qnode
on tx
, but does not stage it. It will be staged just
before the transaction is committed. This can be used over
bus1_tx_stage_sync
if no immediate staging is necessary, or if required
locks cannot be taken.
It is a caller-error if qnode
is already part of a transaction.
bus1_tx_join — HIC SUNT DRACONES!
bool bus1_tx_join (
struct bus1_queue_node * whom, struct bus1_queue_node * qnode)
;
This makes qnode
join the on-going transaction of whom
. That is, it is
semantically equivalent of calling:
bus1_tx_stage_sync(whom->group, qnode);
However, you can only dereference whom->group while it is still ongoing. Once committed, it might be a stale pointer. This function safely checks for the required conditions and bails out if too late.
The caller must hold the data locks of both peers (target of whom
and
qnode
). node
->owner must not be NULL! Furthermore, qnode
must not be
staged into any transaction, yet.
In general, this function is not what you want. There is no guarantee that
you can join the transaction, hence a negative return value must be expected
by the caller and handled gracefully. In that case, this function guarantees
that the clock of the holder of qnode
is synced with the transaction of
whom
, and as such is correctly ordered against the transaction.
If this function returns “false”, you must settle on the transaction before visibly reacting to it. That is, user-space must not see that you failed to join the transaction before the transaction is settled!
Table of Contents
The object system on a bus is based on 'nodes' and 'handles'. Any peer can allocate new, local objects at any time. The creator automatically becomes the sole owner of the object. References to objects can be passed as payload of messages. The recipient will then gain their own reference to the object as well. Additionally, an object can be the destination of a message, in which case the message is always sent to the original creator (and thus the owner) of the object.
Internally, objects are called 'nodes'. A reference to an object is a 'handle'. Whenever a new node is created, the owner implicitly gains an handle as well. In fact, handles are the only way to refer to a node. The node itself is entirely hidden in the implementation, and visible in the API as an “anchor handle”.
Whenever a handle is passed as payload of a message, the target peer will gain a handle linked to the same underlying node. This works regardless of whether the sender is the owner of the underlying node, or not.
Each peer can identify all its handles (both owned and un-owned) by a 64-bit integer. The namespace is local to each peer, and the numbers cannot be compared with the numbers of other peers (in fact, they are very likely to clash, but might still have *different* underlying nodes). However, if a peer receives a reference to the same node multiple times, the resulting handle will be the same. The kernel keeps count of how often each peer owns a handle.
If a peer no longer requires a specific handle, it can release it. If the peer releases its last reference to a handle, the handle will be destroyed.
The owner of a node (and *only* the owner) can trigger the destruction of a node (even if other peers still own handles to it). In this case, all peers that own a handle are notified of this fact. Once all handles to a specific node have been released (except for the handle internally pinned in the node itself), the owner of the node is notified of this, so it can potentially destroy both any linked state and the node itself.
Node destruction is fully synchronized with any transaction. That is, a node and all its handles are valid in every message that is transmitted *before* the notification of its destruction. Furthermore, no message after this notification will carry the ID of such a destroyed node. Note that message transactions are asynchronous. That is, there is no unique point in time that a message is synchronized with another message. Hence, whether a specific handle passed with a message is still valid or not, cannot be predicted by the sender, but only by one of the receivers.
enum bus1_handle_bits — node flags
enum bus1_handle_bits { BUS1_HANDLE_BIT_RELEASED, BUS1_HANDLE_BIT_DESTROYED };
struct bus1_handle — object handle
struct bus1_handle { struct kref ref; atomic_t n_weak; atomic_t n_user; struct bus1_peer * holder; struct bus1_handle * anchor; struct bus1_handle * tlink; struct rb_node rb_to_peer; u64 id; struct bus1_queue_node qnode; union {unnamed_union}; };
bus1_handle_is_anchor — check whether handle is an anchor
bool bus1_handle_is_anchor (
struct bus1_handle * h)
;
bus1_handle_is_live — check whether handle is live
bool bus1_handle_is_live (
struct bus1_handle * h)
;
bus1_handle_is_public — check whether handle is public
bool bus1_handle_is_public (
struct bus1_handle * h)
;
bus1_handle_ref — acquire object reference
struct bus1_handle * bus1_handle_ref (
struct bus1_handle * h)
;
bus1_handle_unref — release object reference
struct bus1_handle * bus1_handle_unref (
struct bus1_handle * h)
;
bus1_handle_acquire — acquire weak/strong reference
struct bus1_handle * bus1_handle_acquire (
struct bus1_handle * h, bool strong)
;
bus1_handle_release — release weak/strong reference
struct bus1_handle * bus1_handle_release (
struct bus1_handle * h, bool strong)
;
bus1_handle_release_n — release multiple references
struct bus1_handle * bus1_handle_release_n (
struct bus1_handle * h, unsigned int n, bool strong)
;
h
handle to operate on, or NULL
n
number of references to release
strong
whether to release strong references
bus1_handle_new_anchor — allocate new anchor handle
struct bus1_handle * bus1_handle_new_anchor (
struct bus1_peer * holder)
;
bus1_handle_new_remote — allocate new remote handle
struct bus1_handle * bus1_handle_new_remote (
struct bus1_peer * holder, struct bus1_handle * other)
;
bus1_handle_acquire_owner — acquire owner of a handle
struct bus1_peer * bus1_handle_acquire_owner (
struct bus1_handle * handle)
;
bus1_handle_ref_by_other — lookup handle on a peer
struct bus1_handle * bus1_handle_ref_by_other (
struct bus1_peer * peer, struct bus1_handle * handle)
;
This looks for an handle held by peer
, which points to the same node as
handle
(i.e., it is linked to handle
->anchor). If peer
does not hold such
a handle, this returns NULL. Otherwise, an object reference is acquired and
returned as pointer.
The caller must hold an active reference to peer
.
bus1_handle_acquire_locked — acquire strong reference
struct bus1_handle * bus1_handle_acquire_locked (
struct bus1_handle * handle, bool strong)
;
bus1_handle_acquire_slow — slow-path of handle acquisition
struct bus1_handle * bus1_handle_acquire_slow (
struct bus1_handle * handle, bool strong)
;
bus1_handle_release_slow — slow-path of handle release
void bus1_handle_release_slow (
struct bus1_handle * handle, bool strong)
;
bus1_handle_destroy_locked — stage node destruction
void bus1_handle_destroy_locked (
struct bus1_handle * handle, struct bus1_tx * tx)
;
This stages a destruction on handle
. That is, it marks handle
as destroyed
and stages a release-notification for all live handles via tx
. It is the
responsibility of the caller to commit tx
.
The given handle must be an anchor and not destroyed, yet. Furthermore, the caller must hold the local-lock and data-lock of the owner.
bus1_handle_is_live_at — check whether handle is live at a given time
bool bus1_handle_is_live_at (
struct bus1_handle * h, u64 timestamp)
;
This checks whether the handle h
is live at the time of timestamp
. The
caller must make sure that timestamp
was acquired on the clock of the
holder of h
.
Note that this does not synchronize on the node owner. That is, usually you
want to call this at the time of RECV, so it is guaranteed that there is no
staging message in front of timestamp
. Otherwise, a node owner might
acquire a commit-timestamp for the destruction of h
lower than timestamp
.
The caller must hold the data-lock of the holder of h
.
bus1_handle_import — import handle
struct bus1_handle * bus1_handle_import (
struct bus1_peer * peer, u64 id, bool * is_newp)
;
This searches the ID-namespace of peer
for a handle with the given ID. If
found, it is referenced, returned to the caller, and is_newp
is set to
false.
If not found and id
is a remote ID, then an error is returned. But if it
is a local ID, a new handle is created and placed in the lookup tree. In
this case is_newp
is set to true.
bus1_handle_forget — forget handle
void bus1_handle_forget (
struct bus1_handle * h)
;
If h
is not public, but linked into the ID-lookup tree, this will remove it
from the tree and clear the ID of h
. It basically undoes what
bus1_handle_import
and bus1_handle_export
do.
Note that there is no counter in bus1_handle_import
or
bus1_handle_export
. That is, if you call bus1_handle_import
multiple
times, a single bus1_handle_forget
undoes it. It is the callers
responsibility to not release the local-lock randomly, and to properly
detect cases where the same handle is used multiple times.
Table of Contents
Different users can communicate via bus1, and many resources are shared between multiple users. The bus1_user object represents the UID of a user, like “struct user_struct” does in the kernel core. It is used to account global resources, apply limits, and calculate quotas if different UIDs communicate with each other.
All dynamic resources have global per-user limits, which cannot be exceeded by a user. They prevent a single user from exhausting local resources. Each peer that is created is always owned by the user that initialized it. All resources allocated on that peer are accounted on that pinned user. Additionally to global resources, there are local limits per peer, that can be controlled by each peer individually (e.g., specifying a maximum pool size). Those local limits allow a user to distribute the globally available resources across its peer instances.
Since bus1 allows communication across UID boundaries, any such transmission of resources must be properly accounted. Bus1 employs dynamic quotas to fairly distribute available resources. Those quotas make sure that available resources of a peer cannot be exhausted by remote UIDs, but are fairly divided among all communicating peers.
struct bus1_user_usage — usage counters
struct bus1_user_usage { atomic_t n_slices; atomic_t n_handles; atomic_t n_bytes; atomic_t n_fds; };
struct bus1_user_limits — resource limit counters
struct bus1_user_limits { atomic_t n_slices; atomic_t n_handles; atomic_t n_inflight_bytes; atomic_t n_inflight_fds; unsigned int max_slices; unsigned int max_handles; unsigned int max_inflight_bytes; unsigned int max_inflight_fds; struct idr usages; };
number of remaining quota for owned slices
number of remaining quota for owned handles
number of remaining quota for inflight bytes
number of remaining quota for inflight FDs
maximum number of owned slices
maximum number of owned handles
maximum number of inflight bytes
maximum number of inflight FDs
idr of usage entries per uid
struct bus1_user — resource accounting for users
struct bus1_user { struct kref ref; kuid_t uid; struct mutex lock; union {unnamed_union}; };
bus1_user_modexit — clean up global resources of user accounting
void bus1_user_modexit (
void)
;
This function cleans up any remaining global resources that were allocated by the user accounting helpers. The caller must make sure that no user object is referenced anymore, before calling this. This function just clears caches and verifies nothing is leaked.
This is meant to be called on module-exit.
bus1_user_limits_init — initialize resource limit counter
void bus1_user_limits_init (
struct bus1_user_limits * limits, struct bus1_user * source)
;
bus1_user_limits_deinit — deinitialize source limit counter
void bus1_user_limits_deinit (
struct bus1_user_limits * limits)
;
bus1_user_ref_by_uid — get a user object for a uid
struct bus1_user * bus1_user_ref_by_uid (
kuid_t uid)
;
bus1_user_ref — acquire reference
struct bus1_user * bus1_user_ref (
struct bus1_user * user)
;
bus1_user_unref — release reference
struct bus1_user * bus1_user_unref (
struct bus1_user * user)
;
bus1_user_charge — charge a user resource
int bus1_user_charge (
atomic_t * global, atomic_t * local, int charge)
;
global
global resource to charge on
local
local resource to charge on
charge
charge to apply
bus1_user_discharge — discharge a user resource
void bus1_user_discharge (
atomic_t * global, atomic_t * local, int charge)
;
bus1_user_charge_quota — charge quota resources
int bus1_user_charge_quota (
struct bus1_user * user, struct bus1_user * actor, struct bus1_user_limits * limits, int n_slices, int n_handles, int n_bytes, int n_fds)
;
user
user to charge on
actor
user to charge as
limits
local limits to charge on
n_slices
number of slices to charge
n_handles
number of handles to charge
n_bytes
number of bytes to charge
n_fds
number of FDs to charge
This charges the given resources on user
and limits
. It does both, local
and remote charges. It is all charged for user actor
.
Negative charges always succeed. Positive charges might fail if quota is denied. Note that a single call is always atomic, so either all succeed or all fail. Hence, it makes little sense to mix negative and positive charges in a single call.
bus1_user_discharge_quota — discharge quota resources
void bus1_user_discharge_quota (
struct bus1_user * user, struct bus1_user * actor, struct bus1_user_limits * l_local, int n_slices, int n_handles, int n_bytes, int n_fds)
;
bus1_user_commit_quota — commit quota resources
void bus1_user_commit_quota (
struct bus1_user * user, struct bus1_user * actor, struct bus1_user_limits * l_local, int n_slices, int n_handles, int n_bytes, int n_fds)
;
Table of Contents
The bus1_active object implements active references. They work similarly to plain object reference counters, but allow disabling any new references from being taken.
Each bus1_active object goes through a set of states: NEW: Initial state, no active references can be acquired ACTIVE: Live state, active references can be acquired DRAINING: Deactivated but lingering, no active references can be acquired DRAINED: Deactivated and all active references were dropped RELEASED: Fully drained and synchronously released
Initially, all bus1_active objects are in state NEW. As soon as they're activated, they enter ACTIVE and active references can be acquired. This is the normal, live state. Once the object is deactivated, it enters state DRAINING. No new active references can be acquired, but some threads might still own active references. Once all those are dropped, the object enters state DRAINED. Now the object can be released a *single* time, before it enters state RELEASED and is finished. It cannot be re-used anymore.
Active-references are very useful to track threads that call methods on an object. As long as a method is running, an active reference is held, and as such the object is usually protected from being destroyed. The destructor of the object needs to deactivate *and* drain the object, before releasing resources.
Note that active-references cannot be used to manage their own backing memory. That is, they do not replace normal reference counts.
struct bus1_active — active references
struct bus1_active { atomic_t count; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif };
bus1_active_acquire — acquire active reference
struct bus1_active * bus1_active_acquire (
struct bus1_active * active)
;
This acquires an active reference to the passed object. If the object was
not activated, yet, or if it was already deactivated, this will fail and
return NULL. If a reference was successfully acquired, this will return
active
.
If NULL is passed, this is a no-op and always returns NULL.
This behaves as a down_read_trylock
. Use bus1_active_release
to release
the reference again and get the matching up_read
.
bus1_active_release — release active reference
struct bus1_active * bus1_active_release (
struct bus1_active * active, wait_queue_head_t * waitq)
;
active
object to release active reference of, or NULL
waitq
wait-queue linked to active
, or NULL
bus1_active_init_private — initialize object
void bus1_active_init_private (
struct bus1_active * active)
;
bus1_active_deinit — destroy object
void bus1_active_deinit (
struct bus1_active * active)
;
Destroy an active-object. The object must have been initialized via
bus1_active_init
, deactivated via bus1_active_deactivate
, drained via
bus1_active_drain
and cleaned via bus1_active_cleanup
, before you can
destroy it. Alternatively, it can also be destroyed if still in state NEW.
This function only does sanity checks, it does not modify the object itself. There is no allocated memory, so there is nothing to do.
bus1_active_is_new — check whether object is new
bool bus1_active_is_new (
struct bus1_active * active)
;
bus1_active_is_active — check whether object is active
bool bus1_active_is_active (
struct bus1_active * active)
;
bus1_active_is_deactivated — check whether object was deactivated
bool bus1_active_is_deactivated (
struct bus1_active * active)
;
bus1_active_is_drained — check whether object is drained
bool bus1_active_is_drained (
struct bus1_active * active)
;
bus1_active_activate — activate object
bool bus1_active_activate (
struct bus1_active * active)
;
bus1_active_deactivate — deactivate object
bool bus1_active_deactivate (
struct bus1_active * active)
;
bus1_active_drain — drain active references
void bus1_active_drain (
struct bus1_active * active, wait_queue_head_t * waitq)
;
This waits for all active-references on active
to be dropped. It uses the
passed wait-queue to sleep. It must be the same wait-queue that is used when
calling bus1_active_release
.
The caller must guarantee that bus1_active_deactivate
was called before.
This function can be safely called in parallel on multiple CPUs.
Semantically (and also enforced by lockdep), this call behaves like a
down_write
, followed by an up_write
, on this active object.
bus1_active_cleanup — cleanup drained object
bool bus1_active_cleanup (
struct bus1_active * active, wait_queue_head_t * waitq, void (*cleanup)
(
struct bus1_active *, void *)
, void * userdata)
;
active
object to release
waitq
wait-queue linked to active
, or NULL
cleanup
cleanup callback, or NULL
userdata
userdata for callback
This performs the final object cleanup. The caller must guarantee that the
object is drained, by calling bus1_active_drain
.
This function invokes the passed cleanup callback on the object. However, it guarantees that this is done exactly once. If there're multiple parallel callers, this will pick one randomly and make all others wait until it is done. If you call this after it was already cleaned up, this is a no-op and only serves as barrier.
If waitq
is NULL, the wait is skipped and the call returns immediately. In
this case, another thread has entered before, but there is no guarantee that
they finished executing the cleanup callback, yet.
If waitq
is non-NULL, this call behaves like a down_write
, followed by an
up_write
, just like bus1_active_drain
. If waitq
is NULL, this rather
behaves like a down_write_trylock
, optionally followed by an up_write
.
bus1_active_lockdep_acquired — acquire lockdep reader
void bus1_active_lockdep_acquired (
struct bus1_active * active)
;
Whenever you acquire an active reference via bus1_active_acquire
, this
function is implicitly called afterwards. It enables lockdep annotations and
tells lockdep that you acquired the active reference.
However, lockdep cannot support arbitrary depths, hence, we allow
temporarily dropping the lockdep-annotation via
bus1_active_lockdep_release
, and acquiring them later again via
bus1_active_lockdep_acquire
.
If you need to pin a large number of objects, you would acquire each of them individually viabus1_active_acquire
. Then you would perform state tracking, etc. on that object. Before you continue with the next, you callbus1_active_lockdep_released
, to pretend you released the lock (but you still retain your active reference). Now you continue with pinning the next object, etc. until you pinned all objects you need. If you now need to access one of your pinned objects (or want to release them eventually), you callbus1_active_lockdep_acquired
before accessing the object. This enables the lockdep annotations again. This cannot fail, ever. You still own the active reference at all times. Once you're done with the single object, you either release your entire active reference viabus1_active_release
, or you temporarily disable lockdep viabus1_active_lockdep_released
again, in case you need the pinned object again later. Note that you can acquired multiple active references just fine. The only reason those lockdep helpers are provided, is if you need to acquire a *large* number at the same time. Lockdep is usually limited to a depths of 64 so you cannot hold more locks at the same time.
Table of Contents
This implements a fixed-size list called bus1_flist. The size of the list must be constant over the lifetime of the list. The list can hold one arbitrary pointer per node.
Fixed lists are a combination of a linked list and a static array. That is,
fixed lists behave like linked lists (no random access, but arbitrary size),
but compare in speed with arrays (consequetive accesses are fast). Unlike
fixed arrays, fixed lists can hold huge number of elements without requiring
vmalloc
, but solely relying on small-size kmalloc
allocations.
Internally, fixed lists are a singly-linked list of static arrays. This guarantees that iterations behave almost like on an array, except when crossing a batch-border.
Fixed lists can replace fixed-size arrays whenever you need to support large number of elements, but don't need random access. Fixed lists have ALMOST the same memory requirements as fixed-size arrays, except one pointer of state per 'BUS1_FLIST_BATCH' elements. If only a small size (i.e., it only requires one batch) is stored in a fixed list, then its memory requirements and iteration time are equivalent to fixed-size arrays.
bus1_flist_inline_size — calculate required inline size
size_t bus1_flist_inline_size (
size_t n)
;
When allocating storage for an flist, this calculates the size of the
initial array in bytes. Use bus1_flist_new
directly if you want to
allocate an flist on the heap. This helper is only needed if you embed an
flist into another struct like this:
struct foo { ... struct bus1_flist list[]; };
In that case the flist must be the last element, and the size in bytes required by it is returned by this function.
The inline-size of an flist is always bound to a fixed maximum. That is,
regardless of n
, this will always return a reasonable number that can be
allocated via kmalloc
.
bus1_flist_init — initialize an flist
void bus1_flist_init (
struct bus1_flist * list, size_t n)
;
This initializes an flist of size n
. It does NOT preallocate the memory,
but only initializes list
in a way that bus1_flist_deinit
can be called
on it. Use bus1_flist_populate
to populate the flist.
This is only needed if your backing memory of list
is shared with another
object. If possible, use bus1_flist_new
to allocate an flist on the heap
and avoid this dance.
bus1_flist_deinit — deinitialize an flist
void bus1_flist_deinit (
struct bus1_flist * list, size_t n)
;
bus1_flist_next — flist iterator
struct bus1_flist * bus1_flist_next (
struct bus1_flist * iter, size_t * pos)
;
This advances an flist iterator by one position. iter
must point to the
current position, and the new position is returned by this function. pos
must point to a variable that contains the current index position. That is,
pos
must be initialized to 0 and iter
to the flist head.
Neither pos
nor iter
must be modified by anyone but this helper. In the
loop body you can use iter
->ptr to access the current element.
This iterator is normally used like this:
size_t pos, n = 128; struct bus1_flist *e, *list = bus1_flist_new(n);
...
for (pos = 0, e = list; pos < n; e = bus1_flist_next(e, pos)) { ... access e->ptr ... }
bus1_flist_walk — walk flist in batches
size_t bus1_flist_walk (
struct bus1_flist * list, size_t n, struct bus1_flist ** iter, size_t * pos)
;
This walks an flist in batches of size up to BUS1_FLIST_BATCH. It is normally used like this:
size_t pos, z, n = 65536; struct bus1_flist *e, *list = bus1_flist_new(n);
...
pos = 0; while ((z = bus1_flist_walk(list, n, e, pos)) > 0) { ... access e[0...z]->ptr ... invariant: z <= BUS1_FLIST_BATCH ... invariant: e[i]->ptr == (e->ptr)[i] }
bus1_flist_populate — populate an flist
int bus1_flist_populate (
struct bus1_flist * list, size_t n, gfp_t gfp)
;
bus1_flist_new — allocate new flist
struct bus1_flist * bus1_flist_new (
size_t n, gfp_t gfp)
;
Table of Contents
A pool is a shmem-backed memory pool shared between userspace and the kernel. The pool is used to transfer memory from the kernel to userspace without requiring userspace to allocate the memory.
The pool is managed in slices, which are published to userspace when they are ready to be read and must be released by userspace when userspace is done with them.
Userspace has read-only access to its pools and the kernel has read-write access, but published slices are not altered.
struct bus1_pool_slice — pool slice
struct bus1_pool_slice { u32 offset; u32 free:1; u32 ref_kernel:1; u32 ref_user:1; struct list_head entry; struct rb_node rb; };
relative offset in parent pool
whether this slice is in-use or not
whether a kernel reference exists
whether a user reference exists
link into linear list of slices
link to busy/free rb-tree
Each chunk of memory in the pool is managed as a slice. A slice can be accessible by both the kernel and user-space, and their access rights are managed independently. As long as the kernel has a reference to a slice, its offset and size can be accessed freely and will not change. Once the kernel drops its reference, it must not access the slice, anymore.
To allow user-space access, the slice must be published. This marks the slice as referenced by user-space. Note that all slices are always readable by user-space, since the entire pool can be mapped. Publishing a slice only marks the slice as referenced by user-space, so it will not be modified or removed. Once user-space releases its reference, it should no longer access the slice as it might be modified and/or overwritten by other data.
Only if neither kernel nor user-space have a reference to a slice, the slice is released. The kernel reference can only be acquired/released once, but user-space references can be published/released several times. In particular, if the kernel retains a reference when a slice is published and later released by userspace, the same slice can be published again in the future.
Note that both kernel-space and user-space must be aware that slice references are not ref-counted. They are simple booleans. For the kernel-side this is obvious, as no ref/unref functions are provided. But user-space must be aware that the same slice being published several times does not increase the reference count.
struct bus1_pool — client pool
struct bus1_pool { struct file * f; size_t allocated_size; struct list_head slices; struct rb_root slices_busy; struct rb_root slices_free; };
backing shmem file
currently allocated memory in bytes
all slices sorted by address
tree of allocated slices
tree of free slices
A pool is used to allocate memory slices that can be shared between kernel-space and user-space. A pool is always backed by a shmem-file and puts a simple slice-allocator on top. User-space gets read-only access to the entire pool, kernel-space gets read/write access via accessor-functions.
Pools are used to transfer large sets of data to user-space, without requiring a round-trip to ask user-space for a suitable memory chunk. Instead, the kernel simply allocates slices in the pool and tells user-space where it put the data.
All pool operations must be serialized by the caller. No internal lock is provided. Slices can be queried/modified unlocked. But any pool operation (allocation, release, flush, ...) must be serialized.
bus1_pool_slice_is_public — check whether a slice is public
bool bus1_pool_slice_is_public (
struct bus1_pool_slice * slice)
;
bus1_pool_init — create memory pool
int bus1_pool_init (
struct bus1_pool * pool, const char * filename)
;
bus1_pool_deinit — destroy pool
void bus1_pool_deinit (
struct bus1_pool * pool)
;
This destroys a pool that was previously create via bus1_pool_init
. If
NULL is passed, or if pool
->f is NULL (i.e., the pool was initialized to 0
but not created via bus1_pool_init
, yet), then this is a no-op.
The caller must make sure that no kernel reference to any slice exists. Any pending user-space reference to any slice is dropped by this function.
bus1_pool_alloc — allocate memory
struct bus1_pool_slice * bus1_pool_alloc (
struct bus1_pool * pool, size_t size)
;
This allocates a new slice of size
bytes from the memory pool at pool
. The
slice must be released via bus1_pool_release_kernel
by the caller. All
slices are aligned to 8 bytes (both offset and size).
If no suitable slice can be allocated, an error is returned.
Each pool slice can have two different references, a kernel reference and a
user-space reference. Initially, it only has a kernel-reference, which must
be dropped via bus1_pool_release_kernel
. However, if you previously
publish the slice via bus1_pool_publish
, it will also have a user-space
reference, which user-space must (indirectly) release via a call to
bus1_pool_release_user
.
A slice is only actually freed if neither reference exists, anymore. Hence,
pool-slice can be held by both, the kernel and user-space, and both can rely
on it staying around as long as they wish.
bus1_pool_release_kernel — release kernel-owned slice reference
struct bus1_pool_slice * bus1_pool_release_kernel (
struct bus1_pool * pool, struct bus1_pool_slice * slice)
;
This releases the kernel-reference to a slice that was previously allocated
via bus1_pool_alloc
. This only releases the kernel reference to the slice.
If the slice was already published to user-space, then their reference is
left untouched. Once both references are gone, the memory is actually freed.
bus1_pool_publish — publish a slice
void bus1_pool_publish (
struct bus1_pool * pool, struct bus1_pool_slice * slice)
;
Publish a pool slice to user-space, so user-space can get access to it via the mapped pool memory. If the slice was already published, this is a no-op. Otherwise, the slice is marked as public and will only get freed once both the user-space reference *and* kernel-space reference are released.
bus1_pool_release_user — release a public slice
int bus1_pool_release_user (
struct bus1_pool * pool, size_t offset, size_t * n_slicesp)
;
pool
pool to operate on
offset
offset of slice to release
n_slicesp
output variable to store number of released slices, or NULL
Release the user-space reference to a pool-slice, specified via the offset of the slice. If both, the user-space reference *and* the kernel-space reference to the slice are gone, the slice will be actually freed.
If no slice exists with the given offset, or if there is no user-space reference to the specified slice, an error is returned.
bus1_pool_flush — flush all user references
void bus1_pool_flush (
struct bus1_pool * pool, size_t * n_slicesp)
;
bus1_pool_mmap — mmap the pool
int bus1_pool_mmap (
struct bus1_pool * pool, struct vm_area_struct * vma)
;
bus1_pool_write_iovec — copy user memory to a slice
ssize_t bus1_pool_write_iovec (
struct bus1_pool * pool, struct bus1_pool_slice * slice, loff_t offset, struct iovec * iov, size_t n_iov, size_t total_len)
;
pool
pool to operate on
slice
slice to write to
offset
relative offset into slice memory
iov
iovec array, pointing to data to copy
n_iov
number of elements in iov
total_len
total number of bytes to copy
bus1_pool_write_kvec — copy kernel memory to a slice
ssize_t bus1_pool_write_kvec (
struct bus1_pool * pool, struct bus1_pool_slice * slice, loff_t offset, struct kvec * iov, size_t n_iov, size_t total_len)
;
pool
pool to operate on
slice
slice to write to
offset
relative offset into slice memory
iov
kvec array, pointing to data to copy
n_iov
number of elements in iov
total_len
total number of bytes to copy
Table of Contents
(You are highly encouraged to read up on 'Lamport Timestamps', the concept of 'happened-before', and 'causal ordering'. The queue implementation has its roots in Lamport Timestamps, treating a set of local CPUs as a distributed system to avoid any global synchronization.)
A message queue is a FIFO, i.e., messages are linearly ordered by the time they were sent. Moreover, atomic delivery of messages to multiple queues are supported, without any global synchronization, i.e., the order of message delivery is consistent across queues.
Messages can be destined for multiple queues, hence, we need to be careful that all queues get a consistent order of incoming messages. We define the concept of `global order' to provide a basic set of guarantees. This global order is a partial order on the set of all messages. The order is defined as:
1) If a message B was queued *after* a message A, then: A < B
2) If a message B was queued *after* a message A was dequeued, then: A < B
3) If a message B was dequeued *after* a message A on the same queue, then: A < B
(Note: Causality is honored. `after' and `before' do not refer to the same task, nor the same queue, but rather any kind of synchronization between the two operations.)
The queue object implements this global order in a lockless fashion. It solely relies on a distributed clock on each queue. Each message to be sent causes a clock tick on the local clock and on all destination clocks. Furthermore, all clocks are synchronized, meaning they're fast-forwarded in case they're behind the highest of all participating peers. No global state tracking is involved.
During a message transaction, we first queue a message as 'staging' entry in each destination with a preliminary timestamp. This timestamp is explicitly odd numbered. Any odd numbered timestamp is considered 'staging' and causes *any* message ordered after it to be blocked until it is no longer staging. This allows us to queue the message in parallel with any racing multicast, and be guaranteed that all possible conflicts are blocked until we eventually commit a transaction. To commit a transaction (after all staging entries are queued), we choose the highest timestamp we have seen across all destinations and re-queue all our entries on each peer using that timestamp. Here we use a commit timestamp (even numbered).
With this in mind, we define that a client can only dequeue messages from its queue that have an even timestamp. Furthermore, if there is a message queued with an odd timestamp that is lower than the even timestamp of another message, then neither message can be dequeued. They're considered to be in-flight conflicts. This guarantees that two concurrent multicast messages can be queued without any *global* locks, but either can only be dequeued by a peer if their ordering has been established (via commit timestamps).
NOTE: A fully committed message is not guaranteed to be ready to be dequeued as it may be blocked by a staging entry. This means that there is an arbitrary (though bounded) time from a message transaction completing when the queue may still appear to be empty. In other words, message transmission is not instantaneous. It would be possible to change this at the cost of shortly blocking each message transaction on all other conflicting tasks.
The queue implementation uses an rb-tree (ordered by timestamps and sender), with a cached pointer to the front of the queue.
struct bus1_queue_node — node into message queue
struct bus1_queue_node { union {unnamed_union}; u64 timestamp_and_type; struct bus1_queue_node * next; void * group; void * owner; };
struct bus1_queue — message queue
struct bus1_queue { u64 clock; u64 flush; struct rb_node * leftmost; struct rb_node __rcu * front; struct rb_root messages; };
bus1_queue_node_init — initialize queue node
void bus1_queue_node_init (
struct bus1_queue_node * node, unsigned int type)
;
bus1_queue_node_deinit — destroy queue node
void bus1_queue_node_deinit (
struct bus1_queue_node * node)
;
bus1_queue_node_get_type — query node type
unsigned int bus1_queue_node_get_type (
struct bus1_queue_node * node)
;
bus1_queue_node_get_timestamp — query node timestamp
u64 bus1_queue_node_get_timestamp (
struct bus1_queue_node * node)
;
bus1_queue_node_is_queued — check whether a node is queued
bool bus1_queue_node_is_queued (
struct bus1_queue_node * node)
;
bus1_queue_node_is_staging — check whether a node is marked staging
bool bus1_queue_node_is_staging (
struct bus1_queue_node * node)
;
bus1_queue_tick — increment queue clock
u64 bus1_queue_tick (
struct bus1_queue * queue)
;
bus1_queue_sync — sync queue clock
u64 bus1_queue_sync (
struct bus1_queue * queue, u64 timestamp)
;
bus1_queue_is_readable_rcu — check whether a queue is readable
bool bus1_queue_is_readable_rcu (
struct bus1_queue * queue)
;
bus1_queue_compare — comparator for queue ordering
int bus1_queue_compare (
u64 a_ts, void * a_g, u64 b_ts, void * b_g)
;
a_ts
timestamp of first node to compare
a_g
group of first node to compare
b_ts
timestamp of second node to compare against
b_g
group of second node to compare against
Messages on a message queue are ordered. This function implements the comparator used for all message ordering in queues. Two tags are used for ordering, the timestamp and the group-tag of a node. Both must be passed to this function.
This compares the tuples (a_ts
, a_g
) and (b_ts
, b_g
).
bus1_queue_deinit — destroy queue
void bus1_queue_deinit (
struct bus1_queue * queue)
;
This destroys a queue that was previously initialized via bus1_queue_init
.
The caller must make sure the queue is empty before calling this.
This function is a no-op, and only does safety checks on the queue. It is safe to call this function multiple times on the same queue.
The caller must guarantee that the backing memory of queue
is freed in an
rcu-delayed manner.
bus1_queue_flush — flush message queue
struct bus1_queue_node * bus1_queue_flush (
struct bus1_queue * queue, u64 ts)
;
This flushes all committed entries from queue
and returns them as
singly-linked list for the caller to clean up. Staged entries are left in
the queue.
You must acquire a timestamp before flushing the queue (e.g., tick the
clock). This timestamp must be given as ts
. Only entries lower than, or
equal to, this timestamp are flushed. The timestamp is remembered as
queue->flush.
bus1_queue_stage — stage queue entry with fresh timestamp
u64 bus1_queue_stage (
struct bus1_queue * queue, struct bus1_queue_node * node, u64 timestamp)
;
Link a queue entry with a new timestamp. The staging entry blocks all
messages with timestamps synced on this queue in the future, as well as any
messages with a timestamp greater than timestamp
. However, it does not block
any messages already committed to this queue.
The caller must provide an even timestamp and the entry may not already have been committed.
bus1_queue_commit_staged — commit staged queue entry with new timestamp
void bus1_queue_commit_staged (
struct bus1_queue * queue, wait_queue_head_t * waitq, struct bus1_queue_node * node, u64 timestamp)
;
queue
queue to operate on
waitq
wait-queue to wake up on change, or NULL
node
queue entry to commit
timestamp
new timestamp for node
Update a staging queue entry according to timestamp
. The timestamp must be
even and the entry may not already have been committed.
Furthermore, the queue clock must be synced with the new timestamp *before* staging an entry. Similarly, the timestamp of an entry can only be increased, never decreased.
bus1_queue_commit_unstaged — commit unstaged queue entry with new timestamp
void bus1_queue_commit_unstaged (
struct bus1_queue * queue, wait_queue_head_t * waitq, struct bus1_queue_node * node)
;
bus1_queue_commit_synthetic — commit synthetic entry
bool bus1_queue_commit_synthetic (
struct bus1_queue * queue, struct bus1_queue_node * node, u64 timestamp)
;
This inserts the unqueued entry node
into the queue with the commit
timestamp timestamp
(just like bus1_queue_commit_unstaged
). However, it
only does so if the new entry would NOT become the new front. It thus allows
inserting fake synthetic entries somewhere in the middle of a queue, but
accepts the possibility of failure.
bus1_queue_remove — remove entry from queue
void bus1_queue_remove (
struct bus1_queue * queue, wait_queue_head_t * waitq, struct bus1_queue_node * node)
;
bus1_queue_peek — peek first available entry
struct bus1_queue_node * bus1_queue_peek (
struct bus1_queue * queue, bool * morep)
;
This returns a pointer to the first available entry in the given queue, or NULL if there is none. The queue stays unmodified and the returned entry remains on the queue.
This only returns entries that are ready to be dequeued. Entries that are still in staging mode will not be considered.
If a node is returned, its group-state is stored in morep
. That means,
if there are more messages queued as part of the same transaction, true is
stored in morep
. But if the returned node is the last part of the
transaction, false is returned.