CTF FILE FORMAT
---------------

Format: v2 + slices (current v3, trunk pre-1.2)

A CTF file ("container", since it is usually not a file, but an ELF section or
something or that sort) is divided into a number of sections internally,
identified by offset from the header. In order, the sections are:

 - Type label section
 - Data object section
 - Function info section
 - Variable info section
 - Data type section
 - String table

We'll consider these in order of importance (not the same as order in the file).

Other things in the header:
  - a preamble containing a magic number (used to determine container
    endianness: libctf will endian-flip foreign-endian containers into the
    native endianness at open time), a version number, whose current value is
    the CTF_VERSION constant, and a set of CTF_F global flags
  - a parent container name and label, which indicates (in some
    consumer-dependent way) the name of the container containing types whose ID
    has its MSB turned on (the "parent container"): it is only nonzero if this
    container is not itself a parent.  This allows types to be shared between
    containers: with one container being the parent of potentially many others.
    (The parent label has space allocated in the header, but is not used by any
    code in libctf at present.)

This does mean that a container cannot be used both as a parent and as a child
container at the same time, because type IDs referring to types within the same
container will have their MSB turned on if this was constructed as a parent
container.  While there is a parent name and parent label in the header, it is
purely up to the CTF consumer and convention how this is interpreted: neither
libctf nor the format prohibits ctf_import()ing any container at all as a parent
container, though you should in general import the same parent at consumption
time as you did when you generated the container, or things wil misbehave.


Data type section
-----------------

This is the core section in a CTF file, an array of variable-length entries,
each entry a struct ctf_stype or struct ctf_type followed by optional
variable-length data.  Each array index is transformed into a type ID by
flipping on the MSB iff this is a parent type container.  These type IDs are how
types are referenced within CTF containers.  The ID of each type is not stored
witih the type, but is implied by its array index.

The ctf_type_t and ctf_stype_t act as a discriminated union with an identical
first few members:

typedef struct ctf_stype
{
  uint32_t ctt_name;		/* Reference to name in string table.  */
  uint32_t ctt_info;		/* Encoded kind, variant length (see below).  */
  union
  {
    uint32_t ctt_size;		/* Size of entire type in bytes.  */
    uint32_t ctt_type;		/* Reference to another type.  */
  };
} ctf_stype_t;

All types are represented by an instance of one of these structures: ctt_name is
0 for unnamed types, while ctt_info is a tiny bitfielded structure accessed via
masking:

               ------------------------
   ctt_info:   | kind | isroot | vlen |
               ------------------------
               31    26    25  24     0
where

kind: a CTF_K_* constant indicating whether this type is an int, a float, an array,
      a pointer, a structure or what-have-you (see below)
isroot: is 1 if this type has a name, 0 otherwise
vlen: the length of a kind-specific variable data region ("variant data") which
      immediately follows the ctf_stype or ctf_type structure, and contains 
      type-kind-specific properties (array length, an array of structure
      members, or whatever). The data in the vlen region is the closest thing to
      most of the attributes used by DWARF to describe types.  In general, only
      kinds for which the vlen is actually variable can be trusted to have
      useful values in this field: for all other kinds, the vlen is meaningless
      and is usually hardwwiired for that kind where needed.  ctf.h defines the
      currently-valid set of kinds:

#define CTF_K_UNKNOWN	0	/* Unknown type (used for padding).  */
#define CTF_K_INTEGER	1	/* Variant data is CTF_INT_DATA (see below).  */
#define CTF_K_FLOAT	2	/* Variant data is CTF_FP_DATA (see below).  */
#define CTF_K_POINTER	3	/* ctt_type is referenced type.  */
#define CTF_K_ARRAY	4	/* Variant data is single ctf_array_t.  */
#define CTF_K_FUNCTION	5	/* ctt_type is return type, variant data is
				   list of argument types (unsigned short's for v1,
				   uint32_t's for v2).  */
#define CTF_K_STRUCT	6	/* Variant data is list of ctf_member_t's.  */
#define CTF_K_UNION	7	/* Variant data is list of ctf_member_t's.  */
#define CTF_K_ENUM	8	/* Variant data is list of ctf_enum_t's.  */
#define CTF_K_FORWARD	9	/* No additional data; ctt_name is tag.  */
#define CTF_K_TYPEDEF	10	/* ctt_type is referenced type.  */
#define CTF_K_VOLATILE	11	/* ctt_type is base type.  */
#define CTF_K_CONST	12	/* ctt_type is base type.  */
#define CTF_K_RESTRICT	13	/* ctt_type is base type.  */
#define CTF_K_SLICE	14	/* Variant data is a ctf_slice_t.  */

#define CTF_K_MAX	63	/* Maximum possible (V2) CTF_K_* value.  */

Most of these obviously relate directly to specific C types: the only strange
one is 'slice', which allows you to take an integral type and modify its
bitness, for easy construction of bitfields (a slice of a CTF_K_ENUM is the only
way to specify an enum bitfield).


Looking at the rest of the ctf_stype_t, the ctt_size / ctt_type union is a trick
to reduce sizes. Most type-kinds that refer to another type (like pointers, or
cv-quals) have a fixed size, defined by the platform ABI (libctf calls this the
'machine model'): most types that have a variable size do not refer to another
type: all the most voluminous type kinds either do one or the other. So the
ctt_size / ctt_type contains whichever of these is applicable to the type in
question. (A few kinds, like structures or function pointers, refer to more than
one type ID: in this case, relevant type IDs are carried in the vlen data.)

For very large types the ctf_stype is not enough: the size of types can exceed
that representable by a uint32_t.  For these, we use a ctf_type_t instead:

typedef struct ctf_type
{
  uint32_t ctt_name;		/* Reference to name in string table.  */
  uint32_t ctt_info;		/* Encoded kind, variant length (see below).  */
  union
  {
    uint32_t ctt_size;		/* Always CTF_LSIZE_SENT.  */
    uint32_t ctt_type;		/* Do not use.  */
  };
  uint32_t ctt_lsizehi;		/* High 32 bits of type size in bytes.  */
  uint32_t ctt_lsizelo;		/* Low 32 bits of type size in bytes.  */
} ctf_type_t;

As noted above, this overlays on top of the ctf_stype_t, so almost all code can
just deal directly with whichever it prefers and check ctt_size to see if this
is a ctf_type or ctf_stype. You distinguish a ctf_type_t from a ctf_stype_t
because ctf_type_t has ctt_size == CTF_LSIZE_SENT (which is an invalid value for
a type ID).

Structure members use a similar trick. Almost all the time, the size of the
structure (the ctt_size) is less than 2^32 bytes, and the variable data is an
array of ctf_member_t's:

typedef struct ctf_member_v2
{
  uint32_t ctm_name;		/* Reference to name in string table.  */
  uint32_t ctm_offset;		/* Offset of this member in bits.  */
  uint32_t ctm_type;		/* Reference to type of member.  */
} ctf_member_t;

But if the structure is really huge (above CTF_LSTRUCT_THRESH bytes in length),
the ctt_size overflows the range of the ctm_offset, and every member in this
structure is instead described by the larger ctf_lmember_t:

typedef struct ctf_lmember_v2
{
  uint32_t ctlm_name;		/* Reference to name in string table.  */
  uint32_t ctlm_offsethi;	/* High 32 bits of member offset in bits.  */
  uint32_t ctlm_type;		/* Reference to type of member.  */
  uint32_t ctlm_offsetlo;	/* Low 32 bits of member offset in bits.  */
} ctf_lmember_t;

Unions are identical, and you can represent unnamed structure and union fields
as well with no extensions, by just adding members at the appropriate bit offset
in the containing struct/union (which is how unnamed structs/unions appear to
the programmer, and thus how they should appear to debuggers).


Structure members show the general theme for variant data: in most cases, the
variant data is some sort of structure, or an array of structures, or is not
present at all (things like typedefs don't have one): but function types, and
integral and floating-point types, use different sorts of vlen.  Function types
use a list of argument types with vlen / sizeof (uint32_t) members, with the
ctt_type being the return type; integer and floating-point types use flags
packed into a single uint32_t in the variant data encoding things like format,
bitness, etc:

#define CTF_INT_ENCODING(data) (((data) & 0xff000000) >> 24)
#define CTF_INT_OFFSET(data)   (((data) & 0x00ff0000) >> 16)
#define CTF_INT_BITS(data)     (((data) & 0x0000ffff))

#define CTF_INT_DATA(encoding, offset, bits) \
       (((encoding) << 24) | ((offset) << 16) | (bits))

#define CTF_INT_SIGNED	0x01	/* Integer is signed (otherwise unsigned).  */
#define CTF_INT_CHAR	0x02	/* Character display format.  */
#define CTF_INT_BOOL	0x04	/* Boolean display format.  */
#define CTF_INT_VARARGS	0x08	/* Varargs display format.  */

Or, for floats:

#define CTF_FP_ENCODING(data)  (((data) & 0xff000000) >> 24)
#define CTF_FP_OFFSET(data)    (((data) & 0x00ff0000) >> 16)
#define CTF_FP_BITS(data)      (((data) & 0x0000ffff))

#define CTF_FP_DATA(encoding, offset, bits) \
       (((encoding) << 24) | ((offset) << 16) | (bits))

/* Variant data when kind is CTF_K_FLOAT is an encoding in the top eight bits.  */
#define CTF_FP_ENCODING(data)	(((data) & 0xff000000) >> 24)

#define CTF_FP_SINGLE	1	/* IEEE 32-bit float encoding.  */
#define CTF_FP_DOUBLE	2	/* IEEE 64-bit float encoding.  */
#define CTF_FP_CPLX	3	/* Complex encoding.  */
#define CTF_FP_DCPLX	4	/* Double complex encoding.  */
#define CTF_FP_LDCPLX	5	/* Long double complex encoding.  */
#define CTF_FP_LDOUBLE	6	/* Long double encoding.  */
#define CTF_FP_INTRVL	7	/* Interval (2x32-bit) encoding.  */
#define CTF_FP_DINTRVL	8	/* Double interval (2x64-bit) encoding.  */
#define CTF_FP_LDINTRVL	9	/* Long double interval (2x128-bit) encoding.  */
#define CTF_FP_IMAGRY	10	/* Imaginary (32-bit) encoding.  */
#define CTF_FP_DIMAGRY	11	/* Long imaginary (64-bit) encoding.  */
#define CTF_FP_LDIMAGRY	12	/* Long double imaginary (128-bit) encoding.  */

#define CTF_FP_MAX	12	/* Maximum possible CTF_FP_* value */

Some of the formats, particularly in the floating-point realm, are somewhat
debatable, and we hope for discussion of what formats are appropriate (C99
complex types appear to be provided for, but not much else).

It is notable that there are two redundant ways to encode the bitness of
bitfield types, and three redundant ways to encode their offset: you can put
either directly into the encoding, or put it into a slice, or specify the offset
via bit-specific values in the containing structure or union. libctf hides as
much of this as possible by making it appear that slices are the same kind as
the kind they point to, contributing only an encoding: the only difference
between the slice and its underlying type is that you can call
ctf_type_reference() on the slice to get that underlying type, which you cannot
do on an int.

(In the header alone, but not in the data format, there is an additional
feature: the CTF_CHAR macro is an integral type of the same signedness as the
build target's char type, turning on CTF_INT_SIGNED, nor not, appropriately.)


Function info and data object sections
--------------------------------------

These two sections, taken together, map 1:1 to the symbols of type STT_OBJECT
and STT_FUNC in an ELF symbol table (usually the symbol table in the ELF object
in which the CTF section is embedded). It is generated by traversing the symbol
table, and whenever a suitable symbol is encountered, adding an entry for it
to the data object or function info sections, depending on whether this is a
STT_OBJECT or STT_FUNC symbol.

Both producer and consumer must agree on the definition of 'suitable', since
there is no cross-checking here, and if even one symbol is treated differently,
all symbols following it will be misattributed.

For both STT_FUNC and STT_OBJECT symbols, symbols that have a name that _START_
or _END_ or that is SHN_UNDEF are omitted; for STT_OBJECT symbols, we further
omit zero-valued SHN_ABS symbols.

The data object section is an array of type IDs, one entry per suitable entry in
the symbol table: each type ID is the type of the corresponding symbol.

The function object section is an array of things that (if they were in
structures rather than just a stream of bytes) would look fairly similar to the
variant data for CTF_K_FUNCTION types, described above:

uint32_t ctt_info; # vlen is number of args
ctf_id_t ctc_return;
ctf_id_t args[vlen];

If the last arg is zero, this is a varargs function, and libctf will flip on the
CTF_FUNC_VARARG flag in the funcinfo on return.


Variable info section
---------------------

This is a very simple section, an array of ctf_varent_t sorted in ascending
strcmp() order by ctv_name.  It is used for systems in which there is nothing
resembling a string table, in which address -> name lookup for data objects is
done by machinery outside the purview of CTF, and the caller wants to resolve
string names to types.  This covers data objects only: there is currently
nothing resembling the function info section with manual lookup like this.


Label section
-------------

This section is an array of ctf_lblent, which can be used to tile the type space
into named regions.  It might be useful for parallel deduplicators, or to have
distinct parent containers for different regions of the type space (with names
denoted by the label), or such things.


String table
------------

This is a perfectly normal ELF string table, with a first entry which is simply
\0 (so unnamed items can be denoted by the integer 0): it is specific to the CTF
contianer alone.  String table references in CTF have an MSB which, when 1
(CTF_STRTAB_1), means to use a specific ELF string table (usually the one
accompanying the symbol table used for the function info and data object
sections).