# DHT2 First prototype design This is the design elaboration for the first prototype of DHT2. For initial reading on design and motivation for some of the basic on disk constructs (i.e directory on a single brick, MDS and DS segregation, flatter backend structure (no ancestors)) read, - https://docs.google.com/document/d/1nJuG1KHtzU99HU9BK9Qxoo1ib9VXf2vwVuHzVQc_lKg - https://docs.google.com/document/d/1rcqAPxMAjSJfbUQmqKD1Eo_lswQ71KwWDI2w2Srzx0Y There are parallels to local file systems and their ondisk and in memory representation of the file system structure. For example, the entry name and GFID separation that is achieved in the design below is a classic filesystem directory inode (i.e file names and inode numbers) and file inode separation. Thus a brief understanding on these would improve understanding of the content presented here. ## On disk view: NOTE: The absolute root (i.e / in the examples below) could be any part of the underlying filesystem, as presented to Gluster for its store. ### MetaDataServer (MDS): 1. Consolidated file system view across all MDS bricks - /0x00/0x00/0x00000001/ (Gluster root inode) - Dir1 (GFID: 0xDD000001) - Dir2 (GFID: 0xDD000002) - File1 (GFID: 0xFF000001) - /0xDD/0x00/0xDD000001/ (Dir1) - File2 (GFID: 0xFF000002) - Dir3 (GFID: 0xDD000003) - /0xDD/0x00/0xDD000002/ (Dir2) - File3 (GFID: 0xFF000003) - /0xDD/0x00/0xDD000003/ (Dir3) NOTE: Values in round brackets above, is either informational or stored as extended attributes on the file or directory object as its meta data 2. Object types in MDS - Gluster Directory inode - In the above example, 0x00000001 is a Gluster directory inode - 0x00000001 itself is a GFID (or an inode number) in gluster FS - It stores the metadata about the directory inode that it refers to - i.e attrs, times, size, xattrs, etc. - IOW, it is a regular inode, but of type directory, and hence created as a regular directory on the underlying local file system - Gluster Name entry - In the above disk view, 'Dir1' or 'File1' are Gluster Name entries - It stores the name of the entry, and which inode it points to (i.e its GFID) - The inode that it points to is stored as a GFID, which is stored as an xattr on the file that is created on the local file system - The name to parent relationship is defined by the virtue of this name being present under the parent GFID Gluster directory inode - IOW, it is a regular file, with no attr information stored, and created as a regular file on the local file system under the parent GFID heirarchy ### DataServer (DS): 1. Consolidated view across all DS bricks - /0xFF/0x00/0xFF000001 - /0xFF/0x00/0xFF000002 - /0xFF/0x00/0xFF000003 2. Object types in DS - Gluster File inode - A regular file on the local file system, that is created with the GFID as its name - Like a regular filesystem inode, stores the metadata regarding this file inode - File data is written/read to/from this object and left to the local file system to maintain the data block mappings - IOW, it is a regular file that Gluster brick processes would do data and metadata operations on ## Details of inode distribution These are presented as highlights than in detail as the intention is to keep layout generation and association to subvolumes a plugable, evolving entity that helps reduce data motion in the even of cluster exapnsion or contraction. - Layouts are at the volume level, and stored centrally - Hashing is based on GFID, hence subvol is chosen based on hash of GFID - Post lookups the cached inode information helps redirect FOPs to the right subvolume. - To begin with the layout range is divided into equal number of parts, as there are subvolumes on the MDS or DS side of the DHT2 volume graph ## Virtual file system construction by DHT2 on the client Root inode has a special reserved GFID, which is 0x00000001 (shortened form). As a result looking up root is done by the client side DHT xlator based on rootGFID hash. Once root is determined, further lookups would build the inode and dentry table on the client to provide the file system view. ## Details on FOP implementation ### Limitations in initial implementation A. We do not consider rebalance, hence nothing is out of balance and in the right place always B. We need to determine order of operations when doing composite operations, like create, unlink - Meaning, what is created first the inode or the name entry on the parent - What is unlinked first - We should be able to find some useful patterns when looking at other local file system implementations for the same ### lookup - Named lookup (parent gfid (pgfid)/parent inode (pinode) + bname) - subvol = layout_search (mds, pgfid) || subvol = pinode->subvol - lookup (subvol, loc) - Success - subvol = layout_search (ds, xattr->gfid) - lookup (subvol, gfid) - Success - update inode - return results - Failure - [A] prevents this from occuring - [B] cases may cause a failure to find the GFID and needs to be handled - IOW, not expected except in race conditions - Call this an dorphan (dentry orphan) for now - Failure - [A] pevents any other possibility other than ENOENT - return ENOENT - Nameless lookup (gfid) (can be a file or a directory, need to search in both spaces) - subvols[1] = layout_search (mds, gfid) - subvols [2] = layout_search (ds, gfid) - lookup (subvols, gfid) - Success - update inode - return results - Failure - if found on both, then critical error - if not found anywhere, return ESTALE - Again due to not considering [A] we do not need to lookup everywhere on a failure ### stat - subvol = layout_search (inode->type? mds : ds, gfid) || inode->subvol - return stat (subvol, gfid) ### mkdir - subvol = layout_search (mds, gfid) - mkdir (subvol, gfid) - Success - mkdir (pinode->subvol, name, gfid) - Success - return 0 - Failure - Race to create name entry won by another client - async cleanup oinode (orphan inode), i.e the created gfid on subvol - return EEXIST - Failure - return ESYS ### rmdir - subvol = layout_search (mds, gfid) || inode->subvol - rmdir (subvol, gfid) - Success - Directory empty, also no further creates of dentries possible as parent inode (i.e GFID) does not exist, dentry for this directory at this point is an dorphan - rmdir (pinode->subvol, name, gfid) - Success - return 0 - Failure - This should not fail, other than for connectivity/brick unavailability reasons, at this point we leave a dorphan behind - Failure - Either not empty, or unable to remove due to other systemic reasons - return error ### readdirp - inodes = readdir (fd->subvol, fd, args...) - for each inode in inodes - subvols[1] = layout_search (mds, inode->gfid) - subvols[2] = layout_search (ds, inode->gfid) - stat (subvols, gfid) - fill up stat info in return to readdirp call ### create - subvol = layout_search (ds, gfid) - create (subvol, gfid) - Success - create (pinode->subvol, name, gfid) - Success - return 0 - Failure - Race to create name entry won by another client - async cleanup oinode (orphan inode), i.e the created gfid on subvol - return EEXIST - Failure - return ESYS ### setattr - subvol = layout_search (inode->type? mds : ds, gfid) || inode->subvol - return setattr (subvol, gfid) ### open(dir) - subvol = layout_search (inode->type? mds : ds, gfid) || inode->subvol - return open(dir) (subvol, gfid) ### readv - subvol = inode->subvol - return readv (subvol, fd, args...) ### writev - subvol = inode->subvol - return writev (subvol, fd, args...) ### unlink - subvol = layout_search (ds, gfid) || inode->subvol - unlink (subvol, gfid) - Success - open fd is preserved by underlying POSIX fs, so no harm done there - rmdir (pinode->subvol, name, gfid) - Success - return 0 - Failure - This should not fail, other than for connectivity/brick unavailability reasons, at this point we leave a dorphan behind - Failure - return error ### rename - Simple rename: i.e rename a directory or a file within the same parent and newname does not exist - rename (pgfid->subvol, oldname, newname) ### link - This is more involved, as we need to keep a link count on the gfid inode ### fsync ### fsyncdir ### lk ## Open problems, issues and limitations 1. readdirp would be very slow in the approach above, need ways to alleviate the same 2. Quota needs a rethink as ancestory in a brick is not present for accounting and other such needs. We need to possibly rework quota in any approach to be able to adapt to this change on disk. One such promising approach seems to be using XFS project quota like ideas in Gluster, see - http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf - See, "Internal inodes" -> "Quota inodes" - http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/xfs-quotas.html 3. lookup also translates into a 2 network RPC FOP, and needs improvement ## Appendix A: Directory as a file at the POSIX layer ## High level requirements - Cannot hardlink GFID name space to real name, as a file does not exist on the MDS as a real file, it is just an offset into a file that represents the directory - Also even if such a hard link is possible this is a link to the name entry for the file, not to its inode, as that exists as a separate object in the local file system anyway - Bottomline is that either with directory as a file or not, the current on disk format prevents a .glusterfs name space to exist in parallel on the disk and is actually not needed anyway - Need to have an case-insensitive name index for the same, helps SAMBA if at all possible - Can be sharded to provide hot directory immunity (Future) - Should retain d_off across replicas, i.e a d_off should be continuable on any available replica, this is one of the primary needs for this feature as well