OCFS2 - Frequently Asked Questions ================================== General ------- Q01 How do I get started? A01 a) Download and install the module and tools rpms. b) Create cluster.conf and propagate to all nodes. c) Configure and start the O2CB cluster service. d) Format the volume. e) Mount the volume. Q02 How do I know the version number running? A02 # cat /proc/fs/ocfs2/version OCFS2 1.2.1 Fri Apr 21 13:51:24 PDT 2006 (build bd2f25ba0af9677db3572e3ccd92f739) Q03 How do I configure my system to auto-reboot after a panic? A03 To auto-reboot system 60 secs after a panic, do: # echo 60 > /proc/sys/kernel/panic To enable the above on every reboot, add the following to /etc/sysctl.conf: kernel.panic = 60 ============================================================================== Download and Install -------------------- Q01 Where do I get the packages from? A01 For Novell's SLES9, upgrade to SP3 to get the required modules installed. Also, install ocfs2-tools and ocfs2console packages. For Red Hat's RHEL4, download and install the appropriate module package and the two tools packages, ocfs2-tools and ocfs2console. Appropriate module refers to one matching the kernel version, flavor and architecture. Flavor refers to smp, hugemem, etc. Q02 What are the latest versions of the OCFS2 packages? A02 The latest module package version is 1.2.2. The latest tools/console packages versions are 1.2.1. Q03 How do I interpret the package name ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm? A03 The package name is comprised of multiple parts separated by '-'. a) ocfs2 - Package name b) 2.6.9-22.0.1.ELsmp - Kernel version and flavor c) 1.2.1 - Package version d) 1 - Package subversion e) i686 - Architecture Q04 How do I know which package to install on my box? A04 After one identifies the package name and version to install, one still needs to determine the kernel version, flavor and architecture. To know the kernel version and flavor, do: # uname -r 2.6.9-22.0.1.ELsmp To know the architecture, do: # rpm -qf /boot/vmlinuz-`uname -r` --queryformat "%{ARCH}\n" i686 Q05 Why can't I use "uname -p" to determine the kernel architecture? A05 "uname -p" does not always provide the exact kernel architecture. Case in point the RHEL3 kernels on x86_64. Even though Red Hat has two different kernel architectures available for this port, ia32e and x86_64, "uname -p" identifies both as the generic "x86_64". Q06 How do I install the rpms? A06 First install the tools and console packages: # rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm Then install the appropriate kernel module package: # rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm Q07 Do I need to install the console? A07 No, the console is not required but recommended for ease-of-use. Q08 What are the dependencies for installing ocfs2console? A08 ocfs2console requires e2fsprogs, glib2 2.2.3 or later, vte 0.11.10 or later, pygtk2 (EL4) or python-gtk (SLES9) 1.99.16 or later, python 2.3 or later and ocfs2-tools. Q09 What modules are installed with the OCFS2 1.2 package? A09 a) configfs.ko b) ocfs2.ko c) ocfs2_dlm.ko d) ocfs2_dlmfs.ko e) ocfs2_nodemanager.ko f) debugfs Q10 What tools are installed with the ocfs2-tools 1.2 package? A10 a) mkfs.ocfs2 b) fsck.ocfs2 c) tunefs.ocfs2 d) debugfs.ocfs2 e) mount.ocfs2 f) mounted.ocfs2 g) ocfs2cdsl h) ocfs2_hb_ctl i) o2cb_ctl j) o2cb - init service to start/stop the cluster k) ocfs2 - init service to mount/umount ocfs2 volumes l) ocfs2console - installed with the console package Q11 What is debugfs and is it related to debugfs.ocfs2? A11 debugfs is an in-memory filesystem developed by Greg Kroah-Hartman. It is useful for debugging as it allows kernel space to easily export data to userspace. For more, http://kerneltrap.org/node/4394. It is currently being used by OCFS2 to dump the list of filesystem locks and could be used for more in the future. It is bundled with OCFS2 as the various distributions are currently not bundling it. While debugfs and debugfs.ocfs2 are unrelated in general, the latter is used as the front-end for the debugging info provided by the former. For example, refer to the troubleshooting section. ============================================================================== Configure --------- Q01 How do I populate /etc/ocfs2/cluster.conf? A01 If you have installed the console, use it to create this configuration file. For details, refer to the user's guide. If you do not have the console installed, check the Appendix in the User's guide for a sample cluster.conf and the details of all the components. Do not forget to copy this file to all the nodes in the cluster. If you ever edit this file on any node, ensure the other nodes are updated as well. Q02 Should the IP interconnect be public or private? A02 Using a private interconnect is recommended. While OCFS2 does not take much bandwidth, it does require the nodes to be alive on the network and sends regular keepalive packets to ensure that they are. To avoid a network delay being interpreted as a node disappearing on the net which could lead to a node-self-fencing, a private interconnect is recommended. One could use the same interconnect for Oracle RAC and OCFS2. Q03 What should the node name be and should it be related to the IP address? A03 The node name needs to match the hostname. The IP address need not be the one associated with that hostname. As in, any valid IP address on that node can be used. OCFS2 will not attempt to match the node name (hostname) with the specified IP address. Q04 How do I modify the IP address, port or any other information specified in cluster.conf? A04 While one can use ocfs2console to add nodes dynamically to a running cluster, any other modifications require the cluster to be offlined. Stop the cluster on all nodes, edit /etc/ocfs2/cluster.conf on one and copy to the rest, and restart the cluster on all nodes. Always ensure that cluster.conf is the same on all the nodes in the cluster. ============================================================================== O2CB Cluster Service -------------------- Q01 How do I configure the cluster service? A01 # /etc/init.d/o2cb configure Enter 'y' if you want the service to load on boot and the name of the cluster (as listed in /etc/ocfs2/cluster.conf). Q02 How do I start the cluster service? A02 a) To load the modules, do: # /etc/init.d/o2cb load b) To Online it, do: # /etc/init.d/o2cb online [cluster_name] If you have configured the cluster to load on boot, you could combine the two as follows: # /etc/init.d/o2cb start [cluster_name] The cluster name is not required if you have specified the name during configuration. Q03 How do I stop the cluster service? A03 a) To offline it, do: # /etc/init.d/o2cb offline [cluster_name] b) To unload the modules, do: # /etc/init.d/o2cb unload If you have configured the cluster to load on boot, you could combine the two as follows: # /etc/init.d/o2cb stop [cluster_name] The cluster name is not required if you have specified the name during configuration. Q04 How can I learn the status of the cluster? A04 To learn the status of the cluster, do: # /etc/init.d/o2cb status Q05 I am unable to get the cluster online. What could be wrong? A05 Check whether the node name in the cluster.conf exactly matches the hostname. One of the nodes in the cluster.conf need to be in the cluster for the cluster to be online. ============================================================================== Format ------ Q01 How do I format a volume? A01 You could either use the console or use mkfs.ocfs2 directly to format the volume. For console, refer to the user's guide. # mkfs.ocfs2 -L "oracle_home" /dev/sdX The above formats the volume with default block and cluster sizes, which are computed based upon the size of the volume. # mkfs.ocfs2 -b 4k -C 32K -L "oracle_home" -N 4 /dev/sdX The above formats the volume for 4 nodes with a 4K block size and a 32K cluster size. Q02 What does the number of node slots during format refer to? A02 The number of node slots specifies the number of nodes that can concurrently mount the volume. This number is specified during format and can be increased using tunefs.ocfs2. This number cannot be decreased. Q03 What should I consider when determining the number of node slots? A03 OCFS2 allocates system files, like Journal, for each node slot. So as to not to waste space, one should specify a number within the ballpark of the actual number of nodes. Also, as this number can be increased, there is no need to specify a much larger number than one plans for mounting the volume. Q04 Does the number of node slots have to be the same for all volumes? A04 No. This number can be different for each volume. Q05 What block size should I use? A05 A block size is the smallest unit of space addressable by the file system. OCFS2 supports block sizes of 512 bytes, 1K, 2K and 4K. The block size cannot be changed after the format. For most volume sizes, a 4K size is recommended. On the other hand, the 512 bytes block is never recommended. Q06 What cluster size should I use? A06 A cluster size is the smallest unit of space allocated to a file to hold the data. OCFS2 supports cluster sizes of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M. For database volumes, a cluster size of 128K or larger is recommended. For Oracle home, 32K to 64K. Q07 Any advantage of labelling the volumes? A07 As in a shared disk environment, the disk name (/dev/sdX) for a particular device be different on different nodes, labelling becomes a must for easy identification. You could also use labels to identify volumes during mount. # mount -L "label" /dir The volume label is changeable using the tunefs.ocfs2 utility. ============================================================================== Mount ----- Q01 How do I mount the volume? A01 You could either use the console or use mount directly. For console, refer to the user's guide. # mount -t ocfs2 /dev/sdX /dir The above command will mount device /dev/sdX on directory /dir. Q02 How do I mount by label? A02 To mount by label do: # mount -L "label" /dir Q03 What entry to I add to /etc/fstab to mount an ocfs2 volume? A03 Add the following: /dev/sdX /dir ocfs2 noauto,_netdev 0 0 The _netdev option indicates that the devices needs to be mounted after the network is up. Q04 What do I need to do to mount OCFS2 volumes on boot? A04 a) Enable o2cb service using: # chkconfig --add o2cb b) Enable ocfs2 service using: # chkconfig --add ocfs2 c) Configure o2cb to load on boot using: # /etc/init.d/o2cb configure d) Add entries into /etc/fstab as follows: /dev/sdX /dir ocfs2 _netdev 0 0 Q05 How do I know my volume is mounted? A05 a) Enter mount without arguments, or # mount b) List /etc/mtab, or # cat /etc/mtab c) List /proc/mounts, or # cat /proc/mounts d) Runs ocfs2 service # /etc/init.d/ocfs2 status mount command reads the /etc/mtab to show the information. Q06 What are the /config and /dlm mountpoints for? A06 OCFS2 comes bundled with two in-memory filesystems configfs and ocfs2_dlmfs. configfs is used by the ocfs2 tools to communicate to the in-kernel node manager the list of nodes in the cluster and to the in-kernel heartbeat thread the resource to heartbeat on. ocfs2_dlmfs is used by ocfs2 tools to communicate with the in-kernel dlm to take and release clusterwide locks on resources. Q07 Why does it take so much time to mount the volume? A07 It takes around 5 secs for a volume to mount. It does so so as to let the heartbeat thread stabilize. In a later release, we plan to add support for a global heartbeat, which will make most mounts instant. ============================================================================== Oracle RAC ---------- Q01 Any special flags to run Oracle RAC? A01 OCFS2 volumes containing the Voting diskfile (CRS), Cluster registry (OCR), Data files, Redo logs, Archive logs and control files must be mounted with the "datavolume" and "nointr" mount options. The datavolume option ensures that the Oracle processes opens these files with the o_direct flag. The "nointr" option ensures that the ios are not interrupted by signals. # mount -o datavolume,nointr -t ocfs2 /dev/sda1 /u01/db Q02 What about the volume containing Oracle home? A02 Oracle home volume should be mounted normally, that is, without the "datavolume" and "nointr" mount options. These mount options are only relevant for Oracle files listed above. # mount -t ocfs2 /dev/sdb1 /software/orahome Q03 Does that mean I cannot have my data file and Oracle home on the same volume? A03 Yes. The volume containing the Oracle data files, redo-logs, etc. should never be on the same volume as the distribution (including the trace logs like, alert.log). ============================================================================== Moving data from OCFS (Release 1) to OCFS2 ------------------------------------------ Q01 Can I mount OCFS volumes as OCFS2? A01 No. OCFS and OCFS2 are not on-disk compatible. We had to break the compatibility in order to add many of the new features. At the same time, we have added enough flexibility in the new disk layout so as to maintain backward compatibility in the future. Q02 Can OCFS volumes and OCFS2 volumes be mounted on the same machine simultaneously? A02 No. OCFS only works on 2.4 linux kernels (Red Hat's AS2.1/EL3 and SuSE's SLES8). OCFS2, on the other hand, only works on the 2.6 kernels (Red Hat's EL4 and SuSE's SLES9). Q03 Can I access my OCFS volume on 2.6 kernels (SLES9/RHEL4)? A03 Yes, you can access the OCFS volume on 2.6 kernels using FSCat tools, fsls and fscp. These tools can access the OCFS volumes at the device layer, to list and copy the files to another filesystem. FSCat tools are available on oss.oracle.com. Q04 Can I in-place convert my OCFS volume to OCFS2? A04 No. The on-disk layout of OCFS and OCFS2 are sufficiently different that it would require a third disk (as a temporary buffer) inorder to in-place upgrade the volume. With that in mind, it was decided not to develop such a tool but instead provide tools to copy data from OCFS without one having to mount it. Q05 What is the quickest way to move data from OCFS to OCFS2? A05 Quickest would mean having to perform the minimal number of copies. If you have the current backup on a non-OCFS volume accessible from the 2.6 kernel install, then all you would need to do is to retore the backup on the OCFS2 volume(s). If you do not have a backup but have a setup in which the system containing the OCFS2 volumes can access the disks containing the OCFS volume, you can use the FSCat tools to extract data from the OCFS volume and copy onto OCFS2. ============================================================================== Coreutils --------- Q01 Like with OCFS (Release 1), do I need to use o_direct enabled tools to perform cp, mv, tar, etc.? A01 No. OCFS2 does not need the o_direct enabled tools. The file system allows processes to open files in both o_direct and bufferred mode concurrently. ============================================================================== Troubleshooting --------------- Q01 How do I enable and disable filesystem tracing? A01 To list all the debug bits along with their statuses, do: # debugfs.ocfs2 -l To enable tracing the bit SUPER, do: # debugfs.ocfs2 -l SUPER allow To disable tracing the bit SUPER, do: # debugfs.ocfs2 -l SUPER off To totally turn off tracing the SUPER bit, as in, turn off tracing even if some other bit is enabled for the same, do: # debugfs.ocfs2 -l SUPER deny To enable heartbeat tracing, do: # debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow To disable heartbeat tracing, do: # debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny Q02 How do I get a list of filesystem locks and their statuses? A02 OCFS2 1.0.9+ has this feature. To get this list, do: a) Mount debugfs is mounted at /debug. # mount -t debugfs debugfs /debug b) Dump the locks. # echo "fs_locks" | debugfs.ocfs2 /dev/sdX >/tmp/fslocks Q03 How do I read the fs_locks output? A03 Let's look at a sample output: Lockres: M000000000000000006672078b84822 Mode: Protected Read Flags: Initialized Attached RO Holders: 0 EX Holders: 0 Pending Action: None Pending Unlock Action: None Requested Mode: Protected Read Blocking Mode: Invalid First thing to note is the Lockres, which is the lockname. The dlm identifies resources using locknames. The lockname is a combination of a lock type (S superblock, M metadata, D filedata, R rename, W readwrite), inode number and generation. To get the inode number and generation from lockname, do: #echo "stat " | debugfs.ocfs2 /dev/sdX Inode: 419616 Mode: 0666 Generation: 2025343010 (0x78b84822) .... To map the lockname to a directory entry, do: # echo "locate " | debugfs.ocfs2 /dev/sdX debugfs.ocfs2 1.2.0 debugfs: 419616 /linux-2.6.15/arch/i386/kernel/semaphore.c One could also provide the inode number instead of the lockname. # echo "locate <419616>" | debugfs.ocfs2 /dev/sdX debugfs.ocfs2 1.2.0 debugfs: 419616 /linux-2.6.15/arch/i386/kernel/semaphore.c To get a lockname from a directory entry, do: # echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | debugfs.ocfs2 /dev/sdX M000000000000000006672078b84822 D000000000000000006672078b84822 W000000000000000006672078b84822 The first is the Metadata lock, then Data lock and last ReadWrite lock for the same resource. The DLM supports 3 lock modes: NL no lock, PR protected read and EX exclusive. If you have a dlm hang, the resource to look for would be one with the "Busy" flag set. The next step would be to query the dlm for the lock resource. Note: The dlm debugging is still a work in progress. To do dlm debugging, first one needs to know the dlm domain, which matches the volume UUID. # echo "stats" | debugfs.ocfs2 -n /dev/sdX | grep UUID: | while read a b ; do echo $b ; done 82DA8137A49A47E4B187F74E09FBBB4B Then do: # echo R dlm_domain lockname > /proc/fs/ocfs2_dlm/debug For example: # echo R 82DA8137A49A47E4B187F74E09FBBB4B M000000000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug # dmesg | tail struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=75, key=965960985 lockres: M000000000000000006672078b84822, owner=79, state=0 last used: 0, on purge list: no granted queue: type=3, conv=-1, node=79, cookie=11673330234144325711, ast=(empty=y,pend=n), bast=(empty=y,pend=n) converting queue: blocked queue: It shows that the lock is mastered by node 75 and that node 79 has been granted a PR lock on the resource. This is just to give a flavor of dlm debugging. ============================================================================== Limits ------ Q01 Is there a limit to the number of subdirectories in a directory? A01 Yes. OCFS2 currently allows up to 32000 subdirectories. While this limit could be increased, we will not be doing it till we implement some kind of efficient name lookup (htree, etc.). Q02 Is there a limit to the size of an ocfs2 file system? A02 Yes, current software addresses block numbers with 32 bits. So the file system device is limited to (2 ^ 32) * blocksize (see mkfs -b). With a 4KB block size this amounts to a 16TB file system. This block addressing limit will be relaxed in future software. At that point the limit becomes addressing clusters of 1MB each with 32 bits which leads to a 4PB file system. ============================================================================== System Files ------------ Q01 What are system files? A01 System files are used to store standard filesystem metadata like bitmaps, journals, etc. Storing this information in files in a directory allows OCFS2 to be extensible. These system files can be accessed using debugfs.ocfs2. To list the system files, do: # echo "ls -l //" | debugfs.ocfs2 /dev/sdX 18 16 1 2 . 18 16 2 2 .. 19 24 10 1 bad_blocks 20 32 18 1 global_inode_alloc 21 20 8 1 slot_map 22 24 9 1 heartbeat 23 28 13 1 global_bitmap 24 28 15 2 orphan_dir:0000 25 32 17 1 extent_alloc:0000 26 28 16 1 inode_alloc:0000 27 24 12 1 journal:0000 28 28 16 1 local_alloc:0000 29 3796 17 1 truncate_log:0000 The first column lists the block number. Q02 Why do some files have numbers at the end? A02 There are two types of files, global and local. Global files are for all the nodes, while local, like journal:0000, are node specific. The set of local files used by a node is determined by the slot mapping of that node. The numbers at the end of the system file name is the slot#. To list the slot maps, do: # echo "slotmap" | debugfs.ocfs2 -n /dev/sdX Slot# Node# 0 39 1 40 2 41 3 42 ============================================================================== Heartbeat --------- Q01 How does the disk heartbeat work? A01 Every node writes every two secs to its block in the heartbeat system file. The block offset is equal to its global node number. So node 0 writes to the first block, node 1 to the second, etc. All the nodes also read the heartbeat sysfile every two secs. As long as the timestamp is changing, that node is deemed alive. Q02 When is a node deemed dead? A02 An active node is deemed dead if it does not update its timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7) loops. Once a node is deemed dead, the surviving node which manages to cluster lock the dead node's journal, recovers it by replaying the journal. Q03 What about self fencing? A03 A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the [o2hb-x] kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster. Q04 How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD? A04 This parameter value could be changed by adding it to /etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should be the SAME on ALL the nodes in the cluster. Q05 What should one set O2CB_HEARTBEAT_THRESHOLD to? A05 It should be set to the timeout value of the io layer. Most multipath solutions have a timeout ranging from 60 secs to 120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61. O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1) Q06 What if a node umounts a volume? A06 During umount, the node will broadcast to all the nodes that have mounted that volume to drop that node from its node maps. As the journal is shutdown before this broadcast, any node crash after this point is ignored as there is no need for recovery. Q07 I encounter "Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing" whenever I run a heavy io load? A07 We have encountered a bug with the default "cfq" io scheduler which causes a process doing heavy io to temporarily starve out other processes. While this is not fatal for most environments, it is for OCFS2 as we expect the hb thread to be r/w to the hb area atleast once every 12 secs (default). Bug with the fix has been filed with Red Hat and Novell. For more, refer to the tracker bug filed on bugzilla: http://oss.oracle.com/bugzilla/show_bug.cgi?id=671 Till this issue is resolved, one is advised to use the "deadline" io scheduler. To use deadline, add "elevator=deadline" to the kernel command line as follows: 1. For SLES9, edit the command line in /boot/grub/menu.lst. title Linux 2.6.5-7.244-bigsmp elevator=deadline kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5 vga=0x314 selinux=0 splash=silent resume=/dev/sda3 elevator=deadline showopts console=tty0 console=ttyS0,115200 noexec=off initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp 2. For RHEL4, edit the command line in /boot/grub/grub.conf: title Red Hat Enterprise Linux AS (2.6.9-22.EL) root (hd0,0) kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200 console=tty0 elevator=deadline noexec=off initrd /initrd-2.6.9-22.EL.img To see the current kernel command line, do: # cat /proc/cmdline ============================================================================== Quorum and Fencing ------------------ Q01 What is a quorum? A01 A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups. Q02 How does OCFS2's cluster services define a quorum? A02 The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network. A node has quorum when: * it sees an odd number of heartbeating nodes and has network connectivity to more than half of them. or * it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number. Q03 What is fencing? A03 Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described in Q02, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test. Q04 How does a node decide that it has connectivity with another? A04 When a node sees another come to life via heartbeating it will try and establish a TCP connection to that newly live node. It considers that other node connected as long as the TCP connection persists and the connection is not idle for 10 seconds. Once that TCP connection is closed or idle it will not be reestablished until heartbeat thinks the other node has died and come back alive. Q05 How long does the quorum process take? A05 First a node will realize that it doesn't have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of 10 seconds of idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead (see Q03 in the Heartbeat section of this FAQ). The current default of 7 iterations of 2 seconds results in waiting for 9 iterations or 18 seconds. By default, then, a maximum of 28 seconds can pass from the time a network fault occurs until a node fences itself. Q06 How can one avoid a node from panic-ing when one shutdowns the other node in a 2-node cluster? A06 This typically means that the network is shutting down before all the OCFS2 volumes are being umounted. Ensure the ocfs2 init script is enabled. This script ensures that the OCFS2 volumes are umounted before the network is shutdown. To check whether the service is enabled, do: # chkconfig --list ocfs2 ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off Q07 How does one list out the startup and shutdown ordering of the OCFS2 related services? A07 To list the startup order for runlevel 3 on RHEL4, do: # cd /etc/rc3.d # ls S*ocfs2* S*o2cb* S*network* S10network S24o2cb S25ocfs2 To list the shutdown order on RHEL4, do: # cd /etc/rc6.d # ls K*ocfs2* K*o2cb* K*network* K19ocfs2 K20o2cb K90network To list the startup order for runlevel 3 on SLES9, do: # cd /etc/init.d/rc3.d # ls S*ocfs2* S*o2cb* S*network* S05network S07o2cb S08ocfs2 To list the shutdown order on SLES9, do: # cd /etc/init.d/rc3.d # ls K*ocfs2* K*o2cb* K*network* K14ocfs2 K15o2cb K17network Please note that the default ordering in the ocfs2 scripts only include the network service and not any shared-device specific service, like iscsi. If one is using iscsi or any shared device requiring a service to be started and shutdown, please ensure that that service runs before and shutsdown after the ocfs2 init service. ============================================================================== Novell SLES9 ------------ Q01 Why are OCFS2 packages for SLES9 not made available on oss.oracle.com? A01 OCFS2 packages for SLES9 are available directly from Novell as part of the kernel. Same is true for the various Asianux distributions and for ubuntu. As OCFS2 is now part of the mainline kernel (http://lwn.net/Articles/166954/), we expect more distributions to bundle the product with the kernel. Q02 What versions of OCFS2 are available with SLES9 and how do they match with the Red Hat versions available on oss.oracle.com? A02 As both Novell and Oracle ship OCFS2 on different schedules, the package versions do not match. We expect to resolve itself over time as the number of patch fixes reduce. Novell is shipping two SLES9 releases, viz., SP2 and SP3. The latest kernel with the SP2 release is 2.6.5-7.202.7. It ships with OCFS2 1.0.8. The latest kernel with the SP3 release is 2.6.5-7.257. It ships with OCFS2 1.2.1. ============================================================================== What's New in 1.2 ----------------- Q01 What is new in OCFS2 1.2? A01 OCFS2 1.2 has two new features: a) It is endian-safe. With this release, one can mount the same volume concurrently on x86, x86-64, ia64 and big endian architectures ppc64 and s390x. b) Supports readonly mounts. The fs uses this feature to auto remount ro when encountering on-disk corruptions (instead of panic-ing). Q02 Do I need to re-make the volume when upgrading? A02 No. OCFS2 1.2 is fully on-disk compatible with 1.0. Q03 Do I need to upgrade anything else? A03 Yes, the tools needs to be upgraded to ocfs2-tools 1.2. ocfs2-tools 1.0 will not work with OCFS2 1.2 nor will 1.2 tools work with 1.0 modules. ============================================================================== Upgrading to the latest release ------------------------------- Q01 How do I upgrade to the latest release? A01 1. Download the latest ocfs2-tools and ocfs2console for the target platform and the appropriate ocfs2 module package for the kernel version, flavor and architecture. (For more, refer to the "Download and Install" section above.) 2. Umount all OCFS2 volumes. # umount -at ocfs2 3. Shutdown the cluster and unload the modules. # /etc/init.d/o2cb offline # /etc/init.d/o2cb unload 4. If required, upgrade the tools and console. # rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm 5. Upgrade the module. # rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.2-1.i686.rpm 6. Ensure init services ocfs2 and o2cb are enabled. # chkconfig --add o2cb # chkconfig --add ocfs2 7. To check whether the services are enabled, do: # chkconfig --list o2cb o2cb 0:off 1:off 2:on 3:on 4:on 5:on 6:off # chkconfig --list ocfs2 ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off 8. At this stage one could either reboot the node or simply, restart the cluster and mount the volume. Q02 Can I do a rolling upgrade from 1.0.x/1.2.x to 1.2.2? A02 Rolling upgrade to 1.2.2 is not recommended. Shutdown the cluster on all nodes before upgrading the nodes. Q03 After upgrade I am getting the following error on mount "mount.ocfs2: Invalid argument while mounting /dev/sda6 on /ocfs". A03 Do "dmesg | tail". If you see the error: >> ocfs2_parse_options:523 ERROR: Unrecognized mount option >> "heartbeat=local" or missing value it means that you are trying to use the 1.2 tools and 1.0 modules. Ensure that you have unloaded the 1.0 modules and installed and loaded the 1.2 modules. Use modinfo to determine the version of the module installed and/or loaded. Q04 The cluster fails to load. What do I do? A04 Check "demsg | tail" for any relevant errors. One common error is as follows: >> SELinux: initialized (dev configfs, type configfs), not configured for labeling >> audit(1139964740.184:2): avc: denied { mount } for ... The above error indicates that you have SELinux activated. A bug in SELinux does not allow configfs to mount. Disable SELinux by setting "SELINUX=disabled" in /etc/selinux/config. Change is activated on reboot. ============================================================================== Processes --------- Q01 List and describe all OCFS2 threads? A01 [o2net] One per node. Is a workqueue thread started when the cluster is brought online and stopped when offline. It handles the network communication for all threads. It gets the list of active nodes from the o2hb thread and sets up tcp/ip communication channels with each active node. It sends regular keepalive packets to detect any interruption on the channels. [user_dlm] One per node. Is a workqueue thread started when dlmfs is loaded and stopped on unload. (dlmfs is an in-memory file system which allows user space processes to access the dlm in kernel to lock and unlock resources.) Handles lock downconverts when requested by other nodes. [ocfs2_wq] One per node. Is a workqueue thread started when ocfs2 module is loaded and stopped on unload. Handles blockable file system tasks like truncate log flush, orphan dir recovery and local alloc recovery, which involve taking dlm locks. Various code paths queue tasks to this thread. For example, ocfs2rec queues orphan dir recovery so that while the task is kicked off as part of recovery, its completion does not affect the recovery time. [o2hb-14C29A7392] One per heartbeat device. Is a kernel thread started when the heartbeat region is populated in configfs and stopped when it is removed. It writes every 2 secs to its block in the heartbeat region to indicate to other nodes that that node is alive. It also reads the region to maintain a nodemap of live nodes. It notifies o2net and dlm any changes in the nodemap. [ocfs2vote-0] One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. It downgrades locks when requested by other nodes in reponse to blocking ASTs (BASTs). It also fixes up the dentry cache in reponse to files unlinked or renamed on other nodes. [dlm_thread] One per dlm domain. Is a kernel thread started when a dlm domain is created and stopped when destroyed. This is the core dlm which maintains the list of lock resources and handles the cluster locking infrastructure. [dlm_reco_thread] One per dlm domain. Is a kernel thread which handles dlm recovery whenever a node dies. If the node is the dlm recovery master, it remasters all the locks owned by the dead node. [dlm_wq] One per dlm domain. Is a workqueue thread. o2net queues dlm tasks on this thread. [kjournald] One per mount. Is used as OCFS2 uses JDB for journalling. [ocfs2cmt-0] One per mount. Is a kernel thread started when a volume is mounted and stooped on umount. Works in conjunction with kjournald. [ocfs2rec-0] Is started whenever another node needs to be be recovered. This could be either on mount when it discovers a dirty journal or during operation when hb detects a dead node. ocfs2rec handles the file system recovery and it runs after the dlm has finished its recovery. ==============================================================================