OCFS2 - Frequently Asked Questions
		==================================

General
-------

Q01	How do I get started?
A01	a) Download and install the module and tools rpms.
	b) Create cluster.conf and propagate to all nodes.
	c) Configure and start the O2CB cluster service.
	d) Format the volume.
	e) Mount the volume.

Q02	How do I know the version number running?
A02	# cat /proc/fs/ocfs2/version
	OCFS2 1.2.1 Fri Apr 21 13:51:24 PDT 2006 (build bd2f25ba0af9677db3572e3ccd92f739)

Q03	How do I configure my system to auto-reboot after a panic?
A03	To auto-reboot system 60 secs after a panic, do:
	# echo 60 > /proc/sys/kernel/panic
	To enable the above on every reboot, add the following to
	/etc/sysctl.conf:
	kernel.panic = 60
==============================================================================

Download and Install
--------------------

Q01	Where do I get the packages from?
A01	For Novell's SLES9, upgrade to SP3 to get the required modules
	installed. Also, install ocfs2-tools and ocfs2console packages.
	For Red Hat's RHEL4, download and install the appropriate module
	package and the two tools packages, ocfs2-tools and ocfs2console.
	Appropriate module refers to one matching the kernel version,
	flavor and architecture. Flavor refers to smp, hugemem, etc.

Q02	What are the latest versions of the OCFS2 packages?
A02	The latest module package version is 1.2.2. The latest tools/console
	packages versions are 1.2.1.

Q03	How do I interpret the package name
	ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm?
A03	The package name is comprised of multiple parts separated by '-'.
	a) ocfs2		- Package name
	b) 2.6.9-22.0.1.ELsmp	- Kernel version and flavor
	c) 1.2.1		- Package version
	d) 1			- Package subversion
	e) i686			- Architecture

Q04	How do I know which package to install on my box?
A04	After one identifies the package name and version to install,
	one still needs to determine the kernel version, flavor and
	architecture.
	To know the kernel version and flavor, do:
	# uname -r
	2.6.9-22.0.1.ELsmp
	To know the architecture, do:
	# rpm -qf /boot/vmlinuz-`uname -r` --queryformat "%{ARCH}\n"
	i686

Q05	Why can't I use "uname -p" to determine the kernel architecture?
A05	"uname -p" does not always provide the exact kernel architecture.
	Case in point the RHEL3 kernels on x86_64. Even though Red Hat has
	two different kernel architectures available for this port, ia32e
	and x86_64, "uname -p" identifies both as the generic "x86_64".

Q06	How do I install the rpms?
A06	First install the tools and console packages:
	# rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm
	Then install the appropriate kernel module package:
	# rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm

Q07	Do I need to install the console?
A07	No, the console is not required but recommended for ease-of-use.

Q08	What are the dependencies for installing ocfs2console?
A08	ocfs2console requires e2fsprogs, glib2 2.2.3 or later, vte 0.11.10 or
	later, pygtk2 (EL4) or python-gtk (SLES9) 1.99.16 or later,
	python 2.3 or later and ocfs2-tools.

Q09	What modules are installed with the OCFS2 1.2 package?
A09	a) configfs.ko
	b) ocfs2.ko
	c) ocfs2_dlm.ko
	d) ocfs2_dlmfs.ko
	e) ocfs2_nodemanager.ko
	f) debugfs

Q10	What tools are installed with the ocfs2-tools 1.2 package?
A10	a) mkfs.ocfs2
	b) fsck.ocfs2
	c) tunefs.ocfs2
	d) debugfs.ocfs2
	e) mount.ocfs2
	f) mounted.ocfs2
	g) ocfs2cdsl
	h) ocfs2_hb_ctl
	i) o2cb_ctl
	j) o2cb - init service to start/stop the cluster
	k) ocfs2 - init service to mount/umount ocfs2 volumes
	l) ocfs2console - installed with the console package

Q11	What is debugfs and is it related to debugfs.ocfs2?
A11	debugfs is an in-memory filesystem developed by Greg Kroah-Hartman.
	It is useful for debugging as it allows kernel space to easily
	export data to userspace. For more, http://kerneltrap.org/node/4394.
	It is currently being used by OCFS2 to dump the list of
	filesystem locks and could be used for more in the future.
	It is bundled with OCFS2 as the various distributions are currently
	not bundling it.
	While debugfs and debugfs.ocfs2 are unrelated in general, the
	latter is used as the front-end for the debugging info provided
	by the former. For example, refer to the troubleshooting section.
==============================================================================

Configure
---------

Q01	How do I populate /etc/ocfs2/cluster.conf?
A01	If you have installed the console, use it to create this
	configuration file. For details, refer to the user's guide.
	If you do not have the console installed, check the Appendix in the
	User's guide for a sample cluster.conf and the details of all the
	components. 
	Do not forget to copy this file to all the nodes in the cluster.
	If you ever edit this file on any node, ensure the other nodes are
	updated as well.

Q02	Should the IP interconnect be public or private?
A02	Using a private interconnect is recommended. While OCFS2 does not
	take much bandwidth, it does require the nodes to be alive on the
	network and sends regular keepalive packets to ensure that they are.
	To avoid a network delay being interpreted as a node disappearing on
	the net which could lead to a node-self-fencing, a private interconnect
	is recommended.  One could use the same interconnect for Oracle RAC
	and OCFS2.

Q03	What should the node name be and should it be related to the IP
	address?
A03	The node name needs to match the hostname. The IP address need
	not be the one associated with that hostname. As in, any valid
	IP address on that node can be used. OCFS2 will not attempt to
	match the node name (hostname) with the specified IP address.

Q04	How do I modify the IP address, port or any other information
	specified in cluster.conf?
A04	While one can use ocfs2console to add nodes dynamically to a running
	cluster, any other modifications require the cluster to be offlined.
	Stop the cluster on all nodes, edit /etc/ocfs2/cluster.conf on one
	and copy to the rest, and restart the cluster on all nodes. Always
	ensure that cluster.conf is the same on all the nodes in the cluster.
==============================================================================

O2CB Cluster Service
--------------------

Q01	How do I configure the cluster service?
A01	# /etc/init.d/o2cb configure
	Enter 'y' if you want the service to load on boot and the name of
	the cluster (as listed in /etc/ocfs2/cluster.conf).

Q02	How do I start the cluster service?
A02	a) To load the modules, do:
		# /etc/init.d/o2cb load
	b) To Online it, do:
		# /etc/init.d/o2cb online [cluster_name]
	If you have configured the cluster to load on boot, you could
	combine the two as follows:
		# /etc/init.d/o2cb start [cluster_name]
	The cluster name is not required if you have specified the name
	during configuration.

Q03	How do I stop the cluster service?
A03	a) To offline it, do:
		# /etc/init.d/o2cb offline [cluster_name]
	b) To unload the modules, do:
		# /etc/init.d/o2cb unload
	If you have configured the cluster to load on boot, you could
	combine the two as follows:
		# /etc/init.d/o2cb stop [cluster_name]
	The cluster name is not required if you have specified the name
	during configuration.

Q04	How can I learn the status of the cluster?
A04	To learn the status of the cluster, do:
		# /etc/init.d/o2cb status

Q05	I am unable to get the cluster online. What could be wrong?
A05	Check whether the node name in the cluster.conf exactly matches the
	hostname. One of the nodes in the cluster.conf need to be in the
	cluster for the cluster to be online.
==============================================================================

Format
------

Q01	How do I format a volume?
A01	You could either use the console or use mkfs.ocfs2 directly to format
	the volume.  For console, refer to the user's guide.
		# mkfs.ocfs2 -L "oracle_home" /dev/sdX
	The above formats the volume with default block and cluster sizes,
	which are computed based upon the size of the volume.
		# mkfs.ocfs2 -b 4k -C 32K -L "oracle_home" -N 4 /dev/sdX
	The above formats the volume for 4 nodes with a 4K block size and a
	32K cluster size.

Q02	What does the number of node slots during format refer to?
A02	The number of node slots specifies the number of nodes that can
	concurrently mount the volume. This number is specified during
	format and can be increased using tunefs.ocfs2. This number cannot
	be decreased.

Q03	What should I consider when determining the number of node slots?
A03	OCFS2 allocates system files, like Journal, for each node slot.
	So as to not to waste space, one should specify a number within the
	ballpark of the actual number of nodes. Also, as this number can be
	increased, there is no need to specify a much larger number than one
	plans for mounting the volume.

Q04	Does the number of node slots have to be the same for all volumes?
A04	No. This number can be different for each volume.

Q05	What block size should I use?
A05	A block size is the smallest unit of space addressable by the file
	system. OCFS2 supports block sizes of 512 bytes, 1K, 2K and 4K.
	The block size cannot be changed after the format. For most volume
	sizes, a 4K size is recommended. On the other hand, the 512 bytes
	block is never recommended.

Q06	What cluster size should I use?
A06	A cluster size is the smallest unit of space allocated to a file to
	hold the data. OCFS2 supports cluster sizes of 4K, 8K, 16K, 32K,
	64K, 128K, 256K, 512K and 1M. For database volumes, a cluster size
	of 128K or larger is recommended. For Oracle home, 32K to 64K.

Q07	Any advantage of labelling the volumes?
A07	As in a shared disk environment, the disk name (/dev/sdX) for a
	particular device be different on different nodes, labelling becomes
	a must for easy identification.
	You could also use labels to identify volumes during mount.
		# mount -L "label" /dir
	The volume label is changeable using the tunefs.ocfs2 utility.
==============================================================================

Mount
-----

Q01	How do I mount the volume?
A01	You could either use the console or use mount directly. For console,
	refer to the user's guide.
		# mount -t ocfs2 /dev/sdX /dir
	The above command will mount device /dev/sdX on directory /dir.

Q02	How do I mount by label?
A02	To mount by label do:
		# mount -L "label" /dir

Q03	What entry to I add to /etc/fstab to mount an ocfs2 volume?
A03	Add the following:
		/dev/sdX	/dir	ocfs2	noauto,_netdev	0	0
	The _netdev option indicates that the devices needs to be mounted after
	the network is up.

Q04	What do I need to do to mount OCFS2 volumes on boot?
A04	a) Enable o2cb service using:
		# chkconfig --add o2cb
	b) Enable ocfs2 service using:
		# chkconfig --add ocfs2
	c) Configure o2cb to load on boot using:
		# /etc/init.d/o2cb configure
	d) Add entries into /etc/fstab as follows:
		/dev/sdX	/dir	ocfs2	_netdev	0	0

Q05	How do I know my volume is mounted?
A05	a) Enter mount without arguments, or
		# mount
	b) List /etc/mtab, or
		# cat /etc/mtab
	c) List /proc/mounts, or
		# cat /proc/mounts
	d) Runs ocfs2 service
		# /etc/init.d/ocfs2 status
	mount command reads the /etc/mtab to show the information.

Q06	What are the /config and /dlm mountpoints for?
A06	OCFS2 comes bundled with two in-memory filesystems configfs and
	ocfs2_dlmfs. configfs is used by the ocfs2 tools to communicate to the
	in-kernel node manager the list of nodes in the cluster and to the
	in-kernel heartbeat thread the resource to heartbeat on.
	ocfs2_dlmfs is used by ocfs2 tools to communicate with the in-kernel
	dlm to take and release clusterwide locks on resources.

Q07	Why does it take so much time to mount the volume?
A07	It takes around 5 secs for a volume to mount. It does so so as
	to let the heartbeat thread stabilize. In a later release, we
	plan to add support for a global heartbeat, which will make most
	mounts instant.
==============================================================================

Oracle RAC
----------

Q01	Any special flags to run Oracle RAC?
A01	OCFS2 volumes containing the Voting diskfile (CRS), Cluster registry
	(OCR), Data files, Redo logs, Archive logs and control files must 
	be mounted with the "datavolume" and "nointr" mount options. The
	datavolume option ensures that the Oracle processes opens these files
	with the o_direct flag. The "nointr" option ensures that the ios
	are not interrupted by signals.
	# mount -o datavolume,nointr -t ocfs2 /dev/sda1 /u01/db

Q02	What about the volume containing Oracle home?
A02	Oracle home volume should be mounted normally, that is, without the
	"datavolume" and "nointr" mount options. These mount options are only
	relevant for Oracle files listed above.
	# mount -t ocfs2 /dev/sdb1 /software/orahome

Q03	Does that mean I cannot have my data file and Oracle home on the
	same volume?
A03	Yes. The volume containing the Oracle data files, redo-logs, etc.
	should never be on the same volume as the distribution (including the
	trace logs like, alert.log).
==============================================================================

Moving data from OCFS (Release 1) to OCFS2
------------------------------------------

Q01	Can I mount OCFS volumes as OCFS2?
A01	No. OCFS and OCFS2 are not on-disk compatible. We had to break the
	compatibility in order to add many of the new features. At the same
	time, we have added enough flexibility in the new disk layout so as to
	maintain backward compatibility in the future.

Q02	Can OCFS volumes and OCFS2 volumes be mounted on the same machine
	simultaneously?
A02	No. OCFS only works on 2.4 linux kernels (Red Hat's AS2.1/EL3 and SuSE's
	SLES8).  OCFS2, on the other hand, only works on the 2.6 kernels
	(Red Hat's EL4 and SuSE's SLES9).

Q03	Can I access my OCFS volume on 2.6 kernels (SLES9/RHEL4)?
A03	Yes, you can access the OCFS volume on 2.6 kernels using FSCat
	tools, fsls and fscp. These tools can access the OCFS volumes at the
	device layer, to list and copy the files to another filesystem.
	FSCat tools are available on oss.oracle.com.

Q04	Can I in-place convert my OCFS volume to OCFS2?
A04	No. The on-disk layout of OCFS and OCFS2 are sufficiently different
	that it would require a third disk (as a temporary buffer) inorder to
	in-place upgrade the volume. With that in mind, it was decided not to
	develop such a tool but instead provide tools to copy data from OCFS
	without one having to mount it.

Q05	What is the quickest way to move data from OCFS to OCFS2?
A05	Quickest would mean having to perform the minimal number of copies.
	If you have the current backup on a non-OCFS volume accessible from
	the 2.6 kernel install, then all you would need to do is to retore
	the backup on the OCFS2 volume(s). If you do not have a backup but
	have a setup in which the system containing the OCFS2 volumes can
	access the disks containing the OCFS volume, you can use the FSCat
	tools to extract data from the OCFS volume and copy onto OCFS2.
==============================================================================

Coreutils
---------

Q01	Like with OCFS (Release 1), do I need to use o_direct enabled tools
	to perform cp, mv, tar, etc.?
A01	No. OCFS2 does not need the o_direct enabled tools. The file system
	allows processes to open files in both o_direct and bufferred mode
	concurrently.
==============================================================================

Troubleshooting
---------------

Q01	How do I enable and disable filesystem tracing?
A01	To list all the debug bits along with their statuses, do:
		# debugfs.ocfs2 -l
	To enable tracing the bit SUPER, do:
		# debugfs.ocfs2 -l SUPER allow
	To disable tracing the bit SUPER, do:
		# debugfs.ocfs2 -l SUPER off
	To totally turn off tracing the SUPER bit, as in, turn off
	tracing even if some other bit is enabled for the same, do:
		# debugfs.ocfs2 -l SUPER deny
	To enable heartbeat tracing, do:
		# debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow 
	To disable heartbeat tracing, do:
		# debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny

Q02	How do I get a list of filesystem locks and their statuses?
A02	OCFS2 1.0.9+ has this feature. To get this list, do:
	a) Mount debugfs is mounted at /debug.
		# mount -t debugfs debugfs /debug
	b) Dump the locks.
		# echo "fs_locks" | debugfs.ocfs2 /dev/sdX >/tmp/fslocks

Q03	How do I read the fs_locks output?
A03	Let's look at a sample output:

	Lockres: M000000000000000006672078b84822  Mode: Protected Read
	Flags: Initialized Attached
	RO Holders: 0  EX Holders: 0
	Pending Action: None  Pending Unlock Action: None
	Requested Mode: Protected Read  Blocking Mode: Invalid

	First thing to note is the Lockres, which is the lockname. The
	dlm identifies resources using locknames. The lockname is a
	combination of a lock type (S superblock, M metadata, D filedata,
	R rename, W readwrite), inode number and generation.

	To get the inode number and generation from lockname, do:
	#echo "stat <M000000000000000006672078b84822>" | debugfs.ocfs2 /dev/sdX
	Inode: 419616   Mode: 0666   Generation: 2025343010 (0x78b84822)
	....

	To map the lockname to a directory entry, do:
	# echo "locate <M000000000000000006672078b84822>" | debugfs.ocfs2 /dev/sdX
	debugfs.ocfs2 1.2.0
	debugfs:        419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

	One could also provide the inode number instead of the lockname.
	# echo "locate <419616>" | debugfs.ocfs2 /dev/sdX
	debugfs.ocfs2 1.2.0
	debugfs:        419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

	To get a lockname from a directory entry, do:
	# echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" |
			debugfs.ocfs2 /dev/sdX
	M000000000000000006672078b84822 D000000000000000006672078b84822
		W000000000000000006672078b84822

	The first is the Metadata lock, then Data lock and last
	ReadWrite lock for the same resource.

	The DLM supports 3 lock modes: NL no lock, PR protected read and
	EX exclusive.

	If you have a dlm hang, the resource to look for would be one
	with the "Busy" flag set.

	The next step would be to query the dlm for the lock resource.
	Note: The dlm debugging is still a work in progress.

	To do dlm debugging, first one needs to know the dlm domain,
	which matches the volume UUID.

	# echo "stats" | debugfs.ocfs2 -n /dev/sdX | grep UUID: |
		while read a b ; do echo $b ; done
	82DA8137A49A47E4B187F74E09FBBB4B

	Then do:
	# echo R dlm_domain lockname > /proc/fs/ocfs2_dlm/debug

	For example:
	# echo R 82DA8137A49A47E4B187F74E09FBBB4B
		M000000000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug

	# dmesg | tail
	struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=75, key=965960985
	lockres: M000000000000000006672078b84822, owner=79, state=0 last used: 0, on purge list: no
	  granted queue:
	    type=3, conv=-1, node=79, cookie=11673330234144325711, ast=(empty=y,pend=n), bast=(empty=y,pend=n)
	  converting queue:
	  blocked queue:

	It shows that the lock is mastered by node 75 and that node 79
	has been granted a PR lock on the resource.

	This is just to give a flavor of dlm debugging.
==============================================================================

Limits
------

Q01	Is there a limit to the number of subdirectories in a directory?
A01	Yes. OCFS2 currently allows up to 32000 subdirectories. While this
	limit could be increased, we will not be doing it till we
	implement some kind of efficient name lookup (htree, etc.).

Q02	Is there a limit to the size of an ocfs2 file system?
A02	Yes, current software addresses block numbers with 32 bits.  So the
	file system device is limited to (2 ^ 32) * blocksize (see mkfs -b).
	With a 4KB block size this amounts to a 16TB file system.  This block
	addressing limit will be relaxed in future software.  At that point
	the limit becomes addressing clusters of 1MB each with 32 bits which
	leads to a 4PB file system.

==============================================================================

System Files
------------

Q01	What are system files?
A01	System files are used to store standard filesystem metadata like
	bitmaps, journals, etc. Storing this information in files in a
	directory allows OCFS2 to be extensible. These system files
	can be accessed using debugfs.ocfs2.

	To list the system files, do:
	# echo "ls -l //" | debugfs.ocfs2 /dev/sdX
        	18              16   1    2  .
        	18              16   2    2  ..
        	19              24   10   1  bad_blocks
        	20              32   18   1  global_inode_alloc
        	21              20   8    1  slot_map
        	22              24   9    1  heartbeat
        	23              28   13   1  global_bitmap
        	24              28   15   2  orphan_dir:0000
        	25              32   17   1  extent_alloc:0000
        	26              28   16   1  inode_alloc:0000
        	27              24   12   1  journal:0000
        	28              28   16   1  local_alloc:0000
        	29              3796 17   1  truncate_log:0000
	The first column lists the block number.

Q02	Why do some files have numbers at the end?
A02	There are two types of files, global and local. Global files are
	for all the nodes, while local, like journal:0000, are node specific.
	The set of local files used by a node is determined by the slot
	mapping of that node. The numbers at the end of the system file
	name is the slot#.

	To list the slot maps, do:
	# echo "slotmap" | debugfs.ocfs2 -n /dev/sdX
        	Slot#   Node#
	            0      39
        	    1      40
	            2      41
        	    3      42
==============================================================================

Heartbeat
---------

Q01	How does the disk heartbeat work?
A01	Every node writes every two secs to its block in the heartbeat
	system file. The block offset is equal to its global node
	number. So node 0 writes to the first block, node 1 to the
	second, etc. All the nodes also read the heartbeat sysfile every
	two secs. As long as the timestamp is changing, that node is
	deemed alive.

Q02	When is a node deemed dead?
A02	An active node is deemed dead if it does not update its
	timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7) loops. Once a node
	is deemed dead, the surviving node which manages to cluster lock the
	dead node's journal, recovers it by replaying the journal.


Q03	What about self fencing?
A03	A node self-fences if it fails to update its timestamp for
	((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel
	thread, after every timestamp write, sets a timer to panic the system
	after that duration. If the next timestamp is written within that
	duration, as it should, it first cancels that timer before setting
	up a new one. This way it ensures the system will self fence if for
	some reason the [o2hb-x] kernel thread is unable to update the
	timestamp and thus be deemed dead by other nodes in the cluster.

Q04	How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD?
A04	This parameter value could be changed by adding it to
	/etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should
	be the SAME on ALL the nodes in the cluster.

Q05	What should one set O2CB_HEARTBEAT_THRESHOLD to?
A05	It should be set to the timeout value of the io layer. Most
	multipath solutions have a timeout ranging from 60 secs to
	120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61.
	O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1)
	
Q06	What if a node umounts a volume?
A06	During umount, the node will broadcast to all the nodes that
	have mounted that volume to drop that node from its node maps.
	As the journal is shutdown before this broadcast, any node crash
	after this point is ignored as there is no need for recovery.

Q07	I encounter "Kernel panic - not syncing: ocfs2 is very sorry to
	be fencing this system by panicing" whenever I run a heavy io
	load?
A07	We have encountered a bug with the default "cfq" io scheduler
	which causes a process doing heavy io to temporarily starve out
	other processes. While this is not fatal for most environments,
	it is for OCFS2 as we expect the hb thread to be r/w to the hb
	area atleast once every 12 secs (default).
	Bug with the fix has been filed with Red Hat and Novell. For
	more, refer to the tracker bug filed on bugzilla:
	http://oss.oracle.com/bugzilla/show_bug.cgi?id=671

	Till this issue is resolved, one is advised to use the
	"deadline" io scheduler. To use deadline, add "elevator=deadline"
	to the kernel command line as follows:

	1. For SLES9, edit the command line in /boot/grub/menu.lst.
	title Linux 2.6.5-7.244-bigsmp  elevator=deadline
		kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5
			vga=0x314 selinux=0 splash=silent resume=/dev/sda3
			elevator=deadline showopts console=tty0
			console=ttyS0,115200 noexec=off
		initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp

	2. For RHEL4, edit the command line in /boot/grub/grub.conf:
	title Red Hat Enterprise Linux AS (2.6.9-22.EL)
        	root (hd0,0)
        	kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200
			console=tty0 elevator=deadline noexec=off
        	initrd /initrd-2.6.9-22.EL.img

	To see the current kernel command line, do:
	# cat /proc/cmdline
==============================================================================

Quorum and Fencing
------------------

Q01	What is a quorum?
A01	A quorum is a designation given to a group of nodes in a cluster which
	are still allowed to operate on shared storage.  It comes up when
	there is a failure in the cluster which breaks the nodes up
	into groups which can communicate in their groups and with the
	shared storage but not between groups.


Q02	How does OCFS2's cluster services define a quorum? 
A02	The quorum decision is made by a single node based on the number
	of other nodes that are considered alive by heartbeating and the number
	of other nodes that are reachable via the network.

	A node has quorum when:
	* it sees an odd number of heartbeating nodes and has network
	  connectivity to more than half of them.
		or 
	* it sees an even number of heartbeating nodes and has network
	  connectivity to at least half of them *and* has connectivity to
	  the heartbeating node with the lowest node number. 

Q03	What is fencing?
A03	Fencing is the act of forecefully removing a node from a cluster.
	A node with OCFS2 mounted will fence itself when it realizes that it
	doesn't have quorum in a degraded cluster.  It does this so that other
	nodes won't get stuck trying to access its resources.  Currently OCFS2
	will panic the machine when it realizes it has to fence itself
	off from the cluster.  As described in Q02, it will do this when it
	sees more nodes heartbeating than it has connectivity to and fails
	the quorum test.

Q04	How does a node decide that it has connectivity with another?
A04	When a node sees another come to life via heartbeating it will try
	and establish a TCP connection to that newly live node.  It considers
	that other node connected as long as the TCP connection persists and
	the connection is not idle for 10 seconds.  Once that TCP connection
	is closed or idle it will not be reestablished until heartbeat thinks
	the other node has died and come back alive.

Q05	How long does the quorum process take?
A05	First a node will realize that it doesn't have connectivity with
	another node.  This can happen immediately if the connection is closed
	but can take a maximum of 10 seconds of idle time.  Then the node
	must wait long enough to give heartbeating a chance to declare the
	node dead.  It does this by waiting two iterations longer than 
	the number of iterations needed to consider a node dead (see Q03 in
	the Heartbeat section of this FAQ).  The current default of 7
	iterations of 2 seconds results in waiting for 9 iterations or 18
	seconds.  By default, then, a maximum of 28 seconds can pass from the
	time a network fault occurs until a node fences itself.

Q06	How can one avoid a node from panic-ing when one shutdowns the other
	node in a 2-node cluster?
A06	This typically means that the network is shutting down before all the
	OCFS2 volumes are being umounted. Ensure the ocfs2 init script is
	enabled. This script ensures that the OCFS2 volumes are umounted before
	the network is shutdown.
	To check whether the service is enabled, do:
        	# chkconfig --list ocfs2
        	ocfs2     0:off   1:off   2:on    3:on    4:on    5:on    6:off

Q07	How does one list out the startup and shutdown ordering of the
	OCFS2 related services?
A07	To list the startup order for runlevel 3 on RHEL4, do:
		# cd /etc/rc3.d
		# ls S*ocfs2* S*o2cb* S*network*
		S10network  S24o2cb  S25ocfs2

	To list the shutdown order on RHEL4, do:
		# cd /etc/rc6.d
		# ls K*ocfs2* K*o2cb* K*network*
		K19ocfs2  K20o2cb  K90network

	To list the startup order for runlevel 3 on SLES9, do:
		# cd /etc/init.d/rc3.d
		# ls S*ocfs2* S*o2cb* S*network*
		S05network  S07o2cb  S08ocfs2

	To list the shutdown order on SLES9, do:
		# cd /etc/init.d/rc3.d
		# ls K*ocfs2* K*o2cb* K*network*
		K14ocfs2  K15o2cb  K17network

	Please note that the default ordering in the ocfs2 scripts only include
	the network service and not any shared-device specific service, like
	iscsi. If one is using iscsi or any shared device requiring a service
	to be started and shutdown, please ensure that that service runs before
	and shutsdown after the ocfs2 init service.
==============================================================================

Novell SLES9
------------

Q01	Why are OCFS2 packages for SLES9 not made available on oss.oracle.com?
A01	OCFS2 packages for SLES9 are available directly from Novell as
	part of the kernel. Same is true for the various Asianux
	distributions and for ubuntu. As OCFS2 is now part of the
	mainline kernel (http://lwn.net/Articles/166954/), we expect more
	distributions to bundle the product with the kernel.

Q02	What versions of OCFS2 are available with SLES9 and how do they
	match with the Red Hat versions available on oss.oracle.com?
A02	As both Novell and Oracle ship OCFS2 on different schedules, the
	package versions do not match. We expect to resolve itself over
	time as the number of patch fixes reduce.
	Novell is shipping two SLES9 releases, viz., SP2 and SP3.

	The latest kernel with the SP2 release is 2.6.5-7.202.7. It ships
	with OCFS2 1.0.8.

	The latest kernel with the SP3 release is 2.6.5-7.257. It ships
	with OCFS2 1.2.1.
==============================================================================

What's New in 1.2
-----------------

Q01	What is new in OCFS2 1.2?
A01	OCFS2 1.2 has two new features:
	a) It is endian-safe. With this release, one can mount the same
	volume concurrently on x86, x86-64, ia64 and big endian architectures
	ppc64 and s390x.
	b) Supports readonly mounts. The fs uses this feature to auto
	remount ro when encountering on-disk corruptions (instead of
	panic-ing).

Q02	Do I need to re-make the volume when upgrading?
A02	No. OCFS2 1.2 is fully on-disk compatible with 1.0.

Q03	Do I need to upgrade anything else?
A03	Yes, the tools needs to be upgraded to ocfs2-tools 1.2.
	ocfs2-tools 1.0 will not work with OCFS2 1.2 nor will 1.2
	tools work with 1.0 modules.
==============================================================================

Upgrading to the latest release
-------------------------------

Q01	How do I upgrade to the latest release?
A01	1. Download the latest ocfs2-tools and ocfs2console for the target
	platform and the appropriate ocfs2 module package for the kernel version,
	flavor and architecture. (For more, refer to the "Download and Install"
	section above.)
	2. Umount all OCFS2 volumes.
		# umount -at ocfs2
	3. Shutdown the cluster and unload the modules.
		# /etc/init.d/o2cb offline
		# /etc/init.d/o2cb unload
	4. If required, upgrade the tools and console.
		# rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm
	5. Upgrade the module.
		# rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.2-1.i686.rpm
	6. Ensure init services ocfs2 and o2cb are enabled.
		# chkconfig --add o2cb
		# chkconfig --add ocfs2
	7. To check whether the services are enabled, do:
		# chkconfig --list o2cb
		o2cb      0:off   1:off   2:on    3:on    4:on    5:on    6:off
		# chkconfig --list ocfs2
		ocfs2     0:off   1:off   2:on    3:on    4:on    5:on    6:off
	8. At this stage one could either reboot the node or simply, restart
	the cluster and mount the volume.

Q02	Can I do a rolling upgrade from 1.0.x/1.2.x to 1.2.2?
A02	Rolling upgrade to 1.2.2 is not recommended. Shutdown the cluster on
	all nodes before upgrading the nodes.

Q03	After upgrade I am getting the following error on mount
	"mount.ocfs2: Invalid argument while mounting /dev/sda6 on /ocfs".
A03	Do "dmesg | tail". If you see the error:
	>> ocfs2_parse_options:523 ERROR: Unrecognized mount option
	>> 	"heartbeat=local" or missing value
	it means that you are trying to use the 1.2 tools and 1.0
	modules. Ensure that you have unloaded the 1.0 modules and
	installed and loaded the 1.2 modules. Use modinfo to determine
	the version of the module installed and/or loaded.

Q04	The cluster fails to load. What do I do?
A04	Check "demsg | tail" for any relevant errors. One common error
	is as follows:
	>> SELinux: initialized (dev configfs, type configfs), not configured for labeling
	>> audit(1139964740.184:2): avc:  denied  { mount } for  ...
	The above error indicates that you have SELinux activated. A bug
	in SELinux does not allow configfs to mount. Disable SELinux
	by setting "SELINUX=disabled" in /etc/selinux/config. Change
	is activated on reboot.
==============================================================================

Processes
---------

Q01	List and describe all OCFS2 threads?
A01	[o2net]
	One per node. Is a workqueue thread started when the cluster is
	brought online and stopped when offline. It handles the network
	communication for all threads. It gets the list of active nodes
	from the o2hb thread and sets up tcp/ip communication channels
	with each active node. It sends regular keepalive packets to
	detect any interruption on the channels.

	[user_dlm]
	One per node. Is a workqueue thread started when dlmfs is loaded and
	stopped on unload. (dlmfs is an in-memory file system which allows user
	space processes to access the dlm in kernel to lock and unlock
	resources.) Handles lock downconverts when requested by other
	nodes.

	[ocfs2_wq]
	One per node. Is a workqueue thread started when ocfs2 module is
	loaded and stopped on unload. Handles blockable file system tasks
	like truncate log flush, orphan dir recovery and local alloc
	recovery, which involve taking dlm locks. Various code paths
	queue tasks to this thread. For example, ocfs2rec queues orphan
	dir recovery so that while the task is kicked off as part of
	recovery, its completion does not affect the recovery time.

	[o2hb-14C29A7392]
	One per heartbeat device. Is a kernel thread started when the
	heartbeat region is populated in configfs and stopped when it
	is removed. It writes every 2 secs to its block in the heartbeat
	region to indicate to other nodes that that node is alive. It also
	reads the region to maintain a nodemap of live nodes. It notifies
	o2net and dlm any changes in the nodemap.

	[ocfs2vote-0]
	One per mount. Is a kernel thread started when a volume is mounted
	and stopped on umount. It downgrades locks when requested by other
	nodes in reponse to blocking ASTs (BASTs). It also fixes up the dentry
	cache in reponse to files unlinked or renamed on other nodes.

	[dlm_thread]
	One per dlm domain. Is a kernel thread started when a dlm domain
	is created and stopped when destroyed. This is the core dlm
	which maintains the list of lock resources and handles the
	cluster locking infrastructure.

	[dlm_reco_thread]
	One per dlm domain. Is a kernel thread which handles dlm recovery
	whenever a node dies. If the node is the dlm recovery master, it
	remasters all the locks owned by the dead node.

	[dlm_wq]
	One per dlm domain. Is a workqueue thread. o2net queues dlm
	tasks on this thread.

	[kjournald]
	One per mount. Is used as OCFS2 uses JDB for journalling.

	[ocfs2cmt-0]
	One per mount. Is a kernel thread started when a volume is
	mounted and stooped on umount. Works in conjunction with
	kjournald.

	[ocfs2rec-0]
	Is started whenever another node needs to be be recovered. This
	could be either on mount when it discovers a dirty journal or
	during operation when hb detects a dead node. ocfs2rec handles
	the file system recovery and it runs after the dlm has finished
	its recovery.
==============================================================================