From:	CSBVAX::MRGATE!RELAY-INFO-VAX@CRVAX.SRI.COM@SMTP 21-SEP-1988 02:22
To:	ARISIA::EVERHART
Subj:	Thoughts On Dump Files. (Creation, Management And Analysis Of)...


Received: From KL.SRI.COM by CRVAX.SRI.COM with TCP; Tue, 20 SEP 88 21:30:23 PDT
Received: from central.cis.upenn.edu by KL.SRI.COM with TCP; Tue, 20 Sep 88 21:15:13 PDT
Received: from LINC.CIS.UPENN.EDU by central.cis.upenn.edu
	id AA07047; Tue, 20 Sep 88 22:02:29 EST
Received: from XRT.UPENN.EDU by linc.cis.upenn.edu
	id AA27511; Tue, 20 Sep 88 22:02:21 EDT
Posted-Date: Tue, 20 Sep 88 22:03 EDT
Message-Id: <8809210202.AA27511@linc.cis.upenn.edu>
Date: Tue, 20 Sep 88 22:03 EDT
From: "Clayton, Paul D." <CLAYTON@xrt.upenn.edu>
Subject: Thoughts On Dump Files. (Creation, Management And Analysis Of)...
To: INFO-VAX@KL.SRI.COM
X-Vms-To: @INFOVAX,CLAYTON


There are up to three VMS SYSGEN parameters that indicate how VMS will perform 
a memory dump in the event of a fatal error. These three are listed below.

	SAVEDUMP	- ignored if the dump is being written to 
                                SYS$SYSTEM:SYSDUMP.DMP, otherwise the following 
                                actions are taken.
			- when set to a '0', and the crash dump is done to the 
                                system page file, SYS$SYSTEM:PAGEFILE.SYS, due 
                                to there not being a separate dump file, 
                                SYS$SYSTEM:SYSDUMP.DMP, it will NOT be kept on 
                                the next system boot. This eliminates the 
                                chance of performing an analysis of it to 
                                determine what went wrong.
			- when set to a '1', and the crash dump is to the 
                                system page file, then it is mandatory to 
                                perform the following step as part of the next 
                                system boot.

		$ANALYZE/CRASH COPY SYS$SYSTEM:PAGEFILE.SYS XXX.YY
			where: XXX.YY is the file to use for storing the dump

				This should be done first thing as part of the 
                                boot process in order to allow the space in the 
                                page file to be used for its intended purpose. 
                                If this is not done, then the space is not 
                                available, and the system may not be able to 
                                completely boot without problems. For those 
				still running V3 of VMS, note that a bug in the
				COPY command resulted in the SP register being 
				increased by eight (8) for each COPY command
				issued on the same dump file. This was 
				corrected in V4 of VMS.
	DUMPBUG		- when set to a '0', no memory/processor status, as 
                                well as the error log buffers at the time of 
                                shutdown, is written to a file for later 
                                analysis. In other words, there will be nothing 
                                to help prevent a problem from recurring since 
                                this information is not available. 
			- when set to a '1', a memory dump and current 
                                processor status, and error log buffers, are 
                                written to one of two places. In the event that
                                the file SYS$SYSTEM:SYSDUMP.DMP was present at 
                                boot time, it will be the first choice to write
                                to. If this file is not present, then it will 
                                write the information to the primary system 
                                page file, SYS$SYSTEM:PAGEFILE.SYS. 
	DUMPSTYLE	- when set to a '0', the entire amount of physical
	(VMS V5 only)		memory will be written to the dump file, which 
                                ever one is used. This results in the same 
                                actions as that taken under VMS V4.
			- when set to a '1', only those portions of physical 
                                memory that were marked as 'valid' at the time 
                                of the shutdown, are written to the dump file. 
                                Physical memory used and allocated to VMS are 
                                written first, then if there is more space in 
                                the dump file, memory taken by user processes 
                                are written as well. If the dump file is to 
                                small to hold the pages VMS is using, later 
                                analysis of anything can not be performed. It 
                                should be noted that you have no control over 
                                what user processes get written to the dump and 
                                which do not, in the event the dump file is to 
                                small to hold everything.

There are several important issues to understand here. Problems with any one of 
them can result in useless, or no, information to aid in future analysis.

1. System dump files are not created on the 'fly' like those for terminal 
        servers, job controller and printer symbionts. These files have to 
        created and maintained by the system manager.
2. The dump file, under VMS 4, has to be the size which is the result of the 
        following equation.
		# of physical pages of memory + 4
	It does not matter if the dump file is the primary system page file, or 
        the separate SYS$SYSTEM:SYSDUMP.DMP file, the required size is the same 
        regardless. It does not hurt anything if the file is larger then this 
        value. If the page file is used, then the primary page file must be at 
        least this big, all other page files are ignored for this purpose. The 
        bottom line here is that under V4, you have to save all the information 
        to be able to use any of it.
3. Under VMS 5, the dump file does not have to be big enough to hold all the 
        information, regardless of which file is used. The amount of 
	information that is stored is determined by the size of the file, and 
	if it is to small, then the particular piece of information that may be
	required to determine the exact cause of the crash may not be 
	available. Or the contents of the dump file could be totally useless. 
	It should also be noted that you have no control over what parts are 
	saved when a 'compressed' dump is performed. A good starting point to 
	determine the size of a very usable 'compressed' dump file would be to 
	average out the maximum amount of physical memory USED over a given 
	time period. Then add to this between 2,000 to 7,000. An increase in 
	system usage could result in this value changing over time. The bottom
	line here is that a partial dump is supported and an analysis can be 
	performed on the parts that are saved. 
4. If there is no SYS$SYSTEM:SYSDUMP.DMP file present at the time of the boot, 
        then the dump will be to the primary page file. If the file, 
        SYS$SYSTEM:SYSDUMP.DMP, was created after the time of the boot, it will 
        not be used until after the system has been rebooted and then taken 
	down again.
5. The file SYS$SYSTEM:SYSDUMP.DMP is not kept open by VMS during the course of 
        normal system operation. In other words, doing the command,
		$SHOW DEVICE/FILE SYS$SYSDEVICE
	would not show the file to be open. This also means that should the DCL 
        command DELETE be issued against this file, it will not report any 
	error messages dealing with the file being 'locked' by another user, 
	and the file will in fact be deleted from the disk. This is a problem 
	area that must be avoided at all costs. Should this file exist at boot 
	time, then is deleted, for what ever reason, followed by the system 
	being shutdown or a crash, the system disk is in all likelihood 
	corrupted. The amount of corruption largely depends on how the disk is 
	used for non-VMS operating system purposes. When VMS wants to write the 
	dump file, 'normal' VMS disk I/O operations are not used. The 
	bootstrap device driver, which is a bare bones subsystem, is used to 
	write the information to the file. The true starting logical block 
	number of the dump file which is stored by VMS at boot time, is used to
	locate where to write the information. No directory lookups are 
	performed to 'find' the file, and the information is written directly. 
	The implication is that, no checks are made during shutdown to 
	determine if the dump file present at boot time, is still around at 
	system shutdown time. If it was deleted, and the system continued 
	operation, then the space that the dump file had reserved would be used,
	as needed, for new files on the disk. These new files, and maybe some 
	file headers themselves, WILL be overwritten when it comes time for 
	VMS to write to the dump file it knew of at the previous boot.
6. In the event that the dump file is deleted, by whatever causes and for 
        whatever reasons, the only safe way to bring down the VAX processor 
	that was to use that dump file is to HALT the machine. Do not do a 
	normal shutdown or '@CRASH'. You can dismount the disks yourself, 
	given that no open files are present, and stop the queue manager before 
	this drastic action is taken. A new version of the dump file should 
	also be created before the machine is halted and rebooted. Note that 
	this new dump file will not be used and does not replace the prior 
	dump file, that was deleted, for the purpose of this system shutdown. 
	The new dump file will be used the next time the system is shutdown.
7. In order to conserve space on the system disk, the dump file(s) can be 
	shared between processors by placing the dump file in 
	SYS$COMMON:[SYSEXE], instead of the usual node specific, 
	SYS$SPECIFIC:[SYSEXE] directory. While this can save considerable 
	space, there are several drawbacks to this method.
	a.	This only works on 'Cluster Common System Disks', which are 
		used for both CI and NI based VAXClusters. And only for 
		processors that are using the same disk as their system disk.
	b.	The size of the dump file has to be large enough to accommodate 
                the largest memory size of any single processor in the group 
                that is sharing dump files under VMS V4 and V5,when compression 
                is disabled. Under VMS 5 with compression enabled, the size of 
                the dump file has to be the 'best guess' of what will have all 
                the needed information.
	c.	Given the scenario where multiple VAX processors, that are 
                sharing a dump file, crash or otherwise come down, the contents 
                of the dump file is questionable. The Distributed Lock Manager 
                is not used when the information is written out, so the result 
                may be a 'mixture' of several processors which renders it 
                unusable for later analysis.
	If the above conditions are acceptable, then the dump file can be 
        shared. In order to share the dump file, pick the largest one and issue 
        the following command.
	   $RENAME/LOG SYS$SPECIFIC:[SYSEXE]SYSDUMP.DMP SYS$COMMON:[SYSEXE]*
	This command should be done when logged into the machine with the 
        largest dump file.Note that the RENAME command does not 'move' the file 
        header or the file itself, so the system whose dump file is being moved 
        can still be taken down normally without corrupting the system disk.The 
        other processors, that are to share the dump file, should be taken down 
        normally, and their dump file in the directory:
		disk:[SYSx.SYSEXE]SYSDUMP.DMP
			where:
				disk: is the disk that the group uses as their 
					system disk
				x     is the 'root number' for the processor(s)
					that no longer need a dump file specific
					to them.
	can be deleted from any remaining VAX processor(s) in the VAXCluster.
8. The 'compression' feature that is available under VMS V5, will save disk 
        space as well, but the size of the dump file that is needed to hold the 
        information needed to perform a complete analysis can change from one 
        crash to the next. It depends on the problem that caused the system to 
        crash to start with. 
9. Given that there are problems with the dump file,either it is not present or 
        was deleted, and/or the page file dump was not saved then the ERRLOG 
        buffers at the time of the crash are also not available for analysis. 
        These error buffers may hold vital hardware failure information that 
        actually caused the problem. In the event of a good system dump, and 
        later reboot, the contents of the error buffers from the dump file are 
        written to the ERRLOG process for recording purposes and later use by 
        maintenance personnel.

Hope this helps some in understanding how dump files work, and how to manage
them in the future. :-)

pdc

Paul D. Clayton 
Address - CLAYTON%XRT@RELAY.UPENN.EDU

Disclaimer:  All thoughts and statements here are my own and NOT those of my 
employer, and are also not based on, or contain, restricted information.