Professional Documents
Culture Documents
MeasuringMemory LinuxUnix PDF
MeasuringMemory LinuxUnix PDF
AND UNIX
Roger Snowden, Center of Expertise, Oracle Support
November 14, 2007
ABSTRACT
Although modern server platforms use virtual memory managers to provide resilience and
robustness of operation, severe shortages of physical memory can have negative and even
catastrophic consequences for production servers. In order to provide sufficient physical memory for
optimal operation, it is necessary to have a basic understanding of how virtual memory is used, and a
means to measure current and historic memory usage.
This article provides a brief, elementary explanation of virtual memory managers, as typically
implemented on Unix and Linux systems; and introduces readers to simple methods and tools to
ascertain memory usage. The reader is also offered simple techniques to detect and diagnose
problems associated with physical memory shortfalls.
This article is intended for system administrators, database administrators, and managers who wish to
determine virtual and physical memory utilization on Unix and Linux platforms and to conduct basic
capacity planning.
WHAT IS VIRTUAL MEMORY?
In very simple terms, virtual memory is a technique whereby an application program is able to use
large amounts of memory, which can exceed the physical memory on a machine. Essentially, physical
memory used by a process is extended transparently, by using disk resources.
In order for a process to access more memory than exists on a machine, physical memory is divided
into uniform-sized pages. When a process needs to allocate memory, it obtains that memory from the
virtual memory manager. The virtual memory manager obtains a reference to a page of physical
memory from a pool of memory reserved for that purpose by the operating system, and places that
reference in the page table.
The application process does not need to do anything special to access or manage the memory other
than use the appropriate operating system call to make the allocation request. Memory use via virtual
memory is meant to be entirely transparent to application processes. A diagram of processes and
virtual memory components follows, with some discussion of those components and their functions.
Application 1
System
Reserved
Code
Page Table
Stack
Swap File
Data
Allocated
Application 2
System
Reserved
Code
Physical
Memory
Stack
Data
Allocated
PAGE TABLE
In the diagram above, in addition to application processes that consume memory, physical memory is
shown, along with the swap file and the page table. When more memory pages are allocated than
physically exist, some previously allocated page of memory must be reused. A page of memory from
the page table is chosen, based on its relatively non-recent use by its owning application process.
When a process attempts to access memory, the actual page lookup and address translation are
performed by a memory management unit (MMU). The MMU is hardware device that makes virtual
memory feasible and transparent to application processes.
In order to preserve the contents of that reused page, the original page is written to a unique location
within the swap file, for later retrieval. This is known as a page-out operation. The paging mechanism
may involve physical I/O, but permits application programs to allocate nearly unlimited amounts of
memory, transparently. The event of requesting a page from the page table, when the page is not
present, is known as a page fault.
Once the page-out operation is complete, the memory page in the page table is then granted to the
process that requested memory— that is, the process that incurred the page fault. That requesting
process can then modify the memory. By having been written to the swap file, the paged-out
memory becomes “clean” and safe to modify. Neither the allocating process nor the process whose
page was written to the swap file are aware of the page-out operation.
Since the virtual memory management code does introduce some overhead for execution, paging
does incur some process execution time. Transparency to the application program is the point of
virtual memory, although the flexibility and resilience gained by use of virtual memory is not “free”.
When the process that owns the previously paged-out memory needs to access that memory page
again, the page from the swap file must then be read back into physical memory, into an available
entry from the page table. If an unused page is not available, another least-recently page of memory
from another process must now be paged out, and the cycle continues.
Some paging is normal in a busy system. However, when memory demands become heavy, and free
unused physical memory becomes exceptionally low, then more drastic measures must be taken to
make physical memory available to processes.
SWAPPING
Unix and Linux systems have kernel-owned processes responsible for monitoring overall free
physical memory. Known as swappers, these processes will detect situations when free physical
memory drops below a predetermined threshold. At that point, those swappers— one per CPU—
will begin to grab multiple pages from entire processes and write those pages to disk in order to free
up large chunks of memory. When this happens, all other CPU activity is typically suspended until
some higher threshold of memory becomes available. System administrators set these threshold
values at system configuration time.
While light paging activity is considered normal, and not necessarily performance impairing,
swapping results in severe performance degradation. This is not only because of the extreme and
time-consuming I/O involved, but also because swapped processes cannot run until their memory is
swapped back into the page table, which often means another process must then be swapped out.
Moreover, the swapper process dominates CPU resources, noticeably blocking other processes from
execution during the time of extreme swapping activity.
MEMORY PAGE STATES
In Linux, all pages of memory will be in one of five states, shown in the diagram below. Other
operating system memory states will vary, but are similar to this:
Free
allocation
deallocate
accessed/kscand Inactive
Active
Dirty
accessed
acccessed
page out
ac
ce
ss
ed
Inactive Inactive
Clean kupdated/bdflush Laundry
During the lifecycle of a memory page, each page will move from one state to another as needed.
Those states are:
Free: A free page is not being used, and is available for allocation to a process.
Active: An active page is in use by a process.
Inactive Dirty: When a page is unused for a particular period of time, it is marked as inactive dirty,
and is a candidate for reuse by another process. A kernel process periodically scans all memory pages
and tracks how recently that page has been used. A busy page is left in the active state, while an
unused page is moved to the inactive laundry list.
Inactive Laundered: A page on the inactive laundry list has its contents written to disk for
preservation, should the owning process need to access it later. Once the write operation is complete,
the page enters the inactive laundered state, which is a transitional state.
Inactive Clean: An inactive laundered page is moved to the inactive clean state to indicate that page is
now eligible for reuse. It may be deallocated or overwritten as needed.
REACHING CRITICAL LEVELS
When free memory is drawn down to some predetermined critical level, the operating system will
move inactive memory pages to disk in order to satisfy increased memory demands. This is the
swapping process described earlier in this article. The determination of “critical” and the mechanism
for dealing with the situation vary by operating system, but generally, an operating system process
known as the swapper will begin to swap the memory of entire processes disk, such that the swapped
process enters a suspended state. If that process is not entirely idle, then when it gets its next
execution opportunity (time slice) and wakes up, its memory is then reclaimed from disk, as perhaps
another process is then forced to have its memory swapped out.
When a system is in a state where far more memory is demanded by processes than is physically
available, and those processes must alternately become swapped, the system begins to thrash. This is a
desperately serious state in which overall machine performance is severely impaired, since it is
spending more time managing memory than executing application code. Therefore, it is essential that
enough physical memory be available on a system to avoid swapping. Swapping, if it continues, can
lead to complete memory exhaustion and a system halt, at which time the machine becomes
unavailable altogether, until rebooted.
Swappers, also known as swap daemons, are operating system processes that exist for each CPU on the
machine. When they are actively trying to reclaim memory by swapping other processes to disk, they
usually run at a sufficiently high priority such that normal application processes have to wait in a run
queue for CPU time to become available once the swapper has resolved the temporary memory
shortage.
When a process is waiting for CPU in a run queue, it is not executing. For time-critical services, such
as the cssd daemon of Oracle Portable Clusterware (also known as CRS), this situation can be fatal for a
cluster node, since it may be unable to respond to the heartbeat messages from other nodes in the
cluster. This situation can lead to unexpected node evictions in an Oracle RAC cluster.
MEASURING MEMORY HEALTH
To avoid critical performance and availability issues for servers, some commonly available utilities
can be helpful. On Unix and Linux systems, vmstat provides a useful picture of the current memory
situation. Vmstat operates by taking samples of operating system information at regular intervals,
settable as a command line parameter. While a thorough discussion of vmstat is outside the scope of
this article, a sample of vmstat obtained from a server undergoing memory exhaustion is included for
discussion:
[root@ceintcb14 proc]# vmstat 2 30
procs memory swap io system cpu
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 142120 312400 19468 142540 0 1 1 2 18 10 2 3 7 0
1 0 142120 244800 19468 142540 0 0 0 24 109 43 0 100 0 0
1 0 142120 195196 19476 142540 0 0 0 44 109 44 0 100 0 0
2 0 142120 137240 19476 142540 0 0 0 0 106 31 0 100 0 0
1 0 142120 61688 19480 142540 0 0 0 32 108 44 0 100 0 0
2 0 241896 20904 14256 134448 0 404 0 428 124 164 0 100 0 0
2 0 271212 20376 13572 120260 0 448 0 448 118 31 0 100 0 0
4 0 365144 19956 12800 114376 0 6800 2 6838 140 48 1 99 0 0
4 0 442024 18904 10888 109484 0 10768 0 10768 113 98 0 100 0 0
5 2 587060 18860 10648 105564 42 0 48 12 188 35 0 100 0 0
1 0 177624 1874148 10268 102684 40 0 76 18 118 47 0 95 5 0
0 0 177624 1874340 10268 102696 0 0 6 24 118 51 0 2 98 0
0 0 177624 1874340 10272 102720 6 0 18 64 111 42 0 0 100 0
In the example above, the leftmost “r” column represents CPU run queue length, a symptom of
processes waiting for CPU, and thus CPU resource exhaustion. Together with the “id” (CPU percent
idle) column to the far right, we can tell this machine is CPU-bound. The run queue average spikes
upward suddenly, while idle percentage drops to zero.
While a casual observer might conclude the worst bottleneck on this system is CPU and not memory,
the reason CPU waits are high is because of memory exhaustion. As discussed earlier, under duress, a
system’s swapper kernel process will dominate process execution time until enough memory is free
for current memory demand. All other processes must wait in a run queue until the swapper’s task is
complete.
Note the “swpd” and “free” memory columns, representing total system swapped and free memory
respectively. The free memory drops rapidly until a critical threshold is reached, at which time the
swapping activity begins. The “so” column indicates memory pages swapping out to disk, while “si”
indicates memory being swapped back in. As one might expect, after swapping out much memory,
the “free” value jumps significantly. However, this is not always apparent as other processes may be
consuming that memory as soon as it becomes free.
The “si” activity burst following the “so” swapping-out activity is the result of some processes whose
memory was previously swapped out, now being swapped back in. Those processes are using some
of the memory freed up by the swap-out operation.
As for the CPU run queue and percent idle values, note the run queue size drops to zero and the
percent idle increases quickly to 100 percent idle as the swapping increases the amount of free
memory and the memory-starvation crisis is resolved. The swapper daemon no longer dominates
CPU and other processes can get sufficient execution time for the run queue length to become zero.
The amount of memory swapped to disk will remain high until the swapped out processes need to
run again, and access memory pages that were previously swapped.
DIAGNOSTICS
In cases where sudden and extreme memory consumption leads to swapping, it may not be obvious
to the system administrator what is the precise cause of problem. In such cases, a tool such as top
may be invoked to analyze relative memory usage among processes.
Top collects information from processes consuming either CPU or virtual memory resources, with
some useful details. A complete discussion of top is outside the scope of this article, and the reader is
encouraged to read appropriate Unix or Linux documentation to fully understand top, and similar
utilities. Here is a test case example, designed to deliberately “leak” memory, to illustrate use of top:
top 17:16:58 up 12 days, 17:45, 3 users, load average: 0.94, 0.77, 0.42
Tasks: 143 total, 1 running, 142 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.1% us, 0.5% sy, 0.2% ni, 95.3% id, 1.0% wa, 0.0% hi, 0.0% si
Mem: 4072172k total, 4055640k used, 16532k free, 920k buffers
Swap: 4144760k total, 452076k used, 3692684k free, 1838948k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP COMMAND
19586 root 15 0 394m 391m 348 S 0 9.9 0:07.76 2256 yyksd
19585 root 15 0 328m 325m 348 S 0 8.2 0:08.40 2760 yyksd
19588 root 15 0 306m 305m 348 S 1 7.7 0:07.11 1376 yyksd
19587 root 15 0 284m 283m 348 S 0 7.1 0:07.89 1228 yyksd
19584 root 15 0 270m 268m 348 S 2 6.8 0:08.29 1648 yyksd
19589 root 15 0 236m 234m 348 S 2 5.9 0:09.11 1840 yyksd
20708 oracle 16 0 596m 111m 108m S 0 2.8 1:02.49 484m oracle
30959 oracle 15 0 595m 109m 106m S 0 2.7 1:15.14 486m oracle
30965 oracle 16 0 604m 103m 95m S 0 2.6 0:26.17 500m oracle
13297 oracle 15 0 595m 103m 100m S 0 2.6 1:58.85 492m oracle
20703 oracle 16 0 596m 88m 84m S 0 2.2 2:11.92 508m oracle
20674 oracle 16 0 610m 83m 11m S 0 2.1 5:29.31 526m java
In this test case example, we see a single process is currently running and 142 processes are sleeping,
or suspended. In the tabular part of the output, we see the “COMMAND” column on the right,
listing process names in order of physical memory consumption (“RES”, resource column). The top
memory-consuming processes are all named “yyksd”. As mentioned, this is a contrived case to
illustrate this specific memory diagnostic technique.
As we can see, the first “yyksd” has consumed 391 megabytes of physical memory, which is 9.9
percent of all memory on the system. The “S” column indicates process state, which is suspended in
this case. All other processes in the list are also suspended.
Note the “SWAP” column, which shows all processes listed as having at least some memory
swapped to disk, but in particular the processes, in this partial display of output, following the
“yyksd”, starting with the “oracle” processes, have hundreds of megabytes of memory swapped out.
A logical starting point for diagnosing this situation would be to investigate the nature of “yyksd”
and determine why it is consuming so much physical memory, forcing others to be swapped.
A full diagnostic discussion is out of the scope of this article, but such efforts might include truss or
strace capture, perhaps some process stack traces captures with pstack or a similar utility, and a detailed
analysis of memory consumed by this process, as contained within the /proc filesystem for the
process in question. For such purposes, the PID column shows the process id of each process listed.
For further clarification of the problem case, here is top output sorted by CPU consumption at some
earlier point during this test case:
top 17:14:53 up 12 days, 17:43, 3 users, load average: 0.79, 0.60, 0.32
Tasks: 143 total, 2 running, 141 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.7% us, 9.8% sy, 0.0% ni, 44.8% id, 42.7% wa, 0.0% hi, 0.0% si
Mem: 4072172k total, 4055808k used, 16364k free, 268k buffers
Swap: 4144760k total, 24012k used, 4120748k free, 2148344k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP COMMAND
56 root 15 0 0 0 0 D 8 0.0 0:22.80 0 kswapd0
19587 root 15 0 223m 222m 348 S 5 5.6 0:05.21 1356 yyksd
19584 root 15 0 236m 235m 348 S 3 5.9 0:06.27 1616 yyksd
19586 root 15 0 315m 313m 348 S 3 7.9 0:05.78 2256 yyksd
19585 root 15 0 265m 263m 348 S 2 6.6 0:06.16 2856 yyksd
19589 root 15 0 201m 199m 348 S 2 5.0 0:06.80 1744 yyksd
19588 root 15 0 243m 242m 348 S 1 6.1 0:05.34 1376 yyksd
36 root 5 10 0 0 0 S 0 0.0 0:02.62 0 kblockd/1
332 root 16 0 0 0 0 S 0 0.0 7:58.64 0 kjournald
14113 oracle 25 10 39128 22m 9.9m S 0 0.6 4:52.29 16m rhnappletgui
30957 oracle 16 0 594m 18m 16m S 0 0.5 16:09.45 575m oracle
Note the top CPU-consuming process is kswapd0, the “swapper” daemon for one of the CPUs. The
presence of a swapper process as the top CPU-consumer is clear evidence of heavy swap activity, and
of potential performance problems resulting from the inability of other processes to run while
waiting for the swapper to complete its high priority task. The process state (“S” column, with the
“D” value) indicates the swapper is in uninterruptible sleep state. This means it is sleeping, most likely
because it is waiting for completion of an I/O request from its most recent swap file write or read. It
cannot be interrupted. So, since it is itself waiting for I/O, the entire machine is indirectly waiting for
that same I/O. Not a good thing for performance, obviously.
The rest of the processes in the list are in “S” state, which means they are suspended, or sleeping.
TOOLS FOR MONITORING MEMORY
While vmstat and top are excellent tools for monitoring overall virtual memory health, they need to be
run at appropriate intervals and their output captured for later analysis. In cases where a problem is
reported after-the-fact, such tools are not helpful unless they were running at the time the problem
occurred, and their output archived for later analysis. To address such situations, Oracle Support’s
Center of Expertise has developed OSWatcher, a script-based tool for Unix and Linux systems that
runs and archives output from a number of operating system monitoring utilities, such as vmstat, top,
iostat, mpstat and ps.
OSWatcher is available from Metalink as note 301137.1. It is a shell script tool and will run on Unix
and Linux servers. It operates as a background process and runs the native operating system utilities
at user-settable intervals, by default 30 seconds, and retains an archive of the output for a user-
settable period, defaulting to 48 hours. This value may be increased in order to retain more
information when evaluating performance, and to capture baseline information during important
cycle-end periods.
Oracle recommends customers download and install OSWatcher on all production and test servers
that need to be monitored.
For upgrade and migration planning, as well as informal capacity planning, the vmstat archive files
from OSWatcher can be gathered from production and test systems and analyzed for symptoms such
as illustrated above. If any sign of swapping is noted, more memory needs to be made available, or
some analysis of memory-consuming processes made in order to control critical memory
consumption.