Guide to the POSIX Threads Library

Document revision date: 30 March 2001

Guide to the POSIX Threads Library

Contents

Index

3.4.3 Diagnosing Stack Overflow Errors

A process can produce a memory access violation (or segmentation fault) when it overflows its stack. As a first step in debugging this behavior, it is often necessary to run the program under the control of your system's debugger to determine which thread's stack has overflowed. However, if the debugger shares resources with the target process (as under OpenVMS), perhaps allocating its own data objects on the target process' stack, the debugger might not operate properly when the stack overflows. In this case, you might be required to analyze the target process by means other than the debugger.

If a thread receives a memory access exception either during a routine call or when accessing a local variable, increase the size of the thread's stack. However, not all memory access violations indicate a stack overflow.

For programs that you cannot run under a debugger, determining a stack overflow is more difficult. This is especially true if the program continues to run after receiving a memory access exception. For example, if a stack overflow occurs while a mutex is locked, the mutex might not be released as the thread recovers or terminates. When the program attempts to lock that mutex again, it could hang.

To set the stacksize attribute in a thread attributes object, use the pthread_attr_setstacksize() routine. (See Section 2.3.2.4 for more information.)

3.5 Scheduling Issues

The scheduling attributes of threads have unique programming issues.

3.5.1 Real-Time Scheduling

Use care when writing code that uses real-time scheduling (such as FIFO and RR policies) to control the priority of threads:

Review Section 3.1. Scheduling of threads is not the same as synchronization of threads.
Giving threads higher priority does not necessarily make your code run faster. Real-time priority adds overhead that can slow a program down, especially when interfacing with other libraries. For example, a higher-priority thread that polls for keyboard input may block work being done by other threads.
Watch for pitfalls like priority inversion. It is best to avoid relying on real-time scheduling, except where necessary to meet design goals. On the other hand, most systems that interact with external devices have some real-time aspect.
Avoiding multiple priorities and/or policies increases the complexity of the program: this complexity may cost more in performance than the addition of priorities provides, resulting in a performance loss over an application which does not use priorities.

3.5.2 Priority Inversion

Priority inversion occurs when the interaction among a group of three or more threads causes that group's highest-priority thread to be blocked from executing. For example, a higher-priority thread waits for a resource locked by a low-priority thread, and the low-priority thread waits while a middle-priority thread executes. The higher-priority thread is made to wait while a thread of lower priority (the middle-priority thread) executes.

You can address the phenomenon of priority inversion as follows:

To avoid priority inversion, associate a priority (at least as high as the highest-priority thread that will use it) with each resource and force any thread using that object to first increase its priority to that associated with the object.
To minimize the chance that an occurrence of priority inversion will cause a complete blockage of higher-priority threads, use the (default) throughput scheduling policy. The throughput scheduling policy allows even low-priority threads to execute eventually and to release the resources they hold. The FIFO and RR scheduling policies do not provide for resumption of the low-priority thread if the middle-priority thread executes indefinitely.

3.5.3 Dependencies Among Scheduling Attributes and Contention Scope

On Tru64 UNIX systems, to use high (real-time) thread scheduling priorities, a thread with system contention scope must run in a process with root privileges. On the other hand, a thread with process contention scope has access to all levels of priority without requiring special privileges.

Thus, if a process that is not privileged attempts to create another high priority thread with system contention scope, the creation will fail.

3.6 Using Synchronization Objects

The following sections discuss how to determine when to use a mutex with or without a condition variable, and how to prevent two erroneous behaviors that are common in multithreaded programs: race conditions and deadlocks.

Also discussed is why you should signal a condition variable with the associated mutex locked.

3.6.1 Distinguishing Proper Usage of Mutexes and Condition Variables

Use a mutex for tasks with short duration waits and fine-grained synchronization (memory access). Examples of a "fine-grained" task are those that serialize access to shared memory or make simple modifications to shared memory. This typically corresponds to a critical section of a few program statements or less.

Mutex waits are not interruptible. Threads waiting to acquire a mutex cannot be canceled.

Use a condition variable to wait for data to assume a desired state. Condition variables should be used for tasks with longer duration waits and coarse-grained synchronization (routine and system calls) Always use a condition variable with a mutex that protects the shared data being waited for. Condition variable waits are interruptible.

See Section 2.4.1 and Section 2.4.2 for more information about mutexes and condition variables.

3.6.2 Avoiding Race Conditions

A race condition occurs when two or more threads perform an operation and the result of the operation depends on unpredictable timing factors, specifically, the points at which each thread executes and waits and the point when each thread completes the operation.

For example, if two threads execute routines and each increments the same variable (such as x = x + 1), the variable could be incremented twice and one of the threads could use the wrong value. For example:

Thread A increments variable x.
Thread A is interrupted (or blocked, or scheduled off), and thread B is started.
Thread B starts and increments variable x.
Thread B is interrupted (or blocked, or scheduled off), and thread A is started.
Thread A checks the value of x and performs an action based on that value.
The value of x differs from when thread A incremented it, and the program's behavior is incorrect.

Race conditions result from the lack of (or ineffectual) synchronization. To avoid race conditions, ensure that any variable modified by more than one thread has only one mutex associated with it, and ensure that all accesses to the variable are made after acquiring that mutex. You can also use hardware features such as Alpha land-locked/store-conditional instruction sequences.

See Section 3.6.4 for another example of a race condition.

3.6.3 Avoiding Deadlocks

A deadlock occurs when a thread holding a resource is waiting for a resource held by another thread, while that thread is also waiting for the first thread's resource. Any number of threads can be involved in a deadlock if there is at least one resource per thread. A thread can deadlock on itself. Other threads can also become blocked waiting for resources involved in the deadlock.

Following are three techniques you can use to avoid deadlocks:

Use sequence numbers with mutexes. Associate a sequence number with each mutex and acquire mutexes in sequence. Never attempt to acquire a mutex with a sequence number lower than that of a mutex the thread already holds.
If a thread needs to acquire a mutex with a lower sequence number, it must first release all mutexes with a higher sequence number (after ensuring that the protected data is in a consistent state).
Use a "try and back off" algorithm when acquiring multiple mutexes. Use pthread_mutex_trylock() to lock each additional mutex. If any call to pthread_mutex_trylock() returns EBUSY , unlock all of the mutexes (including the first one locked with pthread_mutex_trylock() ), and start over.
Avoid locking more than one mutex at the same time.

3.6.4 Signaling a Condition Variable

Signaling the condition variable while holding the lock allows the Threads Library to perform certain optimizations which can result in more efficient behaviors in the working thread. In addition, doing so resolves a race condition which results if that signal might cause the condition variable to be deleted.

The following C code fragment is executed by a releasing thread (Thread A) to wake a blocked thread:

pthread_mutex_lock (m); ... /* Change shared variables to allow another thread to proceed */ predicate = TRUE; pthread_mutex_unlock (m); (1) pthread_cond_signal (cv); (2)

The following C code fragment is executed by a potentially blocking thread (thread B):

pthread_mutex_lock (m); while (!predicate ) pthread_cond_wait (cv, m); pthread_mutex_unlock (m); pthread_cond_destroy (cv);

If thread B is allowed to run while thread A is at this point, it finds the predicate true and continues without waiting on the condition variable. Thread B might then delete the condition variable with the pthread_cond_destroy() routine before thread A resumes execution.
When thread A executes this statement, the condition variable does not exist and the program fails.

These code fragments also demonstrate a race condition; that is, the routine, as coded, depends on a sequence of events among multiple threads, but does not enforce the desired sequence. Signaling the condition variable while still holding the associated mutex eliminates the race condition. Doing so prevents thread B from deleting the condition variable until after thread A has signaled it.

This problem can occur when the releasing thread is a worker thread and the waiting thread is a boss thread, and the last worker thread tells the boss thread to delete the variables that are being shared by boss and worker.

Code the signaling of a condition variable with the mutex locked as follows:

pthread_mutex_lock (m); ... /* Change shared variables to allow some other thread to proceed */ pthread_cond_signal (cv); pthread_mutex_unlock (m);

3.6.5 Static Initialization Inappropriate for Stack-Based Synchronization Objects

Although it is acceptable to the compiler, you cannot use the following standard macros (or any other equivalent mechanism) to initialize synchronization objects that are allocated on the stack:

PTHREAD_MUTEX_INITIALIZER
PTHREAD_COND_INITIALIZER
PTHREAD_RWLOCK_INITIALIZER

The Threads Library detects some cases of misuse of static initialization of automatically allocated (stack-based) thread synchronization objects. For instance, if the thread on whose stack a statically initialized mutex is allocated attempts to access that mutex, the operation fails and returns [EINVAL]. If the application does not check status returns from Threads Library routines, this failure can remain unidentified. Further, if the operation was a call to pthread_mutex_lock() , the program can encounter a thread synchronization failure, which in turn can result in unexpected program behavior including memory corruption. (For performance reasons, the Threads Library does not currently detect this error when a statically initialized mutex is accessed by a thread other than the one on whose stack the object was automatically allocated.)

If your application must allocate a thread synchronization object on the stack, the application must initialize the object before it is used by calling one of the routines pthread_mutex_init() , pthread_cond_init() , or pthread_rwlock_init() , as appropriate for the object. Your application must also destroy the thread synchronization object before it goes out of scope (for instance, due to the routine's returning control or raising an exception) by calling one of the routines pthread_mutex_destroy() , pthread_cond_destroy() , or pthread_rwlock_destroy() , as appropriate for the object.

3.7 Granularity Considerations

Granularity refers to the smallest unit of storage (that is, bytes, words, longwords, or quadwords) that a host computer can load or store in one machine instruction. Granularity considerations can affect the correctness of a program in which concurrent or asynchronous access can occur to separate pieces of data stored in the same memory granule. This can occur in a multithreaded program, where different threads access the data, or in any program that has any of the following characteristics:

Accesses data in memory that is shared with other processes
Accesses data that can be accessed by asynchronous device drivers, signal handlers (on Tru64 UNIX), or ASTs (on OpenVMS)
Accesses data objects that can be accessed by a continuable exception handler

The subsections that follow explain the granularity concept, the way it can affect the correctness of a multithreaded program, and techniques the programmer can use to prevent the granularity-related race condition known as word tearing.

3.7.1 Determinants of a Program's Granularity

A computer's processor typically makes available some set of granularities to programs, based on the processor's architecture, cache architecture, and instruction set. However, the computer's natural granularity also depends on the organization of the computer's memory and its bus architecture. For example, even if the processor "naturally" reads and writes 8-bit memory granules, a program's memory transfers may, in fact, occur in 32- or 64-bit memory granules.

On a computer that supports a set of granularities, the compiler determines a given program's actual granularity by the instructions it produces for the program to execute. For example, a given compiler on Alpha systems might generate code that causes every memory access to load or store a 64-bit word, regardless of the size of the data object specified in the application's source code. In this case, the application has a 64-bit word actual granularity. For this application, 8-bit, 16-bit, and 32-bit writes are not atomic with respect to other memory operations that overlap the same 64-bit memory granule.

To provide a run-time environment for applications that is consistent and coherent, an operating system's services and libraries should be built so that they provide the same actual granularity. When this is the case, an operating system can be said to provide a system granularity to the applications that it hosts. (A system's system granularity is typically reflected in the default actual granularity that the system's compilers encode when producing an object file.)

When preparing to port a multithreaded application from one system to another, you should determine whether there is a difference in the system granularities between the source and target systems. If the target system has a larger system granularity than the source system, you should become informed about the programming techniques presented in the sections that follow.

3.7.1.1 Alpha Processor Granularity

Systems based on the Alpha processor family have a quadword (64-bit) natural granularity.

Versions EV4 and EV5 of the Alpha processor family provide instructions for only longword- and quadword-length atomic memory accesses. Newer Alpha processors (EV5.6 and later) support byte- and word-length atomic memory accesses as well as longword- and quadword-length atomic memory accesses. (However, there is no way to ensure that a compiler uses the byte or word memory references when generating code for your application.)

Note

On systems using Tru64 UNIX Version 4.0 and later:

If you use Compaq C or Compaq C++ to compile your application's modules on a system that uses the EV4 or EV5 version of the Alpha processor, you can use the -arch56 compiler switch to request the compiler to produce instructions available in the Alpha processor version EV5.6 or later, including instructions for byte- and word-length atomic memory access, as needed.

When an application compiled with the -arch56 switch runs under Tru64 UNIX Version 4.0 or later, with a newer Alpha processor (that is, EV5.6 or later), it utilizes that processor's full instruction set. When that same application runs under Tru64 UNIX Version 4.0 or later, with an older Alpha processor (that is, EV4 or EV5), the operating system performs a software emulation of each instruction that is not available to the older processor; however, this is considerably slower than if the same application was run on a newer Alpha processor.

See the Compaq C and Compaq C++ compiler documentation for more information about the -arch56 switch.

On Tru64 UNIX systems, use the /usr/sbin/psrinfo -v command to determine the version(s) of your system's Alpha processor(s).

3.7.1.2 VAX Processor Granularity

Systems based on the VAX processor family have longword (32-bit) natural granularity, but all instructions can access unaligned data safely (though perhaps with a substantial performance penalty).

For more information about the granularity considerations of porting an application from an OpenVMS VAX system to an OpenVMS Alpha systems, consult the document Migrating to an OpenVMS System¹.

3.7.2 Compiler Support for Determining the Program's Actual Granularity

Table 3-1 summarizes the actual granularities that are provided by the respective compilers on the respective Compaq platforms.

Table 3-1 Default and Optional Granularities
Platform Compiler Default Granularity Setting Optional Granularity Settings

Tru64 UNIX Versions 4.0D and later (Alpha only) C/C++ quadword longword, byte/word on EV5.6

OpenVMS Alpha Version 7.3 C/C++ quadword byte, word

OpenVMS VAX Version 7.3 C/C++ longword None

**Table 3-1 Default and Optional Granularities**
Platform	Compiler	Default Granularity Setting	Optional Granularity Settings
Tru64 UNIX Versions 4.0D and later (Alpha only)	C/C++	quadword	longword, byte/word on EV5.6
OpenVMS Alpha Version 7.3	C/C++	quadword	byte, word
OpenVMS VAX Version 7.3	C/C++	longword	None

Of course, for compilers that support an optional granularity setting, it is possible to compile different modules in your application with different granularity settings. You might do so either to avoid the possibility of word-tearing race condition, as described in Section 3.7.3, or to improve the application's performance.

3.7.3 Word Tearing

In a multithreaded application, concurrent access by different threads to data that occupy the same memory granule can lead to a race condition known as word tearing. This situation occurs when two or more threads independently read the same granule of memory, update different portions of that granule, then independently (that is, asynchronously) store their respective copies of that granule. Because the order of the store operations is indeterminate, it is possible that only the last thread to write the granule continues with a correct "view" of the granule's contents, and earlier writes could be "undone".

In a multithreaded program the potential for a word-tearing race condition exists only when both of the following conditions are met:

Two or more threads can concurrently write distinct pieces of data that occupy the same memory granule G, where G is a byte, word, longword, or quadword.
The application's actual granularity is sizeof(G) or larger.

For instance, given a multithreaded program that has been compiled to have longword actual granularity, if any two of the program's threads can concurrently update different bytes or words in the same longword, then that program is, in theory, at risk for encountering a word-tearing race condition. However, in practice, language-defined restrictions on the alignments of data may limit the actual number of candidates for a word-tearing scenario, as described in the next section.

3.7.4 Alignments of Members of Composite Data Objects

The only data objects that are candidates for participating in a word-tearing race condition are members of composite data objects---that is, C language structures, unions, and arrays. In other words, the application's threads might access different data objects that are members of structures or unions, where those members occupy the same byte, word, longword, or quadword. Similarly, the application might access arrays whose elements occupy the same word, longword, or quadword.

On the other hand, the C language specification allows the compiler to allocate scalar data objects so that each is aligned on a boundary for the memory granule that the compiler prefers, as follows:

For Compaq C and Compaq C++ on Tru64 UNIX Version 4.0D and higher (Alpha only), and OpenVMS Alpha Version 7.3 systems, alignment of scalars is always on quadword boundaries.
For Compaq C and Compaq C++ on OpenVMS VAX Version 7.3 systems, alignment of scalars is always on longword boundaries.

For the details of the compiler's rules for aligning scalar and composite data objects, see the Compaq C and C++ compiler documentation for your application's host platforms.

Note

¹ This manual has been archived but is available on the OpenVMS Documentation CD-ROM.

Contents

Index

privacy and legal statement

6101PRO_007.HTML

Guide to the POSIX Threads Library

3.4.3 Diagnosing Stack Overflow Errors

3.5.2 Priority Inversion

3.5.3 Dependencies Among Scheduling Attributes and Contention Scope

3.6.4 Signaling a Condition Variable

1 This manual has been archived but is available on the OpenVMS Documentation CD-ROM.

¹ This manual has been archived but is available on the OpenVMS Documentation CD-ROM.