MTA documentation

MTA is a tool that analyzes use of Posix threads in a program, and detects incorrect use of mutexes and events, performance problems, and potential deadlocks.

Table of contents

License blurb
System requirements (Linux, HP/UX, Solaris)
Build & Installation
How MTA works
Compiling your program
Running with MTA event logging (LD_PRELOAD, mtarun, direct linking, naming objects)
Running the analyzer (options, output, messages)

License blurb

MTA is Copyright 2004 Ivan Skytte Jørgensen.

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation is required.
  2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

System Requirements

Neither the interposer shared library nor the analyzer interacts directly with the compiler so there are no requirements regarding the compiler unless you want to build MTA yourself. The compiler used for the pre-built binaries may have inserted references to compiler specific shared libraries and these dependencies are listed.

Linux on x86, x86-64, Alpha and Itanium platform

Requirements:

Build requirements:

HP-UX on PA-RISC 2.0 platform

Requirements:

Both 32-bit and 64-bit programs are supported.

Build requirements:

Solaris on SPARC platform

Requirements:

Build requirements:

Build & installation

This section assumes that you are installing MTA version 0.3 on Solaris, and that you want MTA to be installed in /opt/mta-0.3-sparc-solaris. The installation procedure is the same for other operating systems except that the name of the shared librariy is different.

Build MTA

  ./configure
  make
  make test
  make install

The configure script will detect the operating system and the architecture and select a suitable compiler. The configure script supports a few command-line arguments:

-prefix=... or --prefix=...
Selects the installation base directory. The default is /opt/mta-version-architecture-os
--usesunworkshop or --usesunworkshop (Solaris only)
Use Sun Workshop C++ compiler instead of g++
-32
Compile as 32-bit instead of 64-bit. 32-bit compilation is the default for x86 and Sparc systems
-64 or --use64bits
Compile as 64-bit instead of 32 bit. 64-bit compilation is the default for PA-RISC, Alpha and x86-64 systems

Test failures

Failures in the make test step means that you have some digging to do. The most common case is that all the test cases fails with "mta interposer could not load libpthread" which indicates that your OS has developed faster than MTA and you will have to find out which library is the correct library to load.

On newer Linux distributions with the TLS-based libc mta does not always know where the real pthread library is, as some of the libpthread.so files is actually ld.so scripts which dlopen() cannot load. In this case the tests will fail and you should override the PTHREAD_LIB_NAME in mta.c

Installation

The make install step is optional. If you know how to juggle LD_PRELOAD and LD_LIBRARY_PATH then it is fine with me.

Add /opt/mta-0.3-sparc-solaris/bin to your PATH

Add /opt/mta-0.3-sparc-solaris/lib to your LD_LIBRARY_PATH

There are no "man" pages or "info" files.

Overview of the package:

How MTA works

A diagram showing how MTA components interact

MTA works by intercepting most pthread_???? calls and logging them to an event log mta.log. libmta.so then calls the real functions in libpthread.so. libmta.so also logs the return from the pthread functions including timestamps and return codes.

The event log can later be analyzed with mtanalyze which can find potential and real performance problems, incorrect use, scalability problems, non-portable behavior, potential and real deadlocks, uncommon behavior, optimization possibilities, etc.

Compiling your program

Compile and link your program as usual, including -lpthread on the linker line.

MTA does not currently support programs that fork().
Slightly longer description: If you link directly with libmta.so and the child process exec()s immediately without calling pthread functions it should work. The problem is that events from the parent process and the child process ends up in the same log file which will confuse mtanalyze.

Running with MTA event logging

Before you run with MTA event logging you should have debugged your program. Running a few times with Purify, Insure++, IDH, ElectricFence, Valgrind, MPatrol or the like is recommended.

Run a few tests on you program. You do not usually need to run extensive test - just enough to get good coverage. libmta.so will generate a complete event log called "mta.log" which can become quite large (500MB is not unheard of). Make sure you have enough space in the current directory. For best results shut down your program properly.

You can run your program with MTA event logging in 3 ways as described in the following sections

Using LD_PRELOAD=

You can use the LD_PRELOAD feature on Linux, Solaris and HP/UX. Assuming that you program is called "threadtest" you can force libmta.so to be loaded before the program is run (and override symbols in libpthread.so) with:

  LD_PRELOAD=libmta.so ./threadtest

Using mtarun

mtarun is a shell script which uses the LD_PRELOAD feature, but also removes old logs, checks for write permission to the current directory, etc. :

  mtarun ./threadtest

Linking with libmta.so

You can also link directly with libmta.so. Just insert libmta.so before -lpthread:

  CC ... -L/where_mta_is_installed/lib -lmta -lpthread

This does not work under HP/UX as it causes libpthread.1 to loop forever calling ksleep()

Naming objects (threads, mutexes and conditions)

libmta.so can only capture the address of mutexes and conditions, and the thread-id of threads. So mtanalyze can only report that e.g. a mutex at address 08fe34004 were not unlocked by thread 17. You can usually find out what is at address 08fe34004 using a debugger, and what thread 17 is using a simple log file. This can be a bit tedious. An alternative to this is to call mtanalyze_name_{thread,mutex,condition}() prototyped in mta_naming.h and located in mta_naming.o. This requires you to modify your program.

Running the analyzer

The program mtanalyze reads the event log file and analyzes it:

  mtanalyze

You will usually want to pipe the output to a pager (more/less):

  mtanalyze | more

Or send the output to a file::

  mtanalyze -of analysis.txt

Options to mtanalyze

Options can be specified to mtanalyze. If you are using recursive mutexes in you program you will have to specify +r. If you renamed the mta.log file you have to specify -if otherfilename. If you want mtanalyze to output its report to something else than stdout your can specify -of outputfilename.

This is a brief summary of the most common options that mtanalyze accepts. The list is not complete. For a complete list run mtanalyze with the --help option:

  mtanalyze --help

Options requiring an argument must be separated from the argument. e.g.. +w32 is not valid.

Option(s) Description
-of filename Set output filename. Default is standard output.
-if filename Set event log filename. Default is mta.log
+r
-r
Enable/disable recursive mutexes.
If recursive mutexes are not used more strict analysis can be done and more errors can be caught.
+w x
-w x
Enable/disable warning x.
Individual warnings and suggestions can be enabled and disabled.
-wmcr x Warn mutex contention ratio.
If the ratio of contended lock operations is above this level a warning is generated.
-wmlht x Warn mutex long hold time.
Generate warning if a mutex is held for more than the specified time.
-hm x Dump x most used mutexes.
By default the 5 most frequently used mutexes are written to the report.
+ad
-ad
Enable/disable deadlock analysis.
mtanalyze can find real and potential deadlocks.

Example:

  mtanalyze +r -wmcr 0.4 +ad -w 32

This makes mtanalyze allow recursive mutexes, sets the threshold for warning about mutex contention to 0.4 (40% of the lock operations), enables deadlock analysis and disables warning 32.

Output from mtanalyze

mtanalyze will output event log reading progress to standard output so you can see when it has finished reading the event log and started analyzing it.
The report produced could look like this:

===============================================================================
===== Start of pre-analysis

-------------------------------------------------------------------------------
----- Start of mutex-discovery
---
Found 2 mutexes
----- End of mutex-discovery
-------------------------------------------------------------------------------
----- Start of condition-discovery
---
Found 0 conditions
----- End of condition-discovery
-------------------------------------------------------------------------------
----- Start of thread-discovery
---
Found 2 threads
----- End of thread-discovery
===== End of pre-analysis
===============================================================================
===============================================================================
===== Start of mutex analysis

-------------------------------------------------------------------------------
----- Start of mutex use analysis
----- End of mutex use analysis
-------------------------------------------------------------------------------
----- Start of mutex performance analysis
---
Suggestion (message code 33): Mutex 400679d4 is only used by thread 1024. It may be possible to eliminate the mutex
---
Warning (message code 27): High contention ratio for mutex 08049a04 (50.00% lock operations were contended)
---
Warning found in event
thread-id: 1026
timestamp=  999943270.736070
mtx-addr = 08049a04
cond-addr = 00000000
event = mutex_lock
Warning (message code 28): In at least one occurrence it took 2.001 seconds to lock mutex
---
Warning found in event
thread-id: 1024
timestamp=  999943270.735746
mtx-addr = 08049a04
cond-addr = 00000000
event = mutex_locked
Warning (message code 29): Mutex 08049a04 were held for 2.001 seconds here
----- End of mutex performance analysis
-------------------------------------------------------------------------------
----- Start of mutex deadlock analysis
Building lock traces
To be examined: 2 threads, 0 traces and 0 lock operations
Skipping analysis since deadlocks are not possible
----- End of mutex deadlock analysis
-------------------------------------------------------------------------------

Messages generated by mtanalyze

mtanalyze can generate a lot of different errors, warnings and suggestions. The following table lists all the message that mtanalyze currently generates.

1 pthread_mutex_init() called for address where a mutex already exists
pthread_mutex_init() is not necessarily idempotent and could cause a resource leak. It will also create problems if another thread currently owns the mutex.
2 pthread_mutex_init() returned for address where a mutex already exists (race condition in mutex creation?)
This message usually follows message 1 if the pthread library do not check if a mutex already exists
3 pthread_mutex_init failed for address where a mutex already exists
This message usually follows message 1 if the pthread library checks if a mutex already exists at the specified address.
4 pthread_mutex_destroy() called for mutex which has already been destroyed
pthread_mutex_destroy is not necessarily idempotent and could cause all sorts of problems ranging from nothing to core dumps.
5 pthread_mutex_destroy() returned for address where a mutex do not exits (now)
This could indicate that two threads tries to destroy a mutex simultaneously.
6 pthread_mutex_lock/trylock() called for address where a mutex do not exist anymore
Don't do that.
7 pthread_mutex_lock/trylock() returned for address where a mutex do not exist anymore
This could indicate that a thread was trying to lock a mutex while another thread destroyed it.
8 pthread_mutex_lock/trylock() returned for address where a mutex has never existed (?)
This is impossible. This most likely indicates that something has overwritten the stack of the calling thread.
9 pthread_mutex_trylock() failed for address where a mutex do not exist
Also impossible. This most likely indicates that something has overwritten the stack of the calling thread.
10 pthread_mutex_unlock() called for address where a mutex do not exist, or has never been locked
pthread_mutex_unlock() is called for an unowned mutex AND that mutex has not been pthread_mutex_init()'ed. If the mutex were known message 24 would have been generated instead.
11 pthread_mutex_unlock() returned for address where a mutex do not exist anymore (race condition on pthread_mutex_destroy?)
A thread tried to unlock a mutex which has been destroyed in the meantime. At least the pthread_mutex_unlock() call returned after another thread had destroyed the mutex. This may or may not be a problem.
12 pthread_mutex_lock() returned for address where a mutex has never existed (?)
Very strange and most likely impossible.
13 pthread_mutex_lock() called for already held mutex (and recursive mutexes are not allowed)
A threads attempts to lock a mutex it already owns.
14 pthread_mutex_lock/trylock() succeeded even though the lock is currently held by another thread (tid=%ld) (bug in pthread library?)
Most likely a bug in the pthread implementation or another thread has overwritten the mutex structure.
15 pthread_mutex_unlock() called for mutex which currently held by another thread (tid=%ld)
Don't do that.
16 pthread_cond_wait/timedwait() called with mutex owned by other thread (tid=%ld)
The mutex specified in pthread_cond_wait/timedwait() must be held by the calling thread.
17 pthread_cond_wait/timedwait() called with mutex not held.
The mutex specified in pthread_cond_wait/timedwait() must be held by the calling thread.
18 pthread_cond_wait/timedwait() called with recursive mutex with lock level!=1 (%u)
Posix does not guarantee this to work and pthread libraries differ a great deal on how they treat this situation. It invokes undefined behavior.
19 pthread_cond_wait/timedwait() returned for mutex currently owned by other thread (tid=%ld)
This probably indicates a bug in the pthread library (or someone has overwritten the calling thread's stack)
20 pthread_mutex_lock() called but it never returned (canceled thread? incomplete log?)
The event log is probably not complete, or exit() has been called while a thread tried to lock a mutex.
21 pthread_mutex_trylock() called but it never returned (canceled thread? incomplete log?)
Same as message 20.
22 pthread_mutex_trylock() called for already held recursive mutex (lock count = %u). This may not work as expected.
pthread_mutex_trylock() called for already held recursive mutex. This may not work as expected.
pthread_mutex_trylock() called for already held mutex. This may not work as expected.
pthread_mutex_trylock() called for already held mutex.
The behavior of pthread_mutex_trylock() on an already held recursive mutex differs in the various pthread implementations. Don't rely on this to work in the same way in other implementations. For non-recursive mutexes it could indicate a flaw in the program logic.
23 pthread_mutex_lock() failed - this is unusual and probably indicates an error. Return value from pthread_mutex_lock() was %d
pthread_mutex_lock() rarely fails and programs do not usually test the return value. Look into why it failed.
24 pthread_mutex_unlock() called for mutex which currently not held.
Don't do that.
25 pthread_mutex_unlock() returned an error. Very unusual. Return code = %d
This is a very unlikely error and a program does not usually test for the return value. Investigate why it failed.
26 Mutex still locked at end of event log, mutex=%08lx, thread=%ld
The event log is either incomplete or the program does not shut down releasing all resources.
27 High contention ratio for mutex %08lx (%5.2f%% lock operations were contended)
Many threads wants a mutex and have to wait.
28 In at least one occurrence it took %.3f seconds to lock mutex
It took a non-trivial amount of time to lock a mutex. This could indicate a system pushed beyond its limits. This message is usually followed by message 29.
29 Mutex %08lx were held for %.3f seconds here
Mutexes are not meant to be held for a long time.
30 On at least one occurrence it took %.3f seconds to lock the mutex (average=%.3f seconds)
On at least on occasion a mutex were held for unusually long time. This could indicate unwanted task switches or that something during the hold time took unexpectedly long time.
31 Mutex %08lx has many failed pthread_mutex_trylock() operations (%ld failed out of %ld operations)
Many pthread_mutex_trylock() operations fails on a mutex and thus wasting CPU time.
32 Recursive mutex %08lx has deep lock counts (max=%ld)
Consider using a non-recursive mutex and locking it before manipulating what the mutex protects.
33 Mutex %08lx is only used by thread %ld. It may be possible to eliminate the mutex
A mutex was only used by one thread so the mutex is unnecessary. If the mutex operations are in your program and not in a system library you may want to consider removing the mutex.
34 pthread_cond_wait/timedwait() used a mutex which do not exist (anymore)
A call to pthread_cond_wait/timedwait() specified a mutex which has been destroyed.
35 The mutex has many waiting threads. At this event there is %u waiting threads
The message says it all.
36 Mutex has already been named '%s'
The call to mtanalyze_name_mutex() specified a mutex which has already been named
37 Mutex %08lx is never used. It is possible to eliminate the mutex
The mutex was created (and possibly destroyed) but never used
99 (internal error) Mutex address given, but unknown event
internal error in mtanalyze
101 pthread_cond_broadcast/signal/wait/timedwait() called for address where a condition do not exist anymore
The specified condition has been destroyed prior to the call.
102 pthread_cond_broadcast/signal/wait/timedwait() returned for address where a condition do not exist anymore
This message is either a result of message 101, or someone destroyed the condition while threads were waiting for it.
103 pthread_cond_broadcast/signal/wait/timedwait() returned for address where a condition has never existed (?)
Probably impossible. Or something has overwritten the stack of the calling thread.
104 pthread_cond_init() called for address where a condition already exists
pthread_cond_init is not necessarily idempotent and this could lead to resource leaks.
105 pthread_cond_init() returned for address where a condition already exists (race condition in condition creation?)
pthread_cond_init() returned creating a condition where a condition already exists.
106 pthread_cond_init failed for address where a condition already exists
-
107 pthread_cond_destroy() called for condition which has already been destroyed
A thread is trying to destroy a condition which has already been destroyed. pthread_cond_destroy is not necessarily idempotent and this could lead to errors.
108 pthread_cond_destroy() returned for address where a condition do not exist (now)
This could indicate that two threads were trying to destroy the same condition simultaneously.
109 pthread_cond_broadcast() returned an error %d. This is unusual
Programs usually do not test the return code from pthread_cond_broadcast. Investigate the error and why it failed.
110 pthread_cond_destroy() returned an error %d. This is unusual
pthread_cond_destroy() could not destroy a condition. Investigate why.
111 Condition destroyed while %ld waiters exist
A condition were destroyed while other threads were waiting on it.
112 pthread_cond_init() returned an error %d. This is unusual
pthread_cond_init() failed and programs do not usually test the return value. Investigate why it failed.
113 pthread_cond_signal() called without holding the mutex (addr=%08lx) which is used in pthread_cond_wait/timedwait() for this condition
The decision to wake just one thread with pthread_cond_signal() can usually not be made without holding the mutex. This is most likely a programming error.
114 pthread_cond_signal() returned an error. This is unusual.
Programs usually do not test the return code from pthread_cond_signal(). Investigate the error and why it failed.
115 pthread_cond_wait() called with a different mutex than usual (usual mutex = %08lx). This is probably not intended.
Programs usually use the same mutex for a condition. While it is well-defined not to do so it is usually an error.
116 pthread_cond_wait/timedwait() is called with mutex different from specified in other current wait/timedwait operations (mutex addr=%08lx)
This leads to undefined behavior.
pthread_cond_wait/timedwait() were called specifying a mutex different from what other waiting threads specified. This is undefined.
117 pthread_cond_wait() returned the error %d. This is unusual.
Programs do not usually test the return value from pthread_cond_wait() as it rarely fails. Investigate why it happened.
118 %5.0f%% of the pthread_cond_signal operations on condition %08lx were without waiting threads
Many of the pthread_cond_signal() calls did not wake any threads. You can probably implement stronger tests before calling pthread_cond_signal()
119 Condition %08lx has %ld waiting threads at the end of the event log
Some threads were still waiting on a condition when the program terminated. Or maybe the event log is incomplete.
120 Condition has already been named '%s'
The call to mtanalyze_name_condition() specified a condition which has already been named
121 Condition %08lx were broadcasted %u times waking %lu threads, of which %lu went to sleep on the condition again shortly. This could indicate a "thundering herd" problem
Some of the threads awakened with pthread_cond_broadcast() did very little work before waiting on the condition again. Maybe you can use pthread_cond_signal() instead.
122 Spurious wakeup detected - pthread_cond_wait() returned but nothing signalled the condition.
A call to pthread_cond_wait() or pthread_cond_timedwait() returned but nothing has called pthread_cond_signal() or pthread_cond_broadcast() on the condition since the thread went to sleep. This is spurious wakeup and explicitly allowed by the Posix standard. However, spurious wakeups are generally rare and it is more likely that a bug has caused the thread to wake up.
199 (internal error) condition address set, but unknown event type
Internal error in mtanalyze
201 Thread has already been named '%s'
The call to mtanalyze_name_thread() specified a thread which has already been named. You can only name a thread once.
202 At this point %u threads were running. This is a lot.
A unreasonably large number of threads were present. Consider implementing a "worker pool" or in the case of thread-per-connection making the clients re-use the connections or implement shorter idle time-outs.
203 At this point %u threads were running, which exceeds _POSIX_THREAD_THREADS_MAX. This is not portable.
POSIX only guarantees 64 threads in a process. Your program exceeded that limit and is not portable.
204 This thread were short-lived (%.6f seconds)
A thread lived for a very short time. This could indicate a thread-per-request program which does not scale well and uses unnecessary resources.
205 %u short-lived threads were found. This could indicate 'thread-per-connection' implementation which in general does not scale as well as 'worker pool'
The message says it all.
206 The thread %s was created detached but was manually detached by thread %lu at %17.6f. This invokes undefined behaviour.
The pthread_attr_t supplied to the pthread_create() call specified PTHREAD_CREATE_DETACHED, but the thread detached itself before pthread_create() returned. It causes undefined behaviour to detached a thread more than once.
207 Detaching unknown thread (id=%lu)
A call to pthread_detach() specified an invalid thread-id. The thread may have been created detached and finished already, already been detached or already been joined. In any case, this causes undefined behaviour.
208 Thread %s is already detached (it was detached by thread %lu at %17.6f). This invokes undefined behaviour.
An already-detached thread-id was specified to pthread_detach().
209 pthread_detach() called on thread %s which is currently being joined by thread %lu at %17.6f. This invokes undefined behaviour.
210 Joining unknown thread %lu
A pthread_join() call specified an invalid thread-id. The thread-id may be uninitialized, overwritten, but can also be a thread-id which is no longer valid because the thread has been detached or joined and exited.
211 Joining thread %s which has been detached (by thread %lu at %17.f).
A pthread_join() call specified an invalid thread-id. The thread has been detached and cannot be the target for a join.
212 pthread_join() called on thread %s which has already been joined (by thread %lu at %17.6f).
You cannot join a thread twice.
213 The thread %s (created here) was never joined nor detached.
A thread was created and finished, but it was never joined nor detached. This may cause a resource leak (eg. stack).
214 Thread %s (created here) never performed a clean exit.
A thread was created but it never returned from its thread function nor called pthread_exit(). This may cause problem if you use shared libraries and the thread wakes up duing shutdown.
215 pthread_create() returned with thread-id %lu, but that thread could not be found.
(fatal) The analyzer suspects that the trace file is corrupted or other foul play.
216 Thread %ul which exited at %17.6f had its ID reused at this point.
A thread performed pthread_exit() or returned, and while looking for lingering events generated by the thread library or cleanup handlers, a pthread_create() returned with the thread ID. This can either be the original pthread_create() returning, or the the thread ID being reused. mtanalyze detected that it was a re-use, in which case the thread lifetime analysis may be slightly off.
217 Thread %s (created here) seems stuck at %17.6f.
A thread never returned from pthread_mutex_lock(), pthread_cond_(timed)wait, or pthread_join.
218 Thread %s has called pthread_exit() multiple times, previously at %17.6f
A thread is calling pthread_exit() from its cleanup handlers. This causes undefined behavior.
219 While scanning for thread-ID reuse (pthread_create returned at this point), an event for the creating thread which is not pthread_create() call was found at %17.6f.
While scanning for lingering operations of an exiting thread, mtanalyze found a pthread_create() call returning in a (other thread) which did not call pthread_create(). It is most likely a corrupt trace file, or the createing thread was interrupted in the middle of a pthread_create() call by a signal or a debugger. Re-run your application.