race condition on rmm for module ldap (ldap cache) - part II
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Apache2 Web Server |
Confirmed
|
Medium
|
|||
apache2 (Ubuntu) |
Triaged
|
Medium
|
Unassigned |
Bug Description
[Impact]
* Apache users using ldap module might face this if using multiple threads and shared memory activated for apr memory allocator (default in Ubuntu).
[Test Case]
* Configure apache to use ldap module, for authentication e.g., and wait for the race condition to happen.
* Analysis made out of a dump from a production environment.
* Bug has been reported multiple times upstream in the past 10 years.
[Regression Potential]
* ldap module has broken locking mechanism when using apr mem mgmt.
* ldap would continue to have broken locking mechanism.
* race conditions could still exist.
* could could brake ldap module.
* patch is upstreamed in next version to be released.
[Other Info]
ORIGINAL CASE DESCRIPTION:
This is related to LP: #1752683. The locking locking mechanism for LDAP was fixed, since it is now obtained from server config merge like it was supposed to. Problem is that it has, likely, a race condition on its logic, causing ldap module to still fail in some conditions.
Problem summary:
apr_rmm_init acts as a relocatable memory management initialization
it is used in: mod_auth_digest and util_ldap_cache
From the dump was brought to my knowledge, in the following sequence:
- util_ldap_
- util_ald_strdup()
- apr_rmm_calloc()
- find_block_
Had a "cache->rmm_addr" with no lock at "find_block_
cache->
And an invalid "next" offset (out of rmm->base-
This rmm_addr was initialized with NULL as a locking mechanism:
From apr-utils:
apr_rmm_init()
if (!lock) { <-- 2nd argument to apr_rmm_init()
lock = &nulllock;
}
From apache:
# mod_auth_digest
sts = apr_rmm_
# util_ldap_cache
result = apr_rmm_
It appears that the ldap module chose to use "rmm" for memory allocation, using
the shared memory approach, but without explicitly definiting a lock to it.
Without it, its up to the caller to guarantee that there are locks for rmm
synchronization (just like mod_auth_digest does, using global mutexes).
Because of that, there was a race condition in "find_block_
touching "rmm->base-
apache environment, since there were no lock guarantees inside rmm logic (lock
was "apr_anylock_none" and the locking calls don't do anything).
In find_block_of_size:
apr_rmm_off_t next = rmm->base-
We have:
rmm-
Decimal:356400
Hex:0x57030
But "next" turned into:
Name : next
Decimal:
Hex:0x73797363
Causing:
struct rmm_block_t *blk = (rmm_block_
if (blk->size == size)
To segfault.
Upstream bugs:
https:/
https:/
https:/
Upstream bugs fixed the "outer" lock, which is not obtained from the config merge function in all conditions. Since apr_rmm_init() is called with no lock, the caller should take care of the lockin. Unfortunately the "outer" lock is not working as it seems:
---- This new bug explanation:
LDAP_CACHE_LOCK() is either missing a barrier or it is not enough for subsequent calls to APR with NULL locking (passed to APR_RMM_INIT). After patch for this bug has been applied, https:/
In the dump, in find_block_
apr_rmm_off_t next = rmm->base-
...
while(next) {
struct rmm_block_t *blk = (rmm_block_
blk gets value 0x5772e56b36226557 because "next" was corrupted (value: 0x57726573553d5
APR_
in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.
For the sake of exercising possibilities, if mod_ldap is calling APR RMM with external locking, it would be using LDAP_CACHE_LOCK. My current stack trace is this:
Thread #19 7092 (Suspended : Container)
kill() at syscall-
<signal handler called>() at 0x7ff7e9cb7390
find_block_
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_
util_ldap_
util_ald_
uldap_
ldapgroup_
apply_
apply_
authorize_
ap_run_
ap_process_
ap_process_
ap_process_
ap_process_
ap_process_
ap_run_
ap_process_
process_socket() at worker.c:631 0x7ff7e2f51f8b
worker_thread() at worker.c:990 0x7ff7e2f51f8b
start_thread() at pthread_
clone() at clone.S:109 0x7ff7e99e341d
Which means uldap_cache_
LDAP_CACHE_LOCK() translates into:
do {
if (st->util_
} while (0);
After the change proposed for this bug (where "util_ldap_
Name : util_ldap_
Hex:0x7ff7ea75aee0
Name : util_ldap_cache
Hex:0x7ff7e0e51038
Meaning that it got the ldap_cache and ldap_cache_lock from the merge config function.
From the mutex acquire logic, for the apr_global_
apr_status_t apr_proc_
{
return mutex->
}
And it would translate into:
st->util_
And from that logic:
static apr_status_t proc_mutex_
{
int rc;
do {
rc = fcntl(mutex-
} while (rc < 0 && errno == EINTR);
if (rc < 0) {
return errno;
}
mutex-
return APR_SUCCESS;
}
We would guarantee mutex lock through a file descriptor to the file:
"/var/lock/
And the "mutex-
Unfortunately, considering my stack trace, during the cache insertion:
find_block_
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_
util_ldap_
util_ald_
uldap_cache_
Name : st->util_
Details:
Default:
Decimal:
Hex:0x7ff7ea75aee0
Binary:
Octal:
Name : proc_mutex
Details:
Default:
Decimal:
Hex:0x7ff7ea75aef8
Binary:
Octal:
Name : curr_locked
Details:0
Default:0
Decimal:0
Hex:0x0
Binary:0
Octal:0
I have curr_locked = 0
Changed in apache2 (Ubuntu): | |
status: | New → In Progress |
assignee: | nobody → Rafael David Tinoco (inaddy) |
importance: | Undecided → Medium |
Changed in apache2 (Ubuntu): | |
assignee: | Rafael David Tinoco (inaddy) → nobody |
Changed in apache2 (Ubuntu): | |
status: | In Progress → Triaged |
Changed in apache2: | |
importance: | Unknown → Medium |
status: | Unknown → Incomplete |
Changed in apache2: | |
status: | Incomplete → Confirmed |
(I originally filed this with Debian's apache team because when I searched for related information on this issue I turned up something in their bug tracker. Here is the link for more context: https:/ /bugs.debian. org/cgi- bin/bugreport. cgi?bug= 814980)
The anylock structure provided by mod_ldap was set to the type apr_anylock_none, which resulted in multiple threads mutating the shared RMM state at the same time without any concurrency guards. The issue I was seeing was that all the threads on the server were stuck inside the RMM internal function find_block_of_size every 2 or 3 days, requiring the server processes to be killed and restarted.
We are using Apache on Windows.
I made the following patch which passes in a lock to the RMM pool created in mod_ldap when APR_HAS_THREADS is defined. Since doing so we have not encountered any hangs:
diff -Naur httpd-2. 4.16\include\ util_ldap. h httpd-2. 4.16-ea\ include\ util_ldap. h 4.16\include\ util_ldap. h Mon Jul 14 05:07:55 2014 4.16-ea\ include\ util_ldap. h Mon Aug 29 10:20:08 2016 SHARED_ MEMORY
--- httpd-2.
+++ httpd-2.
@@ -169,6 +169,10 @@
#if APR_HAS_
apr_shm_t *cache_shm;
apr_rmm_t *cache_rmm;
+#if APR_HAS_THREADS
+ apr_thread_mutex_t *lock;
+ apr_anylock_t cache_rmm_anylock;
+#endif
#endif
/* cache ald */ 4.16\modules\ ldap\util_ ldap_cache. c httpd-2. 4.16-ea\ modules\ ldap\util_ ldap_cache. c 4.16\modules\ ldap\util_ ldap_cache. c Mon Aug 19 04:45:19 2013 4.16-ea\ modules\ ldap\util_ ldap_cache. c Mon Aug 29 10:23:04 2016
st->cache_ shm = NULL; mutex_destroy( st->lock) ; rmm_anylock. type = apr_anylock_none; rmm_anylock. lock.pm = NULL; size_get( st->cache_ shm);
diff -Naur httpd-2.
--- httpd-2.
+++ httpd-2.
@@ -410,6 +410,14 @@
return result;
}
+
+#if APR_HAS_THREADS
+ apr_thread_
+ st->lock = NULL;
+ st->cache_
+ st->cache_
+#endif
+
#endif
return APR_SUCCESS;
}
@@ -436,8 +444,18 @@
/* Determine the usable size of the shm segment. */
size = apr_shm_
+#if APR_HAS_THREADS mutex_create( &st->lock, APR_THREAD_ MUTEX_DEFAULT, st->pool); rmm_anylock. type = apr_anylock_ threadmutex; rmm_anylock. lock.tm = st->lock; rmm_anylock. type = apr_anylock_none; rmm_anylock. lock.pm = NULL; init(&st- >cache_ rmm, NULL, init(&st- >cache_ rmm, &st->cache_ rmm_anylock,
apr_ shm_baseaddr_ get(st- >cache_ shm), size,
st-> pool);
+ apr_thread_
+ st->cache_
+ st->cache_
+#else
+ st->lock = NULL;
+ st->cache_
+ st->cache_
+#endif
+
/* This will create a rmm "handler" to get into the shared memory area */
- result = apr_rmm_
+ result = apr_rmm_
if (result != APR_SUCCESS) {
Thanks!