blk gets value 0x5772e56b36226557 because "next" was corrupted (value: 0x57726573553d554f). This happens because the lock
APR_ANYLOCK_LOCK(&rmm->lock);
in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.
For the sake of exercising possibilities, if mod_ldap is calling APR RMM with external locking, it would be using LDAP_CACHE_LOCK. My current stack trace is this:
Thread #19 7092 (Suspended : Container)
kill() at syscall-template.S:84 0x7ff7e9911767
<signal handler called>() at 0x7ff7e9cb7390
find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d
util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784
util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310
uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3
ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937
apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa
apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01
authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca
ap_run_auth_checker() at request.c:91 0x56127e692f00
ap_process_request_internal() at request.c:335 0x56127e695d57
ap_process_async_request() at http_request.c:408 0x56127e6b4690
ap_process_request() at http_request.c:445 0x56127e6b4850
ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e
ap_process_http_connection() at http_core.c:251 0x56127e6b091e
ap_run_process_connection() at connection.c:41 0x56127e6a6bf0
ap_process_connection() at connection.c:213 0x56127e6a7000
process_socket() at worker.c:631 0x7ff7e2f51f8b
worker_thread() at worker.c:990 0x7ff7e2f51f8b
start_thread() at pthread_create.c:333 0x7ff7e9cad6ba
clone() at clone.S:109 0x7ff7e99e341d
Which means uldap_cache_compare() would have synchronized access to APR RMM calls through LDAP_CACHE_LOCK() macro. This doesn't seem to be the case as the lock doesn't seem to be acquired.
LDAP_CACHE_LOCK() translates into:
do {
if (st->util_ldap_cache_lock) apr_global_mutex_lock(st->util_ldap_cache_lock);
} while (0);
After the change proposed for this bug (where "util_ldap_cache_lock" would come from the ldap_merge_config), it seems that st has util_ldap_cache_lock and util_ldap_cache all set:
Name : util_ldap_cache_lock
Hex:0x7ff7ea75aee0
Name : util_ldap_cache
Hex:0x7ff7e0e51038
Meaning that it got the ldap_cache and ldap_cache_lock from the merge config function.
From the mutex acquire logic, for the apr_global_mutex_lock() -> apr_proc_mutex_lock():
Unfortunately, considering my stack trace, during the cache insertion:
find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d
util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784
util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310
uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3
Name : st->util_ldap_cache_lock
Details:0x7ff7ea75aee0
Default:0x7ff7ea75aee0
Decimal:140702767230688
Hex:0x7ff7ea75aee0
Binary:11111111111011111101010011101011010111011100000
Octal:03777375235327340
Name : proc_mutex
Details:0x7ff7ea75aef8
Default:0x7ff7ea75aef8
Decimal:140702767230712
Hex:0x7ff7ea75aef8
Binary:11111111111011111101010011101011010111011111000
Octal:03777375235327370
Name : curr_locked
Details:0
Default:0
Decimal:0
Hex:0x0
Binary:0
Octal:0
I have curr_locked = 0
So far.. I can see:
The functions (from fcntl type of locking) touching curr_locked are:
Meaning that likely _release() might have raced with _acquire() somehow. One thing that comes to my thoughts, right away, is the fact that function proc_mutex_fcntl_release(), from apr, sets "curr_locked" to 0 before actually the fcntl() F_SETLKW succeeds BUT the locking is guaranteed using the file lock with fcntl() call and not due to the variable value change, so it would be okay-ish.
I can't see any other obvious reason why the lock is not acquired in this current execution. Still investigating... any ideas ?
Eric,
Even after your fix...
LDAP_CACHE_LOCK() is either missing a barrier or it is not enough for subsequent calls to APR with NULL locking (passed to APR_RMM_INIT). After patch for this bug has been applied, https:/ /bugs.launchpad .net/ubuntu/ +source/ apache2/ +bug/1752683 and the new bug https:/ /bugs.launchpad .net/ubuntu/ +source/ apache2/ +bug/1767105, end user still complains about seg faults and core dumps show same issue: race condition for rmm->base- >firstfree in function "find_block_ of_size" .
In the dump, in find_block_ of_size( ):
apr_rmm_off_t next = rmm->base- >firstfree; t*)((char* )rmm->base + next);
...
while(next) {
struct rmm_block_t *blk = (rmm_block_
blk gets value 0x5772e56b36226557 because "next" was corrupted (value: 0x57726573553d5 54f). This happens because the lock
APR_ ANYLOCK_ LOCK(&rmm- >lock);
in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.
For the sake of exercising possibilities, if mod_ldap is calling APR RMM with external locking, it would be using LDAP_CACHE_LOCK. My current stack trace is this:
Thread #19 7092 (Suspended : Container) template. S:84 0x7ff7e9911767 of_size( ) at apr_rmm.c:106 0x7ff7ea10e25a cache_mgr. c:105 0x7ff7e3369b3d compare_ node_copy( ) at util_ldap_ cache.c: 257 0x7ff7e3369784 cache_insert( ) at util_ldap_ cache_mgr. c:501 0x7ff7e336a310 cache_compare( ) at util_ldap.c:1,183 0x7ff7e33662d3 check_authoriza tion() at mod_authnz_ ldap.c: 925 0x7ff7e8459937 authz_sections( ) at mod_authz_ core.c: 737 0x7ff7e4bb99fa authz_sections( ) at mod_authz_ core.c: 751 0x7ff7e4bb9c01 user_core( ) at mod_authz_ core.c: 840 0x7ff7e4bb9dca auth_checker( ) at request.c:91 0x56127e692f00 request_ internal( ) at request.c:335 0x56127e695d57 async_request( ) at http_request.c:408 0x56127e6b4690 request( ) at http_request.c:445 0x56127e6b4850 http_sync_ connection( ) at http_core.c:210 0x56127e6b091e http_connection () at http_core.c:251 0x56127e6b091e process_ connection( ) at connection.c:41 0x56127e6a6bf0 connection( ) at connection.c:213 0x56127e6a7000 create. c:333 0x7ff7e9cad6ba
kill() at syscall-
<signal handler called>() at 0x7ff7e9cb7390
find_block_
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_
util_ldap_
util_ald_
uldap_
ldapgroup_
apply_
apply_
authorize_
ap_run_
ap_process_
ap_process_
ap_process_
ap_process_
ap_process_
ap_run_
ap_process_
process_socket() at worker.c:631 0x7ff7e2f51f8b
worker_thread() at worker.c:990 0x7ff7e2f51f8b
start_thread() at pthread_
clone() at clone.S:109 0x7ff7e99e341d
Which means uldap_cache_ compare( ) would have synchronized access to APR RMM calls through LDAP_CACHE_LOCK() macro. This doesn't seem to be the case as the lock doesn't seem to be acquired.
LDAP_CACHE_LOCK() translates into:
do { ldap_cache_ lock)
apr_global_ mutex_lock( st->util_ ldap_cache_ lock);
if (st->util_
} while (0);
After the change proposed for this bug (where "util_ldap_ cache_lock" would come from the ldap_merge_config), it seems that st has util_ldap_ cache_lock and util_ldap_cache all set:
Name : util_ldap_ cache_lock
Hex:0x7ff7ea75aee0
Name : util_ldap_cache
Hex:0x7ff7e0e51038
Meaning that it got the ldap_cache and ldap_cache_lock from the merge config function.
From the mutex acquire logic, for the apr_global_ mutex_lock( ) -> apr_proc_ mutex_lock( ):
apr_status_t apr_proc_ mutex_lock( apr_proc_ mutex_t *mutex) meth->acquire( mutex);
{
return mutex->
}
And, from my dump, it would translate into:
st->util_ ldap_cache_ lock->proc_ mutex-> meth->acquire == proc_mutex_ fcntl_acquire( )
And from that logic:
static apr_status_t proc_mutex_ fcntl_acquire( apr_proc_ mutex_t *mutex)
{
int rc;
do { >interproc- >filedes, F_SETLKW, &proc_mutex_ lock_it) ; >curr_locked= 1;
rc = fcntl(mutex-
} while (rc < 0 && errno == EINTR);
if (rc < 0) {
return errno;
}
mutex-
return APR_SUCCESS;
}
Again, from the dump, We would guarantee mutex lock through a file descriptor to the file:
"/var/lock/ apache2/ ldap-cache. 1368" (filedes == 15)
And the "mutex- >curr_locked" would be 1.
Unfortunately, considering my stack trace, during the cache insertion:
find_block_ of_size( ) at apr_rmm.c:106 0x7ff7ea10e25a cache_mgr. c:105 0x7ff7e3369b3d compare_ node_copy( ) at util_ldap_ cache.c: 257 0x7ff7e3369784 cache_insert( ) at util_ldap_ cache_mgr. c:501 0x7ff7e336a310 compare( ) at util_ldap.c:1,183 0x7ff7e33662d3
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
util_ald_alloc() at util_ldap_
util_ldap_
util_ald_
uldap_cache_
Name : st->util_ ldap_cache_ lock 0x7ff7ea75aee0 0x7ff7ea75aee0 140702767230688 111111111110111 111010100111010 110101110111000 00 037773752353273 40
Details:
Default:
Decimal:
Hex:0x7ff7ea75aee0
Binary:
Octal:
Name : proc_mutex 0x7ff7ea75aef8 0x7ff7ea75aef8 140702767230712 111111111110111 111010100111010 110101110111110 00 037773752353273 70
Details:
Default:
Decimal:
Hex:0x7ff7ea75aef8
Binary:
Octal:
Name : curr_locked
Details:0
Default:0
Decimal:0
Hex:0x0
Binary:0
Octal:0
I have curr_locked = 0
So far.. I can see:
The functions (from fcntl type of locking) touching curr_locked are:
proc_mutex_ fcntl_acquire( ) -> fcntl(proc_ mutex_lock_ it), curr_locked = 1 fcntl_tryacquir e() -> unblocking fcntl(proc_ mutex_lock_ it), curr_locked = 1
proc_mutex_
proc_mutex_ fcntl_create( ) -> apr_file_open(), curr_locked = 0 fcntl_release( ) -> curr_locked = 0, fcntl(proc_ mutex_unlock_ it)
proc_mutex_
proc_mutex_ fcntl_cleanup( ) -> if (cur_locked = 1) { mutex_fcntl_ release( ) } -> apr_file_close()
Meaning that likely _release() might have raced with _acquire() somehow. One thing that comes to my thoughts, right away, is the fact that function proc_mutex_ fcntl_release( ), from apr, sets "curr_locked" to 0 before actually the fcntl() F_SETLKW succeeds BUT the locking is guaranteed using the file lock with fcntl() call and not due to the variable value change, so it would be okay-ish.
I can't see any other obvious reason why the lock is not acquired in this current execution. Still investigating... any ideas ?