Skip to content

[Bug] 关于smp环境跨核关闭线程的问题 #10554

@eatvector

Description

@eatvector

RT-Thread Version

master

Hardware Type/Architectures

all

Develop Toolchain

GCC

Describe the bug

1 目前在smp环境下使用rt_thread_detach或者rt_thread_delete 尝试分离或者删除跨核心目标线程时是否存在并发问题,下面一个例程,其中一个线程A尝试两次获取同一个mutex,然后无限循环等待。另一个线程B尝试detach 线程A 后获取锁mutex。

#include <rtthread.h>

#define TEST_ITERATIONS      100  // 测试迭代次数
#define THREAD_PRIORITY      25
#define THREAD_STACK_SIZE    4096
#define THREAD_TIMESLICE     100


static struct rt_thread thread[TEST_ITERATIONS];
static rt_uint8_t thread_stack[TEST_ITERATIONS][THREAD_STACK_SIZE];


/* 线程入口函数 */
static void thread_entry(void *parameter)
{
    rt_mutex_t mutex = (rt_mutex_t)parameter;
    
    /* 第一次获取互斥量 */
    rt_kprintf("thread try to take mutex first time\n");
    rt_mutex_take(mutex, RT_WAITING_FOREVER);
    rt_kprintf("thread take mutex first time success\n");
     
    /* 第二次递归获取互斥量 */
    rt_kprintf("thread try to take mutex second time\n");
    rt_mutex_take(mutex, RT_WAITING_FOREVER);
    rt_kprintf("thread take mutex second time success\n");
    
    /* 故意不释放互斥量 */
    rt_kprintf("thread exit without release mutex\n");
    
    while(1) {} // 保持线程运行直到被分离
}

int mutex_recursive_test(void)
{
    int success_count = 0;
    int fail_count = 0;
    
    rt_kprintf("\nStarting mutex recursive stress test (%d iterations)...\n", TEST_ITERATIONS);
    
    for(int i = 0; i < TEST_ITERATIONS; i++) {
        rt_kprintf("\n=== Iteration %d ===\n", i+1);
        
        /* 创建互斥量 */
        rt_mutex_t mutex = rt_mutex_create("test_mutex", RT_IPC_FLAG_PRIO);
        if (mutex == RT_NULL) {
            rt_kprintf("[ERROR] create mutex failed at iteration %d\n", i+1);
            fail_count++;
            continue;
        }

        
        rt_kprintf("Init thread\n");

        /* 初始化线程 */
        rt_thread_init(&thread[i],
                     "test_thread",
                     thread_entry,
                     (void *)mutex,
                     thread_stack[i],
                     sizeof(thread_stack[i]),
                     THREAD_PRIORITY,
                     THREAD_TIMESLICE);
        
        /* 启动线程 */
        rt_thread_startup(&thread[i]);
   
        /* 分离线程 */
        rt_kprintf("main thread detach thread\n");
        rt_thread_detach(&thread[i]);
        
        /* 尝试获取互斥量 */
        rt_kprintf("main thread try to take mutex after detach\n");
        if (rt_mutex_take(mutex, 100) == RT_EOK) {
            rt_kprintf("main thread take mutex success after detach\n");
            rt_mutex_release(mutex);
            success_count++;
        } else {
            rt_kprintf("[BUG] main thread still cannot take mutex!\n");
            fail_count++;
        }
        
        /* 删除互斥量 */
        rt_mutex_delete(mutex);
        
       
    }
    
    /* 输出统计结果 */
    rt_kprintf("\nTest completed:\n");
    rt_kprintf("  Success: %d\n", success_count);
    rt_kprintf("  Failed:  %d\n", fail_count);
    rt_kprintf("  Success rate: %.1f%%\n", (float)success_count/TEST_ITERATIONS*100);
    
    return 0;
}

/* 导出到 msh 命令列表 */
MSH_CMD_EXPORT(mutex_recursive_test, mutex recursive stress test);

首先需说明上面的例程在单核环境下也无法成功运行,这是因为目前目前主线rt_thread_detach 并不会释放被分离目标线程所持有的mutex,关于这一点问题,我在 #10547 进行了说明,并提供了一个修改方案使得在单核环境能正确释放被分离线程所持有的mutex。

在能够在单核环境运行上面例程后,又发现在多核环境下仍然有机率出现错误。经过排查发现可能存在下面这种情况,线程A和线程B在两个不同核心运行:

t :线程A获取mutex

t+1 :线程B尝试关闭分离线程A,释放其持有的mutex。但问题是,调用rt_thread_detach - > _thread_detach ->rt_thread_close 只是将线程A的线程控制块移除调度队列并设置其状态为关闭,然而线程A还可以继续在核心上运行,直到其所在核心触发调度才会完全终止

t+2:线程A再次获取mutex

t+3 : 线程B尝试获取mutex,由于被A占用故失败

上面的问题可以总结为:当出现跨核情况时,执行完 rt_thread_detach 或者 rt_thread_delete后,目标线程仍然可能在运行,这可能导致出现一些令人费解的bug。

2 这里提供一个可能的解决方案 #10549,欢迎各位大佬指教

Other additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions