Skip to content

Consumer randomly stucks and timeouts when communicating with confluent cloud #2667

@roman-bartusiak-yohana

Description

@roman-bartusiak-yohana

Issue:

We observer random timeouts of the consumer. We have contacted confluent on that - and they do not see anything wrong on broker side. CPU/MEM of the consumer is not high and limited during those events. The only correlation we observe is when we stat a lot (50) consumers per pod - it can happen more often then. Increasing timeout is rather a not-go as it is already 5m. It is not corellating with high load.

Sample logs


warn | Jul 24 20:35:17.924 | i-094c46e288d48e5ce | aidp-api | Marking the coordinator dead (node coordinator-5) for group aidp-chat_completion-request-consumer: [Error 7] RequestTimedOutError: Request timed out after 305000.0 ms.
-- | -- | -- | -- | --
error | Jul 24 20:35:17.924 | i-094c46e288d48e5ce | aidp-api | Error sending JoinGroupRequest_v4 to node coordinator-5 [[Error 7] RequestTimedOutError: Request timed out after 305000.0 ms]
error | Jul 24 20:35:17.924 | i-094c46e288d48e5ce | aidp-api | [IPv4 ('34.211.165.150', 9092)]>: Closing connection. [Error 7] RequestTimedOutError: Request timed out after 305000.0 ms
warn | Jul 24 20:35:17.924 | i-094c46e288d48e5ce | aidp-api | [IPv4 ('34.211.165.150', 9092)]> timed out after 305000.0 ms. Closing connection.
  1. Stack
    • kafka-pytho: 2.2.9
    • python 3.12
    • agains confluent cloud
    • multiple consumers in separate threads - not sharing consumer instances, every thread has a separate consumer
    • 200 partitions on the topic
    • up to 50 listener threads
    • poll interface with timeout
    • consumer config:
      Creating Kafka consumer for topic: pwell_dev_us-west-2_aidp-chat_completion-result with config: {'client.id': 'aidp-sdk', 'bootstrap.servers': '', 'sasl.plain.username': '', 'sasl.plain.password': '', 'enable.auto.commit': False, 'partition.assignment.strategy': [<class 'kafka.coordinator.assignors.sticky.sticky_assignor.StickyPartitionAssignor'>], 'enable.incremental.fetch.sessions': False, 'api.version.auto.timeout.ms': 60000} and extra config: {'group.id': 'aidp-chat_completion-result-consumer', 'auto.offset.reset': 'earliest', 'enable.auto.commit': False, 'key.deserializer': <function CallbackContainer.<lambda> at 0x7f5367a41c60>, 'value.deserializer': <function CallbackContainer.<lambda> at 0x7f5367a41f80>, 'max.poll.records': 150, 'max.poll.interval.ms': 30000000, 'max.partition.fetch.bytes': 100}

It is hard to provide more details - if i could enable some debuging i could try, pls provide instructions for it. When that happens we need to restart pods to make things work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions