IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

KrystianJanas · 2025-01-07T10:59:47Z

Describe the bug
For a long time now we have noticed a problem resulting from refreshing metrics that are collected by the main master-node from the worker-node. We are currently operating on Dockerfile, on the AWS cloud. We have 1 master-node and 8 worker-nodes.

The problem is that the master-node often restarts without any problem. After longer analyses, it turned out that the problem is "metrics", which cannot be turned off in any way, because you have not implemented such a method. It would be very useful in the application.

Sometimes it is possible to "bug" them, restarting the entire infrastructure or adding one more worker-node. But this is not a permanent solution, because by bugging the metrics, the problem is solved for 1-2 days.

The problem is that because of metrics, the worker-node often loses connection with the master-node when the task is started, which is why we get the task status "abnormal", and we have to manually check whether the task has already been completed or is still running. At the moment this is very burdensome for us, as each worker has at least 4-5 tasks running.

We're running master-node and each worker-node on the crawlab-pro:latest image.

Master-node configuration:

version: '3.4'
services:
  crawlab:
    image: crawlabteam/crawlab-pro:latest
    container_name: crawlab
    restart: always
    environment:
      - CRAWLAB_LICENSE
      - CRAWLAB_NODE_MASTER
      - CRAWLAB_MONGO_DB
      - CRAWLAB_MONGO_URI
      - CRAWLAB_DISABLE_METRICS
    volumes:
      - "/opt/.crawlab/master:/root/.crawlab"  # persistent crawlab metadata
      - "/opt/crawlab/master:/data"  # persistent crawlab data
    ports:
      - "9666:9666"  # exposed grpc port
    mem_limit: 7G
    logging:
      options:
        max-size: "15g"
        max-file: "4"


  auth:
    build: .
    container_name: auth
    environment:
      - CRAWLAB_FORWARD_PORT
      - HTPASSWD
    ports:
      - "80:8080"  # crawlab
    depends_on:
      - crawlab
    mem_limit: 1G
    logging:
      options:
        max-size: "2g"
        max-file: "5"

Worker-node configuration:

version: '3.5'
services:
  worker:
    image: crawlabteam/crawlab-pro:latest
    container_name: crawlab_worker
    restart: always
    environment:
      CRAWLAB_LICENSE: "${CRAWLAB_LICENSE}"
      CRAWLAB_NODE_MASTER: "N"  # N: worker node
      CRAWLAB_GRPC_ADDRESS: "${MASTER_NODE_IP}:9666"  # grpc address
      CRAWLAB_FS_FILER_URL: "http://${MASTER_NODE_IP}/api/filer"  # seaweedfs api
    volumes:
      - "/opt/.crawlab/worker:/root/.crawlab"  # persistent crawlab metadata
      - "/opt/crawlab/worker:/data"  # persistent crawlab data
      - "/opt/crawlab/worker/download:/download" # folder for storing downloaded files
    mem_limit: 7G
    logging:
      options:
        max-size: "3g"
        max-file: "3"

Expected behavior
Add possibility to disable/enable metrics flag, or fix this issue.

Screenshots

The text was updated successfully, but these errors were encountered:

KrystianJanas · 2025-01-07T11:00:34Z

@tikazyq please take a look on that. We have created similar issue few months ago, but it has been unfortunately forgotten.

tikazyq · 2025-01-09T14:16:57Z

Hi @KrystianJanas , thanks for your feedback. Thanks for using Crawlab Pro and I really appreciate your invaluable feedback. I noticed the issue as well but unfortunately there is no quick solution to solve the performance issue potentially caused by the metrics module, as the engine behind is prometheus. If you can, please record the resource consumption metrics (memory, cpu, disk io) for main processes such as crawlab-server, prometheus, weed, etc., so that we can precisely locate the root cause.

In the meantime, we are near a new major release (0.7.0) which is under the final stage of testing before the formal announcement. It is supposed to have addressed the issue you mentioned, given that we have got rid of most 3rd-party middleware dependencies such as Prometheus and SeaweedFS, which are replaced with native Golang code. If you are interested in the EA, please let me know and I'll push to the latest "test" version for your trial.

KrystianJanas added the bug Something isn't working label Jan 7, 2025

tikazyq added the performance Performance related label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

KrystianJanas commented Jan 7, 2025 •

edited

Loading

KrystianJanas commented Jan 7, 2025

tikazyq commented Jan 9, 2025 •

edited

Loading

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate abnormal tasks #1539

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate abnormal tasks #1539

Comments

KrystianJanas commented Jan 7, 2025 • edited Loading

KrystianJanas commented Jan 7, 2025

tikazyq commented Jan 9, 2025 • edited Loading

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

KrystianJanas commented Jan 7, 2025 •

edited

Loading

tikazyq commented Jan 9, 2025 •

edited

Loading