Unable to scale down task groups independently in nomad #24947

sijo13 · 2025-01-25T08:45:42Z

Hi,

We are using an open-source version of Nomad. Nomad version is v1.9.3

There is an issue where we are unable to scale down task groups in a job independently.

When a task group is scaled down or up, it causes the task beloging to another group of same job to restart.

Reproduction steps

In the below job specification, we have 2 task groups, nginx-group1 and nginx-group2.

job "nginx" {
  namespace = "platforms"
  node_pool = "platforms"
  group "nginx-group1" {
    count = 1
    network {
      port "http" {
        to = "80"
      }
    }
    task "nginx-task" {
     driver = "docker"
     config {
        image = "nginx:latest"
        ports = ["http"]

    }

      service {
        name     = "platforms-nginx-service"
        port     = "http"
        provider = "consul"

        check {
          type     = "http"
          port     = "http"
          path     = "/"
          interval = "2s"
          timeout  = "2s"
        }
      }
    }
  }
  group "nginx-group2" {
    count = 1
    network {
      port "http" {
        to = "80"
      }
    }
    task "nginx-task" {
     driver = "docker"
     config {
        image = "nginx:latest"
        ports = ["http"]

    }

      service {
        name     = "platforms-nginx-service"
        port     = "http"
        provider = "consul"

        check {
          type     = "http"
          port     = "http"
          path     = "/"
          interval = "2s"
          timeout  = "2s"
        }
      }
    }
  }
}

scaling down nginx-group1 to 0.
nomad_job % nomad job scale nginx nginx-group1 0

sijo.george@macblr0263 nomad_job % nomad job scale nginx nginx-group1 0
==> 2025-01-25T14:10:59+05:30: Monitoring evaluation "648b5bb9"
    2025-01-25T14:10:59+05:30: Evaluation triggered by job "nginx"
    2025-01-25T14:10:59+05:30: Allocation "fb80a82b" modified: node "86c82a4a", group "nginx-group2"
    2025-01-25T14:11:00+05:30: Evaluation within deployment: "2538df26"
    2025-01-25T14:11:00+05:30: Evaluation status changed: "pending" -> "complete"
==> 2025-01-25T14:11:00+05:30: Evaluation "648b5bb9" finished with status "complete"
==> 2025-01-25T14:11:00+05:30: Monitoring deployment "2538df26"
  ✓ Deployment "2538df26" successful

    2025-01-25T14:11:12+05:30
    ID          = 2538df26
    Job ID      = nginx
    Job Version = 3
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group    Desired  Placed  Healthy  Unhealthy  Progress Deadline
    nginx-group2  1        1       1        0          2025-01-25T14:21:09+05:30

This will result in restart the allocation belonging to nginx-group2

sijo.george@macblr0263 nomad_job % nomad job status nginx
ID            = nginx
Name          = nginx
Submit Date   = 2025-01-25T14:10:59+05:30
Type          = service
Priority      = 50
Datacenters   = *
Namespace     = platforms
Node Pool     = platforms
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost  Unknown
nginx-group1  0       0         0        0       2         0     0
nginx-group2  0       0         1        0       0         0     0

Latest Deployment
ID          = 2538df26
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group    Desired  Placed  Healthy  Unhealthy  Progress Deadline
nginx-group2  1        1       1        0          2025-01-25T14:21:09+05:30

Allocations
ID        Node ID   Task Group    Version  Desired  Status    Created     Modified
9b31b80a  86c82a4a  nginx-group1  2        stop     complete  14m13s ago  2m13s ago
e8687ec7  86c82a4a  nginx-group1  0        stop     complete  15m18s ago  14m36s ago
fb80a82b  86c82a4a  nginx-group2  3        run      running   15m18s ago  2m2s ago`

Expected Result

Nomad task groups should scale up and down independently

The text was updated successfully, but these errors were encountered:

gulducat · 2025-01-28T22:19:50Z

Hey there @sijo13, thanks for the report!

I'm afraid I'm not able to replicate with Nomad 1.9.3.

I used your example job specification with just a couple of changes:

Set namespace and node_pool both to "default"
Set the services' provider to "nomad"

so that I could test them on a dev agent on its own: sudo nomad agent -dev

my jobspec: nginx.nomad.hcl

job "nginx" {
  namespace = "default"
  node_pool = "default"
  group "nginx-group1" {
    count = 1
    network {
      port "http" {
        to = "80"
      }
    }
    task "nginx-task" {
     driver = "docker"
     config {
        image = "nginx:latest"
        ports = ["http"]

    }

      service {
        name     = "platforms-nginx-service"
        port     = "http"
        provider = "nomad"

        check {
          type     = "http"
          port     = "http"
          path     = "/"
          interval = "2s"
          timeout  = "2s"
        }
      }
    }
  }
  group "nginx-group2" {
    count = 1
    network {
      port "http" {
        to = "80"
      }
    }
    task "nginx-task" {
     driver = "docker"
     config {
        image = "nginx:latest"
        ports = ["http"]

    }

      service {
        name     = "platforms-nginx-service"
        port     = "http"
        provider = "nomad"

        check {
          type     = "http"
          port     = "http"
          path     = "/"
          interval = "2s"
          timeout  = "2s"
        }
      }
    }
  }
}

Running the job:

nomad job run nginx.nomad.hcl

==> 2025-01-28T17:07:03-05:00: Monitoring evaluation "3862ae30"
    2025-01-28T17:07:03-05:00: Evaluation triggered by job "nginx"
    2025-01-28T17:07:04-05:00: Evaluation within deployment: "dcfcd7ba"
    2025-01-28T17:07:04-05:00: Allocation "8d1494eb" created: node "fe84903d", group "nginx-group2"
    2025-01-28T17:07:04-05:00: Allocation "97090987" created: node "fe84903d", group "nginx-group1"
    2025-01-28T17:07:04-05:00: Evaluation status changed: "pending" -> "complete"
==> 2025-01-28T17:07:04-05:00: Evaluation "3862ae30" finished with status "complete"
==> 2025-01-28T17:07:04-05:00: Monitoring deployment "dcfcd7ba"
  ✓ Deployment "dcfcd7ba" successful

    2025-01-28T17:07:16-05:00
    ID          = dcfcd7ba
    Job ID      = nginx
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group    Desired  Placed  Healthy  Unhealthy  Progress Deadline
    nginx-group1  1        1       1        0          2025-01-28T17:17:14-05:00
    nginx-group2  1        1       1        0          2025-01-28T17:17:14-05:00

produced 2 allocations:

$ nomad status nginx | tail -n4
Allocations
ID        Node ID   Task Group    Version  Desired  Status   Created   Modified
8d1494eb  fe84903d  nginx-group2  0        run      running  2m4s ago  1m52s ago
97090987  fe84903d  nginx-group1  0        run      running  2m4s ago  1m52s ago

Scaling group1 down to 0:

nomad job scale nginx nginx-group1 0

==> 2025-01-28T17:09:58-05:00: Monitoring evaluation "c32b3547"
    2025-01-28T17:09:58-05:00: Evaluation triggered by job "nginx"
    2025-01-28T17:09:58-05:00: Evaluation within deployment: "42ff6634"
    2025-01-28T17:09:58-05:00: Allocation "8d1494eb" modified: node "fe84903d", group "nginx-group2"
    2025-01-28T17:09:58-05:00: Evaluation status changed: "pending" -> "complete"
==> 2025-01-28T17:09:58-05:00: Evaluation "c32b3547" finished with status "complete"
==> 2025-01-28T17:09:58-05:00: Monitoring deployment "42ff6634"
  ✓ Deployment "42ff6634" successful

    2025-01-28T17:10:10-05:00
    ID          = 42ff6634
    Job ID      = nginx
    Job Version = 1
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group    Desired  Placed  Healthy  Unhealthy  Progress Deadline
    nginx-group2  1        1       1        0          2025-01-28T17:20:08-05:00

left one allocation remaining:

$ nomad status nginx | tail -n4
Allocations
ID        Node ID   Task Group    Version  Desired  Status    Created    Modified
8d1494eb  fe84903d  nginx-group2  1        run      running   3m39s ago  33s ago
97090987  fe84903d  nginx-group1  0        stop     complete  3m39s ago  43s ago

Allocation 8d1494eb for nginx-group2 from the initial job run is still running. The job Version gets increased when scale changes, because it's part of the job specification, but although the alloc was Modified with the new job version, the workload itself continued running uninterrupted.

Although I can't see what your allocs looked like before the job scale, I believe your allocation fb80a82b may be the same as my 8d1494eb?

Let me know if there's some aspect of this that I'm missing, and I'll be happy to take a look!

sijo13 added the type/bug label Jan 25, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jan 28, 2025

jrasell added this to Nomad - Community Issues Triage Jan 28, 2025

gulducat self-assigned this Jan 28, 2025

gulducat moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 28, 2025

gulducat added the stage/waiting-reply label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to scale down task groups independently in nomad #24947

Unable to scale down task groups independently in nomad #24947

sijo13 commented Jan 25, 2025

gulducat commented Jan 28, 2025

Unable to scale down task groups independently in nomad #24947

Unable to scale down task groups independently in nomad #24947

Comments

sijo13 commented Jan 25, 2025

Reproduction steps

Expected Result

gulducat commented Jan 28, 2025