Solved

How to Debug catalog-svc


Hi,

I testing Kasten K10 in a new installed AWS EKS cluster using k8s 1.29, so I follow the latest documentation and installing using helm and the extra set parameters  secrets.awsAccessKeyId and secrets.awsSecretAccessKey.

Almost all Pods comes up and PVC accept one pod and PVC that is catalog-svc. 

When I do a kubectl logs -n kasten-io <PODNAME> --all-containers=true it give me nothing, but if I do a describe on the pod it doesn’t give me a lot of information. 

 Containers:

  catalog-svc:

    Image:      gcr.io/kasten-images/catalog:6.5.10

    Port:       8000/TCP

    Host Port:  0/TCP

    Requests:

      cpu:      200m

      memory:   780Mi

    Liveness:   http-get http://:8000/v0/healthz delay=300s timeout=1s period=10s #success=1 #failure=3

    Readiness:  http-get http://:8000/v0/healthz delay=3s timeout=1s period=10s #success=1 #failure=3

    Environment:

      AWS_ACCESS_KEY_ID:                              <set to the key 'aws_access_key_id' in secret 'aws-creds'>      Optional: false

      AWS_SECRET_ACCESS_KEY:                          <set to the key 'aws_secret_access_key' in secret 'aws-creds'>  Optional: false

      VERSION:                                        <set to the key 'version' of config map 'k10-config'>           Optional: false

      K10_CAPABILITIES:                               mc

      K10_HOST_SVC:                                   catalog

      MODEL_STORE_DIR:                                <set to the key 'modelstoredirname' of config map 'k10-config'>  Optional: false

      LOG_LEVEL:                                      <set to the key 'loglevel' of config map 'k10-config'>           Optional: false

      POD_NAMESPACE:                                  kasten-io (v1:metadata.namespace)

      CONCURRENT_SNAP_CONVERSIONS:                    <set to the key 'concurrentSnapConversions' of config map 'k10-config'>               Optional: false

      CONCURRENT_WORKLOAD_SNAPSHOTS:                  <set to the key 'concurrentWorkloadSnapshots' of config map 'k10-config'>             Optional: false

      K10_DATA_STORE_PARALLEL_UPLOAD:                 <set to the key 'k10DataStoreParallelUpload' of config map 'k10-config'>              Optional: false

      K10_DATA_STORE_GENERAL_CONTENT_CACHE_SIZE_MB:   <set to the key 'k10DataStoreGeneralContentCacheSizeMB' of config map 'k10-config'>   Optional: false

      K10_DATA_STORE_GENERAL_METADATA_CACHE_SIZE_MB:  <set to the key 'k10DataStoreGeneralMetadataCacheSizeMB' of config map 'k10-config'>  Optional: false

      K10_DATA_STORE_RESTORE_CONTENT_CACHE_SIZE_MB:   <set to the key 'k10DataStoreRestoreContentCacheSizeMB' of config map 'k10-config'>   Optional: false

      K10_DATA_STORE_RESTORE_METADATA_CACHE_SIZE_MB:  <set to the key 'k10DataStoreRestoreMetadataCacheSizeMB' of config map 'k10-config'>  Optional: false

      K10_LIMITER_GENERIC_VOLUME_SNAPSHOTS:           <set to the key 'K10LimiterGenericVolumeSnapshots' of config map 'k10-config'>        Optional: false

      K10_LIMITER_GENERIC_VOLUME_COPIES:              <set to the key 'K10LimiterGenericVolumeCopies' of config map 'k10-config'>           Optional: false

      K10_LIMITER_GENERIC_VOLUME_RESTORES:            <set to the key 'K10LimiterGenericVolumeRestores' of config map 'k10-config'>         Optional: false

      K10_LIMITER_CSI_SNAPSHOTS:                      <set to the key 'K10LimiterCsiSnapshots' of config map 'k10-config'>                  Optional: false

      K10_LIMITER_PROVIDER_SNAPSHOTS:                 <set to the key 'K10LimiterProviderSnapshots' of config map 'k10-config'>             Optional: false

      AWS_ASSUME_ROLE_DURATION:                       <set to the key 'AWSAssumeRoleDuration' of config map 'k10-config'>                   Optional: false

      KANISTER_TOOLS:                                 <set to the key 'KanisterToolsImage' of config map 'k10-config'>                      Optional: false

      K10_RELEASE_NAME:                               k10

      KANISTER_FUNCTION_VERSION:                      <set to the key 'kanisterFunctionVersion' of config map 'k10-config'>  Optional: false

    Mounts:

      /mnt/k10-features from k10-features (rw)

      /mnt/k10state from catalog-persistent-storage (rw)

      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-646kx (ro)

  kanister-sidecar:

    Image:      gcr.io/kasten-images/kanister-tools:6.5.10

    Port:       <none>

    Host Port:  <none>

    Limits:

      cpu:     1200m

      memory:  800Mi

    Requests:

      cpu:        100m

      memory:     800Mi

    Environment:  <none>

    Mounts:

      /mnt/k10state from catalog-persistent-storage (rw)

      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-646kx (ro)

Conditions:

  Type           Status

  PodScheduled   False 

Volumes:

  k10-features:

    Type:      ConfigMap (a volume populated by a ConfigMap)

    Name:      k10-features

    Optional:  false

  catalog-persistent-storage:

    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)

    ClaimName:  catalog-pv-claim

    ReadOnly:   false

  kube-api-access-646kx:

    Type:                    Projected (a volume that contains injected data from multiple sources)

    TokenExpirationSeconds:  3607

    ConfigMapName:           kube-root-ca.crt

    ConfigMapOptional:       <nil>

    DownwardAPI:             true

QoS Class:                   Burstable

Node-Selectors:              <none>

Tolerations:                 node.kubernetes.io/not-ready:NoExecute for 300s

                             node.kubernetes.io/unreachable:NoExecute for 300s

Events:

  Type     Reason            Age                     From               Message

  ----     ------            ----                    ----               -------

  Warning  FailedScheduling  3m55s (x77 over 6h23m)  default-scheduler  0/3 nodes are available: 1 Too many pods, 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

 

And if I take a look on the PVC it doesn’t give me that much information either.

 

Name:          catalog-pv-claim

Namespace:     kasten-io

StorageClass:  gp2

Status:        Pending

Volume:        

Labels:        app=k10

               app.kubernetes.io/instance=k10

               app.kubernetes.io/managed-by=Helm

               app.kubernetes.io/name=k10

               component=catalog

               helm.sh/chart=k10-6.5.10

               heritage=Helm

               release=k10

Annotations:   meta.helm.sh/release-name: k10

               meta.helm.sh/release-namespace: kasten-io

Finalizers:    [kubernetes.io/pvc-protection]

Capacity:      

Access Modes:  

VolumeMode:    Filesystem

Mounted By:    catalog-svc-5b6b7bbf4-8f2xw

Events:

  Type    Reason               Age                     From                         Message

  ----    ------               ----                    ----                         -------

  Normal  WaitForPodScheduled  46s (x1561 over 6h30m)  persistentvolume-controller  waiting for pod catalog-svc-5b6b7bbf4-8f2xw to be scheduled

And if I add the missing annotations 

   pv.kubernetes.io/bind-completed: "yes"

    pv.kubernetes.io/bound-by-controller: "yes"

    volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com

    volume.kubernetes.io/selected-node: NODE.eu-north-1.compute.internal

    volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com

It at least create the PVC but the POD still reports the same issue even if I delete the POD.

 

My question is, 

  1. How can I easy debug this kind of issues?
  2. Have anyone seen this before?

Thanks

 

icon

Best answer by Hagag 27 March 2024, 07:22

View original

3 comments

Userlevel 5
Badge +2

Hello @issen007  if you check the warining message when describing the catalog pod, It indicates that youe EKS cluster is having trouble scheduling a pod  due to resource limitations on your nodes (worker machines in your cluster).

No nodes out of your 3 available nodes are currently suitable to run the pod, Three nodes lack sufficient memory to run the pod. Its resource requirements exceed the available memory on those nodes.


 

Warning  FailedScheduling  3m55s (x77 over 6h23m)  default-scheduler  0/3 nodes are available: 1 Too many pods, 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.


BR,
Ahmed Hagag​​​​​​​

But that is really strange, a totally new cluster, yes it is a 3+2 nodes cluster with t3.small instances but that should be OK for only running K10 and nothing else. 

But I give it a try to setup t5.large nodes and see if that works better. 

Thanks
Christian 
 

@Hagag you where right, when I scale up with 3+2 t3.large nodes it works better.

 

Thx
Christian

Comment