Skip to main content

Hi!  I’ve been kicking the tires of Kasten k10 in my home k8s environment and had it mostly working until I upgraded to 4.5.1 recently.  I’m not exactly sure it was the upgrade that broke things, or me doing something else around the same time (upgrade from 1.21.3 to 1.22.3?).

 

All policy runs fail now. Using the dashboard, and clicking in to details, gives me this similar error:

ause:
  cause:
    cause:
      cause:
        ErrStatus:
          code: 404
          details:
            causes:
              - message: 404 page not found
                reason: UnexpectedServerResponse
          message: the server could not find the requested resource
          metadata: {}
          reason: NotFound
          status: Failure
      fields:
        - name: Resource
          value:
            Group: extensions
            Resource: ingresses
            Version: v1beta1
      function: kasten.io/k10/kio/exec/phases/phase.k8sObjectTypeObjects
      linenumber: 302
      message: Failed to list resource
    function: kasten.io/k10/kio/exec/phases/backup.backupNamespaceToCatalog
    linenumber: 258
    message: Failed to snapshot objects in the K10 Namespace
  fields:
    - name: namespace
      value: kasten-io
  function: kasten.io/k10/kio/exec/phases/backup.(*BackupK10Phase).Run
  linenumber: 72
  message: Failed to backup namespace specs
message: Job failed to be executed
fields: >]

 

I completely removed (helm uninstall, but also kubectl delete namespace), and re-installed back to “factory” settings (all my policies are removed and I wiped all snapshots made), but the problem persists, even with just the single k10-disaster-recovery-policy.

I don’t have any ingress resources.

Looking through all of the logs from all of the kasten pods, the only trouble I would see repeat itself was from the aggregatedapis-svc pod:

saggregatedapis-svc-584b7f4799-xkrfc] E1031 20:33:34.328266       1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
9aggregatedapis-svc-584b7f4799-xkrfc] E1031 20:33:34.328273       1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
faggregatedapis-svc-584b7f4799-xkrfc] E1031 20:33:35.333158       1 retrywatcher.go:130] Watch failed: the server could not find the requested resource

 

Kubernetes environment:

v1.22.3
6 nodes (3 master/control panel)

storage: rook-ceph

network: calico

metallb: load balancing
NFS set up as a location for Kasten.

 

All pods show as Running or Completed.

ceph status shows HEALTHY

All node syslogs messages look routine.

 

Any advice on how I can further dig into this?  Maybe it’s a service account permission thing?

 

k10tools primer output:

I1031 20:48:11.872606       7 request.go:655] Throttling request took 1.032036175s, request: GET:https://10.96.0.1:443/apis/rook.io/v1alpha2?timeout=32s
Kubernetes Version Check:
  Valid kubernetes version (v1.22.3)  -  OK

RBAC Check:
  Kubernetes RBAC is enabled  -  OK

Aggregated Layer Check:
  The Kubernetes Aggregated Layer is enabled  -  OK

         Found multiple snapshot API group versions, using preferred.
CSI Capabilities Check:
  Using CSI GroupVersion snapshot.storage.k8s.io/v1  -  OK

         Found multiple snapshot API group versions, using preferred.
Validating Provisioners:
rook-ceph.rbd.csi.ceph.com:
  Is a CSI Provisioner  -  OK
  Missing/Failed to Fetch CSIDriver Object
  Storage Classes:
    rook-ceph-block
      Valid Storage Class  -  OK
  Volume Snapshot Classes:
    csi-rbdplugin-snapclass
      Has k10.kasten.io/is-snapshot-class annotation set to true  -  OK
      Has deletionPolicy 'Delete'  -  OK
    k10-clone-csi-rbdplugin-snapclass

Validate Generic Volume Snapshot:
  Pod Created successfully  -  OK
  GVS Backup command executed successfully  -  OK
  Pod deleted successfully  -  OK

Looks like your cluster version is 1.22.3, I think Kasten only supports up to 1.21

I say that only because I upgraded to 1.22 and had issues with the dashboard and then when I went back to 1.21 everything was fine. 

Still that might not have anything to do with it especially if it was working for you on 4.13 but thought I would mention it.

https://docs.kasten.io/latest/operating/support.html?highlight=support

 

cheers


That page also has a bunch of debugging suggestions too which might be of help.

 

cheers


Looks like your cluster version is 1.22.3, I think Kasten only supports up to 1.21

 

How did I miss that!  lol, well, if that turns out to be the case, my first PR will be to update their k10primer tool to error out if version >= 1.22.0

I ran the debug script -- looks like it’s returning the stuff I already looked at (pod logs, pod details).  There was something I missed for the gateway service though -- it was complaining about not having CRDs for Ambassador Edge Stack in the beginning of the logs, but didn’t seem to stop it from eventually working (I think I access the dashboard gui through it?).  It is interesting though since I think Ambassador does ingress stuff, and it’s the ingress lookup that seems to be failing here.


Hi everyone,

 

any info about when support for 1.22.3 (and OpenShift 4.9 that brings 1.22.3 on the table) will be released?
Thank you!

Mattia


I’m facing the same issues on a fresh installed k3s single node cluster v1.21.8. When you browse the web application you don’t see any errors and it looks like the application is running. But when you have a look on the log files you can see that the aggregatedapis are not working. In the nodes system logs I can also see that specific apis are constantly queried but they are missing. So it seams like some CRDs are missing in the helm package v4.5.6

Pod aggregatedapis-svc logs:
 
E0102 09:06:49.837482 1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
E0102 09:06:50.837374 1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
E0102 09:06:50.837397 1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
E0102 09:06:50.837445 1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
E0102 09:06:50.837455 1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
E0102 09:06:50.837447 1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
E0102 09:06:50.837474 1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
E0102 09:06:50.837474 1 retrywatcher.go:130] Watch failed: the server could not find the requested resource
 
System logs:
I0102 10:13:12.084062 1487 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.apps.kio.kasten.io: Rate Limited Requeue.
E0102 10:13:12.084049 1487 controller.go:116] loading OpenAPI spec for "v1alpha1.apps.kio.kasten.io" failed with: OpenAPI spec does not exist
I0102 10:13:12.082927 1487 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.actions.kio.kasten.io: Rate Limited Requeue.
E0102 10:13:12.082913 1487 controller.go:116] loading OpenAPI spec for "v1alpha1.actions.kio.kasten.io" failed with: OpenAPI spec does not exist
I0102 10:13:12.081747 1487 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.vault.kio.kasten.io: Rate Limited Requeue.
E0102 10:13:12.081721 1487 controller.go:116] loading OpenAPI spec for "v1alpha1.vault.kio.kasten.io" failed with: OpenAPI spec does not exist
I0102 10:11:12.088633 1487 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.actions.kio.kasten.io: Rate Limited Requeue.
E0102 10:11:12.088612 1487 controller.go:116] loading OpenAPI spec for "v1alpha1.actions.kio.kasten.io" failed with: OpenAPI spec does not exist
I0102 10:11:12.087444 1487 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.vault.kio.kasten.io: Rate Limited Requeue.
E0102 10:11:12.087418 1487 controller.go:116] loading OpenAPI spec for "v1alpha1.vault.kio.kasten.io" failed with: OpenAPI spec does not exist
I0102 10:11:12.086301 1487 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.apps.kio.kasten.io: Rate Limited Requeue.
E0102 10:11:12.086278 1487 controller.go:116] loading OpenAPI spec for "v1alpha1.apps.kio.kasten.io" failed with: OpenAPI spec does not exist
I0102 10:09:12.086698 1487 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.actions.kio.kasten.io: Rate Limited Requeue.
E0102 10:09:12.086655 1487 controller.go:116] loading OpenAPI spec for "v1alpha1.actions.kio.kasten.io" failed with: OpenAPI spec does not exist
I0102 10:09:12.081189 1487 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.vault.kio.kasten.io: Rate Limited Requeue.
E0102 10:09:12.081162 1487 controller.go:116] loading OpenAPI spec for "v1alpha1.vault.kio.kasten.io" failed with: OpenAPI spec does not exist
I0102 10:09:12.079924 1487 controller.go:129] OpenAPI AggregationController: action for item v1alpha1.apps.kio.kasten.io: Rate Limited Requeue.
E0102 10:09:12.079896 1487 controller.go:116] loading OpenAPI spec for "v1alpha1.apps.kio.kasten.io" failed with: OpenAPI spec does not exist

We are facing the same issue with Kasten 4.5.7 and Kuberneters 1.21, which sould be supported. Did you finally manage to solve this @flo-mic ?


@KelianSB no i still have the issue. And 4.5.7 solved it only partly. I can perform backup and restore operations but the log still shows thousands of this messages.
 

I have an open support case for this


Ok thanks for your answer. Please let us know if support finds anything to solve this issue, it’s very annoying.


@KelianSB I was able to fix this error. Maybe the below steps can help you as well:

  1. Uninstall the k10 helm package

    helm uninstall k10 -n kasten-io
  2. Add the following two lines to “/etc/sysctl.conf” on the nodes to allow more file descriptors and file notifier (or create a custom config like “/etc/sysctl.d/30-file-system.conf” with the below content)

    fs.file-max = 100000
    fs.inotify.max_user_instances = 512
  3. Set a higher soft and hard limit for any user in “/etc/security/limits.conf” on the nodes (Add the lines at the end of the file)

    * soft nofile 65532
    * hard nofile 100000
  4. Reboot the node (complete reboot required due to the sysctl modifications)

  5. Install the helm chart without any custom values

    helm upgrade --install --atomic k10 kasten/k10 -n kasten-io
  6. Now update the chart with the custom settings needed, e.g.

    helm upgrade --install --atomic k10 kasten/k10 -n kasten-io -f values.yaml

Maybe the step 5 is not required and you can directly use step 6, but at least this was the way how I was getting it working.


@KelianSBI was able to fix this error. Maybe the below steps can help you as well:

  1. Uninstall the k10 helm package

    helm uninstall k10 -n kasten-io
  2. Add the following two lines to “/etc/sysctl.conf” on the nodes to allow more file descriptors and file notifier (or create a custom config like “/etc/sysctl.d/30-file-system.conf” with the below content)

    fs.file-max = 100000
    fs.inotify.max_user_instances = 512
  3. Set a higher soft and hard limit for any user in “/etc/security/limits.conf” on the nodes (Add the lines at the end of the file)

    * soft nofile 65532
    * hard nofile 100000
  4. Reboot the node (complete reboot required due to the sysctl modifications)

  5. Install the helm chart without any custom values

    helm upgrade --install --atomic k10 kasten/k10 -n kasten-io
  6. Now update the chart with the custom settings needed, e.g.

    helm upgrade --install --atomic k10 kasten/k10 -n kasten-io -f values.yaml

Maybe the step 5 is not required and you can directly use step 6, but at least this was the way how I was getting it working.

Hello @flo-mic, thanks for your response. I didn't expect to have to change system settings on Kubernetes nodes for an error like this in k10... Is this the answer you got from support?


@KelianSB the support was not able to help. The configuration of k10 was already done properly and the primer tool was successfully validating the setup. 
 

This changes on OS level are only needed in case you have setup your own cluster on e.g. Debian 11. If you use AKS or GKs this settings are not needed as the nodes are already configured in a proper way  and the file limits are already increased


Comment