Skip to main content

Detail:
A month ago I posted a question with specific version 7.0.10 in the topic.  I had issues and needed to revert, so I went back to my working 7.0.6 version.  

Today I went to try again. Kasten Upgrade page says to apply 7.0.10 before 7.0.14.  Of course, 7.0.10 was pulled and doesnt exist.  7.0.11 says it was to specifically address the issues in 7.0.10, so I read this as a re-release.
Since 7.0.6 doesnt know about 7.0.10 or 7.0.14, it must come from an external system. This should be corrected to reflect that 7.0.10 no longer exists.

So I apply 7.0.11. Totally fine. Then move on to 7.0.14.

Issue:
the “state-svc” pod is failing to start with a CreateContainerConfigError state. Specifically, the admin-svc container. The event-svc and state-svc containers are starting.

The admin-svc container is demanding a specific set of env variables, including:
    CONCURRENT_SNAP_CONVERSIONS 
        <set to the key 'concurrentSnapConversions' of config map 'k10-config'>
    CONCURRENT_WORKLOAD_SNAPSHOTS
        <set to the key 'concurrentWorkloadSnapshots' of config map 'k10-config'>

These keys do not exist in my k10-config configmap. One would think this is the reason 7.0.10 (7.0.11) was an upgrade step to 7.0.14, and it would have been created in the configmap during update.

The entire container is failing with 
Error: couldn't find key concurrentSnapConversions in ConfigMap kasten-io/k10-config


I do not see any parameters matching “conversion” under the Advanced Install - Complete Listing of Helm Options, so I do not know what the default value of these should be.


Ironically, the only reference to this key I can find in the release notes for v7 is that a parameter named limiter.concurrentSnapConversions is depreciated. Could this be the same concurrentSnapConversions missing from the configmap?

I see that the replacement for this limiter.concurrentSnapConversions is limiter.snapshotExportsPerAction, which has a default value of “3” in the Advanced Install section.
However, I cannot find anything referencing concurrentWorkloadSnapshots or what its default value should be.


TL/DR;
kasten upgrade path information needs to be corrected to remove 7.0.10 and 
7.0.14 state-svc pod admin-svc container is demanding an environment variable that is possibly depreciated in this version, and not populated by default/upgrade. If these keys/vars are mandatory (e.g. optional=false) they should be created and assigned the default values if they dont exist.

Update:

.



So I installed Kasten 7.0.6 again on my test cluster just to check out the CM, and there were indeed the k10-config keys and values for:

  concurrentSnapConversions: "3"

  concurrentWorkloadSnapshots: "5"

Now I cant confirm or deny that they werent in my 7.0.6 config and were somehow removed during an upgrade, or if they werent there to begin with because I installed back in v5.  So I just threw these 2 keys into my prod configmap and deleted the state-svc container.

Lo and behold, that container fails again, on another missing mandatory config env var. This time it wants K10LimiterGenericVolumeSnapshots.

I do not have a K10LimiterGenericVolumeSnapshots in my configmap, but I DO have a K10LimiterGenericVolumeBackupsPerCluster in my configmap.

And I see in the 7.0.14 release notes that there is a options depreciation where 

limiter.genericVolumeSnapshots

limiter.genericVolumeBackupsPerCluster


This is another specifically depreciated key/var that 7.0.14 is requiring. 
Again this is just the admin-svc container of the state-svc pod.
  The event-svc and state-svc containers in the same pod list the correct      K10LimiterGenericVolumeBackupsPerCluster key/var.

Lets compare:
state-svc container:
K10_LIMITER_SNAPSHOT_EXPORTS_PER_ACTION:         <set to the key 'K10LimiterSnapshotExportsPerAction' of config map 'k10-config'>        Optional: false

K10_LIMITER_WORKLOAD_SNAPSHOTS_PER_ACTION:       <set to the key 'K10LimiterWorkloadSnapshotsPerAction' of config map 'k10-config'>      Optional: false

K10_LIMITER_GENERIC_VOLUME_BACKUPS_PER_CLUSTER:  <set to the key 'K10LimiterGenericVolumeBackupsPerCluster' of config map 'k10-config'>  Optional: false

 

admin_svc container:
CONCURRENT_SNAP_CONVERSIONS:  <set to the key 'concurrentSnapConversions' of config map 'k10-config'>               Optional: false
CONCURRENT_WORKLOAD_SNAPSHOTS:  <set to the key 'concurrentWorkloadSnapshots' of config map 'k10-config'>             Optional: false
K10_LIMITER_GENERIC_VOLUME_SNAPSHOTS:  <set to the key 'K10LimiterGenericVolumeSnapshots' of config map 'k10-config'>        Optional: false


events-svc container:
K10_LIMITER_SNAPSHOT_EXPORTS_PER_ACTION:         <set to the key 'K10LimiterSnapshotExportsPerAction' of config map 'k10-config'>        Optional: false

K10_LIMITER_WORKLOAD_SNAPSHOTS_PER_ACTION:       <set to the key 'K10LimiterWorkloadSnapshotsPerAction' of config map 'k10-config'>      Optional: false

K10_LIMITER_GENERIC_VOLUME_BACKUPS_PER_CLUSTER:  <set to the key 'K10LimiterGenericVolumeBackupsPerCluster' of config map 'k10-config'>  Optional: false

only 2 of the 3 sub containers’ requirements match. The odd one out is causing the container to fail. The failing keys were all in a 7.0.6 installation configmap, but not in my 7.0.14 configmap. Its highly improbable that I could have missed all the keys and still have been running without issue. So it seems the depreciated keys were removed from the configmap entirely (read: already, and not “will be removed in an upcoming release”)

TL/DR:
I think the depreciated keys/vars in 7.0.14 were incorrectly removed from the k10-config configmap and the admin-svc container in the state-svc pod was not correctly updated to account for the new keys/vars in 7.0.14 while the other two containers in that pod were, causing the state-svc pod to fail to deploy.


Update:

So I grabbed all the old keys that admin-svc container was specifically requesting and put them into the k10-config cm and restarted the container. The admin-svc container started successfully this time.

Now the event-svc container is failing…….


Update: 

So the final answer is: somehow the admin-svc container in my state-svc deployment did not get removed as it should have during my upgrade path. 7.0.6 > 7.0.11. Removing the whole admin-svc container from the state-svc deployment was the fix.

 I ran thru each version change since 7.0.6 in my lab.  That answered the question why the admin-svc container in the state-svc was still at 7.0.6 (not 7.0.7) when the event-svc and state-svc were updated to 7.0.x at each step.

Apparently, even through there is an image for 7.0.7, the admin-svc container was removed in from the state-svc deployment in 7.0.7.  This is why it was still looking for old values.

Further, the events-svc changed to listen on port 8001 in 7.0.12, instead of listening on port 8002 in 7.0.6-7.0.11.  This is why fixing the CM to get the admin-svc to run correctly (which listens on 8001) caused the event-svc to fail with port already bound.

After incremental upgrades, I ran my initial 7.0.6 > 7.0.11 direct upgrade in my lab and it did successfully remove the admin-svc from the statefulset, and i was able to upgrade 7.0.11 > 7.0.14 without issue after that. Not sure how that happened. But now we know what happened and its all fixed and running correctly.

(ref: i have removed the older keys I manually added to the k10-config configmap and manually removed the admin-svc container from my state-svc deployment. Now its all running as it should be in a blank deployment of 7.0.14)


Comment