Solved

Policy run fails even though all actions succeed


Userlevel 3

All snapshots and actions succeed on my policy run yet the policy is marked Failed and there is no place I can find an error message at the policy level. How can I troubleshoot?

icon

Best answer by FRubens 18 May 2022, 10:00

View original

19 comments

Userlevel 3

@Hagag, deleting and recreating the policy did not fix this.

I cannot share the debug logs as there is information in them that would be a privacy breach. However, I am happy to share specifics from them or redact files if you can tell me what you’re looking for. 

`executor-svc*` appears to be the only file that is relevant. Clearly K10 is attempting to “retire a policy” and manifest "c374f458-cf86-11ec-a067-0ef546421892" has an unexpected entry.

Too bad it doesn’t say what the unexpected entry is.

How can I find and clean up this manifest and any related data manually?

 

{
"File": "kasten.io/k10/kio/exec/phases/phase/retire_policy.go",
"Function": "kasten.io/k10/kio/exec/phases/phase.(*retirePolicyPhase).checkRetirePolicyRun",
"JobID": "c378bd89-cf86-11ec-ac01-6e30f4d8aeb4",
"Line": 316,
"ManifestID": "c374f458-cf86-11ec-a067-0ef546421892",
"QueuedJobID": "c378bd89-cf86-11ec-ac01-6e30f4d8aeb4",
"SubjectRef": "kasten-io:cloud-daily-backup",
"cluster_name": "053cf100-3f34-47f3-a4b5-0d18f067dc50",
"hostname": "executor-svc-5c97596977-5vf59",
"level": "info",
"manifestID": "e9cfdcbd-ce69-11ec-a067-0ef546421892",
"msg": "Retiring policy run manifest",
"time": "20220509-11:04:01.082Z",
"version": "4.5.14"
}
{
"File": "kasten.io/k10/kio/exec/internal/runner/runner.go",
"Function": "kasten.io/k10/kio/exec/internal/runner.(*Runner).maybeExecJob",
"JobID": "c378bd89-cf86-11ec-ac01-6e30f4d8aeb4",
"Line": 177,
"ManifestID": "c374f458-cf86-11ec-a067-0ef546421892",
"QueuedJobID": "c378bd89-cf86-11ec-ac01-6e30f4d8aeb4",
"SubjectRef": "kasten-io:cloud-daily-backup",
"cluster_name": "053cf100-3f34-47f3-a4b5-0d18f067dc50",
"error": {
"message": "Unexpected manifest entry",
"function": "kasten.io/k10/kio/exec/phases/phase.(*retirePolicyPhase).retireMultiActionPolicyEntries",
"linenumber": 177,
"file": "kasten.io/k10/kio/exec/phases/phase/retire_policy.go:177",
"fields": [
{
"name": "entry",
"value": {
"type": "ArtifactReferenceGroup"
}
}
]
},
"hostname": "executor-svc-5c97596977-5vf59",
"level": "error",
"msg": "Job failed",
"time": "20220509-11:04:01.786Z",
"version": "4.5.14"
}
{
"File": "kasten.io/k10/kio/daemon/daemon.go",
"Function": "kasten.io/k10/kio/daemon.(*Daemon).run",
"JobID": "c378bd89-cf86-11ec-ac01-6e30f4d8aeb4",
"Line": 133,
"QueuedJobID": "c378bd89-cf86-11ec-ac01-6e30f4d8aeb4",
"cluster_name": "053cf100-3f34-47f3-a4b5-0d18f067dc50",
"hostname": "executor-svc-5c97596977-5vf59",
"level": "info",
"msg": "Daemon Shutting Down",
"time": "20220509-11:04:01.854Z",
"version": "4.5.14"
}

 

Userlevel 3

It appears these manifests are kept in the `model-store.db` of either the `catalog-pv-claim` or `jobs-pv-claim`. What format is the database?

Userlevel 3

I still don’t know what database format `model-store.db` is so I can’t correct any records, but I managed to extract the manifest from the catalog with a hex dump. The error indicates that `ArtifactReferenceGroup` is not a valid entry in the `entries` collection. My backup cluster is still failing due to this. Please advise.

{
"creationTime": "2022-05-09T10:57:05.341Z",
"destructionTime": "2022-05-10T10:59:57.196Z",
"id": "c374f458-cf86-11ec-a067-0ef546421892",
"meta": {
"manifest": {
"action": "snapshot",
"apiKeys": [
"/actions.kio.kasten.io/runactions/run-m9dp9v4pwg"
],
"apiMeta": {
"annotations": null,
"labels": [
{
"key": "k10.kasten.io/policyName",
"value": "cloud-daily-backup"
},
{
"key": "k10.kasten.io/policyNamespace",
"value": "kasten-io"
}
]
},
"endTime": "2022-05-09T11:04:44.306Z",
"entries": [
{
"artifactReferenceGroup": [
"c5957844-cf86-11ec-a067-0ef546421892",
"c5a31150-cf86-11ec-a067-0ef546421892",
"c5b0ee68-cf86-11ec-a067-0ef546421892",
"c5b82f2f-cf86-11ec-a067-0ef546421892",
"c5cb0bf5-cf86-11ec-a067-0ef546421892",
"c5d53747-cf86-11ec-a067-0ef546421892",
"c5dab78b-cf86-11ec-a067-0ef546421892",
"c5dfdfc4-cf86-11ec-a067-0ef546421892",
"c5e420e2-cf86-11ec-a067-0ef546421892",
"c5e96d04-cf86-11ec-a067-0ef546421892",
"c5ee36dd-cf86-11ec-a067-0ef546421892",
"c5f512d0-cf86-11ec-a067-0ef546421892",
"c5fac0de-cf86-11ec-a067-0ef546421892",
"c600262f-cf86-11ec-a067-0ef546421892",
"c60621bb-cf86-11ec-a067-0ef546421892",
"c60b5118-cf86-11ec-a067-0ef546421892",
"c6104b10-cf86-11ec-a067-0ef546421892",
"c614738b-cf86-11ec-a067-0ef546421892",
"c6277635-cf86-11ec-a067-0ef546421892",
"c62e03f3-cf86-11ec-a067-0ef546421892",
"c6460e3f-cf86-11ec-a067-0ef546421892",
"c64abda5-cf86-11ec-a067-0ef546421892",
"c65a41f9-cf86-11ec-a067-0ef546421892"
],
"type": "ArtifactReferenceGroup"
}
],
"exceptions": null,
"finalFailure": {
"cause": {
"fields": [
{
"name": "entry",
"value": {
"type": "ArtifactReferenceGroup"
}
}
],
"file": "kasten.io/k10/kio/exec/phases/phase/retire_policy.go:177",
"function": "kasten.io/k10/kio/exec/phases/phase.(*retirePolicyPhase).retireMultiActionPolicyEntries",
"linenumber": 177,
"message": "Unexpected manifest entry"
},
"fields": [],
"message": "Job failed to be executed"
},
"jobID": "c378bd89-cf86-11ec-ac01-6e30f4d8aeb4",
"originatingPolicies": [
{
"id": "8fec6a66-7136-4cfe-9819-d9ffccf90c41"
}
],

 

Userlevel 3
Badge +1

Hello Aaron,

 

Could you please provide us your debug logs so that we may further look into your logs. You can find this by going to Settings>Support>Download Logs 

 

If you could please attach the executor-svc(all three), and logging-svc log file and lastly the k10_debug we will have a good bases to further troubleshoot.

 

Thanks 

Emmanuel

 

 

 

 

 

Userlevel 3

These were submitted. Backups still failing.

Userlevel 4
Badge +2

These were submitted. Backups still failing.

Hello @Aaron Oneal .

Kasten K10 4.5.15 is out please check the release notes here: https://docs.kasten.io/latest/releasenotes.html 

There is a fix in this release that was related when policies with selective export and independent export retention schedule fail but all the run actions were succeed, related with automatic retirement failure.

I would recommend to try this latest version and maybe if possible recreate your policy or create a new one to test.

Thanks

Fernando

Userlevel 3

@FRubens I am already using that version and it still fails after running snapshots if it is not time for export.

E.g. hourly snapshot, daily export.

Userlevel 3

Actually, it still fails when time for export too. Looks like I was able to get a single manual run to work but anything scheduled still fails.

Userlevel 3

Thanks. I am storing to S3, however the error indicates an unexpected manifest entry and not an error communicating with the object store or a full bucket.

Userlevel 3

K10 support, @Hagag , @Satish-- What manifest is K10 complaining about and how can I fix this? My critical backups are currently failing.

Userlevel 5
Badge +2

@Aaron Oneal I think you might fix this be recreate the Policy, please try this workaround and let me know.
i value if you share the debug logs in order to try to understand this error.

 

Thanks

Ahmed Hagag

Userlevel 4
Badge +2

Hello @Aaron Oneal ,

Thank you for the information.

Would be great to have your debug logs with the new version (4.5.15), or at least could you please provide the same executor-svc output of the error as you did in this post but with 4.5.15 version, since we had a fix for the same function I would like to see if something changed in the error message from 4.5.14 to the latest version that would help us to investigate.

Regards
Fernando

Userlevel 4
Badge +2

Hello @Aaron Oneal ,

Thank you for the information.

Would be great to have your debug logs with the new version (4.5.15), or at least could you please provide the same executor-svc output of the error as you did in this post but with 4.5.15 version, since we had a fix for the same function I would like to see if something changed in the error message from 4.5.14 to the latest version that would help us to investigate.

Regards
Fernando

Hi @Aaron Oneal ,

Also if possible can you provide the policy yaml or the screenshot from dashboard of the policy setup , I would like to check the retention/export retention you selected to try to replicate on our side.

Thank you

Regards

Fernando

Userlevel 3

Attached to support case. What I noted there is creating a new policy with a new name works. Creating a policy with the old name does not. It appears there must be some old state in the catalog related to the original policy.

Userlevel 7
Badge +20

Do you have a screenshot of the error message?  That might help to get an answer.

Userlevel 3

There is no error message, just a failed policy run state.

Userlevel 7
Badge +20

Anything under “Show Details” or the logs?

Userlevel 3

I didn’t see anything on the details page but I just discovered if I view the YAML there is an error listed.

    cause: '{"fields":[{"name":"entry","value":{"type":"ArtifactReferenceGroup"}}],"file":"kasten.io/k10/kio/exec/phases/phase/retire_policy.go:177","function":"kasten.io/k10/kio/exec/phases/phase.(*retirePolicyPhase).retireMultiActionPolicyEntries","linenumber":177,"message":"Unexpected
manifest entry"}'
message: Job failed to be executed

 

Userlevel 7
Badge +20

Not sure if this will help but found this error a couple times on the Kasten Troubleshooting page but it references Object Storage so not sure if you are using that.

K10 Troubleshooting Guide (kasten.io)

Comment