Leveraging IaC to Streamline Data Protection

  • 19 February 2024
  • 8 comments
  • 195 views

Userlevel 4
Badge

Hello Veeam Community!

As you might know if you’ve seen other topics and articles from me, I am a big proponent of using Infrastructure as Code (IaC) tools to manage resource and application configuration. Several exciting developments in this space have occurred in the past few months, and I thought it would be good to start off this week by asking you:

How are you using Infrastructure as Code tools and principles to manage data protection operations in your organization?

I’m curious to hear what problems you’re solving today using IaC, as well as your own roadmaps, wishlists, and barriers to adopting IaC.

Sound off in the comments, and have a great week!


8 comments

Userlevel 3
Badge

Im interested to hear some community opinions.  Calling out @Chris.Childerhose  @Rick Vanover  What type of automation use cases are you guys seeing?

Userlevel 7
Badge +10

I too am curious. This isn’t an area I have done much work in but can say - if @ericeberg and @Cloud_dizzle are thinking about it… NOW is the time to get your point of view in!

Userlevel 7
Badge +20

I have not done much work in this area either but always looking for ways to make upgrades easier for Veeam - VBK/VCC/VB365.

Userlevel 6
Badge +10

I’m in the middle of forming a Product Engineering team at my current company to investigate and work various projects to see how we can streamline internal systems and/or offer additional services to customers using new technologies and processes like this. I think currently we’re almost doing the exact opposite of what you’re asking about. Due to various acquisitions and consolidations in the last 4 years, we are working to get everyone globally working on the same systems and processes. Some of this includes consolidating project management and development systems. Across different departments and products, we have probably close to a dozen of each, like multiple Jira instances, GitLabs, GitHubs, monday.com, etc. We’re finding ourselves implementing backup solutions for these SaaS offerings not just to protect against any deletions, but to migrate historical data into the new, company-wide instances to at least get everyone in the same workspaces.

 

At my last VCSP, I wrote and/or modified some scripts on VeeamHub to fully automate Veeam updates and Windows upgrades. We had about 500 systems, a majority of which were at customer sites, all of which was managed by RMM software. In 2019 just before the EOL, I used PowerShell scripts to remotely perform unattended upgrades of 75 or so 2008R2 systems to 2012R2 and re-enable Windows Firewalls where customers had disabled them. Once I had the script and answer file fully built and working, I managed to do all systems in 1 or 2 weeks, just kicking off about 10 at a time and QCing them as I received SMTP alerts that the upgrades had finished (or failed). When I left, we were working to do the same to bring all 2012R2 and 2016 systems to 2019.

 

Every major Veeam version and patch from about 10a to 11a we did the same unattended upgrades of customer VBRs. With some additional goodies added to the code, like making sure no IRs were running at the time. 

 

We then adapted some of the logic from that script to build some pre and post reboot scripts for all tenant and provider VBR servers. At this particular VCSP all VCC VBRs were all-in-one boxes hosting the VCC gateway, the VBR, the database, and the repository. We learned in previous years that as long as you didn’t reboot the VCC VBR while a merge or health check was running, the backup chain wouldn’t corrupt or anything and jobs would pick back up just fine. With ReFS, we found out in certain situations if you rebooted a ReFS volume while Veeam was running a merge, ReFS itself would think the vbk was corrupted and delete it with no easy way for you to restore it. Once we had this all working on tenant VBRs and VCCB VBRs, we scheduled weekly reboots so pre-staged updates put in place by our patch management software would install.

 

After that, we turned our attention to the VCCR hosts. Since we had VSPC in use, we generated an API key and wrote logic to stop and disable each tenant’s replication jobs and re-enable them at the end of the reboot window to make sure replicas were not writing while the VCC server rebooted. We also converted all PowerShell scripts to .exes so that any credentials in them weren’t in plaintext on the servers and at the end of each run the exe would be deleted from the server and not be downloaded again until the next run.

 

Any time something failed, such as a cancelation due to a running merge or IR, or jobs failing to re-enable, we’d hit a custom area of the Windows Event Log with an ID and message for each error. Our SIEM, which already did Event Log monitoring, then wrote and began emailing our support team a daily email report showing which machines triggered Event Log errors in this area, so the machines could be manually remediated to maintain patch compliance.

 

Finally, I used MDT, slipstreaming, and WSUS servers with various deployment templates for both internal and external systems that were commonly deployed, including customer VBRs. The base image installed drivers and our standard RMM and security software. More specialized templates would install VBR and on VCC hosts, pre-configure the VCC gateways, install the license files, and create the ReFS volume and folder structures. For physical servers, you’d also be prompted by a custom screen asking the customer ID and the server name, as a script would then run to rename the machine and add said machine (including the service tag and customer assignment) to our SQL DB that did all of our asset tracking. This got our average server build time down from about 2 business days to 2 hours, plus gave us the ability to easily run multiple installations at the same time.

Userlevel 7
Badge +10

@leduardoserrano  do you have any PoV here for our Product Management team?

Userlevel 7
Badge +6

@leduardoserrano  do you have any PoV here for our Product Management team?

 

Hi @Rick Vanover, I apologize for the delay in getting back to you!

I created a small illustration of integration possibilities using Terraform, Ansible Automation Platform, and Ansible Event-Driven. I also divided the scope into four types of interaction with the production environment and the Veeam Data Platform:

  • Creation
  • Configuration
  • Remediation
  • Notification

I also allocated some of the central teams that use automation in organizations, from the teams involved in DevSecOps processes (Developers, Testers, Yellow team, Ops Team) through the security teams (Blue, Red, and Purple Team), to the operational and SRE teams.

 

 

I believe that the availability of Veeam Ansible Collections is a big step towards facilitating the implementation of an automation strategy for data protection using Veeam.

Congratulations to everyone responsible for providing and maintaining Veeam Ansible Collections:

https://galaxy.ansible.com/ui/repo/published/veeamhub/veeam/

Terraform is another widely used platform in the market, and it is becoming the preferred tool for developers to deploy infrastructure, especially in public clouds.

Investing in this integration is worth ensuring that data protection automation is modeled in an Infrastructure as Code (IaC) model and integrated into a CI/CD pipeline.

Although Ansible is also capable of implementing/creating infrastructure, it is compelling and flexible in implementing, changing, and upgrading configurations.

In my opinion, Terraform is gaining more space when we talk about implementing and updating infrastructure, not necessarily the Code. This makes a lot of sense in Kubernetes environments, for example. When an update is required, containers are destroyed and entirely replaced using Canary and Blue/Green deployment strategies.

Here is a sample code of Veeam + Terraform integration:

https://github.com/VeeamHub/veeam-terraform

Last year, Ansible started to incorporate a significant functionality called Ansible Event-Driven. It allows Ansible to respond to real-time events, enabling quick remediation actions across environments.

This capability, combined with the wide availability of automation collections for different types of systems, with the capacity for agent-less automation and via REST-API, presents itself as a valuable tool for operations teams, SREs, and Blue Teams (Security).

As a suggestion, I believe that Veeam Ansible Collections can evolve in this direction, allowing not only creation/deployment but also remediation actions on the Veeam Data Platform.

Veeam recently released an integration with Sophos XDR, sending event logs from VBR to XDR actions. If I'm not mistaken, the Sophos XDR integration sending a request for action in VBR, such as checking the integrity of a VM backup or even automatic recovery to the production environment, is not yet available (coming soon).

Perhaps Ansible, an open-source project, could be the tool that enables this type of integration in a more agnostic way and with a broad spectrum of tools, detecting and responding to security incidents.

As a reference, I wrote a short article on my blog showing the integration between Kafka and Ansible Event-Driven:

https://cloudnroll.com/2023/02/11/event-driven-automation-for-ansible-and-a-kafka-integration-example/

For more informations:

Red Hat Ansible Automation Platform - Event-Driven Ansible

 

 

Ansible Event-Driven supports several types of sources: the most used source type is webhooks. But you can build your own event source plugin in python. A plugin is a single python file, but before that, let's take a look at some best practices and patterns:

https://ansible.readthedocs.io/projects/rulebook/en/latest/sources.html

Due to the automation capabilities already implemented in Veeam ONE, such as the configurable actions for alarm remediation in the backup and virtualized environment, I see Veeam One as a great candidate to fulfill the role of an “automation hub” for the entire platform. and Veeam software.

Another point that caught my attention was the potential for Veeam ONE to become the platform's “Security Hub” given the recent availability of the Threat Center.

I hope I have contributed in some way, and thank you very much for asking for my point of view!

👍🏻👏🏻🙏🏻

Userlevel 4
Badge

@JonahMay & @leduardoserrano, thank you for the thorough replies.

@JonahMay - Can you provide some details about your objectives for your new team and tools and services you’re using or planning to adopt? Even if it’s still conceptual at the moment since you’re in the process of building the team, I’m interested to hear how you plan to streamline some of those internal processes that formerly required custom scripting to accomplish. Thanks again!

Userlevel 6
Badge +10

@JonahMay & @leduardoserrano, thank you for the thorough replies.

@JonahMay - Can you provide some details about your objectives for your new team and tools and services you’re using or planning to adopt? Even if it’s still conceptual at the moment since you’re in the process of building the team, I’m interested to hear how you plan to streamline some of those internal processes that formerly required custom scripting to accomplish. Thanks again!

We have gotten a lot of interviews from people with Ansible, Terraform, and CI/CD experience. At least half of interviewees seem to come from a DevOps background, which I think bodes well for what we’re looking for the team to be doing. We have a few rough goals defined for the next few years:

  • Streamline code/software rollout from dev to product engineering to production. Some of this being in-house developed projects, others being software like Veeam and vSphere
  • Develop automated performance testing so we can run baselines against software versions. I.e., if we see average replication runtimes increase from Veeam patch/release A to B, we want to be alerted automatically so we can investigate why before we roll the update to production
  • Create VM/public cloud machine templates and deployments for easy rollout of test servers or boxes to run pre-release beta/RC software on in the “beta” part of the PE lab. We’ll also have a “release” area but that will be more static and mainly to test GA updates before rolling to production, where we can usually perform in-place upgrades unlike with a beta release
  • See where we can implement automations and better process workflows to reduce downtime. I.e., how can we try to use some of our product offerings to manage maintenance on components not capable of HA. Purely hypothetical, no idea if this would actually work but for example, we could theoretically try to use CDP failovers to keep VCC servers online while we patch the main VBR. Then once the main VBR is patched, we power off the replica, reconnect the source to networking, and quickly patch the components on other servers (gateways, proxies, repos, etc.). Or could we do something similar for installing Windows updates?

Comment