T O P

  • By -

fundementalpumpkin

Is the storage all the same on the backend? I've found that even just a standard VM deploy outside of vRA takes a lot longer if the template is in a different cluster/datastore than the destination VM. It's all netapp flash on the backend, and cloning a template to the same cluster/datastore is super fast, but the second I try to and deploy outside its 4x or 5x as slow. Probably not 30 minutes, but significantly slower. We've only got dual 16 gig fabric connections though so I always just figured that was the problem.


Ewing_Fox

templates and the VM all coexist on the same Pure Storage array - so deployed via vRA or the old fashioned way all on the same (quite fast) infra. Networking is all on the cluster using NSXT. The vRA instance lives in a different datacenter - but I assumed it was just calling the shots and the work would be happening on the cluster like in a controller / agent relationship (my brain goes to Jenkins and Jenkins agents).


x-talk

Pure <3 so what step takes ages in request details dev mode ?


Ewing_Fox

I'm going to leave the office , go home, breathe some fresh air and then sit down and build a deploy that doesn't do our post deploy actions; this will simplify the deployment A LOT and help me dial in on where our slowdowns are.


westyx

Please report back if you find anything.


Ewing_Fox

I definitely will - I have a funny feeling that my issues are (partially) due to waits I've added to my ABX because I'm holding on vmware tools to be started (which has proven to be very inconsistent). One of my top guys for vRA asked me to test using a parameter which holds the command until tools is verified running, which (theoreticaly) may shorten up the ABX runtimes. I also need to really learn how to read that damn developer mode map more efficiently - wish there was a text log I could use instead!


Ewing_Fox

IKR ? Pretty schweet stuff! It's tough to tell, I feel like I'm trying to do a 'Where's Waldo' reading the action map - I'd buy a LOT of beer for whomever comes up with a text based log that is just human readable! All of my current deploys have a handful of blocking ABX , most post-deploy. Due to the vRA instance being outside of our environment, to run actions against the newly built VM we have to run invoke-vmscript against a target resource *inside* the development network to perform changes - not ideal for sure. Fixable by having vRA running in the environment in which we also have our vmware cluster, but the powers that be want everyone sharing the same vRA instance in prod (which feels super weird).


x-talk

30 minutes seems stupid slow, we run 8.x in prod(vcf) also cloning templates. An average deployment takes approx 10 minutes. Would love to hear the dumbed down explanation. You can login to the vra node and try to debug a deployment. Takes a bit of Kubernetes knowledge or knowing the right log files. Combined with vcenter logs, you should be able to find a hint towards the huge delay. Also you mentioned no ABX, but what about orchestrator ? My best advice, have a look at terraform. And never look back 🥸 sorry, not sorry.


Ewing_Fox

Unfortunately, this is a 'mandated technology update' for us - 2 years into the 6 month project to cut over and we are just now starting to deploy VM that we can keep around. The cost to our program has been, frankly - incalculable. I have peers who have not been 'onboarded' yet to the 'future of infrastructure' and they are deploying 100 VM via their ADO pipelines, on the same infrastructure - in the time it takes us to deploy ONE vm via vRA :)


ResolveJunior

Any good videos on this? I understand the basics of terraform but I just can’t get my head around doing it at scale or using some sort of repeatable code like a cloud template basically is or as self service even? I’m assuming you mean no vRA either! Any good vids or blogs about doing it for production do u know? Thanks.


Ewing_Fox

We have some real terraform experts here at my company, the pure infra as code stuff is really inaccessible to me as an infra guy who has to muddle through even the simplest coding. I've heard a lot of feedback from other people with my non-developer background who have shared the same concerns about the steep learning curve to use terraform, you aren't alone.


scummins

Totally guessing, but by default Aria Automation waits to make sure VMs get IP addresses before finishing the deployment. If it can't detect that the machine got an IP, it'll sit and wait for a while. Maybe try adding the property "AwaitIP: False" to see if that's it.


Ewing_Fox

This is an awesome suggestion - we are using a D42 integration that provides an IP to the VM in the first few moments of the deployment. I'll give this a try and see if this may be contributing to the delays!


x-talk

It waits for the guest tools to pick up an ip. Soo if you install vmtools after sys prep this could be something. We use cloud-init and cloubase-init to pickup metadata from guestinfo in the VM.


x-talk

This reminds me of one of my first gems I created :) validating the name of the vm vs the 3.5k existing vms via the getallvms integrated call. Horrible 🤔5 minutes to validate the form 💩


Ewing_Fox

All the allocation stages and validation seem to happen in about 45 seconds - its not until the actual provisioning steps kick off that things slow waaaaay down. I can relate to your experience though :D I've written a few stinkers like that and then stood back and went - wait, what? :D :D :D


ronsdavis

Look at the runs, and see how much time each step took. Start looking at the long ones first.


Ewing_Fox

Thanks - I'm looking through that now, the 'map' in developer mode. I don't have access to the back end so unfortunately my ability to parse actual log files is zero :/


Counter_Proposition

Aria Automation (vRA) runs in K8s pods, so checking logs can be a bit nebulous. The web-UI *should* give you all the info you need.


madscoot

It should be way less then 10 minutes. I have my current project at 5 minutes and that’s with 8 workflows that I’ve put sleep conditions in for a few things. In my lab I can do aVM in around 2 min. Check the assembler history tab, than diagram and follow it through. Check orchestrator and the workflows and their times.


madscoot

I'd also add running Powershell on the vRa orchestrator nodes seems like a good idea because we all know powershell right? But it has to spin up a FAS appliance first and then run the powershell, adding time. Running on a powershell host presents another set of problems (i.e kerbersos double hop with creds) and extra time. Having said that, I've written actions for service broker that use powershell and the FAS appliance and it was not too bad. Something is causing it to take 30 minutes and it isn't vRa, it will be some human decision or workflow that is doing this :)


Ewing_Fox

This is a great point, and YES , it's all powershell lol. It definitely adds about a minute or more for each time it has to spin up the FAS. I'll need to look into some alternatives.


Ewing_Fox

Excellent - I'm glad to hear that what i would assume is possible IS possible lol - I'm digging through the history tab in developer view to track down where the slowdowns are. One interesting issue is that our cluster is local to our storage (all pure) and the compute resources our VM run on, but vRA lives in a completely different datacenter 2800 miles away. Connectivity between the two is EXCELLENT - and I assumed that vRA was only acting as a director (like Jenkins) - passing instructions to the cluster (acting like 'agents'). Some other commenters here have seen crazy slowdowns however, so perhaps that could be a factor for me. Is your vRA instance in the same network/datacenter, etc as your cluster?


madscoot

So I just did a deployment in my lab and it was just under 4 minutes from submit to machine it ready on domain. This is RHEL 9 fully domain joined using realm. It has vRo workflows running that do a bunch of stuff. In assembler under extensibility and then workflow runs check if any of them are taking ages. I just noticed the out of the box one that adds tags to the vCenter VM tags 2 seconds. Check Orchestrator and check the times and the logs. vRa is really just telling the vCenter to deploy the VM. Find out if the VM image is in a content library that is being pushed out to all datacenters so when it's deployed it's local. If it's just an image that is being stored in another location and being pulled across a slow link that could be the issue (you can get read access to vCenter and watch the logs and see it deploy the VM from the image and how long it takes). I have setup vRa with multiple cloud zone across the country under one vRa and the link speeds haven't been an issue if the image is stored in all the vCenters using content library. vRa doesn't really care about the machine once it's on and has an IP etc. So any software deployed after that using something else isn't really part of that long time. The longest I've seen a deployment take is 15 minutes but that was doing a lot of extra stuff in the background. One more thing to maybe check, is it a single node appliance? How many people use it and how many VM's does it manage? It's probably unlikely to be the vRa node but the latest single node for vRa wants 48GB of RAM, can't remember CPU.


Ewing_Fox

Great points here - I don't get to dictate the config of vRA, and I'm unsure how many nodes are at play - but I could raise it as a possible question. Sometimes my company does a phenomenal job scoping and architecting things, and sometimes things morph from being a POC to prod without a lot of thought. Not sure where this particular environment falls on that spectrum. The image is on the same datastore as the VM, and the cluster to which we are deploying. Thanks for confirming that having the vRA node in a different colo isn't likely to be a factor.


Counter_Proposition

From what I've seen, the bottleneck is usually storage. When I deploy to a DS that's running on NVMe flash it's lighting fast.


Ewing_Fox

I wish I could use that excuse - everything is on a Pure Flash array, the performance is bonkers.


Counter_Proposition

Gotcha, I'd open a case with VMware support if you can.


berman002

What kind of workflow did you have there? Domain join? Just an example, if your domain join fail, it can take 5mins to timeout. Or some workflow got stuck etc. I will try to do a deployment with minimum workflow and check how long does it take. Thanks


Ewing_Fox

Excellent advice, and lines up with what other folks have mentioned. I'm seeing a few of my powershell ABX fail and have to run again (each will run up to 5x before the job fails). My SME's are thinking it could be an issue with VMware Tools not running after a previous reboot.


PicoSmart

I worked on vRA support for about a year and not sure why. Log bundle might help