# 350: It looks like you’re trying to send an email from 250,000 miles away! Would you like help with that?
Duration: 62 minutes
Speakers: Matt, Jonathan, Justin
Date: 2026-04-16

## Chapters

1. [00:00] Episode 350 recorded for April 7, 2026 on GCP podcast
   Episode 350 recorded for April 7, 2026. We talk weekly about all things aws, GCP and Azure. Happy birthday. It's a first for the podcast.

2. [00:51] NASA astronauts encountered common Outlook configuration issue on their first day in space
   NASA astronauts dealt with Outlook hiccup in Deep space. Incident raises questions about offline readiness for software deployed in connectivity constrained environments. An offline mode reliability remains an important consideration for software selection.

3. [03:36] Iran has declared aws, Google and Microsoft data centers as military targets
   Iran has declared aws, Google and Microsoft data centers as military targets. This means that these data centers now have a new paradigm in the world. This will definitely make insurance and operating in clouds more expensive for companies.

4. [07:53] OpenAI introduces pay as you go pricing for Codex only seats within ChatGPT
   OpenAI is introducing pay as you go pricing for Codex only seats within ChatGPT Business and Enterprise workspaces. billing on token consumption with no rate limits instead of a fixed per seat fee. If you want the best performance, you're going to have to pay for what you use.

5. [09:50] Are these vendors going to start to prioritize customers?
   Are these vendors going to start to prioritize customers? The higher paid plans get access to the newer models sooner. Will your free included ones get lower priority than the ones you're paying per API call? Will that cost more?

6. [13:34] Anthropic is requiring cloud code subscribers to pay separately for third party usage
   Starting April 4, Anthropic is requiring cloud code subscribers to pay separately for usage through third party tools like OpenClaw. This affects all third party harnesses with more platforms to follow. This signals a broader industry pattern where AI providers may separate subscription pricing from API level or harness level consumption.

7. [16:04] Anthropic is expanding their partnership with Google and Broadcom for next generation compute
   Anthropic is expanding their partnership with Google and Broadcom for multiple gigawatts of next generation compute. This builds on existing October 2025 TPU expansion and deepens Anthropic reliance on Google Cloud. Anthropic run rate revenue has grown from roughly 9 billion at the end of 2025 to over 30 billion.

8. [17:42] Anthropic announced a new cybersecurity model called Cloud Mythos Purview
   A coalition including aws, Google, Microsoft, Apple, Cisco, Nvidia and others built around a new unreleased model called Cloud Mythos Purview. Model is focused specifically on finding and fixing software vulnerabilities in critical infrastructure. Is it as scary as they make it out to be? Maybe, maybe not.

9. [21:51] Managed daemons let engineers deploy monitoring, logging and tracing agents independently
   Managed daemons are now available too, which let platform engineers deploy and update monitoring, logging and tracing agents independently from application teams. Amazon SES is adding support for optional Start TLS configurations. Microsoft pricing for SES mail manager follows existing SES users based pricing.

10. [25:35] Bedrock guardrails now support cross account safeguards and general availability
   Bedrock guardrails now support cross account safeguards and general availability. Amazon Q now supports natural language queries powered by Amazon Q Developer. Cost Explorer now supports additional data sets beyond raw costs and usage data.

11. [33:20] Amazon announced a new EFS proxy for Amazon S3 files this morning
   This article was announced this morning for Amazon S3 files. It allows you to take hot files that the operating system will actually need to use and access them as EFS. There's lots of use cases for this, particularly inside of AI ML use cases. It's a bit pricey if you're not careful.

12. [35:10] Heather: Finally, finally no S3Fs. Finally no crappy user user mode interface to S3
   NetApp is moving away from S3Fs to cast OS and WordPress for AI workloads. The company's stock price is down from $104 to $96 in the past 24 hours. There are advantages to moving to a static site.

13. [39:09] GK inference gateway now supports both real time and async inference workloads
    gcp GK inference gateway now supports both real time and async inference workloads on the same shared GPU TPU accelerator pool. This addresses a common infrastructure inefficiency where real time clusters set idle during off peak hours while async jobs run on underutilized secondary hardware.

14. [40:36] MCP connects agents to live Gemini API documentation via the MCP protocol
   Google is releasing two new tools to address a core limitation of coding agents outdated API knowledge. MCP connects agents to live Gemini API documentation via the MCP protocol. Gemini API Developer Skills adds best practice patterns and SDK guidances. Using both tools together shows measurable improvements in evals.

15. [43:35] Azure Network Watcher now offers a public Purview feature called Rule Impact Analysis
   Azure Network Watcher now offers a public Purview feature called Rule Impact Analysis. It lets a network admin simulate the effective the effect of security admin rules before applying them to their environment. It just shows how rudimentary a lot of the Azure firewall stuff is.

16. [50:33] DigitalOcean is launching a native cloud security posture management tool
   DigitalOcean is launching a native cloud security posture management tool. The tool is built directly into DigitalOcean's dashboard and API. A centralized CSPM is definitely the way to go if you are managing a more enterprise environment.

17. [52:28] This week in Cloud is our 350th episode. Yeah, 350 episodes down
   We've reached the end of another fantastic week here on the cloud. 350 episodes down. Go leave a comment, message us on Twitter or Blue sky or Mastodon or join our Slack team. Thanks for listening and we'll catch you on the next episode.

18. [53:41] A senior Microsoft engineer describes problems with Azure's infrastructure
   Senior Microsoft engineer who rejoined Azure core in May 2023. The node management stack suffered millions of unattributed crashes per month, memory leaks, resource leaks and zombie VMs. It's alleged that it's tied to this poor infrastructure management. How often do you really see how cloud providers operate?

19. [62:20] All right, gentlemen, we'll see you next week. Happy birthday
   All right, gentlemen, we'll see you next week. I'm going to have a birthday dinner. Happy birthday. Thank you. Bye.


## Transcript

[00:00] Matt: Foreign.

[00:08] Jonathan: Forecast is always Claude We talk weekly about all things aws, GCP and Azure.

[00:14] Justin: We are your hosts, Justin, Jonathan, Ryan and Matthew.

[00:18] Matt: Episode 350 recorded for April 7, 2026. It looks like you're trying to send an email from 250,000 miles away. Would you like help with that? Good evening Matt and Jonathan. How are you guys doing?

[00:30] Jonathan: Good, thanks Justin.

[00:32] Justin: I'm good.

[00:33] Matt: Well, thank you for joining me on my birthday. I appreciate it.

[00:36] Jonathan: Happy birthday.

[00:37] Matt: I think we haven't recorded on my birthday before. It's a, it's a first for the podcast. Yeah, but my wife isn't here, my kids are here driving me crazy. So yeah, it's a great birthday, but we get to record a podcast so that's all that really matters. Well, we have some follow up news here. First up, Istio to say that even if you leave this planet, you will still deal with Outlook and Microsoft as apparently the Artemis 2 astronauts dealt with Outlook hiccup in Deep space the Artemis 2 astronauts aboard NASA's Orion spacecraft encountered a pro, a common Outlook configuration issue on their first day in space, requiring remote IT support from mission control to resolve it by reloading the commander's files. NASA used commercial off the shelf software like Microsoft Outlook for crew scheduling and personal communications, while keeping primary Vault systems on separate radiation hardened hardware, illustrating a practical separation of concerns emission critical environments. The Outlook issues stem from the app having configuration problems when no direct network connection is available, which the Vault detector noted is not uncommon, raising questions about offline readiness for software deployed in connectivity constrained environments. This incident is a useful reminder for cloud and enterprise software users that applications heavily dependent on network activity can behave unpredictably in low or no connectivity scenarios. An offline mode reliability remains an important consideration for software selection. I did watch the project Hail Mary if you haven't seen that, but they did not have any Microsoft Outlook software in that. So I, you know, clearly in the future when the world is going to be ending from a weird, you know, astrophage, you know, we'll be fine because they won't have Outlook anymore. So I'm hoping there's still a chance for me.

[02:06] Jonathan: I haven't seen the movie yet.

[02:07] Matt: Oh, it's so good.

[02:09] Jonathan: I'm looking forward to seeing it.

[02:10] Matt: I'm really the book was good. The book was great. I read the book when it came out. I'm a big Andy Weir fan in general, but the movie was really good. It's close enough to the book that I have no quibbles, so that's Good.

[02:22] Jonathan: I. I heard some comments that although the stories are the same, the takeaways from. From the experience of the book is different than the takeaway from the movie.

[02:31] Matt: Well, I mean there's a, there's a big twist that you find out in the, you know, basically beginning of the third act about how he ends up in space that's different in the movie that I think in the movie actually plays differently than it does in the book. And so you have a different emotion leaving that as you do from the book versus the movie. But both of them are excellent. So again, I highly recommend seeing it.

[02:54] Jonathan: Wonderful.

[02:54] Matt: It's probably only in the theater for a few more weeks, so do it quick.

[02:58] Justin: I thought it just came out.

[02:59] Matt: I mean movies are only in the theater for like four weeks, five weeks and they're out again.

[03:04] Justin: I don't think I've seen a movie in the theater since before COVID So I guess my sense of how long things are in theaters is not there anymore.

[03:12] Jonathan: Is it covert or having kids? For me it's. It's both. It's the perception of time is. It's crazy.

[03:18] Matt: Yeah. I don't grow as often as I used to, but I try to go to like big movies that I'm excited about. But yeah, this one is definitely one. I just. Yeah, I know Mario kicked it out of the. The Dolby Theater, which is where I CI/CD it. I think it might be still in the IMAX theater, our local place. Jonathan.

[03:32] Jonathan: But ah, yeah, I'll check that out.

[03:34] Matt: Yeah, worth checking out. Iran has declared aws, Google and Microsoft data centers as military targets. This happened in April with their declaration naming the joint war fighting cloud capability JWC contract specifically as the reason why aws, Google, Microsoft and Oracle data centers hosting Pentagon AI and intelligence workloads have lost civilian status under the Geneva Convention principles of distinction. This means that these data centers now have a new paradigm in the world, especially with the war going on in the Middle east, and they're considered safe targets for war type attacks. So now in the case of Fedramp and the jwcc, those are typically in the Fedramp data centers in the us so it's a little bit of an interesting distinction, but I'm not, you know, there's no guarantee that they're not putting Fedramp type workloads into regions closer to the war zone. There's no conversation about that. So I mean, I can see Iran's point in this. And this will definitely make insurance and operating in clouds more expensive for companies who are very politically sensitive and geopolitical tensions are only rising as we almost had World War Three today, but he blinked at the last minute.

[04:37] Jonathan: Yeah, two more weeks.

[04:39] Matt: Two more weeks. I mean, we won't get into this too much, but 90 seconds after that came out, they attacked Israel. So I mean, the ceasefire is going great.

[04:46] Justin: Yeah, I'm waiting for, like, I have a friend that does data center build outs for the cloud vendors and I'm waiting for her to talk to her again and be like, yeah, so our newest data centers, we have drone, you know, anti drone things on the roof and things like that.

[05:02] Matt: I mean, that's definitely a consideration that people have to make now is. Yeah, you know, how do you, how do you tackle those HEPA issues and do you have to put defensive things on them? I mean, people laugh about Switch data center in Las Vegas because they have armed guards, but you know, they, they allegedly have sensitive, you know, government contracts as well. And so I don't think you see that today with Amazon data centers. I mean, I don't drive by them very often, so maybe they'd have that today, but they don't today. They will in the future.

[05:27] Justin: Yeah, yeah.

[05:28] Jonathan: I Remember visiting a AT&T data center and it was, it was very high security, you know, multiple, multiple rings of fences with barbed wire and security at each thing. Each thing. You check in all your electronic equipment as you walk, as you walk in the door. It's worse than airport screening. It's very, very interesting experience there. I'm not quite sure how I feel about the military classification of something which is a shared utility, though. I, I mean, I, I get the intent obviously, but then, you know, a nuclear power plant is also shared infrastructure and Redis are shared infrastructure because emerging sometimes drives on them. So I guess where, where does the line get drawn?

[06:07] Justin: I was gonna say like the highways in the US are, were originally designed

[06:12] Jonathan: for military movement and landing planes, landing planes on, on interstates. Yeah, Right.

[06:18] Justin: So at that point you, the lines get blurry really fast and it's not like, you know, AWS is going to say this data center over here is, that's where we keep our high security stuff and that's the customer one. I feel like they're not going to go that far out. Also, you know, telling the world where their data centers are is not normally in their list of things to do.

[06:38] Matt: I mean, the reality is the status quo of military warfare and diplomacy has completely changed now after the last few weeks here. So, you know, it's all going to change. And this is a new paradigm in that change that we'll have to now deal with as practitioners of cloud and compute resources and things. And it only get things more difficult in the future. You know, sovereignty is a big issue right now too for the same reason, you know, if, if things go wrong between the United States and Europe, you know, how do you make sure that all these hyperscalers are all US Entities can't access European data and what's the protections on those sort of things? So lots of these questions are being asked globally right now due to geopolitical challenges in the world.

[07:17] Jonathan: Yeah, there's something really interesting though, because in, in certain, like ksa, for example, if, if the government is a customer of that, that region and that data center gets attacked, then you, you are also attacking the Saudi Arabia's own infrastructure, not just the US Military infrastructure. So it's.

[07:37] Matt: Yeah, well, I mean even attacking something like Aramco, which is partially owned by the Crown, potentially puts you into like an attack on the state. So there's, there's all kinds of fun wrinkles as you get into peel this onion if you will.

[07:53] Jonathan: Yep.

[07:53] Matt: All right, let's move on to tokens because that's much more exciting. Woo. OpenAI is introducing pay as you go pricing for Codex only seats within ChatGPT Business and Enterprise workspaces, billing on token consumption with no rate limits instead of a fixed per seat fee, giving teams more cost visibility across their workflow. ChatGPT business annual pricing drops from $25 to 20 per seat for teams that want standard ChatGPT access with Codex use limits included, while the new Codex Only seat option to serve teams that want dedicated coding agent access. Without the broader ChatGPT bundle, OpenAI is offering eligible business workspaces 100 credits per new Codex Only team member added, capped out at $500 per team for a limited time, which lowers the barrier for your pilot. Codex now supports plugins and automations through its macOS and Windows app, allowing teams to connect the coding agent to existing internal systems and tooling rather than treating it as a standalone tool. OpenAI has reported that over 2 million weekly active Codex builders and a 6x growth in Codex users within business and enterprise counts since January, with name customers including Notion Ramp and braintest using standardized engineering workflows. It's interesting on this one for me because with cloud code, you know, I like having the fact that I get access to cloud chat and to cowork and to cloud code. Although there is an API only option with cloud code that I can use which we'll talk about here in a moment. But I, you know, it's interesting to give you the flexibility, but I wonder if bundling will be the way these things get sold typically or it'll really be, you know, pay for what you use.

[09:17] Jonathan: I think if you want the best performance, you're going to have to pay for what you use. I think anyone who's, who's paying a bundle is always going to be a second class user and they will always take sort of second place in the queue to somebody who's paying this price. I mean it's just kind of pitched as a good thing for businesses. But pay as you go. Pricing is expensive. So a minor discount on the per seat pricing plus pays you pricing is really not cheaper. You may have cost visibility per user, but it's super expensive.

[09:50] Justin: All you made me think about was queues and high priority and low priority queues and everything along those lines. And my brain went into are these vendors going to start to prioritize customers? They already do some at you know, free tier versus non free tier.

[10:06] Matt: I think they're doing that a lot today.

[10:08] Justin: But do you think they're doing it at the higher tiers also versus you know, your standard credit versus your add on credits and things along those lines? Because I know they're doing it with free and non free. But are they going to start to prioritize within the other tiers and say the 20x plan gets priority if there's capacity issues over the 5x?

[10:27] Matt: I mean I think they always have done that. I think that's true, you know, and I think you see it when you openclaw kind of blew up and there was a bunch of people buying Mac Minis and earning out there. You know, you CI/CD a bunch of you constraints are happening inside of Max plans and in other Pro plans. So I think you see that as they roll out new models as well that you know, the higher paid plans get access to the newer models sooner, they get access to more tokens on those models because they're willing to spend more money on those particular models in those cases. So that's, that's just the reality of these things. I think you will end up in a situation where you realize that a lot of these bundled packages are actually very subsidized by these vendors and you're getting subsid access and if you really want to be competitive, you have to pay by the API call.

[11:10] Justin: So I mean they've proven that. I think there was something I read where it's like 3 or 4x number of tokens if you vs doing full add on. But I guess my question is if you're on the, let's say the $200 for Cloud and is that going to be lower priority than someone who's already hit their $200 quota and now is moving on to the add on amounts like that's I guess the nuance I'm

[11:35] Matt: thinking about inside of that. Yeah, I don't think that's. That distinction would not be there today. But that'll be interesting too if your, your free included ones get lower priority than the ones you're paying per API call. I mean, I don't notice the difference in speed between when I jump between the two. But also it's very difficult to know when you're leaving the Pro plan and you're going into your APIs versus not, not very clear on some of that when it happens. So it's hard to say.

[11:59] Justin: I keep mine off until I hit the limit and then I manually turn it on each time for that reason.

[12:05] Matt: Yeah, I do the same.

[12:06] Justin: I want to know when I hit that buffer limit. Because like today I was working on something and it was like 15 minutes until it reloads and I was like, I don't need to pay. I can wait 15 minutes. I'm patient enough for that sometimes.

[12:17] Matt: Yeah, it sounds like a coffee break time.

[12:19] Justin: That's. Yeah, you know, I don't need to spend money for the sake of it because I can't wait 15 minutes though, you know.

[12:25] Matt: But the other thing, I'm using a lot of sub agents these days and a lot of spinning off sub agents and work trees and things like that. And so the ability to burn through a token usage cap for me is much faster now than it used to be.

[12:39] Jonathan: Apparently I'm the only sucker who pays for the $200 plan.

[12:43] Matt: No, I pay for the $200 plan too. I'm saying I'm paying for the $200 plan plus I'm paying for usage after that.

[12:49] Justin: I thought you were using Bedrock for a lot of things.

[12:52] Matt: I was, but I, I've pivoted because I wanted, I wanted the Chrome integration and I some of the other features that are now only coming out in the more advanced. So I switched back to the Pro plan. But yeah, got it. But I, you know, I honestly, when I get to the API usage limit, I should just use the API on Bedrock. If it didn't have the limitations like not being able to do auto mode and a couple other things, I probably

[13:15] Jonathan: would but you, you might. I'm sure they have decent cache support in Bedrock.

[13:19] Matt: They don't. That's one thing I did notice when I switched back was that my token usage dramatically dropped from, you know, new tokens versus cache hit tokens. So yeah, the token caching is not as good. For sure.

[13:31] Jonathan: Yeah, that will definitely cost more. A lot more money in the end.

[13:34] Matt: Yeah Anthropic says that cloud code subscribers will need to pay extra for OpenCloud usage. Now, starting April 4, which was a little bit earlier this week, Anthropic is requiring cloud code subscribers to pay separately on a pay as you go basis for usage through third party tools like OpenClaw, rather than drawing from their existing subscription limits. This affects all third party harnesses with more platforms to follow. Anthropic head of cloud code cited infrastructure constraints and unsustainable usage patterns from third party tools as the reason for the change. I know the company is offering full refunds to subscribers who are unaware of the policy shift. The timing is notable given that OpenClaw's creator Peter Steinberger recently joined OpenAI and OpenClaw continues as an open source project within OpenAI. Backing Steinberger publicly stated he attempted to negotiate with Anthropic and only managed to delay the pricing change by one week. For developers building or using AI coding assistance through third party integrations, this signals a broader industry pattern where AI providers may separate subscription pricing from API level or harness level consumption, adding cost complexity. For teams relying on open source tooling around proprietary models.

[14:36] Jonathan: I'm really on the fence about this. I understand why they're doing it because there's a big difference between somebody having a conversation or somebody doing coding where you are mostly using cache hits for the majority of the work versus openclaw, where the context changes constantly and you're making a call every 60 seconds. It's it's a completely different type of workload. At the same time I'm paying $200 a month and I'd really like to have to split that between, you know, the Claude, the Claude apps, Claw desktop Claude code, and perhaps some API usage as well. So I can run some other things. Like as a developer of services, it would be nice to be able to consume some of that to to test the services I'm building rather than paying separately.

[15:21] Matt: It's pretty opaque to see the usage even in Claude and like an Anthropic, like what you're using and different models and like how much of your usage is Claude code versus how much is cowork versus how much is chat? It's not easy and I do, I do wish there was more granularity in some of that pricing capability to help you make that decision, but they had to get that before they can give you what you want. Jonathan.

[15:42] Justin: So yeah, so we need the vendors, we need all these AI vendors to follow up and go on the focus route.

[15:50] Matt: Yeah, I mean, I suspect that a lot of tooling is going to come out at FinOps X this year in June around AI token tracking. I think it's going to be a big, big area of announcements in particular, so we'll see. Anthropic is expanding their partnership with Google and Broadcom for multiple gigawatts of next generation compute. This is a new TPU capacity agreement. Google and Broadcom with interest are expected to come starting in 2027, not a moment too soon. This builds on existing October 2025 TPU expansion and deepens Anthropic reliance on Google Cloud. Alongside AWS and Nvidia hardware, Anthropic run rate revenue has grown from roughly 9 billion at the end of 2025 to over 30 billion, with enterprise customers spending over 1 million annually, doubling from 500 to over a thousand in under two months. The compute expansion is a direct response to this accelerating demand. Anthropic continues a multi cloud hardware strategy running Claude on aws Trainium, Google TPUs and Nvidia GPUs to match workloads to appropriate chips. Amazon remains the primary cloud and training partner while ongoing work on Project Rainier Cloud is currently the only frontier AI model available across all three major cloud platforms, Bedrock, Vertex and Azure Foundry. And this broad availability has practical implications for enterprise that is already committed to any of the three major cloud providers.

[17:03] Jonathan: Yeah, I read recently that something like half a billion dollars a week in sales are doing right now.

[17:08] Matt: Yeah, that's crazy. We were just doing a contract recently and we're trying to get the enterprise plan and like you know, we'd email the sales rep and then a week and a half later she'd respond and you'd respond back to her like within five minutes. And then another week and a half later she'd respond to you again. And you know, she was super college. She's like, we're just so slammed right now with all these deals coming through and so, you know, if you weren't spending millions of dollars, I think you were getting lower priority in the queue as Matthew pointed out earlier.

[17:35] Justin: I mean they're hot and clearly their sales are showing it, so.

[17:42] Matt: And Our final Anthropic story this week. Big Anthropic week. So last week a rumor dropped about Anthropic having created a new model called Mythos, which they said was, you know, which the rumors were saying was potentially the scariest model created yet. And so of course because it leaked, they had to now announce it officially. And so they're officially announcing Project glasswing. A coalition including aws, Google, Microsoft, Apple, Cisco, Nvidia and others built around a new unreleased model called Cloud Mythos Purview that is focused specifically on finding and fixing software vulnerabilities in critical infrastructure. Mythos Purview has already identified thousands of high severity vulnerabilities autonomously, including a 27 year old flaw in OpenBSD, a 16 year old bug in FFmpeg that survived 5 million automated test runs, and a Linux kernel privilege escalation chain, all of which have since been patched. Model will not be generally available, but partners can access it via Cloud API, Amazon, Bedrock, Google Cloud, Vertex, AI, MXR Foundry at $25 per million input tokens and $125 per million output tokens. After an initial period covered by 100 million in Anthropic usage credits. Anthropic is donating 4 million to open source security organizations including Alpha Omega Open SSF through the Linux foundation and the Apache Software foundation to help maintainers respond to vulnerable but the vulnerabilities of model surfaces. The initiative signals a shift in how AI safety and capability trade offs are being handled in practice with Anthropic planning to test new cybersecurity safeguards on an upcoming Claude Opus model before considering any broader deployment of Mythos class capabilities.

[19:10] Jonathan: What a great position to be in though. To be in the position to sell the model that can find every vulnerability and cause havoc in the world and also with a company that has the smart AI that can solve the problems.

[19:21] Matt: Exactly. Double edged sword like we detect it and we fix it. There you go. Yeah, I mean it's cool. I mean we definitely expect to see newer and better models still coming out. And so the fact that they think this model is this scary is impressive and a little bit skeptical as well as one of the things that Anthropic has a tendency to do is to be a little bit overly ambitious about what they think their things can do. You know, like the fact that, you know, sasageddon is all, all because of them and that, you know, all companies are going to destroy your SaaS contracts because they're going to just be able to build it with Opus like, I mean some of that stuff is hyperbole in the strongest sense. And so, you know, is it probably really great at finding stuff? Is it really good at chaining things together to find these attacks? Yes. Is it as scary as they make it out to be? Maybe, maybe not. I don't know. Time will tell. I'm not going to be spending money on Mythos tokens to find out, but I am curious to see what people are coming out with now that it's out in the wild to just drop this morning. So we'll keep an eye on this one for sure.

[20:22] Jonathan: Yeah, yeah, I'll. I'll bow out of that one too. That's. That's a little pricey.

[20:26] Justin: Yeah, it's. It's getting a bit really expensive. But if you're running something that, you know, either runs on a lot of systems or is.

[20:34] Matt: I mean, if I'm an operating system vendor, I'm a SaaS vendor. You bet your butt I'm using this because this thing can save you a ton of pain in security vulnerabilities. And if you're, you know, if you're not using it, nation state hacking groups are going to definitely be using it. So like you, there is a risk to not using it, but I just don't know that my hobby projects require it.

[20:52] Justin: Yeah, I know. My hobby projects I could care less about. Please hack them. Have fun.

[20:56] Matt: I don't think Bolt will require this level of security scanning, but. And if someone hacks Bolt, they got my API key to Anthropic Oops.

[21:09] Justin: There are a lot of cloud cost management tools out there, but only Archera provides insured commitments. It sounds fancy, but it's really simple. Archera gives you the cost savings of a one or three year AWS savings plan with a commitment as short as 30 days. If you do not use all the cloud resources you've committed to, Artrera will literally cover the differences. Other cost management tools may say they offered insured commitments, but remember to ask will you actually give me my rebate? Archer will check out thecloudpod.netarchera to schedule a demo today.

[21:51] Matt: AWS news this week. First up, I know this isn't the console first because I'm working on some heavy duty ECS development. Managed daemons are now available too, which let platform engineers deploy and update monitoring, logging and tracing agents independently from application teams, eliminating the need to coordinate task definition changes or service redeployments across hundreds of services. Daemons are guaranteed to start before application tasks and drain. Last, ensuring operational tooling like cloudwatch Agent is always available throughout the application lifecycle, including during rolling updates. A new daemon bridge network mode keeps daemon containers isolated from application networking while still allowing communication. And daemons support privileged container access and host file system mounts for deep system level visibility. Each instance runs exactly one daemon copy shared across all application tasks on that instance, which optimizes resource utilization and allows CPU and memory parameters to be managed centrally without rebuilding AMIs or modifying application task definitions. The feature is available now in all AWS regions at no additional charge around standard compute costs as long as ECS exists there.

[22:49] Jonathan: Nice. Like what a useful feature.

[22:52] Matt: Yeah. Popped up. I said what is a daemon? And I was like looking at, I was like oh, this is where you run your CrowdStrike or you run your, you know, your penetration testing or other things that you know are always kind of a, you know, they're a, they're typically privileged so you, you know, they're always a little scary and not having it run with the application container is really nice. So I think this is a great enhancement in general.

[23:11] Jonathan: Yeah, the network bridge thing reminds me of sort of the olden days as it were, when we used to have a backup or management network in addition to the customer facing network for that kind of monitoring traffic. So that's, it's like we weren't totally wrong back then.

[23:27] Matt: We weren't crazy, just premature. Amazon SES is adding support for optional Start TLS configurations, allowing legacy systems that lack full START TTLS support to still connect to Mail Manager without requiring a full infrastructure overhaul. Mutual TLS adds certificate based authentication at the ingress endpoint level, giving organizations a stronger identity verification layer for inbound email connections beyond standard encryption. Two new rules Actions of hand email processing flexibility invoke Lambda lets you trigger custom code directly from rule sets for advanced routing or transformation logic. But the bounce action sets sends RFC compliance SMTP rejection responses back to sending servers. These features are now available across most SES Mail Manager regions, with a notable exception of the Middle East UAE and Middle East Bahrain regions. So don't expect it there. Microsoft pricing for SES mail manager follows existing SES users based pricing. I mean it's only taken them 15 years to give a start. TLS optionality. Thanks I guess.

[24:23] Jonathan: So you could either have it or you couldn't before, but there was no, there's no problem. Okay. Yeah, that's interesting.

[24:31] Matt: Which you know, is a big issue. They've actually added a lot of features to Mail Manager recently. You know, the fact that it now can handle bouncing, bounce protection and handles all that stuff that you set to build your own for, you know, it's nice that that stuff's not there. The deliverability management stuff is pretty nice. So there's definitely been a big investment in SCS capabilities over the last couple years, which I appreciate.

[24:52] Justin: I think it goes back to our conversation. How much is curio curo kind of helping taking out that backlog of items and letting you know the interns or let a dev just go at all these backlog pfrs that are out there.

[25:06] Matt: I mean that could definitely be part of it, but it also causes more outages apparently, so don't worry about that Minor details, details, details.

[25:15] Jonathan: I think with the Lambda support they've kind of turned it into a if this then that for email, for aws, it's very powerful.

[25:22] Matt: I'd love for them to pull it into the workflow engine, which I'm forgetting the name of this moment. Step Functions, Step Functions. Thank you. Love for them to be able to put into Step Functions because then you can do some really cool simple rule sets versus just Lambda invocation, which is fine too.

[25:35] Justin: But didn't I see that they were shutting down Work Mail?

[25:39] Matt: Or is that Work Mail is no longer taking new customers so it's in in maintenance at this point? Okay, I think they gave up on the dream of being able to compete

[25:50] Justin: with Outlook, especially after AWS or Amazon was supposed to move to O365.

[25:57] Matt: Yeah, exactly, Amazon. Bedrock guardrails now support cross account safeguards and general availability, letting organizations enforce a single guardrail policy across all AWS accounts and organizational units. For a central management account covering every Bedrock model invocation automatically, there are two enforcement levels. Organizational Enforcement, which applies the guardrails via AWS organization policies to all member accounts, and OUs. While account level enforcement applies guardrails to all Bedrock inference calls within a single account, giving teams flexibility to layer the controls. A notable configuration option lets admins choose between Comprehensive mode, which enforces guardrails on all content regardless of caller tag, and Selective mode, which only applies guardrails to content that callers explicitly tag. Useful for mixed workloads with pre validated and user generated content. One practical gotcha worth flagging though. Specifying an incorrect guardrail arn in the policy does not fail silently. It blocks all Bedrock model inference for accounts, so ARN accuracy is critical before attaching policies to production ous. This is like group policies on Windows. If you are not careful you can lock yourself out of your domain, so be careful on how you apply these but good to See centralized control.

[27:01] Justin: I feel like you've done that in the past.

[27:03] Matt: Oh yeah, I've messed up a group policy many a times and had to recover in a horrible ways.

[27:10] Justin: So yes, I'm just thinking the tie by blocked all access to an S3 bucket and the only way to fix it was go into the root account in order to get, you know, fix the bucket policy. It's the same type of thing.

[27:25] Matt: Yeah, so the group policy has weird inheritance rules and so the one I enabled was a disable Windows interactive login, but it was at a is that an OU level that ended up actually accidentally cascading down to the user ou, which, you know, if you can't log in interactively to Windows boxes, you can't do much of anything. And the only reason why I was saved is because I hadn't logged out of the server. Oh I was able to reuse my session that hadn't been updated yet. So. But yeah, no, that was a, that was a fun outage in RCA that I had to write the Accountable for AWS Cost Explorer now supports natural language queries powered by Amazon Q Developer, letting users ask plain English questions like show me my top speed. Vending services this month and receives both written insights and automatically updated charts, filters and groupings simultaneously. Feature supports conversational follow up questions with maintained context, meaning users can move from a quick cost check to a detailed investigation without switching tools or manually reconfiguration or reconfiguring visualizations. When Amazon Q pulls from additional data sets beyond raw costs and usage data such as pricing catalogs or anomaly detection, those results appear in separate artifacts panel rather than the main Cost Explorer review. I did play with this because I was curious and I've done a lot of really cool things with AI for cost management recently. It's not very good. Like most Amazon Q things, it's not great. So it's April, but when I was playing with this it was closer to March. And so I asked it why is March 10% more expensive than February? And it didn't pick up the fact that there's more days in March, three more, four more days in March. It's kind of a big deal and billing in particular, but also because it's in the console, it doesn't have some of the application sites, which is where I think the actual real value of AI and FinOps comes into play. Because like if I if I basically I have access to the source code and I have access to the Amazon console command CLI tools, I can basically ask Claude like, hey, why was my Amazon other bill so hi this month? And it can look at the source code, look at what I'm doing in the application, things like, you know, there was things that I was doing like, oh, you're not caching SSM parameters properly in your code, so you're making a SSM parameter call on every call, which is costing you a lot of money once you pass the free tier. Like just basic things like that that are. You have to have enough application insights have that connection to know that that's why you're burning so much money on parameters. And that combination of having the source code of the applications running plus the cost data together is really powerful. Where Amazon Q, because it's contained to. Amazon doesn't have that insight unless it's in like serverless functions or something where it has the source code. But that's very rare in most use cases that you have just a serverless function at play.

[30:09] Jonathan: Yeah, you'd think they could actually look at the fact that you're pulling the same key all the time and maybe point it out.

[30:16] Matt: I mean you could again, like if you, you know, you had all your, you know, you should have cloudtrail turned on. So yes, it should be in Cloudtrail. You're calling the same thing from this principle. So but then that would require the Amazon Q to basically go pull CloudTrail data at scale, which isn't easy to do in context. So like, you know, how much control do you want to have do those things? But yes, you're right, it could. If you had it all the other data already plumbed and you're already paying for it, then yes, it could in theory do that, but it's not that smart. That's my reality.

[30:47] Jonathan: Yeah. Do you ever call Moviefone years ago

[30:49] Matt: when I do, yeah.

[30:51] Jonathan: Remember Moviefone? I remember. Remember that all these, all these voice interactions with AI in front of pretty much every service the cloud offers now. Which reminds me of Moviefone. I'm wondering if we'll end up with 800 numbers to call to get your cloud bill. Oh God, this Friday your bill will be ridiculously large.

[31:13] Matt: You use Mythos, you idiot.

[31:19] Justin: Turn on pissed off CFO mode. Yeah.

[31:24] Matt: Well, if you guys were excited about the fact that LM Studio exists on Google and you can go and put any article you want to in there and get generated a two person podcast you can ask live questions to as a studio call in thing and you said to yourself, man, I really wish Amazon had that capability. Let me tell you They've given you one piece of the Puzzle with Amazon Nova 2 Sonic Speech to Speech model available through the Bedrock that handles a real time conversational AI with support for 7 langu and a 1 million token context window. Make it practical for voice first applications like customer support, Interactive learning and the AWS blog post demonstrates a proof of concept podcast generator that uses two Nova Sonic instances to simulate a host and expert dialogue streaming audio in real time using a flask and Async IO architecture with RX PI for reactive event processing. So if you want a really bad way to do this, you can use this blog post and spend a lot more money than versus just go use LM Studio. But it's nice to have a model architecture I suppose.

[32:18] Justin: Still not out of a podcasting job. Yeah, got it.

[32:20] Matt: Not yet.

[32:21] Jonathan: Not yet. It's getting close though.

[32:24] Matt: Jonathan working on it.

[32:28] Jonathan: What do we talk about every week though? If we had AI to do the news for us? That's the question.

[32:33] Matt: I don't know what it would talk about. That's a question I feel like our

[32:36] Justin: chat would be a lot less of who's available to actually do the podcast this week and more of real topics.

[32:45] Matt: I mean the real thing is that the articles that we talk about are not really the pardon part of the show. It's really the insight that the four of us bring to the conversation about like how you might actually use this or what are some use cases. And that's what typically listeners tell me all the time is like, you know, the news is great, but it's really the conversations you guys have about the tech that they like. So I don't think the podcast could do that, the AI could do that yet.

[33:04] Justin: Maybe there's our level of cynicism.

[33:06] Matt: I mean that's probably the enjoy as well. I mean it's a good time to say hey, if you'd like to review the podcast. We haven't asked for reviews in a bit, but yeah, we'd love to get itunes reviews or Google podcast reviews if you're listening and hearing us out there. Finally, this article was announced this morning for Amazon S3 files and we are linking to of course the Amazon blog post as we typically do. And I will tell you that I read that blog post and I had no idea what they were announcing. Not a clue. But thank God for Corey Quinn as he posted a whole write up on last week in aws, which we've also linked to, that basically tells you what this actually is. And so once I read his article, I was like, oh, okay, I understand. So basically what they've done is instead of trying to do dumb things like Fuse, to basically make your application talk, object storage, they basically added a layer on top of your S3 bucket that allows you to take hot files that the operating system will actually need to use and access them as EFS. So basically it's an EFS proxy on top of S3, which allows you to basically, depending on the size of the object you want, basically read it from the S3 bucket, put into the hot tier, then allow the application to use the hot tier appropriately and then write it back to there. There's lots of use cases for this, particularly inside of AI ML use cases, which is why they built it, of course. But there's also some other interesting legacy use cases too. I was thinking about a CMS application I was using that didn't have native S3 support, but I wanted to be able to use buckets to do CloudFront backing of images. So, you know, using this you can actually create a hybrid file system basically between an efs and a S3 system that basically the image manipulations happens by the CMS, gets stored on the disk and then replicated at S3, where CloudFront could then pick it up like you expect in a web application. So there are some interesting use cases I think will come out of this. It's a bit pricey if you're not careful if you're doing file scans. So you wouldn't want to do a lot of file operations across your system via the EFS mount to the S3, because every time you pick those files up, Even if it's one kilobyte, you're paying for 32 kilobytes, I believe is what I CI/CD The map is. Corey has some good examples of this where you want to be careful, but definitely. Great article.

[35:10] Jonathan: Yeah. In other possibly related news, the NetApp stock price is down from $104 to $96 in the past 24 hours. Because that's basically what NetApp tiered storage does, correct?

[35:22] Matt: Yes.

[35:27] Jonathan: Yeah. Finally, finally, finally no S3Fs. Finally no crappy user user mode interface to S3 and sensible handling of incremental updates and things. It's great. Like, why did it take them so long?

[35:40] Matt: Exactly. I think it's a good middle ground though, because I think this is the people trying to do Fuse are always trying to recreate Posix and it's like, well, that's not really what you need. What is your use case and the use case that you have for AI or ML or training workloads. It's like, well, I have a massive amount of Data that's an S3 bucket that I need to be able to access on a hot tier across multiple machines and be able to potentially orchestrate across them. Like it makes sense in an AI use case. And so I think AI is driving the use case. I think there was use cases for this before. They just didn't have maybe a solution that would work as well. But, you know, maybe Kira designed this. Who knows?

[36:16] Justin: I mean, as long as people don't try to run SQL on S3. Because that was, I think it was like a year into my cloud and a customer I was talking to was like, we're getting really bad performance and on our database. And I looked at it, I was like, oh, you're using S3FS running SQL on S3? And my brain just exploded.

[36:39] Matt: Yeah, I mean, I would tell you that Trying to run MySQL on EFS is not a great experience either.

[36:44] Justin: No, it's not. It's a lot of iOS.

[36:47] Matt: There's really not a great option other than local disk on Amazon for running databases in General. So, yeah, S3Fs, though, that's, that's a real bad.

[36:54] Justin: That was a very quick migration for them.

[36:58] Matt: Yeah, I mean, I, I, I was using EFS a bit for the CloudPower website, and then I was tired of how much it was costing, so I moved away from it. And our website doesn't need to be up all the time, so if it goes down for a couple hours, it's on the end of the world because it's. We have CAST OS who basically hosts the podcast files for us if the website's down. And that's where all the RSS feeds pull from anyways. So, um, so really the website's there for people to go check out and then, you know, we publish there, but it gets synced to CAST os, so it's not critical in our infrastructure. So I was like, there's no reason to pay for EFS to have multiple ECS nodes talking to the CFS thing. But this is interesting use case for some of the WordPress stuff I do where potentially I could get away from using one of the S3 offload plugins that I use and use this instead. And so there, I'm gonna check it out a little bit. I might use this for some of our WordPress sites, but not the database database will stay on local disk because every time I have monkey do it, that that's when performance goes real bad for the website,

[37:51] Jonathan: I kind of wonder if we should look at static site instead. I mean mostly static site when maybe with some supplemental stuff.

[37:57] Matt: You know, it's on my list of many, many AI projects to do. Yeah, because you we could definitely move to a static site, but you know, there's advantages to what we're doing with cast OS and WordPress that I again, how much do I really want to spend on it? It always comes down to that question.

[38:12] Jonathan: The podcast plugins for WordPress are just, I mean they, they do a lot of heavy lifting.

[38:17] Matt: Yeah, when I first was creating it, I was looking at Jekyll a little bit and there was some podcast plugins, but they were not great support on them and they weren't really what I was looking for. And to like do a CICD poll to do a simple podcast post is like a bit clunky, but there are advantages of like being able to take potentially our show notes from Heather and being able to automatically upload them to and trick off the trigger of the whole podcast production. Like there's things that would make it worthwhile. But also I could use WordPress CLI. So yeah, it's one of these. Like you get to that point of like, well, I can just use the WordPress CLI. But it's problem solved too. But maybe, maybe someday I you know, the redesign we did this year on the website dramatically simplified our template that we use and it's pretty much a static website at this point. It does use WordPress to generate it, but then it's mostly static and I could potentially offload a lot of that to Word to CloudFront and not even have it hit the website. All right, moving on to gcp GK inference gateway now supports both real time and async inference workloads on the same shared GPU TPU accelerator pool, eliminating the need to maintain separate clusters for each traffic type. This addresses a common infrastructure inefficiency where real time clusters set idle during off peak hours while async jobs run on underutilized secondary hardware. The async component works by integrating a batch processing agent with cloud pub sub, where latency tolerant requests are pulled from a queue and routed to the inference gateway as lower priority treadable traffic that fills unused compute cycles between real time spikes. Testing shows that without the async processor agent, unmanaged multiplexing of low priority requests causes 99% message drop while using the agent resulted in 100% of latency tolerant requests being served during available capacity windows, demonstrating the efficiency of priority enforcement mechanism doing meaningful work. This is all open source to an LLMD async project, meaning teams can use it in multiple cloud environments rather than just being blocked or locked into gke. So this is a problem you're having on AWS or Azure. You can take a look at this code.

[40:10] Jonathan: That's neat. Yeah, that's very cool. Especially works in the interests of the cloud vendors now who can maximize their utilization of the GPUs. There's still a lot of CPU sitting now idle though.

[40:23] Matt: A lot of CPU idle.

[40:24] Jonathan: 60 to 70% CPU idle while those GPUs are full on. But

[40:31] Matt: yeah, there's lots of opportunity to take advantage of the CPU that's there. Google is releasing two new tools to address a core limitation of coding agents outdated API knowledge due to training data cutoffs and the Gemini API Docs. MCP connects agents to live Gemini API documentation via the MCP protocol, while Gemini API Developer Skills adds best practice patterns and SDK guidances. Using both tools together shows measurable improvements in evals, achieving a 96.3 pass rate with 60% fewer tokens per correct answer Standard prompting MCP server is accessible at Gemini API docs mcp-dev and works with any MCP compatible coding agents including cloud code or Codex. The approach of pairing a live documentation server with a skills layer is a practical pattern other API providers could adopt, and I wish they would.

[41:17] Justin: This definitely seems like a good middle ground to kind of gain that latest API stuff because a lot of the stuff I work on, you know, Azure Terraform provider is out of date in definitely areas or it takes time for it to update just like anything does. So if it's able to link over to the actual Live Docs plus with you know how fast all these providers go, it's a game changer at least on the. You know where I use it on Terraform, so seeing it here just will make life better too.

[41:49] Jonathan: Yeah, it's no coincidence this is from Google though, because their documentation sucks.

[41:53] Matt: It's terrible.

[41:54] Jonathan: Yeah, if they're trying their best.

[41:56] Justin: Have you ever read Microsoft? Hold on, just get interrupt. Have you ever read Microsoft documentation?

[42:01] Jonathan: Yes.

[42:02] Matt: I mean I used, I used to really like Amazon's documentation, but it's kind of gone to crap in the last few years too, so.

[42:07] Jonathan: Yeah, it must be an embarrassment for them though to have, you know, a flagship AI model that can solve the deepest problems in physics and everything else, but he can't tell you how to spin up a VM with a Terraform provider because the documentation sucks about it. It's quite,

[42:23] Matt: it's quite the conundrum for sure.

[42:24] Jonathan: It really is.

[42:26] Justin: The problem with the L documentation is it just gets outdated so quickly. We do this, we've added this new feature. Okay, we gotta update the documentation. And with how fast these cloud providers update stuff, the documentation to the API to everything just isn't updated. So I remember on aws, Terraform honestly was where I got most features from when I was just looking at different things it can do because it was so well updated in that way or looking at the Boto API and that was kind of the way back in the day I kind of kept up to date on everything here. If the MCP is just reading the straight up swagger docs on the API, it has the information right away.

[43:06] Matt: It's true.

[43:08] Justin: Yeah.

[43:08] Matt: I have a project that I'm working on that they just released an MCP that connects directly to their documentation, which is like a bunch of skills around it that has made it so much better. Just like, oh, this is so much nicer than, you know, reminding it all the time like, oh, that was a V1 thing. This is not the way it is in V2 anymore. Is really nice. And then like, oh, what do you mean? Like my knowledge isn't from V1. I'm like, I know, here's the docs and then you're using a contacts for things that aren't necessary. So I do appreciate that. And let's move on to Azure. Azure Network Watcher now offers a public Purview feature called Rule Impact Analysis, which lets a network admin simulate the effective the effect of security admin rules before actually applying them to their environment, reducing the risk of unintended connectivity disruptions. The feature is particularly useful for teams managing Azure virtual Network manager security configurations as it helps identify rule conflicts and validate that activity requirements are met prior to deployment. I mean, 2026 and we still deal with rule conflicts and firewalls.

[44:05] Justin: Hey, you can now stage them.

[44:07] Matt: That's great.

[44:08] Justin: Yeah, no, they're getting up there.

[44:10] Matt: Takes me back to my checkpoint days where I was like, oh no, the rules are duplicate and causes all kinds of havoc.

[44:17] Justin: I mean, it just shows how, I don't want to say immature, but how rudimentary a lot of the Azure firewall stuff is and all their network monitoring and everything is. They're still releasing a lot of the basic features to kind of get it up to speed so it can be more comparable to a real vendor versus where they're at today.

[44:40] Matt: I mean, okay, I have some pain.

[44:45] Justin: Let's, let's just go with that.

[44:47] Matt: Clearly I just, you know, like if you can't build it yourself, just buy it from somebody else. Like that's what Google basically did with the next gen firewall. So you know, partner with Palo Alto, build something that works in your cloud and then this is a solved problem for you versus you know, giving you bad products.

[45:04] Jonathan: So they have no kind of like dry run or test mode where you can put greater rule that just creates log events instead of actually enforcing.

[45:11] Justin: I know they do on the waf. I believe they do on the firewall too. But I now I'm questioning if it's on both WAFs because the WAF for the essentially the ALB, the Load Balancer for the app gateway is actually different than the WAF for the front door just to make life more confusing.

[45:32] Matt: But is it in the Azure application gateway, the Azure front door or the Azure CDN that you need the Azure web application firewall? And where are those rules previewed? And probably, I mean like it's overly complex.

[45:45] Justin: You can tell it's, you know, no different than aws. Two different teams that were told to build a WAF and they did it and there was no communication between the two to say hey, maybe we should have a team that makes the general, hey, here's our standard rules that you Recommend. Here's our OAuth top 10 rule protection and then let that apply to both places.

[46:08] Matt: Yeah, I mean they just need to get a premium firewall product to then solve problems.

[46:11] Justin: They have that.

[46:12] Matt: Oh, do they? Okay.

[46:13] Justin: Yeah, Premium, Azure pre. There's three firewalls. I think there's basic, standard and premium. Premium does the ids, ips. It has higher throughput I believe than the standard. I would have to double check all my notes.

[46:27] Matt: Of course it does.

[46:28] Justin: But the big one is the IPS and IDS versus just IDS because it's always good to be told that you're having issues without actually being able to block anything. That's why you need the premium one to actually do the blocking.

[46:43] Jonathan: So Aviatrix had a feature where you kind of had this cross cloud visibility into rules end to end.

[46:51] Matt: Yeah, that's more for like cross cloud connectivity to private network side, not to public side. I assume Aviatrix can do public as well. It's been a little bit since I've looked at it, but it was really designed more for like hey, I want to connect Azure to AWS and then run workloads between those and to my data center. And so I need visibility end to end of that. But yeah, the front door side, I don't know, maybe we've improved it. Well, you know, we've always talked about premium and sort of had this kind of like, is it really giving you anything or any value? And so I found this great article on Reddit from a company called Go euc and they did benchmark research comparing premium SSD and standard SSD disk disks for Azure VDI workloads in particular. And they found that premium SSD delivers up to eight times higher IOPS and 89% lower latency than standard SSD. But the performance gap widening as disk sizes increase. Standard SSD shows a fixed performance ceiling of roughly 850 to 980iops regardless of disk size, while premium SSD scales from 1800iops at 128 gigabytes up to 8100iops at 200048 gigabytes, making disk sizing a meaningful architecture lever. Only for premium ssd, the cost comparison is less straightforward than it appears because standard SSD carries transaction fees that can push its total cost close to premium SSD pricing under heavy VDI workloads, making premium SSD a more predictable cost option. Despite its higher base price, the 2048 gigabyte premium SSD at $284.94 per month emerges as a recommended sweet spot. Since moving to a 4,096 gigabyte cost, $545, only marginal performance gains and a 2,500 seat scale. That size decision translates to over 7.8 million in annual cost differences. The research used synthetic disk speed tests rather than the real user load simulation. So results reflect maximum disk capabilities under controlled conditions may differ from production environments. So, I mean, it's not Microsoft, it's not something Microsoft paid for, and it's something that was done independently. So I approve.

[48:44] Jonathan: You know, when you see 2,048 gigabytes for a month, that sounds like a lot. But 2 terabytes per month for $284 is actually quite expensive.

[48:54] Matt: Yes, yes. I mean, premium SSD is a lot. VDI workloads are definitely their own, you know, nightmare to scale and be and, you know, and work on. So I. It's definitely valuable. But, you know, it's just nice to see that there are actual values and these premium things you're getting, not just marketing, which is what we sort of, we maybe have accused Microsoft of doing in the past and now we have factual data that's not true.

[49:20] Jonathan: But what about Ultra Premium now that's what we said.

[49:23] Matt: Yeah, they didn't ask about Ultra Premium so yeah, fair enough.

[49:26] Justin: They can't go EUC can't afford Ultra Premium.

[49:29] Matt: Right.

[49:30] Justin: I mean the no brainer for me in here at least is everything with Azure is as you go to a higher tier it's more predictable pricing. It's kind of a lot of what they have. Same thing with Blob storage, same thing with you know, DIs and everything they and even AWS and a lot of the other cloud providers. So like as you go to the more premier hot whatever storage, the price per use goes down but the base price goes up. And that's kind of the way all of Azure is structured in that way.

[50:08] Matt: And the worst part is that one little feature will push you into the higher tier and now you pay a premium when you only need one feature. It makes sense. When you have like you need two or three of the features it's a problem. When you only need one that's 100%

[50:19] Justin: where the sharp edges are like front door. And the only way to get a connect to a storage account that is private is to have to be on premium versus standard.

[50:30] Matt: Yep, just saying, just saying. In our emerging cloud section, DigitalOcean is launching a native cloud security posture management tool that continuously evaluates resources like your droplets and databases from misconfigurations without requiring agents or third party tools making accessible to smaller teams without dedicated security staffs. The tool is built directly into DigitalOcean's dashboard and API addressing a common pain point where security visibility requires separate tooling and context switching across platforms. I mean I like this like cool you gave us a cspm but like you know, per cloud CSPM means that you're trying to manage us across multiple cloud providers. I mean digital Ocean for most companies is a development environment versus a production environment. Although some user for production on a diminished dismiss that capability. But centralized CSPM adds a lot of value in my mind so I appreciate it. This is all you're using is DigitalOcean then you know you're getting value out of that but you know, nice to see it. Not something I'm super jazzed about.

[51:29] Jonathan: You got to pay for it.

[51:30] Justin: It's definitely a nice feature to kind of give the general developer or you know, security person that might not know the intricacies of DigitalOcean. Hey, here's a red flag. Go look at this. Hey, your database is public facing. That's bad. But if you are managing a more enterprise environment Yeah, a centralized CSPM is definitely the way to go.

[51:52] Matt: So Jonathan, to your question about price. It is a freemium model. It gives you free baseline scans for all people. Premium tier unlocks, workload rules, advanced prioritization, API access and automations to. To fix things that CSPM discussed to figure that. Which I think is a good point to have a paywall.

[52:10] Jonathan: Yeah, that's fair. At least they're giving. At least they're giving everything to the important bits, to everybody. I suppose there's no reason you can't hook in some kind of CSPM into Terraform if that's what you're deploying with. So it's nice that they've got. They're covering the base is what they're doing. That's nice.

[52:28] Matt: Well, gentlemen, we've reached the end of another fantastic week here on the cloud.

[52:33] Jonathan: Yeah, 350 episodes down. It's crazy.

[52:35] Matt: I know it's crazy. That's again why I was mentioning like go, go believe a review. Go leave a comment, message us on Twitter or Blue sky or Mastodon or join our Slack team and give us a hello and let us know what you're doing on the cloud and what you like and what you don't like as well. Because we always like to improve. But yeah, no 350 episodes. It's continuing to count down the days of my life. This is my birthday now too. All right, see you guys next week.

[53:05] Jonathan: See you later.

[53:06] Justin: Bye.

[53:09] Jonathan: And that's all for this week in Cloud. We'd like to thank our sponsor Archera. Be sure to click the link in our show notes to learn more about their services. While you're at it, head over to our website@thecloudpod.net where you can subscribe to our newsletter, join our Slack community, send us your feedback and ask any questions you might have. Thanks for listening and we'll catch you on the next episode.

[53:41] Matt: Matt, you brought us an after show today. How Microsoft vaporize a trillion dollars. That's a lot of money, I will tell you that. I did not read this, so I'm going to turn it over to you to summarize it as summarizer in training.

[53:55] Justin: Oh, I was definitely not prepared for that. You should have.

[53:58] Matt: I mean I could gave you a heads up on that one. I mean I can read the basics of what you put here. The author, a senior Microsoft engineer who rejoined Azure core in May 2023, discovered on his first day that 122person.org was seriously planning to port large portions of Windows to a tiny low powered ARM chip on the Azure Boost accelerator card plan he immediately recognized as physically impossible given the hardware constraints. Now as an executive, I know that I don't want to be told things are impossible, so I'm sure this is going to go well. Nobody at Microsoft could explain why up to 173 agents were needed to manage each Azure node. What they all did or how they interacted sprawl that created enormous fragility in the system, orchestrating VMs for OpenAI, government clouds and other mission critical workloads. After the elimination of dedicated testers in 2014 and a talent exodus of original Azure architects, much of the org was staffed by junior engineers with one to two years of experience, led by managers that deep system backgrounds and creating a persistent gap in senior technical leadership. The node management stack suffered millions of unattributed crashes per month, memory leaks, resource leaks and zombie VMs, with each monthly release introducing more bugs that if fixed and most roll outs pending in panicked rollbacks. A publicly exposed web server wire server running on the secure host OS held unencrypted tenant data for multiple customers and shared memory caches, a serious secure liability in a hostile multi tenancy environment while crashing 300 to 500,000 times per month. Just a few times. Despite public claims that ignite conferences from 2023-25 that key components have been offloaded to Azure Boost and rewritten in Rust, the author states that as of late 20240 of C4 identified work items had been completed and work hadn't started on roughly 60 of them. Digital escort sessions where $18 an hour employees executed commands on production nodes under direction of overseas support staff including from China became routine with nearly 200 just in time Access requests per day observed over a two month period directly contracting the original no human touch design vision, the author has proposed an incremental component strategy to modernize the node stack from first principles including a cross platform component model, a new message bus and a security hardened cache. The lower level manager responded with defensiveness and the org eventually terminated his employment. Consequences materialized over time with or OpenAI signing $11.9 billion Core Weave in March 25and later a $300 billion deal with Oracle finally being alleged that it's tied to this poor infrastructure management.

[56:08] Justin: So kind of the way I read the story was this senior developer kind of came back after years away from Microsoft and just CI/CD we call a slight disaster based on kind of the story that he proposed out there where essentially back in the when Azure was created it was done very quickly and wasn't set up at all to scale, definitely not to the levels that it's at today. And there's been a ton of manual intervention including essentially just in time requests that were just auto approved basically. And the whole hands off, hey, everything's auto built and it's going to be hands off and automatically fix itself and everything just wasn't there. The tech debt is so large basically from what he's saying that it's almost insurmountable and they've essentially outsourced all the issues to, I may just call, you know, help desk personnel that just go in and manually touch things all, all the time to fix it. And what made everything much worse was the signed deal with Azure open with OpenAI back in the day when it was first announced because they needed so much capacity and they, when Microsoft first right to refuse, they obviously didn't want to refuse such a massive customer that that just made every other project drop off even more. And I don't know how accurate it is, but it's an interesting glimpse behind the scenes of these cloud providers that how often do you really get to see how the cloud providers operate? And this is kind of saying, yeah, Microsoft a little bit of a disaster.

[57:54] Jonathan: Doesn't surprise me. I've spoken to several people who work for Oracle or who did work for Oracle in the past and there's an awful lot of manual tweaking goes on there as well. The guy I talked to was on the Load Balancer team and you know, he told me about people literally logging in and tweaking settings and doing, you know, live changes on customer production Load Balancer because, because they were so buggy. Also work with people who, you know, staff a 200 person knock and pay them to log into every Windows machine and reboot them once a week. So I'm not surprised.

[58:29] Justin: I'm not surprised either. The part that was interesting to me was he talks about it more on the article, but essentially there was a big shift at one point and they got rid of their entire dedicated QA team and it just caused code quality to go down and they hired a lot of junior developers and with such a complex environment like any of these hyperscalers, you at least need some visionaries and some people that can help, you know, make sure that everything is going in the right direction. And it sounds like the talent drain that occurred, you know, in 2024-2020 just kind of destroyed Azure. And while they're able to keep growing at everything, their Tech debt is just slowly killing them on the back end and at one point they're going to hit a breaking point.

[59:15] Jonathan: Well, it was a real story as the fact that the guy was let go again. You know, you think if he was bringing these issues to light in a sensible way that he would not get that kind of treatment.

[59:26] Matt: But I mean, you're just you, you just know he's you, Jonathan.

[59:31] Jonathan: Yeah, so no, he's not me. There's somebody else having mine, but it's, it's not me.

[59:37] Matt: You're just like. I mean, if he's more diplomatic, unlike myself, who's not diplomatic at these type of things. Unsubscribe from all.

[59:44] Jonathan: Yeah,

[59:47] Matt: you know, I mean, I definitely think it feels a little bit like, you know, definitely sour grape that you got terminated or whatever. But you know, I, I also, you know, question some of the Azure stuff because you see it on the outside, like some of this is very clearly, you can see outages, you can see some of these problems. So I'm not, I don't dismiss them completely out of hand, but I, I definitely wonder how much is sour grapes versus real. But it, it is interesting, you know, the idea of being able to move portions of Windows to low powered ARM chips. Yeah, that's never going to work. I mean, you can't move containers on Windows to Linux, for God's sakes, because it doesn't work. So, I mean, and Microsoft already shown multiple times that ARM transformation is a problem for them. The first version of Windows they ported to ARM was terrible and they canceled an entire product line because of it. They've now done it again with Windows 11 and it's much better, but it's still nowhere near what they had promised it would be.

[60:41] Justin: So, I mean, the other interesting part to me was the 173 agents that run on Azure Node to keep it up and running and no one able to define what those were and why they're there, which on one level doesn't surprise me. But 173 feels like a lot. Feels like also a lot of overhead.

[61:01] Matt: I mean, 170 of them are definitely security agents.

[61:06] Justin: God damn it, Ryan,

[61:11] Matt: that's where all agents come from, is security. It feels like, you know, unified agents aren't a thing.

[61:15] Justin: But yeah, I mean, over time it'd be interesting to see, you know, what the sour grapes with someone that leaves AWS and GCPR and.

[61:23] Matt: Oh yeah, you find them out there as well, you know. Yeah, they don't always get a multi part, you know, substack thread about it, but, you know, definitely interesting and worth. I only read the first one, which is why I didn't have all the context of the three, part two, three and four.

[61:39] Justin: But no, I think it goes up to six, six or seven. It kept going and it's a little long winded, but it's an interesting read. You know, it's kind of like reading the RCAs for me. It's what's going on behind the scenes that you don't really know about, but it's kind of interesting to see.

[61:55] Matt: Yeah. I mean, as I feel what the delve stuff is going at the standpoint, like he just keeps piling on at this point and it's interesting, but I'm like, okay, we know that these things happened and yeah, but you've already done the damage you wanted to do. So we'll see you when they come back next week if it's any better. But yeah, we'll Substacks are always kind of fun to read because they are a little bit of whininess, but. All right, gentlemen, we'll see you next week. I'm going to have a birthday dinner.

[62:23] Jonathan: Yep. See you later.

[62:24] Justin: Happy birthday.

[62:25] Matt: Thank you.

[62:26] Jonathan: Bye.