Wednesday, July 24, 2024

Hedgehog is the AI network solution builder - plus more

If you are actively looking to build out AI network infrastructure and want to utilize white box, cost effective switching, one of the challenges you have is what software you will use to design, deploy, operate and manage those network switches, because doing that by hand via a CLI will not be fun. Hedgehog wants to be the software you use to solve this problem, along with a few more use cases.

They kicked off their Network Field Day 35 (#NFD35) presentation with the pitch of saying Hedgehog is the AI network solution builder. They are the software to make this happen, but you can definitely use them for more than that, so you have to read between the lines to get the bigger picture on where they fit in the market.

I don't ding them for riding the AI wave, that is what is happening at the moment. And they aren't claiming they are "using AI" to build or deploy the fabric. They are saying you can use their solution to build out network infrastructure easily that makes running AI workloads faster, easier, with better performance, and not necessarily need a network engineering team to do it, which is pretty attractive for companies that are mainly investing the money on teams that can build out interesting AI models and data sets.



The pitch is that folks who are building and running AI workloads likely grew up and are used to cloud first tooling and services. They want an API, and for networking, a VPC like concept to build out what they need to run their AI workloads. Because Hedgehog is providing that user experience, but for whitebox, on-premises environments, the argument is that your existing cloud AI teams can easily get things up and running and have a look, feel, and workflow that matches what they do in a public cloud.

I must admit, I have no idea how many companies are running AI workloads and want to avoid recurring cloud costs and have decided to deploy colocated or dedicated infrastructure on their own to help drive down costs. I think that is the intersection of two customer markets in a small Venn diagram overlap, but clearly Hedgehog can provide a solutions for those folks.

I think the compelling story is the fact that you can build an Ethernet fabric easily with the same workflow and concepts as what you are used to with a public cloud service like AWS or Azure for networking. This is important for organizations that are trying to streamline their AI workload processes and want to leverage the cost saving of running AI workloads on fixes capital investments versus recurring costs in public cloud. It still leverages VXLAN and EVPN, but it is abstracted away from the operator. The reality is, this is where network automation is going. A controller that builds out Ethernet fabric solutions in a standard way where you don't need to touch or maintain the underlay and an API for those that want to consume the fabric can use to set it up they way they need it with a simple abstraction.

Let me repeat that - this is where network automation is going. I'm not saying Hedgehog is the only way to do this, but they demonstrate where the industry is going to go and how the network industry is going to change.

Hedgehog VPC - multi-tenancy the cloud way


It is also worth noting that the solution is built on Kubernetes but you don't need to know anything about that, it is just under the hood. The other part that is interesting is that the software is all open source. Their goal is to provide a fully automated network operations solution at a reasonable cost.

So who is the target?

I think, given what Hedgehog can do, and what capabilities it is providing, the target customer is really organizations that want to run networks that can scale and change without having to spend a crazy amount for network architects and engineers to design and deploy. In addition, the solution is cost conscious, and those that are willing to purchase white box networking hardware to help drive down their capital costs compared to buying a Cisco, Juniper, Arista, or other mainstream network switch solution will be an idea customer. By having open source software, low commodity hardware, and the focus of the cost in annual software maintenance, it will be very appealing to shops that run lean, with low overhead and staff.

So where is Hedgehog going?

Hedgehog - What is next?

It is clear that Hedgehog can only really innovate at the pace that network white box platform have the right supported capabilities. That being said, having a service gateway to make their cloud like experience consistent with what the public clouds provide will be critical. The DPU integration will give them scale and security that will be beyond what most commercial network solutions are providing today. And finally, having flow scheduling ability from a DPU gives them even more capabilities to optimize AI workloads.

Is there room in the market for Hedgehog? Yes. Will they likely get bought up to help a larger network vendor who wants to own that market? Yep, that will likely happen at some point. For now, if you are looking for where the network automation landscape is going, you might want to check out what Hedgehog is providing the industry with open source, industry standard network automation through API, and the key value provided via annual software maintenance. I'll be keeping an eye on what they do.

- Ed


In a spirit of fairness (and also because it is legally required by the FTC), I am posting this Disclosure Statement. It is intended to alert readers to funding or gifts that might influence my writing. My participation in Network Field Day, a Tech Field Day event, was voluntary and I was invited to participate. Tech Field Day events are hosted by Gestalt IT (part of The Futurum Group) and my hotel, transportation, food and beverage was/is paid for by Gestalt IT for the duration of the event. In addition, small swag gifts or donations were/are provided by some of the sponsors of the event to delegates (I don't accept gifts but I do ask the sponsors to donate to causes that support Mental Health). It should be noted that there was/is no requirement to produce content about the sponsors and any content produced does not require review or editing by Gestalt IT or the sponsors of the event. So all the spelling mistakes, technical missteps, incorrect opinions, and grammar errors are my own.

Tuesday, July 16, 2024

Intel - Is it an IPU or a DPU or what?

Intel has developed and sold classic Ethernet Network Interface Cards (NICs) for a long time, but many might not be as familiar with their product offerings in the SmartNIC and more advanced NIC categories. Intel breaks down their product offering as follows:

Intel Ethernet Connectivity Solutions

Intel presented on the work they are doing around their Infrastructure Process Unit (IPU), which they refer to as an "Improved DPU" or Data Processing Unit, which fits in the general bucket of "SmartNIC" and is developed by the Intel NEX Cloud Connectivity Group. This post focuses on that, since that is what they presented at #NFD35. I must admit, I am interested in hearing more about their AI Optimized solutions as I am sure they are being leveraged by some very large organizations for interesting workloads, perhaps a future NFD?

There were two features in the IPU that are something infrastructure engineers should know about. Specifically, the capability to build out reliable transport between two hosts who both have an Intel IPUs. The Ethernet fabric no longer needs to run special queuing and management to deal with congestion and microburst issues but instead, the IPU running Falcon, and leveraging programable congestion control, deals with it. Effectively, Falcon is the method for reliable transport over an existing lossy fabric which brings a lot of options to companies who may not want to build out a dedicated fabric for running storage or AI workloads.

In a shared fabric environment, it can be difficult to structure and provision all the access ports with the right queuing and policies. Given that difficultly, it might make sense, for smaller networks and diverse compute environments, to simple purchase the more advanced IPU's for the servers that require them and have the IPU deal with the lossy fabric issues.

The other feature was demonstrating the use of the IPU for general compute capabilities and also AI inference, and the markets that could potentially use the solution. They are definitely targeting a wide audience of infrastructure engineers who might need to run services and workloads but might not have the capacity, budget, or fabric design to support what they are trying to do. Intel sees the following areas as potential good use cases for their IPU.

IPUs In & Beyond the Data Center

 

You can watch the overview presentation from Thomas Scheibe w/ Intel at:



If you want more information about their Reliable Transport over Lossy Fabrics, which is called Falcon, then check out:



Intel also provided some actual demos and you can watch those at:



What is always a little interesting about Intel and their solutions, is that typically, you and I aren't buying directly from Intel. You are normally purchasing their products through a distributor or hardware supplier like HPE, or Dell, or SuperMicro. But Intel still wants infrastructure engineers to know what their products are capable of, so when you are building out the next server, you are picking the right SmartNIC for your specific needs. So it makes sense they are out providing this information directly to the public, or NFD events in this case, so you can pick and choose the right solution for your Data Center and Enterprise server networking needs.

- Ed


In a spirit of fairness (and also because it is legally required by the FTC), I am posting this Disclosure Statement. It is intended to alert readers to funding or gifts that might influence my writing. My participation in Network Field Day, a Tech Field Day event, was voluntary and I was invited to participate. Tech Field Day events are hosted by Gestalt IT (part of The Futurum Group) and my hotel, transportation, food and beverage was/is paid for by Gestalt IT for the duration of the event. In addition, small swag gifts or donations were/are provided by some of the sponsors of the event to delegates (I don't accept gifts but I do ask the sponsors to donate to causes that support Mental Health). It should be noted that there was/is no requirement to produce content about the sponsors and any content produced does not require review or editing by Gestalt IT or the sponsors of the event. So all the spelling mistakes, technical missteps, incorrect opinions, and grammar errors are my own.


Tuesday, July 02, 2024

Network Field Day 35

Network Field Day 35 (#NFD35) is happening July 10-11, 2024 and I am fortunate enough to be a delegate for the event. You can check out the full event schedule at the NFD35 website. The sponsors list has been growing so checking the site is the best until the event starts. I recommend watching live if you can and I believe LinkedIn is likely the best place to catch stuff.

So far the sponsor line up is:

I will be attending in person, and I will be doing my best to take notes and ask interesting questions. Obviously, there is no way we can cover all the questions that those who are watching remote might have, but hit us up on X/Twitter using #NFD35 or via the Tech Field Day slack channel or even via LinkedIn and we will all do our best to try and bring up the point.

So, there you go, let's get ready to have some serious fun with NFD35 as the delegate line up is pretty impressive! If you are at all into networking then I encourage you to follow along live for the events on the Tech Field Day website or via LinkedIn. If you are interested in being a delegate, you can check out the website, they have all the details up there.

- Ed


In a spirit of fairness (and also because it is legally required by the FTC), I am posting this Disclosure Statement. It is intended to alert readers to funding or gifts that might influence my writing. My participation in Tech Field Day events was voluntary and I was invited to participate. Tech Field Day is hosted by Gestalt IT (part of The Futurum Group) and my hotel, transportation, food and beverage was/is paid for by Gestalt IT for the duration of the event. In addition, small swag gifts or donations were/are provided by some of the sponsors of the event to delegates (I don't accept gifts but I do ask the sponsors to donate to causes that support Mental Health). It should be noted that there was/is no requirement to produce content about the sponsors and any content produced does not require review or editing by Gestalt IT or the sponsors of the event. So all the spelling mistakes, technical missteps, incorrect opinions, and grammar errors are my own.

Monday, August 07, 2023

Nile's changing up how Enterprises design, build, and consume Access Networks at Network Field Day 32

Nile presented at Networking Field Day 32 on July 26, 2023 and they presented on their Enterprise Networking solutions. Nile has built out a set of networking solutions that focuses on the enterprise and commercial market and they are selling the solution in a Network as a Service model. The overview of what they provide:

  • Wired and Wireless LAN as a Service
  • Guaranteed Network Performance
  • Zero Trust
  • IT Simplicity

It seems they are competitors to Meraki, Mist, and Aruba from an enterprise solution offering and to Ubiquiti and Microtik in the commercial market. All of these competitors have strong market positions and install bases. This is a simplistic comparison, but for the purpose of understanding what market groups they are potentially suited for, it works just fine.

Here is their overview:


There are several more YouTube videos available, you can find them all over at the Tech Field Day 32 Nile page.

But in typical NFD fashion, the most interesting and relevant session ended up being the last video and the poor presenter was given the least amount of time because everyone else was unable to keep on track prior.

Note: If Nile presents at another field day, I suggest they START with this demo, focus on doing Q&A around it and expand everything else after it. Honestly, the first 30-45 mins of the overall timeslot was a waste of time and could have been cut (except the marketing people likely wanted that content - stop listening to them, you can record that stuff on your own, you don't need a bunch of delegates in the room for that part). If you are going to watch anything, watch this one:



My quick thoughts on what Nile presented:
Of course the IPv6 question was asked and they built a new generation of networking gear and solution without IPv6 as a first class citizen. I don't know if that is really forgivable in the current market. While I understand the US Federal Government is not their primary customer, or even a secondary, there will definitely be organizations that need IPv6. It is just such a glaring misstep I can't really take the rest of the product seriously, so you know my bias going into this. 

They also need to explain and position their place in the market a bit more clearly. A simple elevator pitch that says something like: "We are Meraki or Mist generation 2.0" or something similar to give a reference point. I get that they are doing Network as a Service (NaaS) and their billing/revenue model is slightly different but it puts them in front of the right general audience. The current pitch and explanation is too broad and doesn't narrow the field for buyers to understand what they do and why.

Effectively, they are wrapping together hardware, software, support, and installer/operator easy of administration in a recurring revenue model. I'm not sure that is revolutionary at this point. They did invest to brand their own hardware solution. I'm not sure putting simple diagrams on the equipment makes it unique in terms of IT Simplicity. Their management UI looks like a combo of Mist, Meraki and Ubiquiti so nothing super unique going on there, though that might be a plus, people who have used those other solutions can figure theirs out a bit faster.

I will be honest, I am not 100% sure what the large/important differentiator is for Nile. I either missed the key points in the presentation or they need to hone their message of how they are different, unique, and valuable for a customer. It just wasn't clear to me why I would want them versus any other product solution set out there right now. It should be the first, second, and third thing they talk about. I'm not even sure it was mentioned specifically.

I will keep an eye on Nile and what they are doing, but honestly, just like with Meraki, I won't take them seriously until they can work with IPv6 as a fully supported networking protocol.


 - Ed

In a spirit of fairness (and also because it is legally required by the FTC), I am posting this Disclosure Statement. It is intended to alert readers to funding or gifts that might influence my writing. My participation in Tech Field Day events was voluntary and I was invited to participate in NFD32. Tech Field Day is hosted by Gestalt IT and my hotel, transportation, food and beverage was/is paid for by Gestalt IT for the duration of the event, if travel was involved. In addition, small swag gifts or donations were/are provided by some of the sponsors of the event to delegates (I didn't accept the swag gifts offered). It should be noted that there was/is no requirement to produce content about the sponsors and any content produced does not require review or editing by Gestalt IT or the sponsors of the event. So all the spelling mistakes and grammar errors are my own along with the ideas and thoughts.

Tuesday, August 01, 2023

Broadcom's AI Networking Solutions at Networking Field Day 32

Broadcom presented at Networking Field Day 32 on July 26, 2023 and they presented on their AI Networking solutions. These are products and architectures that address the needs of those building out AI data center focused networks. Obviously the design will work for regular data center workloads too. albeit, suboptimal because the design is focused on addressing AI workloads and not a more general workload. The attributes that Broadcom define for what makes an AI Network unique are:

  • Fewer flows (low entropy)
  • High bandwidth flows (elephant flows due to the large amount of data sets being moved around)
  • Synchronized and bursty traffic
  • Links are saturated in micro-seconds (<<RTT)
  • Training jobs run for long periods of time (hours/days)
  • Tail latency impacts job completion time significantly
And they shared some interesting info about "time spent in network" is impacted by:

  • Transient oversubscription
  • Flow collisions and link failures
  • Incast - many GPUs sends into one or a few GPU(s)
Broadcom says the solution is to build a Clos fabric that makes use of a receiver-based credit control process that can pace the senders accurately. This means it is impossible to oversubscribe the Clos fabric and therefore you can leverage techniques like packet spraying with receiver ordering. It is worth watching the presentation on YouTube to understand what they are doing and why. You can check that out here:



There are specific videos on the Tomahawk AI Interconnect here:



And also on Jericho3 AI here:


And their wrap up on AI/ML Data Center Fabric solutions can be found here:


My quick thoughts on what Broadcom presented:
I wasn't aware (more likely I haven't been paying attention to what is happening in AI/ML like I should be) that there was this much specific network design work going into addressing AI workloads. While I understand there are a lot of AI/ML projects, I wasn't aware that so many private firms might want this solution architecture for their own needs versus running on leased cloud models.

Clearly there is a pricing advantage to running stuff at scale on your own hardware (in terms of reduced network data ingress/egress costs, compute cycles, and having dedicated GPU access) otherwise Broadcom wouldn't be building these sorts of solutions. It seems most of the large scale cloud providers have built something similar on their own or have requested that Broadcom address a gap in what traditional Ethernet fabrics can provide.

What will be interesting to me is if this is a short term industry change to address a narrow vertical or if this will become the new default Ethernet fabric architecture because AI/ML workloads will become common place DC workloads. I'm not convinced it will go that way, perhaps a hybrid of specific AI/ML Ethernet fabrics that are L3 connected to traditional DC focused Ethernet fabrics to attempt to give an organization the best of both worlds.

You can also get Drew Conry-Murray's thoughts on Broadcom's presentation over at his Packet Pushers blog post.

 - Ed

In a spirit of fairness (and also because it is legally required by the FTC), I am posting this Disclosure Statement. It is intended to alert readers to funding or gifts that might influence my writing. My participation in Tech Field Day events was voluntary and I was invited to participate in NFD32. Tech Field Day is hosted by Gestalt IT and my hotel, transportation, food and beverage was/is paid for by Gestalt IT for the duration of the event, if travel was involved. In addition, small swag gifts or donations were/are provided by some of the sponsors of the event to delegates (I didn't accept the swag gifts offered). It should be noted that there was/is no requirement to produce content about the sponsors and any content produced does not require review or editing by Gestalt IT or the sponsors of the event. So all the spelling mistakes and grammar errors are my own along with the ideas and thoughts.