The NANOG has had a regular presentation by Richard Steenbergen called "Everything You Always Wanted to Know About Optical Networking – But Were Afraid to Ask"; last year's:

* https://www.youtube.com/watch?v=Y-MfLsnqluM

Contrary to the "highlights" section (which seems to be the only place calling it a "standard" 19-core optical fiber), this is not in fact a 'standard' fiber, rather the origin seems to be the standard (125µm) diameter ("Sumitomo Electric was responsible for the design and manufacture of a coupled 19-core optical fiber with a standard cladding diameter (see Figure 1)"). Looks like the "diameter" simply got lost for the highlights section.

(Nonetheless impressive, and multi-core fiber seems to be maturing as technology.)

Alright, I have a dumb question...

How come with a LAG group on ethernet, I can get "more total bandwidth", but any single TCP flow is limited to the max speed of one of the LAG Components (gigabit lets say), but then these guys are somehow combining multiple fibers into an overall faster stream? What gives? Even round robin mode on LAG groups doesn't do that.

What are they doing differently and why can't we do that?

  • ryao
  • ·
  • 2 hours ago
  • ·
  • [ - ]
I do not know exactly what is being done here, but I can say that I am aware of two techniques for sending bit streams over parallel wires while keeping the bits in order:

1. The lengths of all wires are meticulously matched so that signals arrive at the same time. Then the hardware simply assumes that the bits coming off each wire are in sequence by wire order. This is done in computers for high speed interfaces such as memory or graphics. If you have ever seen squiggly traces on a PCB going to a high speed device, they were done to get the lengths to be exactly the same so the signals arrive at the same time across each. This is how data transfers from dual channel DDR4 RAM where 64 bits are received simultaneously occur without reordering bits.

2. The lengths of wires are not matched and may be of different lengths up to some tolerance. Deskew buffers then are used to emulate matched lengths. In the case of twisted pair Ethernet, the wire pairs are not equal length because the twist rates vary to avoid interference from having the same twist rates. The result is the Ethernet PHY must implement a deskew buffer to compensate for the mismatched lengths and present the illusion of the wire lengths being matched. This is part of the Ethernet standard and likely applies to Ethernet over fiber too. The IEEE has a pdf talking about this on 800 Gb/sec Ethernet:

https://www.ieee802.org/3/df/public/23_01/0130/ran_3df_03_23...

LAG was never intended to have the sequence in which data is sent be preserved, so no effort was made to enable that in the standard.

That said, you would get a better answer from an electrical engineer, especially one that builds networking components.

They're not combining anything, they're sending 19 copies of one signal down 19 strands (with some offsets so they interfere in awkward ways), applying some signal processing to correct the interference which they say makes it realistic, and declaring that they've calculated the total capacity of the medium.

What you do with it at the higher layers is entirely up to you.

But, ethernet could totally do that, by essentially demuxing the parts of an individual packet and sending them in parallel across a bunch of links, and remuxing them at the other end. I'm not aware of anyone having bothered to implement it.

> What are they doing differently and why can't we do that?

You're (incorrectly) assuming they're doing Ethernet/IP in that test setup. They aren't (this is implied by the results section discussing various FEC, which is below even Ethernet framing), so it's just a petabit of raw layer 1 bandwidth.

It's also important to note that many optical links don't use ethernet as a protocol either (SDH/SONET are the common ones), although this is changing more and more.
  • wmf
  • ·
  • 4 hours ago
  • ·
  • [ - ]
Looks like SDH/SONET topped out at 40 Gbps which means it died 10 years ago.
SONET is widely used in the US.
Used, maybe, but [citation needed].

Built, no, definitely not voluntarily¹, Ethernet is the only non-legacy thing surviving for new installations for anything more than short range (few kilometer) runs. InfiniBand, CPRI and SDI are dying too and getting replaced with various over-Ethernet things, even for low-layer line aggregation there's FlexE these days.

¹ some installations are the exception confirming the rules; but as a telco sinking more money into keeping an old SONET installation alive is totally the choice of last resort. You'll have problems getting hardware too.

Disclaimer: I don't know what military installations do.

  • ryao
  • ·
  • 2 hours ago
  • ·
  • [ - ]
Infiniband is alive and well in HPC. 327 out of the top 500 machines use it according to this:

https://www.top500.org/statistics/sublist/

It is a nice interconnect. It is a shame that the industry does not revisit the idea of connecting all components of a computer over infiniband. Nvlink fusion is a spiritual successor in that regard.

Yes, IB hasn't died… yet. But the writing's probably on the wall with Ultra Ethernet; the corporate development of Mellanox (now nVidia) is not a great sign either.

(Also you don't use IB for 1800km WAN links.)

FWIW I actually run IB and think it's a nice interconnect too :)

  • ryao
  • ·
  • 1 hour ago
  • ·
  • [ - ]
This seems oddly appropriate:

https://xkcd.com/927/

There is one difference, however. As far as I know, they did not make Ultra Ethernet because the existing Infiniband standard did not cover everyone’s use cases. They made Ultra Ethernet because Intel killed QLogic’s Infiniband business in an attempt to replace an open standard (Infiniband) with a proprietary one (Omni-Path) that they made out of QLogic’s infiniband business’ corpse in an attempt to have a monopoly (which failed spectacularly in true Intel fashion), NVIDIA purchased Mellanox becoming the dominant Infiniband vendor, this move turned out to be advantageous for AI training and everyone else wanted an industry association in which NVIDIA would not be the dominant vendor. The main reason people outside of HPC care about Infiniband level performance is AI training and Nvidia’s dominance is not going anywhere. Now that NVIDIA has joined the UEC, it is unclear to me what the point was. NVIDIA will be dominant in Ultra Ethernet as soon as it ships Ultra Ethernet hardware. Are Nvidia’s competitors going to make a third industry association once they realize that Nvidia is effectively in control of the UEC because nobody can sell anything if it is not compatible with Nvidia’s hardware?

Had they just used Infiniband, which they had spent billions of dollars developing just a few decades prior, they would have been further along in developing competing solutions. Reinventing the wheel with Ultra Ethernet was a huge gift to Nvidia. If they somehow succeed in switching people to Ultra Ethernet, what guarantee do we have that they will not repeat this cycle in a few decades after they have left the technology to become a single vendor solution due to myopic decisions and they decide to reinvent the wheel again? We already have been through this with Infiniband and I do not see much reason that anyone should follow them down this road again.

  • wmf
  • ·
  • 1 hour ago
  • ·
  • [ - ]
Mellanox already kind of controlled Infiniband so Intel/QLogic could either chase Mellanox or fork and IMO forking into Omni-Path wasn't necessarily a bad choice. Omni-Path got zero adoption but that could have been for any number of reasons.

So many customers have a mental block against anything that isn't called Ethernet so the industry cannot "just use Infiniband"; they really had no choice but to create UEC. I would predict that Broadcom ends up de facto controlling UEC since their market share is much higher than Nvidia and people now want anything but Nvidia to create competition.

  • ryao
  • ·
  • 51 minutes ago
  • ·
  • [ - ]
I doubt anyone had a mental block against anything that was not Ethernet given that Infiniband had been taking market share from Ethernet since Nvidia acquired Mellanox in HPC if we go by the top 500 list:

https://www.top500.org/statistics/overtime/

Nvidia announced the acquisition of Mellanox on March 11, 2019. In June 2019, Ethernet had 54.4% market share to infiniband’s 24.6%. NVIDIA completed their acquisition of Mellanox on April 27, 2020. In June 2025, Ethernet had 33.4% marketshare to infiniband’s 54.2%. Note that I combined the Ethernet and Gigabit Ethernet fields when getting the data. It would be interesting if I could find broader market data comparing Ethernet and Infiniband, but that is hard to find since people stopped comparing them long ago when Infiniband famously retreated into HPC and Ethernet took over everywhere else. The only thing I could find was an advertisement of a paywalled analyst report, which says:

> When we first initiated our coverage of AI Back-end Networks in late 2023, the market was dominated by InfiniBand, holding over 80 percent share

https://www.delloro.com/news/ai-back-end-networks-to-drive-8...

They express optimism that Ethernet will overtake Infiniband, but the full rationale for that is not publicly accessible. It also does not say outside of the paywall to what extent that market share taken by Ethernet would be maintained by Nvidia. Another article I found gives a hint. It shows Nvidia is gaining market share rapidly in Ethernet switches:

https://www.nextplatform.com/2025/06/23/nvidia-passes-cisco-...

The numbers on Ethernet switches are not showing “people now want anything but Nvidia to create competition”. Instead, they show that there is a competitive market, Nvidia is the current driver of competition in the market, and there is amazing demand for Ethernet switches from Nvidia over switches from the incumbents, who were behind UEC. Broadcom appears to be included under others there.

The public data that I can find largely does not support what you said.

Broadcom introduced their first Ultra Ethernet just yesterday, the Tomahawk Ultra: https://investors.broadcom.com/news-releases/news-release-de...

Let's see how things evolve, but it is very clear that the market wants alternatives to Nvidia.

You don't really want to, but if you configure all of the LAG participants on the path to do round-robin or similar balancing rather than hashing based on addresses, you can have data in one flow that exceeds an individual connection. You'll also be pretty likely to get out of order data, and tcp receivers will exercise their reassembly buffer, which will kill performance and you'll rapidly wish you hadn't done all that configuration work. If you do need more than one link's worth of throughput, you'll almost always do better by running multiple flows, but you may need still need to configure your network so it hashes in a way that you can get diverse paths between two hosts, defaults might not give you diversity even on different flows.
the data out of order is the key bit.

How do these guys get the data in order and we dont?

Consider that a QSFP28 module uses four 25gbps lanes to support sending one single 100gbps flow. So electronics do exist that can easily do what you are asking. I think it is just the economics of doing it for the various ports on a switch, lack of a standard, etc.
SFP/QSFP/PCIe etc., are combining multiple lanes originating from a physical bundle of limited size; transmitters could easily share a single clock source. The wire protocol includes additional signalling that lets the receiver know how to recombine the bits coming over each lane in the correct order.

In contrast, Ethernet link aggregation lets you combine ports that can be arbitrarily far apart -- maybe not even within the same rack (see MC-LAG). Ethernet link aggregation doesn't add any encapsulation or sequencing information to the data flows it manages.

You can imagine an alternate mechanism which added a small header to each packet with sequence numbers; the other end of the aggregation would then have to remove that header after sorting the packets in order..

Plus the NIC/PHY is likely assuming only a small range of propagation delay differences between the lanes/links.

Probably falls down if one link is 1cm and the other is 100km.

A LAG could be done with different medium/speeds, though perhaps not likely in practice.

> A LAG could be done with different medium/speeds, though perhaps not likely in practice.

802.1AX indeed requires all members of a LAG to use the same speed.

> How do these guys get the data in order and we dont?

LAGs stripe traffic across links at the packet level, whereas QSFP/OSFP lanes do so at the bit level.

Different sized packets on different LAG links will take different amounts of time to transmit. So when striping bits, you effectively have a single ordered queue, whereas when striping packets across links, there are multiple independent queues.

Because your switch is mapping a 4 tuple to a certain link and these people aren't, is my guess.
  • wmf
  • ·
  • 5 hours ago
  • ·
  • [ - ]
I assume this is just a PHY-level test and no real switches or traffic was involved.
As others have mentioned, this is mostly a proof of concept for a high core count weakly-coupled fibre from Sumitomo. I also want to highlight the use of a 19 channels MIMO receiver structure which is completely impractical. The linked article also fails to mention a figure for MIMO gain.
Worse, it's offline MIMO processing! ;D

I would guesstimate that if you try to run it live, the receiver [or rather its DSPs] would consume >100W of power, maybe even >1000W. (These things evolve & improve though.)

(Also, a kilowatt for the receiver is entirely acceptable for a submarine cable.)

To get a ballpark power usage, we can look at comparable (for some definition thereof) commercial offerings. Take a public datasheet from Arista[1], they quote 16W typical for a 400Gbps module for 120km of reach. You would need 2500 modems at 16W (38kW) jointly decoding (i.e. very close together) to process this data rate. GPU compute has really pushed the boundaries on thermal management, but this would be far more thermally dense.

[1] https://www.arista.com/assets/data/pdf/Datasheets/400ZR_DCI_...

It's important to note that wavelength channels are not coupled, so modems with different wavelengths don't need to be terribly close together (in fact one could theoretically do wavelength switching so they could be 100s of km apart). So the scaling we need to consider is the scaling of the MIMO which in current modems is 2x2. The difficulty is not necessarily just power consumption (also the power envelope of long haul modems is higher than the DCI modem you link, up to 70W IIRC), but also resourcing on the ASIC, your MIMO part (which needs to be highly parallel) will take up significant floorspace and you need to balance the delays.

The 38kW is not a very high number btw, the switches at the end points of submarine links are quite a bit more power hungry already.

Depending on phase matching criteria of lambda's on a given core, I would mostly agree that various wavelengths are not significantly coupled. I also agree there are a different power budget for LH modems vs. DCI, but power on LH modems is not something that often gets publicly disclosed. I am not too concerned with the overall power, more the power density (and component density) that 19 channel MIMO would require.

The main point I was trying to make is the impracticality of MIMO SDM. The topic has been discussed to death (see the endless papers from Nokia) and has yet to be deployed because the spatial gain is never worth the real world implementation issues.

Man, when switches are using tens of kilowatts, must be nice to have cheap water cooling all around!
I think the scaling parameters are a bit different here since the primary concern is the DSP power processing and correlating for MIMO 19 signals simultaneously. But the 16W figure for a 120km 400Gbps module includes a high-powered¹ transmitter amplifier & laser, as well as receive amplifiers on top of the DSP. My estimate is based on O(n²) scaling for 19×19 MIMO (=361) and then assuming 2≈3W of DSP power per unit factor.

[but now that I think about it… I think my estimate is indeed too low; I was assuming commonplace transceivers for the unit factor, i.e. ≤1Tb; but a petabit on 19 cores is still 53Tb per core…]

¹ note the setup in this paper has separate amplifiers in 86.1km steps, so the transmitter doesn't need to be particularly high powered.

38kW ~= 50 HP ~= 45A at 480V three-phase, which is a relatively light load handled by 3#6 AWG conductors and a #10 equipment ground.

I mean, it’s a shitload more power than a simple media converter that takes in fiber and outputs to a RJ-45 but not all that much compared to other commercial electrical loads. This Eaton/Tripplite unit draws ~40W at 120V - https://tripplite.eaton.com/gigabit-multimode-fiber-to-ether...

A smallish commercial heat pump/CRAC unit (~12kW) can handle the cooling requirements (assuming a COP of 3)

  • bcrl
  • ·
  • 8 hours ago
  • ·
  • [ - ]
Interesting work, but 19 cores is very much not standard. Multiples of 12 cores are the gold standard in the telecommunications industry. Ribbon fibre is typically 12, sometimes 24 fibres per ribbon, and high count cables these days are 864 cores or more using a more flexible ribbon structure that improves density while still using standard tooling.
You're confusing multi-core in a single cladding with multiple strands of cladding. This is 19 cores in a single cladded 125µm (which is quite impressive manufacturing from Sumitomo).
  • bcrl
  • ·
  • 4 hours ago
  • ·
  • [ - ]
I wasn't confusing anything. To interoperate with industry standard fibre optic cables it should have a multiple of 12 or 24 cores, not the complete oddball number of 19. Yes it's cool that it's that small, but that is not the limiting factor in the deployment of long haul fibre optic telecommunications networks.

Sumitomo sells a lot of fusion splicers at very high margins. It is in their best interest to introduce new types of fibre that requires customers to buy new and more expensive fusion splicers. Any fibre built in this way will need rotational alignment that the existing fusion splicers used in telecom do not do (they only align the cores horizontally, vertically and by the gap between the ends). Maybe they can build ribbon fibres that have the required alignment provided by the structure of the ribbon, but I think that is unlikely.

Given that it does not interoperate with any existing cables or splicers, the only place this kind of cable is likely to see deployment in the near term is in undersea cables where the cost of the glass is completely insignificant compared to everything that goes around it and the increased capacity is useful. Terrestrial telecom networks just aren't under the kind of pressure needed to justify the incompatibility with existing fibre optic cables. Data centers are another possibility when they can figure out how to produce the optics at a reasonable cost.

It's impossible for this to interoperate with "normal" fiber, this is "coupled-core" fiber: the cores don't operate independently and a MIMO style setup like the paper describes is really the only thing you can do.

Additionally, the setup uses optical amplifiers in 86.1km steps.

Both of these things put this squarely in submarine-cable-only space. Total cable size is in fact a problem in that space, 4-core claddings are roughly state of the art there, but there's enough room for those to be non-coupled. Higher core counts requires research like this.

Either way you'd be terminating a 19-core setup like this with an OEO (and probably FlexE and/or an OTN solution) both ends.

while fascinating I'm still waiting for that transformative move from electrical. Whichever optical route you're taking, at the beginning and at the end of it has to be an electrical conversion which hinders speed, consumes power and produces (sometimes tons of) heat. Wen optical switching?
  • wmf
  • ·
  • 4 hours ago
  • ·
  • [ - ]
There's been a ton of research on optical computing and it just isn't impressive.
  • ·
  • 4 hours ago
  • ·
  • [ - ]
yet
  • ksec
  • ·
  • 7 hours ago
  • ·
  • [ - ]
The actual figures are 1,808 km. For reference US is 2,800 miles (4,500 km) wide from east to west, and 1,650 miles (2,660 km) from north to south.
For us Americans, thats about 295,680 toilet paper rolls or 2,956 KDC (kilo donkey kicks).
Or about 3 MAG (mega Ariana Grandes). https://x.com/GatorsDaily/status/1504570772873904130
She is apparently 1.55 m tall, quite short to be called Grande...