> The plane manufacturer says it has found that intense radiation from the Sun could corrupt data crucial to flight controls.
> It’s thought most will be able to undergo a simple software update.
> The issue was discovered after a JetBlue aircraft en-route from Mexico to the United States in October experienced a ‘sudden drop in altitude’.
> The plane made an emergency landing, with reports at the time suggesting 15 to 20 people suffered minor injuries.
> It’s thought the incident was caused by intense solar radiation, which corrupted data in a computer used to help control the aircraft.
Radation-driven bit flips would be Poisson distributed in time and energy. So that is one way to find out
If it was really 'solar radiation' there would be more small details.
My concern would be what error correction mechanism did or did not catch the corruption in memory and why did it not recover without critical impact to operations?
This sounds like a software bug.
Something like - {copy a to b, checksum a--b}
Instead of - {copy a to t, checksum a--t, copy t to b, checksum a--b}
I bet the fix is along these lines, with the caveat of real time systems/etc.
The software update is probably more along the lines of 'lets just introduce a watchdog task which resets the system if the output deviates too far from the input for too long'.
Solar radiation event led to alpha particle induced data corruption in a flight control computer memory (could be DRAM, SRAM, on-chip cache, registers...). These failures are supposed to be transient (reboot and all is well).
This is an anticipated failure mode. Only one (of three?) computers should be affected by such a failure and therefore the remaining two keep on running the plane.
But what happened is <something> went wrong with the failover/voting mechanism (as often happens with one-off seldom-executed failover code). The result was no flight control computer functionality until the entire system was rebooted. Hence the emergency landing.
The fix is to address that software error, with perhaps a secondary fix TBD to harden the hardware (add some shielding perhaps).
The fact that they talk about data corruption and not just a malfunction suggests alpha bit flip rather than latch-up.
Then send the whole statement through a French to English translator to make it a bit more confusing.
There is a slightly different level of discipline and engineering ethics at play.
Considering those units were designed back when they did not have EDAC mandated, I can believe it could have been a bit flip (along with some other stuff they will probably address to take into consideration this failure mode). Nowadays, most MCU's have ECC on them so the time of this excuse is mostly gone now. :)
That's kind of a misleading statement. Assuming you mean on planes built nowadays, as we clearly see that nowadays planes still flying (6K of them at least) still have issues. We don't need hand wavy comments trying to make it sound like modern day aviation is no longer susceptible, especially when it's in a thread on an article showing how that's just not true
That even though they’re in widespread operation today, the aircraft types in question were designed (and certified) many years ago, before ECC was the norm. My impression is that, once their type is certified, new airframes are built to pretty much exactly that specification even all these years later.
Yes, that's my point. Just because new aircraft are designed with improved hardware does not automatically mean the issue is resolved industry wide. Existing equipment will still have issues. So the statement is misleading. Is the number of aircraft with ECC "most" of the equipment in the skies?
[1] Technically EDAC is the correct name of the whole sybsystem, and ECC is the name of the algorithm. But I've only heard it refered as ECC in my industry. I was even initially confused when I read EDAC, so TIL.
You would pretty much be logging, every millisecond, the minimum, maximum and mean voltage for every 1ms period (and the same for current).
Then any failing solid state relay would be obvious in the collected data, far before you start to get word corruption!
https://www.pprune.org/rumours-news/669424-airbus-a320-recal...
[1] https://www.researchgate.net/publication/26587285_Challenges...
There wasn’t a software fix per se, but we were able to quickly add a check to verify that the Kalman Filter’s position variance estimate was on the same order of magnitude as the accuracy level that the receivers were reporting and put a big red warning up. This wasn’t a flight-critical system, but it is the first time we’d ever seen that behaviour from those receivers and we’ve used them for 5 years.
Without going too far into the weeds, the fact that the receivers in question were reporting high accuracy under uncertainty is definitely a software bug in the receivers from my perspective. There was a different receiver with a completely different chipset in it on-site too that was experiencing similar issues but was reporting low accuracy. Without going into too much detail, I’ve got pretty good reasons to believe it wasn’t spoofing.
I don't work on the A320 but solar radiation is a well-known issue in avionics, generally speaking.
Edit: deleted some speculation
Now imagine, if it was over the air update, then maybe there would be no disruption?
i believe it could be solar radiation, but i also believe that solar radiation could be a catch-all for unexplained phenomena.
Unless they had total component failure, its most likely localized and if you create redundancy like RAID - you may be able to counter whatever they are seeing as a failure mode. Or at least reduce the likelihood of impact on the flight giving them time to replace components on the ground
> But EasyJet says it has already completed the required software update and is planning on operating its flights as normal on Saturday
There was a television show (episode) about another design issue (which was fatal) some time ago: https://en.wikipedia.org/wiki/Air_France_Flight_447
Quoting your link, "Final Report" section:
> Temporary inconsistency between the measured speeds, likely as a result of the obstruction of the pitot tubes by ice crystals, caused autopilot disconnection and [flight control mode] reconfiguration to "alternate law (ALT)".
- The crew made inappropriate control inputs that destabilized the flight path.
- The crew failed to follow appropriate procedure for loss of displayed airspeed information.
- The crew were late in identifying and correcting the deviation from the flight path.
- The crew lacked understanding of the approach to stall.
- The crew failed to recognize the aircraft had stalled, and consequently did not make inputs that would have made recovering from the stall possible.
Note the numerous "the crew"
Accident studies and, in particular, books like _Normal Accidents_[1] push back on this assumptions:
"... It made the case for examining technological failures as the product of highly interacting systems, and highlighted organizational and management factors as the main causes of failures. Technological disasters could no longer be ascribed to isolated equipment malfunction, operator error, or acts of God."
It is well accepted - and I believe - that there were a multitude of operator errors during the Air France 447 flight but none of them were unpredictable or exotic and the system they were tasked with operating was poorly designed and unhelpfully hid layers of complexity that suddenly re-emerged during tremendous "production pressure".
But don't take my word for it - I appeal to authority[2]:
"Automation dependent pilots allowed their airplanes to get much closer to the edge of the envelope than they should have ..."[3].
or:
@ 14:15: "... we see automation dependent crews, lacking confidence in their own ability to fly an airplane are turning to ther autopilot ..."[4].
[1] https://en.wikipedia.org/wiki/Normal_Accidents
[2] Captain Vanderburgh
[3] Children of Magenta: https://www.youtube.com/watch?v=dTwB94yOrRQ
There is a design flaw though: the sidesticks in modern Airbus planes are independent, so the other pilot didn’t get any tactile feedback when the second officer was pulling back.
[1] https://safetyfirst.airbus.com/app/themes/mh_newsdesk/docume...
Unfortunately, sometimes they also fail in ways that even a trained crew isn't able to recover the aircraft. That could be a failure that wasn't anticipated, training that was inadequate, design flaws, the human element, you name it. Actions of the crew being put in an accident report isn't an assignment of blame, it's a statement of facts - the recommendations that come from those facts are all that matters.
Taking a grain of salt since it's from a movie, but one of the things about Sully setting the plane down in the river was due to his experience of not just the aircraft itself but also situation awareness to realize he was too low to safely divert to an airport. He instinctually "skipped" several steps in the procedures to engage the APU which turned out to be pretty key. The intimated thing being that the procedure was so long that they might not have gotten to the APU in time going step-by-step.
Part of the sales pitch of the Airbus is that the computer does A LOT of handholding for the pilots. In many configurations, including the one that the plane was flying in at the start of the incident, the inputs that caused the crash would have been harmless.
In that incident the airspeed feed was lost to the computer and it literally changed the flight controls and turned off the safety limits, and none of the three people in the cockpit noticed. When an Airbus changes flight control modes, it does not keep inputs idempotent. Something harmless under one set of "laws" could crash the plane under another set of laws. In this case, what the pilot with the working control stick was doing would not have caused a crash, except that the computer had taken off the training wheels without anyone noticing.
As a result of changing the primary controls one pilot was able to unintentionally place the plane in an unrecoverable state without the other pilots even noticing that he was making control inputs.
Tack on that the computer intentionally disregarded the stall warning emanating from the AOA sensor as erroneous at a certain point and did not alert the pilots that the plane was stalled. You are taught from day one of flight training that if you hear the stall alarm you push the power in, and push the nose down until the alarm stops. In this case the stall warning came on, and then as the stall got worse, it turned itself off, with the computer under the mistaken belief that the plane could not actually be that far stalled. So the one alarm that they are trained to respond to in a certain way to recover the plane from a stall was silenced. If I was flying and I heard the stall alarm, then heard it stop, I would assume that I was no longer stalled, not that the plane was so far stalled that the stall alarm was convinced it had broken itself.
So yes, the pilots flew the aircraft into the ground, but the computer suffered a partial failure and then changed how the primary flight controls operated.
Imagine if the brake pedal, steering wheel, and accelerator all started responding to inputs differently when your car had a sensor issue. That causes the cruise control to fail. Add in that the cruise control failure turns off ABS, auto-brakes, lane assist, and stability control for some reason. Oh yeah, there's a steering control on the other side of the car on the armrest and the person sitting there can now make steering inputs, but it won't give feedback in your steering wheel, and also your steering wheel still can be manipulated when the other guy is steering, but it is completely disconnected from the tires while the other guy is steering. All of the controls are also more sensitive now, and allow you to do things that wouldn't have been possible a few seconds ago. Also, its a storm in the middle of the night, so you don't have a good visual reference for speed. So now your car is slipping, at night, in a storm, lights are flashing everywhere, nothing makes sense since the instruments are not reading correctly. However, the car is working exactly as described in the manual. When the car ends up in a ditch, the investigation will find that the cause of the crash was driver error since the car was operating exactly as it was designed.
Worth noting that Boeing (and just about every other aircraft on earth) has linked flight controls between the two pilot's positions that always behave in the exact same way so this type of failure could have never happened on a 737 for example.
At the end of the day, this was pilot error, but more in a "You're holding it wrong, I didn't design it wrong" kind of way. After all, there were three people with a combined 20k flying hours, including thousand of hours in that design.
If three extremely qualified pilots that have literal years of experience in that cockpit, who are rigorously trained and tested on a regular basis for emergencies in that cockpit, can fly the thing into the ground due to a cascade from a single human error... maybe the design of the user interface needs a look.
You also conveniently skipped over the parts of the wikipedia article where they charged the manufacturer with manslaughter, and documented dozens of similar incidents, and the entire section outlining the Human Computer Interface concerns.
Just to be clear, I’m not faulting Airbus. I take issues with the shallow snark at Boeing. The JetBlue incident was serious.
Airbus isn’t immune to controversies , like AF447 or Habsheem air show crash in 1988
On the other hand, the software development practices were slow to modernize in many cases e.g. FORTRAN 66 (but eventually with a preprocessor).
Surprisingly Google couldn't handle the typo: Habsheim. https://en.wikipedia.org/wiki/Air_France_Flight_296Q
3 dead, 133 survivors, started (undisputedly) with pilots intentionally and with approval trying to fly a plane full of passengers 30 meters off the ground, possibly with safety systems intentionally disabled for demonstration purposes.
Past that, there are disputes what kept them from applying enough power and elevator to get away from the trees that they were flying towards, including allegations of the black box data being swapped/faked.
Testing is (should be!) extremely robust, but only tested to the required parameters. If this incident put the subsystem in some atmospheric conditions nobody expected and nobody tested for, that does not suggest that the entire QA chain was garbage. It was a missed case -- and one that I expect would be covered going forward.
Aviation systems are not tested to work indefinitely or infinitely -- that's impossible to test, impossible to prove or disprove. You wouldn't claim that some subsystems works (say, for a quick easy example) in all temperature ranges; you would definite a reasonable operational range, and then test to that range. What may have happened here is that something occurred outside of an expected range.
Has anybody kept count of "fly by wire" failures in aircraft?
It fills me with dread that a computer programme is between the pilot's controls and the control surfaces.
I am amazed that it works at all.