What you are describing technologically is passthrough (aka VST, Video See Through) and it’s only a small component of Mixed Reality.
The real challenges of Mixed Reality are the features that help merging the real world into VR in order to augment it. Things like controller-free interaction (hand tracking, eye tracking, voice recognition), environment understanding (scene mapping, recognizing walls to project holograms onto, detecting obstacles for navigation and creating collisions meshes), unbouned spatial location (creating persistent anchors in space and being able to locate your device beyond the preset “guardian” or room scale) and sharing of the experience (observer cameras and scene graphs to share your AR experiencd with other devices, whethere thay are also a headset of maybe just a 2D screen).
These functionalities cannot be underestimated, otherwise your Mixed Reality experience will be extremely poor (if your device only supports the video see through and none of the features above, you won’t have AR). Seeing your surroundings without the ability for an application to project accurately content onto the environment is not that useful. Being able to move around without the ability to interact isn’t that useful. Not being able to leave your predefined “play area” without getting a warning and having to reconfigure it is quite annoying. Being alone in your experience and not able to invite other people to see what you see what you are experiencing is going to get boring and lonely.
These challenges are a magniture harder to solve for both devices vendors and application developers than just the passthrough bit.
These features aren’t really must-have for VR, and most VR devices today do not implement them (exception going to the controller-free interaction features that are more and more common, but mostly for comfort rather than necessity). Hence I don’t quite agree with your statement that MR is nothing but VR with extra layers of visuals.
Sure, the Quest 2 supports some of these features. So do other devices like HoloLens or Magic Leap. They are still area where it’s very hard to create good applications, and there is still poor commonnality between vendors: legacy API like OpenVR support none of these features. Prior to OpenXR, they all need dedicated SDK that aren’t portable from a device to another. With OpenXR, some of them are in the process of being implemented cross-vendor, but the road is still very very long ahead.
The story for middleware is also quite behind too. The best engine today to support all these features is probably Unity, but there are still areas when devices like Quest 2 and HoloLens 2 cannot be supported well without writing device-specific support. This is a huge blocker for app developers that cannot create and distribute consistent experiences for all platforms.
VR is pretty much known and solved at this point. All you’re going to get now is “better performance”, “better optics”, “better battery life”.
There are a lot of really hard problems that still need to be solved to provide some of the basic features of AR.