Glasses-free 3D display with ultrawide viewing range using deep learning

SBP-utilization analysis

Owing to inherent SBP scarcity, existing 3D display approaches have been forced into static compromises, each emphasizing specific aspects at the expense of others in their display outcomes (see Supplementary Table 1 and Supplementary Information for more analysis and comparison details). Holographic displays, for instance, preserve complete 3D reconstruction by significantly compressing the displayed light field to centimetre-scale regions (about 1–2 cm2), ensuring wide-angle, high-quality optical content but becoming practically unscalable30. By contrast, automultiscopic displays maintain common display sizes (about 0.1–0.2 m2) more suitable for natural viewing scenarios but must limit their effectual viewing angles. Within this category, view-dense solutions use multilayer architectures to provide continuous and realistic optical generation at the cost of highly restricted viewing zones.

Alternatively, view-segmented solutions achieve broad, horizontal viewing angles using single-panel optics21,23,51,52 to discretely spread out available SBP, sacrificing the stereo parallax across vertical and radial dimensions, as well as the focal parallax, although this loss of full parallax inevitably compromises immersion and visual comfort37,40.

Fundamentally, the limited practicality of these existing approaches arises from their passive use of scarce SBP, attempting to statically accommodate various viewing scenarios simultaneously. These static approximations inherently conflict with the extreme scarcity of SBP itself, and this remains unaltered even with AI enhancement (Supplementary Table 2). Recognizing this scientific constraint, it becomes clear that a proactive, dynamic use of limited SBP is necessary, that is, using optical resources precisely where they are most crucially needed at each moment. In practice, this means reconstructing accurate binocular light fields around target eye positions, as binocular parallax is the essential basis for human depth perception. Notably, this dynamic model does not rely on eye tracking to synthesize virtual disparities as is commonly done in conventional eye-tracked systems, as these systems respond only to instantaneous viewpoint positions, with responses typically exhibiting significant errors due to tracking noise and random eye movements.

Instead, the rational and effective solution here requires the accurate and consistent generation of real physical light fields for both binocular viewpoints and their neighbourhoods, with eye tracking primarily serving to guide directional delivery rather than generating virtual content severely dependent on tracking precision. Although SBP, in principle, supports this localized generation, it remains challenging to precisely adapt optical output to arbitrary and extensive views within the neighbourhood of the eyes. To address this, we develop a physically accurate binocular geometric modelling and a deep-learning-based mathematical model that enable real-time computation of light-field outputs. To this end, EyeReal precisely adapts optical output to arbitrary binocular positions within an extensive viewing range, validated by a light-field delivery setup featuring large-scale imaging, wide-angle viewing and full-parallax attributes.

This dynamic SBP-utilization strategy thereby realizes the possibility of achieving the long-desired glasses–free 3D display.

Eye camera modelling and calibration

Given an ocular position in the light-field coordinate system, we use the pinhole camera model (Supplementary Fig. 3) to simulate the retinal imaging process of the light field. In general, we align the centre of the screen with the centre of the light field where the object is located, and by default, the eye is directed towards the centre of the light field, which is the origin of the coordinate system. For standardization, we define the z-axis of the camera model to be opposite to the direction of sight. over, to simulate normal viewing conditions, we stipulate that the x-axis of the camera is parallel to the ground on which the object is situated, consistent with the relative position of the observer and the object in the same world.

Consequently, the y-axis of the eye camera is the normal to the plane formed by the z- and x-axes.

We initially get the relative ocular positions captured by the RGB-D camera. In the process of transferring eye positions into the light-field coordinate system, we first obtain their two-dimensional (2D) pixel coordinates by using a lightweight face detector. Combining the inherent camera intrinsic parameters and the detected pixel-level depth information, we can obtain the 3D coordinates of the eyes in the camera coordinate system. For one eye, this process can be formulated by

$$left(begin{array}{c}{x}_{{rm{c}}} {y}_{{rm{c}}} {z}_{{rm{c}}}end{array}right)={z}_{c}{left(begin{array}{ccc}{f}_{x} & 0 & {c}_{x} 0 & {f}_{y} & {c}_{y} 0 & 0 & 1end{array}right)}^{-1},left(begin{array}{c}{u}_{{rm{e}}} {v}_{{rm{e}}} 1end{array}right)$$

(4)

where u_e and v_e are the pixel-wise positions of the eye; (c_x, c_y) is the optical centre of the image, which represents the projection coordinates of the image plane centre in the camera coordinate system; f_x and f_y are the focal lengths of the camera in the x-axis and y-axis directions; and x_c, y_c and z_c represent the transformed camera coordinates.

Then comes the alignment from the real-world eye coordinates to the digital light-field world. Given the fixed spatial configuration between the camera and the display setup, this alignment reduces to estimating a projection matrix ({M}_{{rm{c}}}=({A}_{{rm{c}}},| {t}_{{rm{c}}})in {{mathbb{R}}}^{3times 4}), which transforms coordinates from the camera to the light field. Based on the characteristic of autostereoscopy, we design a simple and convenient calibration method based on the characteristic of reversible light paths (Extended Data Fig. 1). We select N calibration points in the light-field coordinate system, which also meet the visual field of the RGB-D camera. We replace the light-field images corresponding to the viewpoints with calibration marks (Supplementary Fig. 4) and provide them as input to the neural network to generate the corresponding layered patterns.

Because the patterns can form only the best stereo effect at the input viewpoint, conversely, when the viewer sees the completely overlapping (the superposed colour is also the thickest at this time) rectangle with one eye at a certain angle on the screen of the hardware device, the current 3D eye camera coordinates ({c}_{i}in {{mathbb{R}}}^{3}) captured by the camera and the world coordinates ({w}_{i}in {{mathbb{R}}}^{3}) of the calibration points form an one-to-one correspondence. We solve for Mc using least squares regression (Supplementary Fig. 5) based on K pairs of corresponding calibration points

$${A}_{{rm{c}}},,{t}_{{rm{c}}},=mathop{arg }limits_{{A}_{{rm{c}}},in {{mathbb{R}}}^{3times 3},,{t}_{{rm{c}}}in {{mathbb{R}}}^{3}},min ;mathop{sum }limits_{i=1}^{K}{parallel {A}_{{rm{c}}}{c}_{i}^{{rm{T}}}+{t}_{{rm{c}}}-{w}_{i}^{{rm{T}}}parallel }_{2}^{2}$$

(5)

where ({c}_{i}in {{mathbb{R}}}^{3}) and ({w}_{i}in {{mathbb{R}}}^{3}) denote the ith calibration point in the camera and light-field coordinate systems, respectively. Once M_c is obtained, the eye position in the light-field coordinate system, P_e, is computed by homogeneous transformation

$${({P}_{{rm{e}}},1)}^{{rm{T}}}={M}_{{rm{c}}},left(begin{array}{c}{x}_{{rm{c}}} {y}_{{rm{c}}} {z}_{{rm{c}}} 1end{array}right)=,left(begin{array}{cc}{A}_{{rm{c}}} & {t}_{{rm{c}}} 0 & 1end{array}right)left(begin{array}{c}{x}_{{rm{c}}} {y}_{{rm{c}}} {z}_{{rm{c}}} 1end{array}right)$$

(6)

Eye–light-field correspondence

According to the geometric conventions of the above eye camera model, we can calculate the projection matrix M_e = (R_e|t_e) from the constructed eye camera coordinate system to the light-field coordinate system. As shown in Extended Data Fig. 2a, the centre of the screen is the origin O of the light-field coordinate system. For general cases, we assume that all the coordinate systems are right-handed and the ground plane is parallel to the xOy plane. We can get a pair of trivial vectors r_z and r_x along the z-axis and x-axis, respectively, based on their special position relation. In detail, the z-axis of the eye camera coordinate system is the OP_e direction, and the x-axis is parallel to the xOy plane of the light-field coordinate system

$${{bf{r}}}_{z}={{bf{OP}}}_{{rm{e}}},quad {{bf{r}}}_{x}={bf{O}}{bf{z}}times {{bf{r}}}_{z},quad {{bf{r}}}_{y}={{bf{r}}}_{z}times {{bf{r}}}_{x}$$

(7)

The rotation matrix from the light-field coordinate system to the eye camera can be constructed from the unit vectors of these three trivial vectors as its column vectors

$${R}_{{rm{e}}}={left(frac{{{bf{r}}}_{x}}{parallel {{bf{r}}}_{x}{parallel }_{2}},frac{{{bf{r}}}_{y}}{parallel {{bf{r}}}_{y}{parallel }_{2}},frac{{{bf{r}}}_{z}}{parallel {{bf{r}}}_{z}{parallel }_{2}}right)}^{{rm{T}}}$$

(8)

Here ∥⋅∥_p denotes the ℓ_p vector norm applied to these trivial vectors. And the translation matrix is the vector of eye position

$${t}_{{rm{e}}}={{bf{OP}}}_{{rm{e}}}={{bf{r}}}_{z}$$

(9)

Then we project the light-field images corresponding to the binocular viewing onto each layer plane (Supplementary Fig. 6). Under the predefined FOV of the eye camera with a H × W pixel-size imaging plane, we can first derive the focal length f_pix in pixel measurement

$${f}_{{rm{pix}}}=frac{max (H,W)}{2,tan ({rm{FOV}}/2)}$$

(10)

The screen planar positions ({{mathbb{P}}}_{n},:= ,{({x}_{i},{y}_{i},{z}_{i})}{| }_{i=1}^{4}) are hyperparametrized. For convenience, we define the dimension of depth to be parallel to some axis of the light field, which is the x-axis shown in Fig. 2b, so that we can determine x_i by

$${x}_{i}=frac{n-1}{N-1}({d}_{{rm{near}}}-{d}_{{rm{far}}})+{d}_{{rm{far}}}$$

(11)

where the index of pattern planes n ∈ {1, …, N}; d_near and d_far denote the nearest and farthest depth of the light field, respectively. We can determine the relative coordinates ({{mathbb{P}}}_{n}^{{prime} },:= ,{({x}_{i}^{{prime} },{y}_{i}^{{prime} },{z}_{i}^{{prime} })}{| }_{i=1}^{4}) at each eye camera corresponding to the four corner points of each pattern plane:

$${({x}_{i}^{{prime} },{y}_{i}^{{prime} },{z}_{i}^{{prime} },1)}^{{rm{T}}},=,{left(begin{array}{cc}{R}_{{rm{e}}} & {t}_{{rm{e}}} 0 & 1end{array}right)}^{-1}{({x}_{i},{y}_{i},{z}_{i},1)}^{{rm{T}}}$$

(12)

Based on equation (4), their pixel coordinates are calculated as

$${u}_{i}^{{prime} }=-frac{{f}_{{rm{pix}}}{x}_{i}^{{prime} }}{{z}_{i}^{{prime} }}+frac{W}{2},quad {v}_{i}^{{prime} }=frac{{f}_{{rm{pix}}}{y}_{i}^{{prime} }}{{z}_{i}^{{prime} }}+frac{H}{2}$$

(13)

Here, we compensate for the minus sign for the opposite x-axis direction of two coordinate systems and let c_x = W/2 and c_y = H/2 for general cases. The 2D differences from the new corner coordinates ({{mathbb{Q}}}_{n}^{{prime} },:= ,{({u}_{i}^{{prime} },{v}_{i}^{{prime} })}{| }_{i=1}^{4}) denote the imaging offsets compared with the original positions ({{mathbb{Q}}}_{n},:= ,{(0,0),(W,0),(W,H),(0,H)}). In this way, we can establish the equations of the eight unknowns of the perspective transformation based on these four corner pairs. The solved transformation matrix represents the 2D correspondences from the patterns to the eyes.

Neural network architecture

The ocular geometric encoding warps the view images from each eye camera onto multilayer screens based on binocular poses, establishing geometrically unified normalized projections. The network input is this set of normalized planar warpings at multilayer depths. Each warping represents the expectation of luminous intensity solely under a single viewpoint. The network decomposes into phase values at each depth through the expectation space of binocular views, which can be regarded as the inverse process of equation (3) during a single period. As the backlight source is uniformly illuminated, the light-field variation can be mapped to a finite integral of phases within a period. This makes its inverse decomposition equivalent to a differentiable hidden space by successively applying a set of learned 3 × 3 convolutional kernels, satisfying the fact that the pixel-level phase arrangement not only meets the expectation through a viewpoint but also is independent across the binocular viewpoints.

The nonlinear activation used in the network (that is, rectified linear unit or ReLU) further filters out negative phase components through intermediate non-negative screening during the forward propagation.

For the specific design, the network is a fully convolutional architecture. It comprises an initial input layer, followed by five downsampling blocks and five corresponding upsampling blocks and concludes with a final output layer. Each block consists of two convolutional layers, all using uniform 3 × 3 convolution kernels. During downsampling, max pooling is applied to expand the receptive field, whereas bilinear interpolation is used in the upsampling stage to restore spatial resolution. To enable residual learning, skip connections are established between convolutional layers of matching spatial dimensions across the downsampling and upsampling paths. The input layer is configured to accept binocular RGB images, resulting in a six-channel input. To maintain computational efficiency, the number of channels at the input layer is set to 32, with the channel width increasing progressively in the downsampling layers according to the formula 32 × 2i, where i denotes the index of the downsampling block.

By capitalizing on the swift advancements in graphics processing unit (GPU) computing, the neural architecture embedded with these lightweight elements can execute computations orders of magnitude faster.

Structured loss optimization

In spite of the proposed physics-based mathematical model, the optimization objectives of this AI model are supposed to be elaborated for the accurate light-field approximation. We divide the structured loss design into three parts for multi-faceted constraints. The basic loss function is used to gauge the consistency of the aggregated image formed by the superposition of light paths from each viewpoint based on the predicted hierarchical phase maps. Here, we model the data fidelity by ℓ1 norm, whose sparsity aids in recovering high-frequency phase details at the edges and contours of the light field, whereas its outlier insensitivity helps prevent overfitting to specific viewpoints53.

In detail, we calculate the element-wise difference between the aggregated result ({I}^{{prime} }in {{mathbb{R}}}^{{S}_{k}times C}) from the predicted patterns and the expected ocular light intensity (Iin {{mathbb{R}}}^{{S}_{k}times C}), where C means the RGB channel, and we use Sk for simplicity to denote the value of the emitted cross-sectional area Ft ∩ dk. The basic loss can be formulated as

$${{mathcal{L}}}_{{rm{intensity}}}=frac{1}{{S}_{k}}sum _{{rho }^{{prime} }in {I}^{{prime} },,rho in I}parallel {rho }^{{prime} }-rho {parallel }_{1}$$

(14)

where ∥⋅∥_p denotes the ℓ_p vector norm applied on the pixel-wise luminous intensity vector ({rho }^{{prime} }in {{mathbb{R}}}^{C}) and its matching ground truth ρ.

The normalized planar warpings as inputs reflect only perspective light intensities from individual viewpoints, and merely enforcing intensity consistency with ground truth cannot effectively constrain the mutual exclusivity between the binocular views. View-specific information from one eye inevitably leaks into the other as noise, which can be mitigated through mutual-exclusion constraints. Following the structural assessment for image quality⁵⁴, the second loss function physically considers the local contrast and structure of the emitted light field and sets their product as the whole mutual-exclusion measurement, which should be approximated to 1. This second loss ({{mathcal{L}}}_{{rm{mutex}}}) can be formulated by

$${{mathcal{L}}}_{{rm{mutex}}}=1-{left(frac{2{sigma }_{I}{sigma }_{{I}^{{prime} }}+{xi }_{1}}{{sigma }_{I}^{2}+{sigma }_{{I}^{{prime} }}^{2}+{xi }_{1}}right)}^{p}{left(frac{{sigma }_{I{I}^{{prime} }}+{xi }_{2}}{{sigma }_{I}{sigma }_{{I}^{{prime} }}+{xi }_{2}}right)}^{q}underset{{xi }_{1}=2{xi }_{2}=xi }{overset{p=q=1}{=}}frac{{sigma }_{I}^{2}+{sigma }_{{I}^{{prime} }}^{2}-2{sigma }_{I{I}^{{prime} }}}{{sigma }_{I}^{2}+{sigma }_{{I}^{{prime} }}^{2}+xi }$$

(15)

which establishes a connection between the variances ({sigma }_{{I}^{{prime} }},{sigma }_{I}) and covariance ({sigma }_{I{I}^{{prime} }}) of the aggregated result and the target image. Here, p and q represent the relative importance of contrast and structure, respectively. We make them both equal to 1, arguing that these two aspects should be considered equally. ξ is a systematic error that prevents the computation of 0. For simplicity, we assume ξ₁ = 2ξ₂ = ξ. By constraining the differences in the pixel distribution and fluctuation trends in local regions of both images, the phase approximation for the current viewpoint will be attentive to the noise artefacts coming from the other viewpoints and will smooth and erase them.

Owing to the periodicity of the light phase, there are infinitely many trivial but not generalized solutions in model training because of the possibility of falling into local optimal fitting. Therefore, in the early stage of model training, we calculate the frustum element-wise difference ({{mathcal{L}}}_{{rm{lowfreq}}}) from pure black patterns, which is the starting point of the first positive period, forcing the model to converge within the lowest frequency representation space, so that the phase diagram of layered patterns also conforms to the RGB distribution. The auxiliary loss function can be listed as

$${{mathcal{L}}}_{{rm{lowfreq}}}=frac{alpha }{{sum }_{i=1}^{k}{S}_{i}}sum _{din D}sum _{{phi }_{d}in {varPhi }_{d}}parallel {phi }_{d}{parallel }_{1}$$

(16)

Here, Φ_d represents the total phase set of all light paths in the intersection area of the current frustum field F_t and the planar depth d. The auxiliary regularization term will be multiplied by a factor α that decays exponentially with training time as

$$alpha ,:= ,left{begin{array}{ll}1{0}^{1-4gamma } & 0 < gamma le r 0 & r < gamma le 1end{array}right.$$

(17)

where γ is the proportion of the current iteration to the total. This will be 0 at a preset earlier step for the complete elimination of suppression so that they will not be affected in the middle and late stages of model training convergence.

Ablation studies of these optimization components with visualizations are conducted to further understand how EyeReal functions in optical display and its underlying behaviours. The basic loss emphasizing basic intensity consistency proves crucial for maintaining fidelity, whereas the exclusivity measure loss enhances structural consistency by improving the noise resistance from the other viewpoint Extended Data Fig. 3a. Additionally, we visualized the ablation results of the low-frequency regularization loss Extended Data Fig. 3b, showinssg its effectiveness in guiding the network to focus on universal phase distributions rather than counterintuitive overfitting patterns. Further visualization of the network-computed phase patterns at various depths shows distinct depth-aligned highlights in each pattern Extended Data Fig. 3c. These attentive areas show that the neural network with structured optimization has accurately learnt an effective representation of local depth information, which is consistent with the physical depth structure of the light field.

Light-field dataset construction

A key requirement for a light-field dataset suitable for realistic viewing lies in the inclusion of stereo camera pairs captured from varying viewpoints while focusing on the same spatial point. However, these data characteristics are not directly available in existing public light-field datasets, as their multi-view data are limited to pixel-level difference, which fails to adequately simulate the way human eyes perceive scenes. To ensure the robust generalization and effectiveness of our learning-based mathematical model across diverse real-world viewing scenarios, we have meticulously developed a large-scale dataset characterized by complexity and diversity. For the generalization basis, the foundational component of our dataset focuses on capturing a broad spectrum of object geometries and appearances. We have incorporated a large assortment of geometrically rich and uncommon objects from uCO3D55 for its distinctive variety of object collections.

We curated a collection of 3,000 diverse objects and generated 500 stereo image pairs per object, as the generalized priority is the number of scenes involving different objects rather than the number of viewpoints. This part serves as a robust basis for ensuring diversity in colour, texture and shape. To further enrich the complexity and scale of the dataset, we integrated the additional selected representative scenes from relevant studies48,49,50,56,57,58,59,60 and online resources, each comprising thousands of stereo image pairs. These supplemental scenes highly broaden the environmental complexity and spatial scales of the dataset, covering scenarios ranging from synthetic virtual environments to real-world captures. The scenes vary substantially in scale, encompassing intricate room-level interiors and expansive city-level landscapes. over, they exhibit diverse lighting conditions and reflective materials, including indoor artificial illumination, outdoor natural lighting and scenarios with dim or subdued illumination.

Experimental results validate that the model trained with this rigorously constructed dataset achieves remarkable generalization ability, including various unseen scenes and unknown head poses, maintaining inference speed and output quality without any notable compromise.

We develop a data preparation approach for light fields to achieve a more appropriate viewing simulation, and we use the polar coordinate system in 3D space to facilitate data configuration. For general cases, people stand facing the screen for viewing, ensuring that the line connecting their eyes remains parallel to the ground, thus perpendicular to the shorter side of the screen. As shown in Extended Data Fig. 2b, we define the screen-to-eye direction as the x-axis, sample multiple depth planes along this axis, and designate the horizontal and vertical axes as the y-axis and z-axis, respectively. We initiate a front viewpoint cloud shaped like a truncated frustum, in which each point signifies the midpoint between the eyes.

The distance from the centre to each eye is denoted as R, the angle from the midpoint to the z-axis as φ and the angle to the y-axis as θ. Thus, assuming the interpupillary distance is d, we can derive the coordinates for each point in the Cartesian coordinate system as follows:

$$r,=,sqrt{{(Rsin varphi )}^{2}+{(d/2)}^{2}},quad delta =arctan frac{d}{2R,sin ,varphi }$$

(18)

$${x}_{r}=r,sin (theta -delta ),quad {x}_{l}=r,sin (theta +delta )$$

(19)

$${y}_{r}=r,cos (theta -delta ),quad {y}_{l}=r,cos (theta +delta )$$

(20)

$${z}_{r}={z}_{l}=R,cos ,varphi $$

(21)

Owing to scenario-specific variations in dataset acquisition and inconsistencies in the spatial dimensions of light-field display subjects, both the scaling factor that maps the physical world to the digital light-field domain and the longitudinal thickness of the light-field volume exhibit significant variability. Specifically, we denote the scaling factor as s, which converts the physical screen width of the light field to its corresponding digital representation, and the physical depth extent of the light field as d_thick, which varies with subject distance across scenes (Supplementary Table 3). Before this, we applied a compensation matrix M_comp to each scene to standardize the orientation of the reconstructed light fields (Supplementary Table 4). This transformation realigns the originally unstructured coordinate systems such that the principal viewing axis of the target object consistently faces the positive x-direction.

Training and implementation details

The network was trained on our constructed light-field dataset using 32 NVIDIA Tesla A800 GPUs for 40 epochs. A learning rate warm-up strategy is used during the first epoch, followed by a cosine decay schedule for the remaining training period. The batch size is set to eight, comprising four object-level and four scene-level samples in each batch to preserve a balanced learning signal across both fine-grained and global spatial contexts. Given the relatively smaller size of the scene-level dataset, it is cyclically reused once fully traversed to ensure continued exposure and a balanced contribution to the optimization process.

To capture the diversity of real-world 3D structures and enhance the ability of the generalization of the model, we construct a training corpus that integrates both object-level and scene-level data under heterogeneous geometric and photometric conditions. Specifically, we randomly sample 3,000 object-level scenes from the uCO3D dataset and include 15 additional scene-level environments reconstructed from publicly available sources. The validation set comprises 150 unseen object-level instances and 2 unseen scene-level environments. For the training dataset, each object-level scene is rendered into 500 stereo image pairs from diverse, randomly sampled viewpoints. Each scene-level environment contributes 1,500 stereo pairs, resulting in a wide coverage of spatial configurations and view-dependent visual appearances. In the validation dataset, each object-level instance is rendered into 20 stereo pairs, whereas each scene-level environment contributes 1,500 pairs, yielding a total of 6,000 stereo images for evaluation.

For the ablation study, we curated a training set of 6,000 stereo pairs spanning 150 object-level instances with 20 pairs each and 6 scene-level environments with 500 pairs each. We trained three model variants, each using only the intensity loss, only the mutual-exclusion loss and a combination of both, and evaluated them quantitatively on the validation set. To evaluate generalizability, we constructed equivalent datasets from identical scenes but with perturbed head poses. Random perturbations of up to ±10° were applied independently across yaw, pitch and roll axes, introducing pose diversity to simulate realistic viewing variations. For global-scale spatial performance comparison, we constructed an image dataset with 3,000 pairs across multiple distance–orientation combinations. For the IVD benchmark, we sampled 1,400 pairs at 20 cm intervals from 10 cm to 150 cm. For the NVD benchmark, we categorized 1,600 pairs by viewing angles, including frontal and oblique perspectives and distances across four intervals spanning 30–130 cm.

We designated 30–70 cm as the near range and 90–130 cm as the far range.

For human eyes, the part beyond 30° from the fixation point is called the peripheral vision, commonly known as the afterglow of the eye, which is actually the range that the human eye is insensitive. Therefore, when we build the eye camera model, we set its FOV to 40° to achieve a better sense of visual presence. We set φ ∈ (60°, 120°), θ ∈ (40°, 140°) and R ∈ (0.3, 1.5) in metres to adapt to the normal viewing situation. We use an efficient neural rendering approach59 to generate abundant training data from 3D targets. For the binocular localization part, we use the lightweight face detector61 built in OpenCV to obtain each eye position. The variation constant ξ of the mutual-exclusion loss that avoids system errors caused by denominators of zero is formulated as ξ = (kL)2, where k = 0.003 and L denotes the dynamic range of pixels, which is normalized as 1.

The suppression cancellation time ratio of r in the low-frequency loss is set to 0.3. All experiments are evaluated on inputs with a resolution of 1,920 × 1,080 pixels, and we use a single NVIDIA RTX 4090 as the algorithm execution GPU for practical inference.

Hardware design of the display system

The display prototype for real-world demonstration (Extended Data Fig. 7) uses a BOE TFT-LCD with a resolution of 1,080 × 1,920 as the screen used for light-field display, and the pitch of one LCD pixel is 0.27 mm. The effective physical imaging area is 518.4 mm × 324 mm, and the actual physical size is 528 mm × 337.9 mm, with a manufacturing error of ±0.7 mm. We attached orthogonally oriented polarizing films to the front of the frontmost screen and the back of the rearmost screen to generate a polarized light field. The screen uses a white light source as the backlight source. The RGB-D camera we use is the Microsoft Xbox Kinect V2. Its colour camera has a resolution of 1,920 × 1,080, and the depth camera has a resolution of 512 × 424 with a depth measurement range of 0.5–4.5 m.

For the hardware, we use acrylic plates 5 mm thick to fix and align each screen, and aluminium profiles as the load-bearing structure. The conceptual display for demonstration use N = 3 LCD screens with a 3-cm layered interval distance and transmit the imaging information using the HDMI (high-definition multimedia interface) interface protocol, run on a single NVIDIA RTX 4090 GPU.

■ مصدر الخبر الأصلي

نشر لأول مرة على: www.nature.com

تاريخ النشر: 2025-11-26 02:00:00

الكاتب: Weijie Ma

تنويه من موقع “yalebnan.org”:

تم جلب هذا المحتوى بشكل آلي من المصدر:
www.nature.com
بتاريخ: 2025-11-26 02:00:00.
الآراء والمعلومات الواردة في هذا المقال لا تعبر بالضرورة عن رأي موقع “yalebnan.org”، والمسؤولية الكاملة تقع على عاتق المصدر الأصلي.

ملاحظة: قد يتم استخدام الترجمة الآلية في بعض الأحيان لتوفير هذا المحتوى.