علوم وتكنولوجيا

An integrated view of structure and function of human 4D nucleome

GAM methods and data processing

Preparation of cryosections

H1 cells were fixed and processed for cryosectioning as described previously140. In brief, H1 cells were grown to 70% confluency, the medium was removed and cells were fixed in 4% and 8% paraformaldehyde in 250 mM HEPES-NaOH (pH 7.6; 10 min and 2 h, respectively), gently scrapped and softly pelleted, before embedding (>2 h) in saturated 2.1 M sucrose in PBS and frozen in liquid nitrogen on copper sample holders. Frozen samples were stored in liquid nitrogen. Ultrathin cryosections were cut using a Leica ultracryomicrotome (UltraCut EM UC7, Leica Microsystems) at approximately 220 nm thickness, captured on sucrose-PBS drops and transferred to 4 µm PEN steel frame slide for laser microdissection (Leica Microsystems, 11600289). Sucrose embedding medium was removed by washing with 0.2-μm-filtered molecular biology grade PBS (3 × 5 min each) and filtered ultrapure water (5 min).

For laser microdissection, cryosections on PEN membranes were washed, permeabilized and incubated (2 h, room temperature) in blocking solution (1% BSA (w/v), 5% FBS (w/v, Gibco 10270), 0.05% Triton X-100 (v/v) in PBS). After incubation (overnight, 4 °C) with primary anti-pan-histone (1:50) antibody (Merck, MAB3422) in blocking solution, the cryosections were washed three to five times for 30 min in 0.025% Triton X-100 in PBS (v/v) and immunolabelled (1 h, room temperature) with secondary antibodies in blocking solution, followed by three (15 min) washes in PBS.

Isolation of NPs

Nuclear staining was visualized using a Leica laser microdissection microscope (Leica Microsystems, LMD7000) using a ×63 dry objective. Individual nuclear profiles (NPs) were laser microdissected from the PEN membrane, and collected into PCR adhesive caps (AdhesiveStrip 8C opaque; Carl Zeiss Microscopy 415190-9161-000). GAM data were collected in multiplexGAM mode, where three NPs are collected into each adhesive cap. The presence of NPs in each lid was confirmed with a ×5 objective using a 420–480 nm emission filter. Control lids not containing NPs (water controls) were included for each dataset collection to keep track of contamination and noise amplification of whole genome amplification and library reactions. Collected NPs were kept at −20 °C until whole-genome amplification.

WGA

Whole genome amplification (WGA) was performed as described previously141 with minor modifications. In brief, DNA was extracted from NPs at 60 °C in the lysis buffer (20 mM Tris-HCl pH 8.0, 1.4 mM EDTA, 560 mM guanidinium-HCl, 3.5% Tween-20, 0.35% Triton X-100) containing 0.75 U ml−1 Qiagen protease (Qiagen, 19155). After 24 h of DNA extraction, the protease was heat-inactivated at 75 °C for 30 min and the extracted DNA was amplified by two rounds of PCR. First, quasi-linear amplification was performed with random hexamer GAT-7N primers with an adaptor sequence. The lysis buffer containing the extracted genomic DNA was mixed with 2× DeepVent mix buffer (2× Thermo polymerase buffer (10×), 400 µm dNTPs, 4 mM MgSO4 in ultrapure DNase-free water), 0.5 µM GAT-7N primers (5′-GTGAGTGATGGTTGAGGTAGTGTGGAGNNNNNNN) and 2 U µl−1 DeepVent (exo-) DNA polymerase (New England Biolabs, M0259L) and incubated for 11 cycles in the BioRad thermocycler.

The second exponential PCR amplification was performed in presence of 1x DeepVent mix, 10 mM dNTPs, 0.4 µM GAM-COM primers (5′-GTGAGTGATGGTTGAGGTAGTGTGGAG) and 2 U µl−1 DeepVent (exo-) DNA polymerase in the programmable thermal cycler for 26 cycles. WGA was performed in 96-well plates using Microlab STARLine liquid handling workstation (Hamilton).

Preparation of GAM libraries for high-throughput sequencing

After WGA, the samples were purified using the SPRI magnetic beads (1.7× ratio of beads per sample volume). The DNA concentration of each purified sample was measured using the Quant-iT PicoGreen dsDNA assay kit (Invitrogen, P7589). Sequencing libraries were then made using the in-house tagmentation-based protocol. After library preparation, the DNA concentration for each sample was measured using the Quant-iT PicoGreen dsDNA assay, and equal amounts of DNA from each sample was pooled together. The final pool of libraries was analysed using DNA High Sensitivity on-chip electrophoresis on an Agilent 2100 Bioanalyzer and sequenced on Illumina NextSeq 500 machine.

GAM data sequence alignment

Sequenced reads from each GAM library were mapped to the human genome assembly GRCh38 (December 2013, hg38) with bowtie2 (v.2.3.4.3) using the default settings. All non-uniquely mapped reads, reads with mapping quality <20 and PCR duplicates were excluded from further analyses.

GAM data window calling and sample quality control

Positive genomic windows present within ultrathin nuclear slices were identified for each GAM library as previously described141. In brief, the genome was split into equal-sized windows, and the number of nucleotides sequenced in each bin was calculated for each GAM sample with bedtools. Next, we determined the percentage of orphan windows (that is, positive windows that were flanked by two adjacent negative windows) for every percentile of the nucleotide coverage distribution. The number of nucleotides that corresponds to the percentile with the lowest percentage of orphan windows in each sample was used as an optimal coverage threshold for window identification in each sample. Windows were called positive if the number of nucleotides sequenced in each bin was greater than the determined optimal threshold.

The sample quality was assessed by the percentage of orphan windows in each sample, total genomic coverage in percent of positive windows, the number of uniquely mapped reads to the mouse genome and the correlations from cross-well contamination for every sample. Each sample was considered to be of good quality if it had ≤40% orphan windows, ≤60% of total genome coverage, >50,000 uniquely mapped reads and a cross-well contamination score determined per collection plate of <0.4 (Jaccard index).

GAM data curation

To exclude genomic windows which were under- or oversampled in the GAM collection, we computed a GAM-specific parameter, the window detection efficiency (WDF)31 as previously described142. To detect genomic bins with outlying detection frequency, a smoothing algorithm was applied to the WDF values per chromosome in stretches of eleven equally sized genomic windows. Next, normalized delta (ND) was calculated for each window, according to: ND = (raw_Signal − smoothed_Signal)/smoothed Signal. If the ND was larger than a fold change of 5, the window was removed from the final dataset. Next, the four adjacent windows (2 upstream and 2 downstream) were also removed, to ensure good quality of sampling in the final GAM data used for further analyses.

Genomic bins with an average mappability score below 0.2 were also removed. Genome mappability for the hg38 human genome assembly was computed using GEM-Tools suite143 setting read length to 75 nucleotides. The mean mappability score was computed for each genomic bin with bigWigAverageOverBed utility from Encode.

GAM data normalization

GAM contact matrices for all pairs of windows genome-wide were generated as previously described to produce pairwise co-segregation maps and pointwise mutual information (NPMI) maps that consider window detectability141. For visualization of the contact matrices, scale bars are adjusted in each genomic region displayed to a range between 0 and the 99th percentile of NPMI values in the region.

GAM insulation scores

The insulation scores were calculated from NPMI-normalized pairwise chromatin contact matrices, as previously described25 with minor modifications adjusted for GAM input data by keeping both positive and negative values in the matrix141.

Identification of compartments A and B in GAM data

Compartments were calculated using co-segregation matrices, as previously described142. In brief, each chromosome was represented as a matrix of observed interactions O(i,j) between locus i and locus j. The expected interactions E(i,j) matrix was calculated, where each pair of genomic windows is the mean number of contacts with the same distance between i and j. A matrix of observed over expected values O/E(i,j) was produced by dividing O by E. A correlation matrix C(i,j) was calculated between column i and column j of the O/E matrix. PCA was performed for the first three components on matrix C before extracting the component with the highest correlation to GC content. Loci with eigenvector values with the same sign were designated as A compartments, whereas those with the opposite sign were identified as B compartments.

For chromosomes 3 and 22, we manually picked PC1 and PC2, respectively, as the PC that correlated most with GC content did not display a typical AB compartmentalization pattern and good correlation with transcription levels.

Data processing for Hi-C, Micro-C, ChIA-PET, PLAC-seq and SPRITE

Cooler files for Hi-C, Micro-C, ChIA-PET, PLAC-seq and SPRITE were downloaded from the DCIC Data Portal (links to all data are provided in the Supplementary Information). Contact matrices were normalized using the iterative correction procedure from a previous study144. Interaction heat maps were created using Python. The colour map is ‘YlOrRd’ and the colour scales are created taking the 10th and 90th percentile of the interaction frequencies of individual datasets. No additional processing was applied to GAM data.

HiCRep correlations

HiCRep is used to calculate distance-corrected correlations of the multiple methods145. Correlation is calculated in two steps. First, interaction maps are stratified by genomic distances and the correlation coefficients are calculated for each distance separately. Second, the reproducibility is determined by a novel stratum-adjusted correlation coefficient statistic by aggregating stratum-specific correlation coefficients using a weighted average. Chromosome-specific correlation was performed for pairwise protocols and the correlations across those chromosomes were then averaged. Averaged pairwise correlations of chromosomes 1–22 and X were used to calculate the HiCRep correlations between Hi-C, Micro-C, ChIA-PET, PLAC-seq and SPRITE datasets; averaged correlations of chromosomes 1–22 were used to calculate HiCRep correlations between GAM and all other datasets. We used 50-kb binned interaction matrices to calculate HiCRep correlations.

Compartment analysis

Compartments were assessed for Hi-C, Micro-C, ChIA-PET, PLAC-seq and SPRITE using eigenvector decomposition on observed-over-expected contact maps at 100-kb resolution using the cooltools package derived scripts146. An eigenvector that has the strongest correlation with gene density is selected, then A and B compartments were assigned based on the gene density profiles such that A compartment has high gene density and B compartment has low gene density profile. A and B compartment assignments of GAM were provided by the data producers.

Spearman correlation was used to correlate the eigenvectors of different experiments performed with various protocols and cell states. Saddle plots were generated as follows49: the interaction matrix of an experiment was sorted based on the eigenvector values from lowest to highest (B to A). Sorted maps were then normalized for their expected interaction frequencies; the top left corner of the interaction matrix represents the strongest B–B interactions, the bottom right represents the strongest A–A interactions, the top right and bottom left are B–A and A–B, respectively. To quantify saddle plots, we took the strongest 20% of BB and strongest 20% of AA interactions and normalized them by the sum of AB and BA (top(AA)/(AB + BA) and top(BB)/(BA + AB)). Saddle quantifications were used to create the scatter plots.

The parameters that were used for the saddle plot are as follows: –strength, –vmin 0.5, –vmax 2, –regions hg38_chromsizes.bed, –qrange 0.02 0.98, –contact-type cis.

Preferential interactions

Bigwig or bedgraph files for lamin B1 DamID-seq, TSA–seq and Repli-seq were downloaded from the DCIC Data Portal (a full table of links is provided in the Supplementary Information).

Heat maps that integrate 3D methods with genome activity plots were generated as follows: first, the data were binned into 50-kb bins for aforementioned assays and sorted from the highest to the lowest value. Additional filters were applied for early/late replication ratio. For early/late replication timing data, bins with no values and the bins with value of 0 were removed, and the outlier bins that have values >98th quantile were also removed and the minimum value for the first bin was kept as 0.

Second, the interaction matrices (Hi-C, Micro-C, ChIA-PET, PLAC-seq, SPRITE and GAM) are sorted based on the 1D tracks generated from the aforementioned assays from the highest to the lowest.

Next, sorted matrices were then normalized for their expected interaction frequencies; the upper left corner of the interaction matrix represents the strongest signal for non-preferential interactions, lower right represents strongest preferential interactions. To quantify these plots we took the strongest 20% of the preferential interactions. Saddle plot parameters are listed below for this quantifications: –strength, –vmin 0.5, –vmax 2, –regions hg38_chromsizes.bed, –qrange 0.02 0.98, –range min(Sorted 1D data) max(Sorted 1D data) –contact-type cis. For GAM, –strength, –vmin 0.1, –vmax 0.4, –regions hg38_chromsizes.bed, –qrange 0.02 0.98, –range min(Sorted 1D data) max(Sorted 1D data) –contact-type cis.

Insulation score

For Hi-C, Micro-C, ChIA-PET, PLAC-seq and SPRITE, we calculated diamond insulation scores using cooltools (https://github.com/open2c/cooltools/blob/master/cooltools/cli/insulation.py) as implemented previously25. We defined the insulation and boundary strengths of each 25-kb bin by detecting the local minima of 25-kb binned data with a 100-kb window size. We used cooltools’s diamond-insulation function with the following parameters: –ignore-diags 2. Insulation scores of GAM were provided by data producers. We separated weak and strong log2-transformed insulation scores using the mean insulation score of all protocols (that is, weak insulation scores < mean < strong insulation scores). We piled up strong insulation scores to compare the average insulation score strengths of methods.

Identification of chromatin loops in different platforms

We used different strategies for detecting chromatin loops in different platforms.

For Hi-C and Micro-C, we combined results from HiCCUPS23 and Peakachu63. To identify chromatin loops using HiCCUPS, we ran cooltools dots (v.0.5.1)146 at 5-kb and 10-kb resolutions with the default parameters. Peakachu is a machine-learning based framework that learns contact patterns of pre-defined chromatin loops from a genome-wide contact map and applies trained models to predict loops on other maps generated by the same/similar experimental protocol. Here, we first trained Peakachu models on GM12878 Hi-C data at 2-kb, 5-kb and 10-kb resolutions, using a high-confidence loop set detected by at least two platforms among Hi-C, CTCF ChIA-PET, Pol2 ChIA-PET, CTCF HiChIP, H3K27ac HiChIP, SMC1A HiChIP, H3K4me3 PLAC-seq and TrAC-loop.

These models were then used to predict chromatin loops on Hi-C and Micro-C maps of H1 and HFFc6 cell lines at corresponding resolutions. The probability cut-offs were manually adjusted to balance sensitivity and specificity based on visual inspection.

For ChIA-PET, we combined loop predictions from ChiaSig147 and Peakachu. For each ChIA-PET dataset, we conducted multiple runs of ChiaSig with varying parameter settings, specifically adjusting the -M, -C and -c parameters while keeping other parameters constant (-m 8000 -S 4 -s 6 -A 0.01 -a 0.1 -n 1000). The -M value was selected from 1000000, 2000000 and 4000000, while both the -C and -c values were set to either 2 or 3. Only chromatin loops consistently identified across all parameter settings were retained, while others were discarded. As ChiaSig heavily relies on 1D peak annotation for loop detection, chromatin interactions outside peak regions are not identified as loops. To capture loops with similar contact patterns to those detected by ChiaSig but outside peak regions, we again used Peakachu to learn the patterns. For each ChIA-PET dataset, we trained 23 Peakachu models using interactions detected by ChiaSig, with each model trained on data from different combinations of 22 chromosomes.

During prediction, loops on each chromosome were predicted using the model trained on the other 22 chromosomes. The probability cut-offs were determined to ensure that Peakachu-predicted loops covered 90% of ChiaSig-detected interactions. Training and prediction were conducted separately at 2-kb and 5-kb resolutions, and the final loop predictions for each ChIA-PET dataset were obtained by combining ChiaSig-predicted interactions and Peakachu predictions.

For PLAC-seq, we combined outputs from MAPS148 and a pipeline similar to that used for ChIA-PET. For MAPS, loops were identified at 10-kb resolution using the default parameters. For the ChiaSig-Peakachu pipeline, loop prediction was conducted at 5-kb resolution, using the same training, prediction and filtering strategies as described above for ChIA-PET.

To calculate the union of loops identified from different platforms, methods and resolutions, two chromatin loops (i, j) and (i′, j′) were considered overlapping if and only if |i − i′| < min(0.2|i − j |, 15 kb) and |j − j′| < min(0.2|i − j|, 15 kb). If two loops were overlapping, only the one predicted at a higher resolution (that is, with more precise coordinates) was retained.

Calculation of overlap between capture Hi-C interactions and loops from other assays

For each capture Hi-C interaction (i, j), if there exists a loop (i′, j′) identified by another assay (one of Hi-C, Micro-C, CTCF ChIA-PET, Pol2 ChIA-PET or H3K4me3 PLAC-seq) such that the Euclidean distance between (i, j) and (i′, j′) is less than min(0.3|i − j|, 80 kb), we define (i, j) as overlapping with a loop from the corresponding assay.

Consensus chromatin-state annotations for H1 and HFFc6 cells

We computed epigenomic annotations using ChromHMM (v.1.23) on 14 observed and 2 imputed ChIP–seq datasets for 8 marks (H3K36me3, H3K4me1, H3K27ac, H3K9ac, H3K3me3, H3K4me2, H3K27me3 and CTCF) in both H1 and HFFc6 cells. All the ChIP–seq datasets were obtained from the WashU Epigenome Browser (https://epigenome.wustl.edu/epimap/data/) in bigwig format, and the coordinates were transformed from hg19 to hg38 using CrossMap (v.0.5.2, http://crossmap.sourceforge.net/).

To prepare the data for ChromHMM, we divided the whole genome into 200-bp bins and calculated the average signals within each bin. For the observed data, values were binarized with a −log10(P) value cut-off of 2. For the imputed data (H3K9ac and H3K4me2 in HFFc6), we downloaded both the imputed and observed data in H1 cells for the same marks. Then, for each mark, we set the binarization cut-off for the imputed data to match the quantile in the observed data corresponding to the −log10(P) > 2 cut-off, enabling comparison with the observed data.

Finally, we ran the ChromHMM LearnModel command on the binarized data to segment both the H1 and HFFc6 genomes into 12 states. The name of each state was manually annotated based on prior knowledge about each mark. The 12_Heterochrom state was excluded from further analysis, as it did not contain signals of any marks (Supplementary Fig. 2a).

Enrichment analysis of chromatin states for chromatin loops and loop anchors

To characterize the chromatin states of loop anchors detected by specific combinations of chromatin interaction methods, we calculated fold-enrichment scores by comparing the overlap with each ChromHMM state between the observed loop anchors and 100 randomly generated control sets. Specifically, for each chromatin state, we iterated through the loop anchor list and counted the number of anchors overlapping at least one region with that state. We then randomly shuffled the anchors in the genome to generate 1,000 control sets and repeated the same procedure for each control. For each control, we kept the size distribution and the number of random regions on each chromosome the same as the observed loop anchors, and the intervals of each region did not overlap with any gaps in the hg38 reference genome.

Finally, the fold-enrichment score was calculated by dividing the number of anchors with a specific chromatin state by the average number of random loci with the same state.

We used a similar method to characterize chromatin states for a specific cluster of chromatin loops. In brief, for each pair of chromatin states, we iterated through the loop list and counted the number of loops with one anchor overlapping regions marked by one chromatin state and the other anchor overlapping regions marked by the other chromatin state. Again, we generated 1,000 random control sets for chromatin loops. Each random loop set maintained the same genomic distance distribution between loop anchors and the same number of random loops on each chromosome, ensuring that the interval between the two ends of each loop did not overlap any gaps in hg38. Finally, the fold-enrichment score was calculated by dividing the number of loops between a specific pair of chromatin states by the average number of random loops between the same states.

Enrichment analysis of TFs for different loop clusters

To explore whether different loop clusters exhibit differential binding of various TFs at their anchors, we downloaded the ENCODE ChIP–seq peak files for 62 TFs in H1 cells. A fold-enrichment score was computed for each TF at loop anchors using a procedure analogous to the one described above. In brief, we first identified non-redundant loop anchors from each loop cluster in H1 cells. For each TF, we iterated through this anchor list and counted the number of anchors overlapping at least one ChIP–seq peak. Subsequently, we generated 1,000 random control sets by shuffling the loops and repeated the same procedure for each control set. The fold-enrichment score was then calculated by dividing the number of anchors containing ChIP–seq peaks by the average number of random loci containing ChIP–seq peaks for the same TF.

APA

To evaluate the overall enrichment of chromatin loop signals in contact maps, we performed APA. For a given list of chromatin loops or interactions, we extracted contact frequencies from 21 × 21 submatrices centred at the 2D coordinates of each loop. Each submatrix was normalized by dividing each value by the average contact frequency at the corresponding genomic distance. To reduce the influence of outliers, submatrices with average signal intensities above the 99th percentile or below the 1st percentile were excluded. The average signal at each position was then computed across all retained submatrices and visualized as a heat map. The APA score shown on the plot is defined as the signal intensity at the centre pixel of the aggregated heat map.

UMAP projection and clustering of chromatin loops

To construct the input feature matrix for projecting chromatin loops, we calculated the proportion of each ChromHMM state at the loop anchors in each cell line. This yielded a feature matrix Mij of size 141,365 × 22 for H1 cells and 146,140 × 22 for HFFc6 cells. Each row represents a chromatin loop, with the first 11 columns corresponding to features from one anchor and the next 11 columns corresponding to features from the other anchor.

Next, we standardized (z-score normalized) the matrix Mij. For each row in the normalized matrix, we swapped the order of the two anchors, if necessary, to ensure that the highest value appeared in the first 11 columns. The resulting matrix served as input for UMAP projection (https://github.com/lmcinnes/umap), using the parameters “n_neighbors=60, min_dist=0, n_components=2, metric=euclidean” for H1 cells, and “n_neighbors=65, min_dist=0, n_components=2, metric=euclidean” for HFFc6 cells.

For loop clustering, we first applied principal component analysis (PCA) to the swapped feature matrix, resulting in a transformed matrix of the same shape. We then constructed multiple k-nearest neighbour (k-NN) graphs using values of k ranging from 50 to 2,000 in steps of 50. The Leiden algorithm was applied to each graph for community detection, with the resolution parameter set to 0.5. Consensus clusters were derived by integrating clustering results across all k-NN graphs.

Calculation of the average contact strength for different loop clusters

To calculate the average contact strength for each loop cluster across different experimental platforms, we used distance-normalized (observed/expected) contact frequencies. Specifically, for Hi-C, Micro-C and DNA SPRITE datasets, we computed this value using interaction frequencies normalized by matrix balancing or iterative correction and eigenvector decomposition (ICE) at the 5-kb resolution. By contrast, for ChIA-PET and PLAC-seq datasets, we calculated the value using raw interaction frequencies at the same 5-kb resolution. For GAM, we used the NPMI-normalized co-segregation frequencies at the 25-kb resolution.

Annotation of enhancer regions in different cell types

To define candidate enhancer regions in each cell type, we first downloaded the total set of human cis-regulatory elements (cCREs) from the ENCODE data portal website using the following link https://screen.encodeproject.org/. We then extracted all regions marked as ELS (enhancer-like signatures) from the downloaded file. Finally, enhancer regions in each cell type were defined as a subset of these regions that overlap with ATAC–seq or DNase-seq peaks in corresponding cells, based on data availability for those cells.

Gene expression breadth analysis

To explore the gene expression profiles of a specific gene set across a diverse range of cell type or tissues, we collected RNA-seq datasets for 116 human cell types or tissues (from ENCODE; see the ‘Datasets’ table in Supplementary Note 1, section ‘Methods for relating chromatin loops to gene expression’). The TPM values were used to measure gene transcription levels. To normalize the RNA-seq data, we first applied a logarithm transformation to the original TPM values using the formula log2(TPM + 1) for each sample, and then quantile-normalized the transformed TPM values across all samples.

In each sample, genes with a normalized TPM value greater than 3 were considered expressed in the corresponding sample, and the gene expression breadth is defined as the number of samples in which a gene is expressed.

Identifying SPIN states for large-scale genome compartmentalization

In this work, we used a modified SPIN method to perform joint modelling across multiple cell types. To ensure TSA–seq and DamID scores across different cell types are comparable, we first applied a data normalization method to transform data into a Gaussian or more-Gaussian-like distribution. To do that we identified genomic bins that are spatially stable by calculating the Pearson correlation of interchromosomal Hi-C interactions for each non-overlapping 25-kb genomic bin. The bins were then ranked on the basis of the average Pearson correlation coefficient, and the top 25% were selected as spatially conserved regions. We then obtained TSA–seq or DamID scores for these bins in all cell types and standardized each data track by fitting a power-transformation function. We used the Yeo–Johnson transformation function with the default parameters from the Python scikit-learn package.

Next, we modified the framework of SPIN by jointly modelling the probability across multiple cell types. The hidden Markov random field model is defined on an undirected graph ({G}^{c}=(V,{E}^{c})) for each cell type, where in our case V represents non-overlapping 25-kb genomic bins and Ec represents the cell-type-specific edges (that is, significant Hi-C interactions) in cell type c. For each node (iin V,{O}_{i}^{c}in {{mathbb{R}}}^{d}) is a vector with dimension d indicating the observed TSA–seq and DamID signal of this bin in cell type c. Each node i in C also has a hidden state ({H}_{i}^{c}) for each cell type, representing its underlying spatial environment relative to different nuclear landmarks that we want to estimate. In this work, we assume that the set of hidden states are shared across cell types.

Edges (({i}^{{c}},{j}^{{c}},)in {E}^{{c}}) in the graph are cell-type specific and there are no edges that are connecting nodes from different cell types. Therefore, the hidden state ({H}_{i}^{c}) is only dependent on cell-type-specific observation ({O}_{i}^{c}) and the neighbours of node (i({N}^{c}(i)={j|jin V,(i,j)in {E}^{c}})) in cell type c. The overall objective is to estimate the hidden states ({H}_{i}^{c}) for all nodes in all cell types that maximize the following joint probability as shown below:

$$P(vec{H},vec{O})propto frac{1}{Z}prod _{cin 1ldots 4}(prod _{iin V}{P}_{V}({O}_{i}^{c}|{H}_{i}^{c})prod _{({i}^{c},{j}^{c})in {E}^{c}}{P}_{{E}^{c}}({H}_{i}^{c},{H}_{j}^{c}))$$

To estimate the number of SPIN states, we applied the same approaches as we used in the previous version of the SPIN method72. We used both the Elbow method based on k-means clustering and AIC/BIC scores to search for the optimal number of SPIN states. Both AIC and BIC scores decrease as the number of states increases. We found that the slope of the curve drops close to zero as the number of states exceeds 9. So, we chose 9.

Data acquisition and processing

We obtained TSA–seq, lamin B1 DamID-seq and Hi-C data for H1 and HFFc6 cells from the 4D Nucleome data portal (http://data.4dnucleome.org). The data generation and processing pipeline for TSA–seq data was described previously32,59. The data generation and processing pipeline for DamID data was described previously104. For the SPIN states inference, we used Hi-C data generated by the formaldehyde–disuccinimidyl glutarate Hi-C protocol (1% formaldehyde followed by incubation with 3 mM disuccinimidyl glutarate) using restriction enzyme DnpII42. We binned TSA–seq, lamin B1 DamID-seq and Hi-C mapped reads at 25-kb resolution. We then identified significant interactions from the normalized Hi-C data in each cell type previously described72.

Processing other epigenomic data

We downloaded or processed other epigenomic data and compared SPIN states with these datasets. For ChIP–seq datasets, we downloaded the processed P-value tracks from the ENCODE website for H1 cells and Avocado73 imputed P-value tracks from the 4D Nucleome data portal. Multi-fraction Repli-seq data were collected from a previous study106. For CUT&RUN data, we downloaded raw sequencing reads from the 4D Nucleome data portal and processed them using a similar procedure according to the standard ENCODE ChIP–seq pipeline. First, we mapped reads to the hg38 reference genome using Bowtie2 (v.2.2.9) with the default parameters. We then used MACS3 to generate P-value tracks as well as peaks for CUT&RUN data. The enrichment score of epigenomic data on SPIN states is determined by the log2-transformed ratio between the average observed signals on each SPIN states over genome-wide expectation.

SPIN states enriched caRNA sequence features

Processing of iMARGI data was performed with iMARGI-Docker77. To quantify enrichment of repetitive element (RE)-containing caRNAs with specific chromatin states, we computed an enrichment score (log2-transformed observed/expected interaction frequencies) for each RE caRNA class and SPIN state. Observed frequencies were derived from the number of iMARGI read pairs with RNA ends mapped to RE class of interest and DNA ends mapped to SPIN states. Expected frequencies are computed as the total number of iMARGI read pairs multiplied by the product of the marginal probabilities of RE class abundance (the proportion of all caRNAs mapping to each RE class, irrespective of DNA mapping locations) and SPIN state abundance (the proportion of DNA reads mapping to each SPIN state). Only interchromosomal iMARGI pairs were analysed to mitigate potential biases from nascent RNA transcripts interacting with proximal genomic regions.

Measuring nuclear RNA with iMARGI

iMARGI RNA end coverage is derived from RNA, DNA interactions represented in bedpe files generated by iMARGI docker77. The RNA end abundance bigwig file is generated by calculating the pileup reads coverage (R, coverage function) on the genome using RNA ends only in a strand-specific manner.

Methods for integrated genome structure modelling

Data preprocessing was performed as described previously81, with the exception of the parameter fmaxation = 16 for Hi-C preprocessing.

Genome representation

The genome is represented at 200-kbp resolution as described previously81,83, resulting in n = 29,838 chromatin regions, modelled as hard spheres of radius r0 = 118 nm. The HFFc6 nucleus is modelled as an ellipsoid of semiaxes (a,b,c) = (7,840, 6,480, 2,450) nm, whereas the H1 cell envelope is represented by a sphere of radius R = 5,000 nm.

We define the population of single-cell genome structures as a set of S diploid genome structures X = {X1, …, Xs}; A genome structure Xs is a set of three-dimensional vectors ({X}_{s}={{vec{x}}_{{is}}:{vec{x}}_{{is}}in {R}^{3},{i}=mathrm{1,2},ldots ,N}) representing the centre coordinates of each chromatin domain within the structure s:, N being the total number of chromatin domains in the genome and ({vec{x}}_{{is}}=({x}_{{is}},{y}_{{is}},{z}_{{is}})in {R}^{3}) indicates the coordinates of a 200,000-bp genomic region i in structure s. We use the notation I = (i,i′) to indicate the genomic region, where i and i′ represent the two alleles of genomic region I.

Data-driven simulation of genome structures

Genome structure populations were generated with IGM following procedures described previously81. The goal is to determine a population of 1,000 diploid 3D genome structures X statistically consistent with all input data from different available genomics experiments. Given a collection of input data D from different data sources, (D={{D}_{k}{|k}=1,ldots ,3}) (here, ensemble Hi-C, lamina DamID and SPRITE; see data in Supplementary Note 1, section 4 ‘Methods for integrated genome structure modeling’), we aim to estimate the structure population X such that the likelihood (P({{D}_{k}}{|X})) is maximized. To represent missing information at the single-cell and homologous chromosome level, we introduce data indicator tensors ({D}^{* }={{D}_{k}^{* }{|k}=1,ldots ,K=3}), which augment missing information about allelic copies in single cells. Thus, the latent variables are a detailed expansion of the diploid and single-structure representation.


■ مصدر الخبر الأصلي

نشر لأول مرة على: www.nature.com

تاريخ النشر: 2025-12-17 02:00:00

الكاتب: Job Dekker

تنويه من موقع “yalebnan.org”:

تم جلب هذا المحتوى بشكل آلي من المصدر:
www.nature.com
بتاريخ: 2025-12-17 02:00:00.
الآراء والمعلومات الواردة في هذا المقال لا تعبر بالضرورة عن رأي موقع “yalebnan.org”، والمسؤولية الكاملة تقع على عاتق المصدر الأصلي.

ملاحظة: قد يتم استخدام الترجمة الآلية في بعض الأحيان لتوفير هذا المحتوى.

بتوقيت بيروت — An integrated view of structure and function of human 4D nucleome
اظهر المزيد

مقالات ذات صلة

زر الذهاب إلى الأعلى