Semantic design of functional de novo genes from a genomic language model

Evo 1.5 pretraining

Evo 1.5 was generated by extending the pretraining of the Evo 1 model, originally trained at a sequence context of 8,192 tokens with an initial learning rate of 0.003 after a warmup of 1,200 iterations; a cosine decay schedule with a maximum decay iteration of 120,000 and a minimum learning rate of 0.0003. Training used a global batch size of 4,194,304 tokens and 75,000 iterations, processing a total of 315 billion tokens. Other details on hyperparameters related to the model architecture and optimizer can be found in ref. ¹. Pretraining of the model, including all model states, optimizer states, data loading schedule and learning rate schedule, was resumed from 75,000 iterations to 112,000 iterations (processing a total of 470 billion tokens). The model trained up to 112,000 iterations is referred to as Evo 1.5.

As in Nguyen et al.¹, Evo 1.5 was trained on the OpenGenome dataset, a comprehensive collection of prokaryotic genomic sequences. In brief, the dataset consists of approximately 80,000 prokaryotic genomes from bacteria and archaea and over 2 million phages and plasmid sequences, totalling approximately 300 billion nucleotides. OpenGenome was carefully curated to provide diverse, non-redundant genomic sequences from three primary sources: (1) bacterial and archaeal genomes from the Genome Taxonomy Database⁶¹, (2) prokaryotic viral sequences from the IMG/VR database⁶², and (3) plasmid sequences from the IMG/PR database⁶³. To reduce redundancy, only representative genomes for each species, viral operational taxonomic unit, or plasmid taxonomic unit were retained. The dataset was further filtered to exclude potential eukaryotic viruses and sequences with poor taxonomic specificity. The training data include both positive and negative strands of the genomic sequences, allowing the model to learn the complementary nature of DNA. A detailed description of the dataset curation process is provided in Nguyen et al.¹.

Autoregressive sampling

To sample from Evo, a standard low-temperature autoregressive sampling algorithm was used. Sampling code (https://github.com/evo-design/evo/) based on the reference implementation (https://github.com/togethercomputer/stripedhyena) that leverages kv-caching of Transformer layers and the recurrent formulation of hyena layers^64,65 was used to achieve efficient, low-memory autoregressive generation. Parameter optimization was performed across various temperatures (0.1–1.5, increments of 0.1), top-k values (1–4) and top-p values (0.1–1.0, increments of 0.1) using the Evo 1.5 model. For each parameter combination, 100 1,000-nt sequences were generated from a test set of five gene prompts encoding 50% of a highly conserved protein. Following identification of ORFs in generated sequences using Prodigal (v2.6.3, default parameters, -p meta) with default parameters in metagenome mode (-p meta)⁶⁶, generated proteins were aligned against the full-length prompt protein sequence using MAFFT (v7.526)⁶⁷ for sequence identity calculations. For evaluating sequence degeneracy, DustMasker (v2.14.1+galaxy2)^68,69,70 was run across the full-length generations using default parameters and the proportion of masked nucleotides was calculated. Final parameters (temperature = 0.7, top-k = 4, top-p = 1.0) were selected based on maximizing sequence completion accuracy while maintaining DustMasker proportions below 0.2, a value chosen to be slightly higher than the typical frequency of non-coding DNA in prokaryotic genomes^71,72. All code for sampling and downstream analysis using Evo was written in Python (v3.11.8).

Sequence completion prompt compilation and evaluation

Sequences of highly conserved genes from across prokaryotic biology were downloaded in FASTA format from NCBI GenBank⁷³. Selected genes included rpoS from E. coli K-12 (GenBank: NC_000913.3, coordinates c2866559–2867551), gyrA from S. enterica LT2 (GenBank: NC_003197.2, coordinates c2373710–2376422) and ftsZ from H. volcanii DS2 (GenBank: NC_013967.1, coordinates 643397–644536). Prompts were prepared by extracting 30%, 50% and 80% sequence lengths from the 5′ end. For the analysis of the gene completion ability of Evo 1.5 using sequences with varying levels of conservation, sequences encoding moderately conserved (gloA and pilA) and poorly conserved (tnsA and yagL) genes gloA from E. coli K-12 (GenBank: U57363.1, coordinates 1–408), pilA from Pseudomonas aeruginosa (GenBank: AE004091.2, coordinates c5069082–5069531), tnsA from E. coli O3 (GenBank: NZ_JALKIH010000010.1, coordinates 27218–28039) and yagL from E. coli K-12 (GenBank: U73857.1, coordinates c1018–1716) were used.

Sequence completion performance was evaluated across varying prompt lengths (30%, 50% and 80% of input sequence) using optimal sampling parameters for the Evo 1.5, Evo 1 8K (previous version of Evo trained with context length of 8,192 tokens) and Evo 1 131K (previous version of Evo trained with context length of 131,072 tokens) models (temperature = 0.7, top-k = 4, top-p = 1)¹. For each prompt, 500 sequences of length 2,500 nt were generated and filtered to remove generations with DustMasker proportions above 0.2. Prompts were subsequently appended to the start of each generated sequence and ORFs were identified using Prodigal (v2.6.3, default parameters, -p meta) with default parameters in metagenome mode (-p meta). Generated proteins were then aligned against their corresponding natural sequences using MAFFT (v7.526) with default parameters for sequence identity calculations.

Operon completion prompt compilation and evaluation

Sequences encoding the trp operon and modABC operon from E. coli K-12 (GenBank: NC_000913.3) were downloaded in FASTA format from NCBI GenBank. For modABC, prompts were prepared from the full coding sequences for modA (coordinates 795089–795862), modB (coordinates 795862–796551), modC (coordinates 796554–797612) and acrZ (coordinates 794773–794922). For trp, prompts were prepared from the full coding sequences for trpE (coordinates c1321384–1322946), trpD (coordinates c1319789–1321384), trpC (coordinates c1318427–1319788), trpB (coordinates c1317222–1318415) and trpA (coordinates c1316416–1317222). For testing the ability of the model to generate sequences on the antisense strand, the reverse complement of each sense strand-derived prompt sequence was generated using Biopython⁷⁴.

Sequence completion performance was evaluated across the compiled operon completion prompts using previously identified optimal sampling parameters for Evo 1.5. For each prompt, 5,000 sequences of length 2,500 nt were generated. Following filtering of generations with DustMasker proportions above 0.2 and identification of ORFs using Prodigal (v2.6.3, default parameters, -p meta), directional completion was assessed by searching trpE-prompted generations for trpD-like ORFs, trpD reverse complement-prompted generations for trpE-like ORFs, and similar pairing combinations across both modABC and trp operons. Protein sequences were then aligned against their corresponding wild-type proteins for sequence identity calculations using MAFFT (v7.526). Structural similarity was evaluated by generating protein structure predictions using AlphaFold 3 (ref. ⁷⁵) for both generated and wild-type sequences, with structural alignments and TM-score calculations performed using TM-align⁷⁶. Natural and predicted protein structures were subsequently visualized using ChimeraX⁷⁷.

Positional entropy evaluation

Per-position amino acid and nucleotide entropies were calculated from multiple sequence alignments of 500 generated and natural modB and trpA sequences. Natural modB and trpA sequences were fetched by querying ‘ModB’ and ‘TrpA’ in NCBI protein, filtering by bacteria, and downloading the corresponding amino acid and nucleotide sequences in format ‘FASTA’ and ‘FASTA CDS’, respectively. Generated modB and trpA sequences were chosen by selecting a random sample of 500 modB and trpA ORFs (more than 80% sequence identity to E. coli sequence) from the modB and trpA sequences generated by prompting with modA and trpB, respectively, during the operon completion evaluation. First, nucleotide and amino acid sequences were aligned with MAFFT (v7.526) and trimmed to remove gaps appearing in more than 80% of sequences. For each position i, the Shannon entropy was then calculated as H = –Σ(p_i × log₂(p_i)) and normalized by dividing the calculated entropy by the maximum Shannon entropy (2 for nucleotides meaning all four bases are equally present, 4.32 for amino acids meaning all 20 standard amino acids are equally present), where p_i represents the frequency of each amino acid or nucleotide at that position.

Analysis of species tag-prompting methods

For evaluation of the effect of species-specific prompting on sequence completion, sequences encoding dnaK (GenBank: NC_000913.3, coordinates 12163–14079) and recA (GenBank: U00096.3, coordinates c2822708–2823769) in E. coli K-12 and secY (GenBank: AF395886.1, coordinates 203–1669) and tfb2 (GenBank: AF143693.1, coordinates 140–1138) in H. volcanii were downloaded in FASTA format from NCBI GenBank⁷³. Base prompts were prepared by extracting 50% of the sequence lengths from the 5′ end. To generate species-specific prompt tags, the specific domain, phylum, class, family, genus, order and species information was extracted for each species from the Global Biodiversity Information Facility API (https://api.gbif.org/) and appended to the start of each base prompt. The species-specific prompts used are shown below:

(1)

|d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia | |
(2)

|d__Archaea;p__Halobacteriota;c__Halobacteria;o__Haloferacales;f__Haloferacaceae;g__Haloferax;s__Haloferax volcanii | |

Sequence completion performance was evaluated by sampling the Evo 1 131K model using prompts with and without appended species tags following the method described in the section ‘Sequence completion prompt compilation and evaluation’ above.

To evaluate the effect of species-specific prompts on sequence entropy, prompts encoding fusA (GenBank: AH002539.2, coordinates 1243–1521) upstream of the evaluated tufA gene encoding EF-Tu from E. coli K-12 were fetched as described in the section ‘Operon completion prompt compilation and evaluation’. Sampling was performed as described before, but with species-specific prompt tags appended to the start of each prompt. Positional entropy was determined for tufA sequences generated with and without species tags as described in the section ‘Positional entropy evaluation’.

Amino acid substitution analysis

Generated (see ‘Operon completion prompt compilation and evaluation’) and natural sequences encoding genes in the trp operon and modABC operon were first filtered to select for those with over 95% and less than 100% minimum amino acid sequence identity to their respective wild-type E. coli K-12 reference sequence using MMseqs2. From these filtered sequences, 100 sequences encoding each of modA, modB, modC, acrZ, trpA, trpB, trpC, trpD and trpE were randomly selected for both the Evo-generated and natural sequence groups. These sequences were subsequently aligned against their respective E. coli K-12 reference sequence using pairwise alignment with MAFFT (v7.526) and default parameters. Following alignment, amino acid substitutions were identified at each aligned position by comparing variant residues to the E. coli K-12 reference, excluding gap positions from analysis. Each substitution was then scored using the BLOSUM62 (ref. ⁷⁸) matrix, with scores greater than or equal to 0 indicating biochemically conservative changes and scores less than 0 indicating non-conservative changes. BLOSUM scores for substitutions across all evaluated genes in the trp and modABC operons were then aggregated and plotted to get the final distribution of BLOSUM scores.

T2TA prompt compilation and analysis

Genomic loci and sequences encoding T2TA system sequences were obtained by downloading the nucleotide sequence information for all experimentally validated T2TAs from the TADB 3.0 database²⁴. Using the NCBI Entrez Programming Utilities API (EFetch from the nuccore database using the genomic loci from TADB 3.0)⁷³, the 500 bp of upstream and downstream flanking sequence were extracted for each T2TA locus. In total, for each T2TA system, eight types of prompts were prepared: (1) individual toxin sequences, (2) individual antitoxin sequences, (3) the reverse complement of individual toxin sequences, (4) the reverse complement of individual antitoxin sequences, (5) the upstream context of the toxin loci, (6) the downstream context of the toxin loci, (7) the upstream context of the antitoxin loci, and (8) the downstream context of the antitoxin loci. Following successful identification of an Evo-generated toxin (see ‘Evaluation of types II and III toxin activity’ below), conjugate antitoxins were subsequently generated via prompting with the generated DNA sequence encoding the toxin.

To evaluate the frequency with which each prompt type generated toxin and antitoxin sequences, remaining protein sequences following sequence complexity and pDockQ filtering (see ‘T2TA sampling and filtering’) were evaluated using HMMER (v3.3.0) hmmscan (https://hmmer.org) against the Pfam-A database (v35.0)^79,80. Generations with Pfam matches against known type II toxin or antitoxin-related families were counted as hits and mapped back to the prompt type used to generate the sequence, with the generation frequency calculated as the number of toxin or antitoxin hits divided by the total number of remaining generations for each prompt classification.

T3TA prompt compilation and analysis

Genomic loci and sequences encoding T3TA system sequences were obtained by downloading the nucleotide sequence information for all experimentally validated and computationally predicted T3TAs from the TADB 3.0 database²⁴. Using the NCBI Entrez Programming Utilities API (EFetch from the nuccore database using the genomic loci from TADB 3.0), the 1,000 bp of upstream and downstream flanking sequence were extracted for each T3TA locus. For each T3TA system, eight types of prompts were prepared: (1) individual toxin sequences, (2) individual antitoxin sequences, (3) the reverse complement of individual toxin sequences, (4) the reverse complement of individual antitoxin sequences, (5) the upstream context of the toxin loci, (6) the downstream context of the toxin loci, (7) the upstream context of the antitoxin loci, and (8) the downstream context of the antitoxin loci.

To evaluate the frequency with which each prompt type generated toxin and antitoxin sequences, generations with Rfam or Pfam matches against known T3TA-related families (see ‘T3TA sampling and filtering’) were mapped back to their prompt type and classified as hits. The overall generation frequency was calculated by dividing the number of hits by the total number of remaining generations for each prompt classification.

T2TA sampling and filtering

To generate T2TA candidates, 53,104 sequences of 2,000 nucleotides each were first generated using Evo 1.5 (temperature = 0.7, top-k = 4, top-p = 1.0) from our compiled T2TA prompts. A multi-stage filtering pipeline was then applied to identify promising candidates. First, Prodigal (v2.6.3, default parameters, -p meta) was used to identify ORFs, excluding sequences containing proteins over 300 amino acids or less than to amino acids, resulting in 130,754 called proteins total⁶⁶. Next, SegMasker (v2.14.1+galaxy2) with default parameters⁶⁹ was used to remove sequences containing low-complexity regions with limited amino acid diversity, with 58,704 proteins remaining post-filtering. Next, any proteins that belonged to generations with only one passing protein were removed, resulting in 32,181 remaining protein candidates.

Protein–protein interaction potential of co-generated ORFs by co-folding all ORF pairs within each remaining generation was then assessed using ESMFold¹⁷. Generations were retained if they contained paired proteins with pDockQ⁸¹ scores greater than 0.23 and individual pLDDT scores greater than 0.3, resulting in 945 remaining pairs of proteins. Following the removal of any protein pairs with more than 40% sequence identity (MMseqs2⁸²) between both proteins, 777 proteins remained. To identify novel candidates, the remaining sequences were searched against the non-redundant protein sequence database using BLAST⁶⁸ (e-value cut-off of 0.05), selecting for generations containing at least one component with no significant BLAST hits to known toxins or antitoxins and the other component matching a known toxin or antitoxin. Following this filtering step, a total of 36 protein pairs remained. Ten final toxin candidates were then selected based on high-confidence interaction prediction using AlphaFold 3 (ref. ⁷⁵).

Following the identification of functional toxin candidates via experimental testing, in which two toxins were found to be active, four were unable to be successfully cloned and three were inactive, the Evo-generated sequence encoding the strongest Evo-generated toxin, EvoRelE1, was used as a prompt to generate further diversified antitoxin candidates. After generating a total of 3,000 sequences from the EvoRelE1 prompt, 7,708 generated ORFs were filtered as above (744 remaining) before being co-folded with EvoRelE1. As with the first round of generations, candidates were filtered for high pDockQ scores, moderate pLDDT scores using ESMFold-derived co-folds (122 candidates remaining) and less than 40% identity to known antitoxins using BLAST (43 candidates remaining) before being evaluated for strong predicted co-folds using AlphaFold 3 (ipTM > 0.7). Remaining antitoxin candidates were further characterized using Foldseek Search Server⁸³ searches of the AlphaFold 3-predicted structures (probability threshold of 0.6), blastp searches against the non-redundant protein database (e-value threshold of 1) and HHpred searches (probability threshold of more than 90%)⁸⁴ to select a final of ten antitoxin candidates.

T3TA sampling and filtering

To generate T3TA candidates, 25,960 sequences of 3,000 nucleotides each were first generated using Evo 1.5 (temperature = 0.7, top-k = 4, top-p = 1.0) from our compiled T3TA prompts. A multi-stage filtering pipeline was then applied to identify promising candidates. First, Prodigal (v2.6.3, default parameters, -p meta) was used to identify ORFs⁶⁶, excluding sequences containing proteins over 400 amino acids or less than 50 amino acids, resulting in 80,298 called proteins total. Next, SegMasker (v2.14.1+galaxy2) with default parameters⁶⁹ was used to remove sequences containing low-complexity regions with limited amino acid diversity, with 34,131 proteins remaining post-filtering. On sequences with at least one high-quality ORF present, ESMFold and Tandem Repeats Finder⁸⁵ were used to identify generations with at least one ORFs with a pLDDT > 0.3 and at least one tandem repeat respectively. Tandem Repeat Finder was run using parameters: match = 2, mismatch = 7, delta = 7, PM = 80, PI = 10, minscore = 50, maxperiod = 2,000, resulting in 3,847 remaining generations. Consensus and full repeats were subsequently folded using ViennaRNA’s RNAfold^86,87. Following filtering to remove any called tandem repeats with no hairpins, minimum free energies of more than −3.0 and without all four nucleotides present, a total of 428 sequences remained.

Remaining ORFs and identified tandem repeats from the filtered generations were subsequently evaluated using HMMER (v3.3.0) hmmscan against the Pfam-A database⁸⁰ (v35.0) and rnascan⁸⁸ against the Rfam database (v15.0)⁸⁹, respectively. Tandem repeats were also run against the AbiF5_iter3.CM and the diverse_rna_xinsi.CM files from Zilberzwige-Tal et al.⁹⁰ using Infernal’s cmscan⁹¹. As a point of comparison, natural T3TA sequences were also run against Pfam-A, Rfam and the AbiF-related covariance models. Generated sequences were retained if they had a match (E < 0.05) to a Pfam domain that overlapped with Pfam domains found in natural type III toxin sequences or if they had a match to a Rfam or AbiF family annotation found in natural type III antitoxin sequences.

To account for natural type III antitoxin sequences that did not have existing covariance models and sequence divergent-generated RNA sequences, an RNA secondary structure-based filtering on the generated sequences was also implemented. In brief, consensus repeat and full secondary structures of the generated sequences were compared with that of type III antitoxin sequences from TADB 3.0 and Zilberzwige-Tal et al.⁹⁰. First, structures of generated tandem repeats were pre-filtered to exclude candidates with significantly different consensus repeat lengths and base-pairing ratio similarities, using a weighted average of length similarity (0.6 weight; a score of 1.0 indicates perfect match) and base-pairing ratio similarity (0.4 weight; a score of 1.0 indicates perfect match) against natural type III antitoxin sequences (retained sequences with scores ≥ 0.3). Remaining candidates underwent structural motif comparison using vectorized representations of extracted patterns including hairpins, stems, bulges and unpaired regions, with Jaccard similarity scoring applied to binary motif occurrence matrices (retained sequences with scores ≥ 0.4). Following motif extraction, final candidates were determined by calculating cosine similarity of the generated sequences against all natural type III antitoxins using ten-dimensional feature vectors encoding other RNA properties including base-pair counts, pairing ratios, average and maximum stem and loop lengths, number of stems, minimum free energy values, minimum free energy per nucleotide, and hairpin counts (retained sequences with similarities ≥ 0.7).

Sequences passing the Pfam, Rfam, AbiF CM, or RNA secondary structure filtering metrics were retained as final generations, for a total of 125 candidates. From this pool, a total of 36 toxin candidates and 10 antitoxin candidates were selected to experimentally validate, with selection criteria including sequence novelty for toxin candidates and IDT synthesizability for antitoxin candidates.

Cloning of T2TA sequences

Sequences encoding Evo-generated T2TAs were codon optimized for E. coli expression using the codon optimization tool in IDT and synthesized as eBlocks. The toxin plasmid backbone was prepared by PCR amplification (New England Biosciences) and gel purification (Qiagen) of pAraSpCas9 + spMu⁴⁴ to exclude the Cas9, CRISPR RNA, trans-activating CRISPR RNA and guide RNA sequences (Supplementary Data 1), creating an empty arabinose-inducible vector with spectinomycin resistance. Codon-optimized toxin sequences were subsequently inserted into the modified backbone using Gibson Assembly (New England Biosciences) according to the manufacturer’s recommendations.

For antitoxin cloning, the pZE21_tetR-AcrIIA4_kanR vector⁴⁴ (Supplementary Data 1) was digested with EcoRI and HindIII (New England Biosciences) and gel purified (Qiagen) to remove the AcrIIA4 sequence. Codon-optimized antitoxin sequences were then inserted into the digested vector using Gibson Assembly (New England Biosciences).

Assembled plasmids were transformed into chemically competent Stellar E. coli HST08 cells (Takara Biosciences), and positive clones were selected on LB agar plates containing either 50 μg ml⁻¹ spectinomycin (for the toxin constructs) or 50 μg ml⁻¹ kanamycin (for the antitoxin constructs). Plasmid sequences were then confirmed by Sanger sequencing (Elim Biosciences).

Cloning of T3TA sequences

Type III Evo-generated toxin candidates were codon optimized, synthesized and cloned following the methods outlined in ‘Cloning of T2TA sequences’.

For type III antitoxin cloning, sequences encoding the J23119(SpeI) promoter and a rho-independent terminator from the pAraSpCas9 + spMu plasmid were appended immediately upstream and downstream of the Evo-generated type III antitoxin sequences. The combined sequences were subsequently synthesized by IDT as eBlocks. For type III antitoxin cloning, the pZE21_tetR-AcrIIA4_kanR vector⁴⁴ (Supplementary Data 1) was digested with MfeI and PvuI (New England Biosciences) and gel purified (Qiagen) to remove the AcrIIA4 and pLtetO-1 promoter sequences. Synthesized type III antitoxin sequences were then inserted into the digested vector using Gibson Assembly (New England Biosciences).

Evaluation of types II and III toxin activity

Toxin activity was assessed through growth inhibition assays in E. coli NEB Turbo cells (New England Biosciences) for type II toxin candidates and Stellar E. coli HST08 (Takara Biosciences) for type III toxin candidates. Cells transformed with toxin constructs were first grown overnight in LB medium containing 50 μg ml⁻¹ spectinomycin. Cultures were then diluted 1:10 into 1 ml fresh LB medium supplemented with 50 μg ml⁻¹ spectinomycin and 2 mg ml⁻¹ arabinose to induce toxin expression. After 2 h of induction, cultures were further diluted 1:40 into 200 μl of the same medium in triplicate wells of a 96-well plate. Growth was monitored using a Tecan Spark plate reader at 37 °C with orbital shaking, measuring an optical density at 600 nm (OD₆₀₀) at 30-min intervals over a 12 h period. For each toxin, an uninduced control grown in arabinose-free LB–spectinomycin media was used. Growth curves were analysed to evaluate the extent of toxin-mediated growth inhibition. To calculate relative survival, the final OD₆₀₀ measurement following the 12 h growth period for each toxin was divided by the final OD₆₀₀ value of its respective uninduced control. For comparisons between EvoT1 and natural ToxN, statistical comparisons were done using a one-sided Student’s t-test against the induced ToxN negative control, with the P value for EvoT1 being 0.1105.

Evaluation of type II antitoxin activity

For antitoxin evaluation, E. coli NEB Turbo cells were co-transformed with both toxin and antitoxin constructs (1:2 ratio of toxin to antitoxin) and grown overnight in LB medium containing 50 μg ml⁻¹ spectinomycin and 50 μg ml⁻¹ kanamycin. Cultures were then diluted 1:10 into 1 ml fresh LB medium supplemented with both antibiotics and 2 mg ml⁻¹ arabinose to induce toxin expression. After 2 h of induction, cultures were further diluted 1:40 into 200 μl of the same medium in triplicate wells of a 96-well plate per condition. For each antitoxin, an uninduced control grown in arabinose-free LB–spectinomycin + kanamycin media was used. Growth was monitored using a Tecan Spark plate reader at 37 °C with orbital shaking, measuring OD₆₀₀ at 30-min intervals over a 12 h period. Growth curves were analysed to evaluate antitoxin-mediated rescue of growth. To calculate relative survival, the final OD₆₀₀ measurement following the 12 h growth period for each toxin–antitoxin pair was divided by the final OD₆₀₀ value of its respective uninduced control. Statistical comparisons were done using a one-sided Student’s t-test against the EvoRelE1 + eGFP negative control, with P values for EvoAT1–4 being 1.61 × 10⁻⁷, 1.42 × 10⁻⁷, 2.30 × 10⁻⁵ and 1.69 × 10⁻⁵, respectively.

Evaluation of type III antitoxin activity

For type III antitoxin evaluation, Stellar E. coli HST08 cells (Takara Biosciences) were co-transformed with both toxin and antitoxin constructs (1:3 ratio of toxin to antitoxin) and grown overnight in LB medium containing 50 μg ml⁻¹ spectinomycin and 50 μg ml⁻¹ kanamycin. Cultures were then diluted 1:10 into 1 ml fresh LB medium supplemented with both antibiotics and 2 mg ml⁻¹ arabinose to induce toxin expression. After 3 h of induction, cultures were further diluted 1:40 into 200 μl of the same medium in triplicate wells of a 96-well plate per condition. For each antitoxin, an uninduced control grown in arabinose-free LB–spectinomycin + kanamycin media was used. Growth was monitored using a Tecan Spark plate reader at 37 °C with orbital shaking, measuring OD₆₀₀ at 15-min intervals over a 12 h period. Growth curves were analysed to evaluate antitoxin-mediated rescue of growth. To calculate relative survival, the final OD₆₀₀ measurement following the 12 h growth period for each toxin–antitoxin pair was divided by the final OD₆₀₀ value of its respective uninduced control. Statistical comparisons were done using a one-sided Student’s t-test against the induced ToxN negative control, with the P value for EvoAT6 being 1.84 × 10⁻⁸.

Anti-CRISPR prompt compilation and analysis

Genomic sequences containing known type II Cas9-targeting anti-CRISPR (acr) genes and their associated operons were obtained from previously characterized Acr systems annotated in AcrDB³⁶. Using the Entrez API, Acr coding sequences along with 500 bp of flanking sequence both upstream and downstream of each acr locus were extracted. For each Acr system, six types of prompts were prepared: (1) individual Acr sequences, (2) Acr-associated (aca) gene sequences, (3) the reverse complement of individual Acr sequences, (4) the reverse complement of Acr-associated (aca) gene sequences, (5) the upstream context of the Acr loci, and the (6) the downstream context of the acr loci.

To evaluate the frequency with which each prompt type generated likely anti-CRISPR sequences, remaining generations following PaCRISPR filtering (see ‘Acr sampling and filtering’) were mapped back to their prompt type and classified as hits. The overall generation frequency was calculated by dividing the number of hits for each prompt type by the total number of remaining generations of that prompt classification.

Anti-CRISPR sampling and filtering

To generate Acr candidates, a total of 3,160 sequences of 1,000 nucleotides each were sampled using Evo 1.5 (temperature = 0.7, top-k = 4, top-p = 1.0) from our compiled Acr prompts. A multi-stage filtering pipeline was then applied to identify promising candidates. First, Prodigal (v2.6.3, default parameters, -p meta) was used to identify ORFs, excluding sequences containing proteins over 200 amino acids or less than 50 amino acids long, resulting in a total of 4,223 potential ORFs. Next, SegMasker (v2.14.1+galaxy2, default parameters) was used to remove sequences containing over 50% low-complexity regions or limited amino acid diversity, filtering the generations down to 1,391 candidates. Sequences were subsequently folded using ESMFold to remove any candidates with very low-confidence folds (pLDDT < 0.25) or no secondary structure, reducing the total candidates to 468. Candidate sequences were then evaluated using PaCRISPR, a machine learning model trained to identify potential Acr proteins based on sequence features⁴³. Following guidelines established by PaCRISPR, candidates scoring above 0.75 were considered potential Acrs and advanced to the next step, resulting in 131 remaining generations. Generated sequences classified as potential Acrs by PaCRISPR were then searched against the non-redundant protein sequence database using blastp (e-value cut-off of 1) to identify candidates with varying degrees of sequence novelty. From this pool of 131 candidates, 84 were selected for further screening based on sequence novelty, similarity to other screening candidates and PaCRISPR scores. For comparison, this filtration pipeline was also applied to evaluate sequences generated by prompting with randomly generated DNA sequences and non-Acr-related genomic sequences. The sequence diversity among the predicted Acrs was assessed by performing pairwise alignments using MAFFT (v7.526) on a set of 56 randomly selected sequences that scored above 0.75 in PaCRISPR. The resulting pairwise sequence identities were visualized using a matplotlib heatmap.

Cloning of anti-CRISPR sequences

Sequences encoding Evo-generated Acr proteins were codon optimized for E. coli expression using the codon optimization tool in IDT and synthesized as eBlocks. To generate the cloning backbone, the pZE21_tetR-AcrIIA4-Coli_kanR vector (Supplementary Data 1) was digested with EcoRI and HindIII (New England Biosciences) to remove the AcrIIA4 sequence and gel purified (Qiagen). Codon-optimized Acr sequences were then inserted into the digested vector using Gibson Assembly (New England Biosciences) according to the manufacturer’s recommendations. Assembled plasmids were then transformed into chemically competent Stellar E. coli HST08 cells (Takara Biosciences), and positive clones were selected on LB agar plates containing 50 μg ml⁻¹ kanamycin. Plasmid sequences were confirmed by Sanger sequencing (Elim Biosciences).

Liquid culture assay for measuring anti-CRISPR activity

Acr activity was assessed through protection assays against SpCas9-mediated DNA cleavage in E. coli. NEB Turbo cells were first co-transformed with both the Acr expression plasmid and the CRISPR-targeting plasmid containing SpCas9 and a KanR-targeting guide RNA⁴⁴ (pAraCas9 + Sp2 + Sp6 + I-SceI; Supplementary Data 1) and grown overnight in LB medium containing both 50 g ml⁻¹ kanamycin and spectinomycin.

For liquid culture survival assays, 30 μl of overnight culture was diluted into 1 ml fresh LB medium containing spectinomycin and 0.2 mg ml⁻¹ arabinose to induce SpCas9 expression and deplete the kanamycin resistance plasmid. After 7 h of growth, cultures were normalized to equal optical density and diluted 1:40 into 200 μl fresh LB medium containing 50 μg ml⁻¹ kanamycin and spectinomycin in a 96-well plate to select for cells with active Acr proteins. Growth was monitored using a Tecan Spark plate reader at 37 °C with orbital shaking, measuring OD₆₀₀ at 30-min intervals over an 8 h period. For each Acr, an uninduced control grown in arabinose-free LB–spectinomycin + kanamycin media was used to normalize growth values. Relative survival values were calculated by dividing the final OD₆₀₀ measurement for each Acr following the 8 h growth period by the final OD₆₀₀ value of its respective uninduced control.

Statistical comparisons were done using a one-sided Student’s t-test against the random sequence negative control and AcrIIA2, with P values for EvoAcr1–5 being 8.31 × 10⁻⁶, 6.19 × 10⁻⁶, 8.19 × 10⁻⁶, 1.43 × 10⁻⁶ and 8.18 × 10⁻⁷ against the negative control, respectively, and the P value for EvoAcr5 against AcrIIA2 being 5.07 × 10⁻⁴.

Phage plaque assay for measuring anti-CRISPR activity

For phage protection assays, E. coli NEB Turbo cells were co-transformed with both the Acr expression plasmid and a modified SpCas9 plasmid containing a guide RNA targeting T4(GT7) phage (SpCas9-mrh2; Supplementary Data 1) at a 5:1 Acr:Cas9 ratio. Co-transformed cells were grown overnight in LB medium containing 50 μg ml⁻¹ kanamycin and spectinomycin. Cultures were then diluted 1:10 into fresh LB medium supplemented with spectinomycin, kanamycin and 0.2 mg ml⁻¹ arabinose to induce Cas9 expression. When cultures reached an OD₆₀₀ of 0.4, 300 μl was mixed into 10 ml of 0.7% soft agar containing 50 μg ml⁻¹ kanamycin, 50 μg ml⁻¹ spectinomycin and 0.02 mg ml⁻¹ arabinose. Plates were allowed to harden, and T4(GT7) phage at a titre of 1.7 × 10⁸ PFU ml⁻¹ was serially diluted 1:5 and spotted onto the bacterial lawn. Plates were then incubated overnight at 37 °C before being imaged to visualize plaque formation.

Measurement of Evo-generated sequence diversity

Sequence and structural diversity of generated toxin–antitoxin pairs and Acrs were assessed through a combination of sequence and structure-based searches. Protein sequences were searched against the NCBI non-redundant protein database using blastp under default parameters (e-value threshold of 1.0) to identify similar natural sequences. Overall sequence identity was defined as the percent identity of a given BLAST match multiplied by the query cover of the alignment. RNA sequences were searched against the NCBI non-redundant nucleotide database using blastn under default parameters (e-value threshold of 1.0). Reported BLAST results are from a run conducted on 14 July 2025. An additional search against OpenGenome was performed using MMseqs2 (v15.6f452) with maximum sensitivity (-s 7) and other parameters set to the default values⁸² to evaluate similarity to sequences in the training data. For more-sensitive HMM-based characterization of generated genes, all Evo-generated proteins were searched against PDB_mmCIF70_25_May using the HHpred^84,92 webserver (https://toolkit.tuebingen.mpg.de/tools/hhpred), with probabilities greater than 90 or E less than 0.05 being deemed significant. Foldseek-based structural similarity was evaluated by searching AlphaFold 3-predicted structures against the AFDB-Swissprot and AFDB50 protein databases^93,94 using the Foldseek Search Server (https://search.Foldseek.com/search)⁸³, with values greater than 0.5 reported as significant hits following Foldseek guidelines⁹⁵. Dali-based structural similarity was evaluated by searching AlphaFold 3-predicted structures against PDB100 and the AF-DB V2 E. coli subset using the Dali webserver with hits with Z > ((n/10) − 4) (n = residues in query structure) being labelled as strong significant matches following Dali guidelines (http://ekhidna2.biocenter.helsinki.fi/dali/)^49,96. For each generated protein, the closest sequence and structural matches were identified from all searches (BLAST, MMseqs2, HHpred, Dali and Foldseek) and were evaluated to determine the degree of novelty compared with known proteins. Pairwise sequence identities were calculated from BLAST alignments, broader sequence similarity was calculated using HHpred probabilities and e-values, and structural similarities were quantified Foldseek TM-scores and Dali z-scores. For RNA HMM-based sequence similarity evaluation, generated type III antitoxins were searched against Rfam using rnascan^87,89 using an e-value threshold of 0.05. For structural evaluation of generated RNA and natural RNA antitoxins, individual secondary structure predictions for full and consensus repeat sequences were obtained using ViennaRNA’s RNAfold⁸⁷. Secondary structure similarity for the consensus repeats in EvoAT6 and Rfam ToxI were calculated using LocARNA⁹⁷. The results of all these analyses can be found in Supplementary Data 2–5.

For the per-residue compositionality analyses (Extended Data Figs. 3a, 4l and 6a), the top 1,000 BLAST hits (E < 1.0) and MMseqs2 hits for each protein were compiled using the parameters described above, with the inclusion of an MMseqs2 run against TADB 3.0 (ref. ²⁴) for toxin–antitoxin generations. Residue-level assignments were first determined by exact amino acid matching within the compiled BLAST and MMseqs2 results, with each query position assigned to the best-scoring hit. To avoid self-hits and confounding high-similarity alignments, matches with more than 95% sequence identity were excluded. Any unassigned regions were then integrated with a k-mer-based sequence matching algorithm to improve sequence coverage. To construct the k-mer database, all overlapping k-mers (lengths of 3–12 amino acids) were extracted from a random sample of 100,000 UniRef30 proteins shorter than 800 amino acids. For each unassigned region, the optimal set of non-overlapping exact k-mer matches that maximized sequence coverage while minimizing the number of distinct source proteins was identified. In brief, all possible exact k-mer matches for each unassigned position were enumerated and scored based on k-mer length, with longer matches receiving higher scores. These matches were treated as intervals, and an optimization algorithm was used to select the highest-scoring, non-overlapping subset. To encourage coherent assignments and minimize the number of proteins used, additional weighting was applied to favour proteins contributing multiple k-mers across all unassigned regions.

SynGenome prompt compilation

Protein sequences from prokaryotic and phage organisms were retrieved from UniProt⁹⁸, with all Swiss-Prot proteins with associated coding regions and a random sample of TrEMBL proteins with associated coding regions being used as starting points for prompts. Genomic contexts were retrieved via NCBI’s Entrez Programming Utilities (E-utilities) API, specifically using EFetch with the nuccore database using the coding sequence (CDS) annotations associated with the Protein Sequence accession in UniProt^73,98. For the genomic identifier associated with each protein, three API calls were constructed: one to extract the coding sequence using sequence feature coordinates and two to extract the flanking regions by calculating positions 500 nucleotides upstream and downstream of the CDS boundaries. For CDS regions that were longer than 500 nucleotides, the prompt was derived from the final 500 nucleotides in the coding sequence. For CDS regions that were shorter than 500 nucleotides, the CDS sequence was trimmed to the nearest 100 bp. In addition, the reverse complement sequences for each region were generated using the Biopython Seq module, resulting in six distinct prompts per protein: upstream, CDS, downstream and their respective reverse complements. Associated functional annotations including Gene Ontology terms⁴⁷, species names and InterPro domains⁴⁸ were retrieved from REST API in UniProt for each protein using their UniProt accession numbers and linked to their corresponding prompts.

SynGenome sampling

Sequences were generated using Evo 1.5 with optimized sampling parameters (temperature = 0.7, top-k = 4, top-p = 1.0). For each prompt, two sequences of 5,000 nucleotides in length were generated, yielding a total database size of over 120 billion base pairs. Generation was performed in parallel across multiple compute nodes to facilitate large-scale sequence production.

Data filtering of SynGenome sequences

Generated sequences underwent a multi-step filtering pipeline to remove low-complexity regions while preserving biologically plausible features. Initial validation removed any invalid characters from the nucleotide sequences. These sequences were then processed using DustMasker (NCBI BLAST+ v2.16.0) with a masking level of 30 (-level 30) and FASTA output format (-outfmt fasta) to identify low-complexity regions. Following DustMasker processing, two additional filtering steps were implemented: removal of successive 100 nucleotide chunks from the sequence end if they contained more than 40% masked bases, and elimination of any continuous masked regions longer than 800 nucleotides. Sequences were excluded from the final database based on several criteria: length below 100 nucleotides, masked base content exceeding 80% in sequences shorter than 2,000 nucleotides, complete masking of all bases or empty/NaN values. The sequences passing these filtering criteria were retained in the final database.

Prediction of SynGenome ORFs

ORFs were identified in the filtered sequences using Prodigal (v2.6.3, default parameters, -p meta) with default parameters and metagenomic prediction mode (-p meta). Predictions were initially refined by excluding sequences shorter than 40 amino acids or longer than 1,200 amino acids and sequences with incomplete protein sequences. Following basic filtration, low-complexity regions were identified using SegMasker (v2.14.1+galaxy2, window size of 15, locut of 1.8, hicut of 3.4), with sequences containing more than 20% masked regions being excluded. Additional complexity filters removed sequences with fewer than 12 unique amino acids or highly repetitive k-mer patterns (k = 3–10, threshold > 40% coverage). This multi-step filtering process ensured the retention of higher-quality protein predictions while removing potentially spurious or low-complexity sequences.

Creation of the SynGenome domain association network

Protein family domains in filtered protein predictions from SynGenome (see ‘Prediction of SynGenome ORFs’) were identified using HMMER (v3.3.0) hmmscan against the Pfam-A database⁸⁰ (v35.0) with default parameters (e-value cut-off of 0.1). Next, to analyse domain annotation patterns across SynGenome, a directed graph connecting Pfam domains found in prompt sequences to those found in generated response sequences (nodes) was constructed, with edge weights representing co-occurrence frequencies. The resultant network was visualized using Pyvis⁹⁹, with the top 10,000 strongest edges being visualized. To ensure that no single highly connected node dominated the network, each node was restricted to a maximum of five top edges.

To identify connections between Pfam clans, Pfam domains in both prompt and response sequences were mapped back to their corresponding clan using metadata in Pfam-A (v35.0). As before, a network connecting Pfam clans found in prompts to those found in responses was created, with nodes representing individual Pfam clans and edge weights corresponding to co-occurrence frequencies in prompt–response pairs. Statistics on the network were calculated using igraph (v0.11.6)¹⁰⁰.

Evaluation of SynGenome DUF associations

To identify enriched DUF associations from the SynGenome domain association network (see ‘Creation of the SynGenome domain association network’), first, a series of DUFs present in the top 500,000 most connected nodes were identified. These nodes were then analysed for their associations within the network. For each domain pair involving at least one of these identified DUFs, the co-occurrence frequency, representing the observed edge weight, was directly determined from the network. Next, the relative enrichment of each DUF-connected edge was calculated by dividing the observed edge weight by the expected weight. This expected weight was calculated as the product of the sums of incident edge weights of the two connected domains, normalized by the total weight of the entire network. Co-occurrence frequency was then plotted against relative enrichment to identify domains exhibiting both high co-occurrence rates and strong association strength with each DUF.

Identification of structurally chimeric proteins in SynGenome

To identify structurally chimeric proteins among the proteins generated in SynGenome, first, all proteins in SynGenome that had significant Pfam hits (see ‘Creation of the SynGenome domain association network’) were folded using ESMFold¹⁷, resulting in approximately 3.7 million predicted structures. Sequences were then filtered for those that had pLDDT scores greater than 0.4 and clustered at 90% sequence identity using MMseqs2 (ref. ⁸²), with a representative sequence being chosen from each cluster. Predicted structures of the remaining sequences were then sharded into individual batches of 100,000 proteins and turned into Foldseek databases⁸³. Each database was subsequently aligned against itself and all other shard databases using Foldseek (v10-941cd33; e-value of 1, sensitivity of 7 and maximum sequences of 300), effectively resulting in a structural all-by-all alignment of SynGenome proteins against each other. The resultant alignments were then filtered to exclude alignments with TM-scores less than 0.3 and turned into a DuckDB for simplified querying. To rapidly identify proteins that appeared to be chimeras of other unrelated SynGenome proteins, 1% of query proteins with moderate structural connectivity (2–10 Foldseek hits) to other SynGenome proteins were randomly sampled as candidate chimeras. For each potential chimera, all pairwise combinations of its corresponding target proteins were enumerated as candidate parent pairs. These parent pairs were then filtered to remove any pairs that showed direct structural similarity to each other, retaining only those where both proteins were structurally similar to the chimera candidate protein but not to each other.

To find chimera candidates containing potential domain fusions, candidates were further filtered to retain groups where the chimera candidate contained domains from both parents while the parents had no overlapping domains. The resulting chimera candidates were then queried against the NCBI non-redundant protein database (E < 0.05), with proteins with greater than 70% sequence identity to any single database hit being removed. Domain architectures of the resultant chimeras were also cross-referenced against InterPro to determine chimera candidates that appeared to be novel domain fusions. This yielded a final set of structurally chimeric proteins with limited sequence identity to any one protein.

Creation of SynGenome website

The SynGenome website was implemented with HTML, CSS and JavaScript to provide a searchable web interface (https://evodesign.org/syngenome/). The interface allows users to query sequences using protein names, UniProt IDs, InterPro domains, species names or Gene Ontology terms. The database was structured to maintain associations between generated sequences and their corresponding prompt annotations. The website also includes the SynGenome domain association network (see ‘Creation of the SynGenome domain association network’), which can be queried using Pfam IDs, clans and keywords, and displays the 10,000 strongest prompt–response domain connections across SynGenome (https://evodesign.org/syngenome/network). The raw SynGenome data and ESMFold-predicted structures of generated proteins with significant Pfam hits are hosted on Hugging Face for public access (https://huggingface.co/datasets/evo-design/syngenome-protein-structures/tree/main and https://huggingface.co/datasets/evo-design/syngenome-uniprot).

Creation of SynGenome UMAP and Leiden clusters

To generate the UMAP, first, a random sample of 50,000 sequences encoding at least one ORF with prompts derived from the CDS was extracted from SynGenome. This random sample was subsequently filtered to remove generations with less than 40% or more than 60% GC content, resulting in a final set of 36,762 generations. Embeddings were generated for both the prompt and corresponding generated sequences by extracting activations from the MLP layer 3 in the 6th hyena block of Evo 1.5’s architecture before being mean pooled along the sequence dimension to create fixed-length representations. These high-dimensional embeddings were subsequently reduced to two dimensions using UMAP (n_neighbours = 15, min_dist = 0.1). Sequences were coloured according to their percent identity to sequences in the training data, calculated using MMseqs2 against a database of prokaryotic genomes used in model training. These high-dimensional embeddings were reduced to two dimensions using UMAP with ScanPy (v1.10.3)¹⁰¹ default parameters. Sequences were coloured according to their percent identity to sequences in the training data, calculated using MMseqs2 against OpenGenome. Graph-based clustering was performed using the default Leiden algorithm implemented in Scanpy. The resolution parameter was optimized by evaluating cluster stability across resolutions from 0.1 to 0.5, measuring both the number of clusters and the coefficient of variation of cluster sizes. A final resolution of 0.2 was selected for clustering based on these metrics.

Comparison of SynGenome to natural prokaryotic sequences

To evaluate how representative SynGenome proteins were compared with natural prokaryotic sequences, first, a random sample of 36,762 5,000-nt sequences was taken from OpenGenome to match the number of sequences used in the representative sample of SynGenome (see ‘Creation of SynGenome UMAP and Leiden Clusters’). Following the procedure used for prediction of ORFs in SynGenome (see ‘Prediction of SynGenome ORFs’), Prodigal (v2.6.3, default parameters, -p meta) was used to identify potential ORFs in the sampled sequences and minimal filtering with SegMasker was applied to remove incorrectly called sequences. Length distributions of predicted SynGenome ORFs were compared with those from the random OpenGenome sample. Codon usage bias of the prompts and SynGenome generations were analysed by calculating the total frequency of each codon across all prompt ORFs and all generated ORFs and normalizing the frequencies to the total number of codons per dataset. Protein family domains were identified in both datasets using HMMER (v3.3.0) hmmscan against the Pfam-A database (v35.0)⁸⁰ with default parameters (e-value cut-off of 0.01). Domain frequencies were compared between datasets using Fisher’s exact test with Benjamini–Hochberg multiple testing correction (minimum occurrence threshold of ten domains in both datasets). Domain frequency relationships were visualized using a log-scale scatterplot, with points coloured by absolute log₂ fold change between datasets. Correlation between domain frequencies was assessed using the Pearson correlation coefficient calculated using the pearsonr function in SciPy (v1.11.4)¹⁰². The distribution of domain occurrence frequencies was analysed by binning domains found in SynGenome and OpenGenome proteins into log-scale bins (2ⁿ occurrences) and visualizing the count of unique domains per frequency bin for each.

To evaluate the associations captured by the SynGenome domain association network compared with natural prokaryotic syntenic associations, first, a set of 15 representative bacterial and archaeal genomes and their corresponding proteins were downloaded from the NCBI. These included E. coli K-12 MG1655 (GCF_000005845.2), Bacillus subtilis subsp. subtilis str. 168 (GCF_000009045.1), Mycobacterium tuberculosis H37Rv (GCF_000195955.2), P. aeruginosa PAO1 (GCF_000006765.1), Staphylococcus aureus NCTC 8325 (GCF_000013425.1), Cyanobacterium aponinum PCC 10605 (GCF_000317675.1), Clostridium acetobutylicum ATCC 824 (GCF_000008765.1), Mycoplasma pneumoniae M129 (GCF_910574535.1), Helicobacter pylori 26695 (GCF_000307795.1), Wolbachia pipientis wMel (GCF_016584425.1), Methanocaldococcus jannaschii DSM 2661 (GCF_000091665.1), H. volcanii DS2 (GCF_000025685.1), Sulfolobus acidocaldarius DSM 639 (GCF_000012285.1), Pyrococcus furiosus DSM 3638 (GCF_008245085.1) and Nitrosopumilus maritimus SCM1 (GCF_000018465.1). All proteins from these genomes were annotated using HMMER (v3.3.0) hmmscan against the Pfam-A database (v35.0)⁸⁰ and mapped back to their corresponding Pfam clan. To characterize natural genomic co-occurrence in a manner reminiscent of the generation methods in SynGenome, co-occurring genes were defined as those found within 5,000 bp of each other, with domain associations added between all Pfam clan hits from genes meeting this criterion. Each time a given Pfam clan association was observed, the weight of the edge connecting the domains was incremented, akin to the SynGenome network construction process (see ‘Creation of the SynGenome domain association network’). To account for the much larger number of proteins and associations in SynGenome, z-score normalization was applied to the Pfam clan edge weights in both SynGenome and natural genomes. z-scores were calculated as z = (x − μ)/σ, where x represents the original edge weight, μ is the mean edge weight and σ is the standard deviation of all edge weights within the SynGenome or natural genome co-occurrence network. z-scores for shared associations were subsequently plotted against each other for all shared Pfam clan associations occurring in frequencies of more than 10.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

■ مصدر الخبر الأصلي

نشر لأول مرة على: www.nature.com

تاريخ النشر: 2025-11-19 02:00:00

الكاتب: Aditi T. Merchant

تنويه من موقع “yalebnan.org”:

تم جلب هذا المحتوى بشكل آلي من المصدر:
www.nature.com
بتاريخ: 2025-11-19 02:00:00.
الآراء والمعلومات الواردة في هذا المقال لا تعبر بالضرورة عن رأي موقع “yalebnan.org”، والمسؤولية الكاملة تقع على عاتق المصدر الأصلي.

ملاحظة: قد يتم استخدام الترجمة الآلية في بعض الأحيان لتوفير هذا المحتوى.