Epitope-based Multi-variant SARS-Cov-2 Vaccine Design: Shared Epitopes Among the Natural SARS-Cov-2 Spike Glycoprotein and 5 of its Variants (D614G, α , β , γ , δ ) with High in Silico Binding Affinity to Human Leukocyte Antigen (HLA) Class II Molecules

Page 9 of 14 Epitope-based Multi-variant SARS-Cov-2 Vaccine Design: Shared Epitopes Among the Natural SARS-Cov-2 Spike Glycoprotein and 5 of its Variants (D614G, α, β, γ, δ) with High in Silico Binding Affinity to Human Leukocyte Antigen (HLA) Class II Molecules Spyros A. Charonis1,2, Apostolos P. Georgopoulos1,2* 1The HLA SARS-CoV-2 Research Group, Brain Sciences Center, Department of Veterans Affairs Health Care System, Minneapolis, MN 55417, USA 2Department of Neuroscience, University of Minnesota Medical School, Minneapolis, MN 55455, USA

mutations and deletions of important SARS-CoV-2 variants are documented in several online repositories, with the CDC database (https://www.cdc.gov/coronavirus/2019ncov/variants/variant-info.html) being used in this study.

HLA Alleles
For this study, we selected the more frequent alleles of classical HLA Class II genes (DPB1, DQB1, DRB1), namely all alleles with frequencies ≥ 0.01, an arbitrary but reasonable threshold. For that purpose, we obtained an Estimation of Global Allele Frequencies by querying the relevant website 20 . The alleles with frequencies ≥ 0.01 that we used are listed in Table 2. They comprised 21, 15 and 30 alleles of DPB1, DQB1 and DRB1 genes, respectively.

Partitioning the SARS-CoV-2 Spike Glycoprotein Variants
The amino acid sequences of five viral spike proteins (S natural , S D614G , S B.1.1.7 , S B.1.351 , and S P.1 ) ( Table 1) were retrieved from the UniprotKB database 21 . The amino acid sequence of the more recent Indian variant spike glycoprotein S P.167.2 was retrieved from the NCBI SARS-CoV-2 data hub 22 . The retrieved sequence (GenBank Acc: QWU05442.1) was matched by filtering the search for SARS-CoV-2 spike glycoprotein sequences from the B.167.2 lineage originating in India.
Each viral sequence was queried for binding affinity against 66 common HLA Class II alleles (Table 2). A sliding epitope window approach 23 was used to partition the sequence of the spike glycoprotein for each variant. Partitioning was done in a manner to obtain all possible consecutive linear 15-, 18-and 22-mers (e.g. for 15-mers residues 1-15, 2-16, …, n-15 where n=sequence length) that cover the entire sequence length (Fig 1). These peptide lengths are in the range of suitable lengths for binding with HLA Class II molecules 6 .
The partitioning was implemented in a Python script (version 3.8). All n-mers were queried in the IEDB database 24 in order to determine their binding affinities to a set of 66 HLA Class II receptor molecules. Binding affinity predictions were obtained using the NetMHCIIpan method 25 . For each n-mer, a binding affinity score was predicted and reported as a percentile rank by comparing the peptide's score against the scores of five million random n-mers selected from the UniProt database 21 . Smaller percentile ranks indicate higher binding affinity. For each gene locus (e.g. DRB1) and spike protein variant (e.g. Indian delta variant), all alleles and n-mers (formatted as a FASTA sequence alignment) were entered as a single query and, thus, the same set of 5 million random n-mers was employed to rank all queried alleles. Altogether, for each allele, 7544 15-mers, 7526 18-mers, and 7502 22mers were tested for a total of 22572 n-mers x 66 alleles = 1498464 tests. Smaller percentile ranks indicate higher binding affinity; therefore, the lowest (i.e. minimum) percentile rank (LPR) for each allele (corresponding to the highest binding affinity) and n-mer of each spike glycoprotein was retrieved. Finally, we employed the most conservative threshold of LPR = 0.01 (the lowest LPR returned by NetMHCIIpan) to identify selected epitope sequences (n-mers) with LPR = 0.01 (highest affinity) for further analysis.
The locations in the primary sequence of all 15-, 18-and 22-mers with LPR = 0.01 were tabulated and the n-mers that overlapped with the receptor-binding domain (RBD) of the spike glycoprotein identified. This sequence-based quantity was calculated as the number of

Results
We identified 18 epitope sequences which occurred in all 6 spike glycoproteins and had HLA binding affinities of LPR = 0.01 for at least one allele. Table 3 shows the AA sequence of each epitope and their position in the relevant glycoprotein sequence. Table 4 shows the alleles for which a sequence had very high in silico binding affinity of LPR = 0.01, together with the sum of the population frequencies of those alleles as an estimate of global population coverage; this estimate will vary among different populations, depending on the allele frequency specific for a particular population. Finally, the fraction of overlap of each sequence with the RBD region is given in Table 5. Remarkably, all 4 sequences with most high binding affinities (#2, 3, 4, 12, in bold in Table 4) highly overlapped with the RBD region.

Alleles Involved
Of the total 66 alleles tested (

Epitope n-mer AA Sequence
Allele number (from Table 2 Table 4. HLA alleles for which a sequence had a very high in silico binding affinity of LPR = 0.01. The cumulative allele frequency is the sum of the global frequencies of the corresponding alleles, given in Table 2. The sequences with higher population coverage are in bold. Charonis Table 4) offering high coverage; estimates of population coverage would vary for different populations, depending on the frequencies of the alleles involved in a particular population. Remarkably, the position of all those 4 sequences in each glycoprotein greatly overlaps with the RBD region.
Finally, with respect to the HLA Class II genes involved in this high affinity binding to epitopes of the 6 SARS-CoV-2 spike glycoproteins, the DPB1 gene provided the highest percentages of alleles involved (Table 6). In contrast, both the DQB1 and DRB1 genes were much less involved. The prominent involvement of the DPB1 gene is in keeping with our previous finding that the frequency of alleles of the DPB1 gene is positively associated with the binding affinity of epitopes of the SARS-CoV-2 natural spike glycoprotein 23 . This may reflect an evolutionary pressure favoring the selection of the DPB1 gene due to its presumed success in binding to coronaviruses, thus aiding survival and conferring evolutionary advantage.

Limitations
The main limitation of these results is that they were derived using an in silico methodology. We believe that this approach is justified given the infeasibility of performing those binding affinity assessments experimentally.

Acknowledgments
Partial funding for this study was provided by the University of Minnesota (the Brain and Genomics Fund and the American Legion Brain Sciences Chair). The sponsors had no role in the current study design, analysis or interpretation, or in the writing of this paper. The contents do not represent the views of the U.S. Department of Veterans Affairs or the United States Government.  Table 5. RBD overlap of the 18 epitopes for S natural and its 5 variants in Table 1.  Table 6. HLA alleles involved in very high in silico binding affinity of LPR = 0.01 with the 18 sequences.