SARS-CoV-2 Virus and Human Leukocyte Antigen (HLA) Class II: Investigation in silico of Binding Affinities for COVID-19 Protection and Vaccine Development

SARS-CoV-2 causes COVID-19, urgently requiring the development of effective vaccine(s). Much of current efforts focus on the SARS-CoV-2 spike-glycoprotein by identifying highly antigenic epitopes as good vaccine candidates. However, high antigenicity is not sufficient, since the activation of relevant T cells depends on the presence of the complex of the antigen with a suitably matching Human Leukocyte Antigen (HLA) Class II molecule, not the antigen alone: in the absence of such a match, even a highly antigenic epitope in vitro will not elicit antibody formation in vivo. Here we assessed systematically in silico the binding affinity of epitopes of the spike-glycoprotein to 66 common HLA-Class-II alleles (frequency ≥ 0.01). We used a sliding epitope window of 22-amino-acid-width to scan the entire protein and determined the binding affinity of each subsequence to each HLA allele. DPB1 had highest binding affinities, followed by DRB1 and DQB1. Higher binding affinities were concentrated in the initial part of the glycoprotein (S1-S460), with a peak at S223-S238. This region would be well suited for effective vaccine development by ensuring high probability for successful matching of the vaccine antigen from that region to a HLA Class II molecule for CD4+ T cell activation by the antigen-HLA molecule complex. SARS-CoV-2 Virus and Human Leukocyte Antigen (HLA) Class II: Investigation in silico of Binding Affinities for COVID-19 Protection and Vaccine Development


Introduction
SARS-CoV-2 causes COVID-19, a disease that has now become a global pandemic 1,2 . Hence, there is an urgent need to develop an effective vaccine against the virus. Many efforts in that direction focus on the SARS-CoV-2 spike-glycoprotein, seeking to identify highly antigenic epitopes as good vaccine candidates 3 . However, high antigenicity is not sufficient, since the activation of relevant T cells depends on the presence of the complex of the antigen with a suitably matching Human Leukocyte Antigen (HLA) Class II molecule, not the antigen alone: in the absence of such a match, even a highly antigenic epitope in vitro will not elicit antibody formation in vivo. Therefore, there are two key factors for a vaccine to be effective: first, there should be a good match with a HLA Class II molecule, and second, the antigen should be adequately immunogenic, assuming that the individual is immunocompetent for antibody production. In the case of an actual viral infection, the mounting of antibodies against the virus protein is called "specific (or adaptive) immunity". The key agents for specific immunity and antibody production are the HLA Class II molecules. These bind to ~18-22-mer AA peptides resulting from the cleavage of the viral proteins by proteases in antigen presenting cells (APC), which include macrophages 4 . When a match occurs, the HLA Class II-peptide complex moves to the cell surface, attracts CD4 + T cells, and a complex process of antibody production is initiated 5 . Thus, the most crucial first important step is the successful match between the virus protein peptide and a Class II molecule. An individual carries 6 classical Class II alleles (2 from each DRB1, DQB1 and DPB1 genes) 4 . If these provide a good match to the peptide, the antibody production process starts immediately and, assuming that the person is immunocompetent, antibodies will be produced in time to eliminate the virus and protect from future exposures to that virus. If, on the other hand, there is no match, antibodies cannot be produced and there will not be specific immunity for that virus. In fact, complete lack of Class II molecules, as it occurs in a rare genetic disorder (primary immunodeficiency disorder 6 ) is a devastating disease characterized by uncontrolled viral infections and poor prognosis. Finally, the possibility and quality of the match critically depend on the affinity of the virus peptides and the HLA Class II molecule. Indeed, in this study, we examined systematically and exhaustively the SARS-CoV-2 virus spike glycoprotein to determine the binding affinities of epitopes to 66 common HLA Class II alleles and then identify regions in the protein with high such affinities as good candidates for vaccine development. The structure of the spike protein (S) of SARS-CoV-2 has been recently solved using cryo-electron microscopy 7 and its sequence is known 8 . Specific attention has been directed towards the spike glycoprotein of the SARS-CoV-2 since it represents a major binding site for cells via their ACE2 receptors in a number of recent studies that even revealed its crystal structure 2,9,10 .

Materials and Methods
The main objective of this study was to exhaustively assess the binding affinities of HLA Class II molecules to the SARS-CoV-2 spike glycoprotein. For that purpose, we assessed the binding affinities of 66 Class II alleles, as follows.

HLA alleles
For this study, we selected the more frequent alleles of classical HLA Class II genes (DPB1, DQB1, DRB1), namely all alleles with global frequencies ≥ 0.01, an arbitrary but reasonable threshold. For that purpose, we obtained an Estimation of Global Allele Frequencies by querying the website http://www.allelefrequencies.net/ 11 . The alleles with frequencies ≥ 0.01 that we used are given in Table 1. They comprised 21, 15 and 30 alleles of DPB1, DQB1 and DRB1 genes, respectively.

SARS-CoV-2 spike glycoprotein
The amino acid (AA) sequence of the SARS-CoV-2 spike glycoprotein ("glycoprotein") was retrieved from the UniprotKB database 12 . It consists of 1273 AAs.

Partitioning the SARS-CoV-2 spike glycoprotein
As mentioned above, the main objective of this study was to assess exhaustively the binding affinities of HLA Class I and II molecules to the SARS-CoV-2 spike glycoprotein. For that purpose, we used a sliding epitope window approach 13 to partition the sequence of the spike glycoprotein into subsequences of all possible consecutive 22-mers for (e.g. residues S1-S22, S2-S23, …, S1252-S1273) that covered the entire sequence length (1273 AA). The method is illustrated in Figure 1. More specifically, a set of 22-AA-length subsequences was generated (number of subsequences = length of spike glycoprotein -22); thus the number of subsequences was 1273-22 = 1251. Subsequences were collected and queried in the IEDB database (www.iedb.org) 14 in order to determine their binding affinity to a specific HLA Class II molecule. Binding affinity predictions were obtained using the NetMHCIIpan method 14 . For each 22-mer, a binding affinity score was predicted and reported as a percentile rank by comparing the peptide's score against the scores of five million random n-mers selected from the SwissProt database 11 . Smaller percentile ranks indicate higher binding affinity. Therefore, the minimum percentile rank (i.e. highest affinity) for each allele and 22-mer of the spike glycoprotein was found and  kept. This yielded 1251 values for each allele.

Data analysis
We used two measures to quantify the binding affinity of an allele to the glycoprotein: (a) the value with the lowest 22-mer minimum rank (of the total 1251 values), called Lowest Percentile Rank (LPR), and (b) the percent of percentile rank values less or equal 5% (PR5). LPR is a unique value, and, therefore, prone to chance fluctuations, whereas PR5 is a more stable measure with the reasonable interpretation as the probability of high binding affinity of the allele at the 5% threshold. (Exploratory analyses using a 10% threshold yielded very similar results.) The data were then analyzed to assess (a) the effect of Gene on LPR and PR5, and (b) the differential variation of Generelated effect on LPR and PR5 along the AA sequence of the glycoprotein to determine epitopes of high binding affinity to specific alleles.

Analyses using bootstrap
Since each individual carries 6 HLA Class II alleles of the classical genes (two of each DPB1, DQB1, DRB1), we simulated by bootstrap sets of individual cases by combining randomly six alleles (two from each of the three genes, DPB1, DQB1, and DRB1) using ad hoc computer programs in Microsoft Visual Intel FORTRAN (version 2019).
Application to single individuals. As mentioned above, every individual carries 6 classical HLA Class II alleles, 2 from each gene (DPB1, DQB1, DRB1). For a specific individual k, a quantitative estimate of their ability to make antibodies against SARS-CoV-2 would be the average PR5 of the 6 alleles carried by that individual. Let be the PR5 for allele i. Then (2) We estimated basic statistics for using a bootstrap procedure, as follows. Let M DPB1 , M DQB1, M DRB1 be the number of available alleles in the DPB1, DQB1 and DRB1 genes, respectively, in the population; in this study, for the most common (f ≥ 0.01) alleles globally, we have: M DPB1 = 21, M DQB1 = 15 , and M DRB1 = 30. We generated 1000 bootstrap values by selecting randomly (with replacement, allowing for homozygotes) two alleles from each set of the three genes (3) and (4) where the asterisk (*) denotes a bootstrap sample.
Application to populations. Here we extended the analysis above on single individuals to populations, by taking into account the frequency of allele i in the population. For that purpose, we computed an estimate of the weighted (by ) binding affinity for an individual k in the population using the bootstrapping procedure above: (5) where the asterisk (*) denotes a bootstrap sample. We then computed the sum of the corresponding frequency weights: (6) and (7) We then computed the mean of weighted as follows (N = 1000 bootstrap samples).
Association between binding affinity and allele frequency in populations. Here we investigated the possible association between the average binding affinity (equation 2) and the corresponding average allele frequency (equation 7) using a linear regression analysis.
As a control, we paired the observed to randomly chosen (from the same gene set) and performed the same analysis.
Association between binding affinity and allele frequency in single genes. Here we investigated the possible association between allele frequency and binding affinity (PR5) for each one of the 3 genes above using a linear regression analysis.

General statistical methods
Standard statistical methods were employed in these analyses, including analysis of variance (ANOVA) and linear regression, using the IBM-SPSS statistical package (version 26). Figure 2 illustrates the results of the sliding epitope window analysis for 3 HLA Class II alleles. Fig. 2A shows the percentile rank values for all 1251 epitopes for an allele with high binding affinities (DRB1*13:02) and another with low binding activities (DQB1*02:02). Fig. 2B plots the percentile rank values for 2 HLA Class II alleles that differ only by one amino acid (DRB1*13:01, DRB1*13:02) and attests to the importance of HLA Class II molecule structure on binding affinity to SARS-CoV-2: DRB1*13:02 has much higher binding activities than DRB1*13:01. Figure 3 illustrates results on LPR and PR5 for two alleles, one with high (left panel) and another with low (right panel) binding affinities. Figure 4 shows the frequency distributions of LPR (Fig. 4A) and PR5 (Fig. 4B), and their joint distribution (Fig. 4C). LPR and PR5 were significantly associated; more specifically, PR decreased with the logarithm of PR5 (linear regression of LPR against ln(PR5), F [1,64] = 145.2, P = 4.1x10 −18 ). Tables 2 and 3 give the LPR and PR5 values of the 66 alleles studied.

Localization of allele binding affinities along the SARS-CoV-2 spike glycoprotein sequence
Lower LPR values (i.e. higher binding affinities) were concentrated in the initial part of the glycoprotein (approximately between S1-S480 AA residues) ( Figure 6, left panel); a cumulative plot is shown in Fig. 6, right panel. Similarly, high PR5 values were more frequent in the initial part of the glycoprotein (approximately S1-S501; Figure 7).

Relevance of allele-SARS-CoV-2 binding affinities to single individuals
The frequency distribution of 1000 average binding   Figure 8 (left panel) and their rank-ordered values in Fig. 8 (right panel). The mean and median of were 4.97 and 4.96, respectively. These results indicate that ~50% of the individuals in the simulated population possess binding affinities to SARS-CoV-2 spike glycoprotein greater than the average binding affinity of the whole sample.

Relevance of allele-SARS-CoV-2 binding affinities to whole populations
In this analysis, we extended the findings above on single individuals to populations, by taking into account the frequency of allele i in the population by computing an estimate of the weighted (by ) binding affinity for an individual k in the population using the bootstrapping procedure described in the Methods section. We found that = 5.01, a value very close to the mean (= 4.97) and median (= 4.96) of the distribution of .
In the analyses above, we used the global frequencies of the 66 alleles. However, those frequencies differ from population to population (country, ethnicity, etc.). In general, the quantitative ability to mount antibodies in a population depends (a) on the frequency of those alleles in the population, and (b) on the binding affinity of alleles in the population. Therefore, for accurate estimates of in various populations, the frequencies of all HLA Class II alleles in the population should be known.

Association between binding affinity and allele frequency in populations
Here we investigated the possible association between the average binding affinity (equation 2) and the corresponding average allele frequency (equation 7). We found that the average binding affinity was positively associated with average allele frequency : the higher the average binding affinity of the 6-allele set, the higher the average frequency of the alleles in the set (Figure 9, left panel; r = 0.129; slope (beta) ± SE = 0.0348 ± 0.000846 t [998] = 4.12, P = 0.000042). This means that a 1% increase in is associated with an increase of 0.0348 in . This can be interpreted as reflecting an evolutionary pressure for increasing allele frequency with overall higher affinities to SARS-CoV-2, with a presumed survival advantage. As a control, we paired the observed to randomly chosen (from the same gene set); no significant association was found (Fig. 9, right panel; r = 0.027, P = 0.389). The results above came from a single bootstrap. We carried out 1000 such bootstraps to obtain a long-term expected estimate of the slope of the dependence of on . The mean slope (± SD) was 0.002815 ± 0.00087 per .

Association between binding affinity and allele frequency in single genes
The results above come from simulated HLA makeup of individuals, consisting of 6 alleles (2 from each gene DPB1, DQB1, DRB1). The significant dependence of average allele frequency on the average binding affinity (Figure 9, left panel) suggests that specific genes might be driving this dependence. We tested this hypothesis by performing a linear regression analysis between allele frequency and binding affinity for each one of the 3 genes above. We found a highly significant correlation only for the DPB1 gene ( Figure 10; r = 0.589, P = 0.005, N = 21 alleles). In    contrast no significant association was found for the DQB1 gene (r = 0.022, P = 0.938, N = 15) or the DRB1 gene (r = 0.153, P = 0.421, N = 30). These findings indicate that the hypothesized evolutionary pressure has been exerted through the DPB1 gene.

Discussion
Here we investigated the relations between SARS-CoV-2 spike glycoprotein and 66 common HLA Class II alleles. Our study was inspired by the known dependence of antibody production against a foreign antigen on a good match between the antigen and HLA Class II molecules. There were four major aims of this study, as follows. First, to investigate exhaustively the binding affinity of suitable fragments (epitopes) of the spike glycoprotein to the different 66 HLA Class II alleles; second, to identify portions of the spike glycoprotein with high binding affinity to HLA Class II molecules, as good candidates for vaccine development; third, to derive a quantitative estimate of the ability of an individual to mount antibodies against that protein, as happens during an epidemic or following vaccination; and fourth, to gain an insight into a possible evolutionary pressure by alleles with high binding affinity (and, therefore, the successful production of antibodies and protection from COVID-19) to occur at higher frequencies, i.e. a positive association between binding affinity and allele frequency. We discuss the findings on these objectives below.

Methodological considerations
The predicted binding affinities were based on an algorithm that considers existing experimental data 14 . Moreover, synthetic peptides which contain amino acid sequences of proteins without post-translational modifications such as glycosylation have been used extensively and have proven an excellent type of molecule for the mimicry of protein sites: such peptides can be generated as exact copies of protein fragments, as well as in diverse chemical modifications 16 . An example of the usefulness of this approach 17 was the identification of peptides mimicking antibody binding to the Epidermal Growth Factor Receptor.
Studies of binding affinities of HLA Class II molecules to a protein typically focus on specific fragments (epitopes) of the protein, based on a priori considerations of interest to the particular fragment. In contrast, here we sought to assess systematically the binding affinities of all continuous epitopes of the SARS-CoV-2 spike glycoprotein with each of 66 common (frequency ≥ 0.01) HLA Class II alleles. For that purpose, we used a sliding epitope window approach 13 in which a sliding epitope of 22 AA sequence length (suitable for HLA Class II molecule binding) scanned the whole glycoprotein sequence of 1273 AA, for a total of 1251 epitopes. For each epitope, we determined the minimum rank of its binding to each of the 66 alleles used in this study, as compared to 5 million peptides of the same length; the smaller the minimum rank, the better the binding affinity. This procedure enabled us to (a) evaluate binding affinities of individual alleles, (b) compare binding affinities among the 3 classical HLA Class II genes (DPB1, DQB1, DRB1), (c) identify epitopes with high binding affinity along the glycoprotein sequence, and thus (d) evaluate the ability of individuals and populations carrying these alleles to mount antibodies against the spike glycoprotein, assuming that those epitopes are of adequate antigenicity and that individuals are immunocompetent.

Location of glycoprotein epitopes with high binding affinity to HLA Class II molecules
We found that more than one-half of high affinity binding epitopes were located within the first 500 AA of the glycoprotein (Figs. 6 and 7), with two hot spots in the interval S200-S400 AA. Therefore, vaccines containing epitopes from that part of the glycoprotein would have a high chance of good match with HLA Class II molecules, the first step in initiating antibody production. These findings are in contrast with a recent paper 3 which described the most "immunodominant" epitopes in amino acid positions S510-S570 of the S protein, all of which have low affinities in our analysis (Figs. 6 and 7). Although the authors discuss the concept of immunodominance, it should be pointed out that effective immunogenicity which they assessed on T and B cells from mice is necessary but not sufficient. The reason lies in the fact that T cells are the first to interact with the antigen-HLA molecule complex, not the antigen alone: in the absence of such a match, even a highly antigenic epitope in vitro will not elicit antibody formation in vivo. Therefore, binding affinities of glycoprotein epitopes to HLA Class II molecules are crucial for effective vaccine development.

Application to individuals and populations
An important aspect of this study focused on determining the ability of an individual for a good match between spike glycoprotein epitopes and their own HLA Class II molecules. Assuming that individuals are immunocompetent otherwise (i.e. they are not immunocompromised due to disease, age, or drug intake), the goodness of the match above is crucial in initiating antibody production because it the antigen epitope -HLA Class II molecule complex that activates CD4+ T lymphocytes. Now, every individual carries 2 from each of the 3 classical HLA Class II genes (DPB1, DQB1, DRB1), for a total of 6 such alleles. We combined that information with the binding affinity of each allele we determined in this study, to derive estimates of the "goodness of match" of an individual, and then extended this analysis to populations of individuals using a bootstrap procedure. There were three major findings from this analysis. First, the goodness of match of an individual can be estimated and ranked within the population, thus providing information concerning the rank of that individual among others with respect to that measure. This, in turn, will help identify individuals who may not be able to produce an adequate number of antibodies following SARS-CoV-2 infection or vaccination. Second, the goodness of match of a population overall can be estimated and used as a predictor of the expected antibody response in a natural setting (epidemic) or following vaccination. It should be noted that both cases above, regarding individuals and populations, rely on two factors, namely the HLA Class II genetic makeup and the binding affinity of an allele in that makeup. For an individual, both of these factors are known and, therefore, it is straightforward to compute the goodness of match. However, for a population, an additional critical element is the frequency of occurrence of an allele, information that has a major bearing on estimating the goodness of match in the population as a whole, since allele affinities will have to be weighted by their frequencies in the population (see equation 5 above). In the present study, we used 66 HLA Class II alleles with global frequencies ≥0.01. However, there are many more alleles with lower frequencies, and, even more important, the frequencies of various alleles differ across populations (countries, ethnicities, etc.). Unfortunately, complete lists of HLA Class II alleles are not available for any population because it has never been a useful goal to obtain such a list; instead, most information comes from studies related to organ transplantation. Be that as it may, here we outlined a rigorous procedure that will provide the population estimate of goodness of spike glycoprotein-HLA Class II needed: additional information about added alleles will naturally increase the accuracy of the estimate but will not affect the procedure.

Association between allele frequency and binding affinity
We found a significant positive association between allele frequency and binding affinity, such that average higher allele frequencies (across 6 alleles of the HLA Class II genetic makeup) were associated with higher average binding affinities (Fig. 9). This suggests an evolutionary pressure such that alleles conferring higher protection against SARS-CoV-2 spike glycoprotein ultimately became more frequent. A further investigation of this relation in individual genes revealed a highly significant positive association between single allele frequency and binding affinity only for the DPB1 gene (Fig. 10). The source of this postulated evolutionary drive is unclear, since SARS-CoV-2 is a novel virus. A possible explanation would be that epitopes of the SARS-CoV-2 spike glycoprotein are shared with those of other proteins of virus that have been present for a long time. This could be a plausible explanation for the current observation that exposure to common cold (presumably caused by other corona viruses) may protect against COVID-19 19 . We are currently investigating in our lab this general idea by comparing the SARS-CoV-2 spike glycoprotein to proteins of other upper respiratory viruses, and evaluating their binding affinities to the same 66 alleles to find out a possible common association with alleles of DPB1. If this hypothesis is true, age-long upper respiratory epidemics may have provided the evolutionary pressure for selecting HLA Class II molecules with high binding affinity to such epitopes, thus conferring a survival advantage to their carriers.

Relevance to the COVID-19 pandemic
A pivotal role of HLA Class II in the ability to mount antiSARS-CoV-2 antibodies was also indicated in a study of a cohort of convalescent patients who developed antibodies against different areas of the spike protein of the virus (S1, RBD, and S2 regions) with variable titers of antibodies. More specifically, ≈30% of recovered mild COVID-19 patients generated a deficient level of neutralizing antibody titers; in 10 of the 175 patients, the level was below the limit of detection while in contrast, two patients showed very high titers. This finding strongly suggests that the variable ability to mount antibodies to the virus should depend on HLA genotype, at least in part. The authors concluded that the interplay between virus and host immune response in coronavirus infections should be further explored for the development of effective vaccine against SARS-CoV-2 20 , a point addressed in detail herein.
As to the duration of anti-virus antibodies, these reached a maximum in the first 20 days of infection in a cohort of 70 patients who had variable antibody titers, being highest during day 31-40 since onset, and decreasing slightly afterwards 20 . Thus, changes of antibody levels and their duration were individual-specific. The authors concluded that the titers of antibodies and how long any immunity in patients will last is a key question to be addressed for safe and effective antiviral treatments and vaccines in the future 20 ; this further indicates differences between individuals depending on their ability to mount and sustain an optimal antibody response pointing to the paramount significance of the HLA genotype.
Finally, the published information regarding different antibodies raised by COVID-19 patients strengthens the importance of our data which demonstrates specific S-domain virus peptides binding with high affinity to different HLA class II alleles. This information is critical for future vaccine development, which will serve for (1) strategic clinical management; (2) evaluation of vaccination efficiency in different individuals in the general population; (3) assignment of clinical professional and managerial teams interacting with COVID-19 patients 20 .