The authors have declared that no competing interests exist.
An algorithm to determine the possible mutations that can occur in the S protein responsible of the Covid-19 in humans is designed. To do that, nine tridimensional sequences available in the Protein Data Bank similar to the initial strain sequenced in Wuhan (December 2019) are identified. The conditions driving this potential mutation are: (1) an accumulated number of mutations greater than (or equal to) 5 in each position; (2), a cumulative value of the different variations of Gibbs free energy less than -2.0 Kcal/mol; and (3), a squared fluctuation greater than 1.6 Å obtained according to calculations for normal mode analysis based on anisotropic network models (ANM) after averaging the first 20 vibration modes. The result is that 491 positions can mutate, while 424 positions did not provide any mutation. Finally, the results reveal that there are mutations that cannot be predicted, so more studies are needed to determine why they are present in the human population.
At the end of December 2019, the first episodes of Covid-19 were registered from patients from the Huanan Seafood Wholesale Market in the city of Wuhan (China) who presented a new atypical pneumonia, fever, cough, and in the most severe cases, dyspnea and bilateral lung infiltration. In view of this, on December 31, the Wuhan Municipal Health Commission reported the incident to the World Health Organization (WHO).
The genome of the virus was made public on January 2020
The genome data was published in 2020
In this paper, there is a focus on the S protein because it is involved in the process of entering the virus into the receptor, and therefore it is a target for the design of possible vaccines against Covid-19. Recall that the S protein of SARS-CoV-2 consists of approximately 1,273 amino acids (aa), slightly higher than that found in SARS-CoV (1,255 aa).
Up to date, the emergency use of the vaccines Sputnik V from Russia, Sinopharm from China, among others, have been authorized. They are based on the analysis of the sequences that occurred in the initial studies of the virus, but a range of mutations can occur such that they are inefficient over time. Hence the need to predict possible mutations that can modify the effectiveness of vaccines
The first mutations registered in Brazil were found in a patient in Rio of Janeiro infected in October 2020
The second variant in Brazil has the following mutations: L18F, T20N, P26S, D138Y, R190S, K417T, N501Y, D614G, H655Y, and V1176F in the S protein
In view of this, nine three-dimensional (3D) structures of the S protein were selected to avoid any ambiguity in the results. Subsequently, an exploration of all the possible mutations that occur in each of the positions that make up the S protein was carried out, using the calculations of Gibbs free energy (ΔΔG)
From one of the sequences published after the incident in the Wuhan market in December 2019, nine structural sequences of the S protein were selected from the Protein Data Bank (www.pdb.org), using the tools from the NCBI portal (blast.ncbi.nlm.nih.gov/). These sequences should provide a similarity higher than 98% with respect to the Wuhan sequence.
To consider a mutation as valid, two out of these three conditions should be met: (1) positions where at least five or more possible mutations (the half of the selected sequences) can occur; (2) amino acids that present a quadratic fluctuation equal to or greater than 1.6 Å obtained from an anisotropy network model calculations; and (3), the accumulation of the variation Gibbs energy is less than -2.0 Kcalc/mol.
The first mutations registered after the initial outbreak in Wuhan are analyzed
The amino acid sequence of the S protein selected in the work corresponds to one of the first episodes registered in Wuhan (China) in December 2019, whose NCBI ID was MN908947. Thus, nine of their sequences were randomly selected from the result obtained with the Blastp program, which are deposited in the Protein Data Bank (PDB).
The nine sequences selected in humans correspond to the following PDB identifiers: 7JJI, 6VSB, 7KDI, 7KDJ, 6ZOW, 6XCM, 7CWL, 7K8S, and 7C2L. The next step was to calculate for each one of them, all the possible mutations that can occur from the calculation of the variation of Gibbs free energy after using the PoPMuSiC program.
(
(
(
It should be noted that a total of 177,463 different Gibbs free energy calculations must be analyzed in the nine sequences selected in the work, so it was necessary to implement small scripts in the Python programming language to analyze those results.
The calculations revealed that 424 positions do not present any mutation in the nine sequences selected (indicated in
|
|
(
|
|
|
|
Y28 | 14 | -1,58 | 1,8 |
A67 | 16 | -2,96 | 3,2 |
N74 | 14 | -2,29 | 4,7 |
W152 | 1 | -0,04 | 4,8 |
Y200 | 0 | 0,00 | 3,1 |
R273 | 44 | -6,29 | 1,4 |
F275 | 0 | 0,00 | 1,1 |
L276 | 0 | 0,00 | 1,1 |
E298 | 14 | -1,50 | 1,0 |
K300 | 63 | -31,64 | 1,1 |
T302 | 38 | -6,62 | 1,2 |
G485 | 6 | -0,91 | 23,2 |
A570 | 2 | -0,27 | 4,8 |
D614 | 42 | -12,03 | 1,7 |
A653 | 36 | -10,00 | 1,3 |
L752 | 0 | 0,00 | 6,5 |
P812 | 6 | -0,37 | 2,3 |
I818 | 0 | 0,00 | 1,1 |
G838 | 15 | -5,18 | 3,5 |
F1103 | 0 | 0,00 | 3,4 |
V1104 | 0 | 0,00 | 3,1 |
It is not possible to predict some mutations ie., Y200, F275, L276, L752, I818, F1103, and V1104 (
In order to understand the accumulated value of the Gibbs free energy variations (ΔΔG accum), the results obtained at position D614 are selected. This position has been found to be mutated to a GLY (G). The results obtained in each of the nine sequences are as follows:
6VSB: -1.04, CYS, PHE, GLY, HIS, ASN
6XCM: -2.11, CYS, PHE, GLY, HIS, ASN, TYR
7K8S: -1.35, CYS, GLY, HIS, ASN
7C2L: -0.58, CYS, GLY, HIS, ASN, VAL
7KDJ: GLY, 0,
7KDI: GLY, 0,
7JJI: -2.61, CYS, PHE, LEU, TRP, TYR
6ZOW: -1.48, CYS, PHE, GLY, HIS, ASN, PRO, THR, VAL, TYR
7CWL: -2.86, CYS, PHE, GLY, HIS, LEU, MET, ASN, TYR
Where the 6VSB sequence predicted five possible mutations (Cys, Phe, Gly, His, and Asn) such that the total sum of the five variations of the Gibbs free energy is equal to -1.04 Kcal/mol. The 7K8S sequence predicts four mutations and the cumulative Gibbs energy variation is -1.35 Kcalc/mol, and so on. Therefore, it can be verified that in position 614, 42 mutations present in 7 sequences are predicted, while two of them do not predict any mutation (7KDI and 7KDJ).
When reviewing the different mutations that are predicted in each of the seven sequences, it can be seen that the most frequent amino acid is a Gly (G), occurring 8 times. Cys (C) also appears 7 times, among others. Hence, an accumulated Gibbs energy of -12.03 Kcalc/mol was found (result of the sum of -1.04, -2.11, -1.35, - 0.58, -2.61, -1.48, and -2.86).
Finally, it is verified that the mutations registered in the new Brazil variants also appear in the results of this work, which are T20N, D138Y, R190S, K417T, N501Y, D614G, and H655Y. The L18 and P26 positions do not count on predicted mutations, and unfortunately the position V1176 was not present in the sequences.
This work determines the different positions where a mutation can occur in the S protein in order to explain the different variants that are occurring in SARS-CoV-2. It is interesting to note that it is possible to actually predict those observed in the new variant of Brazil, but it was not possible to explain some of the mutations detected at the beginning of the contagion by Covid-19 (L18, P26, Y200, F275, L276, L752 , I818, F1103 and V1104).
I’d like to acknowledge Rafael Mayo-Garcia for his comments on this manuscript.