In the journal, Nature Methods, a team of researchers from the AMU Faculty of Biology and the Faculty of Automatic Control, Electronics and Computer Science at the Silesian University of Technology, in collaboration with a specialist from Friedrich Schiller University in Jena, described a tool that allows known viruses to be distinguished from new ones and to analyse their diversity in different environments, a key issue for monitoring new pathogens and researching the microbiome.
Polish researchers, including three AMU scientists, Andrzej Zielezinski, Jakub Barylski and Piotr Rozwalak, have developed a computer programme called Vclust, making it possible to compare millions of viral sequences in barely a few hours and order them according to their similarity. Analysing enormous genetic datasets with traditional methods would require up to several years.
"By using Vclust, analysing a set of 15 million sequences takes approximately four hours, while the most accurate tools used so far would take about four years. It is an essential step for the progress of virology and metagenomics, as it will facilitate the identification and classification of new viruses, which are being massively discovered in recent years thanks to modern sequencing technologies," the developers of the solution stressed in an interview with the Polish Press Agency.
The researchers explained that modern microbiology is struggling to cope with the data deluge. Up to one million new viruses are discovered each year, resulting in such extensive collections that their analysis and classification are becoming increasingly challenging for research teams.
"This explosion of data is due to metagenomics, a method that makes it possible to read all the DNA present in a given environmental sample, such as from the ocean, soil or human intestine. Until now, there has been a lack of tools to efficiently analyse and group such a large number of sequences. Incredibly detailed methods existed, but they could not cope with such a scale of data. Therefore, we decided to create a programme that would be just as precise, but far more efficient and could cope with millions of genomes at once," explained co-author of the publication, Andrzej Zieleziński, PhD, from the Adam Mickiewicz University, Poznań.
What makes viruses so difficult?
As he added, in biology, the classification of organisms - i.e. taxonomy - is usually based on comparing specific genes present in all representatives of a group. In this way, phylogenetic trees of organisms can be created, organisms grouped, families or species distinguished, and their degree of relatedness determined. With viruses, it is quite different.
"Viruses, unlike, for example, bacteria, do not have a single common gene that can be compared. They differ too much from each other. This is why classical phylogenetic methods do not work. Nor did the approach based on their morphology, e.g. the shape of the capsids, which proved too slow and not scalable. So we were left with one thing - to compare the sequences of whole genomes, letter by letter," said Dr Zielezinski.
It is difficult to achieve this when there are millions of such genomes. As Prof Sebastian Deorowicz of the Silesian University of Technology, project leader explained, there are already tools to group these enormous data sets, but they do so at a huge computational cost, hard to replicate in the conditions of everyday research work. "It is not that no one has done it before, but it required such large resources (e.g. supercomputers) that the process would be difficult to repeat regularly, especially if we are dealing with ever-growing datasets," he pointed out.
"That is why we prioritised optimisation, i.e. designing the most efficient algorithms and the most efficient code possible, which made it possible to reduce computation time by several orders of magnitude. All this was done to bring the calculations from a supercomputer to a regular workstation," - he added.
Three steps to organise viruses
Vclust works in three steps. The first involves pre-filtering, in which the programme instantly identifies pairs of sequences that show at least minimal similarity. Thus, instead of comparing every sequence with every other sequence - which would mean trillions of possible combinations - the algorithm limits the analysis to a much smaller number, along the lines of hundreds of millions of the most promising pairs.
The second stage is a precise comparison of the selected sequences. A proprietary algorithm called LZ-ANI is applied here, based on techniques inspired by the data compression algorithms used in the ZIP or RAR formats. Its principle is simple: the more similar two sequences are, the better they "compress" together, i.e. they require less space after processing. This effect serves as a measure of similarity.
In the final step, clustering occurs, i.e. the grouping of sequences based on their similarity. Viruses whose genomes are most similar to each other are put into the same group. This makes it easier to identify which are related and form "families" together and which are entirely separate. This provides a better understanding of the diversity of viruses and their evolutionary relationships.
"As a result, the program uses the power of the computer to its maximum potential. Everyone who tested Vclust was full of amazement about its speed," emphasised Dr Zielezinski.
The creators of Vclust have ensured that the tool is fully free and open to the public. It can be downloaded from the internet and run on a computer. For those without advanced hardware, a browser-based version has been prepared: vclust.org.
The tool works in a simple way: the user can paste his or her sequences, run the analysis and, after a short time, receive the result - without the need to log in or register. Currently, the browser-based version allows the research of up to a thousand sequences simultaneously, which in many cases proves to be sufficient.
Prof Deorowicz and Dr Zielezinski ensure that the project will be developed further. ‘We plan to add more functions, and in the future, we would like to extend Vclust to also include the possibility of analysing bacterial genomes,’ they announced.
Nauka w Polsce, Katarzyna Czechowicz (PAP)
Source: https://naukawpolsce.pl/