PLOS ONE: [sortOrder=DATE_NEWEST_FIRST, sort=Date, newest first, q=subject:"Catalogs"]PLOShttps://journals.plos.org/plosone/webmaster@plos.orgaccelerating the publication of peer-reviewed sciencehttps://journals.plos.org/plosone/search/feed/atom?sortOrder=DATE_NEWEST_FIRST&unformattedQuery=subject:%22Catalogs%22&sort=Date,+newest+firstAll PLOS articles are Open Access.https://journals.plos.org/plosone/resource/img/favicon.icohttps://journals.plos.org/plosone/resource/img/favicon.ico2024-03-28T20:04:23ZSynthetic images aid the recognition of human-made art forgeriesJohann OstmeyerLudovica SchaerfPavel BuividovichTessa CharlesEric PostmaCarina Popovici10.1371/journal.pone.02959672024-02-14T14:00:00Z2024-02-14T14:00:00Z<p>by Johann Ostmeyer, Ludovica Schaerf, Pavel Buividovich, Tessa Charles, Eric Postma, Carina Popovici</p>
Previous research has shown that Artificial Intelligence is capable of distinguishing between authentic paintings by a given artist and human-made forgeries with remarkable accuracy, provided sufficient training. However, with the limited amount of existing known forgeries, augmentation methods for forgery detection are highly desirable. In this work, we examine the potential of incorporating synthetic artworks into training datasets to enhance the performance of forgery detection. Our investigation focuses on paintings by Vincent van Gogh, for which we release the first dataset specialized for forgery detection. To reinforce our results, we conduct the same analyses on the artists Amedeo Modigliani and Raphael. We train a classifier to distinguish original artworks from forgeries. For this, we use human-made forgeries and imitations in the style of well-known artists and augment our training sets with images in a similar style generated by Stable Diffusion and StyleGAN. We find that the additional synthetic forgeries consistently improve the detection of human-made forgeries. In addition, we find that, in line with previous research, the inclusion of synthetic forgeries in the training also enables the detection of AI-generated forgeries, especially if created using a similar generator.The art of valuation: Using visual analysis to price classical paintings by Swedish MastersAdri De RidderSteffen EriksenBert Scholtens10.1371/journal.pone.02969062024-01-19T14:00:00Z2024-01-19T14:00:00Z<p>by Adri De Ridder, Steffen Eriksen, Bert Scholtens</p>
This study seeks to address the difficulty of pricing art and the limitations of conventional valuation models by using visual analysis to determine the price of paintings. We examine a large hand-collected sample of classical paintings by Swedish Masters, categorize them based on various dimensions, and reduce measurement error by visually examining and classifying each painting into a theme. We compare this ‘visual’ approach with the conventional ‘terminological’ approach. We find that the technique, theme, and auction house all have a substantial impact on the price. We argue that a visual inspection should take precedence over analysis based on the artwork’s title. This is because the latter leaves many artworks unclassified and results in a systematic bias. The study demonstrates the importance of using art-informed characteristics to reduce measurement error in pricing paintings.Genome-wide association studies on coronary artery disease: A systematic review and implications for populations of different ancestriesSarah SilvaDorothea NitschSegun Fatumo10.1371/journal.pone.02943412023-11-29T14:00:00Z2023-11-29T14:00:00Z<p>by Sarah Silva, Dorothea Nitsch, Segun Fatumo</p>
Background <p>Cardiovascular diseases are some of the leading causes of death worldwide, with coronary artery disease leading as one of the primary causes of mortality in both the developing and developed worlds. Despite its prevalence, there is a disproportionately small number of studies conducted in populations of non-European ancestry, with the limited sample sizes of such studies further restricting the power and generalizability of respective findings. This research aimed at understanding the differences in the genetic architecture of coronary artery disease (CAD) in populations of diverse ancestries in order to contribute towards the understanding of the pathophysiology of coronary artery disease.</p> Methods <p>We performed a systematic review on the 6<sup>th</sup> of October, 2022 summarizing genome-wide association studies on coronary artery disease, while employing the GWAS Catalog as an independent database to support the search. We developed a framework to assess the methodological quality of each study. We extracted and grouped associated single nucleotide polymorphisms and genes according to ancestry groups of participants.</p> Results <p>We identified 3100 studies, of which, 36 relevant studies were included in this research. Three of the studies that were included were not listed in the GWAS Catalog, highlighting the value of conducting an independent search alongside established databases in order to ensure the full research landscape has been captured. 743,919 CAD case participants from 25 different countries were analysed, with 61% of the studies identified in this research conducted in populations of European ancestry. No studies investigated populations of Africans living in continental Africa or admixed American ancestry groups besides African-Americans, while limited sample sizes were included of population groups besides Europeans and East Asians. This observed disproportionate population representation highlights the gaps in the literature, which limits our ability to understand coronary artery disease as a global disease. 71 genetic loci were identified to be associated with coronary artery disease in more than one article, with ancestry-specific genetic loci identified in each respective population group which were not detected in studies of other ancestries.</p> Conclusions <p>Although the replication and validation of these variants are still warranted, these finding are indicative of the value of including diverse ancestry populations in GWAS reference panels, as a more comprehensive understanding of the genetic architecture and pathophysiology of CAD can be achieved.</p>The MAGMA pipeline for comprehensive genomic analyses of clinical <i>Mycobacterium tuberculosis</i> samplesTim H. HeupinkLennert VerbovenAbhinav SharmaVincent RennieMiguel de Diego FuertesRobin M. WarrenAnnelies Van Rie10.1371/journal.pcbi.10116482023-11-29T14:00:00Z2023-11-29T14:00:00Z<p>by Tim H. Heupink, Lennert Verboven, Abhinav Sharma, Vincent Rennie, Miguel de Diego Fuertes, Robin M. Warren, Annelies Van Rie</p>
Background <p>Whole genome sequencing (WGS) holds great potential for the management and control of tuberculosis. Accurate analysis of samples with low mycobacterial burden, which are characterized by low (<20x) coverage and high (>40%) levels of contamination, is challenging. We created the MAGMA (Maximum Accessible Genome for <i>Mtb</i> Analysis) bioinformatics pipeline for analysis of clinical <i>Mtb</i> samples.</p> Methods and results <p>High accuracy variant calling is achieved by using a long seedlength during read mapping to filter out contaminants, variant quality score recalibration with machine learning to identify genuine genomic variants, and joint variant calling for low <i>Mtb</i> coverage genomes. MAGMA automatically generates a standardized and comprehensive output of drug resistance information and resistance classification based on the WHO catalogue of <i>Mtb</i> mutations. MAGMA automatically generates phylogenetic trees with drug resistance annotations and trees that visualize the presence of clusters. Drug resistance and phylogeny outputs from sequencing data of 79 primary liquid cultures were compared between the MAGMA and MTBseq pipelines. The MTBseq pipeline reported only a proportion of the variants in candidate drug resistance genes that were reported by MAGMA. Notable differences were in structural variants, variants in highly conserved <i>rrs</i> and <i>rrl</i> genes, and variants in candidate resistance genes for bedaquiline, clofazmine, and delamanid. Phylogeny results were similar between pipelines but only MAGMA visualized clusters.</p> Conclusion <p>The MAGMA pipeline could facilitate the integration of WGS into clinical care as it generates clinically relevant data on drug resistance and phylogeny in an automated, standardized, and reproducible manner.</p>Focused analysis of RNFL decay in glaucomatous eyes using circular statistics on high-resolution OCT dataMd. Hasnat AliMeghana RayS. Rao JammalamadakaSirisha SenthilM. B. SrinivasSaumyadipta Pyne10.1371/journal.pone.02929152023-10-18T14:00:00Z2023-10-18T14:00:00Z<p>by Md. Hasnat Ali, Meghana Ray, S. Rao Jammalamadaka, Sirisha Senthil, M. B. Srinivas, Saumyadipta Pyne</p>
We generated Optical Coherence Tomography (OCT) data of much higher resolution than usual on retinal nerve fiber layer (RNFL) thickness of a given eye. These consist of measurements made at hundreds of angular-points defined on a circular coordinate system. Traditional analysis of OCT RNFL data does not utilize insightful characteristics such as its circularity and granularity for common downstream applications. To address this, we present a new circular statistical framework that defines an Angular Decay function and thereby provides a directionally precise representation of an eye with attention to patterns of focused RNFL loss. By applying to a clinical cohort of Asian Indian eyes, the generated circular data were modeled with a finite mixture of von Mises distributions, which led to an unsupervised identification in different age-groups of recurrent clusters of glaucomatous eyes with distinct directional signatures of RNFL decay. New indices of global and local RNFL loss were computed for comparing the structural differences between these glaucoma clusters across the age-groups and improving classification. Further, we built a catalog of directionally precise statistical distributions of RNFL thickness for the said population of normal eyes as stratified by their age and optic disc size.eQTL Catalogue 2023: New datasets, X chromosome QTLs, and improved detection and visualisation of transcript-level QTLsNurlan KerimovRalf TambetsJames D. HayhurstIda RahuPeep KolbergUku RaudvereIvan KuzminAnshika ChowdharyAndreas VijaHans J. TerasMasahiro KanaiJacob UlirschMina RytenJohn HardySebastian GuelfiDaniah TrabzuniSarah Kim-HellmuthWilliam RaynerHilary FinucaneHedi PetersonAbayomi MosakuHelen ParkinsonKaur Alasoo10.1371/journal.pgen.10109322023-09-18T14:00:00Z2023-09-18T14:00:00Z<p>by Nurlan Kerimov, Ralf Tambets, James D. Hayhurst, Ida Rahu, Peep Kolberg, Uku Raudvere, Ivan Kuzmin, Anshika Chowdhary, Andreas Vija, Hans J. Teras, Masahiro Kanai, Jacob Ulirsch, Mina Ryten, John Hardy, Sebastian Guelfi, Daniah Trabzuni, Sarah Kim-Hellmuth, William Rayner, Hilary Finucane, Hedi Peterson, Abayomi Mosaku, Helen Parkinson, Kaur Alasoo</p>
The eQTL Catalogue is an open database of uniformly processed human molecular quantitative trait loci (QTLs). We are continuously updating the resource to further increase its utility for interpreting genetic associations with complex traits. Over the past two years, we have increased the number of uniformly processed studies from 21 to 31 and added X chromosome QTLs for 19 compatible studies. We have also implemented Leafcutter to directly identify splice-junction usage QTLs in all RNA sequencing datasets. Finally, to improve the interpretability of transcript-level QTLs, we have developed static QTL coverage plots that visualise the association between the genotype and average RNA sequencing read coverage in the region for all 1.7 million fine mapped associations. To illustrate the utility of these updates to the eQTL Catalogue, we performed colocalisation analysis between vitamin D levels in the UK Biobank and all molecular QTLs in the eQTL Catalogue. Although most GWAS loci colocalised both with eQTLs and transcript-level QTLs, we found that visual inspection could sometimes be used to distinguish primary splicing QTLs from those that appear to be secondary consequences of large-effect gene expression QTLs. While these visually confirmed primary splicing QTLs explain just 6/53 of the colocalising signals, they are significantly less pleiotropic than eQTLs and identify a prioritised causal gene in 4/6 cases.Spatial mapping of b-value and fractal dimension prior to November 8, 2022 Doti Earthquake, NepalRam Krishna TiwariHarihar Paudyal10.1371/journal.pone.02896732023-08-09T14:00:00Z2023-08-09T14:00:00Z<p>by Ram Krishna Tiwari, Harihar Paudyal</p>
An earthquake of magnitude 5.6 mb (6.6 ML) hit western Nepal (Doti region) in the wee hours of wednesday morning local time (2:12 AM, 2022.11.08) killing at least six people. Gutenberg-Richter b-value of earthquake distribution and correlation fractal dimension (D<sub>2</sub>) are estimated for 493 earthquakes with magnitude of completeness 3.6 prior to this earthquake. We consider earthquakes in western Nepal Himalaya and adjoining region (80.0–83.5°E and 27.3–30.5°N) for the period of 1964 to 2022 for the analysis. The b-value 0.68±0.03 implies a high stress zone and the spatial correlation dimension 1.81±0.02 implies a highly heterogeneous region where the epicenters are spatially distributed. Low b-values and high D<sub>2</sub> values identify the study region as a high hazard zone. Focal mechanism styles and low b-values correlate with thrust nature of earthquakes and show that the earthquake’s occurrence is associated with the dynamics of the faults responsible for generating the past earthquakes.Testing three seismic hazard models for Italy via multi-site observationsIunio IervolinoEugenio ChioccarelliPasquale Cito10.1371/journal.pone.02849092023-04-27T14:00:00Z2023-04-27T14:00:00Z<p>by Iunio Iervolino, Eugenio Chioccarelli, Pasquale Cito</p>
Probabilistic seismic hazard analysis (PSHA) is widely employed worldwide as the rational way to quantify the uncertainty associated to earthquake occurrence and effects. When PSHA is carried out for a whole country, its results are typically expressed in the form of maps of ground motion intensities that all have the same exceedance return period. Classical PSHA relies on data that continuously increase due to instrumental seismic monitoring, and on models that continuously evolve with the knowledge on each of its many aspects. Therefore, it can happen that different, equally legitimate, hazard maps for the same region can show apparently irreconcilable differences, sparking public debate. This situation is currently ongoing in Italy, where the process of governmental enforcement of a new hazard map is delayed. The discussion is complicated by the fact that the events of interest to hazard assessment are intentionally rare at any of the sites the maps refer to, thus impeding empirical validation at any specific site. The presented study, pursuing a regional approach instead, overcoming the issues of site specific PSHA validation, evaluated three different authoritative PSHA studies for Italy. Formal tests were performed directly testing the output of PSHA, that is probabilistic predictions, against the observed ground shaking exceedance frequencies, obtained from about fifty years of continuous monitoring of seismic activities across the country. The bulk of analyses reveals that, apparently alternative hazard maps are, in fact, hardly distinguishable in the light of observations.TBProfiler for automated calling of the association with drug resistance of variants in <i>Mycobacterium tuberculosis</i>Lennert VerbovenJody PhelanTim H. HeupinkAnnelies Van Rie10.1371/journal.pone.02796442022-12-30T14:00:00Z2022-12-30T14:00:00Z<p>by Lennert Verboven, Jody Phelan, Tim H. Heupink, Annelies Van Rie</p>
Following a huge global effort, the first World Health Organization (WHO)-endorsed catalogue of 17,356 variants in the <i>Mycobacterium tuberculosis</i> complex along with their classification as associated with resistance (interim), not associated with resistance (interim) or uncertain significance was made public In June 2021. This marks a critical step towards the application of next generation sequencing (NGS) data for clinical care. Unfortunately, the variant format used makes it difficult to look up variants when NGS data is generated by other bioinformatics pipelines. Furthermore, the large number of variants of uncertain significance in the catalogue hamper its useability in clinical practice. We successfully converted 98.3% of variants from the WHO catalogue format to the standardized HGVS format. We also created TBProfiler version 4.4.0 to automate the calling of all variants located in the tier 1 and 2 candidate resistance genes along with their classification when listed in the WHO catalogue. Using a representative sample of 339 clinical isolates from South Africa containing 691 variants in a tier 1 or 2 gene, TBProfiler classified 105 (15%) variants as conferring resistance, 72 (10%) as not conferring resistance and 514 (74%) as unclassified, with an average of 29 unclassified variants per isolate. Using a second cohort of 56 clinical isolates from a TB outbreak in Spain containing 21 variants in the tier 1 and 2 genes, TBProfiler classified 13 (61.9%) as unclassified, 7 (33.3%) as not conferring resistance, and a single variant (4.8%) classified as conferring resistance. Continued global efforts using standardized methods for genotyping, phenotyping and bioinformatic analyses will be essential to ensure that knowledge on genomic variants translates into improved patient care.CureSCi Metadata Catalog–Making sickle cell studies findableHuaqin PanCataia IvesMeisha MandalYing QinTabitha HendershotJen PopovicDonald BrambillaJeran StratfordMarsha TreadwellXin WuBarbara Kroner10.1371/journal.pone.02562482022-12-12T14:00:00Z2022-12-12T14:00:00Z<p>by Huaqin Pan, Cataia Ives, Meisha Mandal, Ying Qin, Tabitha Hendershot, Jen Popovic, Donald Brambilla, Jeran Stratford, Marsha Treadwell, Xin Wu, Barbara Kroner</p>
Objectives <p>To adopt the FAIR principles (Findable, Accessible, Interoperable, Reusable) to enhance data sharing, the Cure Sickle Cell Initiative (CureSCi) MetaData Catalog (MDC) was developed to make Sickle Cell Disease (SCD) study datasets more Findable by curating study metadata and making them available through an open-access web portal.</p> Methods <p>Study metadata, including study protocol, data collection forms, and data dictionaries, describe information about study patient-level data. We curated key metadata of 16 SCD studies in a three-tiered conceptual framework of category, subcategory, and data element using ontologies and controlled vocabularies to organize the study variables. We developed the CureSCi MDC by indexing study metadata to enable effective browse and search capabilities at three levels: study, Patient-Reported Outcome (PRO) Measures, and data element levels.</p> Results <p>The CureSCi MDC offers several browse and search tools to discover studies by study level, PRO Measures, and data elements. The “Browse Studies,” “Browse Studies by PRO Measures,” and “Browse Studies by Data Elements” tools allow users to identify studies through pre-defined conceptual categories. “Search by Keyword” and “Search Data Element by Concept Category” can be used separately or in combination to provide more granularity to refine the search results. This resource helps investigators find information about specific data elements across studies using public browsing/search tools, before going through data request procedures to access controlled datasets. The MDC makes SCD studies more Findable through browsing/searching study information, PRO Measures, and data elements, aiding in the reuse of existing SCD data.</p>A data compendium associating the genomes of 12,289 <i>Mycobacterium tuberculosis</i> isolates with quantitative resistance phenotypes to 13 antibioticsThe CRyPTIC Consortium10.1371/journal.pbio.30017212022-08-09T14:00:00Z2022-08-09T14:00:00Z<p>by The CRyPTIC Consortium </p>
The Comprehensive Resistance Prediction for Tuberculosis: an International Consortium (CRyPTIC) presents here a data compendium of 12,289 <i>Mycobacterium tuberculosis</i> global clinical isolates, all of which have undergone whole-genome sequencing and have had their minimum inhibitory concentrations to 13 antitubercular drugs measured in a single assay. It is the largest matched phenotypic and genotypic dataset for <i>M</i>. <i>tuberculosis</i> to date. Here, we provide a summary detailing the breadth of data collected, along with a description of how the isolates were selected, collected, and uniformly processed in CRyPTIC partner laboratories across 23 countries. The compendium contains 6,814 isolates resistant to at least 1 drug, including 2,129 samples that fully satisfy the clinical definitions of rifampicin resistant (RR), multidrug resistant (MDR), pre-extensively drug resistant (pre-XDR), or extensively drug resistant (XDR). The data are enriched for rare resistance-associated variants, and the current limits of genotypic prediction of resistance status (sensitive/resistant) are presented by using a genetic mutation catalogue, along with the presence of suspected resistance-conferring mutations for isolates resistant to the newly introduced drugs bedaquiline, clofazimine, delamanid, and linezolid. Finally, a case study of rifampicin monoresistance demonstrates how this compendium could be used to advance our genetic understanding of rare resistance phenotypes. The data compendium is fully open source and it is hoped that it will facilitate and inspire future research for years to come.Deep learning exoplanets detection by combining real and synthetic dataSara CuéllarPaulo GranadosErnesto FabregasMichel CuréHéctor VargasSebastián Dormido-CantoGonzalo Farias10.1371/journal.pone.02681992022-05-25T14:00:00Z2022-05-25T14:00:00Z<p>by Sara Cuéllar, Paulo Granados, Ernesto Fabregas, Michel Curé, Héctor Vargas, Sebastián Dormido-Canto, Gonzalo Farias</p>
Scientists and astronomers have attached great importance to the task of discovering new exoplanets, even more so if they are in the habitable zone. To date, more than 4300 exoplanets have been confirmed by NASA, using various discovery techniques, including planetary transits, in addition to the use of various databases provided by space and ground-based telescopes. This article proposes the development of a deep learning system for detecting planetary transits in Kepler Telescope light curves. The approach is based on related work from the literature and enhanced to validation with real light curves. A CNN classification model is trained from a mixture of real and synthetic data. The model is then validated only with unknown real data. The best ratio of synthetic data is determined by the performance of an optimisation technique and a sensitivity analysis. The precision, accuracy and true positive rate of the best model obtained are determined and compared with other similar works. The results demonstrate that the use of synthetic data on the training stage can improve the transit detection performance on real light curves.Genetic differentiation in East African ethnicities and its relationship with endurance running successAndré L. S. ZaniMateus H. GouveiaMarla M. AquinoRodrigo QuevedoRodrigo L. MenezesCharles RotimiGerald O. LwandeCollins OumaEphrem MekonnenNelson J. R. Fagundes10.1371/journal.pone.02656252022-05-19T14:00:00Z2022-05-19T14:00:00Z<p>by André L. S. Zani, Mateus H. Gouveia, Marla M. Aquino, Rodrigo Quevedo, Rodrigo L. Menezes, Charles Rotimi, Gerald O. Lwande, Collins Ouma, Ephrem Mekonnen, Nelson J. R. Fagundes</p>
Since the 1960s, East African athletes, mainly from Kenya and Ethiopia, have dominated long-distance running events in both the male and female categories. Further demographic studies have shown that two ethnic groups are overrepresented among elite endurance runners in each of these countries: the Kalenjin, from Kenya, and the Oromo, from Ethiopia, raising the possibility that this dominance results from genetic or/and cultural factors. However, looking at the life history of these athletes or at loci previously associated with endurance athletic performance, no compelling explanation has emerged. Here, we used a population approach to identify peaks of genetic differentiation for these two ethnicities and compared the list of genes close to these regions with a list, manually curated by us, of genes that have been associated with traits possibly relevant to endurance running in GWAS studies, and found a significant enrichment in both populations (Kalenjin, <i>P</i> = 0.048, and Oromo, <i>P</i> = 1.6x10<sup>-5</sup>). Those traits are mainly related to anthropometry, circulatory and respiratory systems, energy metabolism, and calcium homeostasis. Our results reinforce the notion that endurance running is a systemic activity with a complex genetic architecture, and indicate new candidate genes for future studies. Finally, we argue that a deterministic relationship between genetics and sports must be avoided, as it is both scientifically incorrect and prone to reinforcing population (racial) stereotyping.Sharing datasets of the COVID-19 epidemic in the Czech RepublicMartin KomendaJiří JarkovskýDaniel KlimešPetr PanoškaOndřej ŠancaJakub GregorJan MužíkMatěj KarolyiOndřej MájekMilan BlahaBarbora MackováJarmila RážováVěra AdámkováVladimír ČernýJan BlatnýLadislav Dušek10.1371/journal.pone.02673972022-04-21T14:00:00Z2022-04-21T14:00:00Z<p>by Martin Komenda, Jiří Jarkovský, Daniel Klimeš, Petr Panoška, Ondřej Šanca, Jakub Gregor, Jan Mužík, Matěj Karolyi, Ondřej Májek, Milan Blaha, Barbora Macková, Jarmila Rážová, Věra Adámková, Vladimír Černý, Jan Blatný, Ladislav Dušek</p>
At the time of the COVID-19 pandemic, providing access to data (properly optimised regarding personal data protection) plays a crucial role in providing the general public and media with up-to-date information. Open datasets also represent one of the means for evaluation of the pandemic on a global level. The primary aim of this paper is to describe the methodological and technical framework for publishing datasets describing characteristics related to the COVID-19 epidemic in the Czech Republic (epidemiology, hospital-based care, vaccination), including the use of these datasets in practice. Practical aspects and experience with data sharing are discussed. As a reaction to the epidemic situation, a new portal <i>COVID-19</i>: <i>Current Situation in the Czech Republic</i> (https://onemocneni-aktualne.mzcr.cz/covid-19) was developed and launched in March 2020 to provide a fully-fledged and trustworthy source of information for the public and media. The portal also contains a section for the publication of (i) public open datasets available for download in CSV and JSON formats and (ii) authorised-access-only section where the authorised persons can (through an online generated token) safely visualise or download regional datasets with aggregated data at the level of the individual municipalities and regions. The data are also provided to the local open data catalogue (covering only open data on healthcare, provided by the Ministry of Health) and to the National Catalogue of Open Data (covering all open data sets, provided by various authorities/publishers, and harversting all data from local catalogues). The datasets have been published in various authentication regimes and widely used by general public, scientists, public authorities and decision-makers. The total number of API calls since its launch in March 2020 to 15 December 2020 exceeded 13 million. The datasets have been adopted as an official and guaranteed source for outputs of third parties, including public authorities, non-governmental organisations, scientists and online news portals. Datasets currently published as open data meet the 3-star open data requirements, which makes them machine-readable and facilitates their further usage without restrictions. This is essential for making the data more easily understandable and usable for data consumers. In conjunction with the strategy of the MH in the field of data opening, additional datasets meeting the already implemented standards will be also released, both on COVID-19 related and unrelated topics.Comprehensive mouse microbiota genome catalog reveals major difference to its human counterpartSilas KieserEvgeny M. ZdobnovMirko Trajkovski10.1371/journal.pcbi.10099472022-03-08T14:00:00Z2022-03-08T14:00:00Z<p>by Silas Kieser, Evgeny M. Zdobnov, Mirko Trajkovski</p>
Mouse is the most used model for studying the impact of microbiota on its host, but the repertoire of species from the mouse gut microbiome remains largely unknown. Accordingly, the similarity between human and mouse microbiomes at a low taxonomic level is not clear. We construct a comprehensive mouse microbiota genome (CMMG) catalog by assembling all currently available mouse gut metagenomes and combining them with published reference and metagenome-assembled genomes. The 41’798 genomes cluster into 1’573 species, of which 78.1% are uncultured, and we discovered 226 new genera, seven new families, and one new order. CMMG enables an unprecedented coverage of the mouse gut microbiome exceeding 86%, increases the mapping rate over four-fold, and allows functional microbiota analyses of human and mouse linking them to the driver species. Comparing CMMG to microbiota from the unified human gastrointestinal genomes shows an overlap of 62% at the genus but only 10% at the species level, demonstrating that human and mouse gut microbiota are largely distinct. CMMG contains the most comprehensive collection of consistently functionally annotated species of the mouse and human microbiome to date, setting the ground for analysis of new and reanalysis of existing datasets at an unprecedented depth.