PSR Data Science
As you may have noted, we have not hosted a Midday at the Oasis webinar in recent months. We’ve been participating in planning coordinated webinar series with our colleagues across the network. The coordinated webinar series will include information that is relevant to the broad community of librarians in the network membership.
We will still host Midday at the Oasis, on an as-needed basis when we identify great content to share that may be specific to our region or not fit into the series created with our colleagues. We welcome suggestions for content that you would like to see or present.
We invite you to plan to attend the webinars in the series listed below:
- RDM Webinar Series
- NNLM Resource Picks
- Public Health
- Public Libraries
- Health Sciences Libraries (including hospitals)
- Emerging Trends and Topics
More information will follow. We will continue to send announcements via PSR-News, when webinars are scheduled, so that you can attend and earn MLA CE credits, as available.
Dr. Patti Brennan has announced that NLM has received $10 million as part of the Coronavirus Aid, Relief, and Economic Security (CARES) Act, which provides emergency funding for federal agencies to combat the coronavirus outbreak. The funding is being used to support activities to improve the quality of clinical data for research and care; accelerate research including phenotyping, image analysis, and real-time surveillance; and to enhance access to COVID-19 literature and molecular data resources. The following activities highlight many of the investments that NLM is making with this emergency funding.
The novel coronavirus is driving a need for standardized COVID-19 terminology and data exchange that will allow clinicians and scientists to communicate more effectively and consistently. NLM will use the supplemental funds to support the addition of codes for COVID-19-related laboratory tests within LOINC (Logical Observation Identifiers Names and Codes) and to provide implementation guidelines and training in use of the standards. NLM is also enabling sharing of COVID-19 terminology updates through the Value Set Authority Center (VSAC), which makes available value sets and clinical terminologies. Value sets are codes from standard terminologies around specific concepts or conditions and are used as part of electronic clinical quality measures or to define patient cohorts, classes of interventions, or patient outcomes. This important work will facilitate the analysis of electronic health record data and support effective and interoperable health information exchange.
NLM is updating terminology for coronavirus-related drugs and chemicals through resources such as the Medical Subject Headings (MeSH) used for indexing and cataloging biomedical literature, and ChemIDplus, a dictionary of over 400,000 chemicals (names, synonyms, and structures). This work aligns terminology to facilitate the identification of chemicals and drugs used to treat, detect, and prevent COVID-19 and other coronavirus-related infections, including severe acute respiratory syndrome (SARS), and Middle East Respiratory Syndrome (MERS).
NLM’s intramural research program is using virus genomics, health data, and social media data to identify community spread of COVID-19. Researchers are applying machine learning and artificial intelligence techniques to chest X-rays to differentiate viral pneumonia from bacterial pneumonia – expanding knowledge of the process of the SARS-CoV-2 viral infection and assisting in the identification of best practices for diagnosis and care of COVID-19 patients. NLM research in natural language processing contributed to development of LitCovid, a curated literature hub for tracking scientific publications about the novel coronavirus. It provides centralized access to more than 13,500 relevant articles in PubMed, categorizes them by research topic and geographic location, and is updated daily.
NLM’s extramural research program is focusing on novel informatics and data science methods to rapidly improve the understanding of the infection of SARS-CoV-2 and of COVID-19. In April, NLM issued two Notices of Special Interest (NOT-LM-010 and NOT-LM-011) seeking applications (due in June) in these areas: the mining of clinical data for ‘deep phenotyping’ (gathering details about how a disease presents itself in an individual, fine-grained way) to identify or predict the presence of COVID-19; and public health surveillance methods that mine genomic, viromic, health data, environmental data or data from other pertinent sources such as social media, to identify spread and impact of SARS-Cov-2.
NLM is also improving access to published coronavirus literature via PubMed Central (PMC). In response to a call by science and technology advisors from a dozen countries to have publishers and scholarly societies make their COVID-19 and coronavirus-related publications immediately accessible in PMC, along with the available data supporting them, nearly 50 publishers have deposited more than 46,000 coronavirus-related articles in PMC with licenses that allow re-use and secondary analysis. Articles in the collection have been accessed more than 8 million times since March 18. NLM will use supplemental funds to improve the article-submission system to better accommodate publisher submissions and accelerate release of these critically important articles. On the PubMed side of literature offerings, NLM supplemental funds will support integrating LitCovid metadata. Novel sensors are being developed to leverage LitCovid metadata when directing users to curated COVID-19 content. The new infrastructure will permit PubMed to rapidly add additional disease-specific sensors in the future.
As of May 7, NLM’s GenBank resource has 3,893 SARS-CoV-2 sequences from 42 different countries that are publicly available. NLM created a special site, the “Severe acute respiratory syndrome coronavirus 2 data hub,” where people can search, retrieve, and analyze sequences of the virus that have been submitted to the GenBank database. In late March, NLM joined the CDC-led SPHERES consortium, a national genomics consortium which aims to coordinate U.S. SARS-CoV-2 sequencing efforts and make data publicly available in NLM’s GenBank and Sequence Read Archive (SRA), and other appropriate repositories. Supplemental funds will allow GenBank to further enhance the submission workflow, establish and promote use of metadata sample standards, and develop a fully automated SARS-CoV-2 submission workflow that incorporates quality checks, as well as ‘automated curation’, to provide standardized annotation of the SARS2 genomes submitted to GenBank.
SRA is positioned as a ready-made computational environment for public health surveillance pipelines and tool development. SRA metagenomic datasets from both environmental samples and patients diagnosed with COVID-19 can reveal patterns of co-occurring pathogens, newly emerging outbreaks, and viral evolution. NLM supplemental funds are being used to prototype SRA cloud-based analysis tools to search the entirety of the SRA database. These tools can provide efficient search for SARS-CoV-2, identify genetic patterns, and monitor newly submitted data for specific viral patterns.
NLM supplemental funding also supports the identification and selection of web and social media content documenting COVID-19 as part of NLM’s Global Health Events web archive collection. This content documents life in quarantine, prevention measures, the experiences of health care workers, patients, and more. NLM is also participating as an institutional contributor to a broader International Internet Preservation Consortium (IIPC) Novel Coronavirus outbreak web archive collection.
National Library of Medicine Director Patti Brennan, RN, PhD, has named Stephen Sherry, PhD, Acting Director of the National Center for Biotechnology Information (NCBI) at the National Library of Medicine effective March 31, 2020. As Acting Director of NCBI, Dr. Sherry oversees a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. He is also responsible for developing and operating all NCBI production services, with program areas spanning literature, sequences, chemistry, clinical research, and medical genetics.
Dr. Sherry also leads an NLM program to migrate NCBI’s largest resource, the Sequence Read Archive, into the cloud with the transfer and management of petabyte-scale sequence data on two commercial cloud platforms. He conducts research on the architecture of population genetic information to ensure human genetic information systems are both useful to researchers and respectful to the privacy of study participants.
Dr. Sherry earned his Ph.D. in Anthropology at the Pennsylvania State University in 1996, and post-doctorate at the Louisiana State University Medical Center prior to joining NLM in 1998.
The U.S. Department of Commerce’s National Institute of Standards and Technology (NIST) and the White House Office of Science and Technology Policy (OSTP) have just launched a joint effort to support the development of search engines for research that will help in the fight against COVID-19. The project was developed in response to the March 16 White House Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset.
In this effort, NIST will work initially with the Allen Institute for Artificial Intelligence, the National Library of Medicine, Oregon Health & Science University (OHSU), and the University of Texas Health Science Center at Houston (UT Health). The team will apply the successful, long-running program of expert engagement and technology assessment called the Text Retrieval Conference, or TREC, to the COVID-19 Open Research Dataset (CORD-19), a resource of more than 44,000 research articles and related data about COVID-19 and the coronavirus family of viruses. The TREC-COVID program goals include creating datasets and using an independent assessment process that will help search engine developers to evaluate and optimize their systems in meeting the needs of the research and health-care communities.
The team will first release a series of sample queries for the biomedical research community, developed by team members at the National Library of Medicine, OHSU and UT Health. Registered participants in TREC-COVID will use their information retrieval and search systems to run the queries against the CORD-19 document set and return their results to NIST. Biomedical experts will then review test results, including document relevance rankings, to assess the overall performance of the retrieval systems.
Using proven TREC protocols, NIST will score the submissions and post the scores, the retrieval results themselves, and the lists of key reference documents to the TREC-COVID website. These “test collections” can then be used by information retrieval researchers to evaluate and enhance the performance of their own search engines. This effort is intended to help researchers understand how search systems could best support medical researchers when available information is developing quickly, as in the current pandemic.
The Allen Institute for Artificial Intelligence has been releasing an expanded CORD-19 document set each Friday to capture the most recent articles on COVID-19 and related coronaviruses. Later rounds of TREC-COVID will use the larger releases of CORD-19 and expanded query sets. Participants will have one week to submit their search results, and within about a week NIST will post results, with an expected spacing of about two weeks between each new dataset round being released. The team initially anticipates conducting five consecutive rounds of search system assessments. Interested organizations are invited to register to participate in the TREC-COVID program on the NIST website.