Scientific knowledge is one of the greatest assets of humankind. This knowledge is recorded and disseminated in scientific publications, and the body of scientific literature is growing at an enormous rate. Automatic methods of processing and cataloguing that information are necessary for assisting scientists to navigate this vast amount of information, and for facilitating automated reasoning, discovery and decision making on that data.
The ESSP workshop focuses on processing scientific articles and creating structured repositories such as knowledge graphs for finding new information and making scientific discoveries. The aim of this workshop is to identify the necessary representations for facilitating automated reasoning over scientific information, and to bring together experts in natural language processing and information extraction with scientists from other domains (e.g. material sciences, biomedical research) who want to leverage the vast amount of information stored in scientific publications.
The advent of big data promises to revolutionize medicine by making it more personalized and effective, but big data also presents a grand challenge of information overload. For example, tumor sequencing has become routine in cancer treatment, yet interpreting the genomic data requires painstakingly curating knowledge from a vast biomedical literature, which grows by thousands of papers every day. Electronic medical records contain valuable information for drug development and clinical trial matching, but curating such real-world data from clinical notes can take hours for a single patient. NLP can play a key role in interpreting big data for precision medicine. In particular, machine reading can help unlock knowledge from text by substantially improving curation efficiency. However, standard supervised methods require labeled examples, which are expensive and time-consuming to produce at scale.
In this talk, I'll present Project Hanover, where we overcome the annotation bottleneck by combining deep learning with probabilistic logic, and by exploiting indirect supervision from readily available resources such as ontologies and databases. This enables us to extract knowledge from millions of publications, reason efficiently with the resulting knowledge graph by learning neural embeddings of biomedical entities and relations, and apply the extracted knowledge and learned embeddings to supporting precision oncology.
Bio: Hoifung Poon is the Director of Precision Health NLP at Microsoft Research and an affiliated faculty at the University of Washington Medical School. He leads Project Hanover, with the overarching goal of advancing machine reading for precision health, by combining probabilistic logic with deep learning. He has given tutorials on this topic at top conferences such as the Association for Computational Linguistics (ACL) and the Association for the Advancement of Artificial Intelligence (AAAI). His research spans a wide range of problems in machine learning and natural language processing (NLP), and his prior work has been recognized with Best Paper Awards from premier venues such as the North American Chapter of the Association for Computational Linguistics (NAACL), Empirical Methods in Natural Language Processing (EMNLP), and Uncertainty in AI (UAI). He received his PhD in Computer Science and Engineering from University of Washington, specializing in machine learning and NLP.
The National Library of Medicine (NLM) plays a pivotal role in translating biomedical research into practice. One of the foundational tasks in supporting NLM goals is biomedical language processing. The talk will introduce NLM’s resources for biomedical and clinical NLP -- natural language processing methods to support healthcare by operationalizing clinical information contained in the biomedical literature and clinical narrative. The talk will then focus on the approaches to extraction of structured knowledge from biomedical publications, highlighting automation and support of manual indexing of biomedical literature with Medical Subject Headings (MeSH), as well as extraction of information from tables and the full text of articles.
Bio: Dina Demner-Fushman, Investigator, leads research in information retrieval and natural language processing at the National Library of Medicine. Dina earned a doctor of medicine degree from Kazan State Medical Institute, a clinical research Doctorate degree (PhD) in Medical Sciences from Moscow Medical and Stomatological Institute, and MS and PhD degrees in Computer Science from the University of Maryland. She is the author of more than 190 articles and book chapters in the fields of information retrieval, natural language processing, and biomedical and clinical informatics. She is a Fellow of the American College of Medical Informatics (ACMI), an Associate Editor of the Journal of the American Medical Informatics Association, and one of the founding members of the Association for Computational Linguistics Special Interest Group on biomedical natural language processing (SIGBioMed).
Knowledge extraction technologies have improved dramatically in recent years and have started to make an impact on practical scientific discovery. This talk will survey recent work in applying extraction methods to scientific problems, and in particular two recent efforts in the social sciences.
The first is RaccoonDB, a declarative nowcasting data management system, which enables users to predict real-world time-series phenomena from extracted social media signals. RaccoonDB’s novel query optimization methods allow it to generate useful social science predictions 123 times faster than competing systems, using just 10% of the computational resources. When applied to unemployment phenomena, the system yields predictions with accuracy that is comparable to predictions from real-world economists.
The second system is an information extraction system designed to analyze online text and help law enforcement officers identify potential human trafficking victims. This system has been successfully applied to real-world cases. In addition, the resulting extracted dataset enables several novel social science findings about behavior in an illicit and often opaque market.
Bio: Michael Cafarella is an Associate Professor of Computer Science and Engineering at the University of Michigan. His research interests include databases, information extraction, data integration, and data mining. He has published extensively in venues such as SIGMOD, VLDB, and elsewhere. Mike received his PhD from the University of Washington in 2009 with advisors Oren Etzioni and Dan Suciu. His academic awards include the NSF CAREER award, the Sloan Research Fellowship, and the VLDB Test of Time Award. In addition to his academic work, Mike cofounded (with Doug Cutting) the Hadoop open-source project. In 2015 he cofounded (with Chris Re and Feng Niu) Lattice Data, Inc., which is now part of Apple.
Automatic construction of knowledge-bases is an exciting synthesis of research in machine learning, knowledge representation, natural language processing, and recently, crowdsourcing. Due to the ever increasing over-speciation of researchers in AI, most work in the area strongly emphasizes one of these four areas, and as a result, de-emphasizes the others. Bringing these diverse elements together requires learning a few things from each other, which, when treated as equals, can force us to re-evaluate tacit assumptions that one field makes and another throws away. My own journey from formal knowledge representation and ontologies thru building Watson and more recent AI systems at Google Research, led me to re-evaluate the long-held tacit assumptions of the knowledge representation field and come to a new understanding of its role in AI
9:00 -- 10:30 | |
9:00 -- 9:15 | Welcome |
9:15 -- 10:10 | Invited talk: Machine Reading for Precision Medicine Hoifung Poon |
10:10 -- 10:30 | Distantly Supervised Biomedical Knowledge Acquisition via Knowledge Graph Based Attention Qin Dai, Naoya Inoue, Paul Reisert, Ryo Takahashi and Kentaro Inui |
10:30 -- 11:00 |
Coffee break |
11:00 -- 12:30 | |
11:00 -- 11:50 | Invited talk: Extraction-Intensive Systems for the Social Sciences Michael Cafarella |
11:50 -- 12:10 | Scalable, Semi-Supervised Extraction of Structured Information from Scientific Literature Kritika Agrawal, Aakash Mittal and Vikram Pudi |
12:10 -- 12:30 | Understanding the Polarity of Events in the Biomedical Literature: Deep Learning vs. Linguistically-informed Methods Enrique Noriega-Atala, Zhengzhong Liang, John Bachman, Clayton Morrison and Mihai Surdeanu |
12:30 -- 14:00 |
Lunch break |
14:00 -- 15:30 | |
14:00 -- 14:50 | Invited talk: Extracting structured knowledge from biomedical publications Dina Demner-Fushman |
14:50 -- 15:15 | 5 min presentations for posters and the demo Dataset Mentions Extraction and Classification Animesh Prasad, Chenglei Si and Min-Yen Kan Annotating with Pros and Cons of Technologies in Computer Science Papers Hono Shirai, Naoya Inoue, Jun Suzuki and Kentaro Inui Browsing Health: Information Extraction to Support New Interfaces for Accessing Medical Evidence Soham Parikh, Elizabeth Conrad, Oshin Agarwal, Iain Marshall, Byron Wallace and Ani Nenkova An Analysis of Deep Contextual Word Embeddings and Neural Architectures for Toponym Mention Detection in Scientific Publications Matthew Magnusson and Laura Dietz STAC: Science Toolkit Based on Chinese Idiom Knowledge Graph (demo) Changliang Li, Meiling Wang, Yu Guo, Zhixin Zhao and Xiaonan Liu |
15:15 -- 16:00 |
Coffee break and Poster session |
16:00 -- 17:30 | |
16:00 -- 16:50 | Invited talk: Just when I thought I was out, they pull me back in: The role of knowledge representation in automatic knowledge base construction
Chris Welty |
16:50 -- 17:10 | Playing by the Book: An Interactive Game Approach for Action Graph Extraction from Text Ronen Tamari, Hiroyuki Shindo, Dafna Shahaf and Yuji Matsumoto |
17:10 -- 17:30 | Textual and Visual Characteristics of Mathematical Expressions in Scholar Documents Vidas Daudaravicius |