Active Learning for NLP (ALNLP)
Workshop at NAACL-HLT 2010
When: Sunday June 6, 2010
Location: Millennium Biltmore Hotel, Los Angeles
Webpage: http://active-learning.net/alnlp2010
Overview | Schedule | Invited Talk | Organizers/PC
Proceedings
The ALNLP 2010 Proceedings are now available from ACL Press.
Overview
Labeled training data is required to achieve state-of-the-art performance for many machine learning solutions to NLP tasks. While traditional supervised methods rely exclusively on existing labeled data to induce a model, active learning allows the learner to select unlabeled data for labeling in an effort to reduce annotation costs without sacrificing performance. Thus, active learning appears promising for NLP applications where unlabeled data is readily available (e.g., web pages, audio recordings, minority language data), but obtaining labels is cost-prohibitive.
Ample recent work has demonstrated the effectiveness of active learning over a diverse range of applications. Despite these findings, active learning has not yet been widely adopted for many ongoing large-scale corpora annotation efforts -- resulting in a dearth of real-world case studies and copious research questions. Machine learning literature has primarily focused on active learning in the context of classification, devoting less attention to issues specific to NLP including annotation user studies, incorporation of semantic information, and more complex prediction tasks (e.g. parsing, machine translation). The aim of this workshop is to foster innovation and discussion that advances our understanding in these and other practical issues for active learning in NLP.
Also see the 2009 ALNLP workshop: program / proceedings.Schedule
Sunday, June 6, 2010 | |
| 1:00-1:15 | Introduction by Burr Settles and Kevin Small |
| Invited Talk | |
| 1:15-2:10 | Active and Proactive Machine Learning: From Fundamentals to Applications in Language Technologies and Beyond Jaime Carbonell |
| Research Papers I | |
| 2:10–2:35 | Using Variance as a Stopping Criterion for Active Learning of Frame Assignment Masood Ghayoomi |
| 2:35–3:00 | Active Semi-Supervised Learning for Improving Word Alignment Vamshi Ambati, Stephan Vogel and Jaime Carbonell |
| 3:00-3:30 | Break |
| Research Papers II | |
| 3:30–3:55 | D-Confidence: An Active Learning Strategy which Efficiently Identifies Small Classes Nuno Escudeiro and Alipio Jorge |
| 3:55–4:20 | Domain Adaptation meets Active Learning Piyush Rai, Avishek Saha, Hal Daume and Suresh Venkatasubramanian |
| 4:20–4:55 | Parallel Active Learning: Eliminating Wait Time with Minimal Staleness Robbie Haertel, Paul Felt, Eric Ringger and Kevin Seppi |
| 4:55-5:30 | Discussion |
Invited Talk
Jaime Carbonell
Language Technologies Institute
Carnegie Mellon University
Active and Proactive Machine Learning:
From Fundamentals to Applications
in Language Technologies and Beyond
Active learning is a well-recognized paradigm for addressing the paucity of labeled data (e.g. topic labels on web pages, parallel text for rare languages, etc.) and the relative abundance of unlabeled data (e.g. web crawls, monolingual text). Recent advances include robust ensemble approaches and methods for active rank learning. However, although active learning has gained prominence, we advocate a major extension, relaxing restrictive assumptions such as the existence of a single omniscient oracle. Instead we investigate more realistic settings such as the presence of multiple potentially-fallible or reluctant external information sources with potentially variable cost and unknown reliability. Crowd sourcing and games-with-a-purpose are examples of setting with multiple fallible sources. Proactive learning reaches out to external sources and jointly optimizes learning source properties (e.g. accuracy, expertise area, minimal cost at which they are willing to label, etc.) and selects the most informative sources and instances for the learning task at hand. The proactive sampling methods trade off cost vs. information value vs reliability, and amortized benefit vs. immediate rewards, being largely agnostic to the base-level learning algorithms. We have applied these methods to synthetic data, benchmark test data, and most recently are applying them to new challenges such as low-resource machine translation, co-reference resolution, and inferring the human interactome (and host-pathogen interactomes.) The presentation will focus on the underlying active and proactive learning methods and touch on these applications.
Bio: Dr. Jaime Carbonell is the Director of the Language Technologies Institute and Allen Newell Professor of Computer Science at Carnegie Mellon University. He received SB degrees in Physics and Mathematics from MIT, and MS and PhD degrees in Computer Science from Yale University. His current research spans several areas, including virtually all aspects of Language Technologies: text mining, natural language processing, machine translation, and automated summarization (where he invented the MMR search-diversity method), question answering, etc. He is also an expert in Machine Learning, editing 3 books, and serving as editor-in-chief of the Machine Learning Journal for 4 years. He recently invented Proactive Machine Learning, including underlying decision-theoretic framework. Overall, he has published over 250 articles and books. His research includes text and data mining, corpus-based approaches to machine translation such as context-based MT, reasoning under uncertainty, computational proteomics where he investigates machine learning and language technologies to predict proteomic 3D structure (a.k.a. “the folding problem”) and function from protein primary sequences and biophysical constraints. He also contributes to autonomous agents that learn via observation and experience. Dr. Carbonell has served on multiple governmental advisory committees such as the Human Genome Committee of the National Institutes of Health, the Oakridge National Laboratories Scientific Advisory Board, the National Institute of Standards and Technology Interactive Systems Scientific Advisory Board, and the German National Artificial Intelligence (DFKI) Scientific Advisory Board. He is also the chairman of Carnegie Speech Company, which produces intelligent language tutoring software.
Organizers
- Burr Settles - Carnegie Mellon University, USA
- Kevin Small - Tufts University, USA
- Katrin Tomanek - University of Jena, Germany
Program Committee
- Markus Becker - SPSS (an IBM company), UK
- Claire Cardie - Cornell University, USA
- Hal Daume III - University of Utah, USA
- Ben Hachey - Macquarie University, Australia
- Robbie Haertel - Brigham Young University, USA
- Udo Hahn - University of Jena, Germany
- Eric Horvitz - Microsoft Research, USA
- Rebecca Hwa - University of Pittsburgh, USA
- Ashish Kapoor - Microsoft Research, USA
- Prem Melville - IBM T.J. Watson Research Center, USA
- Ray Mooney - University of Texas at Austin, USA
- Fredrik Olsson - SICS, Sweden
- Foster Provost - New York University, USA
- Eric Ringger - Brigham Young University, USA
- Dan Roth - University of Illinois at Urbana-Champaign, USA
- Burr Settles - Carnegie Mellon University, USA
- Kevin Small - Tufts University, USA
- Katrin Tomanek - University of Jena, Germany