Call for Papers

It is well known that a great proportion of the time devoted to data mining and, especially, data science projects is devoted to data acquisition, integration, transformation, cleansing and other highly tedious tasks. These tasks are tedious basically because they are repetitive and, hence, automatable. As a consequence, progress in the automation of this process can lead to a dramatic reduction of the cost and duration of data-oriented projects. Recently, inductive programming in general (and the learning of declarative rules and programs from a few user interaction examples in particular) has shown a large potential for this automation. The release of FlashFill as a plug-in inductive programming tool for Microsoft Excel and ConvertFrom-String as a Powershell command on Windows 10 are impressive demonstrations that inductive programming research has matured in such a way that commercial applications become feasible.

The aim of this workshop is to gather practitioners and researchers around the use of inductive programming techniques, programming by example and other learning techniques to automate the data wrangling process. It is well known that a great proportion of the time devoted to data mining and, especially, data science projects is devoted to data acquisition, integration, transformation, cleansing and other highly tedious tasks. These tasks are tedious basically because they are repetitive and, hence, automatable. As a consequence, progress in the automation of this process can lead to a dramatic reduction of the cost and duration of data-oriented projects.

We welcome regular papers, demo papers about benchmarks or tools, and position papers, and encourage discussions over a broad list of topics (not exhaustive):

Topics

  • Automation applied to data cleaning, data transformation, and data acquisition.
  • Visual interfaces to accelerate the automation of data wrangling.
  • Domain-specific languages for data wrangling vs general-purpose languages.
  • Explanation of data wrangling rules into natural language.
  • Automation in ETL (Extraction/Transformation/Load) tools.
  • Learning actionable rules automating other parts of the KDD process: model evaluation and deployment.
  • Abstraction mechanisms from inductive programming for metadata creation and handling.
  • Data wrangling showcases.
-

Keydates

Full Paper Submissions August 12, 2016
Full Paper Notification September 13, 2016
Camera-ready for accepted papers September 20, 2016
Workshop date December 12, 2016
All deadlines are at 11:59PM Pacific Daylight Time.

Submission

Paper submissions should be limited to a maximum of eight (8) pages, in the IEEE 2-column format, including the bibliography and any possible appendices. Submissions longer than 8 pages will be rejected without a review. All papers must be formatted according to the IEEE Computer Society proceedings manuscript style, following IEEE ICDM 2016 submission guidelines.

All submissions will be triple-blind reviewed by the Program Committee on the basis of technical quality, relevance to data mining, originality, significance, and clarity. Author names and affiliations must not appear in the submissions, and bibliographic references must be adjusted to preserve author anonymity. Authors of accepted papers will be asked to prepare a presentation (short or long) during the workshop. Accepted papers will be published in the IEEE ICDM 2016 Workshops Proceedings volume by IEEE Computer Society Press, and will also be included in the IEEE Xplore Digital Library. After the workshop, contributing authors will be invited to submit a paper to a special issue (journal to be announced).

Manuscripts must be submitted electronically through the IEEE ICDM CyberChair system . We do not accept email submissions.

Program Committee

Luc De Raedt Katholieke Universiteit Leuven, Belgium
Peter Flach University of Bristol, United Kingdom
José Hernández-Orallo Technical University of Valencia, Spain
Bongshin Lee Microsoft Research, Redmond, USA
Ute Schmid Otto-Friedrich-Universität Bamberg, Germany
Mary Roth IBM Research, San Jose, CA, USA
Armando Solar-Lezama Massachusetts Institute of Technology, USA
Rishabh Singh Microsoft Research, Redmond, USA
Gemma C. Garriga Allianz SE, Munich, Germany
Janis Voigtländer University of Bonn, Germany
Ricardo Aler Mur Universidad Carlos III de Madrid, Spain
Umair Z. Ahmed Indian Institute of Technology, Kanpur, India

Invited Talk

Charles Parker, Ph.D.

VP Machine Learning Algorithms at BigML, Inc

Talk

ML services are quickly becoming a commodity, and they will be taken for granted by developers and computer users alike in the near future. The building blocks for ML as an ubiquitous service are already in place, almost always in the form of remote APIs that provide a first level of abstraction over ML problem-solving and, specially, obviate scalability and resource allocation issues. But that's not enough: those building blocks still leak implementation details inessential to the application developer that needs to provide domain-specific solutions. We need to ascend a couple of rungs in the abstraction ladder and provide domain-specific languages to describe ML solutions without nitty-gritty details unrelated to the problem at hand, offering non-experts the possibility of automating their ML solutions. In this talk, we'll discuss our experience designing and developing BigML's data wrangling and ML workflow DSLs, Flatline and WhizzML, and how they generalize to similar ML services and APIs.

Bio

Dr. Charles Parker is the Vice President of Machine Learning algorithms at BigML. He holds a Ph.D. in computer science from Oregon State University. He was previously a research associate at the Eastman Kodak Company where he applied machine learning to image, audio, video, and document analysis. He also worked as a research analyst for Allston Holdings, a proprietary stock trading company, developing statistically-based trading strategies for U.S. and European futures markets. His current work for BigML is in the areas of Deep Learning and Bayesian Parameter Optimization.

Workshop Programme

Data Mining in Emerging Domains I (December 12, 2016. 09:00h-13:00h, Room 11)

9:00 - 10:30 Maritime Domain Data Mining Session
10:30 - 11:00 Coffee Break
11:00 - 12:00 Whizzml: Designing and developing BigML's data wrangling, Charles Parker
12:00 - 12:20 Using Machine Learning to accelerate Data Wrangling, Shilpi Ahuja, Mary Roth, Rashmi Gangadharaiah, Peter Schwarz, and Rafael Zujur
12:20 - 12:40 Toward Representation Independent Analytics Over Structured Data, Jose Picado, Yodsawalai Chodpathumwan, Arash Termehchy, Alan Fern, and Yizhou Sun
12:40 - 13:00 Mining the Dark Web: drugs and fake ids, Andres Baravalle, Mauro Sanchez Lopez, and Sin Wee Lee
13:00 - 14:30 Lunch Break
The conference will be held at World Trade Center Barcelona (Barcelonas Port Vell).

Workshop Chairs

Ben Zorn

Microsoft Research

Cèsar Ferri

Technical University of Valencia (Contact Person)

Atakan Cetinsoy

BigML

Gustavo Soares

University of California, Berkeley

Fernando Martínez-Plumed

Technical University of Valencia

Collaborators

Microsoft's Research in Software Engineering (RiSE) group mission is to advance the state of the art in Software Engineering and to bring those advances to Microsoft's businesses.

BigML offers cloud-based and on-premises machine learning services, distributed systems, and data visualization.