ICGI 2010 Tutorial Day

September 13th, 2010

The Lectures and the Lecturers

Graph Grammars: Representations, Algorithms, and Induction
Tim Oates, with Sourav Mukherjee (University of Maryland Baltimore County, USA)

Abstract Many domains are naturally relational and are best represented computationally as graphs. Examples of such domains include online social networks, in which nodes correspond to people and edges to friend relationships, protein structure prediction, in which nodes correspond to alpha-helices and beta-sheets and edges to their relative spatial relationships, and financial fraud, in which nodes correspond to people, companies, and financial institutions and edges to business transactions.
Just as string grammars are compact representations of the generative structure underlying sequence data, graph grammars are compact representations of the generative structure underlying relational data. The ability to learn both the structure (productions) and parameters (production probabilities) of graph grammars from data makes it possible to, for example, learn a graph grammar from a dataset of chemical compounds known to inhibit the AIDS virus and (1) design new compounds by sampling from the grammar, (2) evaluate proposed compounds by computing their probability given the grammar, and (3) gain deeper insight into why the compounds are effective by examining the grammar's productions.
This tutorial will introduce the fundamental concepts of graph grammars, including representations such as Node Controlled Embedding in which non-terminals label nodes and Hyper-Edge Replacement in which non-terminals label hyperedges, algorithms such as parsing, and induction of both the structure and parameters of graph grammars. We will use artificial examples when introducing the core concepts, and real-world examples from the long history of work on graph grammar applications to cement the ideas.


Dr. Tim Oates is an Associate Professor in the CSEE Department at the University of Maryland Baltimore County. He received his Ph.D. in Computer Science from the University of Massachusetts Amherst in 2001, spent a year as a Postdoctoral Fellow at the Massachusetts Institute of Technology Artificial Intelligence Lab, and then joined the Computer Science and Electrical Engineering Department at the University of Maryland Baltimore County. In 2005 he received a prestigious National Science Foundation CAREER award. His research interests are in the areas of grammar induction, language acquisition, statistical natural language processing, and developmental learning. Dr. Oates has published extensively in the area of graph grammar induction, including the first formally sound (maximum likelihood) algorithm for parameter estimation for hyperedge-relacement graph grammars.

Sourav Mukherjee is a final year PhD student in the Department of Computer Science at the University of Maryland Baltimore County. He received his MS is Computer Science in 2007 from the University of Maryland Baltimore County, and his Bachelor's degree in Computer Science and Engineering from Jadavpur University, Calcutta, India, in 2005. Sourav's research advisor is Dr. Tim Oates, and his current research focusses on stochastic graph grammars. He aims at developing efficient algorithms both for learning graph grammars, and for extracting useful information once a such grammar has been learned. A part of his work has been published in the International Colloquium on Grammatical Inference (ICGI) in 2008. His past research (as an MS student) focussed on Distributed Data Mining, and has lead to a publication in the Journal of Parallel and Distributed Computing (JPDC) in 2008. Sourav has served as a reviewer for the International Conference on Machine Learning (ICML) in 2009, and interned at Microsoft SQL Server Integration Services (SSIS) during the summers of 2008 and 2009.


Active Learning
Colin de la Higuera (University of Nantes, France)

Abstract In a number of situations, the learning algorithm can actively interact with its environment, instead of passively receiving masses of data. Instead of using given data, the algorithm may be able to perform tests, create new strings and experiment with these, explore paths, ask a human user to evaluate a current hypothesis, ask for a specific translation, evaluate how far it is from the solution. The set of all such situations can be formalised in the active (or Oracle, or query) learning paradigm. In this setting, the learning algorithm can query an Oracle (representing the environment). A number of possible queries have been studied, corresponding to theoretical motivations or practical ones.
The theoretical motivations usually attempt to answer the question can I learn as much with less information?, whereas the practical ones are attempts to use active learning in concrete situations, such as game playing, robotics, web applications or data labelling. In this tutorial we cover positive and negative aspects of this important paradigm in grammatical inference, with a special focus on the case of learning deterministic finite automata. We will describe applications, go through the theory, study the algorithms and propose open problems and challenges.


Dr. Colin de la Higuera has worked in the field of grammatical inference for the past 15 years. He has published more than 40 refereed papers on topics dealing with grammatical inference and is the author of Grammatical Inference. Learning automata and Grammars (Cambridge University Press, 2010). He has also organised a number of workshops on the topic and given several tutorials in conferences. He is currently running the Zulu related conference on active learning of DFA.


Modelling Biological Sequences by Grammatical Inference
Francois Coste (INRIA/IRISA, France)

Abstract Recent sequencing projects and technological progress in the field are giving access to an ever increasing amount of DNA, RNA and proteins macromolecular sequences. To annotate these sequences, fundamental and important tasks in Bioinformatics are the comparisons of sequences to determine common or consensus patterns among a family of sequences, discriminate members of the family from non-members, and discover new members of the family. In this tutorial, we propose a quick introduction to the world of the biological macromolecules and we will survey the approaches related to grammatical inference which have been developed in Bioinformatics to model these sequences, from well established weighting schemes for Profile HMM and Stochastic Context-Free grammars to approaches learning (also) the structure or topology of the grammars.

Dr. Francois Coste is an INRIA Research Scientist in the bioinformatics team Symbiose. He joined IRISA in 2000 after receiving his Ph.D. in Computer Science from the University of Rennes. His research interest is in learning grammars and application to linguistic modelling of biological sequences. Dr Coste contributed to the field of grammatical inference mainly on practical learning of automata and has been involved in several grammatical inference competitions. He is the author with Goulven Kerbellec of Protomata-Learner (http://protomata-learner.genouest.org/), a successful grammatical inference algorithm for learning automata on proteins.


Inference of Finite Automata
Damián López, with Pedro García (Universidad Politécnica de Valencia, Spain)

Abstract The results by Thrakhtenbrot-Barzdin and Gold can be considered as the first works on the identification of Finite Automata from given data. The main drawback of these results is that both may obtain hypotheses which can be inconsistent with the provided data. This drawback was solved by RPNI and Lang algorithms. Taking into account these previous results, some works introduced more efficient algorithms with respect to the training data. The direct consequence of this improvement lead to algorithms with lower error rates. Recently, some works tackle the identification of NFAs instead of the traditional DFA model. In this research direction, the inference of Residual Finite State Automata (RFSA) provides a canonical non-deterministic model. Other results considered the inference of teams of NFAs as a suitable method to solve the grammatical inference of finite automata. In this tutorial we will review the main approaches to solve the inference of finite automata by describing the previously related formalisms and their induction techniques.

Dr. Damián López is an Associate Professor in the SIC Departament at the Universidad Politécnica de Valencia (Spain) where he joined in 1996. Currently, he teaches at the Escuela Técnica Superior de Informática. He received his PhD in Computer Science from the same University in 2003. His research interests focus on several aspects of grammatical inference. They include the inference of multidimensional languages as well as the application of grammatical inference techniques to bioinformatic and syntactic pattern recognition tasks. He is a member of the European Association for Theoretical Computer Science (EATCS).

Dr. Pedro García is an Associate Professor in the SIC Department at the Universidad Politécnica de Valencia (Spain) where he joined in 1987. Currently, he teaches at the Escuela Técnica Superior de Informática. He received the B.S. degree in physics from the Universidad de Valencia (Spain) and the Ph.D. degree in Computer Science from the Universidad Politécnica de Valencia (Spain) in 1976 and 1988, respectively. He has a broad experience on grammatical inference and he has published many results on that field during the last twenty years. He is one of the authors of the RPNI algorithm which has been widely mentioned in the Grammatical Inference community. His current research interests are deeply related with grammatical inference, such as automata theory and formal languages.