Abstract
Many domains are naturally relational and are best represented
computationally as graphs. Examples of such domains include online
social networks, in which nodes correspond to people and edges to
friend relationships, protein structure prediction, in which nodes
correspond to alpha-helices and beta-sheets and edges to their
relative spatial relationships, and financial fraud, in which nodes
correspond to people, companies, and financial institutions and edges
to business transactions.
Just as string grammars are compact representations of the generative
structure underlying sequence data, graph grammars are compact
representations of the generative structure underlying relational
data. The ability to learn both the structure (productions) and
parameters (production probabilities) of graph grammars from data
makes it possible to, for example, learn a graph grammar from a
dataset of chemical compounds known to inhibit the AIDS virus and (1)
design new compounds by sampling from the grammar, (2) evaluate
proposed compounds by computing their probability given the grammar,
and (3) gain deeper insight into why the compounds are effective by
examining the grammar's productions.
This tutorial will introduce the fundamental concepts of graph
grammars, including representations such as Node Controlled Embedding
in which non-terminals label nodes and Hyper-Edge Replacement in which
non-terminals label hyperedges, algorithms such as parsing, and
induction of both the structure and parameters of graph grammars. We
will use artificial examples when introducing the core concepts, and
real-world examples from the long history of work on graph grammar
applications to cement the ideas.
Dr. Tim Oates is an Associate Professor in the CSEE Department at the
University of Maryland Baltimore County. He received his Ph.D. in
Computer Science from the University of Massachusetts Amherst in 2001,
spent a year as a Postdoctoral Fellow at the Massachusetts Institute
of Technology Artificial Intelligence Lab, and then joined the
Computer Science and Electrical Engineering Department at the
University of Maryland Baltimore County. In 2005 he received a
prestigious National Science Foundation CAREER award. His research
interests are in the areas of grammar induction, language acquisition,
statistical natural language processing, and developmental learning.
Dr. Oates has published extensively in the area of graph grammar
induction, including the first formally sound (maximum likelihood)
algorithm for parameter estimation for hyperedge-relacement graph
grammars.
Sourav Mukherjee is a final year PhD student in the Department of
Computer Science at the University of Maryland Baltimore County. He
received his MS is Computer Science in 2007 from the University of
Maryland Baltimore County, and his Bachelor's degree in Computer
Science and Engineering from Jadavpur University, Calcutta, India, in
2005. Sourav's research advisor is Dr. Tim Oates, and his current
research focusses on stochastic graph grammars. He aims at developing
efficient algorithms both for learning graph grammars, and for
extracting useful information once a such grammar has been learned. A
part of his work has been published in the International Colloquium on
Grammatical Inference (ICGI) in 2008. His past research (as an MS
student) focussed on Distributed Data Mining, and has lead to a
publication in the Journal of Parallel and Distributed Computing
(JPDC) in 2008. Sourav has served as a reviewer for the International
Conference on Machine Learning (ICML) in 2009, and interned at
Microsoft SQL Server Integration Services (SSIS) during the summers of
2008 and 2009.
Abstract
In a number of situations, the learning algorithm can actively interact with its environment, instead of
passively receiving masses of data. Instead of using given data, the algorithm may be able to perform tests, create new
strings and experiment with these, explore paths, ask a human user to evaluate a current hypothesis, ask for a specific
translation, evaluate how far it is from the solution.
The set of all such situations can be formalised in the active (or Oracle, or query) learning paradigm. In this setting, the
learning algorithm can query an Oracle (representing the environment). A number of possible queries have been studied,
corresponding to theoretical motivations or practical ones.
The theoretical motivations usually attempt to answer the question can I learn as much with less information?, whereas the
practical ones are attempts to use active learning in concrete situations, such as game playing, robotics, web applications or
data labelling.
In this tutorial we cover positive and negative aspects of this important paradigm in grammatical inference, with a special
focus on the case of learning deterministic finite automata. We will describe applications, go through the theory, study the
algorithms and propose open problems and challenges.
Dr. Colin de la Higuera has worked in the field of grammatical inference for the past 15 years. He has published more than 40
refereed papers on topics dealing with grammatical inference and is the author of
Grammatical Inference. Learning automata and Grammars
(Cambridge University Press, 2010). He has also organised a number of workshops on the topic and given several tutorials in
conferences. He is currently running the Zulu related conference on active learning of DFA.
Abstract
Recent sequencing projects and technological progress in the field are giving access to an ever increasing amount of DNA, RNA and
proteins macromolecular sequences. To annotate these sequences, fundamental and important tasks in Bioinformatics are the comparisons
of sequences to determine common or consensus patterns among a family of sequences, discriminate members of the family from
non-members, and discover new members of the family. In this tutorial, we propose a quick introduction to the world of the biological
macromolecules and we will survey the approaches related to grammatical inference which have been developed in Bioinformatics to model
these sequences, from well established weighting schemes for Profile HMM and Stochastic Context-Free grammars to approaches learning
(also) the structure or topology of the grammars.
Dr. Francois Coste is an INRIA Research Scientist in the bioinformatics team Symbiose.
He joined IRISA in 2000 after receiving his Ph.D. in Computer Science from the University of Rennes.
His research interest is in learning grammars and application to linguistic modelling of biological sequences.
Dr Coste contributed to the field of grammatical inference mainly on practical learning of automata
and has been involved in several grammatical inference competitions. He is the author with Goulven Kerbellec of Protomata-Learner
(http://protomata-learner.genouest.org/), a successful grammatical
inference algorithm for learning automata on proteins.
Abstract
The results by Thrakhtenbrot-Barzdin and Gold can be considered as the first works on the identification
of Finite Automata from given data. The main drawback of these results is that both may obtain
hypotheses which can be inconsistent with the provided data. This drawback was solved by RPNI and Lang algorithms.
Taking into account these previous results, some works introduced more efficient algorithms with respect
to the training data. The direct consequence of this improvement lead to algorithms with lower error
rates. Recently, some works tackle the identification of NFAs instead of the traditional DFA model. In
this research direction, the inference of Residual Finite State Automata (RFSA) provides a canonical
non-deterministic model. Other results considered the inference of teams of NFAs as a suitable
method to solve the grammatical inference of finite automata. In this tutorial we will review the main approaches
to solve the inference of finite automata by describing the previously related formalisms and their
induction techniques.
Dr. Damián López is an Associate Professor in the SIC Departament
at the Universidad Politécnica de Valencia (Spain) where he joined in 1996. Currently, he teaches
at the Escuela Técnica Superior de
Informática. He received his PhD in Computer Science from the same University in 2003. His research
interests focus on several aspects of grammatical inference. They include the inference of
multidimensional languages as well as the application of grammatical inference techniques to
bioinformatic and syntactic pattern recognition tasks. He is a member of the European Association for
Theoretical Computer Science (EATCS).
Dr. Pedro García is an Associate Professor in the SIC Department at the
Universidad Politécnica de Valencia (Spain) where he joined in 1987. Currently, he teaches
at the Escuela Técnica Superior de Informática. He received the B.S. degree in physics from the
Universidad de Valencia (Spain) and the Ph.D. degree in Computer Science from the Universidad
Politécnica de Valencia (Spain) in 1976 and 1988, respectively.
He has a broad experience on grammatical inference and he has published many results on that field during the
last twenty years. He is one of the authors of the RPNI algorithm which has been
widely mentioned in the Grammatical Inference community.
His current research interests are deeply related with grammatical inference, such as automata theory and
formal languages.