Tutorials

University of Washington CSE

Background

This data set was created by Pedro Domingos et al for demonstrating the capabilities of Markov logic networks. Anonymous data was obtained from the University of Washington Department of Computer Science and Engineering. Basic entities include persons (students or professors), scientific papers, and academic courses. Available relations specify e.g. whether a person was the author of a certain paper or whether he/she participated in the teaching activities of a certain course. The learning task consists of predicting students' advisors, i.e. to predict the binary relation "advised_by" between students and professors.

Data format

Data comes in the form of true ground atoms, under the closed-world assumption. Since (first-order logic) functions are not allowed in the language, a ground atom is essentially like a tuple in a relational database. For example the ground atomtaught_by(course170,person211,winter_0102) says that course course170 was taught by person211 in term winter_0102.

kLog learns from interpretations. This means that the data is given as a set of interpretations (or logical worlds) where each interpretation is a set of ground atoms which are true in that world. In this example there are five interpretations: ai graphics language systems and theory, corresponding to different research groups in the department. kLog data files are valid Prolog programs where true ground atoms are asserted using the special predicate intepretation/2: the first argument is the interpretation identifier and the second argument is a ground atom. You may think that all the ground atoms in a particular interpretation form an instance of a relational data base and that the overall data set consists of several (disjoint) databases. Below is a small fragment of the UWCSE data set.

interpretation(ai,advised_by(person21,person211)).
interpretation(ai,advised_by(person45,person211)).
interpretation(ai,has_position(person211,faculty)).
interpretation(ai,has_position(person407,faculty)).
interpretation(ai,in_phase(person14,post_generals)).
interpretation(ai,in_phase(person21,post_generals)).
interpretation(ai,in_phase(person284,post_quals)).
interpretation(ai,in_phase(person45,post_generals)).
interpretation(ai,professor(person211)).
interpretation(ai,professor(person407)).
interpretation(ai,publication(title110,person14)).
interpretation(ai,publication(title110,person407)).
interpretation(ai,publication(title25,person21)).
interpretation(ai,publication(title25,person211)).
interpretation(ai,publication(title25,person284)).
interpretation(ai,publication(title25,person407)).
interpretation(ai,publication(title44,person211)).
interpretation(ai,publication(title44,person415)).
interpretation(ai,publication(title44,person45)).
interpretation(ai,student(person14)).
interpretation(ai,student(person21)).
interpretation(ai,student(person284)).
interpretation(ai,student(person45)).
interpretation(ai,ta(course12,person21,winter_0203)).
interpretation(ai,ta(course24,person21,spring_0203)).
interpretation(ai,ta(course24,person70,autumn_0304)).
interpretation(ai,taught_by(course12,person211,autumn_0001)).
interpretation(ai,taught_by(course143,person211,winter_0001)).
interpretation(ai,taught_by(course170,person211,winter_0102)).
interpretation(ai,taught_by(course24,person211,autumn_0102)).
interpretation(ai,taught_by(course24,person211,spring_0203)).
interpretation(ai,taught_by(course24,person407,spring_0304)).
interpretation(ai,taught_by(course82,person407,winter_0102)).

Data Modeling

The first step in kLog modeling is to describe the domain using a classic database tool: Entity relationship diagrams. We begin by modeling two entity sets: student and professor, two unary relations: in_phase and has_position, and one binary relation: advised_by (which is the target in this example).

E/R diagram for the UWCSE domain

The kLog model is written in this case using the following fragment of embedded Prolog code. The model must be declared within a code section delimited by begin_domain and end_domain in a regular Yap Prolog file. use_module(klog). is necessary to inform Yap of the kLog syntax and to load the necessary library predicates.

:- use_module(klog).

begin_domain.
  signature has_position(professor_id::professor,
                         position::property
                        )::extensional.
  signature advised_by(p1::student,
                       p2::professor
                      )::extensional.
  signature student(student_id::self)::extensional.
  signature professor(professor_id::self)::extensional.
end_domain.

Every entity or relationship that kLog will later use to generate features (see feature generation below) is declared with the special keyword signature. There are two kinds of signatures, annotated by the special reserved words extensional and intensional. In the extensional case, all ground atoms have to be listed explicitly in the data file; in the intensional case, ground atoms are defined implicitly using Prolog definite clauses. A signature has a name and a list of arguments with types. A type is either the name of an antity set (declared in some other signature) or property. In the ground atoms, constants that are not properties are regarded as identifiers and are simply used to connect ground atoms (these constants disappear in the result of the graphicalization procedure explained below). The special type name self is used to denote the name of the signature being declared, as in lines 14-15 of the above code. Thus, a signature containing an argument of type self introduces a new entity set while other signatures introduce relationships.

Intensional signatures

One of the powerful features of kLog is its ability to introduce novel relations using a mechanism resembling deductive databases. Such relations are typically a means of injecting domain knowledge. In our example, it may be argued that the likelihood that a professor advises a student increases if the two persons have been engaged in some form of collaboration, such as co-authoring a paper, or working together in teaching activities. Below we show two intensional signatures for this purpose. An intensional signature declaration must be complemented by a predicate (written in pure Prolog) which defines the new relation. When working on a given interpretation, kLog asserts all the true ground atoms in that interpretation in the Prolog database and collects all true groundings for the predicate associated with the intensional signature. Although not shown in the example, intensional predicates may consist of several clauses and may refer each other. True groundings are defined by the standard Prolog semantics.

begin_domain.
... % same as above
signature on_same_course(s::student,p::professor)::intensional.
on_same_course(S,P) :-
    professor(P), student(S),
    ta(Course,S,Term), taught_by(Course,P,Term).

signature on_same_paper(s::student,p::professor)::intensional.
on_same_paper(S,P) :-
    student(S), professor(P),
    publication(Pub, S), publication(Pub,P).
end_domain.

Graphicalization

This procedure maps a set of ground atoms into a bipartite undirected graph whose nodes are true ground atoms and whose edges connect an entity atom to a relationship atom if the identifier of the former appears as an argument in the latter.

Graphicalization on the tiny data set of the example. Identifiers (such as person45) are only shown for the sake of clarity, kLog does not use them to generate features. All other information shown in the graph (including edge labels, representing positions of the arguments) do contribute to the generated features.

Graphicalization occurs when a data set is first loaded into memory through the library predicate attach/1. In the example below, extensional facts are contained into the Prolog file called uwcse_ext.pl.

experiment :-
    set_klog_flag(klog_master,verbosity,4),
    attach(uwcse_ext),
    ...

Feature generation

Features are obtained from the ground graph using a graph kernel. In principle, any graph kernel can be adapted for this purpose. Currently, however, only a plugin for a generalized version of the fast NSPDK (neighborhood subgraph pairwise distance kernel) of Costa and De Grave (2010) is implemented. Details on the extended NSPDK can be found in the kLog paper. Briefly, the graph kernel comes in two versions. In the hard version, features correspond to subgraphs of the ground graph and in the soft version they correspond to statistics of smaller fragments of such subgraphs (based on vertex and edge labels). A feature generator is declared using the library predicate new_feature_generator/2 whose arguments are two Prolog atoms representing the generator handle and the generator type, respectively. Under the hoods, feature generators are coded in C++ and the handle is used as a C++ string which acts as the key in a lookup table retrieving the pointer to a C++ object. The code fragment below introduces a new generator called my_fg of type nspdk, and selects some of the kernel parameters (distance, radius, match type).

    ...
    new_feature_generator(my_fg,nspdk),
    set_klog_flag(my_fg,distance,2),
    set_klog_flag(my_fg,match_type,soft),
    set_klog_flag(my_fg,radius,2),
    ...

The NSPD kernel in kLog uses the important notion of "kernel points" to specify the parts extracted from the ground graph and used as features. The Prolog predicate kernel_points/1 needs to be used for this purpose (otherwise all feature vectors would be empty). Its argument is a list of atoms that are signature names. NSPDK balls are centered around each node in the ground graph if that node has the type of a signature specified as a kernel point.

  kernel_points([student,professor,on_same_course,on_same_paper]).

Learning

A learning problem is declared by designating one (or more) signature(s) as target (in this domain the target relation is advised_by//2). Precisely, this means the following: given a set of training interpretations (e.g. ai, theory, graphics, systems), with fully observed ground atoms, learn a function capable of filling-in the target relation.

The learnable function is called a model and declared via the library predicate new_model/2. Like model generators, new_model/2 takes two atoms, a model handle and a specification of the model type. Below is the complete predicate for the UW-CSE experiment

experiment :-
    set_klog_flag(klog_master,verbosity,4),
    new_feature_generator(my_fg,rnspdk),
    set_klog_flag(my_fg,distance,2),
    set_klog_flag(my_fg,match_type,soft),
    set_klog_flag(my_fg,radius,2),
    set_klog_flag(referential_integrity_repair,ignore),
    attach(uwcse_ext),
    new_model(my_model,svm_sgd),
    set_klog_flag(my_model,lambda,0.0001),
    set_klog_flag(my_model,epochs,12),
    set_klog_flag(my_model,lossratio,0.2),
    kfold(advised_by,5,my_model,my_fg).

In this example, svm_sgd is a Support Vector Machine classifier trained with the Stochastic Gradient Descent approach (C++ code from Leon Bottou is integrated into the kLog library). The model accepts some flags such as the regularization parameter lambda, the number of training epochs, and the lossratio which in this example assigns positive examples a cost five times bigger than to negative examples. The library predicate kfold/4 (line 13) specifies a k-fold cross validation procedure. Cross validation happens at the level of interpretations, i.e. if there are $n$ interpretations in total, ${(k-1)n}/{k}$ interpretations are used for training and $k/n$ for testing. Since in this example there are exactly 5 interpretations, the 5-fold cross validation corresponds precisely to the "leave-one-research-group-out" validation that has been customarily applied to this particular dataset in the literature. Note that the target relation advised_by//2 is specified as the first argument to kfold/4. The other arguments are the number of folds, and the two handles, the model handle, and the feature generator handle.

Running it

First install kLog following these instructions. Then download the whole kLog script uwcse.pl and the data file uwcse_ext.pl and place both in the same folder. From the shell prompt type:

yap -l wucse.pl
% Restoring file /Users/paolo/local/lib/Yap/startup.yss
YAP 6.2.2 (i386-darwin10.8.0): Thu 26 Jan 2012 16:42:44 CET
MYDDAS version MYDDAS-0.9.1
 % reconsulting /Users/paolo/.yaprc...
 % reconsulted /Users/paolo/.yaprc in module user, 0 msec 9680 bytes

kLog version 0.1
   ?-

From the Prolog prompt you may now run the main predicate experiment. This will prepare the model and run the 5-fold cross validation procedure. If everything goes right you should see messages indicating the progress of computation and finally a summary of performance results.

Bursi (classification of small molecules)

Background

Here we consider the problem of discriminating between mutagenic and non-mutagenic compounds, a classic task in relational learning. For this purpose we consider the data set developed by Kazius et al (2005) and commonly referred to as the "Bursi" data set. It consists of 4,337 small molecules labeled according to mutagenicity (2,401 mutagens and 1,936 nonmutagens).

Every molecule corresponds to one interpretation so the kLog job is binary classification of interpretations. The extensional predicates a/2 and b/3 are used to describe atoms and chemical bonds, respectively. The first argument of a/2 is the atom ID and the second argument the chemical element (e.g. c for carbon). The first two arguments of b/3 are atom IDs and the third argument the bond type (where 1 means single bond, 2 double bond, 3 triple bond, and 4 resonant bond). Below is the code fragment describing one particular molecule.

interpretation(i788,a(a1,c)).
interpretation(i788,a(a2,c)).
interpretation(i788,a(a3,cl)).
interpretation(i788,a(a4,br)).
interpretation(i788,a(a5,n)).
interpretation(i788,a(a6,h)).
interpretation(i788,b(a1,a2,1)).
interpretation(i788,b(a1,a3,1)).
interpretation(i788,b(a1,a4,1)).
interpretation(i788,b(a2,a5,3)).
interpretation(i788,b(a1,a6,1)).
interpretation(i788,sub(f1,halide,1)).
interpretation(i788,sub(f2,halide,1)).
interpretation(i788,sub(f3,nitrile,2)).
interpretation(i788,sub(f4,aliphatic_chain,1)).
interpretation(i788,subat(f1,a3,1)).
interpretation(i788,subat(f2,a4,1)).
interpretation(i788,subat(f3,a2,1)).
interpretation(i788,subat(f3,a5,2)).
interpretation(i788,subat(f4,a1,1)).
interpretation(i788,
               linked(f4,
                      [link(f3,a2,a1),link(f1,a3,a1),link(f2,a4,a1)],
                      [branch(1,3)],saturated)).
interpretation(i788,target(mutagen)).

Some additional predicates are sub/3, subat/3, and linked/2 which describe functional groups, membership of atoms in functional groups, and how functional groups are linked together. These predicates could be defined intensionally using special Prolog rules but in this data set they had been already generated outside kLog using DMax Chemistry Assistant and we therefore put them in the set of extensional ground atoms. Note thatlinked/2 takes structured arguments, which cannot be handled directly by kLog and, in facts, we define no extensional signature associated with this predicate (but we use it in an intensional signature, see below). Finally, target/1 describes the category of the molecule (mutagen is the positive class, nonmutagen the negative class).

Declarations

:- use_module('klog').

describeBondType(1,single).
describeBondType(2,double).
describeBondType(3,triple).
describeBondType(4,aromatic).

begin_domain.

signature atm(atom_id::self, element::property)::intensional.
atm(Atom, Element) :-
    a(Atom,Element), \+(Element=h).

signature bnd(atom_1@b::atm, atom_1@b::atm,
              type::property)::intensional.
bnd(Atom1,Atom2,Type) :-
    b(Atom1,Atom2,NType),
    describeBondType(NType,Type),atm(Atom1,_),atm(Atom2,_).

signature fgroup(fgroup_id::self, group_type::property)::intensional.
fgroup(Fg,Type) :-  sub(Fg,Type,_).

signature fgmember(fg::fgroup, atom::atm)::intensional.
fgmember(Fg,Atom):- subat(Fg,Atom,_), atm(Atom,_).

signature fg_fused(fg1@nil::fgroup,
                   fg2@nil::fgroup, nrAtoms::property)::intensional.
fg_fused(Fg1,Fg2,NrAtoms) :- fus(Fg1,Fg2,NrAtoms,_AtomList).

signature fg_connected(fg1@nil::fgroup,fg2@nil::fgroup,
                       bondType::property)::intensional.
fg_connected(Fg1,Fg2,BondType) :-
    con(Fg1,Fg2,Type,_AtomList),
    describeBondType(Type,BondType).

signature fg_linked(fg::fgroup, alichain::fgroup,
                    saturation::property)::intensional.
fg_linked(FG,AliChain,Sat) :-
    linked(AliChain,Links,_BranchesEnds,Saturation),
    ( Saturation = saturated ->
      Sat = saturated
    ;
      Sat = unsaturated
    ),
    member(link(FG,_A1,_A2),Links).

signature mutagenic::intensional.
mutagenic :- target(mutagen).

kernel_points([atm,fgroup]).

end_domain.

The result of graphicalizing molecule i788 is shown below:

Graphicalization of one molecule

Learning

The following code fragment repeats a k-fold cross-validation for each combination of kernel parameters and SVM regularization constant. Note that graphicalization happens just once, when calling attach/1, while feature generation may happen multiple times for the same molecule. Feature vectors, once generated, are cached for subsequent reuse. However if one of the graph kernel parameters (e.g. the NSPDK radius) change, they are destroyed and need to be re-created when needed. Hence, the code below would run slower if the sweep over the SVM regularization parameter (which does not affect feature vectors) was the outer loop.

sweep(RValues,DValues,CValues,K) :-
  set_klog_flag(klog_master,verbosity,3),
  new_feature_generator(my_fg,nspdk),
  new_model(model1,libsvm_c_svc),
  set_klog_flag(model1, kernel_type, polynomial),
  attach(bursi_ext),
  forall(member(Radius,RValues),
         (set_klog_flag(my_fg,radius,Radius),
          forall(member(Distance,DValues),
                 (set_klog_flag(my_fg,distance,Distance),
                  forall(member(C,CValues),
                         (set_klog_flag(model1,c,C),
                          stratified_kfold(mutagenic,K,model1,my_fg,mstrat)
                         )
                        )
                 )
               )
         )
       ),
  true.

The library predicate stratified_kfold/4 ensures that the percentage of positive and negative examples is roughly constant across the k folds. kLog employs a rather general solution to stratification and the fourth argument (mstrat in the above code) is actually the name of a user-defined predicate which should group objects of the same stratum together. In our case the solution is very simple:

mstrat(Interpretation,Stratum) :-
  (db:interpretation(Interpretation,mutagenic) ->
    Stratum = muta
  ;
    Stratum = nonmuta
  ).
A more interesting case could be in the case of regression, e.g.
mstrat(Interpretation,Stratum) :-
  db:interpretation(Interpretation,activity(A)),
  (A > 6.3 ->
    Stratum = active
  ;
    Stratum = inactive
  ).
where perhaps 6.3 is the median of the target attribute activity in the data set.

Gathering results

kLog saves results and log files in a folder called results/$HASH where $HASH is a cryptographic hash of the sequence of characters obtained by concatenating the kLog script and the values of all the flags. Thus, in the above example, one results folder is generated for each parameter combination. If results were already computed before using the same script and the same flags, the k-fold procedure is not executed and results are not overwritten. This default behavior may be overridden by calling set_klog_flag(clobber,yes).