learn.pl -- Learning support in kLog

Formulate kLog learning jobs. Below are some examples of target relations that kLog may be asked to learn. Currently, implemented models have abilities to cover only a fraction of these.

Jobs with relational arity = 0:

Binary classification of each interpretation.

Example: classification of small molecules.

signature signature mutagenic::extensional.

Regression for each interpretation.

Example: predict the affinity of small molecules binding.

signature affinity(strength::property)::extensional.

Multitask on individual interpretations.

Example: predict mutagenicity level and logP for small molecules.

signature
  molecule_properties(mutagenicity::property(real),
                     logp::property(real))::extensional.

Multiclass classification for each interpretation

Example: image categorization;

signature image_category(cat::property)::extensional.

Jobs with relational arity = 1:

Binary classification of entities in each interpretation

Examples: detect spam webpages as in the Web Spam Challenge (http://webspam.lip6.fr/wiki/pmwiki.php), predict blockbuster movies in IMDb

signature spam(url::page)::extensional.
signature blockbuster(m::movie)::extensional.

Multiclass classification of entities in each interpretation

Examples: WebKB, POS-Tagging, NER, protein secondary structure prediction

signature page(url::page,category::property)::extensional.
signature pos_tag(word::position,tag::property)::extensional.
signature named_entity(word::position,ne::property)::extensional.
signature secondary_structure(r::residue,ss::property)::extensional.

Multiclass classification of entities in each interpretation

Example: traffic flow forecasting

signature flow_value(s::station,flow::property(real))::extensional.

Jobs with relational arity = 2:

Link prediction tasks

Examples: protein beta partners, UW-CSE, Entity resolution, Protein-protein interactions

signature partners(r1::residue,r2::residue)::extensional.
signature advised_by(p1::person,p2::person)::extensional.
signature same_venue(v1::venue,v2::venue)::extensional.
signature phosphorylates(p1::kinase,p2::protein)::extensional.
signature regulates(g1::gene,g2::gene)::extensional.

Regression on pairs of entities

Example: traffic flow forecasting at different stations and different lead times, Prediction of distance between protein secondary structure elements

signature congestion(s::station,lead::time,
                     flow::property(float))::extensional.
signature distance(sse1:sse,sse2:sse,d::property(float))::intensional.

Pairwise hierarchical classification

Example: traffic congestion level forecasting at different stations and different lead times

signature congestion_level(s::station,lead::time,
                           level::property)::extensional.

Tasks with relational arity > 2:

Prediction of hyperedges

Example: Spatial role labeling

signature ttarget(w1::sp_ind_can, w2::trajector_can,
                  w3::landmark_can)::intensional.

Example: Metal binding geometry

signature binding_site(r1::residue,r2::residue,
                       r3::residue,r4::residue)::extensional.

Classification of hyperedges

Example: Metal binding geometry with ligand prediction

signature binding_site(r1::residue,r2::residue,
                       r3::residue,r4::residue,
                       metal::property)::extensional.

Subsampling

The following predefined dynamic predicate fails by default:

  user:klog_reject(+Case,+TrainOrPredict)

By defining your own version of it, it is easy to implement selective subsampling of cases. A typical usage would be for dealing with a highly imbalanced data set where you want to subsample only a fraction of negatives. Case is a Prolog callable predicate associated with a training or testing case. When kLog generates cases, it calls klog_reject/2 to determine if the case should be rejected. TrainOrPredict should be either 'train' or 'predict' to do subsampling either at training or testing time. For example suppose we have a binary classification task and we want to reject 90 percent of the negatives at training time. Then the following code should be added to the kLog script.

klog_reject(Case,train) :-
  \+ call(Case),
  random(R),
  R>0.1.

Predicates

author
- Paolo Frasconi
depends_transitive(?S1:atom, ?S:atom) is det[private]
Tabled. Contains dependency analysis results to properly kill vertices in the graphs. The predicate succeeds if S1 ``depends'' on S. The mechanism actually goes a bit beyond a simple transitive closure of the call graph: when several targets are present in the same file it is necessary to extend the definition of dependencies. The set of dependent signatures is defined as follows:
  1. All ancestors, including S itself (obvious)
  2. All the descendants (less obvious; however, if a target signature S is intensional and calls some predicate Y in its definition, then Y supposedly (although not necessarily) contains some supervision information, which should be removed).
  3. All the ancestors of the descendants (maybe even less obvious, but if now there is some other signature T which calls Y, then T will also contain supervision information).

    The descendants of the ancestors need not to be removed.

    Some of the ancestors of S might be legitimate predicates that have nothing to do with supervision. Therefore we restrict ourselves to the subset of the call graph with predicates that are either in the user: namespace (where the 'user' could cheat accessing supervision information) or in the data set file (in the db: namespace). In any case, since removal of vertices from the graph might surprise the user, kLog issues a warning if there are killed signatures besides the target signature of the current training or test procedure. Finally, one can declare a signature to be safe using safe/1. In this case it is assumed that the Prolog code inside it does not exploit any supervision information. safe/1 should be used with care.

    The table depends on current_depends_directly/2 which is asserted by record_dependencies/0, called at the end of graphicalize:attach/1.

train(+S:atom, +Examples:list_of_atoms, +Model:atom, +Feature_Generator:atom) is det[private]
Train Model on a data set of interpretations, using Feature_Generator as the feature generator. Examples is a list of interpretation identifiers. S is a signature. For each possible combination of identifiers in S, a set of "cases" is generated where a case is just a pseudo-iid example. If S has no properties, positive cases are those for which a tuple exist in the interpretation and negative cases all the rest. If S has one property, this property acts as a label for the case (can be also regression if the property is a real number). Two or more properties define a multi-task problem (currently handled as a set of independent tasks). Problems like small molecules classification have a target signature with zero arity which works fine as a special case of relationship. Model should have the ability of solving the task specified in the target signature S.
Errors
- if the target signature is a kernel point (training on it would be cheating)
- if Model cannot solve the task specified by S.
To be done
- Multitask taking correltations into account
kill_present(+TargetSignature:atom, +Examples:list) is det[private]
Predicates kill_present/2 and kill_future/1 are needed to setup training and test data. Let's call vertices associated with the target signature plus all their dependents (defined by depends_transitive/2) the set of "query" vertices. In the case of unsliced interpretations, all query vertices must be killed. The case of sliced interpretations is more tricky and is handled as follows:

Example using IMdB, slice_preceq defined by time (years)

Suppose data set contains imdb(1953),...,imdb(1997) and that we want to train on [imdb(1992),imdb(1993)] and test on [imdb(1995),imdb(1996)]. Then during training we first take the most recent year in the training set (1993) and kill everything strictly in the future (i.e. 1994, 1995, 1996, 1997) using kill_future/1 plus the query vertices in the present (i.e. 1992 and 1993) using kill_present/2. Test is similar, we take the most recent year in the test set (1996) and kill everything strictly in the future (i.e. 1997) plus the query vertices in the present (i.e. 1995 and 1996). Thus, for example, movies of 1994 are not used for training but during prediction their labels are (rightfully) accessible for computing feature vectors. In a hypothetical transductive setting we might keep alive vertices in the future for "evidence" signatures. However this is not currently supported.

kill_present/2 kills vertices associated with the TargetSignature (and dependents) in the present slices. If interpretations are not sliced vertices are killed anyway.

kill_future(+Examples:list) is det[private]
Kill entire slices in the strict future (if there are no sliced interpretations kill_future/1 does nothing since max_slice will fail)
list_of_slices(+SlicedInterpretations, ?Slices) is det[private]
Slices is unified with the list of slices found in SlicedInterpretations.
preceq_max_list(+List, ?M) is det[private]
M is unified with the max element in List according to the total order defined by slice_preceq/2.
preceq_min_list(+List, ?M) is det[private]
M is unified with the min element in List according to the total order defined by slice_preceq/2.
predict(+S:atom, +Examples:list_of_atoms, +Model:atom, +Feature_Generator:atom) is det[private]
Test Model on a data set of interpretations. Use Feature_Generator as the feature generator. Examples is a list of interpretation identifiers. S is a signature. Cases are generated as in train/3. The predicate asserts induced facts as follows:
induced(InterpretationId,db:interpretation(InterpretationID,Fact)).
get_task(+TargetSignature:atom, -TaskIndex:integer, -TaskName:atom, -Values:list) is nondet[private]
Given TargetSignature, retrieve the i-th task (starting from 0), unify TaskIndex with i, TaskName with its name and Values with the list of target values found in the data set for this task. TaskName is either in the form N#P where N is the property name and P the position in the argument list, or the atom 'callme' meaning that the task is to learn a relationship.
cases_loop(+TS, +Examples, +Model, +FG, +TName, +TType, +TOrP) is det[private]
Core procedure for training (if TOrP=train) or testing (if TOrP=predict). TS is the target signature. Examples is a list of interpretations (from which cases are built). Model is a kLog model. FG is a kLog feature generator. TName and TType are the task name and type as returned by get_task/4. The loop generates all cases for each interpretation (using tuple_of_identifiers/3) and creates (if necessary) all required feature vectors and output labels. In prediction mode, cases are immediately predicted. In training mode, C++-level predicates train_model/2 and test_dataset/3 are called at the end of the loop. In both cases, results are accumulated both in the local and the global reporters. However the local reporter is reset when this loop starts. This is useful for obtaining training set accuracy and test set accuracy of individual folds in k-fold-CV.
tuple_of_identifiers(+Ex, +S, -List) is nondet[private]
Unify IDList with a tuple of identifiers in Ex whose type appears in signature S. On backtracking will retrieve all possible tuples. For tasks such as link prediction, this predicate is used to generate all pairs of candidates.
identifier(+Ex, +S, -ID) is nondet[private]
Unify ID with one of the identifiers of S in Ex. On backtracking will return all data identifiers for signature S in interpretation Ex. The predicate fails if S is not an entity.
clean_internals(IntId)[private]
prolog_make_sparse_vector(Model, Feature_Generator, Ex, CaseID, ViewPoint) is det[private]
Wrapper around C++ method for making feature vectors. Feature_Generator is the name of the feature generator object that will be used. Ex is the interpretation name, possibly sliced. CaseID identifies the case for which the feature vector is generated. ViewPoint is a list of vertex IDs (C++ integer code) around which the feature vector is constructed. For unsliced interpretations, the feature vector is registered under a slash-separated string like ai/advised_by/person20/person240, created from the interpretation identifier (e.g. ai) followed by the target signature (e.g. advised_by) and the identifiers of the entities that define the case, (e.g. person20 and person412). For sliced interpretations the slice name is also used to construct the identifiers, e.g. imdb_1997 and imdb_1997/m441332. Model is used to determine the internal format of the feature vector being generated.
make_case_id(Ex, S, IDTuple, CaseID) is det[private]
Unifies CaseID with a unique identifier for (sliced) interpretation Ex, target signature S, and tuple of identifiers IDTuple. See prolog_make_sparse_vector/5 for details on how this is formatted.
save_induced_facts(Signatures, Filename) is det[private]
Every 'induced' fact (asserted during test) is saved to Filename for a rough implementation of iterative relabeling. The target signature is renamed by prefixing it with 'pred_'.