learn.pl -- Learning support in kLog
Formulate kLog learning jobs. Below are some examples of target
relations that kLog may be asked to learn. Currently, implemented
models have abilities to cover only a fraction of these.
Jobs with relational arity = 0:
Binary classification of each interpretation.
Example: classification of small molecules.
signature signature mutagenic::extensional.
Regression for each interpretation.
Example: predict the affinity of small molecules binding.
Multitask on individual interpretations.
Example: predict mutagenicity level and logP for small molecules.
Multiclass classification for each interpretation
Example: image categorization;
Jobs with relational arity = 1:
Binary classification of entities in each interpretation
Examples: detect spam webpages as in the Web Spam Challenge
(http://webspam.lip6.fr/wiki/pmwiki.php), predict blockbuster
movies in IMDb
Multiclass classification of entities in each interpretation
Examples: WebKB, POS-Tagging, NER, protein secondary structure prediction
Multiclass classification of entities in each interpretation
Example: traffic flow forecasting
Jobs with relational arity = 2:
Link prediction tasks
Examples: protein beta partners, UW-CSE, Entity resolution,
Regression on pairs of entities
Example: traffic flow forecasting at different stations and
different lead times, Prediction of distance between protein
secondary structure elements
Pairwise hierarchical classification
Example: traffic congestion level forecasting at different stations and different lead times
Tasks with relational arity > 2:
Prediction of hyperedges
Example: Spatial role labeling
signature ttarget(w1::sp_ind_can, w2::trajector_can,
Example: Metal binding geometry
Classification of hyperedges
Example: Metal binding geometry with ligand prediction
The following predefined dynamic predicate fails by default:
By defining your own version of it, it is easy to implement
selective subsampling of cases. A typical usage would be for dealing
with a highly imbalanced data set where you want to subsample only a
fraction of negatives. Case is a Prolog callable predicate
associated with a training or testing case. When kLog generates
cases, it calls klog_reject/2 to determine if the case should
be rejected. TrainOrPredict should be either 'train' or 'predict' to
do subsampling either at training or testing time. For example
suppose we have a binary classification task and we want to reject
90 percent of the negatives at training time. Then the following
code should be added to the kLog script.
- - Paolo Frasconi
- depends_transitive(?S1:atom, ?S:atom) is det[private]
- Tabled. Contains dependency analysis results to properly kill vertices in
the graphs. The predicate succeeds if S1 ``depends'' on S. The
mechanism actually goes a bit beyond a simple transitive closure of
the call graph: when several targets are present in the same file it
is necessary to extend the definition of dependencies. The set of
dependent signatures is defined as follows:
- All ancestors, including S itself (obvious)
- All the descendants (less obvious; however, if a target signature
S is intensional and calls some predicate Y in its definition, then
Y supposedly (although not necessarily) contains some supervision
information, which should be removed).
- All the ancestors of the descendants (maybe even less obvious,
but if now there is some other signature T which calls Y, then T
will also contain supervision information).
The descendants of the ancestors need not to be removed.
Some of the ancestors of S might be legitimate predicates that have
nothing to do with supervision. Therefore we restrict ourselves to
the subset of the call graph with predicates that are either in the
user: namespace (where the 'user' could cheat accessing supervision
information) or in the data set file (in the db: namespace). In any
case, since removal of vertices from the graph might surprise the
user, kLog issues a warning if there are killed signatures besides
the target signature of the current training or test
procedure. Finally, one can declare a signature to be safe using
safe/1. In this case it is assumed that the Prolog code inside it
does not exploit any supervision information. safe/1 should be used
The table depends on current_depends_directly/2 which is asserted by
record_dependencies/0, called at the end of graphicalize:attach/1.
- train(+S:atom, +Examples:list_of_atoms, +Model:atom, +Feature_Generator:atom) is det[private]
- Train Model on a data set of interpretations, using
Feature_Generator as the feature generator. Examples is a list of
interpretation identifiers. S is a signature. For each possible
combination of identifiers in S, a set of "cases" is generated where
a case is just a pseudo-iid example. If S has no properties,
positive cases are those for which a tuple exist in the
interpretation and negative cases all the rest. If S has one
property, this property acts as a label for the case (can be also
regression if the property is a real number). Two or more properties
define a multi-task problem (currently handled as a set of
independent tasks). Problems like small molecules classification
have a target signature with zero arity which works fine as a
special case of relationship. Model should have the ability of
solving the task specified in the target signature S.
- - if the target signature is a kernel point (training on it would be cheating)
- - if Model cannot solve the task specified by S.
- To be done
- - Multitask taking correltations into account
- kill_present(+TargetSignature:atom, +Examples:list) is det[private]
- Predicates kill_present/2 and kill_future/1 are needed to setup training and
test data. Let's
call vertices associated with the target signature plus all their
dependents (defined by depends_transitive/2) the set of "query"
vertices. In the case of unsliced interpretations, all query
vertices must be killed. The case of sliced interpretations is more
tricky and is handled as follows:
Example using IMdB, slice_preceq defined by time (years)
Suppose data set contains imdb(1953),...,imdb(1997) and that we want
to train on [imdb(1992),imdb(1993)] and test on
[imdb(1995),imdb(1996)]. Then during training we first take the most
recent year in the training set (1993) and kill everything
strictly in the future (i.e. 1994, 1995, 1996, 1997) using
kill_future/1 plus the query vertices in the present (i.e. 1992 and
1993) using kill_present/2. Test is similar, we take the most recent
year in the test set (1996) and kill everything strictly in the
future (i.e. 1997) plus the query vertices in the present (i.e. 1995
and 1996). Thus, for example, movies of 1994 are not used for
training but during prediction their labels are (rightfully)
accessible for computing feature vectors. In a hypothetical
transductive setting we might keep alive vertices in the future for
"evidence" signatures. However this is not currently supported.
kill_present/2 kills vertices associated with the TargetSignature
(and dependents) in the present slices. If interpretations are not
sliced vertices are killed anyway.
- kill_future(+Examples:list) is det[private]
- Kill entire slices in the strict future (if there are no sliced
interpretations kill_future/1 does nothing since max_slice will
- list_of_slices(+SlicedInterpretations, ?Slices) is det[private]
- Slices is unified with the list of slices found in SlicedInterpretations.
- preceq_max_list(+List, ?M) is det[private]
- M is unified with the max element in List according to the total
order defined by slice_preceq/2.
- preceq_min_list(+List, ?M) is det[private]
- M is unified with the min element in List according to the total
order defined by slice_preceq/2.
- predict(+S:atom, +Examples:list_of_atoms, +Model:atom, +Feature_Generator:atom) is det[private]
- Test Model on a data set of interpretations. Use Feature_Generator as the feature
generator. Examples is a list of interpretation identifiers. S is a
signature. Cases are generated as in train/3. The predicate asserts
induced facts as follows:
- get_task(+TargetSignature:atom, -TaskIndex:integer, -TaskName:atom, -Values:list) is nondet[private]
- Given TargetSignature, retrieve the i-th task (starting from 0),
unify TaskIndex with i, TaskName with its name and Values with the
list of target values found in the data set for this task. TaskName
is either in the form N#P where N is the property name and P the
position in the argument list, or the atom 'callme' meaning that the
task is to learn a relationship.
- cases_loop(+TS, +Examples, +Model, +FG, +TName, +TType, +TOrP) is det[private]
- Core procedure for training (if TOrP=train) or testing (if
TOrP=predict). TS is the target signature. Examples is a list of
interpretations (from which cases are built). Model is a kLog
model. FG is a kLog feature generator. TName and TType are the task
name and type as returned by get_task/4. The loop generates all
cases for each interpretation (using tuple_of_identifiers/3) and
creates (if necessary) all required feature vectors and output
labels. In prediction mode, cases are immediately predicted. In
training mode, C++-level predicates train_model/2 and test_dataset/3
are called at the end of the loop. In both cases, results are
accumulated both in the local and the global reporters. However the
local reporter is reset when this loop starts. This is useful for
obtaining training set accuracy and test set accuracy of individual
folds in k-fold-CV.
- tuple_of_identifiers(+Ex, +S, -List) is nondet[private]
- Unify IDList with a tuple of identifiers in Ex whose type appears in
signature S. On backtracking will retrieve all possible tuples. For
tasks such as link prediction, this predicate is used to generate
all pairs of candidates.
- identifier(+Ex, +S, -ID) is nondet[private]
- Unify ID with one of the identifiers of S in Ex. On backtracking
will return all data identifiers for signature S in interpretation
Ex. The predicate fails if S is not an entity.
- prolog_make_sparse_vector(Model, Feature_Generator, Ex, CaseID, ViewPoint) is det[private]
- Wrapper around C++ method for making feature vectors.
Feature_Generator is the name of the feature generator object that
will be used. Ex is the interpretation name, possibly sliced. CaseID
identifies the case for which the feature vector is
generated. ViewPoint is a list of vertex IDs (C++ integer code)
around which the feature vector is constructed. For unsliced
interpretations, the feature vector is registered under a
slash-separated string like ai/advised_by/person20/person240,
created from the interpretation identifier (e.g. ai) followed by the
target signature (e.g. advised_by) and the identifiers of the
entities that define the case, (e.g. person20 and person412). For
sliced interpretations the slice name is also used to construct the
identifiers, e.g. imdb_1997 and imdb_1997/m441332. Model is used to
determine the internal format of the feature vector being generated.
- make_case_id(Ex, S, IDTuple, CaseID) is det[private]
- Unifies CaseID with a unique identifier for (sliced) interpretation
Ex, target signature S, and tuple of identifiers IDTuple. See
prolog_make_sparse_vector/5 for details on how this is formatted.
- save_induced_facts(Signatures, Filename) is det[private]
- Every 'induced' fact (asserted during test) is saved to Filename for
a rough implementation of iterative relabeling. The target signature
is renamed by prefixing it with 'pred_'.