Loading Data
The c_clause.Loader class loads datasets (knowledge graphs) and rulesets. The first thing to do when using PyClause is loading at least one dataset into the Loader which creates an internal index that maps entity and relation string tokens to intgeger idx’s.
The loader will later be passed to the respective handler classes which can work on the specified data and rulesets.
Input Datasets
The loader is initialized with a dict from the options key “loader” and the Loader.load_data(..) function takes up to three possible arguments.
from clause import Options
from c_clause import Loader
opts = Options()
loader = Loader(options=opts.get("loader"))
Any of the three will work:
loader2 = Loader(options=opts.get("loader"))
loader3 = Loader(options=opts.get("loader"))
# data is the the base KG where rules will be applied on for all handlers
# often also termed and used as 'train'
loader.load_data(data="path/to/dataKG")
loader2.load_data(data="path/to/dataKG", filter="path/to/filter")
loader3.load_data(data="path/to/dataKG", filter="path/to/filter", target="path/to/target")
The dataset data is always required. It is the base knowledge graph on which rules are applied for all the handlers.
Filter can be used as an set of triples that automatically filters out candidates calculated by
c_clause.RankingHandlerandc_clause.QAHandler.Target is only needed when creating ranking files with the
c_clause.RankingHandler(often termed the test set).
Note that you can only load data once. If you want to use multiple dataset specifications you can use multiple loaders.
Data Types
There are three possibilities of how to specify the input datasets. From path (as shown above), from python as strings, or from python as indices.
1) From path
Data, filter and target arguments have to be file paths to files containing tab separated triples of string tokens, e.g., train.txt:
lisa knows max
max likes jon
2) From Python as strings
Data, filter and target arguments have to be Python lists of string triples.
dataset = [["lisa", "knows", "max"], ["max", "likes", "john"]]
Note
Using your own entity and relation index. For the previous two methods the loader will create an internal global index that maps entity and relation strings to integer idx’s.
If you have such a mapping already and later want to specify handler inputs as idx’s according to this mapping you can force the loader to use your index. Simply execute before loading data loader.set_entity_index(index) and loader.set_relation_index(index).
The argument index is a list of strings that maps idx’s to strings by list[idx]=string. The index does not need to be complete, one can still load data with new entities and relations.
Note
Retrieving the index from the loader. When data is loaded with strings the loader can return the constructed entity and relation index if it is desired to continue with idx’s.
Simply execute loader.get_entity_index() and loader.get_relation_index() for obtaining a mapping from strings to idx’s.
3) From Python as indices
It is also possible to use lists or numpy arrays containing indices. PyClause will, however, always need an internal index that maps indices to token strings. Even if the user works with indices, PyClause always allows to output results in human readable string representations.
from c_clause import Loader
from clause import Options
import numpy as np
options = Options()
loader = Loader(options.get("loader"))
# maps entities
# 0: "lisa"
# 1: "max"
# 2: "john"
entity_index = ["lisa", "max", "john"]
# maps relations
# 0: knows
# 1: likes
relation_index = ["knows", "likes"]
# set entity/relation index; should only be done once
loader.set_entity_index(entity_index)
loader.set_relation_index(relation_index)
# (lisa knows max)
# (max likes john)
dataset = np.array(
[
[0, 0, 1],
[1, 1, 2]
]
)
# know yourself
filter_set = np.array(
[
[0, 0, 0],
[1, 0, 1],
[2, 0, 2],
]
)
loader.load_data(data=dataset, filter=filter_set)
In this case, you can only load data containing idx’s that already exist in the entity and relation index. E.g., loader.load_data(data=[[0,3,1]]) would throw an error in the example above.