Loading Data 
============


The ``c_clause.Loader`` class loads datasets (knowledge graphs) and rulesets. The first thing to do when using PyClause is loading at least one dataset into the Loader which creates an internal index that maps entity and relation string tokens to intgeger idx's.
The loader will later be passed to the respective handler classes which can work on the specified data and rulesets.


Input Datasets
~~~~~~~~~~~~~~~

The loader is initialized with a dict from the options key "loader" and the ``Loader.load_data(..)`` function takes up to three possible arguments.

.. code-block:: python

   from clause import Options
   from c_clause import Loader

   opts = Options()
   loader = Loader(options=opts.get("loader"))

Any of the three will work:

.. code-block:: python

   loader2 = Loader(options=opts.get("loader"))
   loader3 = Loader(options=opts.get("loader"))

   # data is the the base KG where rules will be applied on for all handlers
   # often also termed and used as 'train'
   loader.load_data(data="path/to/dataKG")

   loader2.load_data(data="path/to/dataKG", filter="path/to/filter")
   loader3.load_data(data="path/to/dataKG", filter="path/to/filter", target="path/to/target")

- The dataset **data** is always required. It is the base knowledge graph on which rules are applied for all the handlers.
- **Filter** can be used as an set of triples that automatically filters out candidates calculated by ``c_clause.RankingHandler`` and ``c_clause.QAHandler``.
- **Target** is only needed when creating ranking files with the ``c_clause.RankingHandler`` (often termed the test set).


Note that you can only load data once. If you want to use multiple dataset specifications you can use multiple loaders.

Data Types
~~~~~~~~~~~~~~~

There are three possibilities of how to specify the input datasets. From path (as shown above), from python as strings, or from python as indices.

**1) From path**

**Data**, **filter** and **target** arguments have to be file paths to files containing **tab separated** triples of string tokens, e.g., **train.txt**:

.. code-block:: bash

   lisa	knows	max
   max	likes	jon


**2) From Python as strings**

**Data**, **filter** and **target** arguments have to be Python lists of string triples. 

.. code-block:: python

   dataset = [["lisa", "knows", "max"], ["max", "likes", "john"]]

.. note::

   **Using your own entity and relation index.** For the previous two methods the loader will create an internal global index that maps entity and relation strings to integer idx's.
   If you have such a mapping already and later want to specify handler inputs as idx's according to this mapping you can force the loader to use your index. Simply execute **before** loading data ``loader.set_entity_index(index)`` and ``loader.set_relation_index(index)``.
   The argument index is a list of strings that maps idx's to strings by **list[idx]=string**. The index does not need to be complete, one can still load data with new entities and relations.

.. note::

   **Retrieving the index from the loader.** When data is loaded with strings the loader can return the constructed entity and relation index if it is desired to continue with idx's.
   Simply execute ``loader.get_entity_index()`` and ``loader.get_relation_index()`` for obtaining a mapping from strings to idx's.

**3) From Python as indices**

It is also possible to use lists or numpy arrays containing indices. PyClause will, however, always need an internal index that  maps indices to token strings. Even if the user works with indices, PyClause always allows to output results in human readable string representations.

.. code-block:: python

   from c_clause import Loader
   from clause import Options
   import numpy as np

   options = Options()
   loader = Loader(options.get("loader"))

   # maps entities 
   # 0: "lisa"
   # 1: "max"
   # 2: "john"
   entity_index = ["lisa", "max", "john"]
   # maps relations
   # 0: knows
   # 1: likes
   relation_index = ["knows", "likes"]

   # set entity/relation index; should only be done once
   loader.set_entity_index(entity_index)
   loader.set_relation_index(relation_index)

   # (lisa knows max)
   # (max likes john)
   dataset = np.array(
       [
           [0, 0, 1],
           [1, 1, 2]
       ]
   )
   # know yourself
   filter_set = np.array(
       [
           [0, 0, 0],
           [1, 0, 1],
           [2, 0, 2],
       ]
   )
   loader.load_data(data=dataset, filter=filter_set)

In this case, you can only load data containing idx's that already exist in the entity and relation index. E.g., ``loader.load_data(data=[[0,3,1]])`` would throw an error in the example above.