ORKG Harvesters

Warning

This is an experimental feature. Bugs can happen and API may change completely in future releases.

The python client offers a variety of functions on top of ORKG content. One of these functions are the Harvesters. Harvesters hides complex logic and enables users to seamlessly integrate ORKG content ingestion into their own systems and workflows.

At the moment, the client supports the following harvesters:

[x] DOI Harvester
[x] Directory Harvester
[ ] Other harvesters are coming soon!

We start by defining our entry point for the harvesters.

from orkg import ORKG, Hosts # import required classes from package

orkg = ORKG(host=Hosts.SANDBOX, creds=('email-address', 'password')) # create the connector to the ORKG

We can access the harvesters manager directly to do the following:

DOI Harvesting

You need to know two things, the DOI where the content is located, and the ORKG research field to add this paper under

# Passing down the EXACT label of the ORKG's research field
orkg.harvesters.doi_harvest(doi="https://doi.org/10.1002/eap.1695", orkg_rf="Computer Sciences")
>>> {'id': 'R507726', 'label': 'Some label here', 'classes': ['Paper'], 'shared': 0, 'featured': False, 'unlisted': False, 'verified': False, 'extraction_method': 'UNKNOWN', '_class': 'resource', 'created_at': '2023-05-31T11:04:33.499726+02:00', 'created_by': '18a48c35-0a9d-4d35-b276-fe293f7d79c7', 'observatory_id': '00000000-0000-0000-0000-000000000000', 'organization_id': '00000000-0000-0000-0000-000000000000', 'formatted_label': None}

# If you know the resource ID of the ORKG's research field you can pass it down as well
from orkg import OID # import the ORKG ID class
orkg.harvesters.doi_harvest(doi="https://doi.org/10.1002/eap.1695", orkg_rf=OID("R11"))
>>> {'id': 'R507726', 'label': 'Some label here', 'classes': ['Paper'], 'shared': 0, 'featured': False, 'unlisted': False, 'verified': False, 'extraction_method': 'UNKNOWN', '_class': 'resource', 'created_at': '2023-05-31T11:04:33.499726+02:00', 'created_by': '18a48c35-0a9d-4d35-b276-fe293f7d79c7', 'observatory_id': '00000000-0000-0000-0000-000000000000', 'organization_id': '00000000-0000-0000-0000-000000000000', 'formatted_label': None}

# You can also use `slow_mode` parameter to get around any timeouts caused by massive paper payloads
orkg.harvesters.doi_harvest(doi="https://doi.org/10.1002/eap.1695", orkg_rf="Science", slow_mode=True)
>>> {'id': 'R507726', 'label': 'Some label here', 'classes': ['Paper'], 'shared': 0, 'featured': False, 'unlisted': False, 'verified': False, 'extraction_method': 'UNKNOWN', '_class': 'resource', 'created_at': '2023-08-31T11:04:33.499726+02:00', 'created_by': '18a4825-0a9d-4d35-b276-fe293f7d79c7', 'observatory_id': '00000000-0000-0000-0000-000000000000', 'organization_id': '00000000-0000-0000-0000-000000000000', 'formatted_label': None}

Note: Searching for an exact label can cause problems if the label is not found. You can look up the research fields in the ORKG and use the ID instead.

Directory Harvesting

As opposed to running orkg.harvesters.doi_harvest with the directory parameter, you can use directory harvesting to ingest all the contributions in a directory. This is particularly useful during development and modeling.

orkg.harvesters.directory_harvest(
     directory="/path/to/data/",
     research_field="Computer Sciences",
     title="Test paper",
     doi="https://doi.org/10.7784/x06c-3d98",
     authors=["John Doe", OID("R12221")],
     publication_year=2021,
     publication_month=1,
     url="https://www.example.com",
     extraction_method="MANUAL",
     published_in="Journal of Example",
     slow_mode=False
)
>>> {'id': 'R50726', 'label': 'Test paper', 'classes': ['Paper'], 'shared': 0, 'featured': False, 'unlisted': False, 'verified': False, 'extraction_method': 'MANUAL', '_class': 'resource', 'created_at': '2023-11-24T11:04:33.499726+02:00', 'created_by': '18a48c35-0a9d-4d35-b276-fe293f7d29c7', 'observatory_id': '00000000-0000-0000-0000-000000000000', 'organization_id': '00000000-0000-0000-0000-000000000000', 'formatted_label': None}

Note that the directory_harvest has more parameters and accepts **kwargs to pass down to the backend API directly.