Skip to main content

Datasets

A dataset in Cradl is a collection of Documents, preferably from a single source. When you have uploaded documents to your dataset they can be bundled together in Data bundles for training a Model.

Creating a dataset

Datasets are created independently of the documents they contain. You can create a dataset directly in the Cradl datasets UI, or programmatically.

las datasets create --name "Invoices 2020" --description "From accounting system"
{
"datasetId": "<datasetId>",
"description": "From accounting system",
"name": "Invoices 2020",
"numberOfDocuments": 0,
"storageLocation": "EU",
"containsPersonallyIdentifiableInformation": true,
"version": 0
}

The datasetId is used to include datasets in Data bundles and to add documents to datasets. The version field is used to identify changes to a dataset, i.e. when adding/removing/updating contained documents.

caution

Give your datasets clear names and descriptions. This will be helpful when keeping track of which data you train your model from.

Adding documents to a dataset

Documents can be assigned to a dataset either at creation time or in an update.

las documents create path/to/my/document.pdf --dataset-id <datasetId>
las documents update <documentId> --dataset-id <datasetId>
caution

A document cannot be added to more than one dataset.

Deleting a dataset

A dataset may not be deleted unless all documents contained in the dataset are deleted first. Our CLI and SDKs support doing this in a single command. For instructions on how to delete all documents from a dataset see the Documents page.

las datasets delete <datasetId> --delete-documents