Datasets

A dataset in Cradl is a collection of Documents, preferably from a single source. When you have uploaded documents to your dataset they can be bundled together in Data bundles for training a Model.

Creating a dataset

Datasets are created independently of the documents they contain. You can create a dataset directly in the Cradl datasets UI, or programmatically.

CLI
cURL
Python

las datasets create --name "Invoices 2020" --description "From accounting system"

curl -X POST 'https://api.lucidtech.ai/v1/datasets' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
    "name": "Invoices 2020",
    "description": "From accounting system"
}'

dataset = client.create_dataset(name='Invoices 2020', description='From accounting system')

{
  "datasetId": "<datasetId>",
  "description": "From accounting system",
  "name": "Invoices 2020",
  "numberOfDocuments": 0,
  "storageLocation": "EU",
  "containsPersonallyIdentifiableInformation": true,
  "version": 0
}

The datasetId is used to include datasets in Data bundles and to add documents to datasets. The version field is used to identify changes to a dataset, i.e. when adding/removing/updating contained documents.

caution

Give your datasets clear names and descriptions. This will be helpful when keeping track of which data you train your model from.

Adding documents to a dataset

Documents can be assigned to a dataset either at creation time or in an update.

CLI
cURL
Python

las documents create path/to/my/document.pdf --dataset-id <datasetId>
las documents update <documentId> --dataset-id <datasetId>

curl -X POST 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
    "content": "JVBERi0xLjQ...",
    "contentType": "application/pdf",
    "datasetId": "<datasetId>"
}'

curl -X PATCH 'https://api.lucidtech.ai/v1/documents/\<documentId\>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
    "datasetId": "<datasetId>"
  }'

document = client.create_document(b'<bytes data>', 'application/pdf', datasetId='<datasetId>')
# or
document = client.update_document('<documentId>', datasetId='<datasetId>')

caution

A document cannot be added to more than one dataset.

Deleting a dataset

A dataset may not be deleted unless all documents contained in the dataset are deleted first. Our CLI and SDKs support doing this in a single command. For instructions on how to delete all documents from a dataset see the Documents page.

CLI
cURL
Python

las datasets delete <datasetId> --delete-documents

curl -X DELETE 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
    "datasetId": "<datasetId>"
 }'

 curl -X DELETE 'https://api.lucidtech.ai/v1/datasets/<datasetId>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \

client.delete_dataset(dataset_id='<datasetId>', delete_documents=True)

Datasets

Creating a dataset​

Adding documents to a dataset​

Deleting a dataset​

Creating a dataset

Adding documents to a dataset

Deleting a dataset