Datasets
A dataset in Cradl is a collection of Documents, preferably from a single source. When you have uploaded documents to your dataset they can be bundled together in Data bundles for training a Model.
Creating a dataset
Datasets are created independently of the documents they contain. You can create a dataset directly in the Cradl datasets UI, or programmatically.
- CLI
- cURL
- Python
las datasets create --name "Invoices 2020" --description "From accounting system"
curl -X POST 'https://api.lucidtech.ai/v1/datasets' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
"name": "Invoices 2020",
"description": "From accounting system"
}'
dataset = client.create_dataset(name='Invoices 2020', description='From accounting system')
{
"datasetId": "<datasetId>",
"description": "From accounting system",
"name": "Invoices 2020",
"numberOfDocuments": 0,
"storageLocation": "EU",
"containsPersonallyIdentifiableInformation": true,
"version": 0
}
The datasetId
is used to include datasets in Data bundles and to add documents to datasets. The version
field is used to identify changes to a dataset, i.e. when adding/removing/updating contained documents.
Give your datasets clear names and descriptions. This will be helpful when keeping track of which data you train your model from.
Adding documents to a dataset
Documents can be assigned to a dataset either at creation time or in an update.
- CLI
- cURL
- Python
las documents create path/to/my/document.pdf --dataset-id <datasetId>
las documents update <documentId> --dataset-id <datasetId>
curl -X POST 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
"content": "JVBERi0xLjQ...",
"contentType": "application/pdf",
"datasetId": "<datasetId>"
}'
curl -X PATCH 'https://api.lucidtech.ai/v1/documents/\<documentId\>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
"datasetId": "<datasetId>"
}'
document = client.create_document(b'<bytes data>', 'application/pdf', datasetId='<datasetId>')
# or
document = client.update_document('<documentId>', datasetId='<datasetId>')
A document cannot be added to more than one dataset.
Deleting a dataset
A dataset may not be deleted unless all documents contained in the dataset are deleted first. Our CLI and SDKs support doing this in a single command. For instructions on how to delete all documents from a dataset see the Documents page.
- CLI
- cURL
- Python
las datasets delete <datasetId> --delete-documents
curl -X DELETE 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
"datasetId": "<datasetId>"
}'
curl -X DELETE 'https://api.lucidtech.ai/v1/datasets/<datasetId>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
client.delete_dataset(dataset_id='<datasetId>', delete_documents=True)