Documents
A document in Cradl consists of an image or a pdf, along with some additional information. Documents serve two purposes: as data points when training models and as inputs when making predictions. If a document is going to be used for training, it needs to have a ground truth. To help you organize your documents from different sources we recommend that you group them in separate Datasets.
Creating a Document
Allowed formats for documents are PDF, JPEG, PNG and TIFF.
- CLI
- cURL
- Python
las documents create path/to/my/document.pdf
curl -X POST 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
"content": "JVBERi0xLjQ...",
"contentType": "application/pdf"
}'
document = client.create_document(b'<bytes data>', 'application/pdf')
{
"documentId": "las:document:84ed1bb2d2634072bd3134274ed56ebe",
"contentType": "application/pdf"
}
The returned documentId
can be used together with a modelId
to make a prediction on the document once a model has been trained. You can also set a ground truth for the document and add it to a Dataset to use it as training data for a model.
Assigning ground truth to to a document
To use a document as training data, it must have an attached ground truth; since our models learn by example, you must provide both the example input (the file) and its expected output (the ground truth). The ground truth can be provided when you create the document, or it can be added as an update to an existing document.
- CLI
- cURL
- Python
- ground_truth.json
las documents create path/to/document.pdf --ground-truth-fields amount=100.00 due_date='2021-05-20'
las documents create path/to/document.pdf --ground-truth-path path/to/ground_truth.json
las documents update <documentId> --ground-truth-fields amount=100.00 due_date='2021-05-20'
las documents update <documentId> --ground-truth-path path/to/ground_truth.json
curl -X POST 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
"content": "JVBERi0xLjQ...",
"contentType": "application/pdf"
"groundTruth": [
{
"label": "amount",
"value": "100.00"
},
{
"label": "due_date",
"value": "2021-05-20"
}
]
}'
curl -X PATCH 'https://api.lucidtech.ai/v1/documents/<documentId>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
"groundTruth": [
{
"label": "amount",
"value": "100.00"
},
{
"label": "due_date",
"value": "2021-05-20"
}
]
}'
ground_truth = [
{ 'label': 'total_amount', 'value': '100.00' },
{ 'label': 'due_date', 'value': '2020-02-28' }
]
document = client.create_document(b'<bytes data>', 'application/pdf', ground_truth=ground_truth)
# or
document = client.update_document('<documentId>', ground_truth=ground_truth)
[
{
"label": "amount",
"value": "100.00"
},
{
"label": "due_date",
"value": "2021-05-20"
}
]
The JSON format for a ground truth file is an array of objects containing label
and value
keys. See below for examples. Values in the objects must be strings.
The label name is used as a key in several places. Make sure you are consistent in using the same label names across documents and models.
- example_1.png
- example_1_ground_truth.json
- example_2.png
- example_2_ground_truth.json
[
{
"label": "category",
"value": "taxi"
},
{
"label": "currency",
"value": "EUR"
},
{
"label": "date",
"value": "2019-12-31"
},
{
"label": "total_amount",
"value": "43.90"
}
]
[
{
"label": "lading_number",
"value": "s00158002"
},
{
"label": "point_of_loading",
"value": "Chicago, United States"
},
{
"label": "place_of_delivery",
"value": "Sydney, Australia"
},
{
"label": "date",
"value": "2015-06-23"
},
{
"label": "carrier",
"value": "Lucid Logistics"
},
{
"label": "weight",
"value": "20000"
}
]
Documents with personal consents
In addition to grouping documents in datasets, documents can be assigned a consentId
to facilitate deletion of single-user data. If your application requires users to register data use consent, you should label this consent by a user-unique ID, and label all user data uploaded to Cradl with a corresponding consentId
at creation time.
A consentId
must be formatted as "las:consent:[a-f0-9]{32}"
.
- CLI
- cURL
- Python
las documents create path/to/document.pdf --consent-id <consentId>
curl -X POST 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
"content": "JVBERi0xLjQ...",
"contentType": "application/pdf",
"consentId": "<consentId>"
}'
document = client.create_document(b'<bytes data>', 'application/pdf')
Deleting documents
Documents may be deleted one-by-one:
- CLI
- cURL
- Python
las documents delete <documentId>
curl -X DELETE 'https://api.lucidtech.ai/v1/documents/<documentId>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
client.delete_document('<documentId>')
Or using a group identifier (consentId
or datasetId
):
- CLI
- cURL
- Python
las documents delete-all --dataset-id <datasetId>
curl -X DELETE 'https://api.lucidtech.ai/v1/documents/<documentId>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
"datasetId": "<datasetId>"
}'
client.delete_documents(dataset_id='<datasetId>', delete_all=True)
{
"documents": [...],
"consentId": [...]
}
The delete-all command will delete all documents with the given group identifier.