Documents

A document in Cradl consists of an image or a pdf, along with some additional information. Documents serve two purposes: as data points when training models and as inputs when making predictions. If a document is going to be used for training, it needs to have a ground truth. To help you organize your documents from different sources we recommend that you group them in separate Datasets.

Creating a Document

tip

Allowed formats for documents are PDF, JPEG, PNG and TIFF.

CLI
cURL
Python

las documents create path/to/my/document.pdf

curl -X POST 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
    "content": "JVBERi0xLjQ...",
    "contentType": "application/pdf"
 }'

document = client.create_document(b'<bytes data>', 'application/pdf')

{
  "documentId": "las:document:84ed1bb2d2634072bd3134274ed56ebe",
  "contentType": "application/pdf"
}

The returned documentId can be used together with a modelId to make a prediction on the document once a model has been trained. You can also set a ground truth for the document and add it to a Dataset to use it as training data for a model.

Assigning ground truth to to a document

To use a document as training data, it must have an attached ground truth; since our models learn by example, you must provide both the example input (the file) and its expected output (the ground truth). The ground truth can be provided when you create the document, or it can be added as an update to an existing document.

CLI
cURL
Python
ground_truth.json

las documents create path/to/document.pdf --ground-truth-fields amount=100.00 due_date='2021-05-20'
las documents create path/to/document.pdf --ground-truth-path path/to/ground_truth.json
las documents update <documentId> --ground-truth-fields amount=100.00 due_date='2021-05-20'
las documents update <documentId> --ground-truth-path path/to/ground_truth.json

curl -X POST 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
    "content": "JVBERi0xLjQ...",
    "contentType": "application/pdf"
    "groundTruth": [
      {
        "label": "amount",
        "value": "100.00"
      },
      {
        "label": "due_date",
        "value": "2021-05-20"
      }
    ]
  }'


curl -X PATCH 'https://api.lucidtech.ai/v1/documents/<documentId>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
    "groundTruth": [
      {
        "label": "amount",
        "value": "100.00"
      },
      {
        "label": "due_date",
        "value": "2021-05-20"
      }
    ]
  }'

ground_truth = [
  { 'label': 'total_amount', 'value': '100.00' },
  { 'label': 'due_date', 'value': '2020-02-28' }
]
document = client.create_document(b'<bytes data>', 'application/pdf', ground_truth=ground_truth)
# or
document = client.update_document('<documentId>', ground_truth=ground_truth)

[
  {
    "label": "amount",
    "value": "100.00"
  },
  {
    "label": "due_date",
    "value": "2021-05-20"
  }
]

The JSON format for a ground truth file is an array of objects containing label and value keys. See below for examples. Values in the objects must be strings.

caution

The label name is used as a key in several places. Make sure you are consistent in using the same label names across documents and models.

example_1.png
example_1_ground_truth.json
example_2.png
example_2_ground_truth.json

[
  {
    "label": "category",
    "value": "taxi"
  },
  {
    "label": "currency",
    "value": "EUR"
  },
  {
    "label": "date",
    "value": "2019-12-31"
  },
  {
    "label": "total_amount",
    "value": "43.90"
  }
]

[
  {
    "label": "lading_number",
    "value": "s00158002"
  },
  {
    "label": "point_of_loading",
    "value": "Chicago, United States"
  },
  {
    "label": "place_of_delivery",
    "value": "Sydney, Australia"
  },
  {
    "label": "date",
    "value": "2015-06-23"
  },
  {
    "label": "carrier",
    "value": "Lucid Logistics"
  },
  {
    "label": "weight",
    "value": "20000"
  }
]

Documents with personal consents

In addition to grouping documents in datasets, documents can be assigned a consentId to facilitate deletion of single-user data. If your application requires users to register data use consent, you should label this consent by a user-unique ID, and label all user data uploaded to Cradl with a corresponding consentId at creation time.

A consentId must be formatted as "las:consent:[a-f0-9]{32}".

CLI
cURL
Python

las documents create path/to/document.pdf --consent-id <consentId>

curl -X POST 'https://api.lucidtech.ai/v1/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
    "content": "JVBERi0xLjQ...",
    "contentType": "application/pdf",
    "consentId": "<consentId>"
 }'

document = client.create_document(b'<bytes data>', 'application/pdf')

Deleting documents

Documents may be deleted one-by-one:

CLI
cURL
Python

las documents delete <documentId>

curl -X DELETE 'https://api.lucidtech.ai/v1/documents/<documentId>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \

client.delete_document('<documentId>')

Or using a group identifier (consentId or datasetId):

CLI
cURL
Python

las documents delete-all --dataset-id <datasetId>

curl -X DELETE 'https://api.lucidtech.ai/v1/documents/<documentId>' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJra...' \
--data-raw '{
    "datasetId": "<datasetId>"
 }'

client.delete_documents(dataset_id='<datasetId>', delete_all=True)

{
  "documents": [...],
  "consentId": [...]
}

The delete-all command will delete all documents with the given group identifier.

Documents

Creating a Document​

Assigning ground truth to to a document​

Documents with personal consents​

Deleting documents​

Creating a Document

Assigning ground truth to to a document

Documents with personal consents

Deleting documents