API reference
REST API for uploading documents and pulling typed JSON results. Version 0.1.0. Base path /api/v1.
Authentication
Every request is authenticated with the API key issued to your project. Send it in the Authorization header:
Authorization: Bearer YOUR_API_KEY
Treat the key as a secret. Don't commit it to source control or ship it in client-side code. If a key is exposed, rotate it from the Pathnovo console.
Errors
Errors return a JSON body with a detail field and the relevant HTTP status:
{
"detail": "Project not found"
}| Status | Meaning |
|---|---|
| 400 | Bad request. The body or query is malformed. |
| 401 | Missing or invalid API key. |
| 403 | The key does not have access to the requested project or resource. |
| 404 | Resource does not exist. |
| 409 | Conflict. Usually a duplicate upload that has been deduplicated. |
| 413 | File or archive is too large. |
| 415 | Unsupported file type. |
| 422 | Validation error. Field-level details in the response body. |
| 429 | Rate limit exceeded. |
| 500 | Server error. Safe to retry with backoff. |
Rate limits
Default limits: 60 requests per minute per API key for read endpoints, 30 per minute for upload endpoints. Limits are returned in the X-RateLimit-* headers on every response. If you need higher limits, contact your account team.
Conventions
- Request and response bodies are JSON. Uploads use
multipart/form-data. - Upload endpoints return
202 Acceptedwith a job ID. Poll the status endpoint or subscribe to the SSE stream. - Timestamps are ISO 8601 UTC. IDs are UUIDv4.
- All endpoints are versioned under
/api/v1.
Documents
Upload files or import them from a URL. Each upload returns a document ID and starts a background classification + extraction job.
Upload a document
Upload a single file as multipart/form-data. The endpoint returns immediately with a document ID; classification and extraction run in the background.
Use this for one document at a time. The response is 202 Accepted, not 200, because the file has only been received and queued. The pipeline runs in the background and can take anywhere from 30 seconds for a small datasheet to several minutes for a multi-page P&ID.
Pathnovo deduplicates by content hash, scoped to the project. If you upload the same file (byte-for-byte) into the same project twice, the second response returns the original document_id with deduplicated set to true and no new job is queued. This makes uploads idempotent enough to retry safely.
Supported file types: PDF, PNG, JPG, TIFF, XLSX, DOCX. Max size per file is 100 MB.
After upload, watch progress on the SSE stream or poll the status endpoint. Don't keep the upload connection open waiting for completion.
Pass the document_id from the response to GET /documents/{id}/progress (live SSE) or GET /documents/{id} (polling) to track the job.
Treat 202 as final. The upload connection is closed as soon as the file lands. Tracking progress on the upload socket itself will time out.
Request body
Content-Type: multipart/form-data
| Field | Type | Description |
|---|---|---|
| filerequired | binary | The document file (PDF, PNG, JPG, TIFF, XLSX, DOCX). Max 100 MB. |
| project_idrequired | UUID | Project the document belongs to. Your API key must have access to it. |
Responses
{
"document_id": "UUID",
"status": "queued",
"deduplicated": false
}curl -X POST "https://api.pathnovo.com/api/v1/documents/upload" \ -H "Authorization: Bearer $PATHNOVO_API_KEY" \ -F "file=@/path/to/file" \ -F "project_id=<value>"
{
"document_id": "UUID",
"status": "queued",
"deduplicated": false
}Upload a ZIP of documents
Upload a ZIP archive. Every file inside is unpacked and treated as a separate document. Each gets its own document_id and runs through the pipeline independently.
Use this when you have a folder of related documents (a handover package, a vendor deliverable, a turnaround set) and want to push them all in one request. Pathnovo extracts the archive on its side, then queues each file as a normal upload.
The response includes the count and an array of UploadResponse objects, one per file, in the order they appeared in the archive. Track each document_id individually after that.
Nested folders inside the ZIP are flattened. The archive must be < 1 GB. For very large sets, split into multiple ZIPs and call this endpoint per chunk.
Files with unsupported extensions (.bak, .tmp, .DS_Store, etc.) are silently ignored. Hidden files starting with `.` are skipped.
If your archive has more than ~50 files, processing happens in parallel server-side. The order of completion will not match the order in the response.
Request body
Content-Type: multipart/form-data
| Field | Type | Description |
|---|---|---|
| zip_filerequired | binary | ZIP archive of documents. Max 1 GB. Nested folders are flattened. |
| project_idrequired | UUID | Project the documents belong to. |
Responses
{
"total_files": 12,
"documents": [
{ "document_id": "UUID", "status": "queued", "deduplicated": false }
]
}curl -X POST "https://api.pathnovo.com/api/v1/documents/batch/upload" \ -H "Authorization: Bearer $PATHNOVO_API_KEY" \ -F "zip_file=@/path/to/file" \ -F "project_id=<value>"
{
"total_files": 12,
"documents": [
{ "document_id": "UUID", "status": "queued", "deduplicated": false }
]
}Import from a URL
Hand Pathnovo a URL and we'll fetch the file ourselves, then run the standard pipeline. Useful when files live in S3, SharePoint, or an internal share that's reachable over the public internet.
Pathnovo's importer follows redirects, respects standard auth on the URL (basic auth and query-string tokens), and verifies the file type before queuing the job. The download happens once; the file is then stored in our document store like any other upload.
For private files, generate a short-lived pre-signed URL on your side (e.g. an S3 presigned URL with a 15-minute TTL). The URL only needs to be valid long enough for us to fetch the file, which usually takes a few seconds.
We cannot pull from inside your VPC or behind a corporate firewall. If the URL is private and you can't expose it, use POST /documents/upload instead and stream the bytes directly.
The same content-hash deduplication that applies to /documents/upload applies here. Re-importing the same URL into the same project returns the original document_id.
Request body
Model: UrlImportRequest · Content-Type: application/json
{
"url": "https://example.com/spec.pdf",
"project_id": "UUID"
}Responses
{
"document_id": "UUID",
"status": "queued",
"deduplicated": false
}curl -X POST "https://api.pathnovo.com/api/v1/documents/import-url" \ -H "Authorization: Bearer $PATHNOVO_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/spec.pdf", "project_id": "UUID" }'
{
"document_id": "UUID",
"status": "queued",
"deduplicated": false
}Get document status
Read the current state of a document. Use this when you want a single-point-in-time check, for example from a cron job or when refreshing a UI page.
The status field walks through these values in order: queued (uploaded, waiting for a worker), classifying (the classifier is running), extracting (the right pipeline is pulling fields), extracted (done, result available), failed (something blew up, see error_detail on the extraction job).
The response also includes the original filename, mime type, and three timestamps: when it was uploaded, when classification finished, and when extraction finished. The classification and extraction timestamps are null until those steps complete.
If you need live updates rather than snapshots, prefer the SSE progress stream on /documents/{id}/progress so you don't burn rate limit on a tight poll loop.
Most documents finish within 60 seconds. A 5-second poll interval is plenty. Don't poll faster than once per second; the status almost never changes that quickly and you'll hit rate limits.
Path parameters
| Name | Type | Description |
|---|---|---|
| document_id | UUID | Document ID returned by an upload endpoint. |
Responses
{
"id": "UUID",
"project_id": "UUID",
"original_filename": "spec.pdf",
"mime_type": "application/pdf",
"status": "extracted",
"uploaded_at": "2026-04-25T08: 00: 00Z",
"classified_at": "2026-04-25T08: 00: 14Z",
"extracted_at": "2026-04-25T08: 01: 02Z"
}curl -X GET "https://api.pathnovo.com/api/v1/documents/{document_id}" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
{
"id": "UUID",
"project_id": "UUID",
"original_filename": "spec.pdf",
"mime_type": "application/pdf",
"status": "extracted",
"uploaded_at": "2026-04-25T08: 00: 00Z",
"classified_at": "2026-04-25T08: 00: 14Z",
"extracted_at": "2026-04-25T08: 01: 02Z"
}Stream live progress (SSE)
A Server-Sent Events stream that emits an event every time the document moves to a new stage or the extractor reports a percentage. The connection closes automatically when the job hits a terminal state (extracted or failed).
Each event is a JSON object with a `status` field and an optional `progress` field (0-100) for stages that report incremental progress. Browsers can subscribe with the standard EventSource API; on the server side use any HTTP client that supports streaming.
If the connection drops, reconnect with the same URL. Pathnovo will replay the latest known state immediately so you never miss the terminal event. There's no event ID; reconnection is stateless.
If the document already completed before you connected, you'll get one event with the final status, then the connection closes. This makes it safe to subscribe even after a job is done.
SSE is better for live UIs (the user is staring at a progress bar). Polling /documents/{id} is better for batch jobs or async workflows where you only need to check in occasionally.
Path parameters
| Name | Type | Description |
|---|---|---|
| document_id | UUID | Document ID. |
Responses
curl -X GET "https://api.pathnovo.com/api/v1/documents/{document_id}/progress" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
// No body
Classification
Read the document type Pathnovo assigned to a file, or override it manually if needed.
Get classification
Returns the document type Pathnovo assigned to this file, the confidence score, the method (auto or manual), and the bucket the type belongs to.
Confidence is an integer from 0 to 100. In practice, anything above 90 is rock-solid, 70 to 90 is usually right but worth a glance from a human reviewer, and below 70 is worth manual confirmation. The classifier surfaces low-confidence calls in our own UI for review; you can do the same on your side.
Bucket is a coarse grouping (drawings, datasheets, registers, certificates) that lets you filter without knowing the exact doc type. Useful for dashboards.
If the classification was overridden manually, is_manual_override is true and confidence reads 100. The original auto-classification is still recorded in our system but not exposed here.
Run this right after the document hits status 'extracted'. The extraction result already encodes the doc type implicitly, but this endpoint gives you the confidence number which is what you want for review queues.
Path parameters
| Name | Type | Description |
|---|---|---|
| document_id | UUID | Document ID. |
Responses
{
"id": "UUID",
"document_id": "UUID",
"doc_type_id": "UUID",
"doc_type_name": "P&ID",
"bucket": "drawings",
"confidence": 96,
"method": "auto",
"is_manual_override": false,
"created_at": "2026-04-25T08: 00: 14Z"
}curl -X GET "https://api.pathnovo.com/api/v1/classification/documents/{document_id}" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
{
"id": "UUID",
"document_id": "UUID",
"doc_type_id": "UUID",
"doc_type_name": "P&ID",
"bucket": "drawings",
"confidence": 96,
"method": "auto",
"is_manual_override": false,
"created_at": "2026-04-25T08: 00: 14Z"
}Override classification
Manually set the document type. This re-queues extraction with the new schema and returns the updated classification record.
Use this when the auto-classifier got it wrong and the result you got back from /extraction/documents/{id}/result is using the wrong shape. Common case: a vendor datasheet that visually looks like a P&ID, or two doc types that share the same template.
When you call this endpoint, Pathnovo deletes the previous extraction result, looks up the schema for the new doc_type_name, and queues a fresh extraction job. The original document file is reused, no upload needed. Watch the new job through the standard progress endpoints.
The reason field is stored on our side for your audit trail. We don't use it to retrain anything automatically; if you want classifier improvements based on overrides, talk to your integration engineer.
If you've already pulled the extraction result and stored it on your side, snapshot it before overriding. The previous result is deleted as soon as the new job is queued.
doc_type_name must match exactly one of the names returned by GET /schemas. Case-sensitive.
Path parameters
| Name | Type | Description |
|---|---|---|
| document_id | UUID | Document ID. |
Request body
Model: ManualOverrideRequest · Content-Type: application/json
{
"doc_type_name": "P&ID",
"reason": "auto-classified as isometric, but it is a P&ID"
}Responses
{
"id": "UUID",
"document_id": "UUID",
"doc_type_id": "UUID",
"doc_type_name": "P&ID",
"bucket": "drawings",
"confidence": 96,
"method": "auto",
"is_manual_override": false,
"created_at": "2026-04-25T08: 00: 14Z"
}curl -X PATCH "https://api.pathnovo.com/api/v1/classification/documents/{document_id}" \ -H "Authorization: Bearer $PATHNOVO_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "doc_type_name": "P&ID", "reason": "auto-classified as isometric, but it is a P&ID" }'
{
"id": "UUID",
"document_id": "UUID",
"doc_type_id": "UUID",
"doc_type_name": "P&ID",
"bucket": "drawings",
"confidence": 96,
"method": "auto",
"is_manual_override": false,
"created_at": "2026-04-25T08: 00: 14Z"
}Extraction
Track extraction jobs and pull the typed JSON result for a document.
Get job status
Returns the lightweight status record for one extraction job. Use this when you already hold a job_id and only need to know if it's done.
The status field can be queued, running, completed, or failed. resolved_scope tells you which config layer was used to extract this document (default, org, or project) so you can debug surprising results during a rollout of new prompts.
Compared to GET /extraction/documents/{id}/jobs (which returns all jobs for the document including the full result), this endpoint is cheap and small. Reach for it inside tight loops.
Path parameters
| Name | Type | Description |
|---|---|---|
| job_id | UUID | Extraction job ID. Returned by /extraction/documents/{id}/jobs. |
Responses
{
"id": "UUID",
"status": "completed",
"resolved_scope": "project",
"started_at": "2026-04-25T08: 00: 14Z",
"completed_at": "2026-04-25T08: 01: 02Z"
}curl -X GET "https://api.pathnovo.com/api/v1/extraction/jobs/{job_id}/status" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
{
"id": "UUID",
"status": "completed",
"resolved_scope": "project",
"started_at": "2026-04-25T08: 00: 14Z",
"completed_at": "2026-04-25T08: 01: 02Z"
}List jobs for a document
Returns every extraction job that has run for this document, newest first, including the full result and any error detail.
A document can have more than one job in two cases: classification was overridden (so we re-extracted with the new schema), or the first extraction failed and was retried. This endpoint shows the full history, so you can compare results across attempts.
Each job record includes the resolved_scope (default / org / project) used at the time. If you change project-level config and re-extract, you'll see the new scope on the latest job.
If you only care about the latest successful result, hit /extraction/documents/{id}/result instead. This endpoint is for when you need history.
Path parameters
| Name | Type | Description |
|---|---|---|
| document_id | UUID | Document ID. |
Responses
[
{
"id": "UUID",
"document_id": "UUID",
"status": "completed",
"resolved_scope": "project",
"result": { "title_block": { "...": "..." } },
"error_detail": null,
"started_at": "2026-04-25T08: 00: 14Z",
"completed_at": "2026-04-25T08: 01: 02Z",
"created_at": "2026-04-25T08: 00: 00Z"
}
]curl -X GET "https://api.pathnovo.com/api/v1/extraction/documents/{document_id}/jobs" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
[
{
"id": "UUID",
"document_id": "UUID",
"status": "completed",
"resolved_scope": "project",
"result": { "title_block": { "...": "..." } },
"error_detail": null,
"started_at": "2026-04-25T08: 00: 14Z",
"completed_at": "2026-04-25T08: 01: 02Z",
"created_at": "2026-04-25T08: 00: 00Z"
}
]Get extraction result
Returns the latest completed extraction as typed JSON. The shape depends on the document type; look up the schema for that type if you need to know the exact fields.
This is the endpoint you'll call most often. After a document reaches the extracted status, hit this to pull the structured data and load it into your downstream system.
Every result includes the embedded title_block header (28 fields shared across every doc type) plus a body that depends on the doc type. For example, a P&ID returns lines, instruments, equipment, valves; a mill certificate returns chemical composition, mechanical tests, dimensions.
If the document hasn't finished extracting yet, the response is 202 Accepted with an empty body. Don't treat that as a result; wait for the job to complete.
To know the exact fields you'll get back, find the document type in /schemas and use the version returned. The schema includes every field with its type.
Path parameters
| Name | Type | Description |
|---|---|---|
| document_id | UUID | Document ID. |
Responses
curl -X GET "https://api.pathnovo.com/api/v1/extraction/documents/{document_id}/result" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
// No body
Get extraction status
Compact status flags for the latest extraction attempt on this document, without the full result payload.
Use this when you need a quick yes/no on whether the latest extraction is ready, and you don't want to pull the full result. Common pattern: poll this from a worker that's deciding whether to fan out the result-fetch job.
Path parameters
| Name | Type | Description |
|---|---|---|
| document_id | UUID | Document ID. |
Responses
curl -X GET "https://api.pathnovo.com/api/v1/extraction/documents/{document_id}/status" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
// No body
Schemas
List the document types Pathnovo can extract and pull the JSON schema for any one of them.
List supported document types
List every document type Pathnovo can extract, with the schema version and field count.
Responses
[
{
"doc_type_id": "UUID",
"doc_type_name": "P&ID",
"bucket": "drawings",
"version": "1.4.0",
"field_count": 78
}
]curl -X GET "https://api.pathnovo.com/api/v1/schemas" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
[
{
"doc_type_id": "UUID",
"doc_type_name": "P&ID",
"bucket": "drawings",
"version": "1.4.0",
"field_count": 78
}
]Get a schema
Get the JSON schema for a specific document type, including the embedded title_block header.
Path parameters
| Name | Type | Description |
|---|---|---|
| doc_type_id | UUID | Document type ID. |
Responses
{
"doc_type_id": "UUID",
"doc_type_name": "P&ID",
"version": "1.4.0",
"schema": {
"title_block": "<common>",
"lines": [
{ "tag": "string", "size_in": "number", "service": "string" }
]
}
}curl -X GET "https://api.pathnovo.com/api/v1/schemas/{doc_type_id}" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
{
"doc_type_id": "UUID",
"doc_type_name": "P&ID",
"version": "1.4.0",
"schema": {
"title_block": "<common>",
"lines": [
{ "tag": "string", "size_in": "number", "service": "string" }
]
}
}List document type IDs
List the raw document type records used by the classifier. Use this to map document type names to IDs.
Responses
[
{
"id": "UUID",
"name": "P&ID",
"bucket": "drawings",
"description": "Piping and instrumentation diagram"
}
]curl -X GET "https://api.pathnovo.com/api/v1/extraction/document-types" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
[
{
"id": "UUID",
"name": "P&ID",
"bucket": "drawings",
"description": "Piping and instrumentation diagram"
}
]Analytics
Project-level usage and accuracy metrics. Useful for dashboards and billing reconciliation.
Project overview
Summary counts for a project. Documents uploaded, classified, extracted, failed, plus pages processed for billing.
Path parameters
| Name | Type | Description |
|---|---|---|
| project_id | UUID | Project ID. |
Responses
{
"project_id": "UUID",
"documents_uploaded": 1284,
"documents_classified": 1280,
"documents_extracted": 1271,
"documents_failed": 9,
"by_status": {
"queued": 0,
"classifying": 4,
"extracting": 13,
"extracted": 1271,
"failed": 9
},
"pages_processed": 8420,
"billable_pages": 8420
}curl -X GET "https://api.pathnovo.com/api/v1/analytics/projects/{project_id}/overview" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
{
"project_id": "UUID",
"documents_uploaded": 1284,
"documents_classified": 1280,
"documents_extracted": 1271,
"documents_failed": 9,
"by_status": {
"queued": 0,
"classifying": 4,
"extracting": 13,
"extracted": 1271,
"failed": 9
},
"pages_processed": 8420,
"billable_pages": 8420
}Classification accuracy
Accuracy and method breakdown for a project. Shows the auto vs manual split, average confidence, and counts by document type.
Path parameters
| Name | Type | Description |
|---|---|---|
| project_id | UUID | Project ID. |
Responses
{
"total_classified": 1280,
"auto_classified": 1267,
"manual_overrides": 13,
"avg_confidence": 94.6,
"by_method": { "auto": 1267, "manual": 13 },
"by_bucket": { "drawings": 612, "datasheets": 318, "registers": 350 },
"by_doc_type": { "P&ID": 184, "Isometric": 230, "Pressure Vessel Datasheet": 88 }
}curl -X GET "https://api.pathnovo.com/api/v1/analytics/projects/{project_id}/classification" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
{
"total_classified": 1280,
"auto_classified": 1267,
"manual_overrides": 13,
"avg_confidence": 94.6,
"by_method": { "auto": 1267, "manual": 13 },
"by_bucket": { "drawings": 612, "datasheets": 318, "registers": 350 },
"by_doc_type": { "P&ID": 184, "Isometric": 230, "Pressure Vessel Datasheet": 88 }
}Extraction throughput
Daily document and page counts for a project. Use the from and to query params to set a range. Defaults to the last 30 days.
Path parameters
| Name | Type | Description |
|---|---|---|
| project_id | UUID | Project ID. |
Query parameters
| Name | Type | Description |
|---|---|---|
| from | date | ISO date, inclusive. Defaults to 30 days ago. |
| to | date | ISO date, inclusive. Defaults to today. |
Responses
{
"from": "2026-04-01",
"to": "2026-04-25",
"buckets": [
{ "date": "2026-04-01", "uploaded": 42, "extracted": 41, "pages": 268 },
{ "date": "2026-04-02", "uploaded": 60, "extracted": 60, "pages": 401 }
],
"totals": { "uploaded": 1284, "extracted": 1271, "pages": 8420 }
}curl -X GET "https://api.pathnovo.com/api/v1/analytics/projects/{project_id}/throughput" \ -H "Authorization: Bearer $PATHNOVO_API_KEY"
{
"from": "2026-04-01",
"to": "2026-04-25",
"buckets": [
{ "date": "2026-04-01", "uploaded": 42, "extracted": 41, "pages": 268 },
{ "date": "2026-04-02", "uploaded": 60, "extracted": 60, "pages": 401 }
],
"totals": { "uploaded": 1284, "extracted": 1271, "pages": 8420 }
}