Knowledge base — documents (power user).
If you have existing content (a help-centre URL, a product manual PDF, a Word policy doc), you can ingest it directly instead of typing FAQs by hand.
Supported formats
| Format | Extension | Notes |
|---|---|---|
| URL | — | Public web page; Tenlo extracts main content via Mozilla Readability. Sitemap-driven multi-page crawl is available (see below). |
.pdf | Text-based PDFs handled natively. Scanned PDFs are now OCR’d automatically via Mistral OCR (V2.3, shipped 2026-05) — first 50 pages are processed, the rest are skipped with a notice on the document card. | |
| Word | .docx | Modern Word format. Legacy .doc is rejected — convert to .docx first. |
| Markdown | .md | Plain Markdown |
How ingestion works
Knowledge Base tab → Add a source. Either paste a URL or pick a file. Click Ingest.
The dashboard shows the document immediately in the Imported documents list with status pending. Behind the scenes:
- Fetch — Tenlo downloads the URL or reads the uploaded file
- Parse — extract clean text (URL: Readability; PDF: text extraction; Word: raw text; MD: as-is)
- Chunk — split into ~500-token chunks with 50-token overlap so the bot can match precisely without hallucinating across boundaries
- Embed — generate a vector for each chunk
- Index — save to your private search index
Status pill flips through pending → processing → embedded (success) or failed (with an error message). Typical end-to-end time: 30 seconds to 2 minutes per document. Scanned PDFs show a Processing (OCR)… badge while Mistral OCR runs (usually 1–3 minutes for a 10-page scan).
Limits
- 100 documents per business (lifetime) — shared across Support Chatbot (P01) and Sales Assistant (P05). Deleting a document anywhere on either product frees a slot for either product.
- 25 MB max per file
- 20 ingest operations per hour — a sitemap crawl that fans out into 80 pages still counts as 1 operation against this cap. Single URL or single file = 1 operation.
- Scanned PDFs — first 50 pages are OCR’d; beyond that, processing stops and the document card shows how many pages were skipped. Re-upload a split version if you need the tail.
- Sitemap crawl — same-host pages only. CDN subdomains or off-host URLs found in the sitemap are filtered out. Already-ingested URLs are detected and shown as such (no double-ingest).
Deleting an ingested document
The Imported documents list has a Delete button per row. Clicking it:
- Removes all the document’s chunks from the search index
- Deletes the file from storage (if uploaded)
- Removes the row from the list
The deletion is clean — the bot stops citing that document immediately.
FAQs vs. documents — when to use each
| Use FAQs when… | Use document ingestion when… |
|---|---|
| The answer is a single short paragraph | You have a long-form policy or guide |
| The question is highly specific | The content is reference material |
| You want tight control over wording | You trust your existing content to be accurate |
| You’re starting from scratch | You already maintain a help centre |
You can mix them freely. The bot searches both at once and returns the best match, regardless of source type.
What’s not yet supported
- JavaScript-rendered pages (single-page apps that need JS execution to show content)
- Scheduled re-ingest — today, you re-ingest manually after content changes
- Multi-file batch upload — sitemap crawl covers the URL side; file batches are still one-at-a-time
- OCR for scanned PDFs beyond 50 pages — anything over the cap is truncated with a notice on the doc card
These are on the roadmap. For now, the workaround is to ingest URLs/files individually.
Sitemap crawl (V2.1)
Shipped 2026-05. Lets you bring in a whole help centre or docs site in one go without pasting URLs one at a time.
How it works
- Knowledge Base tab → Sitemap crawl card. Paste any URL on the target site (e.g.
https://docs.example.com/getting-started). - Tenlo probes
/sitemap.xmland/robots.txtfor a sitemap and parses it (recursing one level into sitemap-index files). - You get a preview list of every URL the sitemap exposes. Pages under the same section as the seed URL are preselected; URLs you’ve already ingested are flagged and disabled.
- Tick the pages you want, click Ingest selected. Each URL becomes its own
kb_documentsrow and runs through the same fetch / parse / chunk / embed pipeline as a single URL.
Constraints
- Same host only — sitemap entries on different hosts (CDN subdomains, partner domains) are filtered out
- Slot accounting — if a crawl would push you past the 100-doc cap, the UI tells you how many slots remain and caps the selection
- One operation against the rate limit regardless of how many pages were enqueued
- Atomic enqueue — if the queue rejects the batch, all just-inserted rows are rolled back so they don’t ghost-occupy the cap
- No sitemap found? Tenlo can’t discover one ~30% of the time (especially on hand-rolled marketing sites). Fall back to single-URL ingestion.
Common patterns
- Help-centre import — paste your
/helplanding page → ingest the section in one click - Product-docs migration — paste the docs root → tick the categories that map to support-ish questions, ignore developer-only pages
- Refresh — re-run the crawl periodically; already-ingested URLs show as such and stay unticked by default
After ingestion, use the Retrieval Inspector on the same Knowledge Base tab to probe how chunks are matching — see Brand voice & confidence threshold.