Full-Text Sources
Use retrieval.full_text_sources when your full text already exists on disk or when PubGet is only part of the retrieval path.
Purpose
Each item in full_text_sources maps PMIDs to local files and, optionally, to associated coordinate/table data.
Common Fields
root_pathRoot directory containing the content sourcepmid_sourceOne of:file_namefolder_namejsontext_path_templatesRelative file templates used to locate full text forfolder_nameandjsonsourcesallowed_extensionsAllowed file extensions whenpmid_source: file_namejson_filenameMetadata filename forjsonmodejson_pmid_keyKey in the JSON file that contains the PMIDprocessed_data_pathPath to pubget-like processed coordinate/table CSVscoordinates_path_templatesRelative coordinate-file templates when coordinates are stored near each source item
Source Patterns
pmid_source: file_name
Use this when files are named by PMID:
full_text_sources:
- root_path: "/data/fulltexts"
pmid_source: "file_name"
allowed_extensions: [".txt", ".html"]
pmid_source: folder_name
Use this when each publication is stored in a PMID-named directory:
full_text_sources:
- root_path: "/data/fulltexts"
pmid_source: "folder_name"
text_path_templates:
- "fulltext.txt"
- "text.txt"
pmid_source: json
Use this when each publication directory contains metadata describing the PMID:
full_text_sources:
- root_path: "/data/fulltexts"
pmid_source: "json"
json_filename: "identifiers.json"
json_pmid_key: "pmid"
text_path_templates:
- "processed/pubget/text.txt"
- "text.txt"
Coordinates and Tables
You have two ways to attach coordinate/table context:
processed_data_pathUse a directory containing pubget-like processed CSV outputs such ascoordinates.csvandtables.csv.coordinates_path_templatesUse direct coordinate-file templates inside each source item.
Do not use both in the same source entry. Current validation treats them as mutually exclusive.
When to Use This Feature
- you already scraped or licensed the full text separately
- you want Autonima to screen a local corpus
- you want to attach coordinate/table data from a custom preprocessing pipeline