losalamos.ingestion#
Automated ingestion utilities for bibliographic files.
This module provides tools for processing incoming collections of PDF and BibTeX files and converting them into standardized reference notes. The ingestion pipeline detects PDF–BibTeX pairs, extracts and normalizes metadata, generates unique filenames, creates Markdown reference notes, and copies the associated PDFs into a managed library structure.
Quick start#
The ingestion system converts incoming PDF and BibTeX files into standardized reference notes and library assets.
Run the ingestion pipeline
from losalamos.ingestion import Ingester
ing = Ingester(
src="path/to/incoming/article",
dst="path/to/library/papers"
)
ing.run()
This command will:
explode multi-entry
.bibfiles if neededpair PDF files with their corresponding BibTeX metadata
generate standardized filenames
create Markdown reference notes
copy PDFs into the destination library
Classes
|
- class losalamos.ingestion.Ingester(src, dst)[source]#
Bases:
object- run(cleanup=True)[source]#
Executes the processing pipeline to convert incoming PDF and BibTeX files into standardized note references.
Note
This method orchestrates the full ingestiong workflow: it explodes multi-entry BibTeX files, identifies pairs of PDF and
.bibfiles, and generates unique filenames. For each valid pair, it instantiates aNoteReference, populates it with standardized metadata (including subject tagging), saves the resulting Markdown file, and copies the PDF to the destination library. Ifcleanupis enabled, the original source files are deleted after successful processing.- Parameters:
cleanup (bool) – Determines whether to delete the source BibTeX and PDF files after processing. Default value =
True- Returns:
No value is returned.
- Return type:
None
- explode_bib_files()[source]#
Splits multi-entry BibTeX files into individual
.bibfiles named after each reference.- Returns:
No value is returned.
- Return type:
None
- get_incoming_files()[source]#
Retrieves a DataFrame mapping all PDF files in the source folder to their respective subjects.
- Returns:
A DataFrame with columns
fileandsubject.- Return type:
pandas.DataFrame
- get_reference_object()[source]#
Identifies the entry type from the source folder name and returns the corresponding reference class instance.
- Returns:
An instance of the reference class associated with the folder’s name.
- Return type:
Reference
- list_files(extension='*')[source]#
Lists all files within the source directory and subdirectories that match a specific extension.
- Parameters:
extension (str) – The file extension to filter by. Default value =
*- Returns:
A list of paths to the discovered files.
- Return type:
list[
pathlib.Path]
- list_bibs()[source]#
Recursively lists all BibTeX (
.bib) files found within the source directory.- Returns:
A list of paths to the discovered BibTeX files.
- Return type:
list[
pathlib.Path]
- list_pdfs()[source]#
Recursively lists all PDF files found within the source directory.
- Returns:
A list of paths to the discovered PDF files.
- Return type:
list[
pathlib.Path]
- list_subjects(list_paths)[source]#
Extracts the parent directory names for a list of file paths to serve as subject labels.
- Parameters:
list_paths (list[
pathlib.Path]) – A list of file paths to analyze for subject extraction.- Returns:
A list of strings containing the stem of each file’s parent directory.
- Return type:
list[str]
- static parse_pdf_metadata(file_parse)[source]#
Extracts standard metadata fields from a PDF file using the
pymupdflibrary.- Parameters:
file_parse (str |
pathlib.Path) – The path to the PDF file to be processed.- Returns:
A dictionary containing extracted metadata such as DOI, title, author, and dates.
- Return type:
dict
- static parse_pdf_pages(file_parse: str, n_pages: int = 3) str[source]#
Reads and concatenates text content from the first few pages of a PDF document.
- Parameters:
file_parse (str) – The path to the PDF file to be read.
n_pages (int) – The maximum number of initial pages to extract text from. Default value =
3
- Returns:
A single string containing the combined text of the extracted pages.
- Return type:
str