losalamos.ingestion#

Automated ingestion utilities for bibliographic files.

This module provides tools for processing incoming collections of PDF and BibTeX files and converting them into standardized reference notes. The ingestion pipeline detects PDF–BibTeX pairs, extracts and normalizes metadata, generates unique filenames, creates Markdown reference notes, and copies the associated PDFs into a managed library structure.

Quick start#

The ingestion system converts incoming PDF and BibTeX files into standardized reference notes and library assets.

Run the ingestion pipeline

from losalamos.ingestion import Ingester

ing = Ingester(
    src="path/to/incoming/article",
    dst="path/to/library/papers"
)

ing.run()

This command will:

  • explode multi-entry .bib files if needed

  • pair PDF files with their corresponding BibTeX metadata

  • generate standardized filenames

  • create Markdown reference notes

  • copy PDFs into the destination library

Classes

Ingester(src, dst)

class losalamos.ingestion.Ingester(src, dst)[source]#

Bases: object

run(cleanup=True)[source]#

Executes the processing pipeline to convert incoming PDF and BibTeX files into standardized note references.

Note

This method orchestrates the full ingestiong workflow: it explodes multi-entry BibTeX files, identifies pairs of PDF and .bib files, and generates unique filenames. For each valid pair, it instantiates a NoteReference, populates it with standardized metadata (including subject tagging), saves the resulting Markdown file, and copies the PDF to the destination library. If cleanup is enabled, the original source files are deleted after successful processing.

Parameters:

cleanup (bool) – Determines whether to delete the source BibTeX and PDF files after processing. Default value = True

Returns:

No value is returned.

Return type:

None

explode_bib_files()[source]#

Splits multi-entry BibTeX files into individual .bib files named after each reference.

Returns:

No value is returned.

Return type:

None

get_incoming_files()[source]#

Retrieves a DataFrame mapping all PDF files in the source folder to their respective subjects.

Returns:

A DataFrame with columns file and subject.

Return type:

pandas.DataFrame

get_reference_object()[source]#

Identifies the entry type from the source folder name and returns the corresponding reference class instance.

Returns:

An instance of the reference class associated with the folder’s name.

Return type:

Reference

list_files(extension='*')[source]#

Lists all files within the source directory and subdirectories that match a specific extension.

Parameters:

extension (str) – The file extension to filter by. Default value = *

Returns:

A list of paths to the discovered files.

Return type:

list[pathlib.Path]

list_bibs()[source]#

Recursively lists all BibTeX (.bib) files found within the source directory.

Returns:

A list of paths to the discovered BibTeX files.

Return type:

list[pathlib.Path]

list_pdfs()[source]#

Recursively lists all PDF files found within the source directory.

Returns:

A list of paths to the discovered PDF files.

Return type:

list[pathlib.Path]

list_subjects(list_paths)[source]#

Extracts the parent directory names for a list of file paths to serve as subject labels.

Parameters:

list_paths (list[pathlib.Path]) – A list of file paths to analyze for subject extraction.

Returns:

A list of strings containing the stem of each file’s parent directory.

Return type:

list[str]

static parse_pdf_metadata(file_parse)[source]#

Extracts standard metadata fields from a PDF file using the pymupdf library.

Parameters:

file_parse (str | pathlib.Path) – The path to the PDF file to be processed.

Returns:

A dictionary containing extracted metadata such as DOI, title, author, and dates.

Return type:

dict

static parse_pdf_pages(file_parse: str, n_pages: int = 3) str[source]#

Reads and concatenates text content from the first few pages of a PDF document.

Parameters:
  • file_parse (str) – The path to the PDF file to be read.

  • n_pages (int) – The maximum number of initial pages to extract text from. Default value = 3

Returns:

A single string containing the combined text of the extracted pages.

Return type:

str

static parse_bib(file_parse)[source]#

Parses a BibTeX file and returns the metadata of the first entry found.

Parameters:

file_parse (str | pathlib.Path) – The path to the BibTeX file to be parsed.

Returns:

A dictionary representing the first bibliographic entry in the file.

Return type:

dict