scripts

Scripts Directory

This directory contains organized scripts for the amilib project, categorized by functionality.

Scripts for creating, managing, and analyzing document corpora.

ingest_breward2023.py - Ingest PDF files into AmiCorpus structure
extract_phrases_keybert_breward2023.py - Extract phrases using KeyBERT analysis
extract_phrases_breward2023.py - Extract phrases using TF-IDF analysis

Scripts for adding hyperlinks and annotations to PDFs and HTML documents.

markup_climate_pdfs_with_glossary.py - Add IPCC glossary links to climate PDFs
markup_ipcc_executive_summary.py - Add glossary links to IPCC HTML documents
process_biorxiv_pdfs_simple.py - Simplified PDF annotation (alternative)
process_remaining_biorxiv.py - Background processing of bioRxiv PDFs
create_flat_glossary.py - Convert HTML glossary to flat text format

Utility scripts for system maintenance and file operations.

Each subdirectory contains its own README with specific usage instructions for the scripts in that category.

All scripts should follow the established style guide:

Name		Name	Last commit message	Last commit date
parent directory ..
annotation		annotation
ar6_processor		ar6_processor
corpus		corpus
glossary_processor		glossary_processor
id_processor		id_processor
utils		utils
README.md		README.md
README_add_wikidata_ids.md		README_add_wikidata_ids.md
add_spm_ts_paragraph_ids.py		add_spm_ts_paragraph_ids.py
add_wikidata_ids.py		add_wikidata_ids.py
analyze_unused_functions.py		analyze_unused_functions.py
annex_status_table.py		annex_status_table.py
ar6_validate_ids.py		ar6_validate_ids.py
download_missing_ipcc.py		download_missing_ipcc.py
download_pdf_annexes.py		download_pdf_annexes.py
ipcc_coverage_summary.py		ipcc_coverage_summary.py
process_all_ar6_glossaries.py		process_all_ar6_glossaries.py
process_ar6_annexes.py		process_ar6_annexes.py
process_ipcc_html_ids.py		process_ipcc_html_ids.py
process_remaining_biorxiv.py		process_remaining_biorxiv.py
read_annexes_decisions.py		read_annexes_decisions.py
summarize_ipcc_downloads.py		summarize_ipcc_downloads.py
test_glossary_processor.py		test_glossary_processor.py
transform_all_glossaries_to_css_roles.py		transform_all_glossaries_to_css_roles.py
transform_to_semantic_structure.py		transform_to_semantic_structure.py