This repository contains the orgxtract package and CLI.
We develop with Pipenv. A guide for setting up a dev environment is on the The Hitchhiker’s Guide to Python.
To run the project Python 3 and a package installer/manager must be installed that can handle pyproject.toml. The following installation steps use Pipenv as example.
Install all dependencies in a virtual environment.
pipenv install -e .
Install the spacy German model.
pipenv run python -m spacy download de_core_news_md
This is a basic example to extract data from the first page of a PDF.
from orgxtract import Document, TextPipeline
import orgxtract.pdf as pdf
# Return the first page of the PDF
drawing = next("examples/orgchart.pdf"))
document = Document.extract(drawing)
# The with statement is only necessary when using threads.
with TextPipeline() as text_pipeline:
texts = document.text_contents.values()
for content in text_pipeline.pipe(texts):
This package contains a command line tool. It can be executed by running it as script.
The input can be either a PDF or a directory containing multiple PDFs.
pipenv run python -m orgxtract path/to/input -o path/to/output
Use --help
to see all parameters
The package does use the Python logging module. It is enabled in the CLI and the level can be configured.
We used a subset from the data from these websites.
The code is licensed under the MIT license. For distribution, the licenses of the dependencies must be consulted.