Skip to content

This repo is for a student project concerning the conversion from organigrams to a machine readable format.

License

Notifications You must be signed in to change notification settings

FDS-HTW-2024/fds_orgchart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extraction of Structure of German Organizational Charts

This repository contains the orgxtract package and CLI.

Development

We develop with Pipenv. A guide for setting up a dev environment is on the The Hitchhiker’s Guide to Python.

Installation

To run the project Python 3 and a package installer/manager must be installed that can handle pyproject.toml. The following installation steps use Pipenv as example.

Install all dependencies in a virtual environment.

pipenv install -e .

Install the spacy German model.

pipenv run python -m spacy download de_core_news_md

Usage

This is a basic example to extract data from the first page of a PDF.

from orgxtract import Document, TextPipeline
import orgxtract.pdf as pdf

# Return the first page of the PDF
drawing = next(pdf.open("examples/orgchart.pdf"))
document = Document.extract(drawing)

# The with statement is only necessary when using threads.
with TextPipeline() as text_pipeline:
	texts = document.text_contents.values()

	for content in text_pipeline.pipe(texts):
		print(content)

CLI

This package contains a command line tool. It can be executed by running it as script.

The input can be either a PDF or a directory containing multiple PDFs.

pipenv run python -m orgxtract path/to/input -o path/to/output

Use --help to see all parameters

Logging

The package does use the Python logging module. It is enabled in the CLI and the level can be configured.

Data Set

We used a subset from the data from these websites.

License

The code is licensed under the MIT license. For distribution, the licenses of the dependencies must be consulted.

About

This repo is for a student project concerning the conversion from organigrams to a machine readable format.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages