tika

Apache Tika - a content detection and extraction toolkit that parses metadata and text from over 1400 file types through a single unified interface; written in Java but accessible via REST server and CLI from any language. Doesn't support GPS or bigtiff formats yet.
apache, parsing, content-extraction
pdfbox, poi, kreuzberg