Manipulating PDFs with iText in powershell
So, I can write a PDF...
Last post I managed to generate an empty PDF with Powershell using iText, after working through dependencies and order of inclusion for running some .NET libraries in powershell. It's made me wonder if there's a better way to handle the inclusion of libraries in a Powershell script, and if there's a better way to Install-Package
that actually handles dependencies correctly..
However. The point of this whole thing was to automate the processing of some large collections of PDFs.
At work, we've got this .. ancient electronic library. It's some proprietary data format, so extracting the data isn't as straightforward as we'd hoped it would be. What we are able to do, however, is print them to a PDF. Provided the PDF isn't encrypted or secured (they aren't), I should be able to automate this.
First thought was to mimic keyboard input with something like AutoHotKey, and step through the motions of reading some control's title, going through the menu motions to export a single document to PDF, and repeating until it's all exported. While this works, there's something like 10,000+ documents in this library, and keyboard/mouse automation is fragile at best.
Second approach was to extract the content from the PDF with some sort of automation. iText seems to be a current-ish library to handle this, which led down the trail of the last post.
How about reading them?
First step - can I open a PDF and get a page count? Maybe get some content?
Turns out this was.. surprisingly easy:
1function main() {
2 $pdfFile = (join-path $PSScriptRoot ".\input\Test 1.pdf")
3
4 $pdfReader = [iText.Kernel.Pdf.PdfReader]::new($pdfFile)
5 [iText.Kernel.Pdf.PdfDocument]$pdfDocument = [iText.Kernel.Pdf.PdfDocument]::new($pdfReader)
6
7 # Total page count
8 $totalPages = $pdfDocument.GetNumberOfPages()
9 Write-Output "Number of pages: $($totalPages)"
10
11 # Iterate pages
12 for ($page = 1; $page -le $totalPages; $page++) {
13 $strategy = [iText.Kernel.Pdf.Canvas.Parser.Listener.SimpleTextExtractionStrategy]::new()
14
15 $pageContent = [iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor]::GetTextFromPage($pdfDocument.GetPage($page), $strategy)
16
17 Write-Output $pageContent
18 }
19}
So, that was easy. What's printed out is the text content of the PDFs. Images are omitted, but for my application, this is OK. Right now I'm trying to read the PDFs.
A few things of immediate note:
- Each PDF contains 1 or more document from the library.
- The first line of content tells me what document it is, and this changes as the document within the PDF changes.
- This makes finding the start and end of a document "easy", since I can watch for changes on the first line.
My objective is to split the PDF up into multiple PDFs, each containing one document from the library. iText's API supports this out of the box, so it's a matter of implementing the API correctly.
Stack Overflow gives some really great direction here (and by direction, I mean, a solution) which just needs implementing in Powershell. Newer versions of Powershell support classes (for more than just DSC) including inheritance, so adapting the posted solution to Powershell and my application was straightforward:
1class CustomSplitter : iText.Kernel.Utils.PdfSplitter {
2 [string]$_destination
3
4 CustomSplitter([iText.Kernel.Pdf.PdfDocument] $pdfDocument, [string] $destination) : base($pdfDocument) {
5 $this._destination = $destination;
6 }
7
8 [iText.Kernel.Pdf.PdfWriter] GetNextPdfWriter([iText.Kernel.Utils.PageRange] $documentPageRange) {
9 return [iText.Kernel.Pdf.PdfWriter]::new($this._destination);
10 }
11}
Instead of taking an output directory as was done in the StackOverflow solution, I'm outright accepting a destination filename.
Skipping the logic for finding the first and last page of a document within the PDF, as well as the document title (this comes from a regex, and is actually the document number which was much easier to test for..), splitting the PDF up was "as easy" as:
1 $outputDir = join-path $PSScriptRoot "output"
2
3 foreach (${section} in ${newPageLimits}) {
4 $fileTitle = $section.PageHeader
5 $fileName = join-path $outputDir "$($fileTitle).pdf"
6 $pageRangeString = "$($section.StartPage)-$($section.EndPage)"
7
8 $splitter = [CustomSplitter]::new($pdfDocument, $fileName)
9 $pageRange = [iText.Kernel.Utils.PageRange]::new($pageRangeString)
10 $splitDocs = $splitter.ExtractPageRange($pageRange)
11
12 foreach ($doc in $splitDocs) {
13 $doc.Close()
14 }
15 }
$newPageLimits
is an array of objects that contain 3 members: StartPage, EndPage and fileTitle.
Does it work?
Oh does it ever, and it's fast too!
Gotchas?
I have no idea why, but when I run the PS script on it's own, it literally does not load the DLLs that are defined at the top of the script. It just skips them. The result is that the custom class I wrote is also not loaded, and the whole thing falls apart.
If manually load the DLLs outside of the script (or using F8 in VSCode), then run the script - it's successful.
For the sake of this application, I basically... don't care... but it's certainly bizarre.