Extract Text from PCL with Searchable Text and Indexes

Document archiving requirements often include full text extraction for fully-searchable PDFs and extracted data to use with content aggregation applications.

PageTech is the leading software development company that provides custom, extensible and fast solutions for applications requiring the manipulation, optimization, transformation or re-purposing of complex PCL print streams.

What is Complex PCL?

There are hundreds of PCL parsers that can extract text from very simple PCL. The question is… do you have very simple PCL? If not, then PCL Tool SDK is the only product that can extract text from complex and problematic PCL.

PCL Tool SDK can extract text from applications that generate legacy or complex PCL print streams. It can de-construct old style, mainframe-generated bank statements with cancelled check images into text and individual TIFF images of each cancelled check image. With the dis-assembled statement text and image objects, you can use output management/mail optimization software to:

  • Re-design the print stream for an updated look or to re-engineer into TransPromo documents
  • Add Intelligent Mail Barcodes (IMB), DataMatrix 2-D Barcodes or other barcodes
  • Apply address corrections
  • Apply OMR Marks for inserters, folders, etc.

PageTech .TNX File Format

Our Text Object Extraction (*.TNX) file format is not just text parsed from a PCL file. It contains all of the text objects found on the internal logical display list just prior to imaging each page in the interpreter. A .TNX file can be generated as a by-product of the conversion process by one of the following methods:

  • Use our sample TNXDemo.tpt script located in the sample script folder.
  • Use the Convert -> Extract Text function in either PCLTool.exe or PCLWorks.exe to extract all the text information available in the file.

We include TNXDumpG.exe to extract and format the text using a GUI interface and TNXDump.exe for console operations or to call the TNXDump.dll directly.

Not only do we capture the text, we capture the absolute positioning, current symbol set, font and font metrics used in the file. All this information is written to our proprietary .TNX format. We encourage everyone to test our many text extraction solutions by downloading the PCL Tool SDK Live Evaluation and reviewing the Text Extraction Methods page for examples.

How do Clients Take Advantage of PCLTool SDK Text Extraction Functionality?

  1. Systems Integrators and MIS Departments usually end their search for a PCL transformation tool when they find the first product that can simply convert a complete multi-document PCL file into a one page PDF file. Unfortunately, they usually end-up purchasing multiple tools and often have to write custom code to decollate multiple document data streams into individual PDF files with the correct filenaming convention and external rapid batch indexes.

    We see many clients try to extract the text AFTER the PCL has been converted. The best chance of retrieving searchable text from PCL is when it’s in the NATIVE PCL file format. This is why PCLTool SDK extracts text during the conversion process and can easily optimize, transform and re-purpose the extracted text into many other formats.

  2. Service Bureaus sometimes want to extract all the text from legacy applications in order to re-construct it using various third-party output management/mail management tools to update the look of the document, add color, graphics, OMR marks, IMB barcodes, DataMatrix 2-Dbarcodes, etc. Not only can PCL Tool SDK extract all the text objects for this purpose, it can also extract all the raster objects (ie. cancelled check images, graphs, etc.) for these custom applications.

  3. Medical/Pharmacy/Laboratory Imaging Solution providers that need to capture the serial or parallel printer output from an ophthalmologic, EKG, pharmacy or other device can use our PCL Tool SDK to capture and re-direct the print stream for conversion and/or text extraction. We can also convert the difficult PCL3GUI format that is used to print to HP DeskJets used by most of these devices. HP has never provided technical documentation for PCL3GUI format, so it’s rarely supported by PCL transformation tools from other vendors.

Customers needing assistance in selecting the right option or integrating our products into a custom workflow, please contact our Technical Sales team at (+1) 858.794.6884 or send an e-mail to info@pagetech.com for a quote.

PCL Tool SDK v11 Live Evaluation!PCL Tool SDK Live Evaluation!

PCL to PDF Products

PCL Transformation products for Developers, Systems Integrators and MIS departments.

PCLTool SDK

PCLTool SDK converts complex PCL into PDF, PDF/A, XPS, TIF, BMP, JPG, PNG, WMF, and EMF.

PCLWorks Program

PCLWorks provides a Subset of Essential PCLTool SDK GUI-Only Programs that can view, convert, debug, and analyze PCL Printstreams.

PCL to PDF Products

PageTech News and Product Information

PageTech Announces Release of its New PCLMagic Printer Drivers

New industry-exclusive PCL printer drivers embeds searchable text into the PCL print streams before the printer driver generates them, saving all the text in its natural unscrambled state.

PageTech Announces Release of PCLTool SDK v11.0

With the release of PCLTool SDK v11.0, PageTech now offers its flagship product in two different flavors -- PCLTool SDK 32-bit or PCLTool SDK 64-bit – plus .NET versions of its major programs for each platform.