JPedal: a Java library for content users

Mark StephensThere are many types of tools for working with PDF technology. To the end-user many of these things may seem undifferentiated. Displaying a PDF? How could that be a big deal, right?

In point of fact it’s not obvious; the software that creates a PDF is worlds apart from software that physically converts PDF code into pixels on your screen or printer.

Trust me on that.

JPedal, the library developed by IDR Solutions, is decidedly all about parsing and rendering rather than PDF creation. It offers Java developers a range of high-powered building-blocks allowing implementers to make end-user tools that allow for displaying, printing, searching and reusing PDF files, for example in web-based document management systems.

I interviewed Mark Stephens, the founder of IDR Solutions, to talk about PDF technology from his vantage-point.

Duff Johnson: Please describe, in a couple of sentences, your product or suite of products.

Mark Stephens: We wrote a Java library making it easy for Java Developers to view, print, search and rasterize PDF files.

We have expanded our product range to offer a converter which allows conversion of PDF into HTML5 and SVG for use on the web which can be used not only by Java developers but any other languages via web services and non technical user via a control panel on our website (http://convert.idrsolutions.com).

You could probably assume the 2 products have an overlapping code base…

DJ: Please describe how you see your organization’s role in the electronic document industry.

MS: We allow Java developers (there are a few of those) to make extensive use of PDF content in their applications. Some clients even use PDF as their internal file format – it looks great, its well-supported, it is very easy to embed content inside and their export to PDF function is lightning fast!

We see the ability to host your content on your website as a big gap in the market as most web services force you to host your content on their sites, rather like the AOL/Yahoo walled garden/portal concept at the start of the Internet boom. We cannot see why most customers would not want their content on their own site where they get all the SEO, analytics and control.

DJ: What PDF ISO standards do you support?

MS: Most of ISO 32000-1. Some things in Java are not technically possible. For example Java works in sRGB so overprinting in CMYK is ‘interesting’ and some of the complex transparency effects possible in PDF are beyond the capabilities of the current Java technologies.

DJ: What was your principle motivation in utilizing ISO standards for PDF, whichever you use?

MS: Customers want clear open standards these days and will not settle for less. ISO does a brilliant job of maintaining and developing all the standards which people take for granted but which drive our increasingly electronic lives.

DJ: If you do support, or place to support PDF/UA, how are you relating this support to WCAG 2.0, if at all, and describe your view of the relationship between PDF/UA and WCAG 2.0 generally (if you have one).

MS: Providing access to content is important and mobile is really driving forward peoples expectations of how they can use technology. We added a pageflow mode to our viewer and have a speech option as well. We find customers do not tend to mention the specs but providing better access to content (both via the spec and via the application) is a hot topic.

DJ: If you don’t support tagged PDF for content extraction / reuse, why not?

MS: The big issue we find with tagged PDF is that it needs to be baked into the PDF at creation time and lots of people do not understand this or do not have control over the creation. We had a ‘bug report’ last week (from a US government agency no less) because they got this output from one of their files.

 <?xml version=”1.0″ encoding=”UTF-8″?>

<!– Created from JPedal –>

<!– http://www.idrsolutions.com –>

<TaggedPDF-doc/>

<!–There is NO Structured text in the file to extract!!–>

<!–JPedal can only extract it if it has been added when PDF created–>

<!–Please read our blog post at http://www.jpedal.org/PDFblog/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/ –>

If your readers can suggest an improvement to the message we would love to get some feedback!

When it is there it works really well.

DJ: My advice would be to re-write that message to read, more or less: “The contents of this PDF are not structured or tagged. Please add tags to facilitate text extraction. ”

Ok, so back to the questions. 😉

DJ: I see you claim “XFA support”. Does that include rendering dynamic XFA?

MS: Those expandable forms render nicely across multiple pages but allowing the user to expand them further in the viewer and support for all the JavaScript is still a work in progress. Java now has a really nice shiny new JavaScript engine called Nashorn which has been allowing us to improve JavaScript support.

DJ: Which layout attributes (ISO 32000-1, 14.8.5.4) (if any) do  you use as part of your HTML5 conversion, and if not, why not? (NOTE: I’m not suggesting in any way that you should, I’m genuinely curious).

MS: Not at the moment. We have not found them generic enough in most content we are converting.

DJ: I’ve observed that most ECM / DMS implementations don’t do much with PDF; there’s no real integration – PDF files are treated more-or-less as TIFFs. Do you feel there’s a big future for PDF in ECM implementations, or is the growth in the PDF space going to be elsewhere?

MS: There have been lots of innovative ideas for DMS with PDF which have never secured the success they deserved. I think most developers now are looking to mine the content from the PDF and provide it indirectly.

DJ: PDF viewing and rasterization is clearly a major feature in your offerings. Do you see a lot of problems with 3rd-party generated input PDF quality?

MS: My heart drops when I see PDFs from some tools or ‘home brewed’ PDFs. Personally I am not a fan of Ghostscript PDF output.

I still have to spend some time each month tweaking our parser because it does not say you cannot do this in a PDF and it opens in Acrobat. Lots of rules in the spec get broken with impunity. You can actually have some pretty screwed up PDFs with the xref table incorrectly setup, random carriage returns all through the data, missing endobj tags, no EOF in last 1024 bytes, incorrectly encoded fonts which Acrobat will still open correctly.

We see a lot of badly written and inefficient PDFs as well (like an image drawn inside an Xform which is nested inside 50 levels of Xform which just call the next level).  Some versions of Open Office embed every font on every document page whether it is used or not.

Some PDF tools also have their own sets of little tricks, like a 1×1 pixel scaled up to fill in a rectangle of coloured background.

DJ: Do you have any views on ISO 32000-2 (PDF 2.0) at this time? Do you plan to support it, or do your customers have no need for the sorts of features PDF 2.0 will bring?

MS: It needs to ship! It will become important to most customers once it is live. As a viewer we really need to support the full spec as much as possible. But the PDF specification has become large and unwieldy and I especially like the idea of subsets of the full spec for different usages/groups.

I’m also really pleased to see ‘predatory’ patents disappearing for the PDF spec so that everyone can use features like hinting and jbig2.

It is the first ISO release so I think it is very much a learning game for all. I would like to see us start to really ‘deprecate’ things in the spec now we are on 2.0. Do we really need Type3 fonts in 2013?

The Internet and mobile in particular is changing people’s perception and requirements with ideas which never existed when the PDF specification was created. We need to think how this will impact the way people work if PDF is to remain important. How can we share and link content better? I would like location to appear in PDF so my PDF can show where I am on the map or what I can see from my current location. That would be very cool!

DJ: What would you say is your particular speciality, as a developer?

MS: At the start it would be determination. Getting to grips with the PDF spec takes a lot of time and even the ‘hello world’ version of any PDF library is a lot of work. As we have grown IDR solutions my job has been to stay ontop of the spec and our code base while sharing those skills. I still get to deal with lots of the nasty issues (usually in my code!) but I now have some very bright developers to help me. I still learn new things each day which keeps me motivated.

DJ: Looking at the IDR Solutions website I’m struck by the fact that IDR’s products does many of the things iText – probably the biggest name in Java PDF – doesn’t do. Have you considered partnering with iText, or are your marketplaces just very different, as you see it?

MS: We are big fans of IText and we actually put our stands next to each other in a ‘PDF corner’ at JavaOne 2012 as they are very complementary products. We have a lot of overlapping customers and we both have plenty to work on in our own spaces. Creation/editing and rendering/extraction are very different and both require a huge amount of work so it makes sense to co-exist and customers benefit.

DJ: Your Jpedal product-line feels underpriced, especially given that your software renders PDF. I’m curious to know whatever you’d want to say about your pricing strategy.

MS: We had some people complaining last week how excessive it was! I think it is all a question of whether you are demonstrating value. I am sure as part of ISO you get complaints about the 238 Swiss Francs price tag on the PDF spec (and calls for it to be both higher and lower).

Joel Spolsky has a brilliant article on pricing (http://www.joelonsoftware.com/articles/CamelsandRubberDuckies.html) and we all want to be in the sweet spot on the curve! We are in several overlapping markets and the Java market has a different take on pricing. Essentially we found that there were 2 main uses of our product – inhouse projects and OEM developers and tried to position the product at points where it fitted their general discretionary spend budgets so that it was an accessible solution. I would say we choose not to make pricing at either extreme our USP but focus on features and regular updates.

With our PDF to HTML5 converter we have gone for a usage model as some customers want a couple of pages converted and others want millions.

DJ: How did you get into the PDF technology space?

MS: The usual roundabout route… I did a degree in Mediaeval History, went to work in education for a few years, moved into Industry and ended up on a big publishing project at News International working on system integration for the Times Newspaper. That took me into PDF and I was hooked.

DJ: Your talks have a certain reputation….

MS: I have done quite a few talks at conferences (Seybold, JavaOne, Business of Software) and always like to slip in interesting stuff and some dry humour. I entered a Lightning talk at Business Of Software Conference in 2010 where you had to talk for 20 seconds on 20 slides and decided to go for broke with the catchy title “ASTEROID IMPACT: ARE YOU A BIG LIZARD OR SMALL AND FURRY?” (link http://businessofsoftware.org/2010/03/asteroid-impact-are-you-a-big-lizard-or-small-and-furry/). I am told some people are still in therapy….

DJ: What’s with the name “Jpedal”?

MS: We wanted a blue sky name we could effectively own rather than a slight variation on PDF easily confused with lots of other products. It stands for Java Pdf Extraction Decoding Access Library (so I’m told).

DJ: Thanks for all your time, Mark!



Leave a Reply