How reusable content in pdf could be: deriving HTML from PDF

By | January 29, 2020


(upbeat music) – You have to actually explain how PDF could be rich, how it could be used, how responsive it could be, and how the content in
PDF could be reused. And this document is
some sort of a guideline for creating these type of documents, helping developers understand,
and use this as well, understand all the capabilities of PDF. So I’m a software
developer, and I’m working with PDF implementations, application, libraries for 20 years
now, and it’s long enough to actually witness all the
evolution of PDF over the years. And let me explain how we get
where we are at the moment. So we started back in 93 without
representation of a paper. So the basic PDF model, the original one, was based on a graphic model. We wanted to accurately represent the content on a paper, either on a screen or on a print out. So we adopted board rendering. We wanted to precisely render the content that’s been in a PDF file. After then we slowly we started to realize that it’s not enough, we
need more richness in PDF, and we introduced mark content. So, a logical structure. We started to mark content,
and we also realized that it is not enough. We needed to start properly tag files, and when tag PDF was
born, there was actually first time, for the very first time when we started to consume
the PDF a little differently. Right now assistive technology, screen readers and these type of tools are not consuming the PDF or the document, as it is as we ordinary people on a paper. They simply read the content based on the structure that is
written in the PDF file. We have, with PDF, some problems now, and PDF is essential
part of the web, right? It’s the most common file format you can find on the web, except .HTML. And so everywhere you go, you meet PDFs, but how good is the end
user experience with PDF? HTML users hate PDF, right? They don’t know how to
use PDF on a mobile. You have to pinch and zoom. If you are on the browser, you usually experience a PDF in the form of a link, so it’s not an essential part of the experience on a website. You have to go somewhere else into a different viewer, into a different user experience generally. HTML developers hate PDF as well. They don’t know how to access data in PDF. It’s, they can’t control
the user experience at all. You just have to go away from the portion of the website, you have to go into a different application usually, and they can’t navigate into a PDF, out of the PDF into HTML side, and so on. So it’s basically something different. It’s not there. And the question is, is it possible to somehow interpret
PDF a little differently so it give HTML users and HTML developers a better experience with PDF? And we believe, yes, we do. We can do a lot better, and something, the document I was
mentioning at the beginning is something what I was working on with a bunch of great people in next generation technical working group on the PDF association,
and we simply believe that there is a way out
of this with current language we know in PDF. And we believe that the tagging is the way to, is the best way for capturing office intent. So often we decide how to content would be consumed, and
also give enough freedom for developers and implementers to actually adjust these
experiences as well. We believe that provisions
in ISO 32000 dash two, which is an acronym for PDF 2.0, rich enough to be
interpreted properly as HTML. We call it draft. And we also believe that variation might be the new word for rendering. And by giving away this document, we wanted to encourage all the developers, all the implementers and authors to think about the
proper way of doing this. Because we don’t have to do anything, people will simply start doing it, and if we provide the guidance, we believe that people would adopt it. And of course I want PDF has
always been an open format. It’s very important that
it’s community driven, so I’d like to invite you to join. First of all, you can go
to PDF Association website, download the publication,
download the document, read it. If you are not part of the
PDF Association, join us. If you are part of the
PDF Association already, join us in the working group. We also have our, you
know, testing products, so to speak, where we test this idea, and I’m show a little
later how does that work in the real life. Share your files so we can adapt and come up with improvements
to this document. We can discuss techniques, and I’d love to hear your feedback. So, now I warn you, this starting to be a little more technical from now, but stay here please, don’t leave. Whoever needs to take a
nap, now is the good time. (everyone laughing) What actually is a tag PDF? So, we have this structure three. We have marked content. We have form fields, links, everything is connected into structured three and all the elements are accessible via, in Adobe Acrobat via
panel on the left, right? And you have a lot more language in PDF you can use to express the quality of the content, and the semantic and you have a lot of things, how you can actually make the PDF richer. There are attributes you can assign to any specific structure element. You have classes, classes that is a set of attributes I’m
thinking of you might know from CSS World, for example, very similar language we do have in PDF. You have associated file, you might heard of associated file
in couple of occasions during the presentations here. And you have actions in PDF that are also taken into an account. So this is a screenshot for those of you who don’t know what tagging is. So it’s captured on the left here, the panel, how you describe what actually means the, what’s the semantic meaning of the content you see on the left. So if you are on the TR,
which is the table row, you know, it’s highlighted in the content. This is how the tag element
look like in a PDF way. So you have A, which is the
dictionary of attributes. The full richness you might have there. A have associated files attached with a table, for example, a table. And then you have something what points to the content in the PDF. So you’re not losing anything
with tagging the file. You still don’t have to use them, PDF is still able to be printed, right? You can do whatever you want
as you used to, but much more. PDF describes a lot of stand out structure element types, so that’s the language we use in PDF to describe the semantic. And ISO 32000 dash two, so PDF 2.0 knows these specific structure element types. They might look familiar from PDF World, from HTML World, like they’ve spent the paragraph H1, list, item, and so on. Some are harder to imagine. In PDF we might have form filled, we might have annotations and so on. And of course you may also come up with your own language set points, you own set of tags, so to speak, but they have to be used properly. How the derivation then works. I said we wanna actually use tagging as a standard point, as a point of contact with the PDF file. So we have a PDF file, we have this simple H1 structure element. It points into a content,
and you may drive the new tray into HTML and that would look something like this at the end, okay? More complex structure like a table. Table does have table rows. Table row does have a TH or TDs, so you have to fully
describe the table properly. And quite a straight forward variation, it looks something like this, okay? So you have a standard
HTML, and that would look something like this. You see, it doesn’t look nice. We didn’t apply styling so far. The styling might come from PDF, however this is not the way we wanted to actually do it because in a PDF when you drive some path and curves, they are meant to be for the
representation on a paper. If you, or author, wanted to represent the table differently in HTML world, I would suggest to use attributes. And general concept of attributes, actually is something what you attach to any structure element, and again, PDF specification knows bunch of different types of attributes. And you may also, you may also assign to structure element a class. So you see structure element properties in the middle, and then
you have associated class with a C. On the left you see, sorry, on the right you see something like a proper class. On the left you see roll mapping, sorry. So, your own language for expressing tags. You might use heading two or something, which is more close to your
work flows, and your systems. You may use your own language, okay? In PDF World, this is actually screenshot from specification. It’s all the attributes we may use and apply to a specific
structure element, specific tag. There are bunch of, we call it owners, and each owner, it’s a set
of specific attributes. So we have lay out attributes, we have attributes specific for list, like continuous list and things like that, like list numbering and so on. And if you look closer, we may introduce in PDF also attributes
from a different world. So you may carry
information from HTML World, and you may put into PDF quite properly. This information are there,
essential part of the PDF, and could be reused afterwards, which is very cool and
people don’t know this, and are not using it. So if you, your othering
application, like say Word or whatever is your choice
of othering application, you do have this type of information, you have a way how to put
it into PDF and not lose it. Associated file might have information that is relevant for processing to HTML. Might be supplement, might be something what you need to actually look at, like an additional
information, for example. If we have a table, it might
be a chart of the table. It might be if we have an image, it might be SVG
representation of the image. If we have math formula in form of image on a paper,
it might be MathML four for the processing. And we, in our derivation,
are taking this into account, and are looking at this as a interesting and important way of
conveying information. In this particular case, we do have this PDF which is tagged, and we have associated files, and you may see that there is a Java script associated, as well as styling
associated with the header. And after the derivation
and you might guess, based on the name of the Java script, how would the result HTML look like. So after the derivation, we receive HTML which looks identically as a PDF, that’s how we styled it. We decided we want the same experience from result HTMl, however we decided that we want this table to be sortable. So that’s done, and that’s all carried in the PDF file. Nothing is added after the derivation. We didn’t apply anything special on the result file, it’s all
contained in the PDF file. And end user decided, well author decided he wants this experience
after the derivation. Another cool feature of
PDF 2.0 is name spaces. Matthew mentioned this. So we are not blank only in our send box, talking that the PDF
tag set is the only way to express information in PDF. We are allowing other name spaces to join the party, right? So in this case, we
have a made up PDF file. It’s regular PDF file. You may print it, you may search it, it’s fine, you wouldn’t
notice any difference. And for tagging, we use HTML name space. So we literally using HTML tags to describe the semantic of the PDF file. Technically, that would look, yeah that would look something like this. So this is our structure
element dictionary, so to speak, and the
structure name is dev, and we are saying which name space we are using through
the NS dictionary there. And we can still apply
attributes, even PDF attributes. We can combine this information together, so we can road map these HTML tags into a PDF tags so the file is properly accessible for ATs, for
screen readers, everything. So we not using anything, we are enriching PDF with this new, cool features. And we are using these features
during the derivation, yeah? So as we derive PDF, as we derive HTML from this PDF, we would generate HTML with all this informations
that are hidden here. And the result might look a lot richer, is the user experience is a lot better. You have the, HTML
developer does have a choice what to do with this produced HTML. He may put it into
separate DIF, for example, as part of natural flow of
the page, of the web page. And it’s coming from PDF on fly, you don’t have to do
anything special there. We do have another sample which also shows the richness of the derivation itself. Richness of the language we have. PDFs can contain formula data. All the formula data preferred
by Java script inside of PDF. So you have calculations, you
have formatting, everything. Derivation algorithm, as described, is taking this into an account. We know already how to
properly tag forms, like this. So the tagging used for accessibility is very good also in this use case. So you tag this properly, you of course have to think about
deriving the Java script that’s not as easy as it might look. Java script in PDF is very different to Java script in HTML. Well language is the same, of course, but in PDF, Java script knows a lot about insides of the file. Like you can access form field, you can access annotation,
you can access things inside of the PDF, and based on that you may change values
of specific form field. You may also react on a button, button was clicked, button was, you entered the button,
and so on and so on. Our implementation at the moment takes all the Java script and
flushes into the result HTML. Of course, the HTML has to be bothered with some specific Java script library that interprets these PDF specifics. We do have that library included, and at the end of the day, user does have the same experience with filling out forms in side of PDF, as well as inside of HTML. Okay, so you may decide how you want to consume this content. I’m not talking about how you want to consume PDF, because PDF is just the container for your content, for whatever you want
to present to the users. If that’s something you wanna, that to be presented as a HTML, you have a choice at the moment. Task was not easy to write
the derivation algorithm. We are still part of ISO processes and committees and everything, so we always have to make sure that we’re doing right thing, even though HTML, these days, HTML browsers these days
consume almost any HTML file, we wanted to make sure we are doing the right thing, and we needed to actually be sure that the HTML that we produce during the derivation is valid, is valid according to high standards. And that was not easy,
because the PDF language is different to HTML, and we see some specifics that doesn’t bother us in PDF world, like for example we may have some structures that HTML doesn’t allow. And what are you looking at is actually a mega table from PDF 32000 dash two, PDF 2.0, and it says
how the nesting of tags, what is valid in PDF. This is normative language in PDF. And I merge this with
our work on derivation, and I looked at ways how
the results would look like. So the green spots are
what is allowed in PDF, and what would be okay after
the derivation into HTML. However, the red spots,
there is the problem, like for example if we look at here, somewhere here, TH,
which is table heading, in PDF may contain, one other
paragraph, now H1 heading. In PDF, we don’t mind that
table header would contain H1. Yeah, how strange does that
sound, we are allowing this. HTML world doesn’t. You can’t really use heading
inside of table header. So we had to find a way how to solve these problems, and there
are a couple of those, and to be honest sometimes we had to actually take a step back and look if we are not doing something
wrong in PDF itself. Maybe we just missed something, and work that we were
doing on this derivation is actually good input for
work on PDF standards as well. We have validating that the PDF is ready for the years to come, and HTML and PDF have to place nicely at some point if we wanted to be
successful on this world. So, when we were writing this document, we simply had to resolve these issues by, you know, providing some special treatment in these particular cases,
and we kind of described how would that work if you actually face this real world example. Of course, we are still
thinking and believing that we would be consuming
well tagged PDF file. So, it is quite a mandatory to use well tagged PDF file. In a nutshell, now is time to wake up because this is important. We believe that the world
of single representation of PDF is kind of over,
and let’s be honest, it is happening everywhere. People are consuming
PDF in different ways. They don’t want to rely on the
single page representation. They simply don’t. And if we look even closer,
assistive technologies right now, and for many years, are already consuming the content differently that what we doing, right? So relying on single
and only representation might be just a blocker for years to come. Of course, we, whatever I described, doesn’t change the core
value proposition of PDF. If you wanted to, you have the output in the form of PDF, in the
form of paper, so to speak. And not just if you wanted
to, it’s actually mandatory. You have to have that there, right? And I think that the
strong message is also, and I’ve heard this over and over, we should simply wish it. It’s time to stop producing bad PDFs. Of course, I’d love to say that the good PDF is tagged
PDF, but I’d be happy to see PDF that contains fonts, that contains uni-code
representation of the text. Of course, structure, if that’s possible, and I believe we would need to move away from old, bad softwares, tools, and implementations
that don’t actually use proper language and they are not reusable. Another thing, we have a choice. Author may decide how your
PDF is going to be consumed. They may tell, inside of the file, I want this to be properly
consumed as HTML file. There’s nothing bad about this. Author can, well may decide,
can actually give you a choice. They would still need to provide the standard page description, of course. They would need still, that the PDF still needs to be rendered
properly, as we know, so we’re not changing this way. And instead of saying that there is only one way to interpret the PDF, we will be saying that there is always a deterministic way of interpreting PDF, because I’m developer. We tend to use shortcuts, and if there is no guidance, then we would
try something special, some magic that would happen. And I do think that this is also a bad practice, and we simply have to stop producing bad files,
producing bad experiences, and we need to follow standards. And again, they are, this link, of course, I mentioned the document you may grab on the PDF
Association website. You may read it, you
may get in touch with us even through the technical working group, or through the GitHub
repository we have there. We shared a couple of samples there, you can download it, you can look at them, you can see how they are derived. We have implementation as well. The implementation is in
the form of a command line. It’s in the form of also
the application itself. There’s a screenshot of the application, so you may open a PDF file on the left. There are two panels, on the left you see the original PDF rendered
with some kind of render. On the right you have the
HTML representation of it. In this case, it looks almost identically because author wanted
that to be like that. And I guess I run out of the time, so if you have any questions. Okay. – Just tryna understand how tagged PDFs work in that, so the PDF has the X objects describing the content on the page. – Mm-hmm. – That’s how PDFs work. When you tag it, are you
duplicating all that text? – Not at all, no, no, no. – Just applying tags to the next object. – Exactly. If you look at this. (tapping) Let me show you. So we talking about PDF language here. (tapping) So okay, so this one’s good. So this is a structured
element, this is a tag, and you have an information about a page, where the content on the page is, and is it K, which is a
pointer to the content. And the content is marked, and contains some information that one, I am the one. It’s a number, basically. So it points into the content. Whatever you change in the content is actually objectable. So you’re not duplicating things. Of course, you may apply
additional information to this. Like for example, you’re
pointing to an imagine, but you wanted to add an alternate text, so you have to put it in here. So you’re not putting that
into content, of course, but if you have an imagine
and under the image there is a title, or a caption, you associated both of them
in the structured three. So the semantic, the richness is expressed in the structured three, with the pointers to the PDF content. – So do many PDF
producers have the ability to add tags while creating the PDF, or? – Well yes and no. Most of the applications now are able to produce tag PDF, most of
the important ones, of course. Like for example,
Microsoft Word, or Excel. They can do that, and we also included on a the GitHub an example
how would the standard Word document would look
after the conversion, the derivation, and one thing is they do, however how good they do. So far. (man laughing) And I don’t want to blame anyone, it’s just the reality
where we are now, okay? People are producing or using so far tags for accessibility purpose. This is something what we
believe is going to change. People need to access the
data, and this is the way. – I can see how they might help easier. – Yes please. – I’m curious as to how it’s changed from PDF 1.7 to 2.0, that might be too big of a topic or question, but how does it simplify the process? – So, 1.7, I mentioned a couple of new languages we do have
in 2.0 like associated files. In 1.7, only a little bit of them. Right now you may attach associated file to almost anywhere,
specifically that is important that you may attach associated file to a structure element. So it’s new language. When it comes to tags it,
they are a little better in 2.0, I would say, so we duplicated a couple of tags or structure elements. We introduced a couple of new. We also changed the language a little bit, the nesting table I
showed, that is something that wasn’t there in 1.7. So we are more precise in 2.0. There are, of course, other things like name spaces that’s new in 2.0. Some attributes are new in 2.0, but it’s not that dramatic. Structure destination,
that’s one type of action you may use in 2.0. It was not in 1.7. So action that takes you from one structure element into another. Obviously if you click on that link, it would take you into content that is associated with the structure. So, from top of my head, I guess that’s pretty much all. – Thank you. – Name spaces. – Name spaces, yes of course, name spaces. A lot of, we are bringing the language from out of the world, like MathML. It’s, natively you may use MathML inside of the structure three. We are trying to make friends,
HTML for example, right? Stuff like that. There are no further questions. Thank you very much. (audience applauding) (upbeat music)

Leave a Reply

Your email address will not be published. Required fields are marked *