Wednesday, 14 August 2013

Pdf.js text overlay only adds to canvas. How to log it to console?

Pdf.js text overlay only adds to canvas. How to log it to console?

Today I found this fantastic code example on how to extract text from a
pdf file using pdf.js. The functionality is perfect, but the script itself
has more code than what's really needed to call the pdf.js library and is
rather messy. An example of this is having an un-needed(?) ajax call to
another page. Due to this reason, I decided to clean up this code so I,
and others, can use it in future projects.
I've spent all day cleaning up this code and making it work with the html5
file api, but there is one slight issue with my re-write...
For the life of me I can't figure out how to log the output pdf page text
to the browser console after looping over each text layer within a PDF
page. The text will add to the page DOM, but it's inside a canvas with
it's original font, which is no good. I want to achieve the same
functionality as the existing script - that is adding text to a div so it
looks like plaintext. I don't want to have the canvas html, or the
original font appended to the text.
The problem lies here in filedrag.js:
page.startRendering(context, function(){
if (++self.complete == total){
window.setTimeout(function(){
var layers = [];
var nodes = document.querySelectorAll(".textLayer > div");
for (var j = 0; j < nodes.length; j++){
layers.push(nodes[j].textContent + "\n");
}
console.log("testing logging");
console.log(layers.join("\n").replace(/\s+/g, " "));
}, 1000);
}
}, textLayer);
As you can see, I have added a debug line to the code:
console.log("testing logging");
But if you try run my version of the script and uploading the pdf file
supplied, no data will get outputted to the browser console, but the text
will be appended to the DOM as a parent of a canvas element. Grr!
Why isn't the text being outputted to the browser console? I have a
feeling this is some sort of threading problem, but I'm really not
experienced enough with writing Javascript to be able to tell. The
original code sends variables through window by using an iframe, but I
find this really ugly and I'd much prefer to use my version of the code as
it's neater and more flexible to extend.
I have posted this project to github, as it requires extra dependencies
which meant that I couldn't post it up on jsfiddle etc... This is my first
time using github, so I hope the repo is accessible for everyone.
If anyone can help me out here, I would really appreciate it. Any fix
suggestions would be nice. I'm not expecting someone to just go ahead and
fix this code for me, I'd just love some pointers as to how I can fix
this.
Cheers.

No comments:

Post a Comment