OCR Questions
Posted: 17 October 2013 03:32 AM   [ Ignore ]
Newbie
Rank
Total Posts:  10
Joined  2013-09-22

(Before I list my main questions, is there a PDF manual for PDF Nomad? The built-in OSX help mechanism is painfully slow, and some topics [such as “Auto Deskew” and “Resolution for searchable pages”] aren’t covered.)


I’ve OCRed a scanned book from the 60s, and I’m encountering various issues while attempting to edit the OCRed text. While the onboard help mentions that text can be edited, it offers no details on the procedures for doing so. The PDF is of decent, but not stellar, quality. So I do believe the software did a decent job, given the source material. (And let me add that the book I OCRed a few weeks ago was processed essentially flawlessly. So, I’m pleased with the technology in general.) However, I’m experiencing some hiccups with the editing process.  grin


1. I figured out that I can select words in the body of the page and correct them. I also figured out that in the area to the right, words that are unrecognized are underlined in red. That’s a nice touch. However, it’s sometimes difficult to manually scan and find the words on the page that correspond to the underlined words at the right. It would be nice to be able to click on an underlined word at the right and have the corresponding word automatically highlight on the page, ready for correction.


2. Due to the way this book was type-set, and due to the scan quality, PDF Nomad had trouble with the spacing of a number of words. For example, PDF Nomad interprets the word “THIRTEENTHS” precisely like this:

TH I RTEENTHS

So, PDFN thinks they’re three, separate words. As such, how does one join the segments into one word?

UPDATE: While typing this post, I was experimenting with PDFN, and I discovered you can select and delete segments, and you can elongate segments. So, I deleted the 2nd and 3rd segments above, then elongated the 1st one, so I could type the full word. But this is cumbersome. I potentially have hundreds of such corrections to make—and highlighting the smaller segments is tricky. When approaching the edge of a segment, the cursor changes to the “extend segment” cursor. As a result, when a segment is the length of one letter, it’s nearly impossible to select it.

Suggestion: To join segments, we should be able to drag a selection box around them (which we can already do), then issue a command to join them (which I don’t believe we can already do).


3. It would also be nice to be able to adjust PDFN’s threshold of spacing in a document like this. If we had a slider to tell it “A real space is at least this wide, and anything smaller than this is not a space,” that would fix this problem. If we could do this after the document’s been scanned, we could issue that command and have PDFN reprocess the current data with much more accurate results. (I think I’ll reprocess this document using the “Background Level” setting to make the letters thicker. Perhaps that will help.)

(By the way, the manual states: “Lowering the background level often has the effect of making the text heavier and fuller [and vice versa].” However, I found the results to be the opposite: A higher level makes text heavier, and a lower level makes text lighter.)


4. When PDFN completely misses a word, how does one select that word, then enter the missing text? In my current document, I’ll need to do quite a bit of this, but haven’t yet found a way to do so.  grin


5. Finally, after editing text, the enter key doesn’t confirm the window. So, one has to manually mouse up to click the button each time, which is time-consuming. Please make the enter key confirm the window.  grin


Thanks!

Profile
 
 
Posted: 17 October 2013 04:40 AM   [ Ignore ]   [ # 1 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  475
Joined  2007-03-23

Re 1: Ideally, it would be possible to just edit in the overview on the right. We hope to be able to support that eventually, but for now we have to make do with things as they are.

Re 2: With Scope: Words selected, select the three constituents that make up thirteenths, then right-click and choose Join selected words.

Re 3: I’ll look into that.

Re 4: One way to do that is to edit a neighbouring text box and write the word before or behind it, separated by a space. PDF Nomad will create separate boxes for the separate words. You can then drag the new box into place if needed.

Re 5: I’ll look into that too.

Thanks for the elaborate feedback.

 Signature 

António Nunes
SintraWorks

Profile
 
 
Posted: 17 October 2013 05:06 AM   [ Ignore ]   [ # 2 ]
Newbie
Rank
Total Posts:  10
Joined  2013-09-22

Yes, editing in the overview would be be great!

I appreciate the reply and the tips. I’m glad to know we can join words, and I’ll try the “neighboring text box” method.

Thanks again.

Profile
 
 
Posted: 17 October 2013 10:04 PM   [ Ignore ]   [ # 3 ]
Newbie
Rank
Total Posts:  10
Joined  2013-09-22

Hello. A couple more questions:

1. Once editing has been “finalized,” is it possible to go back and make further corrections? This particular project is going to take some time to complete—perhaps even weeks. But I cannot simply leave this document open all that time. Does PDFN keep the data, so that it can be reopened and modified?

2. Sometimes I cannot make out the correct text, because of the red OCRed text that’s superimposed upon it. Is there a keyboard shortcut to toggle the display of the overlay? If not, that would be very helpful. (I think of Apple’s Aperture app for editing photos. At any point, you can hit a keyboard shortcut to see the original photo, then hit the same shortcut again to return to the edited version.)

Thanks.  grin

Profile
 
 
Posted: 18 October 2013 05:55 AM   [ Ignore ]   [ # 4 ]
Administrator
Avatar
RankRankRankRank
Total Posts:  475
Joined  2007-03-23

Ad 1: Short answer: NO. Slightly longer answer: No, but you can OCR only the pages you can/want to finish in the current session, and leave the rest for later.
Ad 2: That’s a good suggestion. I’ll put it on my list. Meanwhile, if you’re in word or letter scope, you can change to line scope. That should help clear the view a bit.

 Signature 

António Nunes
SintraWorks

Profile