Bullnose Enthusiasts Forum

Any Web or Document Gurus Out There?

Classic

List

Threaded

4 messages Options

Gary Lewis

Any Web or Document Gurus Out There?

Administrator

I'm struggling with how best to present information on the website, and wonder if any of y'all have suggestions.

First, my goal with the website is to make it the premier Bullnose documentation site in the world. Certainly not for my glory or gain, but to help our fellow Bullnosians. (Or, is that Bullnosers?

)

But, in order to help our friends they need to find the site. However, in order to find the site Google needs to know about it. And therein lies the problem - Google doesn't read the words in pictures, and many of our ~550 pages are pictures of words. And pdf's don't really work either.

So I've been trying to convert pdf's into HTML as I've been told that Google will find HTML - although I've yet to prove that's the case. And I'm not at all happy with the results. But, you can be the judge of that by going to Driveline/Wheel Covers and looking at these tabs, going to the right:

Pin To ID # Cross-Ref: This tab has two forms of the wheel cover cross-reference table from the MPC.
Up top is a screenshot of the table, which is a picture of words/numbers. Try to highlight some of them and copy them. Nope, no dice.

But on the bottom is a pdf of the document, and you can copy those numbers. In fact, you can hit the Full Screen icon on the bottom right and pop it out into a new browser window and search it. Or, you can download it. But, Google won't find it.

Cross-Ref Foxit: This tab shows an HTML version of the table that was generated by my pdf editor, Foxit. Compare it to the previous tab. Not very good. But, you can search it by hitting Cntl-F. However, for whatever reason the software decided that the prefix, like "D5TA" needed to be in a different box than the rest of the part number, so you can't find "D5TA 1000-BA" because there are extra characters in there.

Cross-Ref ABBYY: This HTML was generated by ABBYY Finereader, which is software that I've downloaded for a 30-day trial, and which is fairly expensive should I decide to buy. The code is better than the Foxit code since the whole part number was kept together. In fact, you can search for and find "D5TA 1000-BA". But the headers for the columns aren't where they should be, and if it does that on a small table like this what'll it do on a 30-page file?.

PDF Online: This is an on-line service that was supposed to be good. But the headers are messed up, it didn't convert the whole table, and you can't find "D5TA 1000-BA", in spite of it being there in plan sight.

Cross-Ref Adobe: This was generated by the much-heralded Adobe Acrobat program, which I've downloaded for a 7-day trial, and which is very expensive should I decide to buy. And the headers are messed up, there appear to be a couple of images missing, and you cannot find "D5TA" anything because it has translated it as "DSTA"!

I'm at my wit's end. Nothing I can find can convert a fairly simple pdf into accurate HTML. Am I doing something wrong? Is there better software out there? Is there a better approach? Am I tilting at windmills? Or perfecting ways of making sealing wax?

If you can help, please do. All suggestions appreciated.

Gary, AKA "Gary fellow": Profile

Dad's: '81 F150 Ranger XLT 4x4: Down for restomod: Full-roller "stroked 351M" w/Trick Flow heads & intake, EEC-V SEFI/E4OD/3.50 gears w/Kevlar clutches
Blue: 2015 F150 Platinum 4x4 SuperCrew wearing Blue Jeans & sporting a 3.5L EB & Max Tow
Big Blue: 1985 F250HD 4x4: 460/ZF5/3.55's, D60 w/Ox locker & 10.25 Sterling/Trutrac, Blue Top & Borgeson, & EEC-V MAF/SEFI

Steve83

Re: Any Web or Document Gurus Out There?

Banned User

This is why I avoid text in images, and use SMN's captions so heavily. It's probably also why my SMN registries get so many hits. Google can find the plain-text captions.

But they're plain-text; no rich text, or formatting, or fonts, or bold, or italic, or underline, or tables (other than crude ASCII, which looks like crap in that site's font).

But I still think real ASCII text is the way to go for technical web pages whose text needs to be searchable. I don't know how the big-boys set up tables and spreadsheets on their web pages, so I can't help you on that. But I still think when you OCR a document, you should pull the text out, and then display it on your page so it looks similar to the original's fonts & formatting (unless you find a better layout for that particular data), but NOT an image of the original text.

I've downloaded some really-good OCR software, but it's part of the install pack for a really-old legal-size scanner (hp C7710A) whose drivers haven't been supported in Windows for the past 4 or 5 versions, and that pack isn't even on the hp site any more. I'd have to try to pull it off one of my old HDDs, without picking up the malware that made me replace them. So I haven't tried to load it on this machine to find out if it will work without detecting that scanner. I've wanted to dual-boot the last OS that supported that scanner, but I'm not comfortable or desperate enough to risk it yet.

Gary Lewis

Re: Any Web or Document Gurus Out There?

Administrator

Steve - I don't have a problem OCR'ing things. In fact, at the moment I have three applications loaded that will do it: Foxit, the app I've had for several years and the one that OCR'd the MPC; ABBYY Finereader which I have on a 30-day trial; and Adobe Acrobat DC, supposedly the king of kings and which is on a 7-day trial.

But it looks like Foxit will be the winner given a few problems encountered with the others and the fact that they are $300 and up. Instead, Foxit is paid for and does a good, albeit not perfect, job. And, it runs the scanner very nicely, creating nicely-straightened and OCR'd results in one go.

Concerning the page, as you know we aren't displaying pictures of text for the TSB's, nor much of anything that we are doing going forward. Instead we are using pdf's that have the text searchable.

Last night I was discussing this with Keith Dickson, Mr FORDification, and he told me that he's been searching for years for a way to do "this", meaning get the search engines to find things like TSB's. But, he's ruled out using HTML, basically for the same reason I have just now - nothing does the conversion well and it takes way too much time to edit the results - and editing results in HTML is not my forte, nor desire.

But, having the pages in HTML certainly would be nice. Those misshapen HTML pages I put up last night have already been found and you can find "D5TA 1000-BA" on the website as of this morning. So one option would be to put both the pdf and the misshapen HTML on the page. The pdf would give the user a clean view of the TSB, and the HTML would be found by the search engines.

To test that theory I searched for "Rear Spring Squeak - Tip Liner And", which is a phrase that's in TSB 80-1-12-S REAR SPRING SQUEAK. Sure enough, even though that TSB has been in place for a week or so the search engines haven't found it - because it is actually a file that resides elsewhere with a link to it from the page, even though it looks like it is on the page. And now I've added the Adobe version of HTML for that TSB to the bottom of the page and have asked Google to crawl and index the page. So in a few hours we should be able to find anything on that TSB with a Google search, and later we'll be able to find it with other search engines as well.

So please take a look and see what y'all think of doing it that way. It is ugly as the formatting is all wrong, there's an image missing, and so on. But it should work.

THOUGHTS?

Gary, AKA "Gary fellow": Profile

Gary Lewis

Re: Any Web or Document Gurus Out There?

Administrator

Gary Lewis wrote

To test that theory I searched for "Rear Spring Squeak - Tip Liner And", which is a phrase that's in TSB 80-1-12-S REAR SPRING SQUEAK. Sure enough, even though that TSB has been in place for a week or so the search engines haven't found it - because it is actually a file that resides elsewhere with a link to it from the page, even though it looks like it is on the page. And now I've added the Adobe version of HTML for that TSB to the bottom of the page and have asked Google to crawl and index the page. So in a few hours we should be able to find anything on that TSB with a Google search, and later we'll be able to find it with other search engines as well.

And, just like clockwork you can now do a Google search for "Rear Spring Squeak - Tip Liner And" and the one and only hit you'll get is TSB 80-1-12-S REAR SPRING SQUEAK. So adding the HTML, as ugly as it is, does work.

I wonder if I can get Adobe to help me since I'm testing their software.

Gary, AKA "Gary fellow": Profile