Finding the Perfect OCR

Finding the Perfect OCR

How I Learned to Stop Worrying and Love Apple (just kidding)

I attempted many different solutions to convert photos of news articles into machine-readable text, until I found something that worked. Skip ahead to “Apple Monterrey” if you don’t want to read the obligatory life story before the recipe!

I started my digital archive project with much bravado and enthusiasm! Five engrossing sessions at the Brooklyn College archives and 1,607 photos later, I realized I had the daunting task of converting my photos of newspaper articles into text so I could code them qualitatively for qualitative category construction.

From: 

Into:

“Tension Erupts between Caucus and Delegates”
By ELIZABETH LU SAN FRANCISCO – Hints of underlying dissatisfaction with the ” decision-making process • The Asian PacifiC Caucus leadership broke into open confrontation on the caucus ‘ last day of meetings during the Democratic Convention. Throughout most of the week, Asian Pacific delegates and alternates from several were dissatisfied with the fact that many caucus platform pavilions were presented to the Democratic National Committee with little input from the majority of caucus members delegates. Another Complaint was that the format of the daily caucus meetings precluded members from meeting and discussing issues of common concern.
The carefully but strongly worded challenges from over half a dozen delegates prompted the caucus to continue the meeting in closed sessions so that members could try to resolve their differences. The participants did not emerge for over an hour and a half, but when they did, most of them agreed that the session had been constructive and that tension had been eased. Uneasiness about the format of the caucus meetings was first expressed by California delegate Ying Lee Kelley, who raised a point of personal information by asking the chair for an opportunity to meet with other caucus members. Thomas Hsieh, president of the Asian Pacific Caucus was at first reluctant to deviate from the  morning’s agenda. However, many of the delegates in attendance persisted. Michael Yamaki. another California delegate, stood up and asserted to a round of applause that although the other caucus members stand up and briefly introduce themselves so that they could identify each other in the small crowd of approximately 60 persons. The caucus president also complied with a request to allocate 15 minutes near the end of the meeting for delegates to meet.

In other words, I needed to figure out an optical character recognition (OCR) solution. 

The first solution I thought about was the Adobe Acrobat. It is a powerful tool, industry standard, that can convert PDFs that you can’t control F and find a word to a file that is editable, machine-readable.

I tried googling “CUNY adobe creative suite” and did find resources that suggested that due to Covid-19, Adobe was extending Creative Suite access to students at home in March 2020. However, by this winter, it seems like the license is once again restricted to “only…to students identified by their campus as being enrolled in a class requiring the use of Adobe Creative Suite.” Alas, not me.

Next, I tried three more wacky ideas: uploading my photos of newspaper articles to my Evernote, Google Cloud Vision, and an open source ocr2text package. 

Evernote was a bust: even though its embedded OCR systems does allow images to be text searchable, it does not spit out an editable text (explained here

Google Cloud Vision was interesting, try it here:https://cloud.google.com/vision (demo). As you can see, the text that results has a space between each word! Although it can recognize the elements of the article (paragraphs, titles, etc), it also produced spacing issues (see picture below) and it was unwieldy for me to handle all the text. This was unfortunately not very helpful but I also didn’t really want to help a big tech company get better at reading. 

Finally, ocr2text! It seemed promising and everything was smooth sailing, even using the command line to install Tesseract-OCR was a piece of cake. But as soon as I ran the .py file, what I got out was gibberish (there’s a specific Japanese word for it)! 

Apple Monterrey 

At wits’ end, I found out that the newest version of the Apple operating system, Monterrey, actually includes a OCR function in Preview. And it works! 

Here’s how:

  1. Select the section you want to convert with the “Rectangular Selection” tool (Preview → Tools → Rectangular Selection).
  2. If you select “Automatic Selection,” the program should recognize which step you’re doing but you might need to toggle between the text selection and rectangular selection tool.
  3. It’s important that, in this example, we copy one column at a time because the program doesn’t recognize the separated columns — if you just selected the whole article, the program would read the article as one massive text rather than three separate columns. 

4. Paste selection “New From Clipboard” (Preview → File → “New From Clipboard”)

5. Use text selection to highlight the text you want to copy (Preview → File → “New From Clipboard”)

6. Paste text selection into a TextEdit file 

7. Control F the Paragraph Breaks (which can be found after Control F on the little magnifying class → “Insert Pattern”) and replace with a blank space “ “. If applicable–your pasted text might not have paragraph breaks. 

8. Repeat the process until you’re done! And clean up the TextEdit document : )

Even though this resource is not open-source, I recommend if you are in a pinch and need photos, there are Macs in the Mina Rees that will have Adobe Acrobat Pro (which includes the Adobe Acrobat) and would have Monterrey installed.

For those on Windows, Capture2text is an open-source tool that performs a similar function. Let us know how it works out for those on Windows or if you find better open-source OCR solutions, please reach out to gc.digitalfellows@gmail.com!! I would love to edit this post to reflect the best ways for people to convert photos into machine readable text!

Skip to toolbar