Monday, October 17, 2011

Document Scanning and Capture Planning - Part 4 - Document Scanning Models


Document Scanning Models

After doing some planning on the hardware types and document scanning volumes, the next step would be to examine what type of model you need to deploy.  There are typically 3 standard  models for document scanning and capture: Centralized, De-centralized and Distributed. 
Each model has its own pros/cons, and below I will examine each, and dive into some detail.
Centralized
Ah, the centralized model.  Some call this old school scanning and capture, as for many years, this was the only way to get the job done, and convert your paper to digital form.  This model provides a centralized scanning center to provide mass conversion for the organization.  The operation can be run by in house personnel, be managed by a services provider in house, or be outsourced to a scanning service bureau.  It requires high volume/high speed hardware, and typically utilizes advanced capture software to allow for the utmost in automation and efficiency.  The software and hardware operators are typically highly trained, and there are usually only a few of them.  Paper and/or digital media is shipped to the centralized location and processed through a set, standardized capture workflow.
Centralized Pros
  • Easily standardized process due to a limited number of skilled/trained scan operators
  • High speed hardware/software results in minimal processing time once paper is received
  • Centralized reporting and control of overall process
  • No loading on WAN infrastructure
  • Centralized backup and restore
Centralized Cons
  • Usually a high time delay for availability of documents
  • High cost due to shipping of documents
  • High maintenance costs
  • High training costs to bring on new operators
  • Disaster recovery planning issues if centralized site is down
  • Operators are typically not knowledgeable in the documents they are indexing
Decentralized
Over time, as bandwidth and scanning hardware/software prices went down, the obvious move was to decentralize the whole scanning and capture process.  This move placed scanning in the branches, and allowed the whole document capture process to be performed by those who had working knowledge of the documents.  Smaller, desktop class hardware could be used, and most capture companies made batch scanning and upload to the centralized repository simple to accomplish.
Decentralized Pros
  • Scan operators are well versed in the documents they scan
  • Documents are available almost immediately
  • No shipping or transfer costs for documents
  • Branch control of the whole scanning process
Decentralized Cons
  • Standardization can be an issue
  • No centralized control or reporting
  • WAN Bandwidth consumption can be high
  • Licensing costs can be high depending on software utilized
Distributed
The advance of network-based scanning devices and the lowering of bandwidth pricing led to the newest model, the Distributed Model.  Distributed Scanning allows for just about anyone in the organization to walk up to a network scanning device/scanning copier/fax machine and send documents to a repository.  The devices are typically multi-faceted, and along with repository integration, can provide scan to network folder, FTP and email.  Collaborative back-end systems, like Microsoft SharePoint, lend themselves nicely to this model, as they allow anyone to participate in a Document Workspace.
Distributed Pros
  • Put scanning in the hands of everyone in the organization
  • Provides a great launching pad for collaborative solutions
  • Simple, easy to use interfaces allow for minimal training and quick adoption
  • Capture and indexing is now in the hands of the true document owner
  • One-to-many solution provides a single device to service many users
Distributed Cons
  • Lack of standardization without software addition
  • Security and document control can be major issues
  • Bandwidth from smaller branches can be a problem with larger scans
  • Lack of hardware integrations with back-end systems
So, most organizations today are combining the above models to create a Hybrid Scanning and Capture solution, and leveraging all the strengths together to minimize the weaknesses of any one model.   Another strategy is to tie scanning models to specific business processes, as most lend themselves nicely to specific scanning and capture workflows.

Hardware and Choosing Your Scanning Model


Most organizations will choose their model to leverage their existing hardware investment, but this can be lead to decisions that seem good at the time, but if deeper examination occurs, it can make sense to realign hardware with the best model.  Take for example, a company that instantly leans toward a distributed model, and attempts to leverage their copier fleet that is currently under lease.  If you examine the part of this guide that covers scanning hardware, copiers will not always fit for the type of scanning you need to perform.  Take for example a branch accounting department that is looking to scan receipts or check stubs.  Will the copier perform well with mixed original sizes?  Just a word of caution to examine the paper, workflow, and document types to get the best feel and adapt the best model.

Tuesday, August 2, 2011

Document Scanning and Capture Planning - Part 3 - Scanning Hardware

Now that I have covered Sizing and Storage in Part 1, and Document Separation in Part 2, now we can start to take a look at scanning hardware.  There are several key questions you need to answer:  Can I use pre-existing hardware such as copiers or fax machines?  Do I need a dedicated scanner?  If I choose to buy a scanner, what features/characteristics are important?

Some may argue you need to decide on a scanning model before you dive into hardware (distributed, centralized, or decentralized), but I will cover this in the next section.

So let’s start with a key question:

Scanning Copier or Dedicated Scanner??

Scanning Multifunction Peripherals (MFPs/copiers) have become standard in most offices. I receive the same question all the time from prospects and customers: Can’t I just use my copier for scanning? In many cases, for a typical office, with typical documents, a copier can be an appropriate component to any scanning solution. As offices become more complex in the way they handle their documents, or they expand their scanning efforts to other departments, dedicated scanners are usually required to achieve the desired result.

Below are some interesting statistics provided by InfoTrends:

· 65 % of office workers use digital copiers/MFPs
· Over 50% use the “scan” feature daily
· 71% expect scanning requirements to increase from year to year
· 72% believe it is necessary to view images before processing
· 36% will require dedicated scanners versus MFP devices
· 36% believe they will need both scanners and MFPs

So what are the benefits/drawbacks to scanning with both types of devices? Below is a summary:

Benefits of MFPs as scanners:

  • Leverage your existing investment in the MFP
  • Most copier maintenance plans do not charge for scans, so you get “free” maintenance for the scanning function (no print/copy, no click charge)
  • MFP manufacturers are really focusing on scanning capabilities: fast speeds, better quality and enhanced drivers, etc.
  • Network scanning functions:
  • Scan to email
  • Scan to Windows Folders
  • Scan to FTP
  • One-to-Many relationship: all workers can use one device.

Drawbacks of MFPs:

  • Contention – copying, scanning and printing may cause “a line at the copier”
  • Poor performance with differing paper sizes
  • Lack of color dropout (Scanning blue or black backgrounds will result in a black page)
  • Lack of image correction capabilities (auto deskew, despeckle, black border removal, streak removal, etc.)
  • Small Document Feeder sizes (50 – 100 pages)
  • On average, file sizes are 10-20% larger
  • Duplex scanning/DPI increase greatly slows down rated speed
  • Black and White scanning only on some models

Benefits of Dedicated Scanners:

  • Convenience – scan at your desk
  • Duplexing does not slow down scanner
  • Color dropout
  • Superior image quality due to enhancement features
  • Ease in handling differing paper sizes/types
  • Larger document feeder selections (up to 1000+ pages)
  • Smaller file sizes
  • Ability to preview scanned documents at scan time

Drawbacks of Dedicated Scanners:
  • One to One relationship – directly connected to PC
  • Additional Maintenance costs

Above are all the pluses and minuses, but in a nutshell, when should you use a dedicated scanner?

  • Scanning 50+ documents per day
  • Workers that are constantly scanning throughout the day
  • Mixed paper sizes, weights and colors
  • Poor quality, older documents or when image enhancement is required
  • OCR or ICR applications
  • High volume copying and printing environments
  • Large Document scanning
  • High security environments

Now that you have an idea of the pros/cons of both types of scanning devices, now let’s take a look at the different features of scanning devices, and what to look for when purchasing a dedicated scanner.


Scanning Speed

Scanning speed is a main area of focus when researching scanning hardware. A scanner’s speed is usually directly proportional to its price, but you have to ask yourself one question: How long do you have to accomplish your scanning tasks? If you buy that cheapo scanner at an office products store that scans at 8 pages per minute, good luck in getting those 10 file cabinets scanned. Another note to mention is that all the manufacturers rate their scanner speeds at 200 DPI. If you need high quality images, or are performing OCR, 300 DPI will probably be necessary. This will significantly slow down your scanning speed, as will color scanning and duplex (2-sided) scanning on some models.

Document Feeder Capacity

The document feeder provides you the ability to load anywhere from 1-1000+ sheets into the scanner. The feeder capacity you require all depends on the volume of paperwork you are scanning, and if you are using an intelligent capture application that provides the ability to use separator sheets to split documents automatically. If you are a Law Firm that routinely scans 200 page documents, then that is a good starting point for your feeder size requirements. This allows you to load your documents, and then let the scanner do the work.

Another focus area related to the feeder is the maximum and minimum paper sizes. If you intend to scan legal size paper or insurance cards, make sure the scanner can handle them.

Daily Duty Cycle

The Duty Cycle (DC) is a rating of the scanner’s durability, and defines just how much paper you can feed through the hardware in a day. If you are scanning 3000 pages per day, you do not want to buy a small desktop scanner with a DC of 750. What happens if you exceed this number? Nothing to begin with, but as time goes on the wear and tear on the unit will begin to show in the form of jams, miss feeds, skewing, etc. This number is also tied to the replacement of consumables (rollers and pads). If you continually exceed the DC, you will more than pay for a higher level scanner in consumables over time, and your maintenance costs may go way up.

Scanning Mode

Most scanners nowadays can scan both sides of your document, but there are still some lingering models that will only do simplex scanning. Also, if you have the requirement to scan color documents, ensure that color scanning is supported.

Warranty and Service

All warranties are not created equal. Some scanner manufacturers provide “depot” type service where you have to ship your scanner for warranty service. Others will provide onsite warranty service for a specified period of time. Along with this, the time period on the warranty also varies everywhere from 30 days, to a full year. Scanner service is a separate purchase, and in some cases, can be a shock to the purchaser. A basic service plan on a mid-range scanner can cost over $1000 per year. Get an advanced plan that provides Preventative Maintenance visits, and you could be in the $1500 - $2000 range, depending on your model. Get all the details up front, and some manufacturers will provide multi-year discounts on service.

Image Processing

Definitely investigate the image processing software that comes bundled with your scanner.  This software will improve the quality of your images, remove shading, borders, etc.  Many of the manufacturers now provide third party image processing software (Kofax VRS), but several have their own built into their drivers.  Most capture software also has built in image processing components as well.

So hopefully this will answer the majority of your questions on hardware.  Remember, hardware is just part of the overall capture solution.  Follow on articles will cover information on software selection and required features.

Friday, July 8, 2011

Document Capture and Scanning Planning - Part 2

Document Examination and Separation


One of the key steps in preparing for document scanning and capture is to identify how you will separate or split documents.  What is separation and how does it work?  Details below:

For those of you that are new to document management and capture, document separation is the notion of how we can determine when a document begins and ends.  With most simple scanning software, this process is easy.  You load a single document in the feeder, click scan, and when it is done, you name it and save it.  With advanced capture, you can load multiple documents into the feeder, scan them all at once, and use a separation method to split them into individual digital documents.    This is a massive time saver.  Imagine loading 20 individual documents into a scanner one at a time, scanning each individually, and then entering information about each.   Below are some key separation methods any advanced capture suite should have:

Fixed Page Count Separation – This allows you to split based on a certain page count.  So if you scan a stack of 100 two page forms, you will have 50 separate documents in your capture interface.

Barcode Separation – probably the most pervasive separation method is a barcode separator.  Place a sheet with a specific barcode pattern between each document, and you are off to the races.  To give you the most flexibility, applications should support the following enhanced barcode separation methods:

  • Separate on any barcode
  • Separate on specific barcode terms and patterns
  • Separate on barcode type
  • Separate on barcode count
  • Separate on a certain number of barcodes on a page
  • Separate when a barcode changes

You want to make sure your barcode engine supports 1D and 2D barcodes without the purchase of any expensive modules or add-ons, and it should also have a simple feature that lets you split 2D barcodes and identify separation terms.

Patch Code Separation – So what the heck is a patch code?  Just an old school horizontal barcode.  Below is an example.  If you work in the medical field, most medical billing forms will have these on them, and some scanners actually support using patch codes to shift scanner settings during the scanning process.  For flexibility, choose an application that supports patch code separation.

Optical Character Recognition (OCR) Separation – OCR is the process of converting a scanned or imported image into searchable text.  OCR separation searches for a key word, term or phrase on the document, and will recognize that page as the first page in a new document.  This is a preferred method, as you don’t have to kill trees to print cover sheets, and it makes document preparation simple (no inserting separator sheets).  For example, if you are scanning contracts, and you want to split when you find an 8 digit contract number in the right hand corner, this comes in very handy.  There are several key requirements in this feature that are absolutely required in your application to make sure you get high separation accuracy:


  • Scan at 200 or 300DPI and use an app that has image processing software to clean up the page.  Also, your image processing engine must allow processing of imported PDFs and TIFFs if you plan to harvest documents.  Some image correction/processing engines only work with scanners.
  • Insure you capture application allows you to use expression matching (Regular expressions) so you have the utmost flexibility in finding separation patterns.
  • Character sets are key.  These provide the ability to tell the OCR engine the type of characters you are looking for (A-Z, 0-9, etc), so if it misidentifies a character, it auto-corrects the information.
  • Finally, top line applications also allow you to separate when OCR terms change.  So you can look for that contract number, and only split when you find a new one.
Intelligent Character Recognition (ICR) Separation- ICR is the process of converting scanned images of hand printing to text.  This method can be utilized to split pages when certain patterns in hand printing are detected.  Note:  all of the features required to insure accuracy for OCR separation should also be considered if you utilize this method as well.

Document Import and Separation – There are several separation methods that can be key to success if you need to import large volumes of documents, or you want to process documents scanned from copiers, network scanners, or fax machines.  Below is several separation methods required for any document capture from imported files:
  • New File Separation – This method of separation will look at a directory, pick up files, and maintain each new file as its own digital document.
  • Folder-based separation – This is a key method if you are importing documents and want to combine them based on the folder.  One example might be a law firm that has a folder structure of case documents on different subjects for the case and wants to combine each folder into a single PDF file.


Blank Page Separation – I only mention this as I would always, always avoid it unless absolutely necessary, especially if you are scanning in duplex.  Most implementations of this method, unless operated under strict preparation by knowledgeable operators becomes an absolute mess. (Just my humble opinion ;)  )

Separation Scripting – Finally, for those rare and special occasions, you always want a product that has a pre-built scripting interface for customizing the whole process if necessary.  Now let me be clear, not a sales rep “Yeah we can do that” (Which usually means $20,000 in professional services), but a product that has simple hooks into the separation function, that allows you a simple “yes or No” based on some parameter or criteria that anyone with basic scripting skills can write.  When would you use something like this?  Usually for very complex jobs where the original documents cannot be modified, but you need to put some logic in place to spit documents.

The last separation topic I want to cover is something called triggered separation.  Let me set the stage on this one, and describe a process which is near and dear to every accounting manager’s heart, invoices.  So you have a stack of invoices, some single page, some multi-page and you are struck with a dilemma.  If I use barcode separators, and I have 100 single page invoices, do I really have to put 100 barcode separators between them all?  Separation triggers allow you to scan single page and multi-page documents all together.  So in this example, you can stack your singles, and then put separators between your stack of variable length separators.  Put a trigger sheet between the two stacks (this tells the capture software to switch from single page separation to barcode-based separation), and scan the whole stack in one fell swoop.  This is a huge time saver in high volume environments, and can allow you to also build redundant separation logic, so you get the highest accuracy in separation with the least amount of document preparation.  Phewwww.  That was geeky.


Do you really need all of this?  Does separation have to be that complex?  The whole goal here is to have as much as you possibly can in the tool kit to insure you can meet all the capture needs within your organization.  I liken it to buying the a base model with no accessories, and then wishing every day you one or another feature.

So now you have examined your documents, and figured out how to efficiently scan and split.

Wednesday, July 6, 2011

Document Scanning and Capture Planning - Part 1 - Sizing and Storage

Been wanting to do this for quite some time, and finally had some time to sit down and put thoughts together.  I find that many of the scanning and capture implementations lack overall direction, structure and standardization.  I wanted to put together a manual from my experiences, and ask the community to add so we can build a reference for everyone to use.  This will be composed of many parts, including all different topics like storage, hardware, designing your index fields, etc.

Sizing and Storage Planning for Document Management and Scanning



One of the key areas of planning for any scanning/capture implementation is sizing and storage.   Many of the customers we work with have no real grasp on the volume of paper they deal with on a day to day basis, and when they make the migration to digitizing their paper, they are often quite surprised at the amount of paper they push through the system.  Obviously, this can cause some serious issues on many different fronts.   So how do you estimate the amount of paper?  There are several key conversion factors used by the document management industry, as outlined below:

Description
Number of Pages
Storage
1 Scanned Page – 8.5 x 11
1
50KB
1 Scanned Page – 11x17
1
100KB
1 File Cabinet – 4 drawers
10,0000
500MB
1 Box
2500
125MB
1 Linear Inch
100
5MB
1 E Size Engineering Drawing (48x36)
16 – 8.5x11
800KB



This table is a basic planning tool, and can be used as a starting point.  One thing to remember is that these are all standard pages.  Not full image magazine pages, but full text pages.  The other thing to keep in mind is that we have listed for boxes and file cabinets, the average number of pages contained within.  In the imaging world, we deal with images, not pages.  What is the difference?  A page may have 2 sides, which are converted digitally into 2 images.  So effectively, if you have a box with double sided pages you are scanning, you will have to double the storage required.
Some other key factors that can contribute to storage and sizing:

DPI Setting – one of the key questions we always receive is What DPI should I set on my scanner?  For most basic scanning and archive applications, you can set your scanner to 200 DPI.  If you are doing OCR or any type of advanced data extraction, you always want a 300 DPI image for maximum accuracy.  Anything beyond that is just a space killer, will slow down your process and really bloat your files.

Black and White, Greyscale and Color – always use black and white scanning to keep file sizes at an absolute minimum.  Greyscale and color scanning should only be used when absolutely necessary, as file sizes are just crazy.  Below is a table of file sizes for the same letter.  The letter was about 50% page coverage.

Scanning Mode/DPI
File Size
Black and White – 200 DPI
26K
Black and White - 300 DPI
38K
Black and White - 400 DPI
51K
Black and White - 600 DPI
80K
Greyscale – 300 DPI
301K
Color- 300 DPI
577K

Image Processing – image cleanup can significantly reduce file sizes, and it is very important to use this feature whenever you can.  Despeckle, deshade, border removal, etc. will eliminate unnecessary noise in scanned images, and reduce your storage requirement by 10-30% depending on the quality of your documents.

Image Format – There is a lot of misinformation on the market about TIFF versus PDF.  I always hear “We want to store as TIFF because PDFs are just too big.”  Just not the case.  An image scanned to as400 PDF is just a TIFF in PDF clothing (Or a PDF wrapper to be more exact).  The PDF overhead is almost negligible.  The de facto standard in imaging today is rapidly becoming the PDF image with hidden text.  This gives you a nice little file with the pristine image, and converted OCR text in the background.  The text layer adds negligible size to the file.

So now, with all this info, you can estimate volume in images, and then come up with required storage on a monthly, yearly or project basis.

Sunday, July 3, 2011

SharePoint and the Document Management Industry

We are talking denial, and I ain't talking about a river in Egypt (Sorry for the bad joke)

I see it every day, and the misinformation out there about Microsoft SharePoint is just crazy.  First off, let me establish my position.  SharePoint has mapped to the typical Microsoft pattern from a product perspective.  Version 1.0 is usually lacking, causes great pain, and sours many IT folks.  2.0 starts to really get some traction, and people start taking notice, early adopters (also known as gluttons for punishment) go all in, and they continue to gather information for Version 3.  Version 3, they knock it out of the park, address needs, and most IT jump in after service pack 1.  This probably accounts for the slow adoption rates (Some interesting SharePoint Stats here)

SharePoint 2010 and all its wonder is taking business by storm.  It is an incredible tool, when used correctly, as a collaboration tool, document repository and overall business automation tool.  Depending on the business size, structure and industry, what I am finding is that it solves pain points.  Take for example the document capture implementation we just finished.  The customer was looking to eliminate Xerox DocuShare from their organization as they were having too many issues, and could not get adequate support from the vendor.  As a large mining company operating in several large countries in South America, they were having a hell of a time dealing with all their paper invoices, and were looking to automate their scanning and invoice processing. They took a leap, and the pilot project was designed to capture and process invoices within their Chilean AP department.  The project was an absolute success, and implemented within a weeks' time.

Simple.  Effective.  Done.

Now if you were to talk to a traditional Document Management Reseller, or perhaps a vendor, they would have instilled the customer with great fear and doubt:

"SharePoint is not a real document management system."
"The resources required to manage SharePoint will kill any ROI you can glean from automating a process."
"SharePoint cannot handle high volume of documents."

I find the attitude is pervasive, and I think it just comes from a lack of understanding, and truly a lack of effort to research the competition.  Is SharePoint for everyone?  No.  Just as Documentum or FileNet is not for every organization.  But the momentum is absolutely mind numbing.  Watch for profits to fall in the Document Management segment...

Do you think SharePoint is the DM industry killer?