Wednesday, July 6, 2011

Document Scanning and Capture Planning - Part 1 - Sizing and Storage

Been wanting to do this for quite some time, and finally had some time to sit down and put thoughts together.  I find that many of the scanning and capture implementations lack overall direction, structure and standardization.  I wanted to put together a manual from my experiences, and ask the community to add so we can build a reference for everyone to use.  This will be composed of many parts, including all different topics like storage, hardware, designing your index fields, etc.

Sizing and Storage Planning for Document Management and Scanning



One of the key areas of planning for any scanning/capture implementation is sizing and storage.   Many of the customers we work with have no real grasp on the volume of paper they deal with on a day to day basis, and when they make the migration to digitizing their paper, they are often quite surprised at the amount of paper they push through the system.  Obviously, this can cause some serious issues on many different fronts.   So how do you estimate the amount of paper?  There are several key conversion factors used by the document management industry, as outlined below:

Description
Number of Pages
Storage
1 Scanned Page – 8.5 x 11
1
50KB
1 Scanned Page – 11x17
1
100KB
1 File Cabinet – 4 drawers
10,0000
500MB
1 Box
2500
125MB
1 Linear Inch
100
5MB
1 E Size Engineering Drawing (48x36)
16 – 8.5x11
800KB



This table is a basic planning tool, and can be used as a starting point.  One thing to remember is that these are all standard pages.  Not full image magazine pages, but full text pages.  The other thing to keep in mind is that we have listed for boxes and file cabinets, the average number of pages contained within.  In the imaging world, we deal with images, not pages.  What is the difference?  A page may have 2 sides, which are converted digitally into 2 images.  So effectively, if you have a box with double sided pages you are scanning, you will have to double the storage required.
Some other key factors that can contribute to storage and sizing:

DPI Setting – one of the key questions we always receive is What DPI should I set on my scanner?  For most basic scanning and archive applications, you can set your scanner to 200 DPI.  If you are doing OCR or any type of advanced data extraction, you always want a 300 DPI image for maximum accuracy.  Anything beyond that is just a space killer, will slow down your process and really bloat your files.

Black and White, Greyscale and Color – always use black and white scanning to keep file sizes at an absolute minimum.  Greyscale and color scanning should only be used when absolutely necessary, as file sizes are just crazy.  Below is a table of file sizes for the same letter.  The letter was about 50% page coverage.

Scanning Mode/DPI
File Size
Black and White – 200 DPI
26K
Black and White - 300 DPI
38K
Black and White - 400 DPI
51K
Black and White - 600 DPI
80K
Greyscale – 300 DPI
301K
Color- 300 DPI
577K

Image Processing – image cleanup can significantly reduce file sizes, and it is very important to use this feature whenever you can.  Despeckle, deshade, border removal, etc. will eliminate unnecessary noise in scanned images, and reduce your storage requirement by 10-30% depending on the quality of your documents.

Image Format – There is a lot of misinformation on the market about TIFF versus PDF.  I always hear “We want to store as TIFF because PDFs are just too big.”  Just not the case.  An image scanned to as400 PDF is just a TIFF in PDF clothing (Or a PDF wrapper to be more exact).  The PDF overhead is almost negligible.  The de facto standard in imaging today is rapidly becoming the PDF image with hidden text.  This gives you a nice little file with the pristine image, and converted OCR text in the background.  The text layer adds negligible size to the file.

So now, with all this info, you can estimate volume in images, and then come up with required storage on a monthly, yearly or project basis.

No comments: