Showing posts with label Capture. Show all posts
Showing posts with label Capture. Show all posts

Thursday, May 31, 2012

How do you want to find your documents?


Document Capture Drives Search
One of the first stages in planning for any scanned image repository is to ask the question: How do you want to find your documents?  Theories vary on best practices, but here are a few tips when designing a document capture implementation for any ECM system:
  1. Limit your number of fields to 5 or less. So many times i see document scanning customers use way to many fields during capture.  The more fields you have, the more time for end users to index their documents, and the more chances fields will get skipped.  Take the time to interview the end users and truly find how they need to search for their documents.
  2. Always use a date.  Dates are the ultimate filter that can be a life saver when searching for that needle in a haystack in a scanned document repository.  Invoice date, purchase order date, contract date, etc. give you the power to narrow down your search results to a specified period and can be a huge help in audit based searches or searches for legal support.
  3. Use automation to reduce indexing time.  Document capture applications provide automation and efficiency, and can reduce end user keying requirements on documents.  Strong, accurate OCR technology, and Advanced Data Extraction (ADE) are absolutely required.
  4. Ensure your technology has a QA step.  If you are going to go to all the trouble of scanning, capturing and migrating documents to a repository, make sure you can check your work.  Misfiling a document can a painful experience.
  5. Full text search is the insurance policy.  Always, I repeat always, convert your scanned documents to a searchable format, PDF Image with Hidden text.  This will allow for granular searches beyond your index fields/columns, and can help you in the "find a needle in the haystack" tasks.  But do not, I say, do NOT rely on full text search as your primary search method.  Full text does not let you sort by specific document focused dates, cannot let you do range based searches on specific criteria, and restricts sorting and viewing in most repositories.
Just a few tips when designing your document scanning index fields.

Wednesday, July 6, 2011

Document Scanning and Capture Planning - Part 1 - Sizing and Storage

Been wanting to do this for quite some time, and finally had some time to sit down and put thoughts together.  I find that many of the scanning and capture implementations lack overall direction, structure and standardization.  I wanted to put together a manual from my experiences, and ask the community to add so we can build a reference for everyone to use.  This will be composed of many parts, including all different topics like storage, hardware, designing your index fields, etc.

Sizing and Storage Planning for Document Management and Scanning



One of the key areas of planning for any scanning/capture implementation is sizing and storage.   Many of the customers we work with have no real grasp on the volume of paper they deal with on a day to day basis, and when they make the migration to digitizing their paper, they are often quite surprised at the amount of paper they push through the system.  Obviously, this can cause some serious issues on many different fronts.   So how do you estimate the amount of paper?  There are several key conversion factors used by the document management industry, as outlined below:

Description
Number of Pages
Storage
1 Scanned Page – 8.5 x 11
1
50KB
1 Scanned Page – 11x17
1
100KB
1 File Cabinet – 4 drawers
10,0000
500MB
1 Box
2500
125MB
1 Linear Inch
100
5MB
1 E Size Engineering Drawing (48x36)
16 – 8.5x11
800KB



This table is a basic planning tool, and can be used as a starting point.  One thing to remember is that these are all standard pages.  Not full image magazine pages, but full text pages.  The other thing to keep in mind is that we have listed for boxes and file cabinets, the average number of pages contained within.  In the imaging world, we deal with images, not pages.  What is the difference?  A page may have 2 sides, which are converted digitally into 2 images.  So effectively, if you have a box with double sided pages you are scanning, you will have to double the storage required.
Some other key factors that can contribute to storage and sizing:

DPI Setting – one of the key questions we always receive is What DPI should I set on my scanner?  For most basic scanning and archive applications, you can set your scanner to 200 DPI.  If you are doing OCR or any type of advanced data extraction, you always want a 300 DPI image for maximum accuracy.  Anything beyond that is just a space killer, will slow down your process and really bloat your files.

Black and White, Greyscale and Color – always use black and white scanning to keep file sizes at an absolute minimum.  Greyscale and color scanning should only be used when absolutely necessary, as file sizes are just crazy.  Below is a table of file sizes for the same letter.  The letter was about 50% page coverage.

Scanning Mode/DPI
File Size
Black and White – 200 DPI
26K
Black and White - 300 DPI
38K
Black and White - 400 DPI
51K
Black and White - 600 DPI
80K
Greyscale – 300 DPI
301K
Color- 300 DPI
577K

Image Processing – image cleanup can significantly reduce file sizes, and it is very important to use this feature whenever you can.  Despeckle, deshade, border removal, etc. will eliminate unnecessary noise in scanned images, and reduce your storage requirement by 10-30% depending on the quality of your documents.

Image Format – There is a lot of misinformation on the market about TIFF versus PDF.  I always hear “We want to store as TIFF because PDFs are just too big.”  Just not the case.  An image scanned to as400 PDF is just a TIFF in PDF clothing (Or a PDF wrapper to be more exact).  The PDF overhead is almost negligible.  The de facto standard in imaging today is rapidly becoming the PDF image with hidden text.  This gives you a nice little file with the pristine image, and converted OCR text in the background.  The text layer adds negligible size to the file.

So now, with all this info, you can estimate volume in images, and then come up with required storage on a monthly, yearly or project basis.

Saturday, March 6, 2010

SharePoint and the Document Management Industry

I has been a crazy few years in the Document Management and Enterprise Content Management space, and one of the key factors, in my opinion, is the emergence of SharePoint as a viable option for use as a repository for scanned images. All of the key players are touting their integration with SharePoint, but I am a bit confused? Why would I want to purchase a 6 figure DM/ECM system when I can just use SharePoint. I know, I know, let me see...as I recollect the comments I hear on a daily basis:

  • "SharePoint is not really a Document Management System"
  • "SharePoint doesn't allow searching by columns"
  • "SharePoint lacks the controls that my ECM system offers"
  • "SharePoint uses BLOB storage which is scary"
  • The list goes on....
The industry is in denial, and I hear more and more objections to using the system every day, which makes me think one thing:  Traditional ECM is being challenged constantly, and they are upping their game to try and compete.

Just this week I had two calls where the customers were trying to move off Documentum, and migrate to SharePoint.  Why?  The main reasons are cost and flexibility.   I believe there will soon be a total reset in the industry, both in price and features, as Big ECM struggles to survive in the current environment.  2010 will be an interesting year.

Sunday, February 15, 2009

The Power of Advanced Capture

In any Document Management or Enterprise Content Management System, there are four basic components: Hardware, Capture, Archive and Search and Retrieve. So what is the most important piece? Everyone nowadays seems to have the hardware. All of the copiers today have scanning capability, with the newer ones scanning at 70 pages per minute. The simplest archive is a series of folders on your server or workstation. And with files on the network, Windows Search, or the search capabilities within Abode allow you to find what you are looking for quickly (sometimes).

For the more advanced organization, they may have a Document Management System, or utilize Microsoft SharePoint for their archive and search and retrieve functions. But what seems to be lacking in most organizations, is a structured, automated way of capturing files. The argument of this BLOG entry is that Capture is the most important piece to any ECM or DM System.

As mentioned before, when we look at just about any office or organization today, they are scanning with a copier or desktop scanner. But inevitably, they take their paper mess and recreate it digitally. Why? No standardization in the process. Joe scans his files to his email and stores them in Outlook folders, Betty scans to her My Documents on her laptop, and uses a convoluted naming scheme that only she can decipher. They take their paper problem, and create a huge problem for IT. Disparate archives now pose a disaster recovery problem, along with the issues of accessibility.

So what is the answer? Advanced Capture. Advanced Capture applications provide the ability to set structure, and harness the capabilities of all the scanning hardware within the organization. They can provide standardization and structure, along with fantastic efficiency improvements. Take for example, PSIGEN's PSI:Capture. With its Microsoft SharePoint Migration feature, and auto-import capability, you can set all your scanning copiers to scan to a processing folder. Utilizing the barcode routing capability, you can create cover sheets for each library within your SharePoint site. When you scan, the software will pick up, process, rename and folder files automatically. The end result is a standardized folder structure, standardized naming scheme, and a searchable PDF all within your SharePoint site.

The other major contributor to efficiency within Capture applications is the ability to use separation technology. I see it all the time...the office that has 20 documents to scan. They walk up to the copier, and scan them one by one; a very time consuming process. With document separators, you can scan the entire stack and let the software split the documents, rename and folder them. Let the technology do all the hard work!


Thursday, June 19, 2008

What is OCR and how can it help me in my scanning project?

Ah, OCR, also known as Optical Character Recognition. Is it really necessary to use OCR software after scanning files to TIFF or PDF? What are the key benefits of OCR? How can I use OCR to create searchable or editable documents?

OCR technology has come a long way in the past few years, and the OCR engines on the market today utilize intelligence and speed to quickly and accurately convert scanned paper documents from plain old images, into searchable or editable documents. For a quick overview of OCR, ICR and OMR, click here.

When looking at OCR technolgies, you need to determine your end goal: is it searchability or a cleanly formatted, editable document. Is your goal speed, or accuracy?

There are a number of desktop applilcations (eCopy Desktop, Adobe, OmniPage, ReadIRIS), that can provide the ability to create searchable files, as well as Word Processing files, or even spreadsheets. These are perfect for low-volume, daily conversions.

If you are scanning a large volume of paper, and need rapid and accurate conversion, most of the Advanced Capture applications on the market can accomplish the task ( Psigen PsiCapture is an example). This capture software utilizes either the Expervision or OmniPage production OCR engines, and can convert a 1000 pages in 10 minutes to searchable PDF.

For more info on OCR and how it can work for you, see the links below:

OCR Software Links

Scanning and Document Management Articles and Research

Sunday, March 30, 2008

eCopy, SharePoint and Scanning in the Enterprise

More and more organizations are taking the leap into a centralized document repository, through the use of Microsoft SharePoint. SharePoint is an oustanding tool for collaboration, and accels when utilizing Microsoft Office documents. But what happens when you create a SharePoint site that will require the uploading of numerous scanned documents? How can you standardize file naming, metadata population and the overall document imaging process?

One of the best solutions I have found is eCopy ShareScan. eCopy is a document capture solution that can be connected to just about any Multi-Function Copier or Dedicated Scanner. The challenge with traditional scanning solutions is that they usually require dedicated scanner hardware that is connected to a PC. This provides a one to one solution, allowing only the PC user to scan documents. With eCopy, you can provide a one-to-many relationship, and share the scanning capabilities of your copier or scanner with an entire office or department. It has a simple, easy to use touch screen interface that even the most technically challenged user can learn to use. It can provide a rapidly deployable solution, with quick adoption, and low training requirements.

eCopy provides a SharePoint Connector that allows quick integration into any site, with all security in tact, and the ability to require metadata fields. For larger organizations, you can configure one ScanStation, and then publish the configuration to all the others on your network. For more info on eCopy and additional tools, go to What is eCopy? or for other SharePoint Scanning Utilities, click on the following link SharePoint Scanning Utilities.

Thursday, January 31, 2008

Document Management and Integration with Business Applications

In the beginning, most of the Document Management and ECM solutions I sold were stand alone applications. Users would scan documents (say invoices) into the repository, and would use the search client to bring up documents by index field, or perhaps do a full-text search of the OCR'ed contents.

What I am finding today, as more and more IT folks get involved in the decision making process, is that integration is King. Applications must play well, and play easy with all other business applications within the organization. What does that mean? What does integration truly mean?

I have found that it means many things, to many different people. Below is a summary:

Basic Metadata Population
Wow, that is a mouthful. Basic Metadata Population, or BMP, involves the pulling of index field information from an existing source, and allowing the user to manually pick the information from a vendor field. The most common used case here is to present a popup list of information for the user to choose. Take for example, one of my customers that has PeopleSoft Financials. One of my engineers created a view within the PeopleSoft DB of all the vendors. When Purchasing is indexing their Purchase Orders, they see a listing of vendors directly from the financial system. This prevents rekeying of data that has already been keyed, prevents duplicate names or mispellings, and insures standardization.

Advanced Metadata Population
Another mouthful, but AMP takes population of index fields a step further, and provides autopopulation of fields based on a database lookup. For example, you might have a Vendor Number field that is entered, and the capture application will go and lookup all the information on that vendor and assign it to the document.

Screen Scraping
This technology is usually used to "scrape" information off the screen from an application, and use it to populate index fields, or to perform a search. Different functions can be tied to hotkeys, or some advanced applications can have a quicklaunch bar that will perform certain operations. For example, if you are in your financial software looking at a particular vendor, you can hit a hotkey and have all the documents for that particular vendor resented.

True Integration
True Integration requires application programming interfaces that will allow two applications to talk directly to each other. For instance, you can create a button in the tool bar of your financial application that will link to a function within your document management system, or ECM system. So with one quick click, you can find all the associated documents, or scan a document to a particular vendor file with all the fields populated.

Integrations always present some challenge, and it is important to make sure you are on the same page when talking to a customer or vendor to insure everyone is satisfied in the end.

For more info on Document Management and ECM, go to the following link:

ScanGuru Document Management/ECM Portal

Sunday, November 11, 2007

How do I scan my file cabinets?

How do I go paperless?
What type of scanner should I buy?

These questions are becoming common in today's business world as the inefficiencies of paper can be eliminated through the use of the proper hardware and software.
The back-scanning of large file rooms can be a huge task, requiring the purchase of equipment, software and usually, some extra employees. Below is an outline of steps to take before moving out on this "paperless" adventure:

Evaluate your current paper files.

How much paper do you have? A good benchmark is that each four drawer file cabinet has bout 10,000 to 12,000 pages depending on how tightly the doors are packed (if you cannot fit any more files in, lean towards the upper number). Now, each page consumes approximately 50K of server space, so an eintire cabinet is about a CD of data, or 500-600 MB. Also take a look at the prep work that will be involved. It amazes me the number of staples folks use when archiving paper files. One simple staple in the left corner never seems to be enough, and I have even seen 5 per packet or document. This will all add to the prep time as you remove staples, post-its, etc. How long will it take to scan the files? Do a benchmark test on a sinlge files and then multiply it out.

Should I outsource the scanning of my file cabinets?

Once you have evaluated your file room, and seen how many files you have, you can get a feel for how long the task will take. With a 90 page per minute scanner, a file drawer will take about a 1/2 hour to scan (that includes a few jams). Then there are the index fields, which depending how many you have per document, can add some additional time. After looking at all these factors, and what it would cost in time, some folks just decide to outsource. Document Scanning Bureaus usually charge a per page fee, plus some additional labor charges. Depending on your market, the cost per page will range from 5-12 cents per page, plus an hourly labor fee. So that 4 drawer cabinet will run you anywhere from $500-1200 plus some labor fees.

How do I figure out what scanner and software to buy?

THere are a ton of options out there, and several websites that can help. http://www.scanguru.com/ is a good place to start. There are some additional articles, and links to many of the vendor sites.
I will go deeper into each of the questions above in separate entries in the future.

Tuesday, April 24, 2007

Key Features- Scanning/Capture Applications for Law Firms

What should a Law Firm look for in a scanning application? Here are some suggestions:

Barcode Separator Functionality - Separator pages allow the user to insert a specially coded page between documents in a stack. Once scanned, the software uses these pages to determine when a document begins and ends. This allows the scanning of many documents at once, rather than scanning one at a time. There is also the notion of "intelligent separators" which allow you to encode data on the separator page, such as case, matter, attorney, etc.

Image Enhancement - These tools, such as Kofax's Virtual Rescan, will automatically adjust contrast and brightness, remove problematic colors, remove speckles, and thicken fonts. If you want the highest quality image, with the least amount of scanning operator intervention, this is a key component to any scanning system.

Indexing - The application should allow for the entry of case and matter information, and this should allow you to automatically rename the files based on these values, and create folders. Rapid indexing features should allow quick entry of these fields for multiple documents.

Optical Character Recognition (OCR) - OCR takes the scanned image, and converts it to a text-based format. When looking at this feature, it should allow conversion to the following 3 formats: Adobe Image + Hidden Text, Word/WordPerfect and plain text. If you can test the software, see what type of results it provides with several sample firm documents.

Export - Depending on how you are managing your cases, the application should offer maximum flexibility on where you can direct the end product. I have several firms that use multiple case/document management systems, depending on the case type and size. Folder Export, Summation, Alchemy, SharePoint, etc should all be supported.

Bates Numbering - Get rid of that old stamp!! Most Advanced Capture Applications provide the ability to digitally Bates Stamp your documents. Huge time saver.

Obviously this is just a starting point, but these are some necessary features that will make processing documents easier, and much more efficient.

For more info, go to www.scanguru.com

Monday, February 26, 2007

What to Look for When Buying a Scanner

The “Paperless” office is the hot topic today, and there are so many choices when it comes to scanning hardware, it can be difficult at best to sort through all the models and features to make the right choice. Below is a breakdown of the different scanner features, and an explanation of what they mean in layman’s terms:

Scanning Speed

Scanning speed is a main area of focus when researching scanning hardware. A scanner’s speed is usually directly proportional to its price, but you have to ask yourself one question: How long do you have to accomplish your scanning tasks? If you buy that cheapo scanner at an office products store that scans at 8 pages per minute, good luck in getting those 10 file cabinets scanned. Another note to mention is that all the manufacturers rate their scanner speeds at 200 DPI. If you need high quality images, or are performing OCR, 300 DPI will probably be necessary. This will significantly slow down your scanning speed, as will color scanning and duplex (2-sided) scanning on some models.

Document Feeder Capacity

The document feeder provides you the ability to load anywhere from 1-1000+ sheets into the scanner. The feeder capacity you require all depends on the volume of paperwork you are scanning, and if you are using an intelligent capture application that provides the ability to use separator sheets to split documents automatically. If you are a Law Firm that routinely scans 200 page documents, then that is a good starting point for your feeder size requirements. This allows you to load your documents, and then let the scanner do the work.

Another focus area related to the feeder is the maximum and minimum paper sizes. If you intend to scan legal size paper or insurance cards, make sure the scanner can handle them.

Daily Duty Cycle

The Duty Cycle (DC) is a rating of the scanner’s durability, and defines just how much paper you can feed through the hardware in a day. If you are scanning 3000 pages per day, you do not want to buy a small desktop scanner with a DC of 750. What happens if you exceed this number? Nothing to begin with, but as time goes on the wear and tear on the unit will begin to show in the form of jams, misfeeds, skewing, etc. This number is also tied to the replacement of consumables (rollers and pads). If you continually exceed the DC, you will more than pay for a higher level scanner in consumables over time.

Scanning Mode

Most scanners nowadays can scan both sides of your document, but there are still some lingering models that will only do simplex scanning. Also, if you have the requirement to scan color documents, ensure that color scanning is supported.

Warranty and Service

All warranties are not created equal. Some scanner manufacturers provide “depot” type service where you have to ship your scanner for warranty service. Others will provide onsite warranty service for a specified period of time. Along with this, the time period on the warranty also varies everywhere from 30 days, to a full year. Scanner service is a separate purchase, and in some cases, can be a shock to the purchaser. A basic service plan on a mid-range scanner can cost over $1000 per year. Get an advanced plan that provides Preventative Maintenance visits, and you could be in the $1500 - $2000 range, depending on your model. Get all the details up front, and some manufacturers will provide multi-year discounts on service.

Others

Definitely investigate the software that comes bundled with your scanner. Many of the manufacturers now provide image processing software (Kofax VRS) and scanning utilities, along with Optical Character Recognition Software. Also, if you require the ability to scan to PDF, make sure that is an output option with the scanner you purchase.