Hardcopy Document Processing Workshop Conference on Information and Knowledge Management CIKM 2004 November 12, 2004 Hyatt Arlington Hotel Washington D.C. Workshop Topic A stated purpose of the CIKM 2004 is to identify challenging problems facing the development of future knowledge and information A major (and growing) challenge for industry and government is need to access and process the content of hardcopy documents. The focus of the workshop is to present current research and development in addressing this challenge. Topics would include but not be limited to: Information retrieval in the image domain Information extraction in the image domain Optical correlation for processing of text Information retrieval of noisy OCR documents Innovative OCR techniques Robust OCR for degraded images Innovative OCR techniques OCR post-processing techniques Automatic categorization of noisy OCR documents Clustering of noisy OCR documents Entity extraction in noisy OCR documents Processing of text (overlaid and in-scene text) in video images Visualization of noisy OCR document collections Processing of fax documents Machine translation of noisy OCR Automatic summarization of noisy OCR Forms recognition of noisy text Duplicate Detection Relevance of Topics During the 1990s, CEOs realized that their intellectual assets represented the majority of corporate wealth. The discipline of knowledge management was born to leverage these resources. Intellectual assets are primarily contained in hardcopy document collections. Many of these collections were scanned, indexed, OCRed and placed upon corporate intranets to be employed to gain competitive advantage.companiessystems. Businesses, particularly highly regulated industries such as pharmaceutical, environmental, and transportation, generate hardcopy records that must be retrievable to demonstrate compliance. Accurate document retrieval requires sufficient indexing. Unfortunately, sufficient indexing requires a priori knowledge of future, unknown requirements. Many government applications need to retrieve and process hardcopy documents on an on-going basis. Law enforcement and national defense organizations have a critical need to process hardcopy document content. In many cases, documents must be exploited in near-real time for their content to be actionable. Further, documents of interest to the Government tend to be very noisy and often contain multiple handwritten annotations or other marks. Currently, the only viable solution is to be able to retrieve and process the content OCRed documents. In virtually all of these examples, the cost, in either time or capital, of correcting OCR is prohibitive, and therefore either OCR accuracy must be improved, the ability to process noisy OCR must be improved, or new, innovative techniques must be developed to process text in the image domain. The ability to process hardcopy documents is a challenge of international importance and an appropriate workshop topic for this CIKM Conference. Target Audience: The focus of the Hardcopy Document Processing Workshop is to bring together the text processing and information retrieval research communities along with the users who face the challenge of processing information from hardcopy documents. The purpose will be to gain a better understanding of the current state-of-the-art and needs of the user community by exchanging of ideas. The target audience will be mixture of academia and researchers from the user communities. Organizing Committee Kirk Lubbes - Workshop Chair, SAIC Division Chief Scientist Dr. David Doermann - Co-Director, Laboratory for Language and Media Processing, University of Maryland Ramana Rao - Founder, CTO and Senior VP of InXight Software Dr. Kazem Taghva - Associate Director, Information Science Research Institute, University of Nevada, Las Vegas Dr. Wei Gen Yee - Assistant Professor, Computer Science, Illinois Institute of Technology Dr. Shih-Fu Chang - Professor, Electrical Engineering, Columbia University Dr. Nancy Chinchor - Program Manager, Advanced Technology Programs, CIA Prem Natarajan -Distinguished Member of Technical Staff, Speech and Language Department, BBN Technologies Dr. Heather McCallum-Bayliss -Senior Computational Linguist, Lockheed Martin Dr. Kristen Summers - Senior Scientist, CACI Agenda - Introduce Workshop agenda - Keynote speeches on the challenge - Papers of ten to twenty minutes duration followed by topical discussions - End with a panel discussion Submission Requirements and Evaluation Criteria Three to four page abstracts will be submitted for review by the organizing committee. Presentations should be planned for being ten to twenty minutes in length. Papers should address research agendas, on-going research, novel solutions, or case studies related to application of state-of-the-art technologies outlined above. Presentations are to be advanced technology oriented, not marketing or advertising oriented. Abstracts are to be submitted preferably in .pdf format (MS Word format is acceptable if .pdf is not available) by e-mail to klubbes@recordsengineering.com. Final Submissions Final submissions of approved papers/presentations are due by 8 September. They may be in form of technical papers, preferably in .pdf format (MS Word format is acceptable if .pdf is not available), or presentations, in MS PowerPoint format. Papers and presentations are to be e-mailed to klubbes@recordsengineering.com. Schedule - Approval of Hardcopy Document Processing Workshop by 1 June by CIKM Workshop Chair - Submission of 3-4 page abstracts by 15 July to Organization Committee - Approvals from Organization Committee by 1 August - Submission of camera-ready copy of papers submitted by 8 September