- 1 Task 2: Automatic metadata capture
- 1.1 Subtask 1: Review development of tools and workflows which incorporate automatic or semi-automatic metadata capture using OCR
- 1.2 Subtask 2: Review of development of NLP for parsing ocr text into Darwin core fields
- 1.3 Subtask 3: Review of development of Natural Handwriting Recognition (NHR) in workflows to automate the identification of collectors’ handwriting and to train software to recognise text for prioritised collectors
- 1.4 Subtask 4: Review of automatic capture of character including colour, shape as well as exif data
- 2 Meetings
Task 2: Automatic metadata capture
Develop software that will automatically identify properties of an image. These data “facets” will be automatically captured without human intervention and provide categories of information that allow Users to easily search and browse virtual collections more effectively.
Specimen label data will be subjected to Optical Character Recognition (OCR) software to extract the text string and research methods to improve the accuracy of OCR use on handwritten labels. OCR-extracted text collected from handwritten labels will need to be subject to further processing and validation, such as via crowdsourcing methodologies.
The Deliverable for Task 1.2 resulted in a broad-ranging report covering the use of Optical Character Recognition (OCR), Natural Language Processing (NLP), Handwritten Text Recognition (HTR) for automating the process of text data capture, as well as automating the capture of character data from images.
Eight SYNTHESYS3 Beneficiaries collaborated to review, trial and develop tools for OCR, NLP, HTR, template matching and pattern recognition. There was further collaboration with the US-based iDigBio and the EU-based tranScriptorium.
Subtask 1: Review development of tools and workflows which incorporate automatic or semi-automatic metadata capture using OCR
The work on the use of OCR was reviewed and trialled across a range of OCR options currently available. The trials involved images representing a variety of material including plants, insects, molluscs and fossils.
The results of the OCR study emphasised the usefulness of using OCR technology in the digitisation workflow, and discovered three options which provided the best results. One is server-based (ABBYY Recognition Server v3), one runs on a PC (ABBYY FineReader v12 Professional) and one is an online service (Onlineocr.net). The cost of the server-based solution is prohibitive for most institutes. Although ABBYY FineReader is not free and open-source, the results found that this software would potentially be the best solution for many institutes. The quality of output was consistently high and it coped well with specimen images which had not been cropped. These images generally include plant material which can produce ‘noise’ in some OCR software output, often resulting in several pages of meaningless characters as it attempts to recognise words in the plant specimen. Being able to process uncropped images saves a significant amount of time in preparing the images. The accuracy of the output was also tested using pre-entered labels and it performed well. ABBYY is currently being used by several institutes in the USA, including the New York Botanic Garden. ABBYY Recognition Server v3 is being used in one Beneficiary institute (RBGE). It is also being used by some institutes in conjunction with Symbiota, as described below. Based on these results, the JRA team decided to include this software in a training workshop to help users who would like to set it up for use in their institutes.
Subtask 2: Review of development of NLP for parsing ocr text into Darwin core fields
The review of NLP solutions presented some of the key projects and individuals involved in this area. Contact was made with Ed Gilbert of Arizona State University and Symbiota which was developed by a collaboration between the University of Wisconsin and Arizona State University as a platform for supporting communities of NH collection curators and enabling them to build shared online portals for their collections and associated data, resulting in a number of networks covering different themes such as lichens, mycology, American Myrtaceae and North American Plants. This tool allows research communities to build virtual collection portals to share their work on the Web. This solution is currently one of the best available for parsing OCR text and therefore it was decided that it would be appropriate to include this tool in a training workshop with an aim to broaden the largely US-based Symbiota community into Europe. This would enable European institutes to work together to use OCR and parse their specimens. Symbiota can be used effectively in conjunction with ABBYY FineReader.
Subtask 3: Review of development of Natural Handwriting Recognition (NHR) in workflows to automate the identification of collectors’ handwriting and to train software to recognise text for prioritised collectors
Work was carried out to determine whether specimens could be automatically classified based on the classification of features holding data. A case study was based at the herbarium of the Botanic Garden and Botanical Museum Berlin-Dahlem (BGBM) as part of StanDAP-Herb, a joint project with the University of Applied Sciences, Hannover. ‘Linienextraktor’, the software used for this study, implements feature recognition algorithms that can be used on herbarium specimens. The study produced a series of recommendations and guidelines for future work using this software. These include specifying the required resolution of the images, recommending the number of templates required to cover variation, minimising the number of words in a template as well as recommendations on the processing set-up.
In addition software developed for historical handwritten documents was tested on NH specimen labels. Contact was made with a separate EU-funded FP7 project, tranScriptorium16, who have developed software incorporating HTR technology. One of the tools developed within the tranScriptorium project, Transkribus, was installed locally by four SYNTHESYS3 Beneficiaries for testing. The results were promising, suggesting that further collaboration between tranScriptorium and NH collections would be beneficial including exploring the use of crowdsourcing to help with the marking up process.
Existing handwriting resources include: Chirographum historicum iDigBio Handwriting Samples and Resources Page Global Plants Initiative Botanists Conservatoire et Jardin Botaniques Ville de Geneve Auxilium ad Botanicorum Graphicem
Subtask 4: Review of automatic capture of character including colour, shape as well as exif data
Work focussed on analysing specimen images to capture non-text specimen data. A series of open source prototypes were developed by NHM to do the following:
- segment specimens from their backgrounds and segment regions of interest (e.g. particular body parts)
- detect morphological features to be used for classification (e.g. markings that indicate gender)
- calculate physical dimensions from images (e.g. wing length)
- colour analysis to be used for classification (e.g. wing colours)
- heat maps for regions of interest
The code for these tools is available in a GitHub repository. In conjunction with Work Package NA2, trials are being carried out by RBGK and RBGE using the colour analysis algorithm to identify any correlation between leaf colour and quality of DNA and to determine whether the tool can be used to aid material selection for sequencing.
Haston, E, Albenga, L, Chagnoux, S, Cubey, R, Drinkwater, R, Durrant, J, Gilbert, G, Glöckler, F, Green, L, Harris, D, Holetschek, J, Hudson, L, Kahle, P, King, S, Kirchhoff, A, Kroupa, A, Kvaček, J, Le Bras, G, Livermore, L, Mühlberger, G, Paul, D, Phillips, S, Smirnova, L & Vacek, F (2016). Automating capture of metadata for natural history specimens. Deliverable 4.2 for SYNTHESYS3.
13 March 2015
17 April 2015