JRA/objective1/task1

From Synthesys3
Revision as of 17:50, 19 April 2017 by Elspeth Haston (Talk | contribs) (Subtask 2: Review of tools to select regions of interest in individual specimens to identify different labels, particularly to help with Task 1.2.)

Jump to: navigation, search

Task 1: Automatic processing (segmentation) of digital images

Research and develop edge detection technology to locate and classify multiple regions of interest within images of NH specimens. Using the principle that pixels in a segment are similar with respect to some characteristic or computed property (e.g. colour, intensity, or texture), develop a method to semi-automatically detect, crop and classify these regions of interest such that they can be subject to appropriate additional processing.


Subtask 1: Development of software for identifying, and potentially cropping, single specimens in a multi-specimen item


The Deliverable for Task 1.1 resulted in Inselect, a desktop software application that automates the cropping of individual images of specimens from whole-drawer scans and similar images that are generated by digitisation of museum collections. It combines image processing, barcode reading, validation of user-defined metadata and batch processing to offer a high level of automation. Inselect runs on Windows and Mac OS X and is open-source. Inselect was developed by the Natural History Museum, London (NHM) and was publicly released in September 2014.

Since its release Inselect has been in almost continual development, testing and refinement. In the current reporting period more than 18 major Inselect issues (both bug fixes and new features) have been closed since September 2015 (a complete list is at available here). A major output was the launch of a new website for Inselect, which provides greatly improved user documentation and a gallery of examples.

In November 2015 an article on Inselect was published in PLOS ONE: Hudson LN, Blagoderov V, Heaton A, Holtzhausen P, Livermore L, Price BW, van der Walt S and Smith VS. 2015. Inselect: automating the digitization of natural history collections. PLOS ONE. 10 (11), e0143402. 10.1371/journal.pone.0143402.

In collaboration with the US-based iDigBio, a major national digitisation project funded by the National Science Foundation (NSF), on 29th March 2016 Lawrence Hudson (research software engineer at NHM) and Ben Price (entomology curator at NHM) presented an introductory training webinar on Inselect, “Insights into Inselect Software: automating image processing, barcode reading, and validation of user-defined metadata.” Recording available online here and here.

Talks were given by both Lawrence Hudson (“Inselect - applying computer vision to facilitate rapid record creation and metadata capture”) and Natalie Dale-Skey (entomology curator at NHM; “Streamlining specimen digitisation through the use of Inselect - a curator's perspective”) at the June 2016 SPNHC conference in Berlin. Immediately following the conference, Lawrence Hudson, Natalie Dale-Skey and Ben Price delivered a training session on Inselect at the SYNTHESYS3 and iDigBio joint workshop “Selected tools for automated metadata capture from specimen images.”

Since its launch Inselect has been downloaded over 400 times. Inselect is now being used or evaluated by more than ten NH organisations across at least six countries to assist the digitisation of over microscope slides (in excess of 100,000 at NHM), pinned insect specimens, malaise trap samples and palaeontological specimens.

Subtask 2: Review of tools to select regions of interest in individual specimens to identify different labels, particularly to help with Task 1.2.

This will investigate identifying regions of interest in preparation of images for OCR, to feed into Task 1.2 and Obj. 3.

Participants: Jörg Holetschek (BGBM), Elspeth Haston (RBGE), Sarah Phillips (RBGK)

The Deliverable for Task 1.2 resulted in a broad-ranging report covering the use of Optical Character Recognition (OCR), Natural Language Processing (NLP), Handwritten Text Recognition (HTR) for automating the process of text data capture, as well as automating the capture of character data from images.

OCR

The work on the use of OCR was reviewed and trialled across a range of OCR options currently available. The trials involved images representing a variety of material including plants, insects, molluscs and fossils. This work resulted in recommendations for OCR solutions. These included:

  • A server-based option (ABBYY Recognition Server v3)
  • A PC option (ABBYY FineReader v12 Professional)

Two online service options (Onlineocr.net and Newocr.com) were the best of the online services.

NLP

A short review of the current state of progress of NLP was carried out which discovered some of the key projects and individuals involved in this area. Contact was made with Ed Gilbert of Arizona State University and Symbiota and arrangements were made to test three portals which have incorporated the use of NLP in their workflow. This software is being used to manage NH specimen data and images but it also contains functionality which processes the specimen images through OCR software, or uses uploaded external OCR output, and then semi-automatically parses the text output using NLP into database fields corresponding to international standards for biodiversity data. This functionality can result in much of the data being automatically entered into the database, thereby reducing the time required by curators or citizen scientists to enter the data. This is an area considered to be important for future developments.

HTR

Work was carried out to determine whether specimens could be automatically classified based on the classification of features holding data. A case study was based at the herbarium of the Botanic Garden and Botanical Museum Berlin-Dahlem (BGBM) as part of StanDAP-Herb, a joint project with the University of Applied Sciences, Hannover. ‘Linienextraktor’, the software used for this study, implements feature recognition algorithms that can be used on herbarium specimens. The study produced a series of recommendations and guidelines for future work using this software. These include specifying the required resolution of the images, recommending the number of templates required to cover variation, minimising the number of words in a template as well as recommendations on the processing set-up.

In addition software developed for historical handwritten documents was tested on NH specimen labels. Contact was made with a separate EU-funded FP7 project, tranScriptorium16, who have developed software incorporating HTR technology. One of the tools developed within the tranScriptorium project, Transkribus, was installed locally by four SYNTHESYS3 Beneficiaries for testing. The results were promising, suggesting that further collaboration between tranScriptorium and NH collections would be beneficial including exploring the use of crowdsourcing to help with the marking up process.

Character data capture

Work focussed on analysing specimen images to capture non-text specimen data. A series of open source prototypes were developed by NHM to do the following:

  • segment specimens from their backgrounds and segment regions of interest (e.g. particular body parts)
  • detect morphological features to be used for classification (e.g. markings that indicate gender)
  • calculate physical dimensions from images (e.g. wing length)
  • colour analysis to be used for classification (e.g. wing colours)
  • heat maps for regions of interest

The code for these tools is available in a GitHub repository. In conjunction with Work Package NA2, trials are being carried out by RBGK and RBGE using the colour analysis algorithm to identify any correlation between leaf colour and quality of DNA and to determine whether the tool can be used to aid material selection for sequencing.

Eight SYNTHESYS3 Beneficiaries collaborated to review, trial and develop tools for OCR, NLP, HTR, template matching and pattern recognition. There was further collaboration with the US-based iDigBio and the EU-based tranScriptorium.

The results of the OCR study emphasised the usefulness of using OCR technology in the digitisation workflow, and discovered three options which provided the best results. One is server-based (ABBYY Recognition Server v3), one runs on a PC (ABBYY FineReader v12 Professional) and one is an online service (Onlineocr.net). The cost of the server-based solution is prohibitive for most institutes. Although ABBYY FineReader is not free and open-source, the results found that this software would potentially be the best solution for many institutes. The quality of output was consistently high and it coped well with specimen images which had not been cropped. These images generally include plant material which can produce ‘noise’ in some OCR software output, often resulting in several pages of meaningless characters as it attempts to recognise words in the plant specimen. Being able to process uncropped images saves a significant amount of time in preparing the images. The accuracy of the output was also tested using pre-entered labels and it performed well. ABBYY is currently being used by several institutes in the USA, including the New York Botanic Garden. ABBYY Recognition Server v3 is being used in one Beneficiary institute (RBGE). It is also being used by some institutes in conjunction with Symbiota, as described below. Based on these results, the JRA team decided to include this software in a training workshop to help users who would like to set it up for use in their institutes.

The review of NLP solutions presented some of the key projects and individuals involved in this area. Contact was made with Ed Gilbert of Arizona State University and Symbiota which was developed by a collaboration between the University of Wisconsin and Arizona State University as a platform for supporting communities of NH collection curators and enabling them to build shared online portals for their collections and associated data, resulting in a number of networks covering different themes such as lichens, mycology, American Myrtaceae and North American Plants. This tool allows research communities to build virtual collection portals to share their work on the Web. This solution is currently one of the best available for parsing OCR text and therefore it was decided that it would be appropriate to include this tool in a training workshop (see p.42 above under NA3 Task 3.3) with an aim to broaden the largely US-based Symbiota community into Europe. This would enable European institutes to work together to use OCR and parse their specimens. Symbiota can be used effectively in conjunction with ABBYY FineReader.

The workshop included 16 participants, 3 instructors, 2 facilitators, 1 recording technician and 2 chairs. A post-workshop survey was carried out. The feedback was very positive overall and endorsed the need for this kind of training and the approach used.

A report on the workshop which was held at the SPNHC annual conference in Berlin in June 2016 has been submitted within SYNTHESYS3 NA3 (Deliverable 3.8).

Haston, E, Albenga, L, Chagnoux, S, Cubey, R, Drinkwater, R, Durrant, J, Gilbert, G, Glöckler, F, Green, L, Harris, D, Holetschek, J, Hudson, L, Kahle, P, King, S, Kirchhoff, A, Kroupa, A, Kvaček, J, Le Bras, G, Livermore, L, Mühlberger, G, Paul, D, Phillips, S, Smirnova, L & Vacek, F (2016). Automating capture of metadata for natural history specimens. Deliverable 4.2 for SYNTHESYS3.