Difference between revisions of "JRA/objective1/task2"

From Synthesys3
Jump to: navigation, search
(Task 2: Automatic metadata capture)
(Task 2: Automatic metadata capture)
Line 10: Line 10:
 
from handwritten labels will need to be subject to further processing and validation, such as via crowdsourcing
 
from handwritten labels will need to be subject to further processing and validation, such as via crowdsourcing
 
methodologies (objective 2).
 
methodologies (objective 2).
 +
<br />
 
<br />
 
<br />
 
<br />
 
<br />
Line 15: Line 16:
  
 
This will include format, standards, Darwin Core, linking with other digitisation groups working on OCR. It may include registers, field books, card catalogues.
 
This will include format, standards, Darwin Core, linking with other digitisation groups working on OCR. It may include registers, field books, card catalogues.
 +
<br />
 +
<br />
 
<br />
 
<br />
 
=== Subtask 2: Review of development of NLP for parsing ocr text into Darwin core fields ===  
 
=== Subtask 2: Review of development of NLP for parsing ocr text into Darwin core fields ===  
  
 
This is currently outside scope but could be included if it becomes prioritised during the project.
 
This is currently outside scope but could be included if it becomes prioritised during the project.
 +
<br />
 +
<br />
 
<br />
 
<br />
 
=== Subtask 3: Review of development of Natural Handwriting Recognition (NHR) in workflows to automate the identification of collectors’ handwriting and to train software to recognise text for prioritised collectors ===  
 
=== Subtask 3: Review of development of Natural Handwriting Recognition (NHR) in workflows to automate the identification of collectors’ handwriting and to train software to recognise text for prioritised collectors ===  
  
 
This will include communicating with other groups including Hannover and iDigBio. It may include field books. It may include training the software.
 
This will include communicating with other groups including Hannover and iDigBio. It may include field books. It may include training the software.
 
+
<br />
 +
<br />
 +
<br />
 
=== Subtask 4: Review of automatic capture of character including colour, shape as well as exif data ===  
 
=== Subtask 4: Review of automatic capture of character including colour, shape as well as exif data ===  
  
 
This is currently outside scope but could be included if it becomes prioritised during the project.
 
This is currently outside scope but could be included if it becomes prioritised during the project.

Revision as of 13:33, 10 September 2014

Task 2: Automatic metadata capture

Develop software that will automatically identify properties of an image. These data “facets” will be automatically captured without human intervention and provide categories of information that allow Users to easily search and browse virtual collections more effectively.

Specimen label data will be subjected to Optical Character Recognition (OCR) software to extract the text string and research methods to improve the accuracy of OCR use on handwritten labels. OCR-extracted text collected from handwritten labels will need to be subject to further processing and validation, such as via crowdsourcing methodologies (objective 2).


Subtask 1: Review development of tools and workflows which incorporate automatic or semi-automatic metadata capture using OCR

This will include format, standards, Darwin Core, linking with other digitisation groups working on OCR. It may include registers, field books, card catalogues.


Subtask 2: Review of development of NLP for parsing ocr text into Darwin core fields

This is currently outside scope but could be included if it becomes prioritised during the project.


Subtask 3: Review of development of Natural Handwriting Recognition (NHR) in workflows to automate the identification of collectors’ handwriting and to train software to recognise text for prioritised collectors

This will include communicating with other groups including Hannover and iDigBio. It may include field books. It may include training the software.


Subtask 4: Review of automatic capture of character including colour, shape as well as exif data

This is currently outside scope but could be included if it becomes prioritised during the project.