Human &
Network of software
components for
data extraction


In the era of data-intensive scientific discovery, Big Data scientists in all communities spend the majority of their time and effort collecting, integrating, curating, transforming, and assessing quality before actually performing discovery analysis. Some endeavors may even start from information not being available and accessible in digital form, and when it is available, it is often in non-structured form, not compatible with analytics tools that require structured and uniformly-formatted data. Two main methods to deal with the volume and variety of data as well as to accelerate the rate of digitization have been to apply crowdsourcing or machine-learning solutions. However, very little has been done to simultaneously take advantage of both types of solutions, and to make it easier for different efforts to share and reuse developed software elements. The vision of the Human- and Machine-Intelligent Network (HuMaIN) project is to accelerate scientific data digitization through fundamental advances in the integration and mutual cooperation between human and machine processing in order to handle practical hurdles and bottlenecks present in scientific data digitization. Even though HuMaIN concentrates on digitization tasks faced by the biodiversity community, the software elements being developed are generic in nature, and expected to be applicable to other scientific domains (e.g., exploring the surface of the moon for craters require the same type of crowdsourcing tool as finding words in text, and the same questions of whether machine-learning tools could provide similar results can be tested).

The HuMaIN project conducted research and developed the following software elements (or aspects thereof):

  • Configurable Machine-Learning applications for scientific data digitization (e.g., Optical Character Recognition and Natural Language Processing)
  • Workflows for coordinated human-machine systems that take advantage of feedback loops (e.g., based on consensus of crowdsourced data and its quality) for self-adaptation to changes and sustainability of the overall system
  • Experimentation platform for the extraction of information from images, which can be used to promote and accelerate the experimental validation of information extraction methods among the biodiversity community
  • Optimization of crowdsourcing sessions, through the study of the most efficient data entry methods and task complexity, while taking into account the opinion of the volunteers

A set of workflows were deployed to provide the necessary execution environment with traceability of tasks executions involved in human-machine workflows, and cost-effectiveness analysis of all the software elements developed in this project will provide assessment and evaluation of long standing what-if scenarios pertaining human- and machine-intelligent tasks. Crowdsourcing activities attracted a wide range of users with tasks that require low expertise, and at the same time it exposed volunteers to applied science and engineering, including graduate and undergraduate students.