Last edited by kashif at 24/08/2017 2:36 PM
Data mining, also called data or knowledge discovery, refers to the process of sifting through large sets of data. Using artificial intelligence techniques and complex digital programming, data mining attempts to uncover the relevant information of a large set of data and present that knowledge, in a form which is readily available and easily accessible. However, little concern has been directed towards the process, focusing primarily on the data and the displayed information This research would provide a various of creative solutions for the whole big data workflow by utilizing provenance to capture the footprints. The provenance catches the key things for re-use and reproduce purposes.
This software aims to incorporate both provenance and persistent identifiers to the data analytics processes in big data. A provenance system responsible for creating entities and recording processes will be developed. The entities will be made available via persistent identifiers, allowing the knowledge representation in both graphical and textual forms.
Definition of adequate standards for infrastructure requirements related to (at least) the following aspects: cloud and puppet-based infrastructure; secure and replicated database storage for high availability; auto-scaling and load balancing enabled architecture; provenance database requirements.
Definition of cyber security requirements related to (at least) the following aspects: a secured framework for both applications and network; secure authentication using LDAP, two-factor authentication or other identity management solution; ACLs (Access Control List).
Definition of requirements and standards for such a provenance system dealing with persistent identifiers related to (at least) the following aspects: usability in terms of data formats; different sources for data by using both push and pull technology; data acquiring API complying with W3C PROV standards; utility to convert non-standard data formats to be compliant with W3C PROV standards; generation of unique persistent identifiers for identifying the location of the data or physical objects (preferably DOI); definition of a monitoring routine to check the validity of the persistent identifiers; system must be subject to licencing conditions, so the user can restrict data sharing on both the whole provenience record or any physical object; displaying of the data on the secure web interface in different formats (e.g., use of SQL and NoSQL query languages); usage of elastic search functionality