Last edited by kashif at 24/08/2017 3:02 PM
Provenance refers to the process of recording the place of origin and history of data. The process details the chronology of the ownership of any object or entity. The evidence provenance produces can be applied to a wide range of areas, including auditing, error detection and recovery, and data-dependent analysis. Provenance allows the assessment of the authenticity of the data, making it reproducible and reusable from any stage of the process. Provenance in big data is an extremely challenging open problem, as preserving exact copies of whole data sets that are constantly changing (as it happens with backups of small data sets) may not be possible due to infrastructure requirements, especially in terms of resources availability and cost.
When web service-generated data is used by a process to create a dataset instead of a static input, we need to use sophisticated provenance representations of the web service request as we can no longer just link to data stored in a repository. A graph-based provenance representation, such as the W3C’s PROV standard, can be used to model the web service request as a single conceptual dataset and also as a small workflow with a number of components within the same provenance report. This dual representation does more than just allow simplified or detailed views of a dataset’s production to be used where appropriate.
It also allows persistent identifiers to be assigned to instances of a web service requests, thus enabling one form of dynamic data citation, and for those identifiers to resolve to whatever level of detail implementers think appropriate in order for that web service request to be reproduced.
Provenance is important for it allows us to assess the authenticity of the data, make it reproducible and reusable from any stage of the process. However, in the research area, the scholars didn’t pay sufficient attention in the practice of provenance services. And the relevant research is still about the concepts. For example, when reproduce the exact data extracts is a daunting task and, for the very large datasets, preserving a copy of large data extracts may be out of the question.