-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmain-icbc2020.tex
160 lines (110 loc) · 22.7 KB
/
main-icbc2020.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out.
\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\input{definitions.tex}
\begin{document}
\title{Proof of Novelty\\
\thanks{This short paper was written for the VIRTUAL DESIGN CHALLENGE FOR AUTHENTICATING AND PROTECTING FULL MOTION VIDEOS: https://blockchain.ubc.ca/news/virtual-design-challenge-authenticating-and-protecting-full-motion-videos}}
\author{\IEEEauthorblockN{Daniel Severo}
\IEEEauthorblockA{
\textit{Independent Scientist}\\
S\~{a}o Paulo, Brazil\\
\maketitle
\begin{abstract}
We propose a design for securing novelty of archived content in distributed ledgers, called \emph{Proof of Novelty}. What constitutes as novel is decided through a consensus mechanism together with a similarity function, which is selected according to the content type (e.g. full-motion videos, textual documents). Scalability is guaranteed by forming a validation committee with cryptographic sortition, which use statistical hypothesis testing to decide on the probability of a content being novel or not. The system can trade-off computational with statistical performance by manipulating parameters. We discuss the usage of this design to secure the novelty of full-motion videos and end with a proposal of future lines of research that can extend the systems capabilities.
% UNTITLED is a Distributed Patent System design proposal equipped with a Prior Art search mechanism. Patentability requirements are defined through a consensus protocol using Smart Contracts and collaborative training of machine learning algorithms. Removing the patent clerk and shifting the burden of proof of patentability to the patentee, we guarantee scalability and efficiency of the blockchain. The system is highly sensitive to the invention’s medium (e.g. audio files, textual documents) as it defines the class of algorithms employed to calculate similarity, which in turn defines what constitutes as prior art. We show how UNTITLED can be applied to full-motion video archives to protect against tampering and dissemination of false information through Deepfakes.
\end{abstract}
\begin{IEEEkeywords}
blockchain, patent, smart contracts, prior art, full-motion videos, proof of novelty
\end{IEEEkeywords}
\section{Introduction}
Guaranteeing novelty and authenticity of modern media content (e.g. full-motion videos, textual documents, audio files) is crucial for decision making in areas such as court trials and journalism. Failing to do so risks undermining the credibility of the decision maker. For example, journalists fact check their findings before publishing news articles \cite{graves2016deciding}; patent clerks exhaustively gather evidence of absence of prior art before emitting patent certificates \cite{callaert2006traces}; and media archives (e.g. YouTube) protect artists by taking down copyright-infringing content \cite{o2006digital}. Scholarly peer review systems also share this characteristic, where a reviewer is tasked with evaluating the claims made in an academic paper during the process of submission to a conference or journal.
The central concepts underlying the aforementioned examples are \emph{novelty} and \emph{authenticity}. An item is deemed authentic when its provenance is indisputable. What constitutes as novel is often the cause of discourse due to its subjective nature and the trust bestowed upon a centralized agent to discriminate on what is, and what is is not, genuinely novel. More recently, a new issue has emerged due to the proliferation of digital content in the internet age; verifying novelty against large archives. Even if the definition of novelty can be agreed upon by interested parties, exhaustively comparing candidate and existing contents can be computationally intractable due to large data volumes (e.g. 500 hours of video images are uploaded to YouTube every minute \cite{robertson2015500}). Defining and securing authenticity is comparably more mature and has benefited considerably in recent years with the advent of Digital Signatures \cite{pointcheval2000security}.
In the case of media archives, checking for novelty prior to acceptance can be seen as a way of preserving the moral integrity of \emph{currently} archived content. A malicious agent can attack content by submitting a modified version and leverage social media as a vector for propagating false information. Considerable time may pass until the false content is debunked and may cause irreversible damage to the owner of the original content. Complexity of attacks range from audio manipulation and video context clipping to modern computer vision techniques (e.g. Deepfakes) \cite{chesney2018deep}.
Pior work has tackled the issue of detecting video tampering \emph{from within} the archive by using content-sensitive identifiers (e.g. digests from cryptographic hash functions) at the moment of content ingestion and storing them in permanent ledgers (e.g. blockchains). The same techniques are also employed for combating spoofing, where content provenance is contested; closely relating to issues with authenticity \cite{hasan2019combating}. To the best of our knowledge, little to no work has been done in preventing the ill-usage of created content.
In this paper, we contribute with
\begin{enumerate}
\item a consensus mechanism for securing genuity, called \emph{Proof of Novelty}\footnote{https://github.com/dsevero/Proof-of-Novelty} (PoN);
\item an approach to combat false media content in digital archives with PoN.
\end{enumerate}
Our design draws inspiration from \emph{Patent Systems}. We discuss shortcomings of our approach as well as model details in the following sections.
\section{Background}
\subsection{Patent Systems and Prior Art}
A patent is a legal document that provides proof of ownership of intellectual property (IP) \cite{callaert2006traces}. It is commonly issued by government run agencies to individuals or organizations. Its main function is to secure legal exclusivity regarding sales, production and distribution of the IP to the owner. The procedure of emission is initiated with a formal request by the party interested in obtaining the patent, called a \emph{patentee}. A \emph{patent clerk}, representing the emitting agency, is then attributed with the task of verifying patentability conditions such as (but not limited to) novelty and non-obviousness of the invention. This is done by collecting evidence of absence of previous similar work, called \emph{Prior Art}. Searching is done against niche databases specialized in storing technical documents \cite{callaert2006traces}.
Granting a patent is an attestation of novelty, conditioned on trust in the emitting agency. An improper Prior Art search can lead to erroneous conclusions of novelty regarding the patentee's invention \cite{bashir2010improving}.
\subsection{Similarity Measures}
A similarity measure is any function that quantifies the degree of similarity between objects \cite{lesot2009similarity}. For example, the Jaccard index measures similarity between sets $\sA$ and $\sB$, by comparing the number of common elements. Formally, it is the ratio between the the intersection and union, and varies between 0 and 1; $J(\sA, \sB) = \vert\sA\cap\sB\vert/\vert\sA\cup\sB\vert$. These metrics are common in applications regarding information retrieval (e.g. search systems) and recommendation systems \cite{metzler2007similarity}.
Distance functions $d$ are common throughout literature and can be converted into a similarity measure by a simple transformation such as $d \to \frac{1}{1+d}$. Both distance and similarity measures are sensitive to their domain. For example, the Jaccard index previously mentioned can not be used directly on real valued data (what would $J(5, 2.4)$ mean?) \cite{lesot2009similarity}.
There is a vast amount of literature addressing metrics for text, full-motion videos and audio applications \cite{kulis2013metric}. The field of \emph{Similarity Learning} is concerned with the task of learning similarity and distance functions through the usage of labeled examples and Machine Learning techniques (i.e. Supervised Machine Learning) \cite{kulis2013metric}.
\subsection{Blockchains}
A Blockchain is a growing list of ordered records that contain digitally signed transactions \cite{antonopoulos2018mastering}. Appending a record to the chain is done by cryptographically hashing the list and adding it to the incoming record. Blockchains are considered immutable, since altering a transaction anywhere in the list would require recomputing the hash of all records that came after it \cite{antonopoulos2018mastering}. This data structure often lives on a distributed public peer-to-peer network, where all peers hold a copy of the list and must reach consensus on which incoming blocks should be appended.
Multiple consensus mechanism have been proposed throughout literature. Examples are Proof of Work \cite{dwork1992pricing}, Stake \cite{king2012ppcoin} and Authority \cite{de2018pbft} (PoW, PoS, and PoA; respectively). Some protocols require partial centralization to be secure, such as PoS and PoA. PoW can be fully distributed, but is inefficient in the time it takes to append a new block to the chain and requires a surplus amount of resources \cite{dwork1992pricing}.
A \emph{Smart Contract} usually refers to a computer program that is executed collaboratively on a blockchain \cite{szabo1997idea}. Consensus on the state of execution of the program is reached through the consensus mechanisms previously mentioned. The name comes from its usage in enforcing terms of agreements between parties in a way that doesn't require trust in a centralized moderator, such as a government agency.
The \emph{Ethereum Virtual Machine} is a distributed blockchain system that implements a set of machine-level instructions similar to a general purpose computer \cite{wood2014ethereum}. It supports Smart Contracts, which are usually written in languages such as Solidity, Vyper and LLL. Programs running on the EVM can be used as the back-end (i.e. server-side) for software applications called \emph{Decentralized Apps} (Dapps).
\subsection{Cryptographic Sortition}
Sortition is the act of randomly selecting elements of a group to form a committee. It is usually associated with political systems as an alternative to elections, where representatives are selected by sampling from the population of eligible citizens. For element $i$, the probability $p_i$ of being selected is proportional to some non-negative real valued weight $\omega_i$, such as $p_i = \omega_i/\sum_i\omega_i$. Choosing equal weights (i.e. $\omega_i = \omega_j; \forall i,j$) results in a uniform distribution, where every element has the same chance of being selected.
Sortition can be used to scale blockchain systems, by forming a committee of peers to reach consensus on new blocks; possibly replacing PoW \cite{gilad2017algorand}. Although promising, a naive implementation can cause security concerns due to interactions between committee members and their exposure to malicious agents. Algorand \cite{gilad2017algorand} created a consensus protocol which forms committees without the need of interactions or exposure, using \emph{Cryptographic Sortition}. Any peer holding a private key can verify and prove self-membership in the committee using a \emph{Verifiable Random Function} \cite{micali1999verifiable}.
\subsection{Content-Addressable Storage}
Location-addressable systems reference resources based on location. Examples are everyday systems such as bank accounts, housing addresses, traditional databases, and the internet. Content-addressable systems use the content itself (usually in the form of a hash) as a locator \cite{atkin2010system}.
The \emph{InterPlanetary File System} (IPFS) \cite{benet2014ipfs} is a distributed content-addressable storage system that is commonly used with blockchain technology and Smart Contracts. Referencing stored content by its cryptographic hash allows IPFS to remove duplicated files, as any two files that are exactly the same will have equal hashes.
\section{Problem Statement}
Let $\ContentSet$ represent a collection of content archived in a distributed content-addressable storage system such as IPFS. For each content $\content \in \ContentSet$ there exists some owner $\owner \in \OwnerSet$ such that the mapping $\content \to \owner$ is one-to-one but the inverse $\owner \to \content$ is one-to-many. In other words, one owner can have multiple contents, but each content has a unique owner.
A \emph{similarity measure} on $\ContentSet$ is a function $\simfunc \colon \ContentSet^2 \to \mathbb{R}$ that takes a pair of contents as input and returns a scalar that captures the notion of what constitutes as similar. $s$ can be anything from an Artificial Neural Network to a simple comparison of descriptive statistics.
The system will accept or reject the insertion of a candidate $\content^\prime \notin \ContentSet$ based on a subset of similar content $$\sS(\content^\prime, \sV) = \left\{ \simfunc(\content,\content^\prime) \mid \forall \content \in \sV \subseteq \ContentSet \right\}$$
with some rule $\mathcal{H}$
$$\sS(\content^\prime, \sV) \xrightarrow{\mathcal{H}} r \in \text{\{accept, reject\}}$$
where $(\mathcal{H}, s)$ is defined through consensus.
Accepting $\content^\prime$ can be viewed as the equivalent of emitting a \emph{certificate of novelty}. The objective is to minimize the rate of false positives while maintaining a reasonable rate of acceptance. An owner $\owner$ wishing to prove novelty of $\content$ can do so by showing that $\content \in \ContentSet$ and $\content \xrightarrow{} \owner$ to any entity that trusts the system $\left(\mathcal{H}, \simfunc \right)$.
\section{Related Work}
Previous work has tackled the problem of authenticity (i.e. proving provenance) by securing files on content-addressable storage (e.g. IPFS) with blockchain technology. \cite{hasan2019combating} implements a Smart Contract on EVM where artists can register their content hosted on IPFS to be shared, modified, and purchased. The blockchain tracks a history of interactions and creates a hierarchy of parent-child relationships that makes it possible to trace content provenance. Bernstein \cite{Bernstein} is a decentralized app that implements a Patent System. It can prove ownership, existence and integrity (i.e. tamper-proof) of stored files. ARCHANGEL \cite{bui2019archangel} is a tamper-proof video archive equipped with a content hashing mechanism insensitive to most lossy compression algorithms.
None of these are capable of verifying novelty of files submitted to the system. A malicious agent can download content, modify it for ill-usage, and submit it back to the blockchain. Even though it is possible to prove that the modified content was inserted after the original (since blockchains are ordered lists), there is still the issue of locating the original content for comparison.
MediaChain was a peer-to-peer database for users to collaborate on open-media. It contained a metadata resolution protocol that was intended to map similar content to the same metadata. The project was halted when bought by Spotify in 2017 and the solver was never fully implemented \cite{perez2017spotify}.
% TODO \\
% - ARCHANGEL: TCH needs overfitting; doesn't address discovery\\
% - Mediachain\\
% - Locality-sensitive hashing\\
% - https://arxiv.org/abs/1901.03136\\
\section{Design Proposal}
An overview of the system is presented through narrative. Details are addressed in later subsections.
\begin{figure}
\centering
\includegraphics[width=0.45\textwidth, keepaspectratio]{assets/pon-diagram-full-icbc2020.png}
\caption{System overview. The set $\ContentSet$ represents the content archive; $\sV$ is a subset of $\ContentSet$ and is created by the blockchain to be validated by the owner of $\content$; $\sF$ is chosen with cryptographic sortition as a random subset of $\sV$ and is used a sample for validation; $\simfunc$ is a similarity function and $\mathcal{H}$ is the rule (most likely a hypothesis test) that accepts or rejects $\content$.}
\label{fig:diagram}
\end{figure}
\subsection{Overview}\label{overview}
Consider a credible archive $\ContentSet$ of similar content, with content-addressable hashes secured on a smart contract enabled blockchain, an associated similarity measure $\simfunc$, and an acceptance/rejection rule $\mathcal{H}$. For example, arXiv articles stored on IPFS with community defined $s$ and $\mathcal{H}$ secured by Ethereum. Note that credibility is a consensus amongst peers and not an intrinsic property of the archive (e.g. scholarly peer-reviewed venues). An owner $\owner^\prime$, wishing to prove novelty of content $\content^\prime \notin \ContentSet$, makes a transaction on the blockchain and receives a random subset of content-hashes of elements $\content \in \sV \subseteq \ContentSet$. The owner now uses $\simfunc$ to calculate the similarity of $\content^\prime$ with the elements of $\sV$ (i.e. $\sS(\content^\prime, \sV)$) and submits the calculations back to the smart contract for evaluation. Using cryptographic sortition, the smart contract chooses a random committee that is tasked with off-chain verification of a subset of the results, $\sS(\content^\prime, \sF \subseteq \sV)$. Consensus is reached by the committee regarding the legitimacy of $\sS(\content^\prime, \sV)$, and $\content^\prime$ is accepted or rejected into the archive based on the rule $\mathcal{H}$.
\subsection{Choosing $\sV$ and $\sF$}
The degree of novelty verified by the system is directly related to the cardinality (i.e. number of elements) of $\sV$, represented by $\vert\sV\vert$. Making $\sV = \ContentSet$ will guarantee that all files in the archive are compared against the candidate content $\content^\prime$, but will make the computation of $\sS(\content^\prime, \sV)$ resource intensive. Similarly, raising $\vert\sF\vert$ decreases the probability that an owner can successfully manipulate the values of $\sS(\content^\prime, \sV)$, but requires more computational power during the verification process by network peers.
Scalability of this solution comes from the fact that $\vert\sF\vert < \vert\sV\vert$, which is possible only if $\sF$ is hidden from the owner $\owner^\prime$. This means that the candidate does not know which values of $\sV$ will be verified by his peers, forcing him to compute $\simfunc(\content,\content^\prime)$ for all values of $\content \in \sV$. The probability of randomly guessing which elements are in $\sF$ can be made as low as required.
\subsection{Consensus of Peers Regarding $\sS(\content^\prime, \sV)$}
Using only $\sS(\content^\prime, \sF \subseteq \sV)$, the committee must infer if $\sS(\content^\prime, \sV)$ is valid. Inference through statistical hypothesis testing \cite{newey1994large} can be leveraged to decide if candidate content is novel.
\subsection{Certificate of Novelty}
Showing that the content-hash of $\content$ is in the distributed archive (i.e. $\content \in \ContentSet$) provides credibility with respect to the novelty of $\content$. Agencies can request that candidates provide PoN on distinct $(\mathcal{H}, \simfunc)$ for different tasks. For example, peer-reviewed journals can use arXiv to build a blockchain where researchers submit their written work for tests against plagiarism before submission.
Notice that the burden of proof of novelty is bestowed upon the owner of the candidate content, and not the the network. This design can also provide varying levels of proof. For example, a content validated against a larger value of $\vert\sF\vert$ will result in a higher confidence of novelty. This can be extended to the point where there are varying or even continuous levels of confidence, where the candidate owner can resubmit the same content to the blockchain when demanded by a third party that trusts the system.
\section{Application to Full-motion Video Archives}
An advantage of PoN is that changing the content type (e.g. full-motion videos, audio tracks, images) only requires us to choose a different similarity function. Each content type has a unique property that usually implies different types of techniques for the same task. For example, although full-motion videos are a sequence of image frames, the similarity between sequential frames allows us to use specific compression algorithms that would not perform as well on a set of random images \cite{le1991mpeg}. Literature in Similarity Learning for full-motion videos as well as Near-Video Duplicate Detection \cite{li2019fast} has evolved considerably with the use of Convolutional and Recursive Neural Network \cite{saadatpanah2019adversarial} and can be explored to find candidate similarity functions.
A single similarity function is not expected to perform well for all types of videos. The novelty certificate can be segmented by topic (e.g. famous speeches, car collisions) and the entity wishing to validate a candidate video can specify which certificate is needed.
Videos with long duration (i.e. tens of minutes) can be broken down with scene detection algorithms, such as in \cite{bui2019archangel}. This framework can be extended to sample frames or scenes from a candidate video, and novelty can be secured stochastically.
\section{Conclusions and Future Work}
In this paper we presented a design proposal for a Smart Contract compatible blockchain that can secure \emph{novelty} of currently existing archives. Our work can scale by leveraging a cryptographic sortition algorithm to form a committee of validators during insertion of new content. It is easily applicable to full-motion videos, audio files, textual documents and any other type of content for which a similarity function can be defined. Through statistical hypothesis testing, guarantees can be made with respect to the degree of novelty (within the scope of the archive).
There are multiple points of improvement that future work can tackle, such as:
\begin{itemize}
\item implementing and running experiments on Smart Contract blockchains such as Ethereum;
\item applying decentralized and collaborative machine learning to progressively update a similarity function \cite{harris2019decentralized};
\item investigating statistical hypothesis tests and guarantees for different content and similarity functions.
\end{itemize}
\section*{Acknowledgment}
We would like to thank Professor Chen Feng for the invitation to participate in the VIRTUAL DESIGN CHALLENGE FOR AUTHENTICATING AND PROTECTING FULL MOTION VIDEOS, hosted by Patriot One Technologies Inc. in collaboration with the Blockchain research cluster at The University of British Columbia.
\bibliography{bibliography.bib}{}
\bibliographystyle{plain}
\end{document}