Network Working Group P. Liljenberg Internet-Draft Commons Machinery Intended status: Experimental December 2015 Expires: June 3, 2016 The blockhash perceptual image URN scheme draft-commonsmachinery-urn-blockhash-00 Abstract This document describes a URN namespace for identifying images by their likeness, encoded as a perceptual hash based on the image content. The naming scheme relies on an algorithm called blockhash. The hash is easy to compute but still has a high probability of positively identifying an image after it has been shared through various Internet services, which typically strip other identifying metadata. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on June 3, 2016. Copyright Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Liljenberg Expires June 3, 2016 [Page 1] Internet-Draft Blockhash URN December 2015 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. 1. Conventions used in this document The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" in this document are to be interpreted as defined in "Key words for use in RFCs to Indicate Requirement Levels" [RFC2119]. 2. Introduction Most URN namespaces are intended to identify a unique resource [RFC1630], e.g. by a unique ID that is assigned by a registry, or from a cryptographic hash derived from the resource itself. The blockhash URN namespace specified in this document instead identifies a unique likeness of a resource, specifically of an image. It is intended to be used in situations where an exact match may not be possible, e.g. when comparing two instances of an image that might have been resized or recoded in different file formats. This is achieved by using a perceptual hash based on the blockhash algorithm [BLOCKHASH]. A small change in an image also results in a correspondingly small change in the blockhash. This is in contrast to a cryptographic hash where even small changes to a file result in unpredictable and often large changes in the hash. This allows not only the exact comparison of two blockhashes, but also makes it possible to determine how similar two images are. The similary is measured as the hamming distance, i.e. the number of bits in the two hashes that are different. Tools using blockhashes can then choose a maximum hamming distance that is relevant for the particular use case and the precision needed. Efficient algorithms exist to query a set of hashes to find the ones within a given distance from a particular hash, e.g. HmSearch [HMSEARCH]. The primary use case for blockhashes is to implement services that can perform a reverse image lookup, i.e. based on the image itself find out information about it or find similar images. This namespace specification is for a formal namespace. The specification adheres to the guidelines given in "Uniform Resource Names (URN) Namespace Definition Mechanisms" [RFC3406]. Liljenberg Expires June 3, 2016 [Page 2] Internet-Draft Blockhash URN December 2015 3. Specification Template 3.1. Namespace ID "blockhash" requested. 3.2. Registration information Registration version number: 1 Registration date: 201?-??-?? 3.3. Declared registration of the namespace Registering organisation: Commons Machinery in Sweden AB Aesgatan 5 646 32 Gnesta Sweden http://commonsmachinery.se/ Designated Contact Person: Jonas Oeberg jonas@commonsmachinery.se 3.4. Declaration of syntactic structure The Namespace Specific Strings (NSS) of all URNs assigned by the schema described in this document will conform to the syntax defined in section 2.2 of [RFC2141]. The formal syntax of the NSS is defined by the following normative ABNF [RFC5234] rules for "blockhash-nss": blockhash-nss = hexhash hexhash = 1*hex hex = %x30-39 / %x41-46 / %x61-66 ; 0-9, A-F, a-f The following are comments and restrictions not captured by the above grammar. Though both upper-case and lowercase characters are allowed in the hexadecimal value, lowercase characters SHOULD be used. The number of bits encoded in "hexhash" MUST be M, where, M = N * N and N is a multiple of 4. It is not expected that hash lengths other than 64, 144 and 256 bits will be useful. Blockhashes SHOULD be 256 bits long unless there is a specific need for a different size. Liljenberg Expires June 3, 2016 [Page 3] Internet-Draft Blockhash URN December 2015 3.5. Relevant ancillary documentation None as yet. 3.6. Identifier uniqueness considerations A blockhash URN identifies the unique likeness of an image, rather than a specific image file. Within the blockhash scheme, all images that produce the same blockhash are considered identicial. 3.7. Identifier persistence considerations The algorithm that calculate a blockhash from an image also establish a persistent identifier. The process can later be repeated on the to reproduce the identifier at any point in the lifetime of the image. 3.8. Process of identifier assignment A blockhash identifier is calculated from a digital image, and only depend on access to the pixel representation of the image. It can be read from a local or remote file, from an in-memory pixmap or canvas, or another suitable source. A blockhash SHOULD be calculated following these steps: 1. Convert the image to RGB format with an optional alpha channel (if not already done). The standard conversion method of the file format SHOULD be used, avoiding any file- or system-specific colour profiles. Any orientation tag in the image metadata SHOULD be applied. 2. Construct a grid of N by N blocks covering the image. N MUST be a multiple of 4. 3. Initialise each block to 0. 4. For each pixel: A. Decide which blocks the pixel overlaps. It can be one, two or four blocks, depending on the pixel coordinate. B. If the image has an alpha channel and the alpha for the pixel is 0 (indicating transparency), set the RGB components to their maximum values, otherwise use them as they are. C. Add the sum of the RGB components from the previous step to each of the blocks the pixel covers, in proportion to the area of the pixel that is in each block. Liljenberg Expires June 3, 2016 [Page 4] Internet-Draft Blockhash URN December 2015 5. Divide the block grid into four horizontal groups, each containing N columns and N/4 rows. For each group, determine the median value of all the blocks contained in the group. 6. Let H be the theoretically maximum cumulative value of all pixels in a block divided by two (half the value space). 7. For each block, starting with the upper left and progressing row by row until the lower right, add a bit to the result hash: A. If the block value is higher than the median value of the horizontal group the block belongs to, add a 1. B. If the block value and median have the same value, and the median is above above H, add a 1. Note that imprecisions from floating point calculations or other implementation details must be taken into consideration when determining if the block value and the median are the same value. C. Otherwise, add a 0. 8. Encode the hash in hexadecimal, interpreting the sequence of bits as eight-bit bytes. 9. Construct a blockhash URN according to the grammar above using the hexadecimal representation of the hash. Implementations MAY use a different process which yield similar results, but SHOULD make users aware if the modified process can introduce differences to the standard algorithm which may result in false negatives. One possible alternative to the algorithm described above is to resize the image to ensure that each block holds an even number of pixels to avoid the calculations in 4a and 4c. 3.9. Process of identifier resolution Not specified. 3.10. Rules for lexical equivalence Lexical equivalence of two blockhash identifiers are determined by first converting both hexadecimal strings to lower-case or upper- case, and then comparing the resulting strings. Alternatively, the hexadecimal strings can be converted to byte or bit sequences, which are then compared directly. Liljenberg Expires June 3, 2016 [Page 5] Internet-Draft Blockhash URN December 2015 3.11. Conformance with URN syntax No special considerations. 3.12. Validation mechanism None specified. 3.13. Scope Global. 4. IANA Considerations This document includes a URN namespace registration that is to be entered into the IANA registry for URN NIDs. 5. Namespace Considerations A new URN namespaces is needed as there are no existing namespaces for representing perceptual hashes as URNs in general, and none specifically for the blockhash algorithm. [I-D.thiemann-hash-urn] proposes a namespace for cryptographic hashes, but it is intentionally limited to only allow the hash algorithms specified in the internet draft. In a similar vein this namespace is limited to blockhash itself, to avoid having to establish a secondary level of namespace assignments for different perceptual hash algorithms. The blockhash algorithm was choosen to allow even relatively restricted environments (e.g. javascript code in a web browser) to produce an image hash, that is still sufficiently precise for identifying images as they are shared across the Internet, even after resizes, recodings and smaller edits. More complex perceptual hashes exist that can track more complex changes, e.g. large crops or rotations, but they also put higher requirements on the implementation. Such larger changes would often mean that the edit is no longer the same image but a derivate work. Blockhash is intended to only match images that can be considered the same work, so not matching derivatives is actually a benefit. Representing the hash as a hexadecimal string rather than as a more compact base32 is choosen to align the format more closely with the underlying bit structure. This allow code to determine a rough hamming distance between two blockhashes just by comparing the Liljenberg Expires June 3, 2016 [Page 6] Internet-Draft Blockhash URN December 2015 hexadecimal strings, rather than first having to convert them into binary sequences. The restriction that the grid width and height must be the same allow some scope to implementations to choose a reasonable hash size while ensuring that the hash size can always be reconstructed from the hash length. Requiring that the grid width and height must be a multiple of 4 ensures that the hash size is a multiple of eight, and thus will fill up a sequence of bytes exactly. The original blockhash algorithm [BLOCKHASH] calculated a single median value for the entire block grid. During testing it was found that this often resulted collisions in images with a lot of contrast between different parts of the image. An example is outdoor images of landscapes with a bright sky and a darker ground. An image-wide median results in the bright areas mostly being represented by 0 bits and the darker with 1 bits, losing a lot of detail that could help discern different images with similar contrast structure. By calculating a median value for each of four horizontal groups and comparing the block values to the median in the corresponding group help retain such contrast and reduce the collision frequency. 6. Community Considerations Internet services today make it very simple to share an image, but most loses the context of the original image in the process. The context is where and when the image was created or a photo taken, by whom, and what the conditions are for properly sharing the image. It is needed both to avoid copyright infringements, and to give credit to the original creator. There already exist several reverse image search services, but they are generally proprietary, closed implementations, and require users to upload the image to the service, potentially violating the user's privacy. Having a standardised way to refer to a perceptual hash of an image will provide new capabilities directly to Internet users to find out the context of images, even when they have passed through several steps of sharing. The blockhash on its own is not sufficient, but it will enable new services that can index large collections of images. Client-side tools can then calculate a blockhash and query these indices to find more information about the image, preserving the user privacy. Liljenberg Expires June 3, 2016 [Page 7] Internet-Draft Blockhash URN December 2015 The scope of this document does not cover processes for finding and querying such indices, but enable such future work to build on blockhash URNs. 7. Security Considerations Blockhashes are generally only useful when they relate to images that are already public and being shared. As such, producing a blockhash of the image and sharing that will not change the security or privacy situation significantly. For all images that are a few times larger than the grid size, it will be practically impossible to reconstruct the original image based on its blockhash. Thus sharing a blockhash of a private image does not infringe on the user's privacy directly. However, if the image is later shared, the previously shared blockhash can be used to infer that the user had access of the image at the earlier time. Images that are not much bigger than the grid size could be reproduced in sufficient detail to be identified from a blockhash. Such images are usually icons and other non-sensitive content, but implementation can also mitigate against this by requiring the user to confirm that such a small image should be hashed. 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC3406] Daigle, L., van Gulik, D., Iannella, R., and P. Faltstrom, "Uniform Resource Names (URN) Namespace Definition Mechanisms", BCP 66, RFC 3406, October 2002. [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, January 2008. 8.2. Informative References [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web", RFC 1630, June 1994. Liljenberg Expires June 3, 2016 [Page 8] Internet-Draft Blockhash URN December 2015 [BLOCKHASH] Yang, B., Gu, F., and X. Niu, "Block Mean Value Based Image Perceptual Hashing", December 2006. [HMSEARCH] Zhang, X., Qin, J., Wang, W., Sun, Y., and J. Lu, "HmSearch: an efficient hamming distance query processing algorithm", July 2013. [I-D.thiemann-hash-urn] Thiemann, P., "A URN Namespace For Identifiers Based on Cryptographic Hashes", draft-thiemann-hash-urn-01 (work in progress), September 2003. Appendix A. Experimental results On a sample of 100,000 random images from Wikimedia Commons, the algorithm generated colliding matches for 1,036 images, meaning that it has a collision rate of ca 1%. The vast majority of such collisions are where between two to four images generate the same hash; only rarely does a hash match more than four images. +--------------------------------+-------------+ | Number of images in collision | Collisions | +--------------------------------+-------------+ | 2 | 247 | | | | | 3 | 51 | | | | | 4 | 19 | | | | | 5 | 8 | | | | | 6 | 9 | | | | | 7 | 3 | | | | | 8 or more | 16 | +--------------------------------+-------------+ Table 1: Blockhash collisions rate Taking 4,000 unique sample images (not counting collisions) and doing a crosswise comparison of them, calculating the hamming distance between one hash and every other hash, we get the following distribution of hamming distances (up to 10 bits): Liljenberg Expires June 3, 2016 [Page 9] Internet-Draft Blockhash URN December 2015 +-------------------+----------------+--------------------+ | Hamming distance | Cross matches | Cross matches (%) | +-------------------+----------------+--------------------+ | 1 | 0 | 0% | | | | | | 2 | 2 | 0,05% | | | | | | 3 | 2 | 0,05% | | | | | | 4 | 7 | 0,175% | | | | | | 5 | 7 | 0,175% | | | | | | 6 | 25 | 0,625% | | | | | | 7 | 30 | 0,75% | | | | | | 8 | 47 | 1,175% | | | | | | 9 | 55 | 1,375% | | | | | | 10 | 73 | 1,825% | +-------------------+----------------+--------------------+ Table 2: False positive match rate Author's Address Peter Liljenberg Commons Machinery Aesgatan 5 Gnesta 64632 SE Email: dev@commonsmachinery.se URI: http://commonsmachinery.se/ Liljenberg Expires June 3, 2016 [Page 10]