{"title": "Quantum Embedding of Knowledge for Reasoning", "book": "Advances in Neural Information Processing Systems", "page_first": 5594, "page_last": 5604, "abstract": "Statistical Relational Learning (SRL) methods are the most widely used techniques to generate distributional representations of the symbolic Knowledge Bases (KBs). These methods embed any given KB into a vector space by exploiting statistical similarities among its entities and predicates but without any guarantee of preserving the underlying logical structure of the KB. This, in turn, results in poor performance of logical reasoning tasks that are solved using such distributional representations. We present a novel approach called Embed2Reason (E2R) that embeds a symbolic KB into a vector space in a logical structure preserving manner. This approach is inspired by the theory of Quantum Logic. Such an embedding allows answering membership based complex logical reasoning queries with impressive accuracy improvements over popular SRL baselines.", "full_text": "Quantum Embedding of Knowledge for Reasoning\n\nDinesh Garg1\u2217, Shajith Ikbal1\u2217, Santosh K. Srivastava1, Harit Vishwakarma2\u2020,\n\nHima Karanam1, L Venkata Subramaniam1\n\n1IBM Research AI, India\n\n2Dept. of Computer Sciences, University of Wisconsin-Madison, USA\n\ngarg.dinesh, shajmoha, sasriva5@in.ibm.com, hvishwakarma@cs.wisc.edu, hkaranam, lvsubram@in.ibm.com\n\nAbstract\n\nStatistical Relational Learning (SRL) methods are the most widely used techniques\nto generate distributional representations of the symbolic Knowledge Bases (KBs).\nThese methods embed any given KB into a vector space by exploiting statistical\nsimilarities among its entities and predicates but without any guarantee of pre-\nserving the underlying logical structure of the KB. This, in turn, results in poor\nperformance of logical reasoning tasks that are solved using such distributional\nrepresentations. We present a novel approach called Embed2Reason (E2R) that\nembeds a symbolic KB into a vector space in a logical structure preserving\nmanner. This approach is inspired by the theory of Quantum Logic. Such an\nembedding allows answering membership based complex logical reasoning queries\nwith impressive accuracy improvements over popular SRL baselines.\n\n1\n\nIntroduction\n\nWe consider the problem of embedding a given symbolic Knowledge Base (KB) into a vector\nspace that preserves the logical structure. Such embeddings are popularly known as distributional\nrepresentation (e.g., Word2Vec [1] and GLOVE [2]) and are aimed to be leveraged by several non-\nsymbolic (e.g. neural and vector) methods to accomplish various tasks (e.g. KB completion/ link\nprediction and logical reasoning under noisy/incomplete KB) on which symbolic methods struggle.\nFigure 1 (best viewed in color) depicts a toy example of the kind of symbolic KBs that we wish to\nembed into a vector space. The red-colored oval-shaped nodes of the left tree denote a hierarchy\n(aka ontology) of the unary predicates (aka concepts), and the blue-colored rectangle-shaped nodes\ndenote the memberships of entities to various concepts. Similarly, the right tree denotes an ontology\nof binary predicated (aka relations) and the ordered pair of entities as their members. In the right tree,\nhaving (Bob, Alice) as a child node of the relation Father_of means Bob is Father_of Alice.\nIn such ontologies, a predicate node logically implies (aka entails) any of its ancestral predicate node.\nFor example, cardiologist \u21d2 physician \u21d2 doctor, and similarly, Mother_of \u21d2 Parent_of\n\u21d2 Blood_relation. In the framework of mathematical logic, such a logical structure can be\nexpressed via a subset of Description Logic (DL) statements. In the DL parlance, logical structure\namong predicates (red oval nodes) is commonly known as T-box, whereas logical structure connecting\nentities (or entity tuples) to the predicates is commonly known as A-box [3].\nAs far as vector embeddings of symbolic KBs is concerned, last couple of years have witnessed a\nsurge in the newer methods with an end goal of KB completion or logical inferencing to answer\nreasoning queries [4, 5, 6, 7, 8, 9, 10, 11, 12]. While details being different, a high level idea behind\nall these methods is as follows \u2013 Ingest a given symbolic KB and embed it into a vector space by using\na neural network; followed by conversion of logical queries into score based algebraic operations\n\n\u2217The \ufb01rst two authors contributed equally.\n\u2020This work was done when the author was with IBM Research, New Delhi, India.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: A toy example of knowledge base. The left (right) \ufb01gure depicts unary (binary) predicates ontology.\n\nover the vector space. Unlike pure symbolic logic frameworks [13, 14, 3], these neural methods\nare robust against input noise but at the same time, they suffer from a major drawback, namely no\nguarantees that embeddings maintain the sanctity of the logical structure present in the input A-Box\nand T-Box. Therefore, unlike pure symbolic reasoners, a machine that uses such embeddings for\ncomplex reasoning tasks struggle with the accuracy. The embeddings resulted by some of these neural\nmethods [4, 5, 6, 7, 8] capture, at best, the logical structure of just A-box but not T-box. Some other\nmethods [9, 10, 11], which provision to maintain T-box structure, suffer from the fact that resulting\nlogical structure in the embedding is not sound enough by the design (unlike ours which is based on\nQuantum Logic) and hence they often fail to meet high accuracy for complex reasoning tasks.\nIn this paper, we propose a novel approach, called Embed2Reason (E2R) that can embed a given\nsymbolic KB (such as the one shown in Figure 1) into a \ufb01nite dimensional vector space. E2R is\ninspired from the theory of Quantum Logic (QL) which guides us constraining the embedding in\nsuch a way that the set of all logical propositions in the input A-Box and T-Box become isomorphic\nto a lattice structure over a set of subspaces of the vector space. Such an embedding satis\ufb01es the\naxioms of the Quantum Logic [15, 16] and allows one to perform logical operations (e.g. negation,\nconjunction, disjunction, and implication) directly over the vectors in a manner similar to the Boolean\nLogic except that distributive law does not hold true [17]. We call such an embedding as Quantum\nEmbedding. We formulate an unconstrained optimization program that captures all such quantum\nlogical (as well as regularity) constraints. We solve this program via Stochastic Gradient Descent\n(SGD) technique and the resulting embedding maintains the input logical structure with a good\naccuracy. Next, we show that these quantum embeddings can solve complex deductive as well as\npredictive reasoning tasks (formally de\ufb01ned later) with a much superior performance as compared to\nother kinds of embeddings. Speci\ufb01cally, quantum embeddings show better performance on both link\nprediction as well as reasoning tasks. In our experiments, we found that on FB15K dataset, quantum\nembeddings obtained via the proposed E2R method exhibit 57% improvement on MRR and 95%\nimprovement on HITS@1. Further, for LUBM1U dataset, we found E2R exhibiting 76% improvement\non MRR and 139% improvement on HITS@1 relative to the closest competitor.\n\n2 Preliminaries\n\nDescription Logic (DL) Syntaxes Description Logics (DLs) are a family of logics that are frag-\nments of First Order Logic (FOL) [3, 18]. In this paper, we con\ufb01ne to the simplest form of DL, namely\nAttributive Language with Complements (ALC). A triple (NO,NC,NR) is called as the signature of\nthe DL where, NO = {O1, O2, . . . , O|NO|} denotes a \ufb01nite set of entities (aka objects or elements) \u2013\nblue rectangular nodes in the left tree of Figure 1. NC = {C1, C2, . . . , C|NC|} denotes a \ufb01nite set of\nunary predicates (aka (atomic) concepts, classes, or types) \u2013 red oval nodes in the left tree of Figure 1.\nNR = {R1, R2, . . . , R|NR|} denotes a \ufb01nite set of binary predicates (aka (atomic) relations or roles) \u2013\nred oval nodes in the right tree of Figure 1. The complex logical statements in DL are formed by apply-\ning one or more of the following constructors on one or more of the atomic concepts: negation (\u00ac) of\na concept, equality (=), intersection ((cid:117)), union ((cid:116)), and logical inclusion ((cid:118)) between two concepts.\nAlso, there are two special concepts called (cid:62) (every concept) and \u22a5 (empty concept). Finally, the\noperators universal restriction (\u2200R.C) and existential restriction (\u2203R.C) correspond to the concepts\n\n2\n\nUnary Predicates (aka Concepts)Binary Predicates (aka Relations)everythingdoctorphysiciansurgeondentistpulmonologistcardiologistPredicate HierarchyBased on Hypernymy Relation(T-Box)Entities and their concepts memberships(A-Box)any relationblood_relationBrother_ofParent_ofSister_ofFather_ofMother_ofPairs of entities and their relationships (A-Box)AliceBobSamAlex(Bob, Alice)(Maria, Alice)Predicate HierarchyBased on Hypernymy Relation(T-Box)\fwhose members are given by the sets {O \u2208 NO | \u2200O(cid:48) having R(O, O(cid:48)) we must have O(cid:48) \u2208 C} and\n{O \u2208 NO | \u2203 O(cid:48) having R(O, O(cid:48)) \u2227 (O(cid:48) \u2208 C)}, respectively.\n\nFormal Concept Analysis (FCA) and Distributional Representation FCA [19] is a branch of\nlattice theory [20] focusing on special type of lattices that are obtained from a set of entities, a set\nof attributes, and a binary relation called context that determines whether an entity has an attribute.\nAlthough not directly related, FCA has lots of similarities with DL and can be exploited to get a naive\nand partial solution of the central question in this paper. FCA theory allows embedding entities of\nany DL into a binary space, say {0, 1}d. Such an embedding can also be used as a device to perform\nBoolean logical operations on unary predicates. Supplementary material comprises further details.\n\n3 Quantum Embeddings\n\nAlthough, FCA based embedding partially preserves the logical structure of DL, it has several\nlimitations - (i) FCA makes closed world assumption \u2013 all the missing membership assertions are\ntreated as non-members in FCA. In real world, this assumption is rarely true. In fact, we don\u2019t\neven know how many other atomic concepts/relations are missing in the given KB. (ii) In many real\nworld setup, an entity may have soft/fuzzy membership to certain concepts - for example, an apple\nmay have each of red and green color but to a varying extent. (iii) Lastly, FCA embeds only unary\npredicates but we want to embed both unary and binary predicates. These limitations of FCA based\nembeddings motivated us to explore the use of Quantum Logic. 3\nOriginally, QL [22, 17] was proposed by [15] to model the quantum behavior of subatomic particles.\nWe, however, leverage it here to embed A-box and T-box of a DL while preserving the logical\nstructure. In QL, every predicate (unary or binary as well as atomic or compound) is represented by a\nlinear subspace of a complex vector space \u03a3, where \u03a3 = Cd for some integer d. 4 All the entities and\nentity pairs are denoted by (complex) vectors and they lie in the predicate subspaces to which they\nbelong (see Figure 2 for a graphical illustration of the idea). The axes of such an embedding space\nrepresent latent semantic attributes of the entities and entity pairs. The set of all linear subspaces of\n\u03a3, known as Grassmannian, can be put up in correspondence with the set of all possible predicated\n- in\ufb01nitely many of them are not even supplied as part of typical input KB. The subspaces of \u03a3\nnaturally have a partial order relation induced over them by set theoretic inclusion operation. That is,\nfor any two subspaces Si, Sj \u2286 \u03a3, we say Si (cid:118) Sj iff Si \u2286 Sj or equivalently, Si is a subspace of\nSj also. Because origin is a zero dimensional subspace and is common to any subspace, by letting\n\u22a5= {0} and (cid:62) = \u03a3, the resulting partial order over the subspaces becomes bounded lattice of \u221e\nsize. The quantum logical operations on this lattice are de\ufb01ned by its inventors [15] as follows.\n\u2022 Si \u2227 Sj = Si (cid:117) Sj := Si \u2229 Sj. That is, logical AND or intersection between two subspaces is\n\u2022 Si \u2228 Sj = Si (cid:116) Sj := Si + Sj. Here + sign means the vector sum and not the set theoretic union.\nThat is, logical OR or union between two subspaces is the smallest subspace encompassing both\nthese subspaces. This is where QL differs from the Boolean Logic. This axiom also results in non-\ndistributive nature of QL. That is, the distributive law Si \u2227 (Sj \u2228 Sk) = (Si \u2227 Sj)\u2228 (Si \u2227 Sk) holds\nno more true in QL. However, we have proven (refer supplementary material) that by restricting\neach predicate subspace to be parallel to the axes, we can avoid this limitation of the QL.\n\nequal to the largest common subspace.\n\n\u2022 \u00acS := S\u22a5, where S\u22a5 is known as orthogonal complement of the subspace S. Every vector in S\u22a5\nis perpendicular to every vector in S and also S + S\u22a5 = \u03a3. This means, logical NOT or negation\nof any subspace is given by set of all those vectors that are normal to this subspace.\n\nFigure 2 depicts an illustrative example of Quantum Logic based embedding of the concept ontology\ninto, say a \u03a3 = C3 space. The embedding of relation hierarchy would also be similar except that\nan ordered entity pair such as (Bob, Alice) would be mapped to the vector (V Bob + \u03b9V Alice),\nwhere V Bob + \u03b90, V Alice + \u03b90 are the embedding vectors for entities Bob, Alice, respectively. In\nother words, the imaginary component of the entities would be zero but for a pair of entities it would\nbe non-zero. The relations would be mapped to different linear subspaces. In next section, we\n\n3A similar motivation was advocated by [21] in the context of designing a document retrieval system.\n4In general, QL allows the embedding space to be any \ufb01nite or in\ufb01nite dimensional Hilbert space.\n\n3\n\n\fFigure 2: An illustrative example of Quantum embedding\n\npresent E2R learning scheme which automatically maps these predicates and (pair of) entities to the\nsubspaces and vectors, respectively, in the space \u03a3 = Cd for some integer d.\n\n4 Embed2Reason\n\nE2R writes down a quantum logical constrain for every logical structure present in A-Box/T-Box\nso as to ensure it preservation during embedding. The overall problem, thus, becomes a constraint\nsatisfaction problem which we convert into an unconstrained optimization problem via loss functions.\nIn E2R, we assume the underlying \ufb01eld of the embedding space Cd is R. Under this assumption - (i)\nThe vector space \u03a3 = Cd becomes isomorphic to the Euclidean space R2d. That means, there exist a\nbijection map T : Cd (cid:55)\u2192 R2d which preserves vector addition and scalar multiplication [23]. (ii) The\nsubspace corresponding to any single axis of Cd can be viewed as span of two subspaces \u2013 real part\nand imaginary part. This, in turn, implies that the set of vectors {xR + \u03b90} forms a valid subspace\nof Cd. (iii) Any axis-parallel subspace of Cd can be mapped to a unique axis-parallel subspace of\nR2d and vice-a-versa. Due to these properties, we can simulate Cd via R2d on a digital computer.\nSpeci\ufb01cally, any axis parallel subspace S of Cd can be simulated in the space R2d by means of\nde\ufb01ning a 2d-dimensional 0/1 indicator vector y having the following property \u2013 The coordinates\ny[i] and y[i + d] would be 1 (or 0) depending on whether real and imaginary part, respectively,\nof the ith-axis (1 \u2264 i \u2264 d) of Cd is part (or not part) of the subspace S. To this end, we make a\nsimplifying assumption that E2R always outputs an embedding where subspace corresponding to\nany predicate is parallel to the axes of the space Cd. We call such an embedding as axis-parallel\nembedding. The axis-parallel embedding assumption alleviates the limitation of the QL regarding\ndistributive law (refer supplementary material). Lastly, because the underlying \ufb01eld is R, we use the\nfollowing valid inner product formula [23]: (cid:104)x, y(cid:105) = [(y\u2217(cid:62)x)\u2217(y\u2217(cid:62)x)]1/2 for x, y \u2208 Cd, where \u2217\ndenotes the complex conjugate operation. This inner product is essentially the length of the complex\nnumber obtained by taking the standard inner product between two complex vectors [23]. With this\nsetup, we now describe our E2R technique.\nEntities, Pair of Entities, and Predicates E2R maps individual entities Oi, Oj \u2208 NO to the unit\nlength vectors (under (cid:96)2 norm) of the form xi = ( xiR\n0 ), respectively, and the\nordered pair (Oi, Oj) to the vector xij = ( xiR\nxjR ). This can automatically be ensured by provisioning\nfollowing loss terms, for each entity Oi, in the optimization program of E2R.\n\n0 ) and xj = ( xjR\n\n(1)\nwhere, 0d and 1d are d-dimensional vectors of all 0s and 1s, and (cid:12) denotes element wise multiplica-\ntion. Observe that having LOi = 0 is both necessary and suf\ufb01cient condition for the xi to have the\nform ( xiR\nRecall, QL theory requires entities (and pair of entities) to lie in the subspaces of concepts (and\nrelations) to which they belong. Therefore, above representation conventions would immediately\n\n0 ). Therefore, including LOi as a loss term will push its value to go towards zero.\n\n1d\n\nLOi =(cid:13)(cid:13)xi (cid:12)(cid:0) 0d\n\n(cid:1)(cid:13)(cid:13)2\n\n4\n\naxis 1axis 2axis 3physician pulmonologistdentistcardiologist\ud835\udc49\"#$+\ud835\udf040\ud835\udc49()*+,+\ud835\udc560\ud835\udc49./0+\ud835\udf040\ud835\udc49(),1+\ud835\udf040everythingdoctorphysiciansurgeondentistpulmonologistcardiologistAliceBobSamAlex\fentail that the subspace corresponding to any unary predicate, say Ci, must be given by an indicator\nvector yi having property that yi[t] = 0 \u2200(d + 1) \u2264 t \u2264 2d. Similarly, the subspace corresponding\nto any binary predicate, say Ri, must be given by the indicator vector zi having property that\n\u2203 (1 \u2264 t1 \u2264 d) and (d + 1 \u2264 t2 \u2264 2d) such that zi[t1] = zi[t2] = 1. These constraints can be\nimplemented via following loss terms.\n\n(cid:0) 0d\n\n(cid:1) \u2212 1(cid:1)(cid:9)(cid:1)2\n\n+(cid:0)min(cid:8)0,(cid:0)z(cid:62)\n\n(cid:0) 1d\n\n(cid:1) \u2212 1(cid:1)(cid:9)(cid:1)2\n\nLCi =(cid:13)(cid:13)yi (cid:12)(cid:0) 0d\n\n(cid:1)(cid:13)(cid:13)2\n\n; LRi =(cid:0)min(cid:8)0,(cid:0)z(cid:62)\n\n1d\n\n(2)\nIt is noteworthy that above constraints would not make sense unless we ensure indicator vectors\nyi and zi are binary vectors. Such combinatorial constraints can be approximately enforced by\nprovisioning the following two losses.\n\n1d\n\n0d\n\ni\n\ni\n\nLyi = (cid:107)yi (cid:12) yi(cid:107)2 ; Lzi = (cid:107)zi (cid:12) zi(cid:107)2\n\n(3)\nwhere, yi (or zi) is the bit \ufb02ipped version of yi (or zi) given by yi = 12d \u2212 yi (or zi = 12d \u2212 zi).\nThe dimensions of the subspaces for Ci and Ri, and the set of corresponding basis vectors (given by\nindicator vectors yi and zi) are learnable parameters and are learned automatically.\nMembership For any given Oi \u2208 Cj, or Ri(Op, Oq), E2R aspires to ensure that vectors xi or xpq\nlie in the subspaces corresponding to Ci and Ri, respectively. We felt a natural way to model such\nconstraints would be via de\ufb01ning residual length of the projection as a loss metric. The same idea is\nalso endorsed by the Quantum Logic where projection length is indeed used as a probability of such\nmembership assertion. This can be ensured by means of the following loss terms which essentially\nproject the vectors xi and xpq into subspaces yj and zi in an orthogonal manner and enforce zero\nresidual components of these projections. These loss terms also take care of the unit lengths for entity\nvectors. Here, 0d represents a d \u00d7 d matrix of all 0s.\n\nL(Oi\u2208Cj ) =(cid:13)(cid:13)yj (cid:12) xi\n(cid:107)zi (cid:12) xpq(cid:107)2 +(cid:0)1(cid:62)\nequal to(cid:0) xpR\n\n(cid:13)(cid:13)2(4)\n(cid:1) (cid:12) xpq \u2212 xq\n(cid:1) (as de\ufb01ned earlier). The second term in each of the above losses represents squared\n\nThe last term in the expression of L(Oi\u2208Cj ) tries to enforce unit length constraint for each entity\nvector. The last two terms in the expression of LRi(Op,Oq) try to enforce the structure of xpq to be\n\n(cid:1) yj\n(cid:0)(cid:0)(cid:0) 0d Id\n(cid:1) (cid:12) xi\n+(cid:13)(cid:13)(cid:0) 1d\n(cid:1)(cid:1)2\n(cid:1) (cid:12) xpq\n\n+(cid:0)1 \u2212 x(cid:62)\n(cid:1)(cid:1)2\n(cid:13)(cid:13)2\n(cid:1) (cid:12) xpq \u2212 xp\n\n(cid:13)(cid:13)2\n+(cid:0)1(cid:62)\n(cid:0)(cid:0)(cid:0) 0d Id\n(cid:1) zi\n\n(cid:1)2\n+(cid:13)(cid:13)(cid:0) 0d\n\n; LRi(Op,Oq) =\n\ni xi\n\nId 0d\n\nId 0d\n\n1d\n\n0d\n\n2d\n\n2d\n\nlength of complex part of the projected vector. The second term would be be zero for L(Oi\u2208Cj ) due\nto the assumptions about structures of xi and yj.\nLogical Inclusion As per the axioms of QL, for any given Ci (cid:118) Cj (or Ri (cid:118) Rj), the subspace\ncorresponding to Ci (or Ri) must be a subset of the subspace corresponding to Cj (or Rj). Under\naxis-parallel assumption, such a constraint can be implemented via the following loss term.\n\n(5)\nwhere yi, yj, zi, zj are the indicator vectors for the subspaces corresponding to Ci, Cj, Ri, Rj,\nrespectively. The rational behind above loss term is as follows. If Ci (cid:118) Cj then a necessary condition\nwould be to have 1 in the vector yj at all those positions wherever there is 1 in the vector yi and that\nwould mean we must have y(cid:62)\nalso a suf\ufb01ciency condition for Ci (cid:118) Cj. The argument for Ri (cid:118) Rj is similar.\n\n(cid:13)(cid:13)2\n(cid:1) = 0. Similarly, we can argue that y(cid:62)\n\nLCi(cid:118)Cj =(cid:13)(cid:13)yi (cid:12) yj\ni (cid:12)(cid:0)1 \u2212 yj\n\ni (cid:12)(cid:0)1 \u2212 yj\n\n; LRi(cid:118)Rj = (cid:107)zi (cid:12) zj(cid:107)2\n\n(cid:1) = 0 is\n\nxqR\n\nLogical Conjunction (Disjunction) For any given logical conjunction (disjunction) of the form\nCi = Cj(cid:117)Ck (Ci = Cj(cid:116)Ck) and similarly for Ri, Rj, Rk, we can provision the following loss terms\nwhose value being zero provides necessary and suf\ufb01ciency conditions to satisfy the QL axiom. Here,\nmax(\u00b7) denotes the component wise max operation. Extension to multiple concepts (or predicates)\nsuch as Ci = Cj (cid:117) Ck (cid:117) C(cid:96) (Ci = Cj (cid:116) Ck (cid:116) C(cid:96)) is straightforward.\n\n(cid:1)(cid:13)(cid:13)2\nL(Ci=Cj(cid:117)Ck) =(cid:13)(cid:13)yi \u2212(cid:0)yj (cid:12) yk\n(cid:1)(cid:13)(cid:13)2\nL(Ci=Cj(cid:116)Ck) =(cid:13)(cid:13)yi \u2212 max(cid:0)yj, yk\n; L(Ri=\u00acRj ) =(cid:0)z(cid:62)\n(cid:1)2\n+(cid:0)y(cid:62)\nL(Ci=\u00acCj ) =(cid:0)y(cid:62)\n\n; L(Ri=Rj(cid:117)Rk) = (cid:107)zi \u2212 (zj (cid:12) zk)(cid:107)2\n; L(Ri=Rj(cid:116)Rk) = (cid:107)zi \u2212 max (zj, zk)(cid:107)2\n(cid:1)2\n\nLogical Negation Necessary and suf\ufb01ciency conditions are\n\n+(cid:0)z(cid:62)\n\n(6)\n\n(7)\n\n(8)\n\n(cid:1)2\n\n(cid:1)2\n\ni yj\n\ni yj\n\ni zj\n\ni zj\n\n5\n\n\fUniversal Type Restriction For each given universal type restriction \u2200Ri \u00b7 Cj, we need to ensure\nthat for any Oi, Oj \u2208 NC having relation Ri(Oi, Oj), we must have Oj \u2208 Cj. A necessary and\nsuf\ufb01cient condition for this constraint would be as follows - whenever we have Oj \u2208 Ck, where Ck\nis a non-descendant of Cj in the concept ontology, then we must not have Ri(Oi, Oj). The loss term\nfor this can be given as follows. For all Ck \u2208 DCj , where DCj denotes the set of concepts that are\nnon-descendant of Cj, we can have the following loss term:\n\nL(\u2200Ri\u00b7Cj )(yk) =(cid:0)y(cid:62)\n\n(cid:0) 0d Id\n\n(cid:1) zi\n\n(cid:1)2\n\nk\n\n0d 0d\n\n(9)\nThe set DCj can be shrunk considerably by considering only the following concepts in the concept\nhierarchy - (i) any child Ck of the root concept (cid:62) that is non-ancestor of Cj (children of Ck would\nautomatically be taken care of due to quantum logical constraints related to inclusion), (ii) concept\nCk that is either a sibling of Cj or child of an ancestor but itself is not an ancestor.\nBy putting all these losses together, we get the following unconstrained non-convex optimization\nproblem as E2R learning problem and we call it as E2R model. Here, LE2R is the sum of all the loss\nterm (each averaged over its corresponding logical assertions) de\ufb01ned from Equation (1) through (9).\n\nminimize\nxi,yi,zi\n\nLE2R\n\n(10)\n\nChoosing the Embedding Dimension d Under axis parallel assumption, we have a maximum\nof 22d different axis parallel subspaces when we embed into Cd (\ufb01eld being R). Therefore, if we\n\nhave |NC|,|NR| many unique concepts and relations then we must have d > log((cid:112)|NC| + |NR|).\n\nThis is just a trivial lower bound, but in practice, we require much higher d to accommodate all\nthe constraints (especially regularity constraints given below). If we allow oblique subspace, we\ncan embed in much smaller dimensional space. However, we still prefer axis-parallel embeddings\nbecause the distributive law, which in general doesn\u2019t hold true for quantum embeddings, surprisingly\nholds true under axis-parallel quantum embeddings (proof given in the supplementary material).\n\nThe Problem of Subspace Collapse The E2R formulation (10) is susceptible to one serious\nproblem, namely subspace collapse. Imagine, there are no logical negation and universal type\nrestriction statements in the input KB. In such a case, the loss terms (8), (9) would be missing\nand there would be a degenerate solution, namely all the subspaces (i.e. yi and zi) being the\nsame, because it would drive all the loss terms related to A-Box and T-Box assertions, namely (4)\nthrough (7), to zero. We can avoid this by including two kinds of regularity constraints - (i) For any\ntwo predicates (unary/binary) that are siblings of each other, we should encourage their subspaces\nto be orthogonal as much as possible, (ii) Any two predicates (unary/binary) that have (parent,\nchild) relationship in the T-Box hierarchy, we should encourage certain minimum gap between the\ndimensions of their subspaces. Both these constraints can be achieved via the following losses.\n\n(cid:0)y(cid:62)\n(cid:80)\n(cid:1)(cid:62)\nd \u2212(cid:0)yj \u2212 yi\n\nCi sib Cj\n\ni yj\n\n1\n\n(cid:1)2\n(cid:17)2\n\n\u2126orth = 1\nNsib\n\n(cid:16)\u221a\n\n(cid:80)\n\n+ 1\nMsib\n\n+ 1\nM(cid:118)\n\n(cid:80)\n(cid:80)\n\n(cid:1)2\n\n(cid:0)z(cid:62)\n(cid:16)\u221a\n\ni zj\n\nRi sib Rj\n\nRi(cid:118)Rj\n\nd \u2212 (zj \u2212 zi)\n\n(cid:17)2\n\n1\n\n(cid:62)\n\n(11)\n\n(12)\n\n\u2126sep = 1\nN(cid:118)\n\nCi(cid:118)Cj\n\n\u221a\n\n\u221a\n\nd levels of the inclusion hierarchy which usually suf\ufb01ces.\n\nThe pairs (Ci sib Cj) and (Ri sib Rj) denote the pair of sibling concepts/relations. Nsib and Msib\ndenote the count of such sibling pairs. N(cid:118) and M(cid:118) denote the counts of respective logical assertions.\nd is \ufb02exible and can be replaced with an appropriate positive constant \u2265 1. This\nThe choice of\nchoice allows us to stuff roughly\nInstead of using regularization losses (11) and (12), one can achieve a similar effect by having an\nalternative loss term which is simpler and effective in practice (we witnessed this in our experiments).\nThis alternative loss term works as follows \u2013 Take each valid membership assertion in the given\nA-box. Replace one of its entity with an invalid (aka negatively sampled) entity. Take the negation of\nthe corresponding membership loss of this newly constructed fake assertion (aka negative sample).\nIn fact, this way of regularizing helps us in achieving good convergence during our model training.\nNote, we use a tiny fraction of invalid entities just to avoid degenerate solution. This should not be\nconfused with closed-world assumption because there one uses a huge number of negative assertions.\n\n6\n\n\fReasoning Tasks While applicability of quantum embeddings is quite broad, in what follows, we\nhighlight a few reasoning tasks that we think are important and apt for quantum embeddings. Each of\nthese tasks could be deductive or predictive in nature and our approach need not even know this.\n\n1. Membership [Find all the entities belonging the concept C where C could be a complex logical\nformula.] For this, we convert the formula C into an appropriate subspace S, followed by \ufb01nding\nall those entities whose length of orthogonal projection onto S is more than some threshold. The\nprojection length is used to assign the con\ufb01dence score/probability of the answer being correct.\n2. Property Testing [Does entity Oi belongs to the concept C where C could be a complex logical\nformula?] The previous trick can be applied here as well and the answer would be probabilistic\nwhich can be made deterministic by setting an appropriate threshold.\n\n3. Property Listing [Given entity Oi, list all the atomic concepts to which it belongs.] Find a set,\nsay Bi, of all those standard basis vectors which are non-orthogonal to the embedding vector xi\nof Oi. Find all the atomic concepts Cj\u2019s whose subspaces lie within the space of Bi. Each such\nCj would be an answer with con\ufb01dence/probability proportional to the projection length.\n\n5 Experiments\n\nDatasets and Setup We evaluated the performance of E2R on two different kinds of tasks \u2013 (i)\nlink prediction task, and (ii) reasoning task. For link prediction, we chose FB15K and WN18 datasets\nbecause they are standard in the literature [5, 6, 24]. FB15K is a subset of the Freebase dataset,\nwhereas WN18 is a subset of the WordNet dataset featuring lexical relations between words. The train\nand test sets of these datasets are respectively used for training and testing our proposed model.\nTo evaluate the reasoning capabilities, we chose LUBM (Lehigh University Benchmark) dataset\n(http://swat.cse.lehigh.edu/projects/lubm/), which consists of a university domain ontology, customiz-\nable and repeatable synthetic data. We generated one university (LUBM1U) data for the evaluation.\nWe used all the 69628 triples produced by the LUBM generator along with ontology in triple form\n(unary and binary predicates) as our training set. Its important to mention that majority of the A-Box\nassertions in LUBM dataset are concerned about the bottom most predicates (concepts and relations) in\nthe hierarchy. Starting with these assertions, we custom designed eight new test queries that comprise\nhigher level predicates in the LUBM hierarchy so that answering such queries would explicitly require\ndeductive reasoning. These test queries involved deducing the members of the following unary predi-\ncates \u2013 Professor, Faculty,Person, Student, Course, Organization, and the following\nbinary predicates MemberOf, WorksFor.\nWe implemented E2R model using PyTorch. We used SGD (Stochastic Gradient Descent) with\nADAM optimizer [25] to solve the proposed E2R formation (10) together with the regularization\nterms as described in Section 4. In all our experiments we used d = 100 for E2R model. E2R did not\nshow signi\ufb01cant sensitivity to the value of d, in our experimental range of 100 to 300. As speci\ufb01ed\nin Section 4, the use of negative membership candidates during training helped in achieving good\nconvergence. Using more than one negative candidate per positive membership candidate achieved\nbetter convergence. We used 3 different negative entities per positive entity in our experimental setup.\nFor baseline approaches, we used optimal parameter setup obtained through grid search. For example,\nwe used embedding dimension d = 100. Note, the datasets FB15k and WN18 do not contain T-Box.\nTherefore, we discarded the loss terms corresponding to T-Box assertions and retained the loss terms\ncorresponding to A-Box assertions while training E2R on these datasets. Similarly, while training\nthe baseline approaches on LUBM1U dataset, we simply added T-box assertions as additional triples\nbecause these approaches have no provision for handling T-box assertions in an explicit manner. Our\nexperiments were performed on a Tesla K80 GPU machine.\n\nBaselines, Evaluation Metrics, and Results We used TransE [5] and ComplEx [24] as baselines to\nillustrate the effectiveness of our proposed model. TransE is a simple but effective model and ComplEx\nis one of the current state-of-the-art approach. We used OpenKE (https://github.com/thunlp/OpenKE)\nimplementation of these approaches for our evaluation. We compared E2R with these approaches\nboth for link prediction task (using FB15K and WN18 datasets) and reasoning tasks (using LUBM1U\ndataset). Tuning of the hyper-parameters for the baseline approaches was performed on the test set\nfor FB15K and WN18 datasets but on the training set for LUBM1U. For E2R, the tuning was always\ndone on the training set.\n\n7\n\n\fData\n\nMEAN RANK\n\nTE\n68.4\n409.9\n\nCE\nE2R\n114.0\n72.0\n468.1\n5780.2\n220.1 1292.6 5742.9\n\nFB15K\nWN18\nLUBM1U\n\nMRR\nTE\n\nE2R\nCE\n0.96 0.49 0.61\n0.71 0.63 0.90\n0.46 0.26 0.12\n\nHITS@1 (%)\n\nHITS@10 (%)\n\nTE\nCE\nE2R\n96.4\n34.8 49.8\n41.0 87.4\n71.1\n45.4 18.97 12.5\n\nCE\nE2R\nTE\n96.4 76.7\n81.2\n71.1 93.2 95.25\n45.4 49.1 12.59\n\nTable 1: Datasets and Experimental Results. Acronyms used are TE := TransE, CE := ComplEx. LUBM-IU\ndataset has 69628 triples in A-Box, and 43 concepts and 25 relations in T-Box.\n\nFor all the evaluations, we used four standard metrics, namely MEAN RANK (lower the better), MRR\n(Mean Reciprocal Rank), HITS@1, and HITS@10 [5]. These metrics capture how well an approach\nis able to extract correct entities for a given membership query. These metrics start improving as the\ncorrect answer entities start moving up in the ranked list of the predicted answer entities. Additionally,\nwhile computing these metrics, we considered only \ufb01ltered case (as advocated in [5]) that avoid\ncon\ufb02ict among multiple correct answers in terms of occupying top rank. For reasoning task on\nLUBM1U dataset, where our goal is to evaluate the inferred membership of the concepts and relations,\nwe computed all the metrics at the query level. The query level metric is an average over instance\nlevel metrics, where instances are the true members of the query. We further averaged these metrics\nover all the queries. Note, we did not use any symbolic reasoner as a baseline because our objective\nis not to compete with them but to demonstrate logical reasoning capability in vector spaces.\nIn E2R, the predicted candidate ranked list for any query is generated simply by projecting each entity\nonto the respective subspace and assigning the length of its residual component as its non-\ufb01tment\nscore. For TransE/ ComplEx, a membership query rank orders the candidate triples based on the\ntriple scores obtained through the corresponding model.\n\nInsights Table 1 shows experimental results, where we highlighted the numbers re\ufb02ecting superior\nperformance of E2R. It is evident from the results that E2R outperforms both TransE and ComplEx\non metrics MRR and HITS@1 for all the tasks and datasets, except for WN18. Interestingly, E2R\nachieves an exceptional accuracy improvements over the baseline approaches for the link prediction\ntask on the FB15k dataset as well as on the LUBM1U dataset for the reasoning task. This illustrates the\neffectiveness of E2R in learning and representing the logical structure of the input KB, and leveraging\nthe same for the reasoning task. On the other hand, baseline approaches, which are fundamentally\ndesigned to address the link prediction task, \ufb01nd it hard to achieve the same. As far as WN18\ndataset is concerned, its a collection of binary relation instances where most of these relations satisfy\ntransitivity property (e.g. hypernym) and have inverse relations (e.g. hypernym/hyponym). Baseline\nmethods, which are primarily distanced based, can probably capture transitivity/inversion properties\nbit better which E2R somewhat struggles because it is not distance based. Lastly, E2R consistently\nmaintains a high score for HITS@1 metric and moreover, HITS@1 = HITS@10 for all the datasets.\nThis implies that our algorithm often ranks a ground truth entity either at Rank-1 or at a quite low\nrank. We believe, this is an artifact of our approach trying to assign different sub-spaces for different\npredicates and pushing the non-members away from them. This phenomenon also results in a poor\nscore for (average) MEAN RANK despite having high Hits@1 score. Such a behavior of E2R makes\nit practically interesting because Rank 1 candidate can be chosen with more con\ufb01dence.\n\n6 Related Work\n\u2022 [Quantum Logic] Dominic [26, 21] advocated using QL framework to \ufb01x the problem of word\nmeaning disambiguation in keyword based search engines. The conceptual framework suggested\nby Dominic and the follow-up work, however, do not prescribe any recipe to embed knowledge\ninto vector space satisfying QL axioms. Our work aims to bridge this gap.\n\u2022 [Deep Nets for Word Embeddings] In recent times, deep-net based techniques have emerged that\ncan embed word-phrases [2, 27] and natural language sentences [28, 29] into a vector space. Such\nembeddings have offered remarkable performance boost for several downstream tasks including\nquestion answering [30, 31]. However, a key concern about these techniques is their black box\nnature which makes the underlying models unexplainable and non-interpretable.\n\u2022 [Statistical Relational Learning (SRL) and KG Embeddings] For majority of the SRL tech-\nniques, the goal is to embed a Knowledge Graph (KG) into a vector space followed by discovery\n\n8\n\n\fof new triples or links [7, 32, 12]. The prediction of the links is done via a scoring function\nthat builds upon exploiting the statistical measures for computing the similarity between joining\nentities. While effective for the KG completion tasks, this simple form of reasoning \u2013 based on the\nsimilarities among entities \u2013 face dif\ufb01culties when handing complex reasoning queries [11]. One\nstriking difference between E2R and KG embedding approaches is that later ones are not capable\nof handling the ontology of relations/concepts unlike E2R.\n\n\u2022 [Hierarchy Aware KB Embedding] Recently, there have been attempts to embed ontological\nhierarchy [33, 34, 35, 36, 10] as well as inferring ontological hierarchy from embeddings [37, 38].\nThe work in [33] is less about embedding a KB and more about inducing an ontology. [34] embeds\nentities and predicates both in the forms of vectors which is very different from our philosophy.\nUnlike us, in [35], the training set upfront includes the assertions inferred on a select set of LUBM\npredicates (using symbolic reasoner) because their aim is to compete with symbolic reasoners (in\nspeed and memory) during inference. [36] fall under the category of distance translation based\nembeddings (e.g. TransE) except that it uses \u2018type information\u2019 to improve quality of entities\nand relations embedding. Our Hits@10 score outperforms them on FB15K. In [10], they embed\nentities based on the type similarities as well as binary relations but they don\u2019t explicitly preserve\nontology of binary relations themselves. In addition to the ontology embedding, there has also\nbeen some work on embedding the logic structure [39, 40, 41] of a KB into a vector space so as\nto enable reasoning. Unlike QL axioms in our approach, the underlying principles in all of these\nprevious works do not offer any guarantee to preserve logical structure in the embedding space.\n\n7 Conclusions\n\nA novel approach, namely Embed2Reason, has been proposed which can embed a DL based symbolic\nKB into a vector space while preserving the structure of all logical propositions. The idea is inspired\nfrom Quantum Logic axioms. The key advantage is to be able to answer complex reasoning queries\n(both deductive and predictive) in accurate manner using distributional representation of the KB. The\nexperimental results illustrate the effectiveness of E2R relative to standard baselines. The future\ndirections include extension to more expressive logic as well as natural languages KBs.\n\nAcknowledgments\n\nWe would like to thank Mukuntha N.S. for pointing out corrections at few places.\n\nReferences\n[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\n\nrepresentations in vector space. CoRR, abs/1301.3781, 2013.\n\n[2] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for\n\nword representation. In EMNLP, 2014.\n\n[3] Franz Baader, Ian Horrocks, Carsten Lutz, and Uli Sattler. An Introduction to Description Logic.\n\nCambridge University Press, 2017.\n\n[4] Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. Learning structured\n\nembeddings of knowledge bases. In AAAI, 2011.\n\n[5] Antoine Bordes, Nicolas Usunier, Alberto Garc\u00eda-Dur\u00e1n, Jason Weston, and Oksana Yakhnenko.\nTranslating embeddings for modeling multi-relational data. In NIPS, pages 2787\u20132795, 2013.\n\n[6] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by\n\ntranslating on hyperplanes. In AAAI, pages 1112\u20131119, 2014.\n\n[7] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of\n\nrelational machine learning for knowledge graphs. IEEE, 104(1):11\u201333, 2016.\n\n[8] William L. Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, and Jure Leskovec. Embedding\n\nlogical queries on knowledge graphs. In NeurIPS, 2018.\n\n9\n\n\f[9] Gonzalo I. Diaz, Achille Fokoue, and Mohammad Sadoghi. Embeds: Scalable, ontology-aware\n\ngraph embeddings. In EDBT, pages 433\u2013436, 2018.\n\n[10] Shoaib Jameel and Steven Schockaert. Entity embeddings with conceptual subspaces as a basis\n\nfor plausible reasoning. In ECAI, 2016.\n\n[11] Tim Rockt\u00e4schel and Sebastian Riedel. End-to-end differentiable proving. In NIPS, pages\n\n3791\u20133803, 2017.\n\n[12] Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. Hinge-loss markov\n\nrandom \ufb01elds and probabilistic soft logic. JMLR, 18:109:1\u2013109:67, 2017.\n\n[13] Alain Colmerauer and Philippe Roussel. The birth of prolog. In History of Programming\n\nLanguages Conference (HOPL-II), Preprints, pages 37\u201352, 1993.\n\n[14] Stephen Muggleton. Inductive logic programming. New Gen. Comput., 8(4):295\u2013318, 1991.\n\n[15] Garrett Birkhoff and John Von Neumann. The logic of quantum mechanics. The Annals of\n\nMathematics, 37(4):823\u2013843, 1936.\n\n[16] V S Varadarajan. Geometry of Quantum Theory. Springer-Verlag, 1985.\n\n[17] M. L. Dalla Chiara and R. Giuntini. Quantum Logic. 2008.\n\n[18] Ronald J. Brachman and Hector J. Levesque. Knowledge Representation and Reasoning.\n\nElsevier, 2004.\n\n[19] Bernhard Ganter and Rudolf Wille. Formal Concept Analysis: Mathematical Foundations.\n\nSpringer, 1999.\n\n[20] Garrett Birkhoff. Lattice Theory, volume 25 of AMS Colloquium Publications. American\n\nMathematical Society, 3 edition, 1965.\n\n[21] Dominic Widdows. Geometry and Meaning, volume 172 of CSLI lecture notes series. CSLI\n\nPublications, 2004.\n\n[22] David W. Cohen. An Introduction to Hilbert Space and Quantum Logic. Springer-Verlag, 1989.\n\n[23] Kenneth Hoffman and Ray Kunze. Linear Algebra. Prentice Hall India, 2nd Edition, 2015.\n\n[24] Th\u00e9o Trouillon, Johannes Welbl, Sebastian Riedel, Eric Gaussier, and Guillaume Bouchard.\n\nComplex embeddings for simple link prediction. In ICML, pages 2071\u20132080, 2016.\n\n[25] Diederik Kingma and Jimmy Ba. ADAM: A method for stochastic optimization. In ICLR, 2015.\n\n[26] Dominic Widdows. Orthogonal negation in vector spaces for modelling word-meanings and\n\ndocument retrieval. In ACL, pages 136\u2013143, 2003.\n\n[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 6000\u20136010,\n2017.\n\n[28] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah\nConstant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope,\nand Ray Kurzweil. Universal sentence encoder. CoRR, abs/1803.11175, 2018.\n\n[29] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of\n\ndeep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.\n\n[30] Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: A conversational question\n\nanswering challenge. CoRR, abs/1808.07042, 2018.\n\n[31] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions\n\nfor machine comprehension of text. In EMNLP, pages 2383\u20132392, 2016.\n\n10\n\n\f[32] Lise Getoor and Ben Taskar. Introduction to Statistical Relational Learning. Adaptive Compu-\n\ntation and Machine Learning. MIT Press, 2007.\n\n[33] Zied Bouraoui, Shoaib Jameel, and Steven Schockaert. Inductive reasoning about ontologies\n\nusing conceptual spaces. In AAAI, 2017.\n\n[34] Gonzalo I. Diaz, Achille Fokoue, and Mohammad Sadoghi. EmbedS: Scalable, ontology-aware\n\ngraph embeddings. In EDBT, pages 433\u2013436, 2018.\n\n[35] Patrick Hohenecker and Thomas Lukasiewicz. Deep learning for ontology reasoning. CoRR,\n\nabs/1705.10342, 2017.\n\n[36] Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Representation learning of knowledge graphs\n\nwith hierarchical types. In IJCAI, pages 2965\u20132971, 2016.\n\n[37] Maximilian Nickel and Douwe Kiela. Poincar\u00e9 embeddings for learning hierarchical represen-\n\ntations. In NIPS, pages 6338\u20136347, 2017.\n\n[38] Maximilian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of\n\nhyperbolic geometry. In ICML, pages 3776\u20133785, 2018.\n\n[39] Tim Rockt\u00e4schel, Sameer Singh, and Sebastian Riedel. Injecting logical background knowledge\n\ninto embeddings for relation extraction. In NAACL-HLT, pages 1119\u20131129, 2015.\n\n[40] Tim Rockt\u00e4schel, Matko Bosnjak, Sameer Singh, and Sebastian Riedel. Low-dimensional\n\nembeddings of logic. In Workshop on Semantic Parsing, co-located with ACL, 2014.\n\n[41] Thomas Demeester, Tim Rockt\u00e4schel, and Sebastian Riedel. Lifted rule injection for relation\n\nembeddings. In EMNLP, pages 1389\u20131399, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2995, "authors": [{"given_name": "Dinesh", "family_name": "Garg", "institution": "IBM Research AI, India"}, {"given_name": "Shajith", "family_name": "Ikbal", "institution": "IBM Research AI, India"}, {"given_name": "Santosh", "family_name": "Srivastava", "institution": "IBM Research AI"}, {"given_name": "Harit", "family_name": "Vishwakarma", "institution": "University of Wisconsin Madison"}, {"given_name": "Hima", "family_name": "Karanam", "institution": "IBM Research AI"}, {"given_name": "L Venkata", "family_name": "Subramaniam", "institution": "IBM Research AI - India"}]}