Difference between revisions of "SIG:GraphTechnologies"

From TEIWiki
Jump to navigation Jump to search
(Convert DTA-XML with neo4j to Standoff Property JSON)
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
Andreas Kuczera, Iian Neill, Stefan Armbruster
 +
 +
== Introduction ==
 +
 
As TEI is not a format, though many people think it is. It's a de facto standard that specifies Guidelines for document interchange. Actually the Guidelines are based on the XML but this is only one possible technical way of expressing the phenomenons.
 
As TEI is not a format, though many people think it is. It's a de facto standard that specifies Guidelines for document interchange. Actually the Guidelines are based on the XML but this is only one possible technical way of expressing the phenomenons.
  
Line 9: Line 13:
 
The main goal of the TEI-Graph-SIG is to model the textual phenomenons of the TEI in a Graph and to develop routines to import TEI-encoded XML-files into graph databases.
 
The main goal of the TEI-Graph-SIG is to model the textual phenomenons of the TEI in a Graph and to develop routines to import TEI-encoded XML-files into graph databases.
  
In a first step we import a small xml-example into a [https://neo4j.com neo4j] instance using [https://github.com/neo4j-contrib/neo4j-apoc-procedures-function apoc.import.xml]
+
== Convert DTA-XML with neo4j to Standoff Property JSON ==
 +
 
 +
In a first step we import a small xml-example into a [https://neo4j.com neo4j] instance using [https://neo4j.com/labs/apoc/4.4/import/xml/ apoc.import.xml]
  
 
The example is a [https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1 page] from the [https://www.deutschestextarchiv.de DTA]. Here you can find the [[XML-Testfile]] and this is the [http://www.deutschestextarchiv.de/book/view/patzig_msgermfol841842_1828/?hl=welcher;p=11 Link] to the DTA-Version.
 
The example is a [https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1 page] from the [https://www.deutschestextarchiv.de DTA]. Here you can find the [[XML-Testfile]] and this is the [http://www.deutschestextarchiv.de/book/view/patzig_msgermfol841842_1828/?hl=welcher;p=11 Link] to the DTA-Version.
  
The import to neo4j runs with:
+
<code>
 +
<TEI>
 +
  <text>
 +
    <body>
 +
      <div xml:id="Ms_germ_fol_841" next="#Ms_germ_fol_842">
 +
        <div type="session" n="1">
 +
          <p><pb facs="#f0011" n="7."/>
 +
            erblicken wir einen großen Unter&#x017F;chied zwi-<lb/>
 +
            &#x017F;chen den entferntern u. nähern Planeten<lb/>
 +
            <note place="left">
 +
              <hi rendition="#u">Zwei be&#x017F;ondere Planeten-Sy&#x017F;teme</hi><lb/></note>von der Son&#x0303;e.
 +
            <hi rendition="#u">Dies giebt zwei be&#x017F;ondre<lb/>
 +
              Sÿ&#x017F;teme</hi>. Die Scheide machen die kleinern<lb/>
 +
            Körper die &#x017F;ich zwi&#x017F;chen Mars u. Jupiter<lb/>
 +
            bewegen, die ein ganz eignes Sy&#x017F;tem<lb/>
 +
            bilden, von denen die Ve&#x017F;ta als die<lb/>
 +
            <hi rendition="#u" hand="#pencil">größte</hi>
 +
            <choice>
 +
              <sic>ungefahr</sic>
 +
              <corr resp="#CT">ungefähr</corr>
 +
            </choice>
 +
            die
 +
            <choice>
 +
              <abbr>Oberfl.</abbr>
 +
              <expan resp="#CT">Oberfläche</expan>
 +
            </choice>
 +
            von Deut&#x017F;ch-<lb/>
 +
            land hat. Sie haben eine translative Be-<lb/>
 +
            wegung von We&#x017F;ten nach O&#x017F;ten, &#x017F;ind ihrer<lb/>
 +
            Stellung nach ähnlich den
 +
            <choice>
 +
              <sic>Com&#x0303;eten</sic>
 +
              <corr resp="#BF">Cometen</corr>
 +
            </choice>; obgleich<lb/>
 +
            doch keine A<subst>
 +
              <del rendition="#ow"><gap reason="illegible" unit="chars" quantity="1"/></del>
 +
              <add place="across">e</add>
 +
            </subst>hnlichkeit anderweit zwi&#x017F;chen<lb/>
 +
            ihnen u. den
 +
            <choice>
 +
              <abbr>
 +
                <choice>
 +
                  <sic>Com&#x0303;et&#xFFFC;</sic>
 +
                  <corr resp="#BF">Comet&#xFFFC;</corr>
 +
                </choice>
 +
              </abbr>
 +
              <expan resp="#BF">
 +
                <choice>
 +
                  <sic>Com&#x0303;eten</sic>
 +
                  <corr resp="#BF">Cometen</corr>
 +
                </choice>
 +
              </expan>
 +
            </choice>
 +
            i&#x017F;t, wie überhaupt<lb/>
 +
            kein Uebergang zwi&#x017F;chen Planeten u. Co-<lb/>
 +
            meten gefunden wird u. keine po&#x017F;itive<lb/>
 +
            <note place="left">
 +
              <hi rendition="#u">Er&#x017F;tes Sy&#x017F;tem<lb/>
 +
                characteri&#x017F;t. Merkmale</hi><lb/></note>Aehnlichkeit.
 +
            <hi rendition="#u">Jn die&#x017F;em doppelten Sy&#x017F;tem<lb/>
 +
              der Planeten gehören zu&#x017F;am&#x0303;en: Merkur,<lb/>
 +
              Venus, Erde, Mars.</hi>
 +
            Sie haben das<lb/>
 +
            gemein&#x017F;ame der be&#x017F;ondern Dichtigkeit,<lb/>
 +
            wie
 +
            <hi rendition="#aq">Platina</hi>, Magnet&#x017F;tein u. dgl.; &#x017F;ie<lb/>
 +
            <subst>
 +
              <del rendition="#s">rotiren</del>
 +
              <add place="superlinear">bewegen &#x017F;ich</add>
 +
            </subst>
 +
            viel ge&#x017F;chwinder
 +
            <metamark/>
 +
            <add place="superlinear">um die Son&#x0303;e</add>, &#x017F;ind mond-<lb/>
 +
            armer |: bloß die Erde hat einen
 +
            <choice>
 +
              <abbr>Trabant&#xFFFC;</abbr>
 +
              <expan resp="#BF">Trabanten</expan>
 +
            </choice>
 +
            :|,<lb/>
 +
            an den Polen abgeplattet. Anders<lb/>
 +
            verhält es &#x017F;ich mit den Planeten auf der<lb/>
 +
            Bahn jen&#x017F;eits der kleinen Planeten.</p><lb/>
 +
        </div>
 +
      </div>
 +
    </body>
 +
  </text>
 +
</TEI>
 +
 
 +
</code>
 +
 
 +
== Import into neo4j ==
 +
 
 +
The import into neo4j runs with:
  
 
<code>
 
<code>
 +
// Import xml-example from DTA to neo4j
 +
call apoc.import.xml('https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1', {connectCharacters: true, charactersForTag:{lb:' '}, filterLeadingWhitespace: true}) yield node
 +
return node;
 +
</code>
 +
 +
In the next picture you can see a small set of the Graph:
  
[[File:Example.jpg]]
+
[[image:TEI-Graph.png|center|800px|x]]
  
// Import xml-example from DTA to neo4j
+
== Export from neo4j to Standoff Property JSON ==
  
call apoc.xml.import('https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1', {connectCharacters: true, charactersForTag:{lb:' '}, filterLeadingWhitespace: true}) yield node
+
The next step is to export the data with some [[cypher]] to the Standoff-Property JSON-Format.
  
return node;
+
<code>
 +
// Export TEI-Graph to Standoff-Property-JSON-Format by Stefan Armbruster
 +
match path=(d:XmlDocument)-[:NE*]->(e:XmlCharacters)
 +
where not (e)-[:NE]->()
 +
with tail(nodes(path)) as words, d
 +
with reduce(s="", x in words| s + x.text ) as allText, d
 +
call apoc.path.expandConfig(d,{
 +
relationshipFilter: '<IS_CHILD_OF',
 +
labelFilter: 'XmlTag',
 +
bfs: false,
 +
minLevel: 1
 +
}) yield path
 +
with allText, path, nodes(path)[-1] as this
 +
MATCH p=(this)-[:NEXT*]->(x)
 +
where (x)-[:LAST_CHILD_OF*]->(this) and any(x in nodes(p) WHERE x:XmlCharacters)
 +
with allText, this, collect(p)[-1] as longest
 +
with allText, this, [x in nodes(longest) where x:XmlCharacters] as xmlCharacters
 +
with allText, this,
 +
apoc.coll.min([x in xmlCharacters | x.startIndex]) as min,
 +
apoc.coll.max([x in xmlCharacters | x.endIndex]) as max,
 +
apoc.text.join([x in xmlCharacters | x.text], "") as text
 +
with allText, {
 +
index:id(this),
 +
startIndex: min,
 +
endIndex: max,
 +
text: text,
 +
type: this._name,
 +
attributes: apoc.map.fromPairs([x in keys(this) WHERE not x starts with "_" | [x, this[x]] ])
 +
} as standoffProperty
 +
return {text: allText, properties: collect(standoffProperty)};
 
</code>
 
</code>
  
In the next picture you can see a small set of the Graph:
 
  
[[image:TEI-Graph.png|left|600px|x]]
+
This [[json]] can then be imported in the [[https://github.com/argimenes/standoff-properties-editor SPEEDy]] Standoff Property Editor which can be found on [[https://github.com/argimenes/standoff-properties-editor GitHub]].
 +
 
 +
At the end of the README-Section you can find a [[https://argimenes.github.io/standoff-properties-editor/ Link]] to Test-Istance hosted on [[https://argimenes.github.io/standoff-properties-editor/ Github-Pages]].
 +
 
 +
Just copy the JSON-Export in the window below the UNBIND-Button of SPEEDy and press BIND.
 +
 
 +
The next picture shows SPEEDy with the test.json. You can choose the example file in the top selection box of SPEEDy as well.
 +
 
 +
[[image:TEIinSPEEDy.png|center|800px|x]]
 +
 
 +
I want to say thanks to Stefan Armbruster from neo4j for the export-cypher-query and the implementation of the XML-Import functions to [https://neo4j.com/labs/apoc/4.4/import/xml/ apoc.import.xml apoc] and Iian Neill for his work on [https://argimenes.github.io/standoff-properties-editor/ SPEEDy].

Latest revision as of 13:52, 3 March 2022

Andreas Kuczera, Iian Neill, Stefan Armbruster

Introduction

As TEI is not a format, though many people think it is. It's a de facto standard that specifies Guidelines for document interchange. Actually the Guidelines are based on the XML but this is only one possible technical way of expressing the phenomenons.

The aim of the Graph-SIG is to find a way of expressing the language phenomenons of the TEI in Graphs.

  • In the graph you can use multi-hierarchical annotations layers.
  • Graph models are very easy to read and understand. So DH-People and “normal” scientists have a level of discussion in common.
  • A Graph can be expressed as RDF so the step from a Graph to linked open data is easy to make.

The main goal of the TEI-Graph-SIG is to model the textual phenomenons of the TEI in a Graph and to develop routines to import TEI-encoded XML-files into graph databases.

Convert DTA-XML with neo4j to Standoff Property JSON

In a first step we import a small xml-example into a neo4j instance using apoc.import.xml

The example is a page from the DTA. Here you can find the XML-Testfile and this is the Link to the DTA-Version.

<TEI>

 <text>
   <body>

<pb facs="#f0011" n="7."/> erblicken wir einen großen Unterſchied zwi-<lb/> ſchen den entferntern u. nähern Planeten<lb/> <note place="left"> <hi rendition="#u">Zwei beſondere Planeten-Syſteme</hi><lb/></note>von der Soñe. <hi rendition="#u">Dies giebt zwei beſondre<lb/> Sÿſteme</hi>. Die Scheide machen die kleinern<lb/> Körper die ſich zwiſchen Mars u. Jupiter<lb/> bewegen, die ein ganz eignes Syſtem<lb/> bilden, von denen die Veſta als die<lb/> <hi rendition="#u" hand="#pencil">größte</hi> <choice> <sic>ungefahr</sic> <corr resp="#CT">ungefähr</corr> </choice> die <choice> Oberfl. <expan resp="#CT">Oberfläche</expan> </choice> von Deutſch-<lb/> land hat. Sie haben eine translative Be-<lb/> wegung von Weſten nach Oſten, ſind ihrer<lb/> Stellung nach ähnlich den <choice> <sic>Com̃eten</sic> <corr resp="#BF">Cometen</corr> </choice>; obgleich<lb/> doch keine A<subst> <gap reason="illegible" unit="chars" quantity="1"/> <add place="across">e</add> </subst>hnlichkeit anderweit zwiſchen<lb/> ihnen u. den <choice> <choice> <sic>Com̃et</sic> <corr resp="#BF">Comet</corr> </choice> <expan resp="#BF"> <choice> <sic>Com̃eten</sic> <corr resp="#BF">Cometen</corr> </choice> </expan> </choice> iſt, wie überhaupt<lb/> kein Uebergang zwiſchen Planeten u. Co-<lb/> meten gefunden wird u. keine poſitive<lb/> <note place="left"> <hi rendition="#u">Erſtes Syſtem<lb/> characteriſt. Merkmale</hi><lb/></note>Aehnlichkeit. <hi rendition="#u">Jn dieſem doppelten Syſtem<lb/> der Planeten gehören zuſam̃en: Merkur,<lb/> Venus, Erde, Mars.</hi> Sie haben das<lb/> gemeinſame der beſondern Dichtigkeit,<lb/> wie <hi rendition="#aq">Platina</hi>, Magnetſtein u. dgl.; ſie<lb/> <subst> rotiren <add place="superlinear">bewegen ſich</add> </subst> viel geſchwinder <metamark/> <add place="superlinear">um die Soñe</add>, ſind mond-<lb/> armer |: bloß die Erde hat einen <choice> Trabant <expan resp="#BF">Trabanten</expan> </choice> :|,<lb/> an den Polen abgeplattet. Anders<lb/> verhält es ſich mit den Planeten auf der<lb/> Bahn jenſeits der kleinen Planeten.

<lb/>
   </body>
 </text>

</TEI>

Import into neo4j

The import into neo4j runs with:

// Import xml-example from DTA to neo4j
call apoc.import.xml('https://seafile.rlp.net/f/6282a26504cc4f079ab9/?dl=1', {connectCharacters: true, charactersForTag:{lb:' '}, filterLeadingWhitespace: true}) yield node 
return node;

In the next picture you can see a small set of the Graph:

x

Export from neo4j to Standoff Property JSON

The next step is to export the data with some cypher to the Standoff-Property JSON-Format.

// Export TEI-Graph to Standoff-Property-JSON-Format by Stefan Armbruster
match path=(d:XmlDocument)-[:NE*]->(e:XmlCharacters)
where not (e)-[:NE]->()
with tail(nodes(path)) as words, d
with reduce(s="", x in words| s + x.text ) as allText, d
call apoc.path.expandConfig(d,{
relationshipFilter: '<IS_CHILD_OF',
labelFilter: 'XmlTag',
bfs: false,
minLevel: 1
}) yield path
with allText, path, nodes(path)[-1] as this
MATCH p=(this)-[:NEXT*]->(x)
where (x)-[:LAST_CHILD_OF*]->(this) and any(x in nodes(p) WHERE x:XmlCharacters)
with allText, this, collect(p)[-1] as longest
with allText, this, [x in nodes(longest) where x:XmlCharacters] as xmlCharacters
with allText, this, 
apoc.coll.min([x in xmlCharacters | x.startIndex]) as min, 
apoc.coll.max([x in xmlCharacters | x.endIndex]) as max, 
apoc.text.join([x in xmlCharacters | x.text], "") as text
with allText, {
index:id(this), 
startIndex: min, 
endIndex: max,
text: text,
type: this._name,
attributes: apoc.map.fromPairs([x in keys(this) WHERE not x starts with "_" | [x, this[x]] ])
} as standoffProperty
return {text: allText, properties: collect(standoffProperty)};


This json can then be imported in the [SPEEDy] Standoff Property Editor which can be found on [GitHub].

At the end of the README-Section you can find a [Link] to Test-Istance hosted on [Github-Pages].

Just copy the JSON-Export in the window below the UNBIND-Button of SPEEDy and press BIND.

The next picture shows SPEEDy with the test.json. You can choose the example file in the top selection box of SPEEDy as well.

x

I want to say thanks to Stefan Armbruster from neo4j for the export-cypher-query and the implementation of the XML-Import functions to apoc.import.xml apoc and Iian Neill for his work on SPEEDy.