Converted from a Word document
Text analysis methods based on word co-occurrence have yielded useful results in humanities and social sciences research. For instance, Venturini et al., (2012) describe the use of concept co-occurrence networks in social sciences. Grimmer and Stewart (2013) survey clustering and topic modeling applied to political science corpora. Whereas these methods provide a useful overview of a corpus, they cannot determine the predicates
Predicate in the sense of an expression relating a set of arguments.
France and the phrase
binding commitments co-occur within a sentence, how are both elements related? Is France in favour of, or against
binding commitments?
Different natural language processing (NLP) technologies can identify related elements in text, and the predicates relating them. A recent approach is
http://www.iisd.ca/vol12open relation extraction (Mausam et al., 2012, among others), where relations are derived from the corpus in a data-driven manner, without having to pre-specify a vocabulary of predicates or actors. We are developing a workflow to analyze the Earth Negotiations Bulletin (vol. 12)
Our system identifies points supported and opposed by negotiating actors and extracts keyphrases and DBpedia
wiki.dbpedia.org (Auer et al., 2007)
The abstract is structured as follows. First, related work and the corpus are presented. Then, our system is described. Finally, evaluation is discussed.
Material supplementing the paper and access information to the system will be available on the project’s website.
https://sites.google.com/site/nlp4climate
Venturini et al., (2014) created concept co-occurrence networks for the ENB corpus, using Cortext Manager
http://docs.cortext.netgrammar induction on ENB to identify recurrent actor/predicate patterns; it could be tested whether results with that approach complement ours.
Some studies have used syntactic and semantic parsing for text-analysis of social sciences and humanities corpora. Diesner (2012, 2014) examines the contribution of NLP to the construction of text-based networks. Van Atteveldt (2015) used dependency parsing to apply co-occurrence based methods within sentence elements related to an actor or a predicate. These studies rely mostly on syntactic dependencies and verbal predicates. We are using semantic role labeling as the basis for relation extraction, and treating nominal predicates besides verbal ones. We also developed an interface to navigate the results.
Finally, a relevant resource for text-mining on climate corpora is
API: http://api.climatetagger.net ; Thesaurus: http://www.climatetagger.net/glossary/ climatetagger API
Each ENB issue is a 2000 word summary for one day of negotiations. The issues are written by domain experts, who strive for an objective tone and, to avoid biases, use similar expressions when reporting about all participants’ interventions (Venturini et al., 2014). The COP meetings are covered in 255 ENB issues, with ca. 35,000 sentences. The original corpus format is HTML, which we preprocessed into clean text. We dated each issue based on ENB’s table of contents.
The system helps analyze patterns of support and opposition between negotiating parties, and the issues about which parties agree or disagree. To achieve this, the system extracts propositions of shape
Terminology adopted:
‹actor, predicate, negotiation point›,
‹Norway, preferred, legally-binding commitments› is a proposition, with actor
Norway, predicate
preferred and
legally-binding commitments as the negotiation point.
We used the IXA Pipes library
http://ixa2.si.ehu.es/ixa-pipes/ https://bitbucket.org/Josu/corefgraphtokenization and
part-of-speech tagging. We resolved some types of
pronominal anaphora based on
CorefGraph
https://github.com/newsreader/ixa-pipe-srl ; it provides a wrapper to
Semantic Role Labeling (SRL) (Surdeanu et al., 2008) identifies a predicate’s arguments and their semantic functions or roles (e.g.
agent). SRL was performed with ixa-pipe-srl
mate-tools (Björkelund et al., 2010)
http://search.cpan.org/~thhamon/Lingua-YaTeA/lib/Lingua/YaTeA.pmKeyphrase Extraction: YaTeA
Entity Linking (EL): The tool from (Ruiz and Poibeau, 2015) was used. It combines outputs from several public EL services, selecting the best outputs with a weighted vote.
The
domain model contains actors (negotiating countries and groups) and verbal or nominal predicates. Verbal predicates (from PropBank) can be neutral reporting verbs (e.g.
stated), or verbs related to support and opposition (
recommended,
criticized). The nominal predicates (from NomBank) express similar notions to the verbs (e.g.
proposal,
objection). The model also specifies a predicate type:
report,
support, or
oppose.
In SRL,
Analysis rules were implemented to identify propositions based on the semantic roles of predicates’ arguments, previously obtained with SRL. Most domain predicates involve an agent and a message expressed by the agent (who agrees with the message, objects to it, or just reports it). Thus, actor mentions in a predicate’s A0 argument
A0 corresponds to a predicate’s agent.
A1 is the patient or theme.
AM roles represent adjuncts (time, location etc.) or negation. See Palmer et al., 2005.
12 often represents the negotiation point addressed by the actor. The generic rule to identify propositions is in Figure 1.
Sentences with
opposed by constructions require a different analysis (e.g.
China, opposed by the EU, recommended…) In such sentences, a different rule creates, for the opposing actors, propositions where the predicate contradicts the main clause’s predicate (see Table 1 for an example). Proposition-creation rules for more specific cases have also been implemented.
The treatment of
negation relies on finding
AM-NEG roles (see footnote 12) attached to a predicate, or negative items (
not,
lack) in a window of two tokens preceding a predicate.
Pronominal anaphora was treated via custom rules operating on the output of a coreference resolver (see footnote 9). We created custom rules since, in the corpus,
he and
she (besides
it) can refer back to a country (pronoun gender depends on the country’s delegate).
To facilitate searches by date-range, propositions are assigned their documents’ date.
The UI (Figure 2) helps analyze actors’ negotiation positions. It allows searching for documents matching a text query (
Text search box), and for propositions matching a given actor (
Actors box) or a given predicate (
Actions box). Propositions matching a query are displayed on the left panel, documents for a query on the right. Aggregated
keyphrases and
DBpedia concepts for the content matching a query (documents or propositions) are displayed in tabs on the right panel. The
AgreeDisagree view provides an overview of keyphrases and concepts from propositions where selected actors agree or disagree. Simultaneous access on the UI to the corpus and the annotations helps researchers validate results.
The implementation framework is Django
https://www.djangoproject.com/ https://lucene.apache.org/solr/
It is important to assess whether the system can help domain-experts gain insights they would not have otherwise obtained, e.g. detect previously unnoticed generalizations (see e.g. Berry, 2012). This type of evaluation is ongoing; we are collaborating with political scientists, whose initial feedback on the tool has been positive. User validation of the interface is also ongoing.
The system’s NLP components were evaluated in literature cited above. Results are state-of-the-art or competitive, and available on our project’s website (sites.google.com/site/nlp4climate).
To evaluate the model and analysis rules that create domain-relevant propositions, we have manually annotated a set of corpus sentences with propositions. Details about the test-set, evaluation metrics and results are on the website. We consider the results satisfactory.
A useful feature would be an annotation confidence score, that users could employ to establish priorities in manual result revision. A useful application of the propositions extracted would be creating network graphs with different types of edges representing support and opposition among parties, and between parties and issues.
We thank Tommaso Venturini, Audrey Baneyx, Kari de Pryck and Diégo Antolinos-Basso from the Sciences Po médialab in Paris for domain-expert feedback on the system. Pablo Ruiz is supported by a PhD grant from Région Ile-de-France.