An Investigation about Entailment and Narrative by AI Techniques (Generative Models)

Many storytelling generation problems concern the difficulty to model the sequence of sentences. Language models are generally able to assign high scores to well-formed text, especially in the cases of short texts, failing when they try to simulate human textual inference. Although in some cases output text automatically generated sounds as bland, incoherent, repetitive and unrelated to the context, in other cases the process reveals capability to surprise the reader, avoiding to be boring/predictable, even if the generated text satisfies entailment task requirements. The lyric tradition often does not proceed towards a real logical inference, but takes into account alternatives like the unexpectedness, useful for predicting when a narrative story will be perceived as interesting. To achieve a best comprehension of narrative variety, we propose a novel measure based on two components: inference and unexpectedness, whose different weights can modify the opportunity for readers to have different experiences about the functionality of a generated story. We propose a supervised validation treatment, in order to compare the authorial original text, learned by the model, with the generated one.


Introduction
To start with, let us look to discuss with what we consider to be an existing disconnection between the study of Artificial Intelligence and the analysis of storytelling. During the decade of 2000-2010, there was a period of confluence, say a mutual interest, more or less, between the two domains of research mentioned above. As a testament to what has previously been said, we may mention some bibliographical datum; for example the work by Cavazza and Pizzi (2006) about Narratology for Interestingly, from the nineties onward we have not experienced significant technological improvements in the field of studies about Artificial Intelligence and storytelling: the models have not changed to a great extent. These models have continued to adopt a mixture of classification, decision making and logic rules. Not so different, one could say, from the classical AI patterns of the eighties (Expert Systems) that was to be echoed and reverberated in Literature over the next two decades.
A bibliographical testimony of that is the book by Bringsjord and Ferrucci (1999). The two authors showed a good mathematical background, and their challenge was to demonstrate «that logic is forever closed off from the emotional world of creativity». Their approach was typical of the first chatbots that have been simulating the Turing Test, namely, to build a knowledge repository where an "agent" answered a human person: the agent ran through Decision Trees to query the repository. Alternatively, more agents interacted with each other-without human support-and they gave life to narrative paths.
When implementing these deductive-inductive schemes, which are enclosed and self-sufficient, we specifically limit the magnitude of events and stories to be narrated. However, these systems can easily be plunged into crisis in response to unexpected questions, whereby they need continuous extensions and generalizations, as well as ways of appending new rules and new events and stories: it is-you pardon the pun-a never-ending story.
Despite considerable advances in neural language modeling during the last twenty years (Note 1), the more hopeful branches toward the text generation and automated storytelling have simply addressed the statistical learning domain. The first goals of general computational linguistics of the nineties have been long forgotten. Indeed, promises of a machine, that tries to dive deeply into a text comprehension (using logics and others linguistic models), have not been kept. By adopting large training on millions of webpages and heterogeneous documents, the current language models pursue this law: more information ingested, more knowledge (reasoning abilities) acquired. An example is offered by OpenAI's GPT2 model, followed by its successor GPT3, which has recently overpassed the Turing Test, in some lite versions. These models are based on the assumption that the probability's distribution of a words sequence can be decomposed into the product of conditional next word distributions. Simply put, the text generation guidelines try to discover the more probable distributions of words chains, with respect to the global knowledge learned (paradigmatic level), and locally inferred by the previous sentence (syntagmatic level). In these cases, the inference is drawn by assigning a most probable maximum score to words' context and relationships among pieces of text.
The current metrics, used for measuring text-generated quality, follow the same concept. Essentially, www.scholink.org/ojs/index.php/csm Communication, Society and Media Vol. 3, No. 4, 2020 63 Although the output is grammatical, and even impressively idiomatic, its comprehension of the world is strange, and the appeal is nonsensical. The machine offers a seriously a costly solution to destroying the home, for finding a solution to a trivial problem. This is absolutely at opposition to the effort minimizing suggestion of such a problem-solving machine (also automatic). It is not a singular case: many texts generated, when understandable, are curious and seem rich in random irony. Probably useful for creative storytelling, but definitely inadequate for logical entailment, as the NLP metrics attempt to measure.
In light of this, the purpose of our research is to extend the entailment, providing for other linguistic connectives, because the narrative and lyric tradition often does not proceed towards a real logical inference. The storytelling coherence and readability are conditioned by a complex way of reasoning, pointing out a spectrum of solutions, starting from an antecedent statement, completely untied from the next sentence, then trying to infer the truth of the antecedent (hypothesis). It is evident that this spectrum collects a rich variety of sentence relations, some inferred by cause-effect, some not. It is different from the simple opposition between entailment and contradiction, like the current evaluation systems insist on measuring. When we work from this point of view, it is extremely difficult to reduce all the relations which do not belong to the cause-effect inference to an easy "not-entailed-label", as and also the ability to capture the reader's attention. To achieve this scope, we propose a novel measure of narrative inference, and, more generally, the concatenation of a pair of sentences based on two components: inference and unexpectedness, whose different weights can modify the opportunity for readers to have different experiences about the functionality of a story.

Related Literature
It is not simple to collect the related literature that has aided our research, for several reasons. From one perspective, the new horizons of NLG (Natural Language Generation) research, was born in 2017, with the Transformers models, probably too short on time to allow the use of new measures of narrative functionality. BLEU score (Note 4), and derivative metrics like BLEURT (Note 5) are used to evaluate the consistency of a text generated from a trigger (the previous words), but they fall into the trap of adopting the same technological level applied to language (the story to put in exam) and meta-language (the metric to measure the properties of the story). Since the automatic approach is limited by the lack of metalevel, the human evaluation is often preferred, but limited only to classical inference proof.
From another perspective, as we anticipated above, during the last decade we have assisted with a detachment from NLP (Natural Language Processing) research, performed basically by computer scientists who are unrelated to digital humanities fields and are not interested in literature studies and narrative experience. Despite the names, NLG and Computational Creativity (CC) convey different meanings: the first relies on some advanced and specific algorithms of Neural Network and Statistical Learning, whereas CC (Note 6) is oriented on the creativity from a less formalized point of view, drawing from psychology, philosophy and sometimes from neurosciences.
In this panorama, Mark Riedl (Note 7) is an author who merges the two areas, engaged in Intelligent Narrative Technology (Note 8) and in the new improvements of NLP. In "Computational Narrative Intelligence: Past, Present, and Future" (Note 9) he traces the insights of this approach: "Narrative intelligence is the ability to craft, tell, understand, and respond effectively to stories". The purpose is to highlight how the knowledge, required to understand a story, can be used to create new stories. Once the domain model has been ingested, a story generation system can produce an infinite number of stories involving characters, places, and actions that are known to the system. In "Scheherazade: Crowd-Powered Interactive Narrative Generation" (Note 10) the author uses crowdsourcing information to automatically learn the domain knowledge that is needed to construct and to understand stories about daily activities, such as going to a restaurant or to a movie theater. The approach is similar to the basic principles, implemented in the pre-trained model of NLG, but the difference is essentially given by the different level of human contribution. A system like Scheherazade uses a plot graph for predicting possible stories' development, driven by human supervisors (Note 11), whereas the systems like GPT-2 learn by texts and generate new texts through a completely automatic way, driven only by statistical distribution laws.
In a recent paper, Riedl (Note 12) has been improving his research, with the goal of integrating NLG perspective, of decomposing neural plot generation in two issues: a) the generation of a sequence of events (event-to-event) and b) the transformation of these events into natural language sentences (event-to-sentence), so as to offer a solution to the current limits of NLG (the capability to produce grammatically correct but semantically inconsistent sentences). The paper, remarkable for measuring the topic alignments of sentences, fails to adopt a suitable evaluation of the inference problem, limiting the measure to BLUE score. In another paper (Note 13), Riedl focuses the attention about the skillfulness to learn normative natural language descriptions, exploiting predetermined narrative events, but without measuring in detail the entailment.
At the end of the story, except for particular cases, like the quoted studies of Riedl and his collaborators, we observe a fairly small intersection between NLG veins and Narrative Intelligence (or other sub-stream of Digital Humanities). The first branch continues to remain out of the field of narrative theories, and it avoids directing itself into the deep role of inference (and its variations) of the literary tradition shows. The second branch is often not updated to the Deep Learning models and is too much tied to the rules that are drawn by humans, thus introducing a bias that confounds a model evaluation.
In addition, recent Digital Humanities studies, specifically concerning literary texts, are definitely stopped at the stage of perhaps more sophisticated forms of querying textual corpora (that is a development of the "linguistica computazionale" in Italy, arising from the pattern DBT by Picchi-Stoppelli) and little else, as e. g., examples of digital philology, critical editions of texts available online that are pretty much interactive. We have an evident testimony of this backwardness, when we read an insight into this topic which is now offered by an outstanding review such as Italian Studies: what should be-and it is, actually-an overview concerning "Italian Studies and the Digital" (Armstrong & Patti 2020). This gives us a very thorough frame of the existing international debate.
Nonetheless, in this paper we find only a single reference to Artificial Intelligence as a future possibility, related to DH, presented as almost a science fiction projection (p. 206) (Note 14). The rest is silence, a silence that is overflowing with bibliography which has truly little to do with New Digital perspectives.

Research Question
Our research is focused on evaluating the narrative functionality of a story (generated through some AI tools). To achieve this scope, we propose a novel measure of narrative inference, extended to all the concatenation modes of a pair of sentences. For functionality we mean the encoding of a sequence of dynamic components (also in the diachronic meaning) that aim to achieve a successful value judgement, approved and confirmed by a community of readers. In layman's terms, with narrative functionality we identify basically the propriety of a story (written as well as in audiovisual etc.) to be successful for much of its public. There is no need here to make matters more complicated, quoting the concept of Erwartungshorizon and the Reception Theory (Rezeptionsästhetik), or the thought of Gadamer and so on. We prefer to remain on an elementary level, at most evoking and invoking the common-sense theory.
Our attention is driven by the so-called NLG task: "prompt generated text". In simple terms: the human premise is followed by a consequent sentence generated by the machine, through a statistical learning mechanism. The purpose of this mechanism is to predict the next probable word based on the earlier sequence of words. The quality of this prediction depends on several factors, mainly based on the model (the neural network that implement the text generation) and on the features that are exploited by the model, to learn from a training set of data texts. In any case, the tool must meet stringent requirements, about building understandable stories, so that the reader satisfaction can be reached.
There is therefore the requirement for an evaluation measure.
The method that we propose is based on two components: inference and unexpectedness (Note 15), whose different weights can modify the opportunity for readers to have different experiences about the functionality of a story. The inference part is related to the prediction capability of the reader, while the unexpectedness conveys a feeling of surprise, of astonishment and curiosity about the storytelling evolution. Our idea is supported by the belief that this functionality could be described by feasible features, quantitatively measurable by weights: < τ i : inference, τ u : unexpectedness>, applied to the concatenation of two contiguous sentences (x,y). Better said, a weight τ (0 < τ < 1) sets the leverage on the story, provided respectively by the inference and the unexpectedness of y, given x. The sum of the two-weights s= τ i + τ u represents a sort of aggregative index, able to capture different opportunities for generating a story. For example, τ i + τ u 0 sounds like a free concatenation of the sentences, with a lack of useful information for a story, whereas τ i + τ u 2 represents a sort of contradiction, because it is a concatenation marked by high inference (involving high predictability) and high unpredictability at the same time. The best equilibrium for the readability of the story is to assess around τ i + τ u 1 (see just below).
The challenge is the evaluation of inference-unexpectedness attributes, reaching writer's style, with respect to the case of the machine's prediction style. The evaluation is based on the comparison between one pair of sentences (x,y), with another pair (x,z), where x,z are written by the author, and y is generated by the machine. In this way, we can explain a quality of text generation, and its stickiness to a replicability style, from the point of view of narrative functionality. Otherwise, from a theoretical point of view, it is simple to generalize this mechanism, expanding more chains (x 1 , …, x n ) where (x,y) = (x j , x j+1 ) for some j 0 < j <= n-1.
Let us give some examples of the first case (relating to texts written by an author). Wonderland by Carrol. The analysis may be quite intriguing, because the logical writing process of that novel will almost surely generate causal consistencies but, at the same time, the unrealistic world Carrol describes shall determine something unexpected and amazing. Perhaps, the amounts of <mean τ i , mean τ u > in Alice will be similar to Pickwick's ones, but the "variance" (for instance variance in each chapter) will be diversified, because unexpectedness in Carrol's novel is not the same as that in Pickwick: in fact, it has a different distribution.

Methodological Proposal
In order to explain an idea of method, we now propose a classification with five possible cases. This allows us to make a partition of the range [0,1] in three levels for each index: small ( 0), medium ( 0.5), large ( 1). This trisection allows a supervised evaluation in the light of a prior approximation.

5) inference unexpectedness (0.3 < 0.7)
Some examples: 1) I go to kiss my mother -> and my mother stabs me 2) I go to kiss my mother -> and she hugs me 3) I go to kiss my mother -> but she walks away, irritated 4) I go to kiss my mother -> and she invites me to lunch 5) I go to kiss my mother -> and she breaks down in tears www.scholink.org/ojs/index.php/csm Communication, Society and Media Vol. 3, No. 4, 2020 68 Published by SCHOLINK INC.
Certainly, the narrative context makes the sentences clearer; to be less abstract,

The Model
Nowadays there are two possible approaches for generating stories from NLG perspective. The first, and easier, is by adopting pre-training models, such that they are available through an open-access provider (like Hugging Face, Google and so forth). The second one concerns the creation of a dedicated model from scratch, otherwise customizing a pre-training model via a transfer learning process (transferring knowledge across tasks or domains), in order to make the machine learn the contents of a writer and his style.
In terms of the scope of our research, we essentially require a pre-trained model, that has learned from a default set of data texts (e.g., blogs or web pages) and a mechanism of fine-tuning of this model using new information. In such a way, we can compare some results of text generation, using the default model, with other results, captured by the transfer learning method. Our attempt is to show an improvement of the customized model, with respect to the trustworthiness of generating text in accordance with the original writer's style.
Basically, within the NLP neural networks we have two options to transfer new knowledge: a) using task-driven information or b) using domain-driven information. The original pre-trained network is trained on a general domain of data for different source tasks: we must specialize the original knowledge to assist the target task (in our case: learning supervised information about the classification of sentence pairs, with respect to the two indexes: <inference, unexpectedness>) and the domain task (leveraging the original model and adding the contents of the reference writer submitted to our attention).
Let us discuss some examples: Task driven: this learns from labeled pairs like: <premise → hypothesis> for example: A soccer game with multiple males playing → Some men are playing a sport; this takes into account the supervised label from an open source dataset (e. g., https://nlp.stanford.edu/projects/snli/): o Neutral (the output does not rely on premises and consequences).
In a similar way, we proceed to learn the unexpectedness of the task. Since it is not a standard evaluation in the traditional NLP process, and it is hard to find it navigating the public repositories, we will compile a dedicated one.
• Domain driven: It also embodies the reference novels (The Catcher in the Rye and possibly the other works of Salinger) to map on the neural network.
We observe that the simple task-driven process is not enough to allocate the inference/unexpectedness structure of a story, because it has been set on the linguistic experience derived from a writer. It captures the insight of the inference property, compliant to natural language understanding systems, but it also needs to move this abstract knowledge toward concrete examples of the specific author utterances, depicted through the domain-driven task.
The problem of the making of the library (dataset) to achieve our fine-tuning is a particularly delicate task. The literary texts we need, to put together a homogeneous and consistent corpus for successful queries, must be at least-if not totally-oriented towards a narrative which is open to unexpectedness or, in any case, somehow distant from a classical (Note 19), traditional way of storytelling.
Furthermore, we have decided to include also poetic texts and dramas. The former subgroup of texts is the result of a choice of lyrical and epic poems whose main characteristic is to surprise the readers through unexpected relationships and metaphors, or unforeseen events. The latter sub-set has been chosen among quite "original" dramatists' pieces that are renewals, more or less, of the classical theatrical traditions.
Almost one hundred thousand pages of this databank can be enough to result in interesting findings, in our opinion. Obviously, we shall implement, if necessary, the corpus in question. www.scholink.org/ojs/index.php/csm Communication, Society and Media Vol. 3, No. 4, 2020 70 Published by SCHOLINK INC.

Result
We intend to measure the weights of the two indexes <inference, unexpectedness> using samples In more detail, the sequence of experiments as follows: A. Using a pre-trained model of an open source library (e.g., GPT-2 or XLNet of HuggingFace framework). Extracting N sample (e.g., 1000) of pairs of sentences <premise, hypothesis> prompt-based generated (e.g., they are always asking you to do them a big favor -> because you are a very handsome guy).
a. Submitting the sample to a supervised evaluation of weight level for each index < τ-inference, τ-unexpectedness> taking in account three levels small ( 0)

Discussion
So far, we have been trying to measure the coefficients of inference and unexpectedness: a maximum of the first coefficient-say 0.9-implies a superminimal presence of the second, and vice versa. So, we have two extreme ends of a spectrum where the two coefficients find a reciprocal distribution.
This is not the place to draw a historical diagram of western narrative which can be grounded upon the average measure of the equilibrium between the two coefficients, age upon age. However, we can say that, for example, the so-called Baroque period has had a particular orientation towards the meraviglia (anything that is met with curiosity or wonder, or is unusual, odd, rare, etc.), and this statement, with its clarifications and details, is essentially basic knowledge. In a post-modernist narrative we may find something of similar, but obviously not identical. We are seeing an increase in the desire for surprise and for a kind of "functional inconsequentiality"-we use the term functional in the practical meaning: in short, when a tale or a novel functions, i. e., works. We may exemplify with Pynchon's or De Lillo's novels, with Carver's and Salinger's stories, with Foster Wallace's inventions or, within the cinematographic field, with Cohen Brothers' movies, or with Lynch's masterworks. In fact, such an artist as David Lynch has accustomed his viewers-in the TV series of Twin Peaks-to narrative possibilities which are actually unexpectedness-oriented, and he has managed to familiarize the average audience with what, on the face of it, appears to be oddity or absurdity.
Obviously, the balancing act between the two coefficients proposed in this work has been a constant over the centuries. Nevertheless, possible objections could be proposed: is it true that all readers/viewers always aspire to something unexpected? The answer, we think, is not at all certain.
Often, in fact, the pleasure for the audience grows from the solace given by a verisimilar order of cause-effect (remember Aristotle). Then, regardless of historical epochs and tastes of the times, the equilibrium/disequilibrium between inference and unexpectedness cannot be subjected to a deterministic law. We can say, to provisionally conclude, that the periods of "high classicistic orientation" privilege the entailment coefficient, whereas "baroque-oriented" ages (let's say) prefer to enhance the unexpectedness coefficient.
Our research stands on the line of recent studies about neural network applications for classifying humanities and art works. A computer scientist, who collaborates with museums and galleries to classify artwork from the point of view of creativity, is Ahmed Elgammal (Note 22). Conversely, we do not notice on the side of History of literature and of literary movements studies, an analogue improvement. Our preliminary work aims to bridge for this gap.
Published by SCHOLINK INC. the geometry) or to take the legs off the table, if they are detachable. Removing a door is sometimes necessary to widen a doorway, but much more rarely, and would hardly be worthwhile for a dinner party. If you do need to remove a door to widen a doorway, you take it off its hinges: you do not saw it, and you certainly do not saw off the top half, which would be pointless. Finally, a "