Sensing Earthquake Disaster Information : A Named Entity Recognition Approach using Twitter Collaborative Data

In recent years, online social networks have received important consideration in spatial modelling fields given the critical information that can be extracted from them for events in real time; one of the most latent issues is that regarding various natural disasters such as earthquakes. Although it is possible to retrieve data from these social networks with embedded geographic information provided by GPS, in many cases this is not possible. An alternative solution is to reconstruct specific locations using probabilistic language models, more specifically those based on Name Entity Recognition (NER), which extracts names from a user’s description about an event occurring in a specific place (e.g., a collapsed building on a specific avenue). In this work, we present a methodology to use twitter as a social sensor system for disasters. The methodology scores NER locations with a kernel density estimation function for different subtopics originating from a natural disaster and that maps them into a geographic space is proposed. The proposed methodology is evaluated with tweets related to the 2017 earthquake in Mexico.


Introduction
Although there are sensors that can detect in advance various natural disasters such as earthquakes (Mexico City's Alarm System senses quakes coming from southern states) [1], the devastating consequences in urban areas are still severe and uncountable.The relief efforts after such disasters can inspire the participation of civil society, which together with rescue teams and civil protection and security institutions, can help to inform, rescue, and provide restoration as quickly as possible.The active participation of civil society in the aftermath can not only strengthen the society itself but also improve the trust in the information obtained from non-traditional sources [2,3].For example, thanks to the widespread of wireless communication networks and mobile technologies, the dissemination of information, such as news and reports, has served as a vital way to contact aid services and make appropriate decisions in a timely and more flexible manner [4].Similarly, other data from personal mobile devices have played an important role in disaster relief efforts.As evidence, in the 2010 earthquake in Haiti, the use of instant messages sent by civilians in different locations facilitated the report of trapped people and the provision of emergency medical assistance and basic needs such as food, water, and shelter [5].Personal mobile phones can also be used by survivors to send messages about their current status to relatives and the community.These messages can then be forwarded to rescue teams.Figure 1 illustrates an example of a survivor of an earthquake using their mobile phone to communicate with their relatives.Thanks to the ubiquitous availability of mobile Internet [7], personal mobile devices can be linked to online social networks, providing a synchronisation between applications (e.g. between Twitter and Facebook), which allows to communicate behaviours and activities in real-time across various platforms [6].Since online social networks are applications based on the Internet protocol and enable the creation and exchange of user-generated content [8], information extracted from these networks can include temporal and spatial data related to different events [9].The information extracted from these networks can then be represented as geo-referenced patterns that establish relationships between the publication and the geographical and temporal characteristics of the publishing entity.As an example, a post (tweet) on Twitter with temporal and spatial data is shown in Figure 2. The behaviour of users on social networks and their posted information can be considered as a social sensor system since the data generated on a large scale closely resembles those acquired by traditional sensor systems [10,11].Below are some characteristics that reinforce the notion of social networks as being sensor systems [12,13]:

•
Sensor operation: Sensors acquire data related to various events as a result of real-time observations.Mobile phones are equipped with various sensors, such as cameras, that can acquire also acuaire data related to various events.[11,14,15].

•
Processing of sensed data: When information is processed, metadata such as geographic information can be included if navigation systems, e.g.GPS, are available.If the information is then published on a social network, users have the ability to comment and retransmit it along with the geographic information [16].
Twitter, one of the most popular social networks, can be used as an alternative communication channel for exchanging information related to disasters such as fires, floods, hurricanes, and earthquakes.In this work, we then propose a methodology to use Twitter as a social sensor system for disasters, by exploiting the data resulting from observations and experiences posted by the users.Within this context, the collected data is expected to be in textual form (sentences of maximum 140 characters, referred to as tweets) and with well-defined geographic information (spatial attributes).However, it has been shown that only a very small percentage of users use navigation systems such as GPS to reference their information [17].Some studies propose estimating the event location by exploting some features available in Twitter , such as searching for updates related to the event within a known geographical region [18].Other studies propose to detect and group textual patterns to approximate the location of the event in a given time period [19], or to use geographic coders to approximate the geographic coordinates by accounting for the frequency of occurrence of the event in a well-defined spatial region [20].In this work, our proposed methodology is based on the fact that the location of the event can be implicitly described in a tweet.A probabilistic language model known as Named Entity Recognition (NER) is then computed [21,22] to extract entities from the text that can be used to calculate spatial features, such as latitude and longitude coordinates.To detect these entities, the non-grammatical nature of tweets in Spanish as well as informal abbreviations and lexicons (for example, describing a locality using a hashtag) are accounted for.Temporal features are computed by clustering similar tweets related to an event [23].The proposed methodology employs a Kernel Density Estimation (KDE) algorithm, a well-known statistical technique, to detect and monitor the hotspot dynamics of detected events [24].

Related Work
The detection of events related to natural disasters is the subject of recent research in several fields of sensors, natural language processing, and automatic and statistical learning.The main aim is to detect, monitor, and disseminate information in a timely manner with some degree of trust.This section describes some works related to this growing field of research.As described in [25], Twitter has been recently used as a platform for dissemination of diverse information related to various natural disasters such as wildfires [26], floods [25], hurricanes [27], and earthquakes [28], achieving situational awareness.In Table 1, works that employ data extracted from Twitter and other online social networks to sense events related to natural disasters are listed.To detect a target event, this work classifies tweets based on features such as the keywords, and the number of words and their context.It then estimates a probabilistic spatio-temporal model to find the centre and the trajectory of the target event.To this end, each Twitter user is assumed to be a sensor and Kalman filtering and particle filtering is applied for location estimation with ubiquitous/pervasive computing.The authors claim that a 96% probability of correctly detecting an earthquake can be achieved by monitoring tweets [29] Public health implications of social media use during natural disasters, environmental disasters, and other environmental concerns This work analyses how social media can be used to disseminate information, predict data and provide early warnings within the context of environmental awareness and health promotion.The works also analyses how social media can be used as an indicator of public participation during environmental concerns.The authors have found evidence supporting social media as a useful surveillance tool during natural disasters, environmental disasters, and other environmental concerns.Public health officials can use social media to gain insight into public opinions and perceptions.Social media allows public health workers and emergency responders to act more quickly and efficiently during crises [30] Real-Time Crisis Mapping of Natural Disasters Using Social Media In this work, the authors propose a social media crisis mapping platform for natural disasters that uses statistical analysis with geoparsed real-time tweet data streams matched to locations from gazetteers, street maps, and volunteered geographic information (VGI).Geoparsing results are benchmarked against existing published work and evaluated across multilingual datasets.Two case studies compare five-day tweet crisis maps to official post-event impact assessment from the US National Geospatial Agency (NGA), compiled from verified satellite and aerial imagery sources [31] Tweedr: Mining twitter to inform disaster response In this paper, the authors introduce Tweedr, a Twitter-mining tool that extracts actionable information for disaster relief workers during natural disasters.The Tweedr pipeline consists of three main parts: classification, clustering and extraction.In the classification phase, they use classification methods (sLDA, SVM, and logistic regression) to identify tweets reporting damage or casualties.In the clustering phase, they use filters to merge tweets that are similar; and finally, in the extraction phase, they extract tokens and phrases that report specific information about different classes of infrastructure damage, types of damage, and casualties [32] A linguistically-driven approach to cross-event damage assessment of natural disasters from social media messages In this work, the authors focus on the analysis of Italian social media messages for disaster management.Their aim is to detect those messages conveying critical information for the damage assessment task.A main novelty of this study is the focus on out-of-domain and cross-event damage detection, and the investigation of the most relevant tweet-derived features for these tasks.They conduct different experiments by resorting to a wide set of linguistic features qualifying the lexical and grammatical structure of a text, as well as ad-hoc features specifically extracted for this task [33] Combining machine learning topic models and spatio-temporal analysis of social media data for disaster footprint and damage assessment The authors propose a crisis mapping system by analysing the textual content of disaster reports from a twofold perspective.A damage detection component employs an SVM classifier to detect mentions of damage among emergency reports.A novel geoparsing technique is proposed and used to perform message geolocation.They report on a case study to show how the information extracted through damage detection and message geolocation can be combined to produce accurate crisis maps.The crisis maps detect both highly and lightly damaged areas, thus opening up the possibility to prioritise rescue efforts where they are most needed [34] Table 1.Cont.

Title Description
From social sensor data to collective human behaviour patterns: Analysing and visualising spatio-temporal dynamics in urban environments.
This paper presents an approach to analyse social media posts to assess the footprint of and the damage caused by natural disasters by combining machine-learning techniques (Latent Dirichlet Allocation) for semantic information extraction with spatial and temporal analysis (local spatial autocorrelation) for hot spot detection.The results demonstrate that earthquake footprints can be reliably and accurately identified,.The results also show that a number of relevant semantic topics can be automatically identified without a priori knowledge, revealing clearly differing temporal and spatial signatures.Furthermore, a damage map that indicates where significant losses have occurred is also presented [35]

Proposed Methodology
This section describes the proposed methodology to use Twitter as a social sensor system for natural disasters, as depicted in Figure 3. First, data are gathered with entity classes related to locations, people, organisations and other denominations (that do not belong to any rigid designator) [36].In the training stage, the entity classes are analysed at a word level and then expanded to a sentence level using an n-gram model.Features are then extracted to train a classifier using supervised learning.In the sensing stage, tweets are scraped from Twitter API, and grouped in one of three topics T ∈ {disaster areas, missing individuals, shelters}.After extracting Features are extracted from the tweets to predict entity classes.A request to Google Maps [37] is sent to obtain the spatial features (latitude and longitude coordinates) from the predicted entities.The set of spatial features is then scored by the occurrences of entities in the same spatial region.Temporal features correspond to the number of mentions of the entity in the same spatial region during a time period.These features are used by KDE to estimate the entity class location.Estimated locations are finally plotted on a Basemap surface.

Data Gathering
Data gathering is achieved by querying well-identified terms in Spanish related to natural disasters.Here, we scrap tweets using Twitter API Search and Stream.The purpose of this step is to create a training dataset, X t , comprising named entities and classes, y (which are described in the next section), and to create clusters of tweets collected over a time period.The querying terms used are well known urban spaces of a city [38], as well as descriptive terms related to natural disasters.Twitter characteristics, such as retweets and mentions, can contribute to the propagation strength of each tweet and thus, as a temporal feature [39].A sample query, q, sent to the Twitter API Search to collect tweets to construct X t is shown next: q contains the following words in English = #earthqake,help,tlalpan avenue

Word Representations: Word Level Analysis and the n-gram Model
Word level analysis is a widely used language model [40,41] to describe words given a certain context.Each word is mapped to a new representation based on its co-occurrences and semantics.In this work, a word level classifier for Spanish language, known as Polyglot, is adapted to perform the entity extraction for each tweet.Table 2 tabulates the set of classes, y, used in this work.Based on the word-level classification proposed in [42], let us define a tweet by T n i = {w i−n , ..., w i , ..., w i+n } as a phrase centered around the word w i with a window size 2n + 1, and a target function F : T → y.First the tweet T n i is mapped to its embedding representation φ n i .Next, a discriminative model ψ y trains a set X t of word embeddings to score a tag y using a neural network with one hidden layer of size h.A one-vs-all classifier, the function F is constructed and penalised by Equation ( 2) : where t i is the correct class of the word w i and m is the size of the training set.According to [43], an n-gram model is a suitable implementation to capture patterns of local co-occurrence within sentences comprising contiguous words [44].Unlike the word-level analysis, an n-gram model allows us to construct a word vocabulary w ∈ T in which w n − 1 independent words with probability of the form P(w n |w n−1 i ) coexist, where w n is the word occurrence given w n−1 i word particles in a tweet.Table 3 exemplifies the n-gram model with the corresponding entity classes.A training set X t of size m is designed with a pre-classification for both, word-level and sentence-level (n-gram) representations.Each element of X t is denoted as a document d i ∈ X t , with terms v defined by a single occurrence of a term such that v k ∈ d i , mapped to a vocabulary V = v 1 , ..., v k of size k

Feature Extraction
The representation of X t can be done by vector space modelling, so the importance of each term v j contained in a document d i can be approximated.The algorithm used in this work is based on Term Frequency-Inverse Document Frequency, which normalises the frequency of the set of documents X t to assign weights with respect to their importance and relevance.Based on the vocabulary V defined with k terms and the training set with m documents, the algorithm estimates the number of occurrences ∀d i ∈ X t , producing a set D ∈ R m×k , in which each element is the frequency of the vocabulary of terms in the documents d i such that D = f (d 1,1 ), ..., f (d m,k )).The term frequency, t f i,k , is defined by a vocabulary term v k in a document d i as described in Equation ( 4): For each v i ∈ V, the number of documents in which each term occurs at least once is computed.This is known as the document frequency, d f i , of the document d i .To retrieve the inverse document frequency for m documents, Equation ( 4) is used: From the equation above, each pair of document-term is then represented by its weight W i,k = t f i,k × id f i .The resulting set is a weighted feature array W = {W 1,1 , ..., W m,k }.

Supervised Learning
Supervised learning has been extensively used in extensive research works for NER [45][46][47][48] as an alternative to classic adaptations of probabilistic language models, such as conditional random fields and hidden Markov models.In this section, we present an algorithm based on random forests to create a classification model to address the sensor design.Random forests is an algorithm assembles decision trees and combines weak learning to build a more robust classification model [49,50].This is how the generalisation of an error reduces the avoidance of inherent adjustment to different classification models.In this work, the algorithm for the classification of entities of the set of weighted documents, X t W , is described in Algorithm 1. Algorithm 1 uses a statistical technique called bootstrapping, which aims to reduce the variance of random forest learning given k independent weighted documents in X t W , each with variance σ 2 t W .
Bootstrapping considers the variance of the mean X t W of each document d i , which is given by where n is the number of weights.To reduce such variance and increase the prediction accuracy of the entity classes, X t W is split into several bootstrapping training sets B ∈ X t W .A decision tree is trained in the bth bootstrapped training set in order to attain a predictive function f b * (d i ).Resulting predictions are taken from the average, as specified by Equation (5): Finally, to predict an entity class, a majority voting classifier combines predicted classes of each individual tree, C, and selects the entity class ŷ that receives the most votes, as follows:

Spatial and Temporal Information
To be able to find the spatial information, the entities, ŷ, classified by the random forest model are used to request Google API locations, obtaining as a result an address described in terms of a string of characters and geographic coordinates.The entities are first compared by querying the space model representation, Ŷ, of entities using the cosine similarity measure [51] to find similar entities (given some threshold θ ∈ {0, ..., 1}).Therefore, if an entity is similar to an already found entity, it is assigned the same spatial information.Figure 4 shows the proposed spatial information extraction.For the temporal features, a window of time is taken from the first round of tweets collected.When a tweet is denoted as initial, the corresponding spatial information is extracted along with the corresponding time-stamp (date of creation), ts.All the subsequent retweets from the initial tweet are considered to have a threshold θ = 1 (for total similarity) and are assigned a creation time-stamp equal to the difference between the time-stamp of the parent tweet and their actual time-stamp, i.e., (ts parent − ts children ), {∀ts parent ∈ ŷparent , ∀ts parent ∈ ŷchildren }.When a tweet is similar to others according to the threshold θ, the date of creation is then calculated based on that of the similar tweets.Tweets are then re-ordered by date of creation, from the oldest to the most recent one, i.e., sort( ŷ1 → ts 1 , ..., ŷn → ts n ).

Kernel Density Estimation
Several works have employed [52][53][54][55] KDE [56], a statistical method to visualise hotspots from estimated spatial points distributed on a two-dimensional spatial probability density function.In this work, this method is applied to spatial information from predicted entity classes.Based on Basemap [57], we can for example identify geographic areas prone to be dangerous, detect areas with the most missing individuals, and locate shelters.In order to quantify the incoming entities of a certain class from the random forest model at a spatial point g, Equation ( 7) is computed as suggested in [55]: where h is the bandwidth, P is the number of spatial features of the topic T ∈ {disaster areas, missing individuals, shelters} within the window of time, w i maps is the entity of spatial information from the temporal window, K is is a density function, 2 is the vector norm, and and g i is a location related to the disaster.

Sensing Information: A Case study of the 2017 Mexico Earthquake
On 19 September 19 2017, at 1:14 p.m., an earthquake of 7.1 magnitude in the Reichter scale with an epicenter in Axochiapan, Morelos, an adjacent state of Mexico City, impacted the urban infrastructure of the city and surrounding areas.Although the alarm system is efficient when epicentres occur in the Pacific Ocean coast, in the case of this natural disaster, the sensors were so close to the city that the evacuations took place 11 seconds after the earthquake started.It was not to be expected that Twitter users reported events related to the disaster zones in real time.In addition to army and navy personnel, a large number of individuals took to the streets to offer humanitarian aid to people who were in risk areas.Days later, a number of official and collaborative shelters were set up in churches, parks, schools, and other places to offer help to the victims.Figures 5a-c show sample tweets collected over a three-day observation window by the proposed methodology.
We compare our proposed classifier based on random forest learning with the work presented in [58].For our classifier, we use a 3-day observation window.In [58], the extraction of localities embedded in social network messages contain the following metrics: lexical, morpho-syntactic, and lexical expansion, which are associated with three classes-damage, non-damage, and non-relevant.For the events that occurred during the earthquake of 20 May 2012 in Northern Italy, that work trains messages using a Support Vector Machine (SVM) for word embeddings (EMB) Table 4 tabulates the performance metrics of the class damage from [58] and the entity clases LOC, O, ORG, PER of this work (Note that the entity class LOC is the most comparable metric with the class damage from [58] ).Figures 6a-c present the hotspots obtained from the KDE of spatial features collected over a span of three days as described in 5.4.These results are compared from those obtained using official data from the September 19 earthquake, which are publicly available as Mapeo Verificado19s [59].The following categories are shown in the Figures using official data(note that there are no official reports of missing people, so it was not possible to compare this category):

•
Official Damages: includes collapsed buildings, major risks, minor risks, and wall collapses;

Conclusions
In this work, a methodology to use Twitter as a social sensor system was proposed.The methodology extracts features from tweets by analysing entities at the word and sentence levels (n-gram model).It then uses a random forest classifier, which combines different small decision trees and uses majority voting to classify entities extracted from newly seen tweets.Classified entities are then used to extract spatial features using the Google Maps API, which allows reconstructing the entity's location by assigning it geographic coordinates.Using KDE, the compendium of spatial features collected over an temporal window is also to visualise on two-dimensional map the hotspots associated with important information related to the disaster, such as highly affected areas or areas with the most missing individuals.The methodology is evaluated using tweets posted during the earthquake of 19 September 2017 in Mexico.A 3-day observation window is used to collect spatial and temporal features.Classification results are compared with those of a similar work which uses word embeddings to classify damaged areas after an earthquake using a Support Vector Machine and Word Embeddings.Results showed that the random forest classification accurately classified the entities over the 3-day observation window.

Figure 1 .
Figure 1.Through a messaging system, an earthquake survivor describes their situation inside a collapsed building.The messages translated to English are: My love.The roof fell.We are trapped.My love I love you.I love you so much.We are on the 4th floor.Near the emergency stair.There are 4 of us.My love are you ok?

Figure 2 .
Figure 2. A tweet providing the location (spatial information) of a collapsed building along with a time stamp (temporal information) one day after the 2017 earthquake in Mexico City.

Figure 3 .
Figure 3. Proposed methodology to use Twitter as a social sensor system for disasters

Algorithm 1 :
Training Samples function RANDOMFORESTTRAINING(X t W ); for i = 1 : k do B ← a bootstrap sample of X t W with size n; C ← a grown decision tree from the bootstrapped sample B; foreach node in C do Select f (d)features from C; Split the node by maximising the information gain of f (d); Aggregate the prediction by each tree to assign the class label by majority voting end function Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 15 August 2018 doi:10.20944/preprints201808.0269.v1

Figure 5 .
Figure 5.The event starts at 1:46 pm, almost half an hour after the earthquake.The localised entity corresponds to the street of Av.Álvaro Obregón, number 286, with geographic coordinates 19.4162205, −99.1705947.The other classified entities have the same similarity and are ordered temporarily until the last report at 4:22 p.m. of the third observation day.(a) The event starts, users start reporting that a person is trapped in a collapsed building; (b) a day later users continue reporting that a person is in the rubble, and information is already disseminated in a retweet; (c) on the third day, the event ends with the victim reported as rescued.

•Figure 6 .
Figure 6.Estimations using KDE using data collected over a 3-day window.(a) The hotspots of estimated spatial features related to damages and collapses are shown and compared with official reports.(b) The hotspots of estimated spatial features related to official and collaborative shelters are shown and compared with official reports.(c) Although there are no official reports of missing persons, the hotspots of estimated spatial features related to this category are plotted.

Table 2 .
Entity classes for extraction and classification.

Table 3 .
N-gram model examples with their corresponding entity classes.

Table 4 .
Performance metrics for entity classification.