Review on Geolocation Prediction in Social Media Data


Geolocation information from social media data has a great potential that can be used in various analyzes such as Traffic Analysis, Location Based Sentiment Analysis and Disaster Detection and Management. However, the limited geo-tagged data as a database for geolocation information analysis becomes a challenge in the field of this research. This paper focuses on reviewing geolocation prediction approaches at the limited geo-tagged data from social media. This paper then categorizes geolocation prediction approaches into three categories called Content-based Geolocation Prediction, User-profiling-based Geolocation Prediction, and Topic-based Geolocation Prediction. This review also found that there were two types analysis in the utilization of geolocation information. These two types analysis were called Event-based Location Type and Distribution Location Type. Suggestion of geolocation prediction approaches for each type analysis was also discussed.


Geolocation information from social media data has a great potential that can be used for various analyzes [1], [2], [3], such as traffic analysis [4], disaster detection and management [5], location-based sentiment analysis [6], trending local topics [7], demographic analysis [8], targeted advertising [9], and tourism analysis [10].

The utilization of geolocation information from social media data is generally done by using geo-tagged data. Geo-tagged data is a data that can be formed of text, image, or video associated with the location of geographical location. The problem is, geo-tagged data from social media is still very less in terms of quantity [4], [7], [11]. Based on research [12], [13], only about 0.87% to 3% of data that provided geo-tagged information. This problem encourages researches to make the predictions of geolocation information from non-geo-tagged social media data [1], [14].

Various research of geolocation prediction [2], [7], [14], [15] were conducted to address the problem of the limited geo-tagged data from social media. These researches have finally proposed a range of approach solution that can be used to predict geolocation. Each approach of the researches has its own unique character and way of predicting geolocation information that can be used in various analyzes. Therefore, this paper reviews these approaches and analysis, thereby can give the best geolocation approach advice for various analyzes.


Geolocation prediction is conducted by finding location information of data text, and then it is converted into a form of geographical coordinate [2]. Various approaches to conduct this task are discussed as follows.

A. Named Entity Recognition Approach

Named Entity Recognition (NER) is a part of information extraction task, one of important tasks in the field of Natural Language Processing (NLP). NER can handle the task of introducing entities in a text input. Some entities that can be identified through NER’s task are People, Organization, or Location.

Using NER in predicting geolocation can be done by considering the result of entity type associated with the location. Research [2], mentioned that there were at least four types of entities that could be associated with the location namely “Location”, “Facility”, “Organization” and “Company”.

NER has so far been used in several geolocation prediction research. Research [2] used NER to ascertain the type of location entity from data text input of social media users. Research [13], used NER for the introduction need of the location entities and the time of sentence text input. Research [12], stated that NER could be used to extract valuable geographic information from text data.

Geolocation prediction researches with NER have almost equal work processes with differences only in the text input and NER dataset used. The dataset often used in applying NER is DBpedia. DBPedia2 is a database containing structured information from Wikipedia. The process of introducing the entity is conduct by giving the text input to the NER classifier; Figure 1 [16] can illustrate this process.

Fig. 1. Pipeline Architecture for an Information Extraction Process [16]

  • Sentence Segmentation, is a normalization process for processing text input into separate per sentence [17].
  • Tokenization is the separation stage of word per words in one sentence [16].
  • Part of Speech Tagging (POSTagger) is the labeling stage of word types such as Noun, Adjective, and Preposition etc. [17]. Example: go/VB (verb), politic/NN (Noun), etc.
  • Entity Recognition (ER), is the process of searching and labeling of word entities type such as People, Places, or Organizations [18], [19].
  • Relation Recognition is the process of finding relationships between entities [16].

NER’s ability in discovering the entity types provides promising results for location prediction. The result of entity type of NER’s process can be focused on words associated with the type of location entity [2].

B. Location Indicative Word Approach

Location indicative word is location extraction approach using list of location names. LIW works by searching for words that can indicate the name of a location. Research [14] used the dataset initial of city name and words that could indicate the location. This dataset is deciphered into several labels: (1) local word: a word specifically used in a particular area to indicate the location, for example “Jakal” refers to “Jalan Kaliurang”. (2) Semi local words, words that are more commonly used in some areas, for example “Chinatown” refers to a common location in some cities in the world.

Research [13] used Generative Multinomial Naïve Bayes to apply LIW approach. This method is chosen because the excess that can still work well can be limited training data [14]. The LIW process simply works by calculating the probability of each word in the test data that can indicate the location.

The development of the LIW approach was conducted in the research [1]. This research adds textual feature #Hashtag and @Mention as a word that can also indicate the name of a location. Using the method of Multinomial Naïve Bayes, research [1] succeeded in getting better results than the reserach [14] that only used LIW.

One of the advantages of the LIW approach is that it can recognize various word expressions that refer to the location, such as the expression of words that are only understood in the local area [14]. However, the use of LIW approach is highly dependent on the quality and completeness of the dataset. LIW’s development potential using semi-supervised learning techniques can be developed for the case of limited dataset.

C. User Location Profile Approach

On social media like Twitter, users can input their home location into their respective profiles [2], [4]. This information is generally filled with text from the city name where the user comes from [20]. Location information from this user’s social media profile can indicate the user’s home location [11]. However, in some cases, social media users sometimes write down the location name that does not exist [2]. Therefore, location information obtained from the user profiles should be checked first before being used further [12].

The process of User Location Profile Approach is conducted by collecting profile information from the user target. On social media like Twitter, this process can be done by using the Twitter API [7], [21]. The user’s home location acquired then processed using NER to verify the entity type of the data. Only words associated with the location type will be retrieved as location information from the user target [2], [4], [21]. Some type of entities that can relate to a location as the “Location” that clearly explains the location, “Facility” is an entity that can also sometimes refer to a location such as “Stadium”, “Organization” and “Company” can be entities such as the name of a team that the location can also be mapped.

Ingin dibuatkan seperti ini??
Butuh versi lengkap??
Atau ada tugas-tugas costum lainnya??
Silahkan Hubungi di no wa 082138054433

Tinggalkan Balasan

Alamat email Anda tidak akan dipublikasikan. Ruas yang wajib ditandai *