Methodological guidelines to investigate how gender and migrants are represented on social media – Insights from D4.1

The Work Package 4 – Exclusion: Platformization of Media Representations aims to realize an in-depth, first-hand analysis of media representations of two main issues, immigration and gender.

Before proceeding with the analysis, the research questions and a shared methodology have been defined by the Fundació per a la Universitat Oberta de Catalunya, leader of the WP, in the D4.1 Methodological Guidelines, A Framework and Methodological Protocol for Work Package 4 – Analysing the Europeanisation and Platformization of Media Representations.

Here are some insights by Jim Ingebretsen Carlson, researcher at Open University of Catalonia.


The main aim of work package 4 is to analyse how the two topics of gender and migration are represented on social media. The methodological guidelines had to outline the whole process of how the work would be conducted in the work package. In short, the guidelines consisted in the following steps:

  1. Define the research questions.
  2. Develop a theoretical framework of media representations for the two topics from which the most relevant dimensions of representation could be extracted.
  3. Retrieve and filter social media posts.
  4. Manually code a small sample of posts from each of the 10 countries analysed.
  5. Train machine learning models based on the manually coded posts to automatically code a large sample of posts from each country.
  6. Quantitatively analyse all manually and automatically coded posts.

Consequently, we combined a solid theoretical foundation of media representations with data mining and statistical techniques for retrieving, coding, and analysing social media data. Moreover, the guidelines were developed through a joint effort of several consortium partners, especially FUOC (Barcelona), IULM (Rome) and UGent (Gent).

The main research questions of the work package are the following:

  • RQ1: Are there similar debates about im/migration and gender across Europe – can we find hints of a ‘European public sphere’ – or is coverage dominated by the national perspective?
  • RQ2: Are there similar debates about im/migration and gender across Europe when the perspective is European compared to when it is not?
  • RQ3: How are representations of Europe in relation to gender and im/migration affected by new modes of consumption and production?

While RQ1 would be answered by comparing media representations across the 10 countries analysed, RQ2 and RQ3 would be answered by studying both within- and between country differences in media representations. A framework for selecting the most relevant dimensions of media representations had to be developed to answer the research questions.

Since we wanted the framework of media representations to have a solid theoretical foundation, it was developed based on scientific literature relating to the topics of gender and im/migration. Furthermore, we wanted to select the dimensions of media representations which we believed would be most relevant for social media data to limit the workload of the manually coding as much as possible. The work singled-out four dimensions common to the two topics, and three topic-specific dimensions each. The common dimensions of media representations were Law, Culture, People, and Values. The additional dimensions were Identity, New social movements, and Public sphere for gender and Territory, Institutions and Interactions & dialogue for im/migration.

To retrieve social media data, we used crowdtangle to get Facebook posts and Twitter API v2 to download tweets. To download the social media posts, a number of keywords need to be entered based on which the social media posts are selected. Consequently, the better the keywords relate to the social media posts you want to analyse, the more relevant data you will get. Therefore, we extracted the keywords from the theoretical frameworks of media representations to the largest extent possible. This generated 44 keywords for im/migration and 51 keywords for gender that were translated into each of the 10 languages by the partners of the consortium. Using these keywords together with language and geo localization, the data was downloaded for each of the 10 countries.

Many downloaded posts tend to be unrelated to the topics you want to analyse even though relevant keywords are used. Therefore, we created a filtering method to try and filter for the most relevant posts. The filtering was done by calculating a relevance score for each post. A post got a higher relevance score if it:

  1. Contained more keywords.
  2. Contained more non-generic keywords.
  3. Had a higher sentiment score.

The second point acknowledges the fact that some keywords are better than others. For example, keywords such as “inclusion” and “identity” are fairly generic and contributed less to the relevance score than more relevant keywords like “gender violence” and “gender equality” (for gender posts). Moreover, we also included a previously calculated sentiment score to filter posts that had more positive and negative sentiments. The relevance score was validated through manual coding.

Since we wanted to analyse an as large and representative sample as possible, we aimed at automatically code posts applying machine learning models. Therefore, each partner had to manually code a small set of social media posts based on which the machine learning models would learn how to automatically code. This in turn required the operationalisation of the theoretical concepts of media representations into a codebook.

Finally, the methodological guidelines included the general steps for pre-processing of the data and the training of the machine learning models as well as how the resulting data would be quantitatively analysed.

Download the Deliverable 4.1 “Methodological Guidelines”