John Palmer, Universitat Pompeu Fabra (UPF)
Ramona Ottow, Universitat Pompeu Fabra (UPF)
Frederic Bartumeus, Centre d’Estudis Avançats de Blanes (CEAB-CSIC) & CREAF
José J. Ramasco (coordinator), Instituto de Física Interdisciplinar y Sistemas Complejos (IFISC-CSIC)
Frederic Bartumeus (coordinator), Centre d’Estudis Avançats de Blanes (CEAB-CSIC) & CREAF
Alvaro López García, Instituto de Física de Cantabria (IFCA-CSIC)
Diego Ramiro Fariñas, Instituto de Economía, Geografía y Demografía (IEDG-CSIC)
Sandro Meloni, Instituto de Física Interdisciplinar y Sistemas Complejos (IFISC-CSIC)
David Alonso, Centre d’Estudis Avançats de Blanes (CEAB-CSIC)
John Palmer (external PI), Universitat Pompeu Fabra (UPF)
This report provides results from the first three waves of the Distancia-Covid Survey launched on 14 May 2020 under the CSIC-funded project “Impacto de las medidas de distanciamiento social sobre la expansión de la epidemia de Covid-19 en España.” It relies on the survey responses received from the launch date through 10 January 2021. This period encompasses three “waves” during which the survey was disseminated through social media and other channels. As described further below, Survey Wave 1 ran from 14 May 2020 through 10 June 2020, Survey Wave 2 ran from 24 July 2020 through 31 August 2020, and Survey Wave 3 ran from 14 December 2020 through 10 January 2021. (Note that the Survey Waves should not be confused with the waves of the pandemic.)
The vast majority of the responses received during Survey Wave 1 came during the time in which Spain still had social distancing measures in effect but was transitioning away from the extensive restrictions on mobility and social contacts that had been put into place with the state of alarm decreed on 14 March 2020. The state of alarm lasted until 21 June 2020 and Spanish territories were moving, at varying rates, through the three phases of the de-escalation process during the Wave 1 period analyzed here. All of the responses in Survey Wave 2 were received when most restrictions had been lifted and there was no longer a state of alarm in effect. In addition, Survey Wave 2 ends on 31 August in order to coincide with the end of the traditional summer vacation period and avoid overlapping with the September transition back to work and school. Survey Wave 3 brackets Spain’s winter holiday period, starting on 14 December, when schools were still open and most people were still working, overlapping all of the school holiday period, and ending on 10 January, after which most schools were open again and people were back at work.
The survey was designed by the Distancia-Covid team in order to better understand changing patterns of human mobility and social contacts in Spain in the context of the Covid-19 pandemic. Many of the questions draw on the approach taken by the POLYMOD study (Mossong et al. 2008; Prem, Cook, and Jit 2017), and were developed in coordination with researchers in other countries working on similar surveys related to social mixing (Del Fava et al. 2020; Feehan and Mahmud 2020; Perrotta et al. 2020).
The survey was distributed in Spanish, Catalan, Galician, Basque, and English using Kobo Toolbox1. Respondents accessed the survey at https://distancia-covid.csic.es/encuesta and it remains available at present at that URL. Respondents are able to access the survey questions only if they first provide informed consent.
The sampling design was non-random, based entirely on people self-selecting into the respondent pool by connecting to the survey URL online. The survey URL was distributed through press releases, Twitter, Whatsapp, and other channels by members of the project team and institutional press offices, and it appears to have propagated through digital networks reasonably well, reaching all provinces in Spain and a relatively wide segment of the population (see further below).
As of 11 January 2021 there were 10127 valid submissions, 4402 in Survey Wave 1, 2560 in Survey Wave 2, and 3165 in Survey Wave 3. Initial data cleaning was done to improve the interpretability of variable names and generate additional variables calculated from the original ones. Among other things, an imputed usual postal code variable was created based on the two postal code questions in the survey, which asked respondents to list their current and usual postal codes. The imputed variable takes the value of the usual postal code when this has been provided. When it has not been provided it takes the value of the current postal code on the assumption that these are the same in these cases. In addition, province variables were created based on the first two digits of the postal code responses.
This section provides descriptive statistics of the survey submissions received to date, with distinctions made between the two waves as appropriate. Throughout the text and plots, “NA” is used to denote missing data due to respondents declining to answer certain questions on the survey. It should be noted that these statistics are not necessarily representative of the population given the non-random sampling design. Population estimates are now being made using multilevel regression with poststratification, as described in the next section.
Most survey submissions were made soon after the survey was released and promoted in each wave. Figure 1 shows the submission time pattern on a histogram with the data aggregated in 1-hour bins. As can be seen, there were several sub-waves of submissions within each of the three main Survey Waves. There is also a clear daily cycle of submissions, which drop off at night (as one would expect), which can be seen if one zooms in on the plot by clicking on it.
Based on the imputed usual postal code variable, survey respondents appear to have had their usual places of residence distributed across Spain, with at least one respondent in each province. (This mostly also corresponded to their current places of residence, although 1108 respondents listed different current and usual postal codes, and of these, 690 are in different provinces.)
In absolute terms, most respondents reported their usual places of residence in Madrid or Barcelona, as shown in Figure 2. Relative to the province residential populations (taken from the padron), the greatest sampling fraction is from Girona, followed by Toledo, Bizkaia, Barcelona, and Castellon, as shown in Figure 3.
The survey respondents also represented a broad cross-section of ages, ranging from 18 (the requirement for participation) up to 92. The median age of respondents was 46 in Survey Wave 1, 47 in Survey Wave 2, and 43 in Survey Wave 3. The middle 50% of ages of respondents was 36 to 57 in Survey Wave 1, 38 to 56 in Survey Wave 2, and 36 to 51 in Survey Wave 3. The survey’s gender question provided binary response options of male or female in order to match the phrasing of Spain’s labor force survey (Encuesta de población activa),2 which is being used for poststratification. There were both male and female respondents in nearly every age group in all waves. In Survey Wave 1 62% of respondents identified themselves as female, 35% as male, and 2% declined to respond to the gender question. In Survey Wave 2 65% of respondents identified themselves as female, 34% as male, and 1% declined to respond to the gender question. In Survey Wave 3 67% of respondents identified themselves as female, 32% as male, and 1% declined to respond to the gender question. Figure 4 provides a population pyramid of male and female respondents. Note that Survey Wave 2 had relatively fewer respondents below 38 years old than the other two Survey Waves.
The survey asked respondents to report their highest level of education, divided into four levels. Submissions were received from people reporting all four levels, with most reporting undergraduate or graduate level. Figure 5 shows reported education levels by gender. A relatively large proportion of the respondents had high education levels.
The survey also asked respondents to report their “occupation or type of work” as well as “the activity of the establishment in which [they] work.” The distribution of responses to the occupation question is shown in Figure 6, with the labels on the x-axis corresponding to the following response options (abbreviated version for chart in italics, followed by full response option shown in English version to respondents):
The categories in Figure 6 are ordered according to their relative prevalence during Wave 1. We observe a relatively large proportion of scientists in all three Survey Waves, likely related to the dissemination strategy of the survey, which mainly relied on academic social networks. We can also see similar distributions of the other occupational categories, with the exception of the “other” category, which dropped from Wave 1 to Wave 2 (in proportion to the others), and the non-response “NA” category, which rose. Presumably this reflects some combination of (1) variation by respondents in the decision of whether to choose “other” or simply not to respond when they did not see a category fitting their occupation, (2) an increase in unemployment and job instability leading respondents to see themselves less attached to a particular occupational category, (3) survey fatigue or loss of motivation to respond to all questions due to the length and changing nature of the pandemic, and (4) changes in the networks through which the survey propagated.
The distribution of responses to the work activity question is shown in Figure 7, with the labels on the x-axis corresponding to the following response options (abbreviated version for chart in italics, followed by full response option shown in English version to respondents):
As with the occupation plot, Figure 7 shows the activity categories on the x-axis in the order of their prevalence in the Wave 1 responses. In this case, we see the highest proportion of responses coming in the Public category, again very likely reflecting the networks through which the survey was distributed. We see somewhat less stability in the relative proportion of other categories across the two waves and we again see a large increase in the non-response “NA” category, which may be explained in the same way as in the occupation case above.
Most respondents reported that they were born in Spain (94%). Of those who reported being born outside Spain, the top 5 countries of birth were Argentina (12% of non-natives), Italy (8% of non-natives), Germany (6% of non-natives), the UK (4% of non-natives), and France (6% of non-natives).
For Survey Wave 1, the survey asked respondents, “Are you continuing to work during the lockdown?” The distribution of responses is summarized in Figure 8 with the labels on the x-axis corresponding to the following response options (abbreviated version for chart in italics, followed by full response option shown in English version to respondents):
For Survey Waves 2 and 3, the question was modified to reflect the ending of the “lockdown” and also to better account for the variety of working/non-working situations. The question in this wave was, “What is your current employment status?” The distribution of response is summarized in Figure 9 with the labels on the x-axis corresponding to the following response options (abbreviated version for chart in italics, followed by full response option shown in English version to respondents):
Nearly all respondents reported owning or living with someone who owns an information and communication technology (ICT) device, with personal computers being most prevalent, followed by smart phones and then tablets (Figure 10). Respondents mostly reported multiple devices. Most (>60%) of respondents also reported being constantly connected to the internet, and most of the rest reported being connected several times per day (Figure 11).
As one way of assessing levels of mobility, respondents were asked about the trips they had taken out of their dwellings during the past week. Figure 12 shows the distribution of number of trips reported. The maximum value listed in the responses was 10,000, but this was omitted from the analysis as obviously erroneous. Several people reported 50 or more trips (including two reporting 100) and these were retained, as they reflect plausible behavioral patterns (e.g., delivery work). In Survey Wave 1, the mean and median number of reported trips were both 5. For Survey Wave 2, the mean was 9 and the median was 7. For Survey Wave 3, the mean was 8 and the median was 7. Overall, 80% reported having gone out between 1 and 7 times during Survey Wave 1, 61% reported this during Survey Wave 2, and 65% reported this during Survey Wave 3. The mode of the distribution (most frequent value) in all three Survey Waves was 7 trips, presumably because many people actually tend to go out once per day (even during the confinement period) or because 7 is simply the rough estimate many people use to answer the question. Reports of more than 7 trips accounted for 15% of responses in Survey Wave 1, 37% in Survey Wave 2, and 33% in Survey Wave 3.
Respondents were also asked about the farthest distance they had traveled on any of these trips as well as all of their destinations and safety precautions. The distributions of responses are shown in Figures 13, 14, and 15.
In Survey Wave 1, nearly 80% of respondents reported having traveled less than 10 km from their home and nearly 40% reported having traveled less than 1 km. The most frequent destination was stores, followed by public spaces and workplaces. Nearly all respondents reported taking some sort of safety precaution, with masks, social distancing, and handwashing being the most frequent. In Survey Waves 2 and 3, there were proportionally fewer displacements below 1 and 10 km, and proportionally more displacements above 10 km among the respondents. Final destinations in Survey Waves 2 and 3 were more diverse compared to Survey Wave 1, but stores remained the most frequent destination. In terms of safety precautions while traveling during Survey Waves 2 and 3, again masks, social distancing, and handwashing were the most frequent. There was also a decrease in the proportion of respondents reporting use of gloves in Survey Waves 2 and 3 compared to Survey Wave 1.
An important source of information about social mixing comes from the sizes and age structures of people’s households (defined here as the group of people with whom they were residing at the time of the survey submission). Figure 16 shows the number of co-residents reported by each respondent by autonomous community and city. This raw data is very noisy due in part to the non-random sampling design and the small number of respondents from some autonomous communities/cities (particularly, for example, Ceuta and Melilla). (Modeled population estimates are provided in Figure 18.)
Relevant social mixing also occurs outside the home. Respondents were asked to report the number and ages of the people with whom they had contact on the previous day. Following the POLYMOD approach, contacts were defined for respondents as: “EITHER a two-way conversation with three or more words in the physical presence of another person, OR physical skin-to-skin contact (for example a handshake, hug, kiss or contact sports).” The distribution of the reported numbers of contacts is shown in Figure 17. Note the relatively large proportion of respondents reporting 0 contacts in Survey Wave 1 compared to Survey Waves 2 and 3. Although all of the responses in Wave 1 were received at the time of the de-escalation process, this appears to reflect the effect of the extensive restrictions on mobility and social contacts of the previous months. As with all of these descriptive statistics, however, we need to be extremely cautious in making any population inferences directly from the raw data as we know the samples are not representative. (Modeled population estimates are provded below in Figures 20, 21, and 22.)
The project team is now using multilevel regression with poststratification (MRP) (Zhang et al. 2014; Downes et al. 2018; Park, Gelman, and Bafumi 2004) to make population-level estimates from the survey data. Preliminary results are offered here and have already been incorporated into several epidemiological models. We focus here on social mixing patterns because of the obvious relevance to understanding Covid-19 dynamics. We consider in and out of home contacts, distinguishing between co-residents and non-co-residents.
MRP is a statistical method that has the potential to produce reliable population-level estimates from non-representative samples (Downes et al. 2018; Wang et al. 2015; Del Fava et al. 2020). The approach relies on multilevel modeling to first estimate an outcome of interest for different combinations or cells of respondent characteristics. MRP then uses model predictions and poststratification to generate population-level estimates based on knowledge of the relative proportion of each cell in the total population (Downes et al. 2018).
In our case, key outcomes of interest are (1) the number of co-residents in each household, (2) the probability of having had an out-of-home contact during a given 24-hour period, and (3) the number of such contacts in that period. The respondent characteristics used to create the cells are taken from survey questions that provide information also obtained from Spain’s large, representative labor force survey (Encuesta de población activa),3 from which the population proportions needed for poststratification are taken.
We use multilevel negative binomial regression models for the mean of the count response variables — both in-home co-residents and out-of-home contacts — conditional on poststratification cells. We use a multilevel logistic regression model to estimate the probability of having any out-of-home contact (again conditional on these cells).
We assume the random variable representing the number of co-residents or out-of-home contacts for each individual \(i\) follows a negative binomial distribution. We further transform the scale of the expectation into non-negative values with a log link and define the multilevel model for the expected number of co-residents or out-of-home contacts using random intercepts for occupation, province of residence, and response date. (The date intercept is included to account for potential temporal autocorrelation arising from the network structure along which the survey was distributed; model predictions are then made for a hypothetical unobserved date within each wave.)
In the co-resident model for all contact ages pooled, fixed effects are included for gender and five-year age group, whereas in the out-of-home contacts model with pooled contact ages, fixed effects are included for education level and five-year age group. Age-specific contact models include random effects for respondent, education, occupation, respondent’s five-year age group, contact’s 10-year age group, gender, the crossed effects of gender, respondent’s age group, and contact’s age group, as well as for province of residence and response date as discussed above.
In order to model the probability of any out-of-home contact, we first defined a random variable representing the occurrence of any contact for any individual \(i\), following a Bernoulli distribution with probability \(\pi_i\). We then fit a multilevel logistic regression with random intercepts for occupation, province of residence, and date (as in the count models) and fixed effects of education and five-year age group.
We fit all models in R (R Core Team 2020) using Stan and the rstanarm package (Stan Development Team 2015, 2016), with the default priors described in the rstanarm 2.21.1 documentation.4
After fitting these models, we made population level estimates by sampling from the posterior predictive distributions according to the corresponding cell size in the labor force survey data. As a comparison, we also modeled the co-resident outcome directly from the labor force survey, using the same count model described above.
Starting with the number of co-residents each respondent reported, we estimate a population-level distribution of co-resident counts for people aged 20 and over. This is shown in Figure 18. As a comparison, Figure 19 shows that same estimates based directly on the Spain’s labor force survey (Encuesta de población activa) for each quarter during 2019 and 2020. Comparing Figures 18 and 19, we see that the Distancia-Covid survey estimates (using MRP) match very closely with the estimates obtained from the much larger more representative labor force survey. Looking at the two waves of the Distancia-Covid survey in Figure 18, we see very little difference in the distribution of co-residents. Looking at the labor force survey estimates in Figure 19, we see that this patterns appears to have been quite stable over the past two years.
For out-of-home contacts we use the survey responses to estimate the distribution and age-structured contact matrix for the population aged 20 and over.
Since a large number of respondents in Wave 1 reported no out-of-home contacts at all on the previous day, we start by simply estimating the probability of any out-of-home contact. Figure 20 shows the estimated probabilities and the 90% credible intervals for these estimates for each province in each wave. We see a clear increase in all provinces in the probability of having had any out-of-home contact.
Figure 21 shows the estimated distribution of the number of out-of-home contacts for the total population aged 20 and over. The mean number of contacts increases with each Survey Wave, from 3 (Survey Wave 1), to 5 (Survey Wave 2), to 6 (Survey Wave 3). More interesting, however, is how the distribution changes, with greater variability in Survey Waves 1 and 3 than in Survey Wave 2. Survey Wave 3 has the highest variability (the standard deviation is 10, compared with 6 in Survey Waves 1 and 2), with a long upper tail representing people with many contacts. But the distribution of Survey Wave 3 also has a lot of weight on low contact numbers. (Note that the y-axis is in log scale.) Thus, while the mean of this Survey Wave is higher than the others, the median is only 2. In contrast, the median number of contacts estimated in Survey Wave 2 is 4, and for Survey Wave 1 the median is 1. We see similar patterns when we examine these estimated distributions by age and occupation in Figures 22 and 23.
It should be noted that the contact distribution estimates for Survey Wave 3 include some unrealistically high values (the maximum is 1392), which results from the high variability of the responses and the stochastic nature of the model. The plots shown here have the x-axis truncated at 300 to aid visualization, since only a tiny proportion of estimates (0.0003%) exceed this value. Even if we truncate the estimates at this value – or even at the maximum number of out-of-home contacts actually reported on the survey, which is 120 – the mean, median, and standard deviation of the distributions remain the same (when rounded as above; and only slightly different if we include additional digits).
Apart from this technical modeling issue, however, the question of high contact numbers is of great interest. High out-of-home contacts would be consistent in certain occupations, particularly in the service sector or manufactoring jobs involving large numbers of workers on factory floors. This can be seen in the raw survey responses as well as in the model estimates (23 and we are currently exploring this question further.
For epidemiological models, an age-specific estimate of total contacts (both co-residents and non-co-residents, in-home and out-of-home) is often most useful. We make such estimates for all Survey Waves by combining the EPA co-resident contact estimates with estimates of non-co-resident contacts drawn from the survey. We include here very rough estimates for the age groups not included in the survey (under 18 years old), which are based on proportionally distributing contact ages from the other age groups across these younger ages. This is a reasonable starting point in the absence of other sources of information about these younger age groups, but the estimates should be treated with caution. In particular, for periods when schools were in session, these age groups surely had higher contacts than estimated here, and this may be best approximated using average classroom sizes in schools.
Figure 24 shows the estimated age-structured total contact matrix for the population in each wave. The x-axis here indicates the 5-year age group of reference for the estimate (“self age group”), while the y-axis indicates the 10-year age groups of the estimated contacts for the reference groups. Cell colors indicate the mean number of contacts each of the respective reference groups are estimated to have had with each of the respective contact age groups in some hypothetical day during the wave period. Hovering the cursor over the cells will also show medians and the central 90% of the contact distributions. These matrices are not symmetrical because population sizes vary by age group.
We can observe here high mean number of daily contacts in the diagonal (as is the case also of the estimated household contact matrix). That is, people from one age group tend to have contact with people from the same age group. We also observe increases in estimated mean contacts moving from Survey Waves 1 to Survey Wave 2 and then Survey Wave 3. The differences between Survey Waves 1 and 2 are difficult to distingush from the colors but can be seen from the values shown then the cursor is hovered over each cell. The differences are more evident from the colors in the matrix for Survey Wave 3, with the highest values appearing for contacts between people in their 40s, presumably reflecting a combination of household structure and work activity during this period.
In general, the estimated mean number of daily contacts are rather small. It should also be noted (as with the descriptive statistics) that these estimates are based on cross-sectional data that does not incorporate information about variation in the number of contacts each person has over time. Thus, an estimate of 0.5 could reflect variation within individual contact patterns over time, with a person of age X having contact with a person of age Y on average every 2 days. Alternatively, it could reflect variation at the population level, with some people having one or more contacts on a daily basis and others having no contacts on a daily basis. In fact, these estimates surely reflect variation at both levels, but it is not possible to differentiate between them from the available data.
The Distancia-Covid Group continues to analyze this data to better understand social mixing patterns across ages, occupations and other collected variables, to build-up a network-focused analysis, and to prepare data to feed into a variety of epidemiological models, ranging from agent-based models to classical SEIR compartmental models. The contact estimates shown here are already being used to make a number of epidemiological models more realistic, provide insight into how they may be affected by changing contact networks, and improve predictions about future scenarios. These estimates are available upon request and will soon be placed in an open access repository.
Special thanks to Ane Calvo, Jose A. Costoya, and Manuel Pereira, for translating the survey into Basque and Galician, and to Wiebke Weber, Dennis Feehan, Ayesha Mahmud, Emilio Zagheni, and Jorge Cimentada for suggestions and feedback on the survey design.