A data scientist's tale - ECHR violations data preparation

Posted on
rstats human rights

Hello dear reader!

This post tells a data scientist’s story about working with (human rights violations) data. The post is not about showing results of human rights violations (some are of course included), but about experiences when preparing data for analysis. It is often said that 80% of the work of data analyst is data preparation. So, this post tells you what is usually not shown, the struggle to get a dataset into the right shape for analysis. As Hadley Wickham says - based on a famous quote of Leo Tolstoi about happy families: Tidy datasets are all alike, but every messy dataset is messy in its own way.1

I often work with human rights data. One of the most important human rights institutions is the European Court of Human Rights (ECtHR), which rules on individual or state applications alleging violations of the rights set out in the European Convention on Human Rights (ECHR). In the past years the court has stepped up its efforts to provide data on judgments, on human rights violations, in its HUDOC database. So, I made an effort and downloaded data on human rights violations from the Council of Europe HUDOC database, including all Chamber and Grand Chamber judgments. Not many people have analysed those data quantitatively, so this can nicely add to our understanding of human rights violations in Europe.

Data collection - the HUDOC database of judgments

The first challenge was that the database only allows you to download 500 cases at once. There are, however, many thousands of cases. I took advantage of the fact that the download link from the database changes according to the filter you set. I made a systematic sub-selection of the database - making sure they stay below 500 cases - and loop through the links to download data and put them together later on. For the years 2007 to 2020, I downloaded quarterly data, by setting the time period of judgments. Let’s load the full dataset and look at it. Should be ready for analysis, right? … not really

Data preparation

In the code below, I load the compiled dataset of ECtHR judgments between 2007 and 2020, as downloaded from the database.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: Paket 'ggplot2' wurde unter R Version 4.0.5 erstellt
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(countrycode)
library(knitr)

# put your path into the object 'pth'
dat <- read.csv(paste0(pth, "hudoc_final1-2021-03-28.csv"), stringsAsFactors = FALSE)

head(dat) %>% kable()
Document.Title Application.Number Document.Type Originating.Body Date Conclusion datetype year
CASE OF ANHEUSER-BUSCH INC. v. PORTUGAL 73049/01 HEJUD Court (Grand Chamber) 11/01/2007 No violation of P1-1 dd/mm/yyyy 2007
CASE OF SISOJEVA AND OTHERS v. LATVIA 60654/00 HEJUD Court (Grand Chamber) 15/01/2007 No violation of Article 8 - Right to respect for private and family life (Article 8-1 - Respect for family life);No violation of Article 34 - Individual applications (Article 34 - Hinder the exercise of the right of petition) dd/mm/yyyy 2007
CASE OF TATISHVILI v. RUSSIA 1509/02 HEJUD Court (First Section) 22/02/2007 Violation of P4-2;Violation of Art. 6-1;Pecuniary damage - claim dismissed;Non-pecuniary damage - financial award;Costs and expenses partial award - domestic and Convention proceedings dd/mm/yyyy 2007
CASE OF TYSIÄ„C v. POLAND 5410/03 HEJUD Court (Fourth Section) 20/03/2007 Preliminary objection dismissed (Article 35-1 - Exhaustion of domestic remedies);No violation of Article 3 - Prohibition of torture (Article 3 - Degrading treatment;Inhuman treatment);Violation of Article 8 - Right to respect for private and family life (Article 8-1 - Respect for private life);Pecuniary damage - claim dismissed;Non-pecuniary damage - award dd/mm/yyyy 2007
CASE OF AKPINAR AND ALTUN v. TURKEY 56760/00 HEJUD Court (Second Section) 27/02/2007 No violation of Art. 2 (substantive aspect);Violation of Art. 2 (procedural aspect);No violation of Art. 3;Violation of Art. 3;Not necessary to examine Art. 3 (procedural aspect);Pecuniary damage - claim dismissed;Non-pecuniary damage - financial award;Costs and expenses award - Convention proceedings dd/mm/yyyy 2007
CASE OF CASTRAVET v. MOLDOVA 23393/05 HEJUD Court (Fourth Section) 13/03/2007 Preliminary objection dismissed (non-exhaustion of domestic remedies);Violation of Art. 5-3;Violation of Art. 5-4;Non-pecuniary damage - financial claim;Costs and expenses partial award dd/mm/yyyy 2007

Oh, it includes a lot of information. Document title, application number, document type, originating body, date and conclusion. But no direct information on whether a violation occured, or not, and which article was violated. Also no concrete country column etc. All this information is provided in a more or less structured way under ‘Conclusion’. Let’s we what we can do about it. First we need to get a sense of the structure of the dataset.

nobs <- nrow(dat)
mn <- min(dat$year, na.rm = TRUE) 
mx <- max(dat$year, na.rm = TRUE)
dupno <- sum(duplicated(dat$Application.Number))
dups <- sum(duplicated(dat))

uapp <- length(unique(dat$Application.Number))

emptyconclusions <- dat %>% dplyr::filter(Conclusion == "") %>% nrow()

dat$Date <- str_sub(dat$Date, 1, 10) %>% as.Date(format = "%d/%m/%Y")
dat$id <- 1:nrow(dat)

# check on duplicated application numbers
dups <- dat %>%
  dplyr::filter(Conclusion != "") %>%
  group_by(Application.Number) %>%
  mutate(nApps = n()) %>%
  ungroup() %>%
  dplyr::filter(nApps > 1) %>%
  arrange(Application.Number)

# we have 
# number of cases with more than one conclusion
morecons <- length(unique(dups$Application.Number))

# summary(dups$nApps)

# they range from 2 to 4
rm(dups)

There are 12282 rows and 11779 unique application numbers in the data. The data are from 2007 to 2020.

The dataset also includes several cases without any information in the conclusions. These are 200 cases. I delete all cases, where no conclusions are provided in the database.

Then, we have 385 cases with more than one conclusion. This includes cases that went on to the Grand Chamber. For ease of analysis, I only keep the most recent decisions.

Finally, I’ll save a random selection of cases in a separate file, which I can view and follow through the data transformation process. This is needed because we need quite some data wrangling before the dataset is ready to use for analysis.

dat <- dat %>%
  dplyr::filter(Conclusion != "") # concerns 200 cases, where there is no text in the conclusions

dat <- dat %>%
  group_by(Application.Number) %>%
  mutate(nApps = n(),
         maxDate = max(Date)) %>%
  ungroup() %>%
  arrange(Application.Number, Date) %>% # view()
  dplyr::filter(Date == maxDate) %>%
  dplyr::select(-nApps, -maxDate)

nobs2 <- nrow(dat)
# how many cases left with more than one decision
nobs2-length(unique(dat$Application.Number))
## [1] 1
# follow:
# 24322/02 poland no viol 5 and viol 8
# 69582/12 romania no viol 3 and viol 3
# dat[dat$Application.Number == "24322/02", ]
# and a random number of three other cases sample(dat$Application.Number, 6)

# save physical file for checks:
checks1 <- dat %>% dplyr::filter(Application.Number %in% c("24322/02", "69582/12", "35312/02",
                                         "33210/07;41866/08", "37715/11", "12306/04",
                                         "81277/12", "44753/12"))
# write_csv2(checks1, "violation_checks2.csv")

We are left with 11685 cases.

Now we need some tidying up and preparation of variables, including the date format, and identifying the names of countries. Careful attention is needed for two countries, which often have different ISO2 codes. The United Kingdom (UK) or sometimes Great Britain (GB) (I know there is an important difference. British people explained it to me several times, but I always forget it … it does not matter for country level analysis anyway). Addtionally, the official two letter abbreviation used in the EU for Greece is EL, but sometimes also used as GR. The database uses GB and GR, but we need to be careful for example when matching with Eurostat data (where UK and EL is used).

To have a column that indicates the country covered, I have to create a list of country names, paste it into a string of country names separated by | (OR) to detect which country is covered. This is needed because in some cases, there is more than one country. As there is no clear country column, I detect the name from the conclusion. This requires some checking and string detection. Fortunately, the way countries are named in the conclusions is very consistent, with only a few exceptions.

dat$Document.Title <- tolower(dat$Document.Title)

COElist <- c("AL", "AD", "AM", "AT", "AZ", "BE", "BA", "BG", "HR", "CY", "CZ", "DK", "EE", "FI",
             "FR", "GE", "DE", "GR", "HU", "IS", "IE", "IT", "LV", "LI", "LT", "LU", "MT", "MD",
             "MC", "ME", "NL", "NO", "PL", "PT", "RO", "RU", "SM", "RS", "SK", "SI", "ES", "SE",
             "CH", "MK", "TR", "UA", "GB")
# yes still including Russia, the biggest human rights violator of the CoE.

COElist2 <- countrycode(COElist, "iso2c", "country.name")
COElist2[COElist2 == "Macedonia, the former Yugoslav Republic of"] <- "Macedonia" 
# actually now called "North Macedonia"
COElist2[COElist2 == "Russian Federation"] <- "Russia"
COElist2[COElist2 == "Moldova, Republic of"] <- "Moldova"
COElist2[COElist2 == "Czechia"] <- "Czech Republic"
COElist2[COElist2 == "Bosnia & Herzegovina"] <- "Bosnia and Herzegovina"

There are up to seven countries in one case covered. In the code below, I assign a separate column per country and them make each of them a separate row in the dataset. There is surely a nicer way to do this, but who cares if it works.

cs <- str_c(COElist2, collapse = "|")

country <- str_extract_all(dat$Document.Title, tolower(cs), simplify = TRUE)

dat$country1 <- country[,1]
dat$country2 <- country[,2]
dat$country3 <- country[,3]
dat$country4 <- country[,4]
dat$country5 <- country[,5]
dat$country6 <- country[,6]
dat$country7 <- country[,7]
# names(dat)
rm(country)

# a few outliers to change
dat$country1[str_detect(dat$Document.Title, "lettonie")] <- "latvia"
dat$country1[str_detect(dat$Document.Title, "italie")] <- "italy"
dat$country1[str_detect(dat$Document.Title, "luxemburg")] <- "luxembourg"
dat$country1[str_detect(dat$Document.Title, "turquie")] <- "turkey"

dat <- dat %>%
  pivot_longer(cols = contains("country"), names_to = "country_no", values_to = "country") %>% 
  select(-country_no) %>%
  dplyr::filter(country != "") %>% # as there would be 7 times each case with empty countries 
  dplyr::filter(Conclusion != "") # deleting those without any information in the conclusions column

# count(dat, country, sort = TRUE) %>% view()

ns2 <- nrow(dat)

dat$iso2c <- countrycode(dat$country, "country.name", "iso2c")

# check if it worked for all observations
# count(dat, country, iso2c) %>% arrange(desc(n)) %>% view()

Text pre-processing - ‘Violation of …’ and ‘No violation of …’

Now, finally, the fun part starts. The conclusions text of the data includes a lot of information, which need to be extracted and brought into a format that can be analysed. Some typical rows look like the following:

checks1 %>% select(Conclusion) %>% kable()
Conclusion
Violation of Article 6 - Right to a fair trial;Violation of Article 1 of Protocol No. 1 - Protection of property
No violation of Article 5 - Right to liberty and security;Violation of Article 8 - Right to respect for private and family life
Preliminary objection joined to merits and dismissed (non-exhaustion of domestic remedies);Remainder inadmissible;Violation of Art. 5-1;Violation of Art. 5-4;Violation of Art. 5-5;Non-pecuniary damage - award
Violation of Article 6 - Right to a fair trial
Violation of Article 2 - Right to life (Article 2-1 - Effective investigation) (Procedural aspect)
Violation of Article 3 - Prohibition of torture (Article 3 - Degrading treatment;Inhuman treatment) (Substantive aspect);Violation of Article 13 - Right to an effective remedy (Article 13 - Effective remedy)
Violation of Article 3 - Prohibition of torture (Article 3 - Effective investigation) (Procedural aspect);No violation of Article 3 - Prohibition of torture (Article 3 - Degrading treatment;Inhuman treatment) (Substantive aspect)
No violation of Article 2 - Right to life (Article 2-1 - Life) (Substantive aspect);Violation of Article 2 - Right to life (Article 2-1 - Effective investigation) (Procedural aspect)

Okay, the table above looks relatively structured. If a violation occurs, it says “Violation of Article …”, if not, it says “No violation of Article …”. Additionally, there is text on preliminary objections and if non-pecuniary damages were awarded. This is the information I am interested in to do some data analysis on human rights violations.

All is separated by semi-colon ;. Yeah, great, let’s just split by semi-colon and … no, wait. Look at the sixth line above, including the text: Prohibition of torture … and the text in brackets. It also includes a semi-colon. F|$"§(/°^9ß8# (that’s me cursing)

It seems we have to do some manual adjustments. The code below is the result of me bringing everything into structure for analysis. It is the result of many checks and tests - a lot of work which is usually hidden from the outside world, when data analysis is presented. First, I thought I really have to clean up all possible cases manually. You will see the code below, which I started. Fortunately, I soon realised that almost all cases of semicolons which I don’t need for splitting are included in brackets. Good. I just get rid of all text in brackets.

Before I figured this out, I actually started all manual cleaning, i.e. finding all cases with semi-colons. This would also have solved almost all of the problems, but I prefer the cleaner way obviously. I still show the manual part for completeness below.

# not run
dat <- dat %>%
 mutate(Conclusion = str_replace_all(Conclusion, "Art\\.", "Article"),
        Conclusion = str_replace(Conclusion, "liberty;Lawful", "liberty;Lawful"),
        Conclusion = str_replace(Conclusion, "correspondence;Respect", "correspondence;Respect"),
        Conclusion = str_replace(Conclusion, "property;Peaceful", "property;Peaceful"),
        Conclusion = str_replace(Conclusion, "obligations;Article", "obligations;Article"),
        Conclusion = str_replace(Conclusion, ";Article", ",Article"),
        Conclusion = str_replace(Conclusion, "; Article", ", Article"),
        Conclusion = str_replace(Conclusion, ";Inhuman", ",Inhuman"),
        Conclusion = str_replace(Conclusion, ";Respect", ",Respect"),
        Conclusion = str_replace(Conclusion, ";\\(Art", ",\\(Art"),
        Conclusion = str_replace(Conclusion, ";Equality of", ",Equality of"),
        Conclusion = str_replace(Conclusion, ";Prohibition", ",Prohibition"),
        Conclusion = str_replace(Conclusion, ";Reasonableness", ",Reasonableness"),
        Conclusion = str_replace(Conclusion, ";Independent", ",Independent"),
        Conclusion = str_replace(Conclusion, ";Just satis", ",Just satis"),
        Conclusion = str_replace(Conclusion, ";Procedure", ",Procedure"),
        Conclusion = str_replace(Conclusion, ";Speediness", ",Speediness"),
        Conclusion = str_replace(Conclusion, ";Possessions", ",Possessions"),
        Conclusion = str_replace(Conclusion, ";Review", ",Review"),
        Conclusion = str_replace(Conclusion, ";Degrading", ",Degrading"),
        Conclusion = str_replace(Conclusion, ";Review", ",Review"),
        Conclusion = str_replace(Conclusion, ";Adversarial", ",Adversarial"),
        Conclusion = str_replace(Conclusion, ";Fair", ",Fair"),
        Conclusion = str_replace(Conclusion, "; non-exhaustion", ", non-exhaustion"),
        Conclusion = str_replace(Conclusion, ";Positive obligation", ",Positive obligation"),
        Conclusion = str_replace(Conclusion, ";Security", ",Security"),
        Conclusion = str_replace(Conclusion, ";Criminal", ",Criminal"),
        Conclusion = str_replace(Conclusion, "; General measures", ",   General measures"),
        Conclusion = str_replace(Conclusion, ";Freedom", ",Freedom"),
        Conclusion = str_replace(Conclusion, ";Security", ",Security"))

But then, as mentioned above, I realised that all the information that I do not want to split by ; is included in brackets. So, with a little regex, I just get rid of all brackets and this solves most of my problems.

dat <- dat %>%
  mutate(Conclusion = str_replace_all(Conclusion, "\\([^()]+\\)", ""),
         Conclusion = str_replace_all(Conclusion, "Art\\.", "Article"),
         Conclusion = str_trim(Conclusion))

That’s much quicker, easier and more elegant.

Now, let’s create a long version of the data, where we have one line per conclusion (and more cleaning):

dat <- dat %>%
  mutate(con2 = str_split(Conclusion, ";")) %>%
  unnest(cols = c(con2)) %>%
  dplyr::select(-datetype) %>%
  mutate(con2 = str_trim(con2),
         con2 = str_replace(con2, " - .*", ""), # here I get rid of everything after ' - ' which I dont need for analysis
         con2 = str_replace(con2, "P1-1", "Article 1 of Protocol No. 1"), #cleaning a few other more inconsistencies (yes, could be done more elegantly)
         con2 = str_replace(con2, "P4-2", "Article 2 of Protocol No. 4"),
         con2 = str_replace(con2, "P1-2", "Article 2 of Protocol No. 1"),
         con2 = str_replace(con2, "P1-3", "Article 3 of Protocol No. 1"),
         con2 = str_replace(con2, "P7-4", "Article 4 of Protocol No. 7"),
         con2 = str_replace(con2, "P7-1", "Article 1 of Protocol No. 7"),
         con2 = str_replace(con2, "P7-2", "Article 2 of Protocol No. 7"),
         con2 = str_replace(con2, "P6-1", "Article 1 of Protocol No. 6"),
         con2 = str_replace(con2, "P3-1", "Article 1 of Protocol No. 3"),
         con2 = str_replace(con2, " and ", "+"),
         con2 = str_replace(con2, "de l'art.", "of Article")) 

After all the above cleaning, the data are still not perfect, however, the data are structured enough to do some general analysis on violation vs no violation of cases and getting a sense of the main articles violated and not violated. Let’s create a few variables for analysis. Most notably, a variable that indicates whether or a violation was found. As several articles can be violated, or not violated, we will have several instances per case. There is much more that can be done, but I’ll leave it at that (not sure how many readers are actually left at this stage).

dat <- dat %>%
  mutate(violation = ifelse(str_detect(con2, "Violation"), 1,
                            ifelse(str_detect(con2, "No violation"), 0, NA)),
         otherdecision = ifelse(str_detect(con2, "Violation"), 1,
                                ifelse(str_detect(con2, "No violation"), 1, 0)),
         article = str_extract(con2, "iolations?.*"),
         article = str_replace(article, "iolations? of ", "") %>% str_trim())

dat <- dat %>% group_by(id) %>%
  mutate(otherdecision = as.integer(sum(otherdecision) == 0)) %>%
  ungroup() %>%
  rename(year_decision = year) %>%
  mutate(year_lodged = str_sub(Application.Number, -2, -1),
         year = ifelse(year_lodged > 20, paste0(19, year_lodged),
                              paste0(20, year_lodged)) %>% as.numeric())

Some results

We are done! Now we have a dataset ready for analysing human rights violations. Are you excited as I am?

It includes information and data, which - to my knowledge - not many people have analysed so far. It is a long dataset with several observations per case, separately for each article violation decision and country involved.

What could be analysed? A few examples follow.

How many cases found a violation?

n_viols <- filter(dat, violation == 1) %>% nrow()

# overal judgments finding a violation
d2 <- dat %>% 
  group_by(id) %>% 
  summarise(n_viol = sum(violation, na.rm = TRUE),
            if_viol = ifelse(n_viol > 0, 1, 0),
            year_decision = mean(year_decision),
            year_lodged = mean(year))

t1 <- count(d2, n_viol) %>% mutate(prp = round(n/sum(n), 2)) %>% head()

t1
## # A tibble: 6 x 3
##   n_viol     n   prp
##    <dbl> <int> <dbl>
## 1      0  1654  0.14
## 2      1  6183  0.53
## 3      2  2565  0.22
## 4      3   576  0.05
## 5      4   264  0.02
## 6      5   188  0.02
no_viol <- 100*(t1 %>% filter(n_viol == 0) %>% pull(prp))
viol <- 100-no_viol

Only 14 percent cases found no violation in our dataset. 86 percent found at least one, but several more than one article violated.

Which articles are most often violated?

dat %>% 
  filter(violation == 1) %>% 
  count(article, sort = TRUE) %>% 
  mutate(prp = round(n/sum(n), 2)) %>% 
  head()
## # A tibble: 6 x 3
##   article                         n   prp
##   <chr>                       <int> <dbl>
## 1 Article 6                    2627  0.16
## 2 Article 3                    2521  0.16
## 3 Article 6-1                  1730  0.11
## 4 Article 1 of Protocol No. 1  1608  0.1 
## 5 Article 5                    1574  0.1 
## 6 Article 8                    1007  0.06

Most often Article 6 (Right to a fair trial and more), Article 3 (Prohibition of torture and inhuman or degrading treatment or punishment), and Article 6-1 (fair trial, yes, Article 6 has several elements, sometimes it is more detailed about paragraphs of articles - so more opportuntities for detailed analysis).

Which countries most often violate the ECHR?

NB: this is on the level of decisions on articles, not judgment, hence more than one per case possible.

dat %>% 
  filter(violation == 1) %>% 
  count(country, sort = TRUE) %>% 
  mutate(prp = round(n/sum(n), 2)) %>% 
  head()
## # A tibble: 6 x 3
##   country      n   prp
##   <chr>    <int> <dbl>
## 1 russia    3822  0.24
## 2 turkey    2638  0.16
## 3 ukraine   1378  0.09
## 4 romania   1297  0.08
## 5 bulgaria   760  0.05
## 6 poland     702  0.04

Not surprisingly: Russia

How long does it take on average from lodging a case at the ECtHR to the judgment of a violation?

This is something you don’t easily get from other data sources.

ave_time <- d2 %>% 
  filter(if_viol == 1) %>% 
  mutate(time_to_violation_judgment = year_decision-year_lodged) %>% 
  pull(time_to_violation_judgment) %>% 
  mean(.) %>% 
  round(digits = 1)

It takes an average of 5.3 from lodging a case at the ECtHR to obtainig a decision.

How well can we predict the outcome of a judgment on merits, when simply using information on the year of lodging the application, the country involved and article of the ECHR?

Very cool question in a world of AI and machine learning [yes, I know, I am doing just a simple logistic regression with very few variables … who cares … I still call it AI]

dat <- dat %>% 
  filter(!is.na(violation)) %>% 
  mutate(article = fct_lump_min(article, 20))

s <- round(nrow(dat))*0.8
set.seed(220618)
idx <- sample(1:nrow(dat), size = s)

dat80 <- dat[idx, ]
dat20 <- dat[-idx, ]

m1 <- glm(violation ~ factor(year) + article + iso2c, 
          data = dat80, family = binomial(link = "logit"))

mn_correct <- dat20 %>% 
  mutate(preds = predict(m1, newdata = dat20, type = "response"),
         correct = ifelse((preds > 0.5 & violation == 1) | (preds < 0.5 & violation == 0),
                          1, 0)) %>% 
  summarise(mean(correct))

mn_correct <- round(100*mn_correct)

Hoooray. We have a machine learning model that correctly predictions violations of articles at 83 percent! ;-)

Conclusions - Other ways and data sources of data on violations of the European Court of Human Rights

If you read the post until here, I am quite impressed. The post showed how difficult it is to work with messy data, and how much work and decisions are behind data cleaning.

So far, I haven’t really made a fuller analysis of the data, as I am not 100% sure if everything is perfectly in order. It would require a few more case checks. I am confident that the results are relevant for research purposes, but this was done relatively quickly and should be further checked before being fully used.

Until now, I co-authored two publications including an analysis of ECtHR violations data. In both cases, I used a different dataset. In one case, the research question was a bit different, and because I wasn’t sure about the above dataset in the other. The latter case included an analysis of violations with a focus on comparing the country Austria to others in terms of violations.2 We still analysed other countries and built clusters of countries according to violations of articles. For this case, I actually resorted to plain old copying and pasting data on violations for several years from PDFs published by the court into Excel. It took a few hours (data compilation can be quite relaxing actually) and allowed me to create a dataset including information on the number of judgments including (1) at least one violation, (2) no violation, or (3) any other decision, as well as the articles found to be in violation. The basis are PDFs like this for the year 2020, or this for all years since 1959 but aggregated.3

The other analysis looked in the violations of the ECtHR based on the database of executions of judgments HUDOC EXEC. This dataset includes a nicer way of presenting articles and other information, but only incudes violations. Read the analysis here.4 It highlights structural human rights issues in certain countries as there are many repetitive cases that in some countries link to exactly the same leading case, but in others more diversly to different leading cases, making it more difficult to pinpoint the issues.

I quick check showed that the number of violations in these three datasets do not match up 100%. However, they are very similar and preserve the statistical correlations among countries etc.

Conclusion

If you liked reading this post, you are a natural born data scientist. If not, think twice if you want to work in ‘one of the most sexiest jobs of the decade’.5


  1. Wickham H. (2014): Tidy Data. *Journal of Statistical Software, Vol. 59 (10).

  2. Forthcoming book chapter: Toggenburg G. N. and Reichel D. (2022, forthcoming): ‘Vienna found guilty in Strasbourg: a look at (statistical) patterns in ECtHR judgements’

  3. if you are interested in using the dataset, just drop me line, and I will send it to you.

  4. Reichel, David and Grimheden, Jonas (2018), A Decade of Violations of the European Convention on Human Rights: Exploring Patterns of Repetitive Violations, In: Benedek, Wolgang et al. (eds), European Yearbook on Human Rights, Cambridge et al., intersentia, p. 272.

  5. Google’s Chief Economist Hal Varian is known for having said that.