Sex Distributions in Research
Gender imbalance in academia is a huge topic these days, especially in the STEM fields. Computer Science, with its close connection to tech/Web/geek culture is buzzing with all kinds of related activities, funding programs, and (sometimes heated) debates. The statistics we get to see in these discussions are often very coarse, talking about all of Computer Science. Researchers thus tend to get the impression that their own (specific) field is merely suffering from wider issues beyond their control. So I was wondering how strong is gender imbalance in (really) specific research fields? Considering the conferences I go to, I can certainly say that some are more balanced than others. For journals, I have no idea. So I did a little data mining over the weekend to find out, using DBLP and Wikidata as main data sources.
Sexing Academic Authors
What I am asking is: what is the gender distribution of authors who publish at certain conferences or in certain journals? Good quality data on publications is available for free from DBLP. The data includes author names, publication titles, and labels for conference (series) and journals.
DBLP does not have sex information for authors. Moreover, authors are identified by their string name, possibly with an appended number if there are more people of that name (and DBLP is aware of it). This makes it hard to impossible to match the records to other person databases, such as VIAF; besides, most of the authors in DBLP do not have VIAF identifiers anyway.
Mining Wikidata for Girls' and Boys' Names
So all we have is a name. We need to guess the gender. Humans normally do this by looking at the first name. To automate this, I needed a list of first names together with sex association. The list should be international, include many spelling variants, and take into account that some names may be used for either gender. I decided to extract this from Wikidata, since it contains hundreds of thousands of people taken from Wikipedias of many languages, together with sex information.
I extended my Wikidata Analysis script to do this. Wikidata does not distinguish first/given and last/family names, which would be rather futile in an international context anyway. So I guess the "first name" from the main label by picking the first word. I ignore weird names and single-word person labels (like "Oedipus"). I also filter abbreviated initials and some other junk. I only consider the first label I find in the languages English, German, French, Spanish, Italian, Dutch, and Polish, since I want to get names in Latin script like in DBLP.
The result is a list of 54915 names: 37812 used by men, 14337 used by women, and 2545 used by either. Here are the most popular names (with numbers of thusly named people on Wikipedia):
- Women:Maria (1638), Anna (1628), Mary (1165), Barbara (1067), Anne (840), Princess (779), Marie (760), Elisabeth (690), Elizabeth (650)
- Men: John (15040), William (8033), David (7319), Paul (6742), Michael (6361), Robert (6258), Charles (6239), Peter (6081), and Johann (5985).
Things to note:
- There is a strong male bias. This reflects the distribution of Wikipedia articles across languages.
- There is a strong Western bias. This is related to the relative sizes of Wikipedias in various languages, but also to the fact that we only extract labels from (some) Latin alphabet labels.
- The data still contains some junk: Princess is the first word in many person's labels who don't have it as their first name (though some might well have). Note that bogus names do not impair our application.
Another issue is that Chinese and other Asian names in Wikipedia are written in their native ordering, family name first. I make no attempt to detect this, since there is often no sufficient information to tell where a person comes from. Moreover, some Chinese flip their names when living in Western countries, and again we have no ways to know what is going on here. I think Asian names simply need a different approach; I leave this to future work.
The above list contains many names that are used by some women and some men. For example, there are four women called John. Names like Andrea (F399/M314) and even Evelyn (F151/M17) are just not gender specific on an international level. I consider a name to be "female" if at least four times as many women than men have the name. For "male" names, I require a ten times greater number of people, taking Wikidata's gender bias into account. All other names are "ambiguous". I end up with 14807 female, 38241 male, and 1867 ambiguous first names. Not too bad.
I now inspect all DBLP data and try to assign a sex to each author. To do this, I look at the first word of the name string. If this is no gender-specific first name, then I check if it contains a "-" and use only the part before that as a candidate name. If this still fails, I optimistically try to use the next word in the name (if it is not the last yet) and proceed as before. If all fails, the author's sex remains "unknown".
I then compute the following scores for each journal:
- number of authorships (appearances of a person in a list of authors): female authorships, male authorships, unknown authorships
- number of papers: papers with at least one female/male/unknown author
Two derived ratios are relevant for gender bias:
- female authorships/total authorships: the probability that a randomly chosen name on any paper in that journal is a woman
- female papers/total papers: the probability that a randomly chosen paper from that journal involves at least one woman.
Moreover, one also needs to consider the amount of unknown authors in each case, as they could also be women. Finally, I also compute the average author number per paper, since it seems interesting and has (of course) a strong effect on "female papers/total papers": the more people on each paper, the more likely it is that a woman is among them.
It seemed unfair to compare journals who exist since decades with very recent ones: there were significantly fewer female researchers in the past than there are now. Therefore, I also recomputed all scores restricted to publications that appeared since 2003. This should better reflect the current publication culture in a field.
Moreover, I also computed the same scores grouped by year (summing up the publications from all journals). This should give an interesting overall trend.
Finally, I computed all three scores also for conferences: conference publications total, conference publications since 2003, and conference publications by year. Here, "conference" includes any event that publishes proceedings (workshops, doctoral consortia, summer schools, even some book collections).
How Gender-Biased is Your Research Field?
The results can be found in two Google spreadsheets: DBLP gender biases for journals and conferences and DBLP gender biases for conferences (total). The second is just one sheet, but Google has a size limit on spreadsheets, so it would not fit into the first document.
The DBLP data contained publications from 1,379 different journals and 6,319 conferences. Of the 1,319,894 different author names, 154,822 were identified as female and 681,204 as male; another 483,868 could not be assigned to any sex. It turned out that the vast majority all of the authors of unknown sex have Asian names. This is to be expected, given the deficiencies of my sexing method for these names. Another relevant part of unknown names is accounted for by thousands of people with genuinely ambiguous names like Andrea, who just cannot be sexed based on names only. For Western names, however, recall was relatively good; e.g., more than 90% of authors could be sexed for some German journals, where few international authors would publish.
Looking at the general trend over time (analysis by years) first, we see some mildly encouraging developments. Almost 10% of all journal authorships in 2013 are likely to come from women, compared to only 4% in 1980. With the increased average number of authors per paper, almost one in four papers have a female coauthor today. However, we also have an increase in authors of unknown sex, which may account for the increased contributions from Asian researchers. The study reveals nothing about the gender distribution is among these authors. If the distribution in this group were similar to the distribution of the known authors, then this would contribute another 3%-4% to the ratio of female journal authorships. But I have no reason to assume that this is the case.
Looking at the figures for individual journals, one can see a lot of variance. I will focus on the figures that are based on the past ten years of publications, since they seem more relevant for the current state of affairs. Female authorship rates range from 56% (Library Trends) down to 1%, though the lower end of the scale is affected by a large amount of authors of unknown sex. Moreover, care is needed when comparing the ratio of papers with at least one female author, since the average author numbers per paper vary widely.
It is instructive, however, to pick a few journals that one is familiar with to get an impression of their relative gender bias. For me, this is the following list:
|Name||female authorships||unknown authorships|
|IEEE Intelligent Systems||12%||23%|
|ACM Trans. Comput. Log.||12%||14%|
|J. Web Sem.||12%||12%|
|J. Artif. Intell. Res. (JAIR)||11%||18%|
|Theor. Comput. Sci.||10%||22%|
|J. Autom. Reasoning||10%||11%|
|Logical Methods in Computer Science||8%||12%|
|ACM Trans. Database Syst.||8%||28%|
While these numbers are close together, it should be kept in mind that the average female authorship ratio grew only by 5%-6% in the past 30 years. So 1% is a lot here.
It is interesting to note that magazines like Comm. ACM and IEEE Intelligent Systems achieve a high female participation, whereas the flagship Computer Science research journal J. ACM is at the bottom. Indeed, the last time that we have had a global average of 7% female authorships was in 1994, and even there the uncertainty of 24% may have hidden a larger number. On the other hand, it is interesting that leading journals in the more specific field of AI, Semantic Web, and computational logic are above the global average of 2013.
Another noteworthy point is that Informatik Spektrum, roughly the German equivalent of Commun. ACM, achieves only 9% of female participation in spite of the very low rate of uncertainty of just 2% unknown authors.
Of course, the numbers must be considered with the due scepticism. The uncertainty is rather high in some cases, and may add significantly to the number of female authorships. Moreover, some journals, such as Commun. ACM include significant amounts of editorial content that is possibly indexed by DBLP, so they are not necessarily a good indication of research contributions. Finally, many individual effects may account for slightly altered rates.
This is just a first study, a work of a weekend. No testing has been done to validate the predictive power of the instruments that were used, so even significant errors in sex guessing are conceivable.
Nevertheless, I think that this outlines an interesting line of thinking. Editors of journals, but also for organisers of conferences, should at least be aware of the actual figures of female participation. It may indicate a systematic bias in their field that is not explained by how students chose their first field of study, and which is thus in the reach of the researchers themselves to address.
For authors, the impact factor is probably still the more important measure to decide which journal to submit to, yet a look at gender bias could also be interesting even for them.
Comments and feedback for this article can be sent to me via email: markus at this domain.