License choice in the R community
This week a post on the RSE Slacksparked a lot of discussion on how to choose a license for your research software. The website https://choosealicense.com/is a helpful resource and starts with an important point raised by Hugo Grusonthat a good place to start is to consider the license(s) commonly used in your community. But how do you find out this information? This blog post explores the licenses used in the R and Bioconductor communities, by demonstrating how to obtain licencing information on CRAN and Bioconductor packages.
Licenses on CRAN
The Comprehensive R Archive Network (CRAN) repository is the main repository for R packages and the default repository used when installing add-on packages in R. The tools package that comes with the base distribution of R provides the CRAN_package_db() function to download a data frame of metadata on CRAN packages. Using this function, we can see that there are currently 19051 packages on CRAN.
library(tools)
pkgs <- CRAN_package_db()
nrow(pkgs)
## [1] 19051
The license information is in the License column of the
library(dplyr)
n_distinct(pkgs$License)
## [1] 164
However, many of these are different versions of a license, e.g.
pkgs |>
filter(grepl("^MIT", License)) |>
distinct(License)
## License
## 1 MIT + file LICENSE
## 2 MIT License + file LICENSE
## 3 MIT + file LICENCE
## 4 MIT + file LICENSE | Apache License 2.0
## 5 MIT +file LICENSE
## 6 MIT + file LICENSE | Unlimited
The above output also illustrates that
- An additional LICENSE (or LICENCE) file can be used to add additional terms to the license (the year and copyright holder in the case of MIT).
- Packages can have more than one license (the user can choose any of the alternatives).
- Authors do not always provide the license in a standard form!
A LICENSE file can also be used to on its own to specify a non-standard license. Given this variation in license specification, we will use transmute() to create a new set of variables, counting the number of times each type of license appears in the specification. We create a helper function n_match() to count the number of matches for a regular expression, which helps to deal with variations in the form provided. Finally we check against the expected number of licenses for each package to check we have covered all the options.
n_match <- function(s, x) lengths(regmatches(x, gregexpr(s, x)))
licenses <- pkgs |>
transmute(
ACM = n_match("ACM", License),
AGPL = n_match("(Affero General Public License)|(AGPL)", License),
Apache = n_match("Apache", License),
Artistic = n_match("Artistic", License),
BSD = n_match("(^|[^e])BSD", License),
BSL = n_match("BSL", License),
CC0 = n_match("CC0", License),
`CC BY` = n_match("(Creative Commons)|(CC BY)", License),
CeCILL = n_match("CeCILL", License),
CPL = n_match("(Common Public License)|(CPL)", License),
EPL = n_match("EPL", License),
EUPL = n_match("EUPL", License),
FreeBSD = n_match("FreeBSD", License),
GPL = n_match("((^|[^ro] )General Public License|(^|[^LA])GPL)", License),
LGPL = n_match("(Lesser General Public License)|(LGPL)", License),
LICENSE = n_match("(^|[|] *)file LICEN[SC]E", License),
LPL = n_match("(Lucent Public License)", License),
MIT = n_match("MIT", License),
MPL = n_match("(Mozilla Public License)|(MPL)", License),
Unlimited = n_match("Unlimited", License))
n_license <- n_match("[|]", pkgs$License) + 1
all(rowSums(licenses) == n_license)
## TRUE
Now we can tally the counts for each license, discounting version differences (i.e., GPL-2 | GPL-3 would only count once for GPL). We convert the license variable into a factor so that we can order by descending frequency in a plot.
tally <- colSums(licenses > 0)
tally_data <-
tibble(license = names(tally),
count = tally) |>
arrange(desc(count)) |>
mutate(license = factor(license, levels = license))
The vast majority are GPL (73%), followed by MIT (18%). All other licenses are represented in less than 3% of packages. This is consistent with R itself being licensed under GPL-2 | GPL-3. The only licenses in the top 10 that are not mentioned as "in use" on https://www.r-project.org/Licenses/, are the Apache and CC0 licenses, used by 1.7% and 1.1% of packages, respectively. The Apache license is a modern permissive license similar to MIT or the older BSD license, while CC0 is often use for data packages where attribution is not required. A separate LICENSE file is the 3rd most common license among CRAN packages; without exploring further it is unclear if this is always a stand-alone alternative license (as the specification implies) or if it might sometimes be adding further terms to another license.
Licenses on Bioconductor
Bioconductor is the second largest repository of R packages (excluding GitHub, which acts as a more informal repository). Bioconductor specialises in packages for bioinformatics. We can conduct a similar analysis to that for CRAN using the BiocPkgToolspackage. The function to obtain metadata on Bioconductor packages is biocPkgList(). With this we find there are currently 2041 packages on Bioconductor:
library(BiocPkgTools)
pkgs <- biocPkgList()
nrow(pkgs)
## [1] 2041
Still, there are 89 distinct licenses among these pckages:
n_distinct(pkgs$License)
## [1] 89
We can use the same code as before to tally each license and create a plot - the only change made to create the plot below was to exclude licenses that were not represented on Bioconductor.
GPL is still a popular license, represented by 55% of packages. However the Artistic license is also popular in this community (23%). This reflects the fact that the Bioconductor core packages are typically licensed under Artistic-2.0 and community members may follow the lead of the core team. Third to fifth place are taken by MIT (9%), LGPL (7%) and LICENSE (4%), respectively, with the remaining licenses represented in less than 2% of packages. The ACM, BSL, CC0, EUPL, FreeBSD and LPL licenses are unrepresented here.
Summary
Although the Biconductor community is a subset of the R community, it has different norms regarding package licenses. In both communities though, the GPL is a common choice for licensing R packages, consistent with the license choice for R itself.