All 2 entries tagged R

View all 4 entries tagged R on Warwick Blogs | View entries tagged R at Technorati | There are no images tagged R on this blog

April 08, 2022

License choice in the R community

This week a post on the RSE Slacksparked a lot of discussion on how to choose a license for your research software. The website https://choosealicense.com/is a helpful resource and starts with an important point raised by Hugo Grusonthat a good place to start is to consider the license(s) commonly used in your community. But how do you find out this information? This blog post explores the licenses used in the R and Bioconductor communities, by demonstrating how to obtain licencing information on CRAN and Bioconductor packages.

Licenses on CRAN

The Comprehensive R Archive Network (CRAN) repository is the main repository for R packages and the default repository used when installing add-on packages in R. The tools package that comes with the base distribution of R provides the CRAN_package_db() function to download a data frame of metadata on CRAN packages. Using this function, we can see that there are currently 19051 packages on CRAN.

library(tools)
pkgs <- CRAN_package_db()
nrow(pkgs)

## [1] 19051

The license information is in the License column of the data frame. We'll use the dplyr package to help summarise this variable. With n_distinct() we find that there are 164 unique licenses!

library(dplyr)
n_distinct(pkgs$License)

## [1] 164

However, many of these are different versions of a license, e.g.

pkgs |>
  filter(grepl("^MIT", License)) |>
  distinct(License)


##                                   License
## 1                      MIT + file LICENSE
## 2              MIT License + file LICENSE
## 3                      MIT + file LICENCE
## 4 MIT + file LICENSE | Apache License 2.0
## 5                       MIT +file LICENSE
## 6          MIT + file LICENSE | Unlimited

The above output also illustrates that

An additional LICENSE (or LICENCE) file can be used to add additional terms to the license (the year and copyright holder in the case of MIT).
Packages can have more than one license (the user can choose any of the alternatives).
Authors do not always provide the license in a standard form!

A LICENSE file can also be used to on its own to specify a non-standard license. Given this variation in license specification, we will use transmute() to create a new set of variables, counting the number of times each type of license appears in the specification. We create a helper function n_match() to count the number of matches for a regular expression, which helps to deal with variations in the form provided. Finally we check against the expected number of licenses for each package to check we have covered all the options.

n_match <- function(s, x) lengths(regmatches(x, gregexpr(s, x)))
licenses <- pkgs |>
  transmute(
    ACM = n_match("ACM", License),
    AGPL = n_match("(Affero General Public License)|(AGPL)", License),
    Apache = n_match("Apache", License),
    Artistic = n_match("Artistic", License),
    BSD = n_match("(^|[^e])BSD", License),
    BSL = n_match("BSL", License),
    CC0 = n_match("CC0", License),
    `CC BY` = n_match("(Creative Commons)|(CC BY)", License),
    CeCILL = n_match("CeCILL", License),
    CPL = n_match("(Common Public License)|(CPL)", License),
    EPL = n_match("EPL", License),
    EUPL = n_match("EUPL", License),
    FreeBSD = n_match("FreeBSD", License),
    GPL = n_match("((^|[^ro] )General Public License|(^|[^LA])GPL)", License),
    LGPL = n_match("(Lesser General Public License)|(LGPL)", License),
    LICENSE = n_match("(^|[|] *)file LICEN[SC]E", License),
    LPL = n_match("(Lucent Public License)", License),
    MIT = n_match("MIT", License),
    MPL = n_match("(Mozilla Public License)|(MPL)", License),
    Unlimited = n_match("Unlimited", License))
n_license <- n_match("[|]", pkgs$License) + 1
all(rowSums(licenses) == n_license)

## TRUE

Now we can tally the counts for each license, discounting version differences (i.e., GPL-2 | GPL-3 would only count once for GPL). We convert the license variable into a factor so that we can order by descending frequency in a plot.

tally <- colSums(licenses > 0)
tally_data <- 
  tibble(license = names(tally),
         count = tally) |>
  arrange(desc(count)) |>
  mutate(license = factor(license, levels = license))

Bar chart of license frequencies on CRAN as a percentage of the number of packages. The vast majority are GPL (73%), followed by MIT (18%). All other licenses are represented in less than 3% of packages.

tally_data.csv

The vast majority are GPL (73%), followed by MIT (18%). All other licenses are represented in less than 3% of packages. This is consistent with R itself being licensed under GPL-2 | GPL-3. The only licenses in the top 10 that are not mentioned as "in use" on https://www.r-project.org/Licenses/, are the Apache and CC0 licenses, used by 1.7% and 1.1% of packages, respectively. The Apache license is a modern permissive license similar to MIT or the older BSD license, while CC0 is often use for data packages where attribution is not required. A separate LICENSE file is the 3rd most common license among CRAN packages; without exploring further it is unclear if this is always a stand-alone alternative license (as the specification implies) or if it might sometimes be adding further terms to another license.

Licenses on Bioconductor

Bioconductor is the second largest repository of R packages (excluding GitHub, which acts as a more informal repository). Bioconductor specialises in packages for bioinformatics. We can conduct a similar analysis to that for CRAN using the BiocPkgToolspackage. The function to obtain metadata on Bioconductor packages is biocPkgList(). With this we find there are currently 2041 packages on Bioconductor:

library(BiocPkgTools)
pkgs <- biocPkgList()
nrow(pkgs)

## [1] 2041

Still, there are 89 distinct licenses among these pckages:

n_distinct(pkgs$License)

## [1] 89

We can use the same code as before to tally each license and create a plot - the only change made to create the plot below was to exclude licenses that were not represented on Bioconductor.

Bar chart of license frequencies on Bioconductor as a percentage of the number of packages. GPL is still a popular license, represented by 55% of packages. However the Artistic license is also popular in this community (23%). Third to fifth place are taken by MIT (9%), LGPL (7%) and LICENSE (4%), respectively, with the remaining licenses represented in less than 2% of packages.

tally_data_bioc.csv

GPL is still a popular license, represented by 55% of packages. However the Artistic license is also popular in this community (23%). This reflects the fact that the Bioconductor core packages are typically licensed under Artistic-2.0 and community members may follow the lead of the core team. Third to fifth place are taken by MIT (9%), LGPL (7%) and LICENSE (4%), respectively, with the remaining licenses represented in less than 2% of packages. The ACM, BSL, CC0, EUPL, FreeBSD and LPL licenses are unrepresented here.

Summary

Although the Biconductor community is a subset of the R community, it has different norms regarding package licenses. In both communities though, the GPL is a common choice for licensing R packages, consistent with the license choice for R itself.

Heather Turner : 08 Apr 2022 10:32 | Tags: Code Licenses R | Comments (0) | Close comments | Report a problem

December 08, 2021

Advent of Code 2021: First days with R

The Advent of Code is a series of daily programming puzzles running up to Christmas. On 3 December, the Warwick R User Group met jointly with the Manchester R-thritis Statistical Computing Group to informally discuss our solutions to the puzzles from the first two days. Some of the participants shared their solutions in advance as shared in this slide deck.

In this post, Heather Turner (RSE Fellow, Statistics) shares her solutions and how they can be improved based on the ideas put forward at the meetup by David Selby (Manchester R-thritis organizer) and others, while James Tripp (Senior Research Software Engineer, Information and Digital Group) reflects on some issues raised in the meetup discussion.

R Solutions for Days 1 and 2

Heather Turner

Day 1: Sonar Sweep

For a full description of the problem for Day 1, see https://adventofcode.com/2021/day/1.

Day 1 - Part 1

How many measurements are larger than the previous measurement?

199  (N/A - no previous measurement)
200  (increased)
208  (increased)
210  (increased)
200  (decreased)
207  (increased)
240  (increased)
269  (increased)
260  (decreased)
263  (increased)

First create a vector with the example data:

x <- c(199, 200, 208, 210, 200, 207, 240, 269, 260, 263)

Then the puzzle can be solved with the following R function, that takes the vector x as input, uses diff() to compute differences between consecutive values of x, then sums the differences that are positive:

f01a <- function(x) {
  dx <- diff(x)
  sum(sign(dx) == 1)
}
f01a(x)

## [1] 7

Inspired by David Selby’s solution, this could be made slightly simpler by finding the positive differences with dx > 0, rather than using the sign() function.

Day 1 - Part 2

How many sliding three-measurement sums are larger than the previous sum?

199  A           607  (N/A - no previous sum)
200  A B         618  (increased)
208  A B C       618  (no change)
210    B C D     617  (decreased)
200  E   C D     647  (increased)
207  E F   D     716  (increased)
240  E F G       769  (increased)
269    F G H     792  (increased)
260      G H
263        H

This can be solved by the following function of x. First, the rolling sums of three consecutive values are computed in a vectorized computation, i.e. creating three vectors containing the first, second and third value in the sum, then adding the vectors together. Then, the function from Part 1 is used to sum the positive differences between these values.

f01b <- function(x) {
  n <- length(x)
  sum3 <- x[1:(n - 2)] + x[2:(n - 1)] + x[3:n]
  f01a(sum3)
}
f01b(x)

## [1] 5

David Schoch put forward a solution that takes advantage of the fact that the difference between consecutive rolling sums of three values is just the difference between values three places apart (the second and third values in the first sum cancel out the first and second values in the second sum). Putting what we’ve learnt together gives this much neater solution for Day 1 Part 2:

f01b_revised <- function(x) {
  dx3 <- diff(x, lag = 3)
  sum(dx3 > 0)
}
f01b_revised(x)

## [1] 5

Day 2: Dive!

For a full description of the problem see https://adventofcode.com/2021/day/2.

Day 2 - Part 1

forward X increases the horizontal position by X units.
down X increases the depth by X units.
up X decreases the depth by X units.

                  horizontal  depth
forward 5   -->            5      -
down 5      -->                   5
forward 8   -->           13
up 3        -->                   2
down 8      -->                  10
forward 2   -->           15

==> horizontal = 15, depth = 10

First create a data frame with the example data

x <- data.frame(direction = c("forward", "down", "forward",
                              "up", "down", "forward"),
                amount = c(5, 5, 8, 3, 8, 2))

Then the puzzle can be solved with the following function, which takes the variables direction and amount as input. The horizontal position is the sum of the amounts where the direction is “forward”. The depth is the sum of the amounts where direction is “down” minus the sum of the amounts where direction is “up”.

f02a <- function(direction, amount) {
  horizontal <- sum(amount[direction == "forward"])
  depth <- sum(amount[direction == "down"]) - sum(amount[direction == "up"])
  c(horizontal = horizontal, depth = depth)
}
f02a(x$direction, x$amount)

## horizontal      depth 
##         15         10

The code above uses logical indexing to select the amounts that contribute to each sum. An alternative approach from David Selby is to coerce the logical indices to numeric (coercing TRUE to 1 and FALSE to 0) and multiply the amount by the resulting vectors as required:

f02a_selby <- function(direction, amount) {
  horizontal_move <- amount * (direction == 'forward')
  depth_move <- amount * ((direction == 'down') - (direction == 'up'))
  c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}

Benchmarking on 1000 datasets of 1000 rows this alternative solution is only marginally faster (an average run-time of 31 μs vs 37 μs), but it has an advantage in Part 2!

Day 2 - Part 2

down X increases your aim by X units.
up X decreases your aim by X units.
forward X does two things:
- It increases your horizontal position by X units.
- It increases your depth by your aim multiplied by X.

                  horizontal  aim  depth
forward 5   -->            5    -      - 
down 5      -->                 5      
forward 8   -->           13          40
up 3        -->                 2
down 8      -->                10
forward 2   -->           15          60

==> horizontal = 15, depth = 60

The following function solves this problem by first computing the sign of the change to aim, which is negative if the direction is “up” and positive otherwise. Then for each change in position, if the direction is “forward” the function adds the amount to the horizontal position and the amount multiplied by aim to the depth, otherwise it adds the sign multiplied by the amount to the aim.

f02b <- function(direction, amount) {
  horizontal <- depth <- aim <- 0
  sign <- ifelse(direction == "up", -1, 1)
  for (i in seq_along(direction)){
    if (direction[i] == "forward"){
      horizontal <- horizontal + amount[i]
      depth <- depth + aim * amount[i]
      next
    }
    aim <- aim + sign[i]*amount[i]
  }
  c(horizontal = horizontal, depth = depth)
}
f02b(x$direction, x$amount)

## horizontal      depth 
##         15         60

As an interpreted language, for loops can be slow in R and vectorized solutions are often preferable if memory is not an issue. David Selby showed that his solution from Part 1 can be extended to solve the problem in Part 2, by using cumulative sums of the value that represented depth in Part 1 to compute the aim value in Part 2.

f02b_revised <- function(direction, amount) {
  horizontal_move <- amount * (direction == "forward")
  aim <- cumsum(amount * (direction == "down") - amount * (direction == "up"))
  depth_move <- aim * horizontal_move
  c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}
f02b_revised(x$direction, x$amount)

## horizontal      depth 
##         15         60

Benchmarking these two solutions on 1000 data sets of 1000 rows, the vectorized solution is ten times faster (on average 58 μs vs 514 μs).

Reflections

James Tripp

How do we solve a problem with code? Writing an answer requires what some educators call computational thinking. We systematically conceptualise the solution to a problem and then work through a series of steps, drawing on coding conventions, to formulate an answer. Each answer is different and, often, a reflection of our priorities, experience, and domains of work. In our meeting, it was wonderful to see people with a wide range of experience and differing interests.

Our discussion considered the criteria of a ‘good solution’.

Speed is one criteria of success - a solution which takes 100 μs (microseconds) is better than a solution taking 150 μs.
Readability for both sharing with others (as done above) and to help future you, lest you forget the intricacies of your own solution.
Good practice such as variable naming and, perhaps, avoiding for loops where possible. Loops are slower and somewhat discouraged in the R community. However, some would argue they are more explicit and helpful for those coming from other languages, such as Python.
Debugging friendly.Some participants, including Heather Turner and David Selby, checked their solutions with tests comparing known inputs and outputs. I drew on my Psychology experience and opted for an explicit DataFrame where I can see each operation. Testing is almost certainly a better solution which I adopt in my packages.
Generalisability. A solution tailored for the Part 1 task on a given day may not be easily generalisable for the Part 2 task. It seemed desirable to refactor one’s code to create a solution which encompasses both tasks. However, the effort and benefits of doing so are certainly debatable.

We also discussed levels of abstraction. The tidyverse family of R packages is powerful, high-level and quite opinionated. Using tidyverse functions returned some intuitive, but slower solutions where we were unsure of the bottlenecks. Solutions built on R base (the functions which come with R) were somewhat faster, though others using libraries such as data.table were also rather quick. These reflections are certainly generalisations and prompted some discussion.

How does one produce fast, readable, debuggable, generalisable code which follows good practice and operates at a suitable level of abstraction? Our discussions did not produce a definitive answer. Instead, our discussions and sharing solutions helped us understand the pros and cons of different approaches and I certainly learned a few useful tricks.

Heather Turner : 08 Dec 2021 10:00 | Tags: Code Programming R | Comments (0) | Close comments | Report a problem