Research Software Engineering at Warwick
https://blogs.warwick.ac.uk/researchsoftware
We are Research Software Engineering at the University of Warwick. Doing our daily job brings us up against odd stuff to do with computers and programming. We document them here so that we can say that we leave them a bit better than we found them.en-GB(C) 2024 Christopher Bradyhttps://blogs.law.harvard.edu/tech/rssChristopher BradyChristopher BradyWarwick Blogs, University of Warwick, https://blogs.warwick.ac.uk120License choice in the R community by Heather Turner
https://blogs.warwick.ac.uk/researchsoftware/entry/license_choice_in/
<p>This week a post on the <a href="https://society-rse.org/join-us/#slack">RSE Slack</a>sparked a lot of discussion on how to choose a license for your research software. The website <a href="https://choosealicense.com/">https://choosealicense.com/</a>is a helpful resource and starts with an important point raised by <a href="https://github.com/Bisaloo">Hugo Gruson</a>that a good place to start is to consider the license(s) commonly used in your community. But how do you find out this information? This blog post explores the licenses used in the <a href="https://www.r-project.org/">R</a> and <a href="https://www.bioconductor.org/">Bioconductor</a> communities, by demonstrating how to obtain licencing information on <a href="https://cran.r-project.org/">CRAN</a> and Bioconductor packages.</p>
<h2>Licenses on CRAN</h2>
<p>The Comprehensive R Archive Network (CRAN) repository is the main repository for R packages and the default repository used when installing add-on packages in R. The <strong>tools </strong>package that comes with the base distribution of R provides the <tt>CRAN_package_db()</tt> function to download a data frame of metadata on CRAN packages. Using this function, we can see that there are currently 19051 packages on CRAN.</p>
<pre><code>library(tools)
pkgs <- CRAN_package_db()
nrow(pkgs)
</code></pre>
<pre><code>## [1] 19051</code></pre>
<p>The license information is in the <tt>License</tt> column of the <pkgs> data frame. We'll use the <strong>dplyr </strong>package to help summarise this variable. With <tt>n_distinct()</tt> we find that there are 164 unique licenses! </pkgs>
</p>
<pre><code>library(dplyr)
n_distinct(pkgs$License)
</code></pre>
<pre><code>## [1] 164
</code></pre>
<p> However, many of these are different versions of a license, e.g. </p>
<pre><code>pkgs |>
filter(grepl("^MIT", License)) |>
distinct(License)
</code></pre>
<pre><code>
## License
## 1 MIT + file LICENSE
## 2 MIT License + file LICENSE
## 3 MIT + file LICENCE
## 4 MIT + file LICENSE | Apache License 2.0
## 5 MIT +file LICENSE
## 6 MIT + file LICENSE | Unlimited
</code></pre>
<p> The above output also illustrates that</p>
<ul>
<li>An additional LICENSE (or LICENCE) file can be used to add additional terms to the license (the year and copyright holder in the case of MIT).</li>
<li>Packages can have more than one license (the user can choose any of the alternatives).</li>
<li>Authors do not always provide the license in a standard form!</li>
</ul>
<p>A LICENSE file can also be used to on its own to specify a non-standard license. Given this variation in license specification, we will use <tt>transmute()</tt> to create a new set of variables, counting the number of times each type of license appears in the specification. We create a helper function <tt>n_match()</tt> to count the number of matches for a regular expression, which helps to deal with variations in the form provided. Finally we check against the expected number of licenses for each package to check we have covered all the options.</p>
<pre><code>n_match <- function(s, x) lengths(regmatches(x, gregexpr(s, x)))
licenses <- pkgs |>
transmute(
ACM = n_match("ACM", License),
AGPL = n_match("(Affero General Public License)|(AGPL)", License),
Apache = n_match("Apache", License),
Artistic = n_match("Artistic", License),
BSD = n_match("(^|[^e])BSD", License),
BSL = n_match("BSL", License),
CC0 = n_match("CC0", License),
`CC BY` = n_match("(Creative Commons)|(CC BY)", License),
CeCILL = n_match("CeCILL", License),
CPL = n_match("(Common Public License)|(CPL)", License),
EPL = n_match("EPL", License),
EUPL = n_match("EUPL", License),
FreeBSD = n_match("FreeBSD", License),
GPL = n_match("((^|[^ro] )General Public License|(^|[^LA])GPL)", License),
LGPL = n_match("(Lesser General Public License)|(LGPL)", License),
LICENSE = n_match("(^|[|] *)file LICEN[SC]E", License),
LPL = n_match("(Lucent Public License)", License),
MIT = n_match("MIT", License),
MPL = n_match("(Mozilla Public License)|(MPL)", License),
Unlimited = n_match("Unlimited", License))
n_license <- n_match("[|]", pkgs$License) + 1
all(rowSums(licenses) == n_license)</code></pre>
<pre><code>## TRUE
</code></pre>
<p>Now we can tally the counts for each license, discounting version differences (i.e., <tt>GPL-2 | GPL-3</tt> would only count once for <tt>GPL</tt>). We convert the license variable into a factor so that we can order by descending frequency in a plot. </p>
<pre><code>tally <- colSums(licenses > 0)
tally_data <-
tibble(license = names(tally),
count = tally) |>
arrange(desc(count)) |>
mutate(license = factor(license, levels = license))</code></pre>
<p><br />
</p>
<center><img src="https://blogs.warwick.ac.uk/images/researchsoftware/2022/04/08/cran_stats.png?maxWidth=500" alt="Bar chart of license frequencies on CRAN as a percentage of the number of packages. The vast majority are GPL (73%), followed by MIT (18%). All other licenses are represented in less than 3% of packages." border="0" /></center>
<p><a class="downloadLink application_vnd_ms_excel" href="https://blogs.warwick.ac.uk/files/researchsoftware/tally_data.csv">tally_data.csv</a><br />
</p>
<p>The vast majority are GPL (73%), followed by MIT (18%). All other licenses are represented in less than 3% of packages. This is consistent with R itself being licensed under GPL-2 | GPL-3. The only licenses in the top 10 that are not mentioned as "in use" on https://www.r-project.org/Licenses/, are the Apache and CC0 licenses, used by 1.7% and 1.1% of packages, respectively. The Apache license is a modern permissive license similar to MIT or the older BSD license, while CC0 is often use for data packages where attribution is not required. A separate LICENSE file is the 3rd most common license among CRAN packages; without exploring further it is unclear if this is always a stand-alone alternative license (as the specification implies) or if it might sometimes be adding further terms to another license. </p>
<h2>Licenses on Bioconductor</h2>
<p>Bioconductor is the second largest repository of R packages (excluding GitHub, which acts as a more informal repository). Bioconductor specialises in packages for bioinformatics. We can conduct a similar analysis to that for CRAN using the <strong>BiocPkgTools</strong>package. The function to obtain metadata on Bioconductor packages is <tt>biocPkgList()</tt>. With this we find there are currently 2041 packages on Bioconductor:</p>
<pre><code>library(BiocPkgTools)
pkgs <- biocPkgList()
nrow(pkgs)</code></pre>
<pre><code>## [1] 2041</code></pre>
<p>Still, there are 89 distinct licenses among these pckages:</p>
<pre><code>n_distinct(pkgs$License)</code></pre>
<pre><code>## [1] 89</code></pre>
<p>We can use the same code as before to tally each license and create a plot - the only change made to create the plot below was to exclude licenses that were not represented on Bioconductor.</p>
<center><img src="https://blogs.warwick.ac.uk/images/researchsoftware/2022/04/08/bioc_stats.png?maxWidth=500" alt="Bar chart of license frequencies on Bioconductor as a percentage of the number of packages. GPL is still a popular license, represented by 55% of packages. However the Artistic license is also popular in this community (23%). Third to fifth place are taken by MIT (9%), LGPL (7%) and LICENSE (4%), respectively, with the remaining licenses represented in less than 2% of packages." border="0" /></center>
<p><a class="downloadLink application_vnd_ms_excel" href="https://blogs.warwick.ac.uk/files/researchsoftware/tally_data_bioc.csv">tally_data_bioc.csv</a><br />
</p>
<p>GPL is still a popular license, represented by 55% of packages. However the Artistic license is also popular in this community (23%). This reflects the fact that the Bioconductor core packages are typically licensed under Artistic-2.0 and community members may follow the lead of the core team. Third to fifth place are taken by MIT (9%), LGPL (7%) and LICENSE (4%), respectively, with the remaining licenses represented in less than 2% of packages. The ACM, BSL, CC0, EUPL, FreeBSD and LPL licenses are unrepresented here.</p>
<h2>Summary</h2>
<p>Although the Biconductor community is a subset of the R community, it has different norms regarding package licenses. In both communities though, the GPL is a common choice for licensing R packages, consistent with the license choice for R itself.</p>CodeLicensesRFri, 08 Apr 2022 09:32:50 GMTHeather Turnerhttps://blogs.warwick.ac.uk/researchsoftware/entry/license_choice_in/#comments8a1784e677fd963601800311623d03ef0Advent of Code 2021: First days with R by Heather Turner
https://blogs.warwick.ac.uk/researchsoftware/entry/advent_of_code/
<p>The <a href="https://adventofcode.com">Advent of Code</a> is a series of daily programming puzzles running up to Christmas. On 3 December, the <a href="https://www.meetup.com/Warwick-useRs">Warwick R User Group</a> met jointly with the <a href="https://personalpages.manchester.ac.uk/staff/david.selby/rthritis.html">Manchester R-thritis Statistical Computing Group</a> to informally discuss our solutions to the puzzles from the first two days. Some of the participants shared their solutions in advance as shared in this <a href="https://personalpages.manchester.ac.uk/staff/david.selby/rthritis/2021-12-03-advent2021/">slide deck</a>.</p>
<p>In this post, Heather Turner (RSE Fellow, Statistics) shares her solutions and how they can be improved based on the ideas put forward at the meetup by David Selby (Manchester R-thritis organizer) and others, while James Tripp (Senior Research Software Engineer, Information and Digital Group) reflects on some issues raised in the meetup discussion.</p>
<h2>R Solutions for Days 1 and 2</h2>
<p> <i>Heather Turner</i></p>
<h3>Day 1: Sonar Sweep</h3>
<p>For a full description of the problem for Day 1, see <a href="https://adventofcode.com/2021/day/1">https://adventofcode.com/2021/day/1</a>.</p>
<h4>Day 1 - Part 1</h4>
<p>How many measurements are larger than the previous measurement?</p>
<pre><code>199 (N/A - no previous measurement)
200 (increased)
208 (increased)
210 (increased)
200 (decreased)
207 (increased)
240 (increased)
269 (increased)
260 (decreased)
263 (increased)</code></pre>
<p>First create a vector with the example data:</p>
<pre><code>x <- c(199, 200, 208, 210, 200, 207, 240, 269, 260, 263)</code></pre>
<p>Then the puzzle can be solved with the following R function, that takes the vector <code>x</code> as input, uses <code>diff()</code> to compute differences between consecutive values of <code>x</code>, then sums the differences that are positive:</p>
<pre><code>f01a <- function(x) {
dx <- diff(x)
sum(sign(dx) == 1)
}
f01a(x)</code></pre>
<pre><code>## [1] 7</code></pre>
<p>Inspired by <a href="https://personalpages.manchester.ac.uk/staff/david.selby/rthritis/2021-12-03-advent2021/resources/adventofcode.html#7">David Selby’s solution</a>, this could be made slightly simpler by finding the positive differences with <code>dx > 0</code>, rather than using the <code>sign()</code> function.</p>
<h4>Day 1 - Part 2</h4>
<p>How many <strong>sliding three-measurement sums</strong> are larger than the previous sum?</p>
<pre><code>199 A 607 (N/A - no previous sum)
200 A B 618 (increased)
208 A B C 618 (no change)
210 B C D 617 (decreased)
200 E C D 647 (increased)
207 E F D 716 (increased)
240 E F G 769 (increased)
269 F G H 792 (increased)
260 G H
263 H</code></pre>
<p>This can be solved by the following function of <code>x</code>. First, the rolling sums of three consecutive values are computed in a vectorized computation, i.e. creating three vectors containing the first, second and third value in the sum, then adding the vectors together. Then, the function from Part 1 is used to sum the positive differences between these values.</p>
<pre><code>f01b <- function(x) {
n <- length(x)
sum3 <- x[1:(n - 2)] + x[2:(n - 1)] + x[3:n]
f01a(sum3)
}
f01b(x)</code></pre>
<pre><code>## [1] 5</code></pre>
<p>David Schoch put forward <a href="https://personalpages.manchester.ac.uk/staff/david.selby/rthritis/2021-12-03-advent2021/resources/adventofcode.html#8">a solution</a> that takes advantage of the fact that the difference between consecutive rolling sums of three values is just the difference between values three places apart (the second and third values in the first sum cancel out the first and second values in the second sum). Putting what we’ve learnt together gives this much neater solution for Day 1 Part 2:</p>
<pre><code>f01b_revised <- function(x) {
dx3 <- diff(x, lag = 3)
sum(dx3 > 0)
}
f01b_revised(x)</code></pre>
<pre><code>## [1] 5</code></pre>
<h3>Day 2: Dive!</h3>
<p>For a full description of the problem see <a href="https://adventofcode.com/2021/day/2">https://adventofcode.com/2021/day/2</a>.</p>
<h4>Day 2 - Part 1</h4>
<ul>
<li><code>forward X</code> increases the horizontal position by <code>X</code> units.</li>
<li><code>down X</code> increases the depth by <code>X</code> units.</li>
<li><code>up X</code> decreases the depth by <code>X</code> units.</li>
</ul>
<pre><code> horizontal depth
forward 5 --> 5 -
down 5 --> 5
forward 8 --> 13
up 3 --> 2
down 8 --> 10
forward 2 --> 15
==> horizontal = 15, depth = 10</code></pre>
<p>First create a data frame with the example data</p>
<pre><code>x <- data.frame(direction = c("forward", "down", "forward",
"up", "down", "forward"),
amount = c(5, 5, 8, 3, 8, 2))</code></pre>
<p>Then the puzzle can be solved with the following function, which takes the variables <code>direction</code> and <code>amount</code> as input. The horizontal position is the sum of the amounts where the direction is “forward”. The depth is the sum of the amounts where direction is “down” minus the sum of the amounts where direction is “up”.</p>
<pre><code>f02a <- function(direction, amount) {
horizontal <- sum(amount[direction == "forward"])
depth <- sum(amount[direction == "down"]) - sum(amount[direction == "up"])
c(horizontal = horizontal, depth = depth)
}
f02a(x$direction, x$amount)</code></pre>
<pre><code>## horizontal depth
## 15 10</code></pre>
<p>The code above uses logical indexing to select the amounts that contribute to each sum. <a href="https://personalpages.manchester.ac.uk/staff/david.selby/rthritis/2021-12-03-advent2021/resources/adventofcode.html#16">An alternative approach</a> from David Selby is to coerce the logical indices to numeric (coercing <code>TRUE</code> to 1 and <code>FALSE</code> to 0) and multiply the amount by the resulting vectors as required:</p>
<pre><code>f02a_selby <- function(direction, amount) {
horizontal_move <- amount * (direction == 'forward')
depth_move <- amount * ((direction == 'down') - (direction == 'up'))
c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}</code></pre>
<p>Benchmarking on 1000 datasets of 1000 rows this alternative solution is only marginally faster (an average run-time of 31 μs vs 37 μs), but it has an advantage in Part 2!</p>
<h4>Day 2 - Part 2</h4>
<ul>
<li><code>down X</code> increases your aim by <code>X</code> units.</li>
<li><code>up X</code> decreases your aim by <code>X</code> units.</li>
<li><code>forward X</code> does two things: <ul>
<li>It increases your horizontal position by <code>X</code> units.</li>
<li>It increases your depth by your aim <strong>multiplied by</strong> <code>X</code>.</li>
</ul>
</li>
</ul>
<pre><code> horizontal aim depth
forward 5 --> 5 - -
down 5 --> 5
forward 8 --> 13 40
up 3 --> 2
down 8 --> 10
forward 2 --> 15 60
==> horizontal = 15, depth = 60</code></pre>
<p>The following function solves this problem by first computing the sign of the change to aim, which is negative if the direction is “up” and positive otherwise. Then for each change in position, if the direction is “forward” the function adds the amount to the horizontal position and the amount multiplied by aim to the depth, otherwise it adds the sign multiplied by the amount to the aim.</p>
<pre><code>f02b <- function(direction, amount) {
horizontal <- depth <- aim <- 0
sign <- ifelse(direction == "up", -1, 1)
for (i in seq_along(direction)){
if (direction[i] == "forward"){
horizontal <- horizontal + amount[i]
depth <- depth + aim * amount[i]
next
}
aim <- aim + sign[i]*amount[i]
}
c(horizontal = horizontal, depth = depth)
}
f02b(x$direction, x$amount)</code></pre>
<pre><code>## horizontal depth
## 15 60</code></pre>
<p>As an interpreted language, for loops can be slow in R and vectorized solutions are often preferable if memory is not an issue. David Selby showed that his solution from Part 1 can be extended to solve the problem in Part 2, by using cumulative sums of the value that represented <code>depth</code> in Part 1 to compute the <code>aim</code> value in Part 2.</p>
<pre><code>f02b_revised <- function(direction, amount) {
horizontal_move <- amount * (direction == "forward")
aim <- cumsum(amount * (direction == "down") - amount * (direction == "up"))
depth_move <- aim * horizontal_move
c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}
f02b_revised(x$direction, x$amount)</code></pre>
<pre><code>## horizontal depth
## 15 60</code></pre>
<p>Benchmarking these two solutions on 1000 data sets of 1000 rows, the vectorized solution is ten times faster (on average 58 μs vs 514 μs).</p>
<h2>Reflections</h2>
<p><i>James Tripp</i></p>
<p>How do we solve a problem with code? Writing an answer requires what some educators call <a href="Computational%20Thinking%20-%20Google%20Books">computational thinking</a>. We systematically conceptualise the solution to a problem and then work through a series of steps, drawing on coding conventions, to formulate an answer. Each answer is different and, often, a reflection of our priorities, experience, and domains of work. In our meeting, it was wonderful to see people with a wide range of experience and differing interests.</p>
<p>Our discussion considered the criteria of a ‘good solution’.</p>
<ul>
<li><strong>Speed</strong> is one criteria of success - a solution which takes 100 μs (microseconds) is better than a solution taking 150 μs.</li>
<li><strong>Readability</strong> for both sharing with others (as done above) and to help future you, lest you forget the intricacies of your own solution.</li>
<li><strong>Good practice</strong> such as variable naming and, perhaps, avoiding for loops where possible. Loops are slower and somewhat discouraged in the R community. However, some would argue they are more explicit and helpful for those coming from other languages, such as Python.</li>
<li><strong>Debugging friendly.</strong>Some participants, including Heather Turner and David Selby, checked their solutions with tests comparing known inputs and outputs. I drew on my Psychology experience and opted for an explicit DataFrame where I can see each operation. Testing is almost certainly a better solution which I adopt in my packages.</li>
<li><strong>Generalisability.</strong> A solution tailored for the Part 1 task on a given day may not be easily generalisable for the Part 2 task. It seemed desirable to refactor one’s code to create a solution which encompasses both tasks. However, the effort and benefits of doing so are certainly debatable.</li>
</ul>
<p>We also discussed levels of abstraction. The <a href="https://www.tidyverse.org/">tidyverse family of R packages</a> is powerful, high-level and quite opinionated. Using tidyverse functions returned some intuitive, but slower solutions where we were unsure of the bottlenecks. Solutions built on R base (the functions which come with R) were somewhat faster, though others using libraries such as data.table were also rather quick. These reflections are certainly generalisations and prompted some discussion.</p>
<p>How does one produce fast, readable, debuggable, generalisable code which follows good practice and operates at a suitable level of abstraction? Our discussions did not produce a definitive answer. Instead, our discussions and sharing solutions helped us understand the pros and cons of different approaches and I certainly learned a few useful tricks.</p>CodeProgrammingRWed, 08 Dec 2021 10:00:38 GMTHeather Turnerhttps://blogs.warwick.ac.uk/researchsoftware/entry/advent_of_code/#comments8a17858177fcbf89017d95d537d703a90