Short best practice blueprint for sharing code scripts with academic outputs #186

adamkucharski · 2024-02-15T10:44:02Z

adamkucharski
Feb 15, 2024
Maintainer

Description

As a paper reviewer, I increasingly see code shared on git repositories alongside papers (which is a good thing) but in an often impenetrable way (no README, minimal comments, not much file structure, unclear modularisation of code, no knitted vignettes, no licence).

I wonder if there is an opportunity to provide a short document on contributors' opinion for best practice for sharing analysis code that isn't necessiarly a full package. This isn't about asking users to fundamentally redesign their analysis code – rather ensure that it is clearly documented and structured, to enable ease-of-understanding and reproducibility.

It's to analogous similar discussions we're having with @joshwlambert and @CarmenTamayo in {epiparameter} – at one end, we have best practice for estimating parameters in the first place (more demanding on users) and at the other we have best practice for reporting (less demanding, but still valuable for removing reuse obstacles).

Typical end-users

Researchers publishing preprints/papers in outbreak analysis

Potential contributors

Others interested in best practice for code sharing

Key collaborators

Colleagues at LSHTM and beyond

Inputs

NA (not a package)

Outputs

NA (not a package)

Imports

NA

Used by

NA

Related projects

Model share programme (via @jamesmbaazam): https://sciencegateways.org/networking-community/community-news/n/introducing-modelshare-program

Repo quality metrics (@Bisaloo et al): WHO-Collaboratory/collaboratory-epipipeline-community#6

CODECHECK project: https://codecheck.org.uk/

Usage

NA

(Although perhaps could suggest running 'repo health' functions if available from above WHO collaboratory project)

Additional comments

...

pratikunterwegs · 2024-02-15T12:08:57Z

pratikunterwegs
Feb 15, 2024

Thanks for raising @adamkucharski - I think this is pretty important. Just putting down some thoughts here - might sound a bit basic for some and I've mostly picked these up informally. Here's an example of how I've previously structured repos for submission to journals: "Source Code and Supplementary Material for: A Guide to Pre-processing High-throughput Animal Tracking Data"

Some basic steps, in addition to a clear Readme:

Separate package code from analysis code; especially important for modelling papers where it's tempting to source() a local model function;
Have directories with clear naming: data/ (for raw input data), scripts/, output (to save statistical model fits, processed data, etc.), figure_scripts, figures;
Write supplementary material as .Rmd files for easy submission as LaTeX or Word files; optionally separate the main text analyses from the supplementary material;
Mention any special instructions to install packages (e.g. using Rtools on Windows), or to promote reproducibility (e.g. take care when usin multithreaded options), or when code requires HPC cluster use (example here).

Some extra steps I take, which definitely take time but are worth it for oneself and the wider community imo:

Knit vignettes as a {pkgdown} website when papers describe new methods. Need not actually run the code, which could be quite time consuming;
Make GitHub releases so reviewers are aware which version of the code is being referenced, and can see changes between versions;
Link GH repo with Zenodo so the analysis code is persistent and cite-able with a DOI;
Optionally, run scheduled workflows to demonstrate the code actually works (can be turned off after a set time, e.g. GH turns them off 6 months after it detects the repo is stale iirc).

3 replies

avallecam Feb 27, 2024
Collaborator

thanks @adamkucharski for filling this discussion and @pratikunterwegs for sharing your steps. Your vision and resources partially match and can complement what we intended to share in the last IDDconf workshop, material hosted in the research-compendium tutorial repo.

The keypoints of the tutorial offer a summary of the learning goals. We cover using a research-compendium template for folder structure with {rcompendium} and {usethis}, reproducible analysis with {renv}, and README standards, covering introductory tools to progressively increase the reproducible science, sustainable research and open science features of an analysis project.

The wrap-up includes a self-assessment template to compare the learned features with the JOSS review checklist.

This has an appendix where we provide extended tools to create a github page of the repo with {pkgdown} and templates for manuscript writing with {rrtools}. This could be a nice place to host the extra steps listed and not yet included like gh-releases, link with Zenodo for DOI, and workflows (which can also be added with {usethis}).

After some updates of the first content delivery, we can open a full repo review. Happy to read your thoughts.

adamkucharski Feb 27, 2024
Maintainer Author

Yes, I've been thinking about where would be best for the above to live. One option is a 'recommendation' type journal article, or a 'best practice' one (like a PLOS CB 'ten simple rules' piece). But given we're pointing to lots of code tools, a short vignette or post probably works best for now. The training is useful for new users, but I think it would also be worth having a shorter summarised version for those who already use R quite a bit but perhaps haven't thought about best practice for sharing, and what tools are available.

For such a guide, we should also consider where the biggest marginal gains are to be had vs user effort, and maybe structure the guidelines from quick wins to more advanced options (e.g. it's possible to set up a repo without installing git via R, and would be better to have lots of published work with well structured repos via desktop than relatively few via R!)

avallecam Feb 28, 2024
Collaborator

I agree with the approach for a different audience. We can write a blog post with the approach, select key tools in the tutorial + more specialised extensions. We can end up redirecting to the tutorial material for step-by-steps guides, but also to let readers know they already have this compiled for a workshop they can run locally using our materials.

sbfnk · 2024-02-28T09:42:18Z

sbfnk
Feb 28, 2024
Maintainer

I think some general recommendations/suggestions would be good, but also think there is potentially scope for some slightly more opinionated definition of "best practice" (as e.g. in the EPIFORGE guidelines) that could be pointed by paper reviewers.

2 replies

sbfnk Feb 29, 2024
Maintainer

Also there's a really good existing article on Good enough practices in scientific computing.

avallecam Feb 29, 2024
Collaborator

good indeed. Actually, that is the reference we based for the materials and that we refer to in the self-assessment section above 👍 the {rcompendium} and {rrtools} aims to provide specific R tools to the Wilson et al. paper. In the tutorial, we used mainly {rcompendium} (turned out to be more stable and with no defaults when using it across projects) and suggested the manuscript feature from {rrtools} as an add-in to the workflow process.

kathsherratt · 2024-03-01T16:14:13Z

kathsherratt
Mar 1, 2024

Thanks @sbfnk for pointing me to this post. I had a couple of more or less relevant thoughts :)

Firstly - similar to @avallecam I thought I'd chip in to add an experience teaching on this theme (also with @avallecam and @bquilty25). I recently developed a workshop on "R for research - intro to good practices". It broadly covered the same principles as the Wilson et al paper (modularity, documentation etc), but was pitched slightly differently than the rcompendium material - more going for the kind of quick wins as @adamkucharski mentioned above. The workshop was internal to LSHTM and had a mix of mostly early career staff with a few students, nearly all regular R users.

The feedback from course attendees was that the material was either new to them, or that it was practice they'd seen and had maybe partly tried to adopt but without having any structured motivation or guidance for doing so (e.g. directory set up or having a README). They especially liked having lots of small steps/packages that they could use straightaway to improve existing code. In general there was loads of interest in the material, and I think there's definitely unmet demand for making this really simple and accessible.

Secondly, I had a thought which is more along the lines of @sbfnk suggestion for an opinionated piece. Personally I feel I'm missing and would love to see opinions/guidance on when to leave code for a paper as it is, or when to split out into a package. I've struggled with this in a couple of papers that may/may not be re-used, and wondered whether the extra steps for package dev are worth it (simple as they may be with devtools etc). A guidance piece that included that would be really nice to see.

0 replies

adamkucharski · 2024-03-04T13:14:33Z

adamkucharski
Mar 4, 2024
Maintainer Author

They especially liked having lots of small steps/packages that they could use straightaway to improve existing code.

This is a really nice aspect to provide for users/readers. That 'to package or not to package' is something that's come up in a few other project discussions recently, agree could be useful to include. Shall we start with a draft blog post outlining key points/opinions for best practice, then can decide whether to keep as online post (faster, informal) or refine into a publication (slower, more formal)?

0 replies

joshwlambert · 2024-03-04T14:12:15Z

joshwlambert
Mar 4, 2024
Collaborator

Interesting thread. I agree with many points already mentioned. I wonder if any blog post/paper we write should be more of a how-to guide to actively facilitate good practises, by pointing to existing tools and explaining step-by-step procedures that are crucial to reproducibility, e.g. version releases (DOIs) & repository structure. It will be easier if we focus on best practises for R rather than research software or analysis code in general, but would of course limit the readership of the piece.

It's to analogous similar discussions we're having with @joshwlambert and @CarmenTamayo in {epiparameter} – at one end, we have best practice for estimating parameters in the first place (more demanding on users) and at the other we have best practice for reporting (less demanding, but still valuable for removing reuse obstacles).

This is a good point because it is not clear where to draw the line on "best practises" for code sharing, often code is shared in a very rough state and there are lots of small easy wins to improve this and a how-to guide could easily assist with this. However, there is also an optimal method of code sharing which could involve containers (e.g. Docker) and the use of tools like {renv} which are not as easy for all researchers to quickly pick up and use.

0 replies

TimTaylor · 2024-03-07T11:43:37Z

TimTaylor
Mar 7, 2024

This seems like it might be relevant. Not read but sharing anyways ...
https://carpentries.org/blog/2024/03/good-enough-practices-carpentries-lab/

2 replies

avallecam Mar 7, 2024
Collaborator

this looks interesting to review, I was not aware of it. thanks for sharing!

jamesmbaazam Mar 7, 2024
Collaborator

@TimTaylor You beat me to it. 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epiverse-TRACE

Short best practice blueprint for sharing code scripts with academic outputs #186

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Epiverse-TRACE

Short best practice blueprint for sharing code scripts with academic outputs #186

adamkucharski Feb 15, 2024 Maintainer

Description

Typical end-users

Potential contributors

Key collaborators

Inputs

Outputs

Imports

Used by

Related projects

Usage

Additional comments

Replies: 6 comments · 7 replies

pratikunterwegs Feb 15, 2024

avallecam Feb 27, 2024 Collaborator

adamkucharski Feb 27, 2024 Maintainer Author

avallecam Feb 28, 2024 Collaborator

sbfnk Feb 28, 2024 Maintainer

sbfnk Feb 29, 2024 Maintainer

avallecam Feb 29, 2024 Collaborator

kathsherratt Mar 1, 2024

adamkucharski Mar 4, 2024 Maintainer Author

joshwlambert Mar 4, 2024 Collaborator

TimTaylor Mar 7, 2024

avallecam Mar 7, 2024 Collaborator

jamesmbaazam Mar 7, 2024 Collaborator

adamkucharski
Feb 15, 2024
Maintainer

Replies: 6 comments 7 replies

pratikunterwegs
Feb 15, 2024

avallecam Feb 27, 2024
Collaborator

adamkucharski Feb 27, 2024
Maintainer Author

avallecam Feb 28, 2024
Collaborator

sbfnk
Feb 28, 2024
Maintainer

sbfnk Feb 29, 2024
Maintainer

avallecam Feb 29, 2024
Collaborator

kathsherratt
Mar 1, 2024

adamkucharski
Mar 4, 2024
Maintainer Author

joshwlambert
Mar 4, 2024
Collaborator

TimTaylor
Mar 7, 2024

avallecam Mar 7, 2024
Collaborator

jamesmbaazam Mar 7, 2024
Collaborator