Sunday, September 3, 2023

Running Tarsnap

 This post is to document the procedure to run Tarsnap on MacOS. Tarsnap is basically an online service that runs on top of Amazon cloud infrastructure for online backup. If you are reading this than you already know much about it but needless to say it is a cool program.

First thing, we need to install it which can be easily accomplished using homebrew. 

brew install tarsnap

It can also be installed by compiling it from source but it is outside the scope of this post. 

Once installed, it is a good idea to run the dry-stat command to see how much space it will take and what is the compression ratio. It is supposed to de-duplicate the data and store the those unique bytes. 

For example, to see if we want to only upload pdf and .Rdata files from "Analysis" folder, we can run the following command. 

find /Users/xyz/Analysis -type f \( -name '*.pdf' -o -name '*.Rdata' \) -print0 | tarsnap --dry-run --no-default-config --print-stats --humanize-numbers -c --null -T-


Now this command is doing lot of things, first it is "finding" files with "-type f" and then finds only files  ending with Rdata and pdf extension. Notice the use of -o to indicate "or" operator within find command. If we have more than two file type extensions, then we need to use parenthesis to enclose all the files types. We are using "-print0" to separate the filenames using the null character so it won't fail with some weird characters in the filenames. This is then piped to the tarsnap program. The keywords here are --null to account for the passing files with "-print0" option. 

   --null  (use with -I, -T, or -X) Filenames or patterns are separated by

             null characters, not by newlines.  This is often used to read

             filenames output by the -print0 option to find(1).

     -T filename

             (c, x, and t modes only) In x or t mode, tarsnap will read the

             list of names to be extracted from filename.  In c mode, tarsnap

             will read names to be archived from filename.  The special name

             “-C” on a line by itself will cause the current directory to be

             changed to the directory specified on the following line.  Names

             are terminated by newlines unless --null is specified.  Note that

             --null also disables the special handling of lines containing

             “-C”.  If filename is “-” then the list of names will be read

             from the standard input.  Note:  If you are generating lists of

             files using find(1), you probably want to use -n as well.

so the "-" following the "-T" option allows to pass the name using std-in via find command. Once you run the command, we should see output like this. 

tarsnap: Removing leading '/' from member names

                                       Total size  Compressed size

All archives                               8.4 MB           3.4 MB

  (unique data)                            8.4 MB           3.4 MB

This archive                               8.4 MB           3.4 MB

New data                                   8.4 MB           3.4 MB

Now this is just the test. In order to run the Tarsnap, we need to register for an account and get ready for some configuration for which we need:

  1. tarsnap.conf
  2. tarsnap.key
  3. setting cache directory
If you installed tarsnap simply using homebrew than the location of tarsnap.conf will be in /opt/homebrew/etc so just copy that file to your home directory using 

cp /opt/homebrew/etc/tarsnap.conf.sample ~/tarsnap.conf

According to the documentation (https://www.tarsnap.com/gettingstarted.html#configuration-file),

If you would prefer to run Tarsnap as a normal user,

 cp /opt/homebrew/etc/tarsnap.conf.sample ~/.tarsnaprc

since we will be running it as normal user, we use this above alternative. Now for the "tarsnap.key", it is generated as part of the registration of the computer with tarsnap server. 


sudo tarsnap-keygen \

    --keyfile /Users/xyz/tarsnap.key \

    --user me@example.com \

    --machine mybox

Make sure the key can be read by the user otherwise it won't work. 

     sudo chmod 0444 tarsnap.key


 Now let us set the cache directory

     sudo chmod 700 /Users/xyz/tarsnap_cache


Make sure to point your .tarsnaprc file to the location of the key and cachedir. If it all goes write, then use this command to actually run the real backup:


find /Users/xyz/Analysis  -type f \( -name '*R' -o -name "*.pdf" -o -name "*.Rdata" \)  -print0 | tarsnap  --print-stats  -c -f "analysis_back-$(date +%Y-%m-%d_%H-%M-%S)" --null -T-  


Notice we added $(date +%Y-%m-%d_%H-%M-%S) to note the date of the archive. Tarsnap won't allow us to back with the same archive name. It won't let us delete the archive unless explicitly told to do so. 

we can list the archives using this command:

tarsnap --list-archives

Now we can set up launchd https://web.archive.org/web/20230627074009/https://www.launchd.info/ to run that command every week or every day as needed. Hopefully, this will help someone who is looking to set tarsnap on MacOS. That someone will be mostly likely me. 




Saturday, June 24, 2023

Installing R packages

 Using R packages can be fun but installing them can be difficult sometimes. The problem usually happens when the installation needs dev tools or some from one of the bioconductor repository.  Usually the instructions are as below. 


# Required packages
install.packages("devtools")
install.packages("BiocManager")



# Install package here
devtools::install_github("xxxx/xxxx",
dependencies = c("Depends", "Imports", "LinkingTo"),
repos = c("https://cloud.r-project.org/",
BiocManager::repositories()))


devtools::install_github("XXX/XXX")

That devtools command results into something the output below. 



Using github PAT from envvar GITHUB_PAT
Error: Failed to install 'SPRING' from GitHub:
HTTP error 401.
Bad credentials


Rate limit remaining: 59/60
Rate limit reset at: 2023-06-23 04:15:37 UTC


The problem with this is that one needs the Personal access Token (PAT) for Github API. This is really annoying problem. Also trying to install through bioconductor will update other packages sometimes. Of course, it will prompt before it does that but it can be cumbersome to check every time. What if one needs to install package/package version that is no longer available in the latest version of Bioconductor. One can set the version when installing from Bioconductor but I still have trouble finding the right package version. 

 
The method I am going to describe is pretty simple but will need some manual work which I think is worth it because it should not disturb your existing installations. Actually Karl Broman has nicely described this method here: 


For Github hosted R packages:
    • Download the zip file and save it locally. 
    • Unzip the file and remove the master prefix. For example, in NetCoMi-main.zip, remove the "-main" part. Now, drag the file into terminal or type

R CMD BUILD /Users/ashish/Downloads/NetCoMi

    • This will prepare the file for installation and will result into NetCoMi_1.1.0.tar.gz file. Now this is the folder that is ready to get install on your computer. Install it using this command. 


R CMD INSTALL NetCoMi_1.1.0.tar.gz


Now you will realize that it will fail sometimes because of the dependencies are not installed. Either install them using the same method I am describing or use the install.package() function. Repeat until the package can be installed. 

For Bioconductor hosted packages: 


Bottom line is to make sure to note all the packages that are used for the analysis since sometimes updates can change your analysis output. 

Comparing R and Python

 I have used R for quite some time for data analysis. Especially with the use of Tidyverse package, it has been a very decent experience. Gg...