Tuesday, April 9, 2024

Comparing R and Python

 I have used R for quite some time for data analysis. Especially with the use of Tidyverse package, it has been a very decent experience. Ggplot2 package for plotting is mostly intuitive. Synergy of Tidyverse ecosystem along with availability of bioinformatics and statistical analysis software with the R platform, it is an awesome combination. 

Recently, I have wondered to try out Python for my daily microbiome data analysis. Julia was another option but for some reason, it feels still incomplete. There have been decent attempt to replicate the tidyverse package in Julia such as Tidier.jl https://github.com/TidierOrg/Tidier.jl. However, it still feels work in progress. 

There have been time when trying to code with more defensive approach in R has lead to very cumbersome code.  For example when trying to apply try and catch statement. This example was generated using chatGPT 4 which was similar to my use case:


Sunday, September 3, 2023

Running Tarsnap

 This post is to document the procedure to run Tarsnap on MacOS. Tarsnap is basically an online service that runs on top of Amazon cloud infrastructure for online backup. If you are reading this than you already know much about it but needless to say it is a cool program.

First thing, we need to install it which can be easily accomplished using homebrew. 

brew install tarsnap

It can also be installed by compiling it from source but it is outside the scope of this post. 

Once installed, it is a good idea to run the dry-stat command to see how much space it will take and what is the compression ratio. It is supposed to de-duplicate the data and store the those unique bytes. 

For example, to see if we want to only upload pdf and .Rdata files from "Analysis" folder, we can run the following command. 

find /Users/xyz/Analysis -type f \( -name '*.pdf' -o -name '*.Rdata' \) -print0 | tarsnap --dry-run --no-default-config --print-stats --humanize-numbers -c --null -T-

Now this command is doing lot of things, first it is "finding" files with "-type f" and then finds only files  ending with Rdata and pdf extension. Notice the use of -o to indicate "or" operator within find command. If we have more than two file type extensions, then we need to use parenthesis to enclose all the files types. We are using "-print0" to separate the filenames using the null character so it won't fail with some weird characters in the filenames. This is then piped to the tarsnap program. The keywords here are --null to account for the passing files with "-print0" option. 

   --null  (use with -I, -T, or -X) Filenames or patterns are separated by

             null characters, not by newlines.  This is often used to read

             filenames output by the -print0 option to find(1).

     -T filename

             (c, x, and t modes only) In x or t mode, tarsnap will read the

             list of names to be extracted from filename.  In c mode, tarsnap

             will read names to be archived from filename.  The special name

             “-C” on a line by itself will cause the current directory to be

             changed to the directory specified on the following line.  Names

             are terminated by newlines unless --null is specified.  Note that

             --null also disables the special handling of lines containing

             “-C”.  If filename is “-” then the list of names will be read

             from the standard input.  Note:  If you are generating lists of

             files using find(1), you probably want to use -n as well.

so the "-" following the "-T" option allows to pass the name using std-in via find command. Once you run the command, we should see output like this. 

tarsnap: Removing leading '/' from member names

                                       Total size  Compressed size

All archives                               8.4 MB           3.4 MB

  (unique data)                            8.4 MB           3.4 MB

This archive                               8.4 MB           3.4 MB

New data                                   8.4 MB           3.4 MB

Now this is just the test. In order to run the Tarsnap, we need to register for an account and get ready for some configuration for which we need:

  1. tarsnap.conf
  2. tarsnap.key
  3. setting cache directory
If you installed tarsnap simply using homebrew than the location of tarsnap.conf will be in /opt/homebrew/etc so just copy that file to your home directory using 

cp /opt/homebrew/etc/tarsnap.conf.sample ~/tarsnap.conf

According to the documentation (https://www.tarsnap.com/gettingstarted.html#configuration-file),

If you would prefer to run Tarsnap as a normal user,

 cp /opt/homebrew/etc/tarsnap.conf.sample ~/.tarsnaprc

since we will be running it as normal user, we use this above alternative. Now for the "tarsnap.key", it is generated as part of the registration of the computer with tarsnap server. 

sudo tarsnap-keygen \

    --keyfile /Users/xyz/tarsnap.key \

    --user me@example.com \

    --machine mybox

Make sure the key can be read by the user otherwise it won't work. 

     sudo chmod 0444 tarsnap.key

 Now let us set the cache directory

     sudo chmod 700 /Users/xyz/tarsnap_cache

Make sure to point your .tarsnaprc file to the location of the key and cachedir. If it all goes write, then use this command to actually run the real backup:

find /Users/xyz/Analysis  -type f \( -name '*R' -o -name "*.pdf" -o -name "*.Rdata" \)  -print0 | tarsnap  --print-stats  -c -f "analysis_back-$(date +%Y-%m-%d_%H-%M-%S)" --null -T-  

Notice we added $(date +%Y-%m-%d_%H-%M-%S) to note the date of the archive. Tarsnap won't allow us to back with the same archive name. It won't let us delete the archive unless explicitly told to do so. 

we can list the archives using this command:

tarsnap --list-archives

Now we can set up launchd https://web.archive.org/web/20230627074009/https://www.launchd.info/ to run that command every week or every day as needed. Hopefully, this will help someone who is looking to set tarsnap on MacOS. That someone will be mostly likely me. 

Saturday, June 24, 2023

Installing R packages

 Using R packages can be fun but installing them can be difficult sometimes. The problem usually happens when the installation needs dev tools or some from one of the bioconductor repository.  Usually the instructions are as below. 

# Required packages

# Install package here
dependencies = c("Depends", "Imports", "LinkingTo"),
repos = c("https://cloud.r-project.org/",


That devtools command results into something the output below. 

Using github PAT from envvar GITHUB_PAT
Error: Failed to install 'SPRING' from GitHub:
HTTP error 401.
Bad credentials

Rate limit remaining: 59/60
Rate limit reset at: 2023-06-23 04:15:37 UTC

The problem with this is that one needs the Personal access Token (PAT) for Github API. This is really annoying problem. Also trying to install through bioconductor will update other packages sometimes. Of course, it will prompt before it does that but it can be cumbersome to check every time. What if one needs to install package/package version that is no longer available in the latest version of Bioconductor. One can set the version when installing from Bioconductor but I still have trouble finding the right package version. 

The method I am going to describe is pretty simple but will need some manual work which I think is worth it because it should not disturb your existing installations. Actually Karl Broman has nicely described this method here: 

For Github hosted R packages:
    • Download the zip file and save it locally. 
    • Unzip the file and remove the master prefix. For example, in NetCoMi-main.zip, remove the "-main" part. Now, drag the file into terminal or type

R CMD BUILD /Users/ashish/Downloads/NetCoMi

    • This will prepare the file for installation and will result into NetCoMi_1.1.0.tar.gz file. Now this is the folder that is ready to get install on your computer. Install it using this command. 

R CMD INSTALL NetCoMi_1.1.0.tar.gz

Now you will realize that it will fail sometimes because of the dependencies are not installed. Either install them using the same method I am describing or use the install.package() function. Repeat until the package can be installed. 

For Bioconductor hosted packages: 

Bottom line is to make sure to note all the packages that are used for the analysis since sometimes updates can change your analysis output. 

Sunday, April 8, 2018

Data summary

Data analysis is a huge topic but any data analysis should begin with simple data summary. It is so easy to overlook this step especially when you are dealing with the data is presented in an overly complicated way.

Recently, I was handed longitudinal data collected at a two-time point with 2 years interval between them. The data had about 20 columns including the socio-economic factors, parasite burden and growth indicators which were normalized to z-scores. 
Now, since it was longitudinal data, the attention quickly diverted on how to model the impact of parasites on the growth parameters. Various models including fixed level, random effects, and GEE models were investigated on whether or not they can be used for this kind of data. This took so many hours of research on the web regarding similar data analysis. Eventually, we found that GEE was the appropriate analysis for this kind of analysis (https://stats.stackexchange.com/questions/16390/when-to-use-generalized-estimating-equations-vs-mixed-effects-models). The key word here is that we were interested in "marginal" effect and not conditional effect. We went ahead with the GEE and found the significance for few of the parasites on the growth parameters as the outcome. This looked all good and obvious. 

However, since the parasite burden had highly skewed distribution, we decided to transform it. It was not absolutely necessary for us to go forward with transformation with this analysis.  We went for the transformation anyway. Using rcompanion package and transformtukey function, we found that some values were not transforming at all. Remaining values seems to be "normally" transformed. It turns out that most values were "zeroes". These zero values were not missing or an anomaly in the measurement. These were all genuine zeroes since most of the subjects did not have any parasites.  In fact, about 90-95% of cases had no parasites and here we are trying to find their impact on the growth of these subjects.

This was missed in the beginning because we were overly focused on the longitudinal part.  If we had gone along doing simple data summary, we would have caught it a lot earlier. Lesson learned. 

Now we know the problem, how to deal with this? Do we model them separately which would leave us with only 5% of the cases for the parasite group or do we lump them with the same group? This definitely impacted the statistics of parasite impact.  This lead to another marathon of research on whether on or not to conduct the analysis using GEE with all the data or analyze them separately. No answers have been found yet.  Possible solution include mixture models, mixed level modeling but we have not implemented any of those yet.  I may update this post when we find out conclusively. 

Tuesday, January 23, 2018

Character Encoding

Google released Colaboratory as a data science tool for the purpose of providing computational and hosting Jupyter notebooks to experiment. It also provides a good demonstration of Tensorflow machine learning library. It saves files as python notebook (.ipynb extension) within a designated folder inside your Google drive account.

I got all excited with this and not to say the least that it provides users with 13 GB of ram, Intel dual-core Xeon processors as shown below. It is rationed from some VM but I am not aware of the internal details.

At the writing of this post, it just provides Python 2 and 3 kernels. R and other languages are supposed to be added in future. It seems they are serious since they have added Jake Vanderplas (http://vanderplas.com/)  as a visiting researcher beginning of this year.
I took a plunge in the Colaboratory by trying to see if I can analyze my local data. There is a Jupyter notebook provided as a documentation to load the data from local machine, Google drive, Google sheets, Google cloud as expected: https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/io.ipynb however it does tell us how can we "use" those files. It seems to be outside the scope of documentation so it was of no help to me. I kept trying to read the file after loading it into an object


import pandas as pd
from google.colab import files
uploaded = files.upload()

for fn in uploaded.keys():
   print('User uploaded file "{name}" with length {length} bytes'.format(
   name=fn, length=len(uploaded[fn])))

#To see if the file is in the current folder but I could not find it

#I loaded the file anyway to see if it was present

But there was no file. It kept saying file could not be found. 

The problem was pretty trivial, it turns out that uploaded files are "never" uploaded to the hard drive but are stored in RAM as Python objects and we need to work with those objects as I found the solution  using this stack overflow link: 

##Parital solution

import pandas as pd
import io

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
   print('User uploaded file "{name}" with length {length} bytes'.format(
   name=fn, length=len(uploaded[fn])))


YCOM-Web2016_2017-11-29.csv(text/csv) - 1243703 bytes, last modified: 1/19/2018 - %100 done
User uploaded file "YCOM-Web2016_2017-11-29.csv" with length 1243703 bytes

##End of output

df = pd.read_csv(io.StringIO(uploaded['YCOM-Web2016_2017-11-29.csv'].decode('utf-8')))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 505403: invalid start byte

What is this UnicodeDecodeError? It turns out that my file did not have utf-8 encoding which I supposed. I needed to convert the io object string into file format which Pandas library could swallow. Again, google come to the rescue, I got this page where we have the list of different encodings: https://docs.python.org/3/library/codecs.html#standard-encodings
for which Python 3 has the support. 

I changed my line in the code to 

df = pd.read_csv(io.StringIO(uploaded['YCOM-Web2016_2017-11-29.csv'].decode('ISO-8859-1')))

and it loaded with full glory!

Notice we have the argument to decode changed from utf-8 to ISO-8859-1 which allows characters not supported by utf-8. I am still mystified since, the file was supposed to be a plain old csv file but well for the future references, always check your file encoding using this command:

file -I file_name.csv

which on my mac shows up as "unknown-8bit".  Not very helpful but we do know that it is not plain old "us-ascii". 

TLDR: Check your encoding before you load the file and do not make quick assumptions especially when working with new systems!

Tuesday, January 16, 2018

Dealing with low sample size significance testing

Recently I had to analyze data with very few data points in the range of 3-15. The data consisted of 3 groups and multiple subgroups. The most obvious choice, in this case, was to use a non-parametric statistical test such as Wilcox test. The problem with the Wilcox test is that we have the problem of losing power/sensitivity. A t-test, on the other hand, may give us false-positive especially with the sample size of 3. How do we deal with this? This issue is exacerbated especially for p-value calculation. P-values seems to be necessary "evil" but below are the points to address this problem.

These are some of the links which helped me to understand this issue:

I was intrigued further to carry informal "research" and found possible options.

  • If we cannot determine normality, should we just use t-test anyways? 
    • The idea is that this experiment is to detect potential vaccine candidate in the very preliminary stage so we need to be slightly "lenient" and err on the side of allowing few false positives.

  • Just select Wilcox.test since it is appropriate for non-parametric and enough power to detect a difference
    • https://stats.stackexchange.com/a/66235/124490

  • Use bootstrapped values: 
    • http://biostat.mc.vanderbilt.edu/wiki/pub/Main/JenniferThompson/ms_mtg_18oct07.pdf
    • Requires more than 8 samples.
      • https://stats.stackexchange.com/questions/33300/determining-sample-size-necessary-for-bootstrap-method-proposed-method
    • However, some say we may require more than 20
      • https://speakerdeck.com/jakevdp/statistics-for-hackers

  • Using Permutation test:
    • It works with a fewer sample size as compared to bootstrapping but it cannot generate confidence interval. 
    • In fact, Wilcox test is a subset of a permutation test. 

  • Plainly displaying the data points with a confidence interval.

  • Using Effect size to illustrate the "significance". Site: https://garstats.wordpress.com/2016/05/02/robust-effect-sizes-for-2-independent-groups/
    • Some of the recommendations include 
      • Cohens.d (Not to use for non-normal/non-parametric data) 
      • Cliff's delta (Non-parametric ordinal data)
      • Mutual information (MI)
      • Kolmogorov-Smirnov
      • Wilcox & Muska’s Q

  • Equivalence test. Site: https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/equivalence-tests/supporting-topics/why-use-an-equivalence-test/ 
    • This option requires knowing the "difference" which has some biological/Clinical significance.
Overall, this gives us many options but still, this is no panacea. The small sample size is a very difficult issue and these "solutions" can help to minimize the pain. 

Comparing R and Python

 I have used R for quite some time for data analysis. Especially with the use of Tidyverse package, it has been a very decent experience. Gg...