Sunday, April 8, 2018

Data summary

Data analysis is a huge topic but any data analysis should begin with simple data summary. It is so easy to overlook this step especially when you are dealing with the data is presented in an overly complicated way.

Recently, I was handed longitudinal data collected at a two-time point with 2 years interval between them. The data had about 20 columns including the socio-economic factors, parasite burden and growth indicators which were normalized to z-scores. 
Now, since it was longitudinal data, the attention quickly diverted on how to model the impact of parasites on the growth parameters. Various models including fixed level, random effects, and GEE models were investigated on whether or not they can be used for this kind of data. This took so many hours of research on the web regarding similar data analysis. Eventually, we found that GEE was the appropriate analysis for this kind of analysis (https://stats.stackexchange.com/questions/16390/when-to-use-generalized-estimating-equations-vs-mixed-effects-models). The key word here is that we were interested in "marginal" effect and not conditional effect. We went ahead with the GEE and found the significance for few of the parasites on the growth parameters as the outcome. This looked all good and obvious. 

However, since the parasite burden had highly skewed distribution, we decided to transform it. It was not absolutely necessary for us to go forward with transformation with this analysis.  We went for the transformation anyway. Using rcompanion package and transformtukey function, we found that some values were not transforming at all. Remaining values seems to be "normally" transformed. It turns out that most values were "zeroes". These zero values were not missing or an anomaly in the measurement. These were all genuine zeroes since most of the subjects did not have any parasites.  In fact, about 90-95% of cases had no parasites and here we are trying to find their impact on the growth of these subjects.

This was missed in the beginning because we were overly focused on the longitudinal part.  If we had gone along doing simple data summary, we would have caught it a lot earlier. Lesson learned. 

Now we know the problem, how to deal with this? Do we model them separately which would leave us with only 5% of the cases for the parasite group or do we lump them with the same group? This definitely impacted the statistics of parasite impact.  This lead to another marathon of research on whether on or not to conduct the analysis using GEE with all the data or analyze them separately. No answers have been found yet.  Possible solution include mixture models, mixed level modeling but we have not implemented any of those yet.  I may update this post when we find out conclusively. 



Comparing R and Python

 I have used R for quite some time for data analysis. Especially with the use of Tidyverse package, it has been a very decent experience. Gg...