3
$\begingroup$

I am trying to determine which variables affect the likelihood of lumpfish eating salmon lice and which variables predict the number of lice eaten. My data are highly zero-inflated, so I decided to use a hurdle model (hurdle()) from the pscl package. However, since I have several predictors that do not seem to have a linear relationship with either the likelihood or the number of lice eaten, I incorporated splines (ns()) into the model. Here is the model with the lowest AIC:

Call:
hurdle(formula = Laksalus_tal ~ ns(Total_laksalus_rolling, df = 3) + ns(Vekt, df = 5) + Verur_a_notini | ns(Total_laksalus_rolling, df = 3) + ns(Vekt, df = 5) + 
    Hvalspýggj + Lice_per_ring_baseline + Djóraæti_uttanjelly + Verur_a_notini, data = data, dist = "negbin", zero.dist = "binomial")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.26391 -0.11924 -0.08661 -0.06190 81.26243 

Count model coefficients (truncated negbin with log link):
                                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -12.9943    87.5056  -0.148 0.881950    
ns(Total_laksalus_rolling, df = 3)1   1.1977     0.6326   1.893 0.058335 .  
ns(Total_laksalus_rolling, df = 3)2   4.9079     0.9588   5.119 3.08e-07 ***
ns(Total_laksalus_rolling, df = 3)3   2.7670     1.7266   1.603 0.109025    
ns(Vekt, df = 5)1                     1.5328     0.8386   1.828 0.067579 .  
ns(Vekt, df = 5)2                     0.3391     1.0679   0.318 0.750796    
ns(Vekt, df = 5)3                     7.0208     2.2309   3.147 0.001649 ** 
ns(Vekt, df = 5)4                   -32.1159     9.4733  -3.390 0.000699 ***
ns(Vekt, df = 5)5                   -68.8257    19.4001  -3.548 0.000389 ***
Verur_a_notini1                      -1.2374     0.1734  -7.136 9.63e-13 ***
Log(theta)                          -14.3543    87.5002  -0.164 0.869692    
Zero hurdle model coefficients (binomial with logit link):
                                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -5.482984   0.444500 -12.335  < 2e-16 ***
ns(Total_laksalus_rolling, df = 3)1  0.187322   0.267284   0.701  0.48341    
ns(Total_laksalus_rolling, df = 3)2  2.986707   0.354080   8.435  < 2e-16 ***
ns(Total_laksalus_rolling, df = 3)3  3.154348   0.524225   6.017 1.77e-09 ***
ns(Vekt, df = 5)1                    1.836310   0.391115   4.695 2.67e-06 ***
ns(Vekt, df = 5)2                    1.122467   0.473999   2.368  0.01788 *  
ns(Vekt, df = 5)3                   -1.296933   0.752062  -1.725  0.08462 .  
ns(Vekt, df = 5)4                    2.086199   1.967267   1.060  0.28894    
ns(Vekt, df = 5)5                   -1.192533   3.779546  -0.316  0.75236    
Hvalspýggj1                         -0.955653   0.091983 -10.389  < 2e-16 ***
Lice_per_ring_baseline               0.027207   0.009029   3.013  0.00258 ** 
Djóraæti_uttanjelly1                 0.285546   0.113077   2.525  0.01156 *  
Verur_a_notini1                     -0.308816   0.078037  -3.957 7.58e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 0
Number of iterations in BFGS optimization: 58 
Log-likelihood: -6399 on 24 Df

Total_laksalus_rolling, Lice_per_ring_baseline and Vekt are numerical predictors, while the other variables are binary (0/1).

My questions:

Model evaluation: How should I evaluate the overall fit and appropriateness of this hurdle model? Are there specific diagnostics or metrics recommended for assessing hurdle models?

Interpretation of splines: How do I interpret the spline terms (ns())? Specifically, how can I understand the relationship between predictors (Total_laksalus_rolling and Vekt) and both the likelihood of lice consumption (zero hurdle) and the number of lice eaten (count component)?

Below are residual diagnostics plots:

Residuals vs fitted values

Rootogram

$\endgroup$
1
  • 1
    $\begingroup$ Welcome to Cross Validated! First, please edit the question to specify the package that contains the hurdle() function you used. Second, don't forget that the default locations of boundary knots in ns() are at the extreme values. You might try boundary knots within the range (e.g., 5th and 95th percentiles). Third, look at this answer for guidance on hurdle versus zero-inflated models; might zero inflated be better here?. Fourth, are you worried by the extreme (and poorly estimated) value of theta in the negbin model? $\endgroup$ Commented Jun 13 at 16:09

1 Answer 1

2
$\begingroup$

You can tell a lot from what you already show.

First, in the count model you have massive imprecision in your estimates of the (Intercept) and Log(theta), and the last 2 coefficients for ns(Vekt) are extremely negative.

Second, your model predicts values of about 2 for many observations that are 10 to 100 times as large, according to your Residuals vs. Fitted plot.

Third, for predicted values beyond about 10 the model seems to make systematically increasing over-estimations of observed values.

Fourth, the estimate of theta is extremely small, about 5.8e-07. That would typically mean an enormous negative binomial variance, but the massive standard error in the estimate raises questions about reliability.

All of those suggest that the model isn't very adequate.

Something that you don't show is the adequacy of the hurdle model itself. How well does it represent the probability of 0 lice eaten, as a function of those predictors?

Without knowing more about the data, it's hard to say what's going on.

Part of the problem might be in your invocation of the ns() function for the splines. The default of that function is to place boundary knots at the extremes of the distribution of the values of the argument. That can over-emphasize behavior at the extremes while making it more difficult to model behavior at intermediate values. Moving the boundary knots a bit inward, say at the 5th and 95th percentile values of the argument, might help. (Natural splines are linear outside the boundary knots.)

I also wonder if your 5-df spline for Vekt is too flexible, leading to over-fitting of the count model. There's no need to have the same functional form for the hurdle model and the truncated negative binomial model, and the last 2 terms aren't even "significant" in the hurdle.*

You might also consider whether a zero-inflated model better represents the data.


*With respect to your question about spline coefficients, I suggest that you don't try to interpret them individually, particularly for ns() with its B-spline basis. See this page for what those coefficients mean. You typically just use the coefficients together to reconstruct the predictions. There are other spline bases that are a bit more interpretable. The rcs() splines in the R rms package allow for a simple test of non-linearity, and the nsk() function in the R survival package will "create the design matrix for a natural spline, such that the coefficient[s] of the resulting fit are the values of the function at the knots."

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.