Looking more closely at (5), we see that (7)yi−y˜i(j)=yi−y^i+xijβ˜j=ri+xijβ˜j, where ŷi is the current fit of the model for observation i, and hence ri the current residual. Thus (8)1N∑i=1Nxij(yi−y˜i(j))=1N∑i=1Nxijri+β˜j, because the xj are standardized. The first term on the right-hand side is the gradient of the loss with respect to βj. It is clear from (8) why coordinate descent is computationally efficient. Many coefficients are zero, remain zero after the thresholding, and so nothing needs to be changed. Such a step costs O(N) operations— the sum to compute the gradient. On the other hand, if a coefficient does change after the thresholding, ri is changed in O(N) and the step costs O(2N). Thus a complete cycle through all p variables costs O(pN) operations. We refer to this as the naive algorithm, since it is generally less efficient than the covariance updating algorithm to follow. Later we use these algorithms in the context of iteratively reweighted least squares (IRLS), where the observation weights change frequently; there the naive algorithm dominates.