Use __builtin_clz when available for logbase/next_power_of_two, and replace next_power_of_two with faster implementation otherwise.