Hlb CIFAR10 Versions Save

Train to 94% on CIFAR-10 in <6.3 seconds on a single A100. Or ~95.79% in ~110 seconds (or less!)

v0.7.0

6 months ago

Run it yourself right now (requires: torch, torchvision, nvidia gpu): git clone https://github.com/tysam-code/hlb-CIFAR10 && cd hlb-CIFAR10 && python main.py

Technical discussion: https://twitter.com/hi_tysam/status/1721764010159477161

Code: https://github.com/tysam-code/hlb-CIFAR10/blob/v0.7.0/main.py

Diffs: https://github.com/tysam-code/hlb-CIFAR10/commit/ad103b43d29f08b348b522ad89d38beba8955f7c

v0.6.0

1 year ago

Welcome to the new release! Much of this is doing some changes to lay down some changes that build part of the (hopefully?) groundwork for any potential future work that might come up. We do achieve some good minor speed boosts in the meantime, though we are near one part of the tail end of optimizations that we publish in this particular update. However, we're all chipping away at that one singular goal of trying to be under 2.0 seconds in 2.0 years (or so), so every little bit helps! Usually! (as long as it's not too complicated!)

Anywho, patch notes with technical details in them are located at https://twitter.com/hi_tysam/status/1647832159158165505

v0.5.0

1 year ago

Hello everyone,

One-shot patch notes today since I'm pretty tired from getting this one to completion. But, in the name of scientific and otherwise integrity, let's get some of these changes logged! This is a more rough cut, so apologies in advance for any spelling/otherwise mistakes (will update as necessary!).

Changes Summary + Notes

While last patch we christened our SpeedyResNet 1.0 architecture, there was in fact at least one more network architecture speedup awaiting us. By halving the depth of the first layer group, we can gain a massive speed boost with a decently mild reduction in performance. Increasing the training time to 12.6 epochs allows us to regain this performance but still perform in just under 7 seconds.
Speaking of fractional epochs, we support them now! They added more complexity than what was needed before but seem to be very important now. I haven't quite logged the exchange rate of accuracy for training epochs, but it does seem to follow some consistent nonlinear law. Now it's much easier to implement incremental percentage increase improvements and not have to 'save up' to remove or add whole epochs in one big jump. I recommend taking it for a spin and playing around!
We fixed a bug in which the Lookahead-inspired update strategy with the exponential moving average (EMA) was doing...nothing at all last patch! This was because the updates to the original network were not being applied in-place. That felt embarrassing to learn about! But thankfully we were able to get it working in this patch to good effect. Notably, we do change the paradigm a bit by adding an aggressive warmup schedule that strongly ramps the EMA value smoothly primarily in the last few epochs of training, allowing the network to train helter-skelter until the last minute, where we then strongly self-ensemble with a continously-decreasing learning rate.
We bring back the final_lr_ratio parameter, because unfortunately having the final lr ratio go to 0 did not play nicely with an extremely strong EMA at the end. We still need some kind of learning rate to learn something interesting! (especially as the effective learning rate in the EMA drops very quickly towards the end of training). However, we generally left this alone around .05 as that seemed good enough for our purposes.
Cutout is out, and Cutmix has been added into the mix! Theoretically this should help us more as we're not just destroying information to regularize our network, at least not as much as we are otherwise. This means that our labels are now one-hot arrays that get changed around too, which opens up some opportunities for fun label-based tricks in the future! We use this now in the shorter runs, since the accelerated Lookahead-EMA optimization tends to overfit more quickly at the end otherwise. On the whole, the combination seems to be a strong net positive.
In between and after implementing these things (most of which happened near the beginning of the week), we did about 25-30 hours of manual hyperparameter tuning. I do not know how many thousands of runs this was, but it was certainly a lot of runs. I tried implementing my own short genetic algorithm to run live hyperparameter tuning, but unfortunately I am not that well-versed in that field and the noise surface of the run results is extremely noisy at this point -- even a 25 run battery (~3-4 minutes or so) is a moderately noisy measure of performance at this point. I got lost in the valleys of trying to tell apart whether the hyperparameter surface was nonlinear or incredibly noisy, or if there was a stateful hesienbug of some kind, or all three... In any case, I 'gave up' by increasing the number of epochs at two different points and then tuning to allow for more generous performance headroom. Even if the average is bang-on at 94.00%, the visceral feeling of getting >94% on that first run is an important part of the UX of this repo.
An important point that I want to emphasize from the above is that the hyperparameter surface around these peaked areas seems to be very aggressively flat...for the most part. And that a lot of the hyperparameters are truly good being left where they are when translating into different contexts.
One final thing that we added was weight normalization to the initial projection and the final linear layer. It seemed to perform well on both of these, but the p norm values are both empirically-derived magic numbers, so this is slightly messier than I'd want. However, it did let us save ~.15 seconds or so in removing the initial BatchNorm and let us break under 7 seconds. I have some very rough initial thoughts on this, but want to try to develop them more over time. Hopefully there's a nuanced and predictable way to apply this throughout the network. It didn't work well on the other layers I tried (though it did make for very rapid overfitting when I applied it to certain layers and left the BatchNorms in instead). It seems like BatchNorm really is still doing something very protective for the network during training (we are throwing an awful lot at it, to be sure), and one of the main downsides of most of the BatchNorm replacements seems to be that they are unfortunately still very noise-sensitive. That said, noise helped us out when we originally had GhostBatchNorm (it was just too slow for our purposes), so maybe there is something to be said there for the future.
Oh, and I guess there is one more thing that I thought about for adding this section. For a tiny (but still good!) speedboost, we set foreach=True in the sgd optimizers. Every bit counts! Thanks to @bonlime for the suggestion/reminder to revisit this -- I'd totally passed over it and written it off initially!

Moving Forward

I think it might be a longer time before the next direct speed/accuracy/etc update on this project. (Though admittedly, I think I've felt this every time. But, then again all of that hyperparameter tuning was a pretty big wall, and most of my initial/initial-mid-term toolkit is exhausted at this point. It is a good toolkit though and I'm very glad to have it.). I wouldn't be surprised if development enters a slower, more analytical phase. I also may take a break just to palate cleanse from this project as I've been working quite a lot on it! Quite a lot! There are a few cool options on the table for investigation, but since they A. are something I haven't really done before and don't know the corners/guarantees of, B. could take a longer time to test/implement, and C. aren't guaranteed or have a high chance of paying off without sweat/blood/tears/significantly breaking the intentions of the repo, I'm really hesitant to go into them alone.

So, with that said, I think anything that the community contributes would be super helpful! There are a lot of avenues that we can tackle improving this thing. For example, did you know that MaxPooling now takes almost as much time as all of our convolutions added together? That's crazy! And in the backwards pass, the MaxPooling kernels are using 100% of the SM capacity they're allocated, but it is quite slow. That should be just a tile and multiply of a cached mask under ideal circumstances, right?

There's also some analysis of the learning dynamics of this thing as an option on the table that I would love to get some help/eyes on it. I have a sneaking suspicion that how this network is learning looks very different than maybe other, more traditional networks would when looked at through a training dynamics lens. Whether or not having the training process being compressed so tightly 'brings certain dynamics into focus' or not, I can't quite say. But I think us being able to lay out and start tracing what in the heck is happening now that the empirical flurry of optimization has happened can help us out from a theoretical approach standpoint. There is of course the engineering standpoint too, but since we've focused on that a lot, maybe there is some unrealized gains to be had now from a deeper, more thorough analysis.

Many thanks for reading, hope you enjoyed or that this was helpful to you, and if you find anything, feel free to let me know! I'm on twitter, so feel free to reach out if you have any pressing questions (or find something interesting, etc!): https://twitter.com/hi_tysam. You can also open an issue as well, of course.

Look forward to seeing what we all make together! :D :) 🎇 🎉

Special Thanks

Special thanks to Carter B. and Daniel G. for supporting me on Patreon. Y'all rock! Thank you both so very much for your support! The GPU hours helped a ton.

I also want to extend special thanks as well to Carter B. again, this time for pointing me at a helpful resource that was helpful in developing this release.

Lastly, many thanks to the people who have supportive and been helpful with this project in other ways. Your kind words and encouragement have meant a ton to me. Many thanks!

Support

If this software has helped you, please consider supporting me on my (Patreon)[https://www.patreon.com/tysam]. Every bit counts, and I certainly do use a number of GPU hours! Your support directly helps me create more open source software for the community.

v0.4.0

1 year ago

Welcome to the release notes! If you're looking for the code/README, check out https://github.com/tysam-code/hlb-CIFAR10

If you're just looking to run the code, use git clone https://github.com/tysam-code/hlb-CIFAR10 && cd hlb-CIFAR10 && python -m pip install -r requirements.txt && python main.py

Summary

Welcome to the v.0.4.0 release of hlb-CIFAR10! In this release, we (somehow) find another large timesave in the model architecture, round a number of hyperparameters, and even remove one or two for good measure. We also ~~convert our existing codebase to introduce an optimization protocol similar to the Lookahead optimizer, and~~ get even more aggressive with our learning schedules. Oh, and of course we clean up some of the formatting a bit, and update some annotations/comments so they're more informative or no longer incorrect. And, bizzarely enough, this update is very special because we do practically all of this by just reorganizing, rearranging, condensing, or removing lines of code. Two purely novel lines this time*! Wow!

Now, unfortunately for this release we had set a personal goal to include at least one technique released in the 1990's/early 2000's, but couldn't find anything suited for this particular release. That said, it was a productive exercise and should at least indirectly help us on our journey to under 2 seconds of training time.

One final critical note -- for final accuracies on short training runs (<15-20 epochs), please refer to the val_acc column and not the ema_val_acc column. This is due to a quirk that we describe later in the patch notes.

Now, on to the patch notes!

*two short new lines in the EMA, and three total if you consider the statistics calculation reorganization in the dataloader to be novel

Patch Notes

Architecture Changes

While expanding our 'on-deck' research queue for current and future releases, we accidentally stumbled into a network architecture that converges to a higher accuracy much more quickly than before. What's more, it's a lot simpler, and is only 8 layers!
As this is a novel architecture (to our knowledge), we are now officially naming it the name it's had as a class name since the start of the codebase -- namely, "SpeedyResNet". The core block is now very simple, with only a one-depth residual branch, and no 'short' version used at all. Check it out here.
One downside is that it does seem to mildly overfit much more quickly, but I think there are some (possibly) very succinct solutions to this in the future. However, we'll need Pytorch 2.0 to be fully released (and Colab to be appropriately updated) for us to take full advantage of them due to kernel launch times. All in all, this provided the net fastest gain on the whole. For the longer runs, we now need to use more regularization -- in this case, cutout, to keep the network from overfitting. Thankfully, the overfitting does seem to be mild and not catastrophic (~95.5% vs 95.8% performance on the longer, 90 epoch runs).
With this change in architecture, the Squeeze-and-Excite layers seem not to be as useful as before. This could be for a variety of reasons, but they were being applied to that second convolution in the residual block, which no longer exists. Whether it's gradient flow, the degrees of freedom offered by a 2-block residual branch, or some other kind of phenomenon -- we don't quite know. All we know at this point is that we can relatively safely remove it without too much of a hit in performance. This increases our speed from our starting point in the last patch of ~9.92-ish seconds even more!

EMA

We now run the EMA over the entire network run, every 5 steps, with a simple added feature that sets all of network's weights (ignoring the running batchnorm averages) to the EMA's updated weights after every time the EMA runs. This is effectively analogous to the Lookup optimizer, but it reuses our existing code, and the momentum is much higher (.98**5 = .9). This seems to have some very interesting impacts on the learning procedure that might come in handy later, but for now, it provided a good accuracy boost for us. (note: on some post-investigation, it seems that the EMA code is not working properly for the Lookahead-like usecases in this release -- though the performance numbers still should be accurate! Hopefully we can fix this in the future. The below two lines should still be accurate)
In adding this, we discover that for short-term training usecases, the 'final_lr_ratio' parameter is no longer very useful and can be safely replaced with a value that effectively makes the final lr to be 0. I believe this is also responsible for some of the overfitting in longer runs, but I think we can hope to address that some in the future.
One side effect (!!!) of this is that the ema_val_acc is not able to catch up to the final steps of training as quickly in the short training runs, though it tends to do better in the longer training runs. To avoid complexity and any scheduling changes in the short term, we leave it as is and encourage the user to refer to the val_acc column when making their training accuracy judgements.

Hyperparameters/Misc

The loss_scale_scaler value has been useful enough to training to move to the hyperparameters block. Increasing this seemes to help the resulting network be more robust, though there do seem to be some diminishing returns. It's been in here for a little while, play around with it and see what you think!
Lots of hyperparameters are rounded to nice, 'clean' numbers. It's a great thing we currently have the headroom to do this!
We increase the batchsize to double what it was! This helps us a lot but does/did require some retuning to account for the changes in how the network learns with this. Testing this configuration, it appears to run well on 6.5 GB or more of GPU memory, keeping it still very suitable for home users and/or researchers to use (something we'll try to keep as long as possible!). That said, if you're running in constrained memory with a Jupyter notebook, you may need to restart the whole kernel at times just because there's still some memory that gets freed too late for things to not get clogged up. Hopefully we'll have a good solution for this in future releases, though no promises. If you find a good clean one, feel free to open up a PR! :D
Instead of hardcoding the CIFAR10 statistics up front, we just now use the torch.std_mean function to get them dynamically after the dataset is loaded onto the GPU. It's really fast, simple, and does it all in one line of code. What's not to love?

Scaling and Generalization to Other Similar Datasets

I don't want to spend too much time in this section as this work is very preliminary, but the performance on other datasets at the same size seems to be very promising, at least. For example, changing over to CIFAR100 takes less than half a minute, and running the same network with the same hyperparameters matches about roughly the SOTA in the same timeframe for that dataset (~2015ish for both CIFAR10 and CIFAR100). If you squint your eyes, smooth the jumps in the progress charts a bit, and increase the base_depth of the networks 64->128 and the num_epochs 10->90, and change the ema and regularization in the same way as well (num_epochs 9->78, and cutout 0->11), then on a smoothed version of the SOTA charts we're near early ~2016 for both of them. Effectively -- the hyperparameters and architecture should roughly transfer well to the problems of at least the same size, since identical changes in the hyperparameters of the network seem to result in similar changes in performance for the respective datasets.

By the way, we do slightly upgrade our 'long'-running training values on a slightly larger network from ~95.77%->~95.84% accuracy in 188->172 seconds, though these number sets are both rounded a bit to worse values to slighly underpromise since the training process is noisy, and hey, underpromising is not always a bad bet.

I will note that we still lose some performance here relative to the short run's gains over the previous patch, as you may have noticed, and you'll probably see it in the EMA of the network over the long runs. Sometimes it swings a (rather monstrous) .10-.20% up and then down in the last epochs as what can only likely be overfitting happens. However, that said, we do have some improvements over the previous release and the short runs are the main focus for now. We may have some update focusing on the long runs in the future, but the utility of when it's best to do do that remains up in the air (since speeding up the network does offer raw benefits each time, and that might outweigh the utility of a long-run-only kind of update).

Special Thanks

Special thanks to everyone who's contributed help, insight, thoughts, or who's helped promote this project to other people. Your support has truly helped!

I'd also like to especially extend my thanks to my current Patreon sponsors, 🎉Daniel G.🎉 and 🎉Carter B.🎉 These two helped support a fair bit of the overhead expense that went into securing GPU resources for this project. I used more GPU hours in this last cycle than I have for any release, and it really helped a lot. Many, many thanks to you two for your support!

If you'd like to help assist me in making future software releases, please consider supporting me on Patreon (https://www.patreon.com/tysam). Every bit goes a long way!

If you have any questions or problems, feel free to open up an issue here or reach out to me on the reddit thread. I'm also on Twitter: https://twitter.com/hi_tysam

Thanks again for reading, and I hope this software provides you with a lot of value! It's certainly been quite a bit of work to pull it all together, but I think it's been worth it. Have a good one!

v0.3.0

1 year ago

Here is an overview of the changes. Remember, none of this so far is in JIT (!!!!), so things should be really snappy if you're experimenting around.

Changes

-- Misc extensive hyperparameter tuning (lr<->final_lr_ratio were especially sensitive to each other) -- Added squeeze-and-excitation layers (very effective, might be faster with Pytorch 2.0) -- The depth of the initial projection layer has been halved. (!) This had very little negative effect, and was much much faster. -- Converted the whitening conv from 3x3->2x2. This significantly increased speed and resulted in some accuracy loss, which hyperparameter tuning brought back -- With the whitening conv at 2x2, we could now set the padding to 0 to avoid an expensive early padding operation. This also made the rest of the network faster at the cost of accuracy due to the spatial outputs being slightly smaller -- The samples for the whitening conv is the whole dataset now. To be friendlier to smaller GPUs (8 GB or so, I think), we process the whitening operation in chunks over the whole dataset now -- We scale the loss before and after summing since with float16 that is a regularizing operation, and it was regularizing slightly too strongly for our needs. --We unfortunately had to bring another large timesave/accuracy boost off the shelf to make this update fly under 10 seconds (the first being the 3x3->2x2 conv conversion), and that was replacing the CELU(alpha=.3) activation functions with the now-reasonably-standard GELU() activations. They perform extremely well and the python kernel is very fast for the boost that the activation provides. What's not to like?

If you'd like to follow our progress on our journey to our goal of training to 94% on CIFAR10 on a single GPU in under 2 seconds within maybe (hopefully) 2 years or so, don't forget to watch the repo! We're in the phase where the updates are starting to get harder to put out, but there's still potential for a few good, "quick" chunks to be optimized away yet.

Further discussion

We've noted that the space of hyperparameters that perform optimally for a given problem grows sharper and sharper as we approach optimal performance, similar to the top noted in https://arxiv.org/pdf/2004.09468.pdf. Much of this update involved the extremely laborious task of tuning many hyperparameters within the code, which was done manually partially out of laziness, and partially because it's in the best interest of future me to have an instinctive feel for how they interact with each other. Unfortunately most of the hyperparameter numbers are not clean powers of 2 anymore, but we eventually did have to break that particular complexity barrier.

We performed some preliminary scaling law experiments, and we find that indeed, only increasing the network base width and training epochs yields a good scaling of performance -- twiddling the hyperparameters for these longer runs seems to decrease performance (outside of the EMA). In our runs, we got an average of 95.74% with depth 64->128 and epochs 10->80 (Final EMA percentages: 95.72, 95.83, 95.76, 95.99, 95.72, 95.66, 95.53). We're in the regime now where the scaling laws seem to hold very clearly and cleanly, generally speaking, woop woop woop! :O :D <3 <3

Feel free to open an issue or reach out if you have any questions or anything else of that nature related to this particular work, my email here is [email protected]. Many thanks, and I really appreciate your time and attention!

v0.2.0

1 year ago

New speed: ~12.31-12.38 seconds on an A100 SXM4 (through Colab). Many, many thanks to @99991 (https://github.com/99991/cifar10-fast-simple) for their help with finding the issues that eventually led to some of these improvements, as well as detailed debugging and verification of results on their end. I encourage you to check out some of their work! <3 :)

Notes

After some NVIDIA kernel profiling, we changed the following:

-- Swap the memory channels out for channels_last (in beta for pytorch currently) -- Replace the nn.AdaptiveMaxPooling with a similarly-named class wrapping torch.amax (nearly a ~.5 second speedup in total) -- Replace GhostNorm with a more noisy BatchNorm to take advantage of the faster/simpler kernels. This resulted in roughly a 5.25 second (!!!!) speedup over the baseline, but required some parameter tuning to get to a similar level of regularization that GhostNorm provided.

In doing so, we, along with Peter, Ray, Egon, and Winston, helped GhostNorm finally find its peaceful rest. That said, the idea of batch norm noise for the sake of helping regularize the network does continue to live on, albeit in a strangely different form.

There are many other avenues that we can be going down in the future to continue speeding up this network and to get below the training in ~<2 seconds or so mark. I always burn myself out with these releases, but I already have a couple fun things on deck that have been doing well/showing promise!

As always, feel free to support the project through my Patreon, or drop me a line at [email protected] if you ever want to hire me for up to a part time amount of hours!

v0.1.0

1 year ago

This baseline mimics the original functionality of David Page's work on achieving the world record pace for CIFAR10 on a single GPU (via Myrtle AI: How to Train Your Resnet 8: Bag of Tricks, and via Page's original GitHub: cifar10-fast)

Per the README.md in this version, this code has been practically entirely rewritten to be extremely hackable, to enable the next stage of development: improving upon the techniques used to achieve the world record. We hope to get under ~2 seconds or so within ~2 years. Once the first few batches of initial improvements come out, and the dust settles from the backlog of improvements discovered during development & etc starts coming in, feel free to contribute your own. We need all the help we can get! :D <3 <3 <3 <3 :D