The answers boil down to an observation that neural net training seems to have two distinct phases: a small-batch, noise-dominated phase, and a large-batch, curvature-dominated one. There are various full-featured deep learning frameworks built on top of JAX and designed to resemble other frameworks you might be familiar with, such as PyTorch or Keras. , . M. MacKay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse. we develop a simple, efficient implementation that requires only oracle access to gradients A classic result by Radford Neal showed that (using proper scaling) the distribution of functions of random neural nets approaches a Gaussian process. In. After all, the optimization landscape is nonconvex, highly nonlinear, and high-dimensional, so why are we able to train these networks? The dict structure looks similiar to this: Harmful is a list of numbers, which are the IDs of the training data samples Understanding Black-box Predictions via Inuence Functions 2. as long as you have a supervised learning problem. In order to have any hope of understanding the solutions it comes up with, we need to understand the problems. A. S. Benjamin, D. Rolnick, and K. P. Kording. Li, J., Monroe, W., and Jurafsky, D. Understanding neural networks through representation erasure. If the influence function is calculated for multiple Reference Understanding Black-box Predictions via Influence Functions Lage, E. Chen, J. Yuwen Xiong, Andrew Liao, and Jingkang Wang. The algorithm moves then We'll then consider how the gradient noise in SGD optimization can contribute an implicit regularization effect, Bayesian or non-Bayesian. 7 1 . This leads to an important optimization tool called the natural gradient. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. A. M. Saxe, J. L. McClelland, and S. Ganguli. ": Explaining the predictions of any classifier. the training dataset were the most helpful, whereas the Harmful images were the Helpful is a list of numbers, which are the IDs of the training data samples Wei, B., Hu, Y., and Fung, W. Generalized leverage and its applications. We'll consider bilevel optimization in the context of the ideas covered thus far in the course. use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. below is divided into parameters affecting the calculation and parameters Z. Kolter, and A. Talwalkar. For this class, we'll use Python and the JAX deep learning framework. This could be because we explicitly build optimization into the architecture, as in MAML or Deep Equilibrium Models. approximations to influence functions can still provide valuable information. In this paper, we use influence functions a classic technique from robust statistics to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. We have 3 hours scheduled for lecture and/or tutorial. Students are encouraged to attend class each week. prediction outcome of the processed test samples. Koh, Pang Wei. Proc 34th Int Conf on Machine Learning, p.1885-1894. x\Y#7r~_}2;4,>Fvv,ZduwYTUQP }#&uD,spdv9#?Kft&e&LS 5[^od7Z5qg(]}{__+3"Bej,wofUl)u*l$m}FX6S/7?wfYwoF4{Hmf83%TF#}{c}w( kMf*bLQ?C}?J2l1jy)>$"^4Rtg+$4Ld{}Q8k|iaL_@8v Terry Taewoong Um (terry.t.um@gmail.com) University of Waterloo Department of Electrical & Computer Engineering Terry T. Um UNDERSTANDING BLACK-BOX PRED -ICTION VIA INFLUENCE FUNCTIONS 1 How can we explain the predictions of a black-box model? For these Here, we used CIFAR-10 as dataset. /Filter /FlateDecode Please try again. Jaeckel, L. A. Acknowledgements The authors of the conference paper 'Understanding Black-box Predictions via Influence Functions' Pang Wei Koh et al. Which algorithmic choices matter at which batch sizes? So far, we've assumed gradient descent optimization, but we can get faster convergence by considering more general dynamics, in particular momentum. Your search export query has expired. Up to now, we've assumed networks were trained to minimize a single cost function. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. Programming languages & software engineering, Programming languages and software engineering, Designing AI Systems with Steerable Long-Term Dynamics, Using platform models responsibly: Developer tools with human-AI partnership at the center, [ICSE'22] TOGA: A Neural Method for Test Oracle Generation, Characterizing and Predicting Engagement of Blind and Low-Vision People with an Audio-Based Navigation App [Pre-recorded CHI 2022 presentation], Provably correct, asymptotically efficient, higher-order reverse-mode automatic differentiation [video], Closing remarks: Empowering software developers and mathematicians with next-generation AI, Research talks: AI for software development, MDETR: Modulated Detection for End-to-End Multi-Modal Understanding, Introducing Retiarii: A deep learning exploratory-training framework on NNI, Platform for Situated Intelligence Workshop | Day 2. Biggio, B., Nelson, B., and Laskov, P. Poisoning attacks against support vector machines. Class will be held synchronously online every week, including lectures and occasionally tutorials. The mechanics of n-player differentiable games. Therefore, this course will finish with bilevel optimziation, drawing upon everything covered up to that point in the course. Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., and Clore, J. N. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. In. Then, it'll calculate all s_test values and save those to disk. Things get more complicated when there are multiple networks being trained simultaneously to different cost functions. Understanding black-box predictions via influence functions. We see how to approximate the second-order updates using conjugate gradient or Kronecker-factored approximations. The most barebones way of getting the code to run is like this: Here, config contains default values for the influence function calculation Data poisoning attacks on factorization-based collaborative filtering. 2019. ( , , ). 2017. This will naturally lead into next week's topic, which applies similar ideas to a different but related dynamical system. Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. In. Neural tangent kernel: Convergence and generalization in neural networks. Often we want to identify an influential group of training samples in a particular test prediction for a given We study the task of hardness amplification which transforms a hard function into a harder one. J. Lucas, S. Sun, R. Zemel, and R. Grosse. Understanding Black-box Predictions via Inuence Functions Figure 1. Understanding black-box predictions via influence functions Google Scholar Digital Library; Josua Krause, Adam Perer, and Kenney Ng. S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. The details of the assignment are here. 2016. Your job will be to read and understand the paper, and then to produce a Colab notebook which demonstrates one of the key ideas from the paper. Deep learning via Hessian-free optimization. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. How can we explain the predictions of a black-box model? J. Cohen, S. Kaur, Y. Li, J. understanding model behavior, debugging models, detecting dataset errors, For modern neural nets, the analysis is more often descriptive: taking the procedures practitioners are already using, and figuring out why they (seem to) work. To scale up influence functions to modern machine learning settings, which can of course be changed. Here are the materials: For the Colab notebook and paper presentation, you will form a group of 2-3 and pick one paper from a list. Loss , . vector to calculate the influence. ImageNet large scale visual recognition challenge. In this paper, we use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. In, Martens, J. The implicit and explicit regularization effects of dropout. Li, B., Wang, Y., Singh, A., and Vorobeychik, Y. With the rapid adoption of machine learning systems in sensitive applications, there is an increasing need to make black-box models explainable. An evaluation of the human-interpretability of explanation. Some JAX code examples for algorithms covered in this course will be available here. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Fortunately, influence functions give us an efficient approximation. SVM , . To scale up influence functions to modern machine learning settings, we develop a simple, efficient implementation that requires only oracle access to gradients and Hessian-vector products. Lectures will be delivered synchronously via Zoom, and recorded for asynchronous viewing by enrolled students. Liu, D. C. and Nocedal, J. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Chris Zhang, Dami Choi, Anqi (Joyce) Yang. A spherical analysis of Adam with batch normalization. In, Mei, S. and Zhu, X. How can we explain the predictions of a black-box model? In many cases, they have far more than enough parameters to memorize the data, so why do they generalize well? In, Cadamuro, G., Gilad-Bachrach, R., and Zhu, X. Debugging machine learning models. Despite its simplicity, linear regression provides a surprising amount of insight into neural net training. Therefore, if we bring in an idea from optimization, we need to think not just about whether it will minimize a cost function faster, but also whether it does it in a way that's conducive to generalization. To scale up influence functions to modern machine learning settings, we develop a simple, efficient implementation that requires only oracle access to gradients and Hessian-vector products. Please download or close your previous search result export first before starting a new bulk export. In Proceedings of the international conference on machine learning (ICML). Amershi, S., Chickering, M., Drucker, S. M., Lee, B., Simard, P., and Suh, J. Modeltracker: Redesigning performance analysis tools for machine learning. stream Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. However, as stated To run the tests, further requirements are: You can either install this package directly through pip: Calculating the influence of the individual samples of your training dataset There are several neural net libraries built on top of JAX. This alert has been successfully added and will be sent to: You will be notified whenever a record that you have chosen has been cited. The first mode is called calc_img_wise, during which the two But keep in mind that some of the key concepts in this course, such as directional derivatives or Hessian-vector products, might not be so straightforward to use in some frameworks. Subsequently, Why Use Influence Functions? Pang Wei Koh and Percy Liang. Adaptive Gradient Methods, Normalization, and Weight Decay [Slides]. your individual test dataset. functions. outcome. It is known that in a high complexity class such as exponential time, one can convert worst-case hardness into average-case hardness. Understanding Black-box Predictions via Influence Functions by Pang Wei Koh and Percy Liang. Haoping Xu, Zhihuan Yu, and Jingcheng Niu. This is a PyTorch reimplementation of Influence Functions from the ICML2017 best paper: Understanding black-box predictions via influence functions. 2172: 2017: . Neural nets have achieved amazing results over the past decade in domains as broad as vision, speech, language understanding, medicine, robotics, and game playing. samples for each test data sample. We are given training points z 1;:::;z n, where z i= (x i;y i) 2 XY . , Hessian-vector . Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. I recommend you to change the following parameters to your liking. . On the importance of initialization and momentum in deep learning. Understanding black-box predictions via influence functions Computing methodologies Machine learning Recommendations On second-order group influence functions for black-box predictions With the rapid adoption of machine learning systems in sensitive applications, there is an increasing need to make black-box models explainable. The next figure shows the same but for a different model, DenseNet-100/12. Chatterjee, S. and Hadi, A. S. Influential observations, high leverage points, and outliers in linear regression. (a) What is the effect of the training loss and H 1 ^ terms in I up,loss? In this paper, we use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through the learning algorithm and back to its training data, thereby . How can we explain the predictions of a black-box model? We try to understand the effects they have on the dynamics and identify some gotchas in building deep learning systems. The marking scheme is as follows: The problem set will give you a chance to practice the content of the first three lectures, and will be due on Feb 10. >> Understanding Black-box Predictions via Influence Functions (2017) 1. when calculating the influence of that single image. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks.See more on this video at https://www.microsoft.com/en-us/research/video/understanding-black-box-predictions-via-influence-functions/ calculate which training images had the largest result on the classification This is a PyTorch reimplementation of Influence Functions from the ICML2017 best paper: Understanding Black-box Predictions via Influence Functions by Pang Wei Koh and Percy Liang. $-hm`nrurh%\L(0j/hM4/AO*V8z=./hQ-X=g(0 /f83aIF'Mu2?ju]n|# =7$_--($+{=?bvzBU[.Q. We'll see first how Bayesian inference can be implemented explicitly with parameter noise. International conference on machine learning, 1885-1894, 2017. This paper applies influence functions to ANNs taking advantage of the accessibility of their gradients. Which optimization techniques are useful at which batch sizes? Stochastic gradient descent as approximate Bayesian inference. , loss , input space . Visualised, the output can look like this: The test image on the top left is test image for which the influences were Copyright 2023 ACM, Inc. Understanding black-box predictions via influence functions. Understanding Black-box Predictions via Influence Functions. Ribeiro, M. T., Singh, S., and Guestrin, C. "why should I trust you? A tag already exists with the provided branch name. An empirical model of large-batch training. above, keeping the grad_zs only makes sense if they can be loaded faster/ Frenay, B. and Verleysen, M. Classification in the presence of label noise: a survey. ICML'17: Proceedings of the 34th International Conference on Machine Learning - Volume 70. This packages offers two modes of computation to calculate the influence Gradient descent on neural networks typically occurs on the edge of stability. You can get the default config by calling ptif.get_default_config(). Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. Gradient-based hyperparameter optimization through reversible learning. In this lecture, we consider the behavior of neural nets in the infinite width limit. Cook, R. D. Detection of influential observation in linear regression. Datta, A., Sen, S., and Zick, Y. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. where the theory breaks down, He, M. Narayanan, S. Gershman, B. Kim, and F. Doshi-Velez. In. That can increase prediction accuracy, reduce In. Often we want to identify an influential group of training samples in a particular test prediction for a given machine learning model. A sign-up sheet will be distributed via email. In, Mei, S. and Zhu, X. When can we take advantage of parallelism to train neural nets? Disentangled graph convolutional networks. Hopefully this understanding will let us improve the algorithms. %PDF-1.5 We look at three algorithmic features which have become staples of neural net training. PW Koh*, KS Ang*, H Teo*, PS Liang. When testing for a single test image, you can then Understanding Black-box Predictions via Influence Functions ICML2017 3 (influence function) 4 Understanding Black-box Predictions via Influence Functions. This isn't the sort of applied class that will give you a recipe for achieving state-of-the-art performance on ImageNet. This is the case because grad_z has to be calculated twice, once for We'll use the Hessian to diagnose slow convergence and interpret the dependence of a network's predictions on the training data. The canonical example in machine learning is hyperparameter optimization. kept in RAM than calculating them on-the-fly. ( , ?) The model was ResNet-110. Are you sure you want to create this branch? On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks. While these topics had consumed much of the machine learning research community's attention when it came to simpler models, the attitude of the neural nets community was to train first and ask questions later. TL;DR: The recommended way is using calc_img_wise unless you have a crazy You signed in with another tab or window. influences. Understanding black-box predictions via influence functions. In this paper, we use influence functions a classic technique from robust statistics to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. We'll cover first-order Taylor approximations (gradients, directional derivatives) and second-order approximations (Hessian) for neural nets. reading both values from disk and calculating the influence base on them. However, in a lower Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. If you have questions, please contact Pang Wei Koh (pangwei@cs.stanford.edu). More details can be found in the project handout. To scale up influence functions to modern machine learning settings, we develop a simple, efficient implementation that requires only oracle access to gradients and Hessian-vector products. On the limited memory BFGS method for large scale optimization. In. How can we explain the predictions of a black-box model? You signed in with another tab or window. All Holdings within the ACM Digital Library. Model selection in kernel based regression using the influence function. A tag already exists with the provided branch name. Pang Wei Koh, Percy Liang; Proceedings of the 34th International Conference on Machine Learning, . With the rapid adoption of machine learning systems in sensitive applications, there is an increasing need to make black-box models explainable. on to the next image. 10 0 obj Understanding black-box predictions via influence functions. This is "Understanding Black-box Predictions via Influence Functions --- Pang Wei Koh, Percy Liang" by TechTalksTV on Vimeo, the home for high quality The precision of the output can be adjusted by using more iterations and/or Or we might just train a flexible architecture on lots of data and find that it has surprising reasoning abilities, as happened with GPT3. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. . Differentiable Games (Lecture by Guodong Zhang) [Slides]. and Hessian-vector products. This is a tentative schedule, which will likely change as the course goes on. Most weeks we will be targeting 2 hours of class time, but we have extra time allocated in case presentations run over. We have a reproducible, executable, and Dockerized version of these scripts on Codalab. Fast convergence of natural gradient descent for overparameterized neural networks. We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information. Interacting with predictions: Visual inspection of black-box machine learning models. Biggio, B., Nelson, B., and Laskov, P. Support vector machines under adversarial label noise. The power of interpolation: Understanding the effectiveness of SGD in modern over-parameterized learning. While influence estimates align well with leave-one-out. compress your dataset slightly to the most influential images important for For details and examples, look here. We show that even on non-convex and non-differentiable models non-convex non-differentialble . This will also be done in groups of 2-3 (not necessarily the same groups as for the Colab notebook). Gradient-based Hyperparameter Optimization through Reversible Learning. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks. Online delivery. Three mechanisms of weight decay regularization. On linear models and convolutional neural networks, S. L. Smith, B. Dherin, D. Barrett, and S. De. Thus, we can see that different models learn more from different images. I'll attempt to convey our best modern understanding, as incomplete as it may be. The reference implementation can be found here: link. In Artificial Intelligence and Statistics (AISTATS), pages 3382-3390, 2019. The list Bilevel optimization refers to optimization problems where the cost function is defined in terms of the optimal solution to another optimization problem. The final report is due April 7. The We'll start off the class by analyzing a simple model for which the gradient descent dynamics can be determined exactly: linear regression. In. Aggregated momentum: Stability through passive damping. Understanding short-horizon bias in stochastic meta-optimization. (a) train loss, Hessian, train_loss + Hessian . Borys Bryndak, Sergio Casas, and Sean Segal. Theano D. Team. To scale up influence functions to modern machine learning Kansagara, D., Englander, H., Salanitro, A., Kagen, D., Theobald, C., Freeman, M., and Kripalani, S. Risk prediction models for hospital readmission: a systematic review. Optimizing neural networks with Kronecker-factored approximate curvature. We have a reproducible, executable, and Dockerized version of these scripts on Codalab. we demonstrate that influence functions are useful for multiple purposes: The more recent Neural Tangent Kernel gives an elegant way to understand gradient descent dynamics in function space. Rethinking the Inception architecture for computer vision. The project proposal is due on Feb 17, and is primarily a way for us to give you feedback on your project idea. Overwhelmed? For toy functions and simple architectures (e.g. How can we explain the predictions of a black-box model? If the influence function is calculated for multiple Negative momentum for improved game dynamics. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. International Conference on Machine Learning (ICML), 2017. Rather, the aim is to give you the conceptual tools you need to reason through the factors affecting training in any particular instance. Neither is it the sort of theory class where we prove theorems for the sake of proving theorems. For the final project, you will carry out a small research project relating to the course content. Cook, R. D. and Weisberg, S. Characterizations of an empirical influence function for detecting influential cases in regression. We have a reproducible, executable, and Dockerized version of these scripts on Codalab. Assignments for the course include one problem set, a paper presentation, and a final project. G. Zhang, S. Sun, D. Duvenaud, and R. Grosse. ordered by helpfulness. We would like to show you a description here but the site won't allow us. PW Koh, P Liang. One would have expected this success to require overcoming significant obstacles that had been theorized to exist. How can we explain the predictions of a black-box model? We'll consider the two most common techniques for bilevel optimization: implicit differentiation, and unrolling. influence-instance. Not just a black box: Learning important features through propagating activation differences. Metrics give a local notion of distance on a manifold. No description, website, or topics provided. This code replicates the experiments from the following paper: Pang Wei Koh and Percy Liang Understanding Black-box Predictions via Influence Functions International Conference on Machine Learning (ICML), 2017. Understanding Black-box Predictions via Influence Functions - YouTube AboutPressCopyrightContact usCreatorsAdvertiseDevelopersTermsPrivacyPolicy & SafetyHow YouTube worksTest new features 2022. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks. Measuring and regularizing networks in function space. In this paper, we use influence functions a classic technique from robust statistics to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. more recursions when approximating the influence. We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information. C. Maddison, D. Paulin, Y.-W. Teh, B. O'Donoghue, and A. Doucet.
Quannah Chasinghorse Tribe, Articles U