1 00:00:00,000 --> 00:00:03,060 Can you see my slides? Yes. Okay. 2 00:00:07,470 --> 00:00:11,250 Great. So thank you very much for the opportunity to be here and to present, 3 00:00:11,790 --> 00:00:17,310 let's say our recent results. And so we'll talk about part of the cities with deep 4 00:00:17,310 --> 00:00:23,670 learning models. This is a paper we have published last year. And these it is in 5 00:00:23,670 --> 00:00:29,760 the context of n three PDF. So for those that are not aware of the NPP f is a 6 00:00:29,970 --> 00:00:36,450 research project that we have in Milan, led by Stefan authority, and it tries 7 00:00:36,480 --> 00:00:45,480 somehow to construct the next generation of nn PDF feeds. So let's start this talk. 8 00:00:45,690 --> 00:00:52,860 I wanted to start by saying a couple of words concerning the PDF challenges. So in 9 00:00:52,860 --> 00:00:59,220 the last 20 years, the PDF community we have a lot of priority to treat asks the 10 00:00:59,220 --> 00:01:06,630 data implementation. So try to collect as much as possible data and and from 11 00:01:06,630 --> 00:01:10,350 different processes and experiments. So, here on the right side you see these 12 00:01:10,350 --> 00:01:17,010 kinematic plot where we collect all the GIS fix a target Dragon hydronic data and 13 00:01:17,580 --> 00:01:26,250 you know Elysee data that has for example, included in the NP f 2.1 data set. We had 14 00:01:26,250 --> 00:01:31,710 also lots of developments in terms of steering wheel trying to have a fast 15 00:01:31,710 --> 00:01:36,270 mechanism to compute the theoretical predictions for multiple processes in 16 00:01:36,270 --> 00:01:42,030 order to allow the possibility to have a PDF regression to get a PDF feed from 17 00:01:42,030 --> 00:01:48,210 them. And the third point, which is the most important one is the one that I 18 00:01:48,210 --> 00:01:53,670 wanted to focus today is the methodology. So, we have to define a proper 19 00:01:53,940 --> 00:01:59,490 methodological framework in order to obtain reliable PDS with reliable 20 00:01:59,490 --> 00:02:05,160 uncertainty And somehow have full control of what we are doing. And this is 21 00:02:05,160 --> 00:02:10,050 something very complicated that we spend a lot of time trying to, to tweak and to 22 00:02:10,050 --> 00:02:18,510 understand. In terms of nn, PDF, no, we the latest release we had so far is then 23 00:02:18,510 --> 00:02:26,340 PDF 3.1 was released a couple of years ago. And it implements neural network 24 00:02:27,210 --> 00:02:32,370 architectures optimize it through genetic algorithm. So, that was the original idea. 25 00:02:33,150 --> 00:02:38,970 And we have been using that idea for a long time. The implementation we had was 26 00:02:38,970 --> 00:02:45,480 in c++, because was easier to implement our convolution kernels between the 27 00:02:45,480 --> 00:02:51,900 theoretical predictions and VMs. And the third point, which characterizes these 28 00:02:51,930 --> 00:02:57,690 release is the possibility to mean to the model by doing by applying a manual 29 00:02:57,690 --> 00:03:02,490 approach so we try to fit We tried to understand the feet. We use it closure 30 00:03:02,490 --> 00:03:07,650 tests, but the fine tuning of the neural network architecture and the optimizer 31 00:03:07,680 --> 00:03:12,420 setup, and the other parameters of our system was performance manually, that we 32 00:03:12,420 --> 00:03:17,970 had to retry several times times and to getting the perfect or miss a close a 33 00:03:18,030 --> 00:03:25,920 perfect solution we could. So when you see these three points, I think there are at 34 00:03:25,920 --> 00:03:30,030 least two natural questions that we may ask in terms of challenges. Now, the first 35 00:03:30,360 --> 00:03:37,680 question is, can we have feats, which are very, very fast, and that can provide us a 36 00:03:37,680 --> 00:03:43,260 better understanding of what's going on in this release, because so far in the nn PDF 37 00:03:43,260 --> 00:03:50,160 framework, feats will usually take around 30 to 35 hours per applicant. And if you 38 00:03:50,160 --> 00:03:55,290 consider 35 hours, this is a very long time. And you could have I mean, you could 39 00:03:55,290 --> 00:03:59,340 try many other things during this time. So this is the first question What can we 40 00:03:59,340 --> 00:04:05,550 improve before Can we maybe modify the idea of using a genetical optimizer and 41 00:04:05,550 --> 00:04:12,090 changing the parameterization? And the second question is, can we generalize the 42 00:04:12,090 --> 00:04:16,890 methodology and implement some sort of learning automatic learning of the 43 00:04:16,890 --> 00:04:22,230 methodology to perform the selection criteria automatically, instead of doing 44 00:04:22,230 --> 00:04:29,190 it manually? And then the question is, probably yes, we can, we can try to follow 45 00:04:29,250 --> 00:04:33,090 the approaches which are adopted in the common by the deep learning community by 46 00:04:33,150 --> 00:04:38,100 many, many projects that we are seeing here in physics too. So we need some sort 47 00:04:38,100 --> 00:04:43,410 of general upgrade of our software technology, but also in terms of 48 00:04:43,410 --> 00:04:48,420 methodology. Now, that was the missing step. And that's why we have the entity 49 00:04:48,420 --> 00:04:54,300 PDF. So it's some sort of research and development team that will try to 50 00:04:54,960 --> 00:05:00,000 implement new features and see if it is new features that are acceptable or better 51 00:05:00,000 --> 00:05:05,460 And the things we had so far. So, we did that to the implementation from the 52 00:05:05,460 --> 00:05:07,920 methodology code and software technology. 53 00:05:09,750 --> 00:05:15,180 So, going towards a deep learning approach, eds are a supervised learning 54 00:05:15,180 --> 00:05:20,610 problem. So, we have our data, we have a model, a cost function and optimizer and 55 00:05:20,640 --> 00:05:26,070 all these four items enters the enter the training, and then we have somehow cross 56 00:05:26,070 --> 00:05:34,500 validation methods that will provide us the the best possible model. So, if you 57 00:05:34,500 --> 00:05:39,540 want to improve on that, we have to focus on at least three points, the model 58 00:05:39,540 --> 00:05:43,500 definition to how we build the neural networks, how we decided the best 59 00:05:43,500 --> 00:05:49,440 architecture, then we move to the optimizer. So shall we continue using 60 00:05:49,740 --> 00:05:54,510 genetic optimizers? Or shall we change it and use it for example, gradient descent, 61 00:05:54,540 --> 00:05:59,880 which is the most popular technique, use it so far? And then the third question is 62 00:05:59,880 --> 00:06:04,200 how To deal with cross validation, what is the best approach for PDFs? How we do 63 00:06:04,200 --> 00:06:10,860 that? Now, in order to fix the first two items, the modern optimizer, we have 64 00:06:10,860 --> 00:06:16,740 designed a new fitting algorithm that we call industry fit model. And here you have 65 00:06:16,740 --> 00:06:22,290 a schematic picture of this model how this model is built. And let's say the most 66 00:06:22,290 --> 00:06:28,110 important features of that, of that particular model are the implementation. 67 00:06:28,470 --> 00:06:34,950 Now, the implementation uses speidel, instead of pure c++, and relies on 68 00:06:34,950 --> 00:06:40,890 TensorFlow, we decided to move to an external library, because this library can 69 00:06:40,890 --> 00:06:45,060 provide us many features like moving and running the code on different hardware, 70 00:06:45,510 --> 00:06:51,120 like CPUs, GPUs, and provide us the flexibility to change parameterization 71 00:06:51,240 --> 00:06:55,350 without having to implement everything ourselves. So if you wanted to take some 72 00:06:55,380 --> 00:07:01,740 new architecture, we can just import from an external code and try it The second 73 00:07:01,740 --> 00:07:05,580 point is the modular approach. So each single block that you see in this picture 74 00:07:05,580 --> 00:07:10,020 from the values in x that you feeding in a PDF, to the pre processing the neural 75 00:07:10,020 --> 00:07:15,270 network architecture, defeating basis, the normalization, and then the next step, so 76 00:07:15,270 --> 00:07:19,740 the rotation and the convolution with the radical predictions. So all these steps 77 00:07:19,740 --> 00:07:27,000 are possible to be tuned automatically. And it's part now of our metadata. So, you 78 00:07:27,000 --> 00:07:31,650 can really change it very, very easily without spending much time in coding. So, 79 00:07:31,650 --> 00:07:37,740 we can vary all aspects of the methodology by modifying few coordinates of the code. 80 00:07:39,240 --> 00:07:45,480 So this is then three feet by doing that, in terms of performance, so switching from 81 00:07:45,720 --> 00:07:51,810 c++, pure implementation based on graph on stone, genetic optimization, to tensor 82 00:07:51,810 --> 00:07:58,350 flow, using gradient descent, we managed to to observe incredible performance 83 00:07:58,350 --> 00:08:04,530 improvement. So Global PDF fees usually visit the histogram of computation time in 84 00:08:04,530 --> 00:08:10,200 hours. And based on the number of replicas, so you see that in orange, we 85 00:08:10,200 --> 00:08:15,630 usually spend something like 35 to 40 hours per replica using the genetic 86 00:08:15,630 --> 00:08:22,080 optimizer after switching to a gradient descent and using the Python framework. 87 00:08:22,410 --> 00:08:28,650 This time went down to few few minutes, let's say one hour, you know in an overall 88 00:08:28,710 --> 00:08:34,470 so 60 minutes. The same for the is feeds before it was taking something around 15 89 00:08:34,560 --> 00:08:40,530 hours. Now it takes just a few minutes, we're talking about 10 to 15 minutes. So 90 00:08:40,530 --> 00:08:46,800 here we see the difference between entry feet average or for a global data set the 91 00:08:46,800 --> 00:08:53,910 same user Dean and MPs 3.1. So now we are seated around 70 minutes, and the D is 92 00:08:53,910 --> 00:08:57,900 only which it takes just a few few minutes. We're talking about between five 93 00:08:57,900 --> 00:09:04,110 to 15 in maximum So we gain a lot of efficiency by doing that. And that's very 94 00:09:04,110 --> 00:09:10,080 nice, because opens the possibility to find you another model. Now, instead of 95 00:09:10,110 --> 00:09:18,660 waiting 40 hours, we can really run 40 feet or close to that, right. And that's 96 00:09:18,660 --> 00:09:25,350 exactly what we try to do. So, we try to move on. And we have implemented a hyper 97 00:09:25,350 --> 00:09:32,970 optimization algorithm that automatically checks different configurations of our 15 98 00:09:33,000 --> 00:09:38,670 fitting procedure. For example, here you see on the plot the loss function, so in 99 00:09:38,670 --> 00:09:45,150 our case, the chi square of the fit as a function of different parameters, like the 100 00:09:45,150 --> 00:09:51,300 number of layers, the optimizer in terms of recent the learning rate in the cell 101 00:09:51,330 --> 00:09:56,610 initializers handbooks, so you have many, many parameters as they once used to be 102 00:09:56,610 --> 00:10:05,100 stable and we can try to monitor the validation chi square as a possible metric 103 00:10:05,130 --> 00:10:08,250 for our target hybrid optimization setup. 104 00:10:09,510 --> 00:10:13,980 In the setup, we decided to also implement about Asian approach. So you're using 105 00:10:13,980 --> 00:10:19,710 hyper opt for that. So you can use the three of Parson estimator in order to get 106 00:10:19,740 --> 00:10:24,870 the next configuration for our feet. So by looking at this plot, we can really 107 00:10:25,620 --> 00:10:31,800 somehow have an intuition about which is the best configuration for our specific 108 00:10:31,800 --> 00:10:35,430 problem. Now, you see that for example, the number of layers is used just one 109 00:10:35,430 --> 00:10:39,870 layer for the neural network, we are not doing so well we have many outliers, if 110 00:10:39,870 --> 00:10:45,390 you select to have a better distribution, so lower Chi squares, so you can try to 111 00:10:45,390 --> 00:10:53,220 correlate all these different parameters and extract the best configuration. Now, 112 00:10:53,250 --> 00:10:58,890 when we tried for the first time, we decided just to use for the hybrid 113 00:10:58,890 --> 00:11:04,890 optimization evaluation chi squared for the hyper organization, are waiting until 114 00:11:04,890 --> 00:11:10,860 we delivered that number as the figure of merit for that, for that approach. And if 115 00:11:10,860 --> 00:11:18,060 you compare the entry fee to so the green replicas, for example, the down PDF, to 116 00:11:18,060 --> 00:11:24,960 then PDF 3.1 replicas, you see that both cases they have some wiggles in the case 117 00:11:24,960 --> 00:11:31,770 of an N PDF 3.1 the wiggles are due to finite size effects. So in principle, if 118 00:11:31,770 --> 00:11:36,990 we increase the number of replicas, these wiggles will disappear for a very, very 119 00:11:36,990 --> 00:11:42,750 reduced wiling then three feet that was not the case. So what we are seeing here 120 00:11:42,810 --> 00:11:48,630 is really overfeeding so you get extremely strong overfeeding and that's due to the 121 00:11:48,660 --> 00:11:52,830 fact that the training invalidation split of the data, so if you just pick up 122 00:11:52,830 --> 00:11:57,840 randomly the points, the data points in, that's not enough, the data is extremely 123 00:11:57,840 --> 00:12:03,870 correlated. So you Finish by getting overfeeding just because you need a 124 00:12:03,870 --> 00:12:08,790 clever, very clever way to separate the training and validation. And that's what 125 00:12:08,790 --> 00:12:17,160 we did, we'd had to implement a new quality control detail. And what we had in 126 00:12:17,160 --> 00:12:22,740 place is designed by these men shown by these by this plot. So, we had to define a 127 00:12:22,740 --> 00:12:28,560 complete uncorrelated the test set, which is used as a quality control for the hyper 128 00:12:28,560 --> 00:12:34,260 optimization algorithm. To insert inside of each fit, we have the PDF optimization, 129 00:12:34,530 --> 00:12:40,350 it will look at the chi squared off the validation in order to be implemented the 130 00:12:40,350 --> 00:12:45,270 early stopping or to monitor the quality or the generalization quality of your 131 00:12:45,270 --> 00:12:50,940 model. But that was not enough to avoid overfitting. So we had to modify this 132 00:12:50,940 --> 00:12:57,270 approach and include that as it and at the end of the day, we decided to use final 133 00:12:57,270 --> 00:13:01,170 figure of merit our weighted average between Between the validation and the 134 00:13:01,170 --> 00:13:10,050 test set. So the test set that I'm talking now is a subset of data sets, which have 135 00:13:10,050 --> 00:13:14,640 already been included in the feet. So, we had for example, some process like jets, 136 00:13:14,970 --> 00:13:21,360 we include a Jetta data set for training with a very large domain x for the for 137 00:13:21,360 --> 00:13:26,490 tonic momentum, but then we take another jet data set and we select one which has a 138 00:13:26,490 --> 00:13:32,550 smaller range. So we know that at least we are putting in the feet the required 139 00:13:32,550 --> 00:13:40,230 information. So by doing that, this is what we get. So for the ubar PDF, we move 140 00:13:40,230 --> 00:13:46,440 on from the oscillating and over learning green distribution to the nice and smooth 141 00:13:46,500 --> 00:13:51,000 orange distribution. So, you see that there is no overfitting by say by looking 142 00:13:51,000 --> 00:13:55,230 at the plot, you see that the complexity of the architecture is not so hard to you 143 00:13:55,230 --> 00:13:59,310 have a very nice and smooth distributions. You can also compute that they are clamps 144 00:13:59,310 --> 00:14:05,610 and you see They are clamps are pretty small. And you have great stability and 145 00:14:05,610 --> 00:14:11,310 reduces uncertainties, including if you compare the overall square of them 146 00:14:11,310 --> 00:14:17,280 received versus the previous technology, so they want in nn, PDF 3.1. Again, okay 147 00:14:17,280 --> 00:14:22,050 squares are very close to each other, but to get much less complexity inside of the 148 00:14:22,050 --> 00:14:22,530 architecture. 149 00:14:24,630 --> 00:14:29,820 Now you may ask, well, is that enough? What can we do next? And the answer is 150 00:14:29,820 --> 00:14:35,910 well, we can implement further tests, so have further quality control. And the 151 00:14:35,940 --> 00:14:40,500 first thing we tried is to see if it's possible to get stable chronological 152 00:14:40,500 --> 00:14:48,360 feeds. So the idea is to go back to the past in the bracket here, time when we 153 00:14:48,360 --> 00:14:53,190 have a debris here a dataset, perform our hyper optimization using the procedure I 154 00:14:53,190 --> 00:15:00,510 just presented, and then compare our predictions to the future data. So we did 155 00:15:00,510 --> 00:15:06,540 that. And at it for example, we tried to remove all data after here. And here you 156 00:15:06,540 --> 00:15:11,640 have the distributions for post header distributions like they have to, and SR 157 00:15:11,640 --> 00:15:16,380 stt bar repetitively simulation. You can repeat this for different experiments and 158 00:15:16,380 --> 00:15:21,360 also compare PDFs. And you will always see that we get uncertainties, and we get the 159 00:15:21,360 --> 00:15:25,830 result, which are within the PDF uncertainty. So the results are always 160 00:15:25,830 --> 00:15:32,160 competitive. So that's less a reassuring approach that makes we more confident 161 00:15:32,160 --> 00:15:39,420 about the approach the limitation we have. The second question is how to define a 162 00:15:39,420 --> 00:15:46,290 proper data set. So here we have in the case of PDF, it's a very limited data set 163 00:15:46,290 --> 00:15:52,230 with 5000 data points. And if you decided to say the wrong data set, we may get 164 00:15:52,530 --> 00:15:57,390 completely different answers. So these are the hypotheses. So what can we do? Well, 165 00:15:57,420 --> 00:16:02,760 we can try to implement the K folding process validation, we tried and indicate 166 00:16:02,760 --> 00:16:08,700 folding, we perform a rotation estimate of the test set, and then we send the average 167 00:16:08,700 --> 00:16:13,350 value of the test set to the hyper optimization algorithm. And we did it and 168 00:16:13,350 --> 00:16:17,790 here you have the comparison between doing the K folding versus not doing the K fold 169 00:16:17,820 --> 00:16:22,590 just using the previews, manually selection of test set and also here you 170 00:16:22,590 --> 00:16:27,690 see that the results are very, very compatible with being in both cases. So, 171 00:16:28,410 --> 00:16:36,600 our level of stability based on the test set definition is quite strong. And now, 172 00:16:36,600 --> 00:16:41,970 to conclude, I want to just to comment about the two points we are studying. The 173 00:16:41,970 --> 00:16:48,180 first concern is extrapolation. So, today pdbs are the computed by using a 174 00:16:48,180 --> 00:16:54,420 polynomial pre processing term which multiplies the neural networks and what we 175 00:16:54,420 --> 00:16:59,010 observe is that if we remove pre processing for example, at small x, the 176 00:16:59,010 --> 00:17:04,470 PDS saturates So we have a situation, we cannot deal with that particular behavior. 177 00:17:04,920 --> 00:17:08,490 So then these options are two possible challenges. The first challenge is to 178 00:17:08,490 --> 00:17:13,320 avoid situation and these we know how to do, we have to create custom input layers 179 00:17:13,590 --> 00:17:20,880 that will modify the behavior of and avoid the situation in our network. So we can do 180 00:17:20,880 --> 00:17:26,190 this pre processing the data level. But then you may ask to Well, now that we 181 00:17:26,190 --> 00:17:31,500 don't have separation, what is the best approach to have a model in that region? 182 00:17:32,010 --> 00:17:37,350 And then the answer to that is that we could create pseudo data based on some 183 00:17:38,340 --> 00:17:43,440 process specific process. And for example, when approaches we can take the D is 184 00:17:43,440 --> 00:17:48,000 observables. Like they have to have to charm whatever distribution we have at low 185 00:17:48,000 --> 00:17:54,060 Q and low x, build a Gaussian process model for that data, and then include 186 00:17:54,450 --> 00:17:59,400 pseudo data from that Gaussian prior into our fees. And here we have an example. 187 00:17:59,460 --> 00:18:05,790 This is The left the adoption process for telecom distribution at very low x values. 188 00:18:06,150 --> 00:18:11,010 If you include the yellow points into feet, then we are able to modify the PDFs 189 00:18:11,010 --> 00:18:16,530 and the PDFs will include that information in extrapolation. So that's for example is 190 00:18:16,530 --> 00:18:22,500 a possible solution for the extrapolation business problem. So, going to my final 191 00:18:22,500 --> 00:18:30,330 comments, we are very close to achieve nn PDF 4.0 release and this release will 192 00:18:30,330 --> 00:18:35,700 contain many methodological improvements like speed, the code will also be open 193 00:18:35,700 --> 00:18:40,170 source, and we have the possibility to learn the methodology with hyper 194 00:18:40,170 --> 00:18:44,190 optimization and better quality controls for the different aspects of the 195 00:18:44,190 --> 00:18:46,800 methodology. So, thank you very much. 196 00:18:49,140 --> 00:18:50,520 If you have questions, please let me know. 197 00:18:52,830 --> 00:18:53,820 Okay, thanks a lot. 198 00:18:56,160 --> 00:18:59,970 So do we have any questions from the room 199 00:19:03,240 --> 00:19:10,320 Do actually this is a rock a penetration by hand. But so on slide 14, I think or 200 00:19:12,150 --> 00:19:13,530 yes no this this one is 201 00:19:15,120 --> 00:19:19,500 the the extrapolation of shooter data there is really 202 00:19:21,450 --> 00:19:22,830 strongly 203 00:19:25,200 --> 00:19:34,260 guided by the Gaussian processes, modeling of these no data. Yes, for example, yes. 204 00:19:34,620 --> 00:19:43,020 So, which could be if, if nature is pretty smooth, then the Gaussian process will get 205 00:19:43,020 --> 00:19:49,110 it but if it's not extrapolating properly, then then it's going to go bad, right. So, 206 00:19:49,530 --> 00:19:54,870 indeed, how do you indeed Yeah, decision for example, if using a Gaussian a 207 00:19:54,900 --> 00:19:55,410 Gaussian, 208 00:19:56,070 --> 00:20:03,840 Gaussian process is because well, we start from there. conceptual idea that the data 209 00:20:04,410 --> 00:20:08,070 derives from a Gaussian probability a normal distribution. But in this 210 00:20:08,070 --> 00:20:13,080 particular case, we may have a couple of things to consider like the kernel of the 211 00:20:13,080 --> 00:20:17,820 Gaussian process, the correlation length, the autocorrelation, and different 212 00:20:17,820 --> 00:20:23,310 features, technical features inside of them on, we are planning to test all these 213 00:20:23,310 --> 00:20:28,350 different approaches with different kernels, and then probably, at the end, 214 00:20:28,770 --> 00:20:33,480 combine all the results together. So have a microscopic representation of the 215 00:20:33,480 --> 00:20:40,020 uncertainty for that data and see what are these translating terms of PDF is 216 00:20:40,050 --> 00:20:44,910 obviously is not the perfect solution because, unfortunately is extrapolation. 217 00:20:44,910 --> 00:20:49,590 So we cannot do much, but we could try to combine these Gaussians processes with 218 00:20:49,590 --> 00:20:55,020 different configurations for different process types at small x and see if the 219 00:20:55,020 --> 00:21:01,560 final answer improves or not the description industry In this model x we, 220 00:21:03,240 --> 00:21:08,340 but that's still a very, that's still very, very complicated problem to solve, 221 00:21:08,370 --> 00:21:12,630 because unfortunately, we cannot rely on simple neural networks for that. 222 00:21:21,540 --> 00:21:27,750 Okay, thanks. Are there any more questions because this is the last talk before we go 223 00:21:27,750 --> 00:21:30,390 to the, to the coffee break? 224 00:21:33,240 --> 00:21:36,930 I guess because we don't really have the ability to applaud people, but I can 225 00:21:36,930 --> 00:21:41,640 unmute all of the panelists. So then maybe we can clap for them.