1
00:00:00,000 --> 00:00:03,060
Can you see my slides? Yes. Okay.

2
00:00:07,470 --> 00:00:11,250
Great. So thank you very much for the
opportunity to be here and to present,

3
00:00:11,790 --> 00:00:17,310
let's say our recent results. And so we'll
talk about part of the cities with deep

4
00:00:17,310 --> 00:00:23,670
learning models. This is a paper we have
published last year. And these it is in

5
00:00:23,670 --> 00:00:29,760
the context of n three PDF. So for those
that are not aware of the NPP f is a

6
00:00:29,970 --> 00:00:36,450
research project that we have in Milan,
led by Stefan authority, and it tries

7
00:00:36,480 --> 00:00:45,480
somehow to construct the next generation
of nn PDF feeds. So let's start this talk.

8
00:00:45,690 --> 00:00:52,860
I wanted to start by saying a couple of
words concerning the PDF challenges. So in

9
00:00:52,860 --> 00:00:59,220
the last 20 years, the PDF community we
have a lot of priority to treat asks the

10
00:00:59,220 --> 00:01:06,630
data implementation. So try to collect as
much as possible data and and from

11
00:01:06,630 --> 00:01:10,350
different processes and experiments. So,
here on the right side you see these

12
00:01:10,350 --> 00:01:17,010
kinematic plot where we collect all the
GIS fix a target Dragon hydronic data and

13
00:01:17,580 --> 00:01:26,250
you know Elysee data that has for example,
included in the NP f 2.1 data set. We had

14
00:01:26,250 --> 00:01:31,710
also lots of developments in terms of
steering wheel trying to have a fast

15
00:01:31,710 --> 00:01:36,270
mechanism to compute the theoretical
predictions for multiple processes in

16
00:01:36,270 --> 00:01:42,030
order to allow the possibility to have a
PDF regression to get a PDF feed from

17
00:01:42,030 --> 00:01:48,210
them. And the third point, which is the
most important one is the one that I

18
00:01:48,210 --> 00:01:53,670
wanted to focus today is the methodology.
So, we have to define a proper

19
00:01:53,940 --> 00:01:59,490
methodological framework in order to
obtain reliable PDS with reliable

20
00:01:59,490 --> 00:02:05,160
uncertainty And somehow have full control
of what we are doing. And this is

21
00:02:05,160 --> 00:02:10,050
something very complicated that we spend a
lot of time trying to, to tweak and to

22
00:02:10,050 --> 00:02:18,510
understand. In terms of nn, PDF, no, we
the latest release we had so far is then

23
00:02:18,510 --> 00:02:26,340
PDF 3.1 was released a couple of years
ago. And it implements neural network

24
00:02:27,210 --> 00:02:32,370
architectures optimize it through genetic
algorithm. So, that was the original idea.

25
00:02:33,150 --> 00:02:38,970
And we have been using that idea for a
long time. The implementation we had was

26
00:02:38,970 --> 00:02:45,480
in c++, because was easier to implement
our convolution kernels between the

27
00:02:45,480 --> 00:02:51,900
theoretical predictions and VMs. And the
third point, which characterizes these

28
00:02:51,930 --> 00:02:57,690
release is the possibility to mean to the
model by doing by applying a manual

29
00:02:57,690 --> 00:03:02,490
approach so we try to fit We tried to
understand the feet. We use it closure

30
00:03:02,490 --> 00:03:07,650
tests, but the fine tuning of the neural
network architecture and the optimizer

31
00:03:07,680 --> 00:03:12,420
setup, and the other parameters of our
system was performance manually, that we

32
00:03:12,420 --> 00:03:17,970
had to retry several times times and to
getting the perfect or miss a close a

33
00:03:18,030 --> 00:03:25,920
perfect solution we could. So when you see
these three points, I think there are at

34
00:03:25,920 --> 00:03:30,030
least two natural questions that we may
ask in terms of challenges. Now, the first

35
00:03:30,360 --> 00:03:37,680
question is, can we have feats, which are
very, very fast, and that can provide us a

36
00:03:37,680 --> 00:03:43,260
better understanding of what's going on in
this release, because so far in the nn PDF

37
00:03:43,260 --> 00:03:50,160
framework, feats will usually take around
30 to 35 hours per applicant. And if you

38
00:03:50,160 --> 00:03:55,290
consider 35 hours, this is a very long
time. And you could have I mean, you could

39
00:03:55,290 --> 00:03:59,340
try many other things during this time. So
this is the first question What can we

40
00:03:59,340 --> 00:04:05,550
improve before Can we maybe modify the
idea of using a genetical optimizer and

41
00:04:05,550 --> 00:04:12,090
changing the parameterization? And the
second question is, can we generalize the

42
00:04:12,090 --> 00:04:16,890
methodology and implement some sort of
learning automatic learning of the

43
00:04:16,890 --> 00:04:22,230
methodology to perform the selection
criteria automatically, instead of doing

44
00:04:22,230 --> 00:04:29,190
it manually? And then the question is,
probably yes, we can, we can try to follow

45
00:04:29,250 --> 00:04:33,090
the approaches which are adopted in the
common by the deep learning community by

46
00:04:33,150 --> 00:04:38,100
many, many projects that we are seeing
here in physics too. So we need some sort

47
00:04:38,100 --> 00:04:43,410
of general upgrade of our software
technology, but also in terms of

48
00:04:43,410 --> 00:04:48,420
methodology. Now, that was the missing
step. And that's why we have the entity

49
00:04:48,420 --> 00:04:54,300
PDF. So it's some sort of research and
development team that will try to

50
00:04:54,960 --> 00:05:00,000
implement new features and see if it is
new features that are acceptable or better

51
00:05:00,000 --> 00:05:05,460
And the things we had so far. So, we did
that to the implementation from the

52
00:05:05,460 --> 00:05:07,920
methodology code and software technology.

53
00:05:09,750 --> 00:05:15,180
So, going towards a deep learning
approach, eds are a supervised learning

54
00:05:15,180 --> 00:05:20,610
problem. So, we have our data, we have a
model, a cost function and optimizer and

55
00:05:20,640 --> 00:05:26,070
all these four items enters the enter the
training, and then we have somehow cross

56
00:05:26,070 --> 00:05:34,500
validation methods that will provide us
the the best possible model. So, if you

57
00:05:34,500 --> 00:05:39,540
want to improve on that, we have to focus
on at least three points, the model

58
00:05:39,540 --> 00:05:43,500
definition to how we build the neural
networks, how we decided the best

59
00:05:43,500 --> 00:05:49,440
architecture, then we move to the
optimizer. So shall we continue using

60
00:05:49,740 --> 00:05:54,510
genetic optimizers? Or shall we change it
and use it for example, gradient descent,

61
00:05:54,540 --> 00:05:59,880
which is the most popular technique, use
it so far? And then the third question is

62
00:05:59,880 --> 00:06:04,200
how To deal with cross validation, what is
the best approach for PDFs? How we do

63
00:06:04,200 --> 00:06:10,860
that? Now, in order to fix the first two
items, the modern optimizer, we have

64
00:06:10,860 --> 00:06:16,740
designed a new fitting algorithm that we
call industry fit model. And here you have

65
00:06:16,740 --> 00:06:22,290
a schematic picture of this model how this
model is built. And let's say the most

66
00:06:22,290 --> 00:06:28,110
important features of that, of that
particular model are the implementation.

67
00:06:28,470 --> 00:06:34,950
Now, the implementation uses speidel,
instead of pure c++, and relies on

68
00:06:34,950 --> 00:06:40,890
TensorFlow, we decided to move to an
external library, because this library can

69
00:06:40,890 --> 00:06:45,060
provide us many features like moving and
running the code on different hardware,

70
00:06:45,510 --> 00:06:51,120
like CPUs, GPUs, and provide us the
flexibility to change parameterization

71
00:06:51,240 --> 00:06:55,350
without having to implement everything
ourselves. So if you wanted to take some

72
00:06:55,380 --> 00:07:01,740
new architecture, we can just import from
an external code and try it The second

73
00:07:01,740 --> 00:07:05,580
point is the modular approach. So each
single block that you see in this picture

74
00:07:05,580 --> 00:07:10,020
from the values in x that you feeding in a
PDF, to the pre processing the neural

75
00:07:10,020 --> 00:07:15,270
network architecture, defeating basis, the
normalization, and then the next step, so

76
00:07:15,270 --> 00:07:19,740
the rotation and the convolution with the
radical predictions. So all these steps

77
00:07:19,740 --> 00:07:27,000
are possible to be tuned automatically.
And it's part now of our metadata. So, you

78
00:07:27,000 --> 00:07:31,650
can really change it very, very easily
without spending much time in coding. So,

79
00:07:31,650 --> 00:07:37,740
we can vary all aspects of the methodology
by modifying few coordinates of the code.

80
00:07:39,240 --> 00:07:45,480
So this is then three feet by doing that,
in terms of performance, so switching from

81
00:07:45,720 --> 00:07:51,810
c++, pure implementation based on graph on
stone, genetic optimization, to tensor

82
00:07:51,810 --> 00:07:58,350
flow, using gradient descent, we managed
to to observe incredible performance

83
00:07:58,350 --> 00:08:04,530
improvement. So Global PDF fees usually
visit the histogram of computation time in

84
00:08:04,530 --> 00:08:10,200
hours. And based on the number of
replicas, so you see that in orange, we

85
00:08:10,200 --> 00:08:15,630
usually spend something like 35 to 40
hours per replica using the genetic

86
00:08:15,630 --> 00:08:22,080
optimizer after switching to a gradient
descent and using the Python framework.

87
00:08:22,410 --> 00:08:28,650
This time went down to few few minutes,
let's say one hour, you know in an overall

88
00:08:28,710 --> 00:08:34,470
so 60 minutes. The same for the is feeds
before it was taking something around 15

89
00:08:34,560 --> 00:08:40,530
hours. Now it takes just a few minutes,
we're talking about 10 to 15 minutes. So

90
00:08:40,530 --> 00:08:46,800
here we see the difference between entry
feet average or for a global data set the

91
00:08:46,800 --> 00:08:53,910
same user Dean and MPs 3.1. So now we are
seated around 70 minutes, and the D is

92
00:08:53,910 --> 00:08:57,900
only which it takes just a few few
minutes. We're talking about between five

93
00:08:57,900 --> 00:09:04,110
to 15 in maximum So we gain a lot of
efficiency by doing that. And that's very

94
00:09:04,110 --> 00:09:10,080
nice, because opens the possibility to
find you another model. Now, instead of

95
00:09:10,110 --> 00:09:18,660
waiting 40 hours, we can really run 40
feet or close to that, right. And that's

96
00:09:18,660 --> 00:09:25,350
exactly what we try to do. So, we try to
move on. And we have implemented a hyper

97
00:09:25,350 --> 00:09:32,970
optimization algorithm that automatically
checks different configurations of our 15

98
00:09:33,000 --> 00:09:38,670
fitting procedure. For example, here you
see on the plot the loss function, so in

99
00:09:38,670 --> 00:09:45,150
our case, the chi square of the fit as a
function of different parameters, like the

100
00:09:45,150 --> 00:09:51,300
number of layers, the optimizer in terms
of recent the learning rate in the cell

101
00:09:51,330 --> 00:09:56,610
initializers handbooks, so you have many,
many parameters as they once used to be

102
00:09:56,610 --> 00:10:05,100
stable and we can try to monitor the
validation chi square as a possible metric

103
00:10:05,130 --> 00:10:08,250
for our target hybrid optimization setup.

104
00:10:09,510 --> 00:10:13,980
In the setup, we decided to also implement
about Asian approach. So you're using

105
00:10:13,980 --> 00:10:19,710
hyper opt for that. So you can use the
three of Parson estimator in order to get

106
00:10:19,740 --> 00:10:24,870
the next configuration for our feet. So by
looking at this plot, we can really

107
00:10:25,620 --> 00:10:31,800
somehow have an intuition about which is
the best configuration for our specific

108
00:10:31,800 --> 00:10:35,430
problem. Now, you see that for example,
the number of layers is used just one

109
00:10:35,430 --> 00:10:39,870
layer for the neural network, we are not
doing so well we have many outliers, if

110
00:10:39,870 --> 00:10:45,390
you select to have a better distribution,
so lower Chi squares, so you can try to

111
00:10:45,390 --> 00:10:53,220
correlate all these different parameters
and extract the best configuration. Now,

112
00:10:53,250 --> 00:10:58,890
when we tried for the first time, we
decided just to use for the hybrid

113
00:10:58,890 --> 00:11:04,890
optimization evaluation chi squared for
the hyper organization, are waiting until

114
00:11:04,890 --> 00:11:10,860
we delivered that number as the figure of
merit for that, for that approach. And if

115
00:11:10,860 --> 00:11:18,060
you compare the entry fee to so the green
replicas, for example, the down PDF, to

116
00:11:18,060 --> 00:11:24,960
then PDF 3.1 replicas, you see that both
cases they have some wiggles in the case

117
00:11:24,960 --> 00:11:31,770
of an N PDF 3.1 the wiggles are due to
finite size effects. So in principle, if

118
00:11:31,770 --> 00:11:36,990
we increase the number of replicas, these
wiggles will disappear for a very, very

119
00:11:36,990 --> 00:11:42,750
reduced wiling then three feet that was
not the case. So what we are seeing here

120
00:11:42,810 --> 00:11:48,630
is really overfeeding so you get extremely
strong overfeeding and that's due to the

121
00:11:48,660 --> 00:11:52,830
fact that the training invalidation split
of the data, so if you just pick up

122
00:11:52,830 --> 00:11:57,840
randomly the points, the data points in,
that's not enough, the data is extremely

123
00:11:57,840 --> 00:12:03,870
correlated. So you Finish by getting
overfeeding just because you need a

124
00:12:03,870 --> 00:12:08,790
clever, very clever way to separate the
training and validation. And that's what

125
00:12:08,790 --> 00:12:17,160
we did, we'd had to implement a new
quality control detail. And what we had in

126
00:12:17,160 --> 00:12:22,740
place is designed by these men shown by
these by this plot. So, we had to define a

127
00:12:22,740 --> 00:12:28,560
complete uncorrelated the test set, which
is used as a quality control for the hyper

128
00:12:28,560 --> 00:12:34,260
optimization algorithm. To insert inside
of each fit, we have the PDF optimization,

129
00:12:34,530 --> 00:12:40,350
it will look at the chi squared off the
validation in order to be implemented the

130
00:12:40,350 --> 00:12:45,270
early stopping or to monitor the quality
or the generalization quality of your

131
00:12:45,270 --> 00:12:50,940
model. But that was not enough to avoid
overfitting. So we had to modify this

132
00:12:50,940 --> 00:12:57,270
approach and include that as it and at the
end of the day, we decided to use final

133
00:12:57,270 --> 00:13:01,170
figure of merit our weighted average
between Between the validation and the

134
00:13:01,170 --> 00:13:10,050
test set. So the test set that I'm talking
now is a subset of data sets, which have

135
00:13:10,050 --> 00:13:14,640
already been included in the feet. So, we
had for example, some process like jets,

136
00:13:14,970 --> 00:13:21,360
we include a Jetta data set for training
with a very large domain x for the for

137
00:13:21,360 --> 00:13:26,490
tonic momentum, but then we take another
jet data set and we select one which has a

138
00:13:26,490 --> 00:13:32,550
smaller range. So we know that at least we
are putting in the feet the required

139
00:13:32,550 --> 00:13:40,230
information. So by doing that, this is
what we get. So for the ubar PDF, we move

140
00:13:40,230 --> 00:13:46,440
on from the oscillating and over learning
green distribution to the nice and smooth

141
00:13:46,500 --> 00:13:51,000
orange distribution. So, you see that
there is no overfitting by say by looking

142
00:13:51,000 --> 00:13:55,230
at the plot, you see that the complexity
of the architecture is not so hard to you

143
00:13:55,230 --> 00:13:59,310
have a very nice and smooth distributions.
You can also compute that they are clamps

144
00:13:59,310 --> 00:14:05,610
and you see They are clamps are pretty
small. And you have great stability and

145
00:14:05,610 --> 00:14:11,310
reduces uncertainties, including if you
compare the overall square of them

146
00:14:11,310 --> 00:14:17,280
received versus the previous technology,
so they want in nn, PDF 3.1. Again, okay

147
00:14:17,280 --> 00:14:22,050
squares are very close to each other, but
to get much less complexity inside of the

148
00:14:22,050 --> 00:14:22,530
architecture.

149
00:14:24,630 --> 00:14:29,820
Now you may ask, well, is that enough?
What can we do next? And the answer is

150
00:14:29,820 --> 00:14:35,910
well, we can implement further tests, so
have further quality control. And the

151
00:14:35,940 --> 00:14:40,500
first thing we tried is to see if it's
possible to get stable chronological

152
00:14:40,500 --> 00:14:48,360
feeds. So the idea is to go back to the
past in the bracket here, time when we

153
00:14:48,360 --> 00:14:53,190
have a debris here a dataset, perform our
hyper optimization using the procedure I

154
00:14:53,190 --> 00:15:00,510
just presented, and then compare our
predictions to the future data. So we did

155
00:15:00,510 --> 00:15:06,540
that. And at it for example, we tried to
remove all data after here. And here you

156
00:15:06,540 --> 00:15:11,640
have the distributions for post header
distributions like they have to, and SR

157
00:15:11,640 --> 00:15:16,380
stt bar repetitively simulation. You can
repeat this for different experiments and

158
00:15:16,380 --> 00:15:21,360
also compare PDFs. And you will always see
that we get uncertainties, and we get the

159
00:15:21,360 --> 00:15:25,830
result, which are within the PDF
uncertainty. So the results are always

160
00:15:25,830 --> 00:15:32,160
competitive. So that's less a reassuring
approach that makes we more confident

161
00:15:32,160 --> 00:15:39,420
about the approach the limitation we have.
The second question is how to define a

162
00:15:39,420 --> 00:15:46,290
proper data set. So here we have in the
case of PDF, it's a very limited data set

163
00:15:46,290 --> 00:15:52,230
with 5000 data points. And if you decided
to say the wrong data set, we may get

164
00:15:52,530 --> 00:15:57,390
completely different answers. So these are
the hypotheses. So what can we do? Well,

165
00:15:57,420 --> 00:16:02,760
we can try to implement the K folding
process validation, we tried and indicate

166
00:16:02,760 --> 00:16:08,700
folding, we perform a rotation estimate of
the test set, and then we send the average

167
00:16:08,700 --> 00:16:13,350
value of the test set to the hyper
optimization algorithm. And we did it and

168
00:16:13,350 --> 00:16:17,790
here you have the comparison between doing
the K folding versus not doing the K fold

169
00:16:17,820 --> 00:16:22,590
just using the previews, manually
selection of test set and also here you

170
00:16:22,590 --> 00:16:27,690
see that the results are very, very
compatible with being in both cases. So,

171
00:16:28,410 --> 00:16:36,600
our level of stability based on the test
set definition is quite strong. And now,

172
00:16:36,600 --> 00:16:41,970
to conclude, I want to just to comment
about the two points we are studying. The

173
00:16:41,970 --> 00:16:48,180
first concern is extrapolation. So, today
pdbs are the computed by using a

174
00:16:48,180 --> 00:16:54,420
polynomial pre processing term which
multiplies the neural networks and what we

175
00:16:54,420 --> 00:16:59,010
observe is that if we remove pre
processing for example, at small x, the

176
00:16:59,010 --> 00:17:04,470
PDS saturates So we have a situation, we
cannot deal with that particular behavior.

177
00:17:04,920 --> 00:17:08,490
So then these options are two possible
challenges. The first challenge is to

178
00:17:08,490 --> 00:17:13,320
avoid situation and these we know how to
do, we have to create custom input layers

179
00:17:13,590 --> 00:17:20,880
that will modify the behavior of and avoid
the situation in our network. So we can do

180
00:17:20,880 --> 00:17:26,190
this pre processing the data level. But
then you may ask to Well, now that we

181
00:17:26,190 --> 00:17:31,500
don't have separation, what is the best
approach to have a model in that region?

182
00:17:32,010 --> 00:17:37,350
And then the answer to that is that we
could create pseudo data based on some

183
00:17:38,340 --> 00:17:43,440
process specific process. And for example,
when approaches we can take the D is

184
00:17:43,440 --> 00:17:48,000
observables. Like they have to have to
charm whatever distribution we have at low

185
00:17:48,000 --> 00:17:54,060
Q and low x, build a Gaussian process
model for that data, and then include

186
00:17:54,450 --> 00:17:59,400
pseudo data from that Gaussian prior into
our fees. And here we have an example.

187
00:17:59,460 --> 00:18:05,790
This is The left the adoption process for
telecom distribution at very low x values.

188
00:18:06,150 --> 00:18:11,010
If you include the yellow points into
feet, then we are able to modify the PDFs

189
00:18:11,010 --> 00:18:16,530
and the PDFs will include that information
in extrapolation. So that's for example is

190
00:18:16,530 --> 00:18:22,500
a possible solution for the extrapolation
business problem. So, going to my final

191
00:18:22,500 --> 00:18:30,330
comments, we are very close to achieve nn
PDF 4.0 release and this release will

192
00:18:30,330 --> 00:18:35,700
contain many methodological improvements
like speed, the code will also be open

193
00:18:35,700 --> 00:18:40,170
source, and we have the possibility to
learn the methodology with hyper

194
00:18:40,170 --> 00:18:44,190
optimization and better quality controls
for the different aspects of the

195
00:18:44,190 --> 00:18:46,800
methodology. So, thank you very much.

196
00:18:49,140 --> 00:18:50,520
If you have questions, please let me know.

197
00:18:52,830 --> 00:18:53,820
Okay, thanks a lot.

198
00:18:56,160 --> 00:18:59,970
So do we have any questions from the room

199
00:19:03,240 --> 00:19:10,320
Do actually this is a rock a penetration
by hand. But so on slide 14, I think or

200
00:19:12,150 --> 00:19:13,530
yes no this this one is

201
00:19:15,120 --> 00:19:19,500
the the extrapolation of shooter data
there is really

202
00:19:21,450 --> 00:19:22,830
strongly

203
00:19:25,200 --> 00:19:34,260
guided by the Gaussian processes, modeling
of these no data. Yes, for example, yes.

204
00:19:34,620 --> 00:19:43,020
So, which could be if, if nature is pretty
smooth, then the Gaussian process will get

205
00:19:43,020 --> 00:19:49,110
it but if it's not extrapolating properly,
then then it's going to go bad, right. So,

206
00:19:49,530 --> 00:19:54,870
indeed, how do you indeed Yeah, decision
for example, if using a Gaussian a

207
00:19:54,900 --> 00:19:55,410
Gaussian,

208
00:19:56,070 --> 00:20:03,840
Gaussian process is because well, we start
from there. conceptual idea that the data

209
00:20:04,410 --> 00:20:08,070
derives from a Gaussian probability a
normal distribution. But in this

210
00:20:08,070 --> 00:20:13,080
particular case, we may have a couple of
things to consider like the kernel of the

211
00:20:13,080 --> 00:20:17,820
Gaussian process, the correlation length,
the autocorrelation, and different

212
00:20:17,820 --> 00:20:23,310
features, technical features inside of
them on, we are planning to test all these

213
00:20:23,310 --> 00:20:28,350
different approaches with different
kernels, and then probably, at the end,

214
00:20:28,770 --> 00:20:33,480
combine all the results together. So have
a microscopic representation of the

215
00:20:33,480 --> 00:20:40,020
uncertainty for that data and see what are
these translating terms of PDF is

216
00:20:40,050 --> 00:20:44,910
obviously is not the perfect solution
because, unfortunately is extrapolation.

217
00:20:44,910 --> 00:20:49,590
So we cannot do much, but we could try to
combine these Gaussians processes with

218
00:20:49,590 --> 00:20:55,020
different configurations for different
process types at small x and see if the

219
00:20:55,020 --> 00:21:01,560
final answer improves or not the
description industry In this model x we,

220
00:21:03,240 --> 00:21:08,340
but that's still a very, that's still
very, very complicated problem to solve,

221
00:21:08,370 --> 00:21:12,630
because unfortunately, we cannot rely on
simple neural networks for that.

222
00:21:21,540 --> 00:21:27,750
Okay, thanks. Are there any more questions
because this is the last talk before we go

223
00:21:27,750 --> 00:21:30,390
to the, to the coffee break?

224
00:21:33,240 --> 00:21:36,930
I guess because we don't really have the
ability to applaud people, but I can

225
00:21:36,930 --> 00:21:41,640
unmute all of the panelists. So then maybe
we can clap for them.