1
00:00:01,050 --> 00:00:09,600
One, I think this is the computer time to
start the second session of the tutorial

2
00:00:09,690 --> 00:00:15,540
track of electricity. Welcome, everyone.
So we'll be using the webinar

3
00:00:15,840 --> 00:00:24,000
functionality of zoom, such as yesterday.
For those not familiar with it, the

4
00:00:24,210 --> 00:00:31,770
speakers are there and then have the right
to speak, why the attendees are nice to

5
00:00:31,770 --> 00:00:39,090
raise their hands with the button on the
zoom interface, so that they can raise the

6
00:00:39,090 --> 00:00:46,530
attention to the co host. And we'll be
unmuting you after the talks. So feel free

7
00:00:46,530 --> 00:00:53,370
to raise your hands during the talk and
after a turn so that we can unmute you and

8
00:00:53,400 --> 00:00:58,260
you will that be able to ask your
questions. If there is any problem then

9
00:00:59,190 --> 00:01:03,690
you have also to chat, so that we can try
to see if we can resolve any of the

10
00:01:03,690 --> 00:01:04,200
problems

11
00:01:06,810 --> 00:01:08,880
with this, and if

12
00:01:10,440 --> 00:01:17,130
there is no major issue now, we can
certainly start with the first

13
00:01:17,130 --> 00:01:25,440
presentation decision from A to contest on
GPU for boosting performance in head.

14
00:01:28,380 --> 00:01:29,730
Okay, so can you hear me?

15
00:01:30,360 --> 00:01:33,420
Yes. Let me hear. Oh, I'm

16
00:01:34,650 --> 00:01:37,020
trying to share the slide.

17
00:01:37,529 --> 00:01:39,059
Okay. You see those lights?

18
00:01:40,440 --> 00:01:41,130
Yes.

19
00:01:42,450 --> 00:01:48,420
Okay, so good afternoon, everybody. Today
I'm pleased to report you my summary about

20
00:01:48,420 --> 00:01:52,830
the usage of graphic processing unit for
boosting performance in high energy

21
00:01:52,830 --> 00:02:00,810
physics computational use cases is an
outline. Okay. So let me Just to briefly

22
00:02:00,810 --> 00:02:06,120
introduce the architecture of GPUs for
whose is less familiar with with it. So,

23
00:02:06,120 --> 00:02:10,440
graphic processing unit is a programmable
architecture that offers a large number of

24
00:02:10,470 --> 00:02:14,010
parallel and independent streams of
instruction.

25
00:02:15,210 --> 00:02:15,930
They were

26
00:02:17,400 --> 00:02:22,140
initially designed for image processing
and image rendering, but nowadays there

27
00:02:22,140 --> 00:02:26,550
are also used for so called general
purpose GPU computing that means to use

28
00:02:26,550 --> 00:02:32,700
GPU to perform mathematical to solve
mathematical problems in scientific field,

29
00:02:33,030 --> 00:02:38,100
where we compare the structure the
architecture of graphic processing unit

30
00:02:38,310 --> 00:02:43,560
with the CPU, we see that the essential
layout is quite similar. They both have a

31
00:02:43,560 --> 00:02:51,750
dedicated memory a dedicated memory and
they have the arithmetic logic units which

32
00:02:51,750 --> 00:02:57,450
are the process are responsible to perform
arithmetic and logic operations. Also they

33
00:02:57,450 --> 00:03:01,950
have different number of control units.
That's strange the instructions to the

34
00:03:01,950 --> 00:03:07,350
processor. So the strip's the chips the
structure of the cheapest, very similar.

35
00:03:08,790 --> 00:03:13,050
Despite this fact, they were designed to
cope with different type of computation.

36
00:03:13,050 --> 00:03:18,780
So let me just add that both GPU and CPU
use threads to distribute the parallel

37
00:03:18,780 --> 00:03:25,890
workloads of workloads of a program among
the multiple processor. Also, there is a

38
00:03:25,890 --> 00:03:29,070
difference in the scale of the
architectural CPU can manage up to

39
00:03:29,070 --> 00:03:34,890
hundreds of processor and threads, and GPU
can scale beyond many thousands of them.

40
00:03:35,910 --> 00:03:40,920
So there are typical scopes are quite
different. The CPU focuses more on

41
00:03:40,920 --> 00:03:46,710
minimizing the latency in context
switching among the application, reducing

42
00:03:46,710 --> 00:03:52,200
the time spent in moving from one to
another, and the GPU and more at

43
00:03:52,200 --> 00:03:56,640
implementing the single instruction of
multiple on multiple threads parodic that

44
00:03:56,820 --> 00:04:02,280
aims at maximizing its throughput that
means maximizing the amount of operation

45
00:04:02,310 --> 00:04:04,140
performed on the memory

46
00:04:05,699 --> 00:04:07,979
on data per time

47
00:04:08,639 --> 00:04:09,419
time unit.

48
00:04:10,559 --> 00:04:16,769
So, while a GPU is more flexible as a CPU
is more flexible a GPU is only focused on

49
00:04:16,769 --> 00:04:21,809
the on the on the aforementioned biobank,
where each thread in a group performs the

50
00:04:21,809 --> 00:04:28,559
same sequence of instruction, then its
neighbors do. So, in some contexts, GPUs

51
00:04:28,559 --> 00:04:35,099
are faster and less power consuming when
compared to the CPU and the workflow that

52
00:04:35,099 --> 00:04:42,179
fits this paradigm master comply with some
kind of statistics. So, when we reached

53
00:04:43,379 --> 00:04:48,239
when we successfully exploited you bill
for this kind of computation, we can reach

54
00:04:48,239 --> 00:04:52,829
also peak of performance that can be one
order of magnitude higher than the CPU

55
00:04:52,829 --> 00:04:57,029
wants. And one other thing to take into
account that is that recently the number

56
00:04:57,029 --> 00:05:02,639
of plots floating point operation that
means The capability of the GPU has been

57
00:05:02,639 --> 00:05:08,519
growing at a rapid pace compared on how
much CPU are growing. So GPUs are becoming

58
00:05:08,519 --> 00:05:13,889
more and more powerful with time passing
by. So usually to use a GPU for

59
00:05:13,890 --> 00:05:15,270
accelerating our

60
00:05:16,050 --> 00:05:21,240
computing workloads is indeed a choice of
convenience. Also, because from our point

61
00:05:21,240 --> 00:05:25,980
of view, nowadays, based on our computing
now can have up to eight GPUs and up to

62
00:05:25,980 --> 00:05:30,630
four CPUs on the same motherboard, meaning
that if we have a workload that can scale

63
00:05:30,900 --> 00:05:36,900
that that can be well deployed both on GPU
and on CPU. Indeed, we can, we can buy

64
00:05:36,900 --> 00:05:43,620
less computing nodes by using more GPUs.
So in the end, the engineering the

65
00:05:43,620 --> 00:05:48,390
software that we currently have, and I
think in the algorithm is compensated by

66
00:05:48,390 --> 00:05:54,150
the cost of the performance related to the
adoption of a GPU. So let's just talk

67
00:05:54,180 --> 00:06:00,720
about the general purpose gpgpu computing,
meaning that as I said before, For is the

68
00:06:00,720 --> 00:06:06,210
usage of graphic cards in generic
computing this is a reality in scientific

69
00:06:06,210 --> 00:06:09,930
fields also outside the high energy
physics, we have a lot of machine learning

70
00:06:09,930 --> 00:06:13,890
and artificial intelligence successful
application of the GPU, but also for real

71
00:06:13,890 --> 00:06:20,880
time image processing. In those use cases,
where activity is a is a requirement like

72
00:06:20,880 --> 00:06:28,410
for instance, self driving cars. So any
title algorithm that any inherently that

73
00:06:28,410 --> 00:06:34,080
finally allow good it can, hypothetically
scale that can particularly scale Well,

74
00:06:34,680 --> 00:06:39,540
with parallel architecture is a good
candidate to be deployed on a on a GPU

75
00:06:39,540 --> 00:06:44,730
with a large core density. So also in the
field of high energy physics, this is

76
00:06:44,730 --> 00:06:50,400
true, and GPUs has been investigated over
fireable that decade. And there are

77
00:06:50,490 --> 00:06:55,830
currently many kinds of application that
promoted the usage and the acceleration of

78
00:06:55,830 --> 00:07:01,920
multiple acreage in different fields in
different states. Audio inside the head

79
00:07:01,920 --> 00:07:07,530
field, there are more or less advanced at
the maturity level projects now are

80
00:07:07,530 --> 00:07:14,250
currently ongoing. And to mention the
forming larger experiments, LLC ATS, le

81
00:07:14,250 --> 00:07:19,560
CMS and lsv have been evaluating solution
that integrated GPUs since a while. And

82
00:07:19,560 --> 00:07:24,450
the three main topics are indeed online
and offline data reconstruction but also

83
00:07:24,450 --> 00:07:29,760
physics simulation, and machine learning
and deep learning application or supplied

84
00:07:29,790 --> 00:07:38,640
to the analysis field. So now before
looking at a broader overview about the

85
00:07:38,670 --> 00:07:42,870
status of the of the current projects that
we've tried to highlight, which are the

86
00:07:42,870 --> 00:07:49,440
trait of an algorithm that can be well
deployed on on a GPU. So in the field of

87
00:07:49,470 --> 00:07:54,870
high energy physics, we have multiple
application lights that are related to the

88
00:07:54,900 --> 00:07:59,850
construction lift. For instance, we have
succeeding models of truck fitting and

89
00:08:00,000 --> 00:08:01,140
These are kind of the

90
00:08:02,040 --> 00:08:03,510
other goods that also has

91
00:08:06,180 --> 00:08:11,550
an inherent combinatorial computation that
can benefit also from the usage of

92
00:08:12,480 --> 00:08:19,860
graphics processing unit. So also today's
and in, in recent times some experiment

93
00:08:19,860 --> 00:08:25,440
are already moving to GPUs in order to
sometime even meet requirements for their

94
00:08:25,440 --> 00:08:31,650
upgrades. So let's just define which are
the main feature of our high energy

95
00:08:31,650 --> 00:08:38,430
physics algorithm that must comply with to
be ported on a GPU GPU. And in the end,

96
00:08:38,430 --> 00:08:43,230
what we find is that the algorithm must be
static and predictable meaning that we in

97
00:08:43,230 --> 00:08:47,880
the in its control stuck with control
flow, we don't want to encounter too many

98
00:08:47,910 --> 00:08:53,820
effects statements and also the the
program should be intelligent enough in

99
00:08:53,820 --> 00:08:57,810
using the local memory in order to
minimize in the latency in the

100
00:08:57,870 --> 00:09:04,200
computations. So will now present very
briefly an interesting example that fits

101
00:09:04,200 --> 00:09:07,890
very well to single instruction of
multiple response, which is a Kalman

102
00:09:07,890 --> 00:09:13,650
filter. And this algorithm is present in
that many pop is a popular algorithm and

103
00:09:13,860 --> 00:09:22,740
is present in many different scenarios.
So, the idea is to have an iterative

104
00:09:22,740 --> 00:09:27,540
algorithm that at every iteration, but for
the same kind of operation operations and

105
00:09:27,540 --> 00:09:32,460
this is true for every instance that can
run concurrently on a on a parallel

106
00:09:32,460 --> 00:09:37,380
architecture for instance, the one of the
GPU and in this case we can see that

107
00:09:37,710 --> 00:09:42,990
considering a hypothetical event we can
perform the truck fitting of every truck

108
00:09:43,440 --> 00:09:51,300
on different threads. So, the the only
operation that Kalman filter performs are

109
00:09:51,330 --> 00:09:56,400
mapping. As I said matrix multiplication.
That means that data can change but the

110
00:09:56,400 --> 00:10:02,880
operation not not doesn't do not so, We do
not encounter branches in the control flow

111
00:10:02,880 --> 00:10:07,650
means that we can reach very high
throughput and in in deploying it in

112
00:10:07,650 --> 00:10:14,880
scaling it on on a parallel architecture
on a parallel architecture. So, in the

113
00:10:14,880 --> 00:10:19,920
end, what we also consider is that to
efficiently achieve a large throughput, we

114
00:10:19,920 --> 00:10:22,950
have to fulfill the

115
00:10:23,880 --> 00:10:24,630
day

116
00:10:24,750 --> 00:10:26,730
to fulfill the occupancy

117
00:10:26,730 --> 00:10:31,200
of the of the of the GPU. And the
possibility to perform multiple

118
00:10:31,200 --> 00:10:35,940
reconstructions also on multiple events
allows us to implement a different

119
00:10:36,600 --> 00:10:41,850
multiple levels of parallelism. So, in
this case, GPU are very, very efficient in

120
00:10:41,850 --> 00:10:42,480
performing

121
00:10:43,560 --> 00:10:44,910
those kinds of computation.

122
00:10:46,680 --> 00:10:52,440
So, let's talk about the broader overview.
In recent times we observe a raised

123
00:10:52,470 --> 00:10:59,520
interest in the topic of using GPU in high
energy physics. This mainly is driven

124
00:10:59,520 --> 00:11:06,690
because the The competition challenges a
lot of the upcoming runs at the LSE are

125
00:11:06,930 --> 00:11:11,730
more and more demanding in terms of the
computing resources. So, what we observe

126
00:11:11,730 --> 00:11:17,640
is also kind of consolidation is in a
knowledge common knowledge base for the

127
00:11:17,640 --> 00:11:24,360
development of of IoT our field and there
is a an increasing interest in rethinking

128
00:11:24,660 --> 00:11:30,060
existing algorithms in order to inspect if
there is a possibility to move them

129
00:11:30,060 --> 00:11:37,920
towards a parallel approach either that be
on CPU or GPU. So, in the end, we have

130
00:11:37,950 --> 00:11:42,300
also some critical point with the target
integration with the existing and

131
00:11:42,300 --> 00:11:46,920
consolidated frameworks, which sometimes
might be not obvious and also the fact

132
00:11:46,920 --> 00:11:55,170
that we do want to avoid code duplication
wherever it is possible. So, in general,

133
00:11:55,200 --> 00:12:00,000
any AI or physics algorithms whose
implementation can provide a large benefit

134
00:12:00,000 --> 00:12:03,180
These are good candidates for being
uploaded on accelerator.

135
00:12:03,870 --> 00:12:07,020
Okay, let's just start with the overview.

136
00:12:08,910 --> 00:12:16,920
When we start with hcb lsv from country
will operate its high level trigger one of

137
00:12:16,920 --> 00:12:22,320
the construction entirely on a GPU. That
means that the reconstruction of each sub

138
00:12:22,320 --> 00:12:26,580
detector will be offloaded to an a
graphics card. And this is thanks to the

139
00:12:26,610 --> 00:12:31,800
Allen framework a self contained piece of
software that can be deployed both on CPU

140
00:12:31,800 --> 00:12:36,780
and GPU with the same result indeed, that
implements efficient tracking and all the

141
00:12:36,870 --> 00:12:44,790
other algorithms can be used in
reconstruction and selection on on hL hL T

142
00:12:44,790 --> 00:12:52,920
one three. So I advise you also to to
check the talk on Wednesday should be

143
00:12:53,310 --> 00:12:59,580
about this. So what it emerged in it
emerged in the in this case is that the

144
00:12:59,580 --> 00:13:05,640
results Good matching from the feature
provide features provided by a GPU and the

145
00:13:05,640 --> 00:13:12,270
requirements from a level trigger one in
the electricity. And as you can see in the

146
00:13:12,270 --> 00:13:19,500
plot of the plot on the right, the
capability of the GPUs can increase but

147
00:13:19,560 --> 00:13:24,480
Allen framework does not saturate in terms
of throughput, meaning that they are

148
00:13:24,900 --> 00:13:33,240
currently able to successfully and
efficiently explore a exploit to the all

149
00:13:33,240 --> 00:13:40,350
the resources that are being deployed it
for them. So he's very promising. And now

150
00:13:40,350 --> 00:13:45,990
we move from two to CMS. Historically, a
lot of effort from their side has been

151
00:13:46,050 --> 00:13:51,960
spent in the support to support the
heterogeneity in a telogen 80 in the

152
00:13:51,960 --> 00:13:58,080
framework, and currently up to 24% of the
current aligner construction is GP ready

153
00:13:58,080 --> 00:14:03,660
meaning that can be run on GPU And this is
thanks to some frameworks like for

154
00:14:03,660 --> 00:14:09,720
instance, a track which is responsible for
the pixel based and tracks and vertex

155
00:14:09,720 --> 00:14:14,760
reconstruction. You see in the bar plot on
the right a comparison between the legacy

156
00:14:14,760 --> 00:14:24,720
code legacy version on GPU and the new
version that can run both on GPU and also

157
00:14:24,750 --> 00:14:28,530
in this case, but they also have a
calorimeter rocker local construction that

158
00:14:28,530 --> 00:14:33,300
can be ported, that this has been ported
on GPU. It is also interesting to observe

159
00:14:33,300 --> 00:14:41,910
that also for for the high luminosity LFC,
they have already they are going to, to to

160
00:14:42,150 --> 00:14:45,690
to deploy your higher granularity
calorimeter. And its reconstruction

161
00:14:45,690 --> 00:14:51,120
algorithm is it's going to it's going to
be on GPU as well and be observing the

162
00:14:51,120 --> 00:14:57,120
port on the right that the the parallel
version on GPU can scale up to 550 times

163
00:14:57,120 --> 00:15:01,290
faster. And the interesting thing is that
the same code can Compared to the second

164
00:15:01,320 --> 00:15:07,410
version, so single core, the interesting
fact is that the same code can be deployed

165
00:15:07,410 --> 00:15:13,350
also on a CPU obtaining scaling by running
in parallel. Indeed also in CMS, we have a

166
00:15:13,350 --> 00:15:18,570
lot of machine learning application that
can be run on a GPU and not only for the

167
00:15:18,600 --> 00:15:22,320
training part, but also for the MCs and
they are investigating possible techniques

168
00:15:22,320 --> 00:15:23,130
for simulation

169
00:15:23,970 --> 00:15:24,540
just

170
00:15:25,320 --> 00:15:31,440
considering Atlas now no Atlas as
historically already performed some some

171
00:15:32,100 --> 00:15:38,550
tests of the of the implementation of
their old tracking on GPU. You can see in

172
00:15:38,550 --> 00:15:41,460
the plot on the right that the results
were promising then in the end, they

173
00:15:41,460 --> 00:15:45,060
decided not to use them, but it is not
very relevant at this point, what is in

174
00:15:45,120 --> 00:15:52,110
it, what is relevant is that 414 and five,
they will also have the knowledge base to

175
00:15:53,190 --> 00:15:58,350
allow us to investigate the use of GPUs,
they are probably going to use them and

176
00:15:58,350 --> 00:16:03,120
they will use it for like trigger on
flatter construction and a lot of other

177
00:16:03,120 --> 00:16:08,460
application like for instance, the
simulation which is very interesting since

178
00:16:08,460 --> 00:16:14,160
they have these two projects where they
are porting part of the libraries to be

179
00:16:14,160 --> 00:16:15,780
able to exploit the GPU

180
00:16:18,150 --> 00:16:19,950
GPU to boost their,

181
00:16:21,540 --> 00:16:28,050
their computation. So, similar GPUs for
simulation, they're interested in not

182
00:16:28,410 --> 00:16:33,930
easily implementable like other use cases
and also in our in our class, we have

183
00:16:33,930 --> 00:16:38,760
different machine learning techniques that
are ready to be run on GPU backends. And

184
00:16:39,060 --> 00:16:44,430
last but not least, at least the least
experimental and three will move towards

185
00:16:44,430 --> 00:16:49,260
our trigger less data acquisition mode,
meaning that they will implement the so

186
00:16:49,260 --> 00:16:53,280
called continuous readout and to obtain
that they are going to

187
00:16:54,870 --> 00:16:55,590
achieve that

188
00:16:56,610 --> 00:17:03,600
larger data compression factors that are
Going to reduce the input bandwidth from

189
00:17:03,600 --> 00:17:10,020
3.5 terabytes per second down to hundreds
of 100 of gigabytes per second. And GPUs

190
00:17:10,020 --> 00:17:15,330
are the pivotal architectures to
efficiently process and operate these

191
00:17:15,330 --> 00:17:20,310
reduction. Considering for instance, the
most expensive computationally expensive

192
00:17:20,340 --> 00:17:24,390
part which is the time projection chamber
tracking, we observe on the plot on the

193
00:17:24,390 --> 00:17:31,020
right, which is actually are normalized to
speed up how we can in this case, compare

194
00:17:31,020 --> 00:17:36,150
the computing power of a GPU to the
computing power of CPUs in order to

195
00:17:36,960 --> 00:17:42,900
somehow estimate which are the exchange
factor between the two the two parts the

196
00:17:42,900 --> 00:17:51,900
two devices, so since the working the
working condition will be more towards the

197
00:17:51,900 --> 00:17:58,980
right part on the plot, we can say that
one GPU will replace up to 40 CPU cores in

198
00:17:58,980 --> 00:18:08,010
the in In, in their online infrastructure,
so, there are also more advanced scenarios

199
00:18:08,040 --> 00:18:13,710
that foresee to include to deploy more and
more part of the reconstruction on the

200
00:18:13,740 --> 00:18:19,110
GPUs whenever they are available in order
to efficiently exploit those resources,

201
00:18:19,110 --> 00:18:26,970
but also to free some some resources on
CPU. I will also advise you to attend all

202
00:18:26,970 --> 00:18:34,410
David Ross presentation on this topic. So,
here I come to my conclusion. Extract

203
00:18:34,410 --> 00:18:38,040
metallics he will be extremely challenging
in the in terms of computing requirements

204
00:18:38,040 --> 00:18:46,080
and GPU might be one solution to this
increasing demand to address the the lack

205
00:18:46,080 --> 00:18:51,150
of computing power. There are ongoing many
different efforts with diverse scope

206
00:18:51,570 --> 00:18:56,550
carried out by all the alleged larger
illiteracy experiments. And the idea is to

207
00:18:56,550 --> 00:19:01,650
accelerate all the parallel workflows that
you have in order to Performance beyond

208
00:19:01,650 --> 00:19:07,290
what we can be currently achieved with
CPUs. There are also some cases where the

209
00:19:07,290 --> 00:19:16,440
GPUs are enabling the some scenarios that
otherwise wouldn't be reachable to by only

210
00:19:16,440 --> 00:19:21,000
using standard CPU. And one last point
I'll leave you to consider is that the

211
00:19:21,000 --> 00:19:25,170
next generation of data center, but also
performance computing facility will

212
00:19:25,170 --> 00:19:29,760
increase the number of GPUs on board that
they're computing now. So, so to be

213
00:19:29,760 --> 00:19:36,840
efficiently able to be able to efficiently
exploit those kind of architecture will

214
00:19:37,290 --> 00:19:45,300
will give us a lot of potential that that
otherwise we will just ignore, since most

215
00:19:45,300 --> 00:19:50,190
of our workflows are on CPU, and from my
side is everything and thanks for your

216
00:19:50,190 --> 00:19:50,640
attention.

217
00:19:52,500 --> 00:19:53,160
Hey, Matteo,

218
00:19:54,510 --> 00:19:59,610
thank you for taking time. And I'm opening
the floor for questions if you find the

219
00:19:59,610 --> 00:20:00,930
race. button

220
00:20:02,310 --> 00:20:04,320
and interface, I see an article

221
00:20:06,210 --> 00:20:09,930
and allowing you to talk to you to be able
to unmute yourself and ask your question.

222
00:20:10,680 --> 00:20:17,910
Yes. Can you hear me? Yes. Cool. So I was
wondering about the use of CUDA and

223
00:20:18,060 --> 00:20:22,320
particularly considering the development
of apparently more vendors trying to

224
00:20:22,470 --> 00:20:27,630
develop GP GPUs and trying to deploy them.
And if we,

225
00:20:29,100 --> 00:20:33,930
if it's still a good idea to to be vendor
specific by using CUDA exclusively,

226
00:20:33,930 --> 00:20:38,520
basically, also, considering that there's
a lot of development in terms of

227
00:20:40,050 --> 00:20:42,210
other parallelism frameworks, like is

228
00:20:43,560 --> 00:20:44,610
that clear?

229
00:20:45,990 --> 00:20:49,140
Yes, yes. So I think what what's your
opinion on this what

230
00:20:50,400 --> 00:20:55,740
my take on that is that in the end, this
has to be considered depending from

231
00:20:55,740 --> 00:21:02,730
experiment to experiment. So there are
some use cases also in the past Were in

232
00:21:02,730 --> 00:21:06,780
Vidya frameworks were considered just
because of their level of maturity but

233
00:21:06,780 --> 00:21:14,760
also their level of the performances. So
the idea is that depending on on how much

234
00:21:14,760 --> 00:21:21,240
manpower how much resources you have, you
can investigate also different different

235
00:21:21,240 --> 00:21:26,460
vendor vendor solution, but also different
frameworks that claims to be cross

236
00:21:26,460 --> 00:21:32,520
portable across across different
architectures layer. I didn't went didn't

237
00:21:32,520 --> 00:21:37,110
go very much into detail about this topic
in the presentation because it is a more

238
00:21:37,110 --> 00:21:42,420
broader one. But the idea is that more or
less all the experiments are considering

239
00:21:43,590 --> 00:21:51,720
frameworks that allows for portability,
both on Nvidia or AMD, or Intel GPUs in

240
00:21:51,720 --> 00:22:02,070
the future. I think that nowadays nobody
is thinking anymore indeed. To a single a

241
00:22:02,070 --> 00:22:08,160
single vendor solution in any case, they
always leave some some other door open and

242
00:22:09,900 --> 00:22:16,080
that for some tests and development on
that in the end you have to converge for

243
00:22:16,080 --> 00:22:22,710
production to one architecture, but it
depends if you are able to support it, you

244
00:22:22,710 --> 00:22:29,310
can also write one code and be able to run
in different GPUs but you have to be sure

245
00:22:29,310 --> 00:22:34,140
every time that the result of the same the
performances are cross portable and etc.

246
00:22:34,140 --> 00:22:41,070
So, my take on that is that depends on how
how much are you are you are able to

247
00:22:41,280 --> 00:22:45,750
support both of them or the three of them
or it depends.

248
00:22:47,070 --> 00:22:54,030
But would you say there's a trend like
towards the trend

249
00:22:54,330 --> 00:22:59,430
depends on who you consider normalize on
how fluent they are. Because if you look

250
00:22:59,430 --> 00:23:07,890
at the outside our field, indeed, the
domination of the media is is adaptable.

251
00:23:08,490 --> 00:23:14,340
When there are more and more convincing
use cases that claims that they can pop,

252
00:23:14,370 --> 00:23:20,220
let's say TensorFlow or AMD GPUs and
whatever. So for the time being, I don't

253
00:23:20,220 --> 00:23:26,160
have a very strong opinion on. But the
trend on considering other GPUs may be

254
00:23:26,160 --> 00:23:32,280
much cheaper. Maybe that just fulfilled
What you need is, is present. Maybe it

255
00:23:32,280 --> 00:23:33,540
isn't. It is growing.

256
00:23:35,160 --> 00:23:36,780
Okay, thank you very much interesting.

257
00:23:38,970 --> 00:23:41,820
Is there any other questions, I don't see
any raised hands

258
00:23:43,920 --> 00:23:45,270
and one for you.

259
00:23:46,590 --> 00:23:52,500
So you've talked about mostly on premises
usage of GPU, but then know that there is

260
00:23:52,500 --> 00:24:02,580
also a line of work on trying to use cloud
based tools queues are server based GPU on

261
00:24:02,580 --> 00:24:09,180
the Computing Center. So something in
between the online tilty type of

262
00:24:09,300 --> 00:24:15,210
facilities and the HPC facilities where
there is a GPU on each node and something

263
00:24:15,210 --> 00:24:21,480
where you either have like just one server
of GPU for the whole HPC or just the GPUs

264
00:24:21,480 --> 00:24:23,640
on the cloud. Can you say a couple of
words on this?

265
00:24:24,570 --> 00:24:32,040
Well, I think that the usage of cloud GPUs
will probably be more directed for those

266
00:24:32,220 --> 00:24:34,440
offline kind of computation because

267
00:24:35,760 --> 00:24:38,880
in the end hlp farms or

268
00:24:40,230 --> 00:24:43,050
de facto h lt. clusters

269
00:24:45,390 --> 00:24:49,980
are not really interested at the moment,
at least, to the best of my knowledge to

270
00:24:49,980 --> 00:24:58,530
the ploy some kind of a bus abstraction
layers in the analyzing GPUs, for for use

271
00:24:58,530 --> 00:25:06,660
cases. So usually what One experiment
tries to do is to exploit edit master edit

272
00:25:06,660 --> 00:25:11,400
maximum capabilities every single GPU that
means that we are paralyzed in it. So,

273
00:25:11,430 --> 00:25:19,500
having a cloud approach is not the goal
you, you you aim for it is possible that

274
00:25:19,500 --> 00:25:24,840
indeed there are some offline use cases
that also are going to consider the

275
00:25:24,840 --> 00:25:26,130
conversion of a cluster

276
00:25:30,600 --> 00:25:33,240
in order to be able to

277
00:25:34,920 --> 00:25:35,700
fully

278
00:25:36,689 --> 00:25:43,979
locate part of GPUs for the smaller tasks
but I'm not really sure this is very

279
00:25:43,979 --> 00:25:50,819
convenient for our use cases. If with
cloud the approaching you you meant why

280
00:25:50,819 --> 00:25:58,199
GPUs indeed when you buy via GPUs on
Amazon, they are we have 12 minutes that

281
00:25:58,199 --> 00:26:03,569
there is a perhaps GPU that is
partitioned, and you will see like it is a

282
00:26:03,569 --> 00:26:08,489
less powerful one. But these, I don't
think an experiment is going to have an

283
00:26:08,489 --> 00:26:15,299
online workflow on on an Amazon or Google
platform or whatever, cloud provider

284
00:26:15,389 --> 00:26:21,299
solution cluster. So these are different
use cases that indeed can be investigated.

285
00:26:21,299 --> 00:26:28,739
But for the moment, we have more stick to
the online construction and other

286
00:26:30,270 --> 00:26:34,530
sounds good that I see your hands your
hand up.

287
00:26:36,450 --> 00:26:40,560
You want to I just, I just want to very
briefly comment on this did while I mean,

288
00:26:40,830 --> 00:26:45,840
everything you said is, of course, true
material. But I think at least in LHB once

289
00:26:45,840 --> 00:26:52,500
we have a large cluster of GPUs sitting
there. Then at some point, kind of

290
00:26:52,500 --> 00:26:56,430
exposing them for other non online
workflows when you're not taking data

291
00:26:58,830 --> 00:27:02,970
is something that you'd respond Think
about isn't necessarily trivial, but I

292
00:27:02,970 --> 00:27:09,420
think that there will be kind of
increasingly work and reflection over the

293
00:27:09,420 --> 00:27:13,950
next years on, on how we make sure that
they're not going to sitting there is that

294
00:27:13,950 --> 00:27:16,680
silicon particularly during Ellis three.

295
00:27:18,510 --> 00:27:25,380
Good point. Good point. Yes, sir. All
right, thank you Matteo for talking and

296
00:27:25,650 --> 00:27:26,400
answers.