1 00:00:01,050 --> 00:00:09,600 One, I think this is the computer time to start the second session of the tutorial 2 00:00:09,690 --> 00:00:15,540 track of electricity. Welcome, everyone. So we'll be using the webinar 3 00:00:15,840 --> 00:00:24,000 functionality of zoom, such as yesterday. For those not familiar with it, the 4 00:00:24,210 --> 00:00:31,770 speakers are there and then have the right to speak, why the attendees are nice to 5 00:00:31,770 --> 00:00:39,090 raise their hands with the button on the zoom interface, so that they can raise the 6 00:00:39,090 --> 00:00:46,530 attention to the co host. And we'll be unmuting you after the talks. So feel free 7 00:00:46,530 --> 00:00:53,370 to raise your hands during the talk and after a turn so that we can unmute you and 8 00:00:53,400 --> 00:00:58,260 you will that be able to ask your questions. If there is any problem then 9 00:00:59,190 --> 00:01:03,690 you have also to chat, so that we can try to see if we can resolve any of the 10 00:01:03,690 --> 00:01:04,200 problems 11 00:01:06,810 --> 00:01:08,880 with this, and if 12 00:01:10,440 --> 00:01:17,130 there is no major issue now, we can certainly start with the first 13 00:01:17,130 --> 00:01:25,440 presentation decision from A to contest on GPU for boosting performance in head. 14 00:01:28,380 --> 00:01:29,730 Okay, so can you hear me? 15 00:01:30,360 --> 00:01:33,420 Yes. Let me hear. Oh, I'm 16 00:01:34,650 --> 00:01:37,020 trying to share the slide. 17 00:01:37,529 --> 00:01:39,059 Okay. You see those lights? 18 00:01:40,440 --> 00:01:41,130 Yes. 19 00:01:42,450 --> 00:01:48,420 Okay, so good afternoon, everybody. Today I'm pleased to report you my summary about 20 00:01:48,420 --> 00:01:52,830 the usage of graphic processing unit for boosting performance in high energy 21 00:01:52,830 --> 00:02:00,810 physics computational use cases is an outline. Okay. So let me Just to briefly 22 00:02:00,810 --> 00:02:06,120 introduce the architecture of GPUs for whose is less familiar with with it. So, 23 00:02:06,120 --> 00:02:10,440 graphic processing unit is a programmable architecture that offers a large number of 24 00:02:10,470 --> 00:02:14,010 parallel and independent streams of instruction. 25 00:02:15,210 --> 00:02:15,930 They were 26 00:02:17,400 --> 00:02:22,140 initially designed for image processing and image rendering, but nowadays there 27 00:02:22,140 --> 00:02:26,550 are also used for so called general purpose GPU computing that means to use 28 00:02:26,550 --> 00:02:32,700 GPU to perform mathematical to solve mathematical problems in scientific field, 29 00:02:33,030 --> 00:02:38,100 where we compare the structure the architecture of graphic processing unit 30 00:02:38,310 --> 00:02:43,560 with the CPU, we see that the essential layout is quite similar. They both have a 31 00:02:43,560 --> 00:02:51,750 dedicated memory a dedicated memory and they have the arithmetic logic units which 32 00:02:51,750 --> 00:02:57,450 are the process are responsible to perform arithmetic and logic operations. Also they 33 00:02:57,450 --> 00:03:01,950 have different number of control units. That's strange the instructions to the 34 00:03:01,950 --> 00:03:07,350 processor. So the strip's the chips the structure of the cheapest, very similar. 35 00:03:08,790 --> 00:03:13,050 Despite this fact, they were designed to cope with different type of computation. 36 00:03:13,050 --> 00:03:18,780 So let me just add that both GPU and CPU use threads to distribute the parallel 37 00:03:18,780 --> 00:03:25,890 workloads of workloads of a program among the multiple processor. Also, there is a 38 00:03:25,890 --> 00:03:29,070 difference in the scale of the architectural CPU can manage up to 39 00:03:29,070 --> 00:03:34,890 hundreds of processor and threads, and GPU can scale beyond many thousands of them. 40 00:03:35,910 --> 00:03:40,920 So there are typical scopes are quite different. The CPU focuses more on 41 00:03:40,920 --> 00:03:46,710 minimizing the latency in context switching among the application, reducing 42 00:03:46,710 --> 00:03:52,200 the time spent in moving from one to another, and the GPU and more at 43 00:03:52,200 --> 00:03:56,640 implementing the single instruction of multiple on multiple threads parodic that 44 00:03:56,820 --> 00:04:02,280 aims at maximizing its throughput that means maximizing the amount of operation 45 00:04:02,310 --> 00:04:04,140 performed on the memory 46 00:04:05,699 --> 00:04:07,979 on data per time 47 00:04:08,639 --> 00:04:09,419 time unit. 48 00:04:10,559 --> 00:04:16,769 So, while a GPU is more flexible as a CPU is more flexible a GPU is only focused on 49 00:04:16,769 --> 00:04:21,809 the on the on the aforementioned biobank, where each thread in a group performs the 50 00:04:21,809 --> 00:04:28,559 same sequence of instruction, then its neighbors do. So, in some contexts, GPUs 51 00:04:28,559 --> 00:04:35,099 are faster and less power consuming when compared to the CPU and the workflow that 52 00:04:35,099 --> 00:04:42,179 fits this paradigm master comply with some kind of statistics. So, when we reached 53 00:04:43,379 --> 00:04:48,239 when we successfully exploited you bill for this kind of computation, we can reach 54 00:04:48,239 --> 00:04:52,829 also peak of performance that can be one order of magnitude higher than the CPU 55 00:04:52,829 --> 00:04:57,029 wants. And one other thing to take into account that is that recently the number 56 00:04:57,029 --> 00:05:02,639 of plots floating point operation that means The capability of the GPU has been 57 00:05:02,639 --> 00:05:08,519 growing at a rapid pace compared on how much CPU are growing. So GPUs are becoming 58 00:05:08,519 --> 00:05:13,889 more and more powerful with time passing by. So usually to use a GPU for 59 00:05:13,890 --> 00:05:15,270 accelerating our 60 00:05:16,050 --> 00:05:21,240 computing workloads is indeed a choice of convenience. Also, because from our point 61 00:05:21,240 --> 00:05:25,980 of view, nowadays, based on our computing now can have up to eight GPUs and up to 62 00:05:25,980 --> 00:05:30,630 four CPUs on the same motherboard, meaning that if we have a workload that can scale 63 00:05:30,900 --> 00:05:36,900 that that can be well deployed both on GPU and on CPU. Indeed, we can, we can buy 64 00:05:36,900 --> 00:05:43,620 less computing nodes by using more GPUs. So in the end, the engineering the 65 00:05:43,620 --> 00:05:48,390 software that we currently have, and I think in the algorithm is compensated by 66 00:05:48,390 --> 00:05:54,150 the cost of the performance related to the adoption of a GPU. So let's just talk 67 00:05:54,180 --> 00:06:00,720 about the general purpose gpgpu computing, meaning that as I said before, For is the 68 00:06:00,720 --> 00:06:06,210 usage of graphic cards in generic computing this is a reality in scientific 69 00:06:06,210 --> 00:06:09,930 fields also outside the high energy physics, we have a lot of machine learning 70 00:06:09,930 --> 00:06:13,890 and artificial intelligence successful application of the GPU, but also for real 71 00:06:13,890 --> 00:06:20,880 time image processing. In those use cases, where activity is a is a requirement like 72 00:06:20,880 --> 00:06:28,410 for instance, self driving cars. So any title algorithm that any inherently that 73 00:06:28,410 --> 00:06:34,080 finally allow good it can, hypothetically scale that can particularly scale Well, 74 00:06:34,680 --> 00:06:39,540 with parallel architecture is a good candidate to be deployed on a on a GPU 75 00:06:39,540 --> 00:06:44,730 with a large core density. So also in the field of high energy physics, this is 76 00:06:44,730 --> 00:06:50,400 true, and GPUs has been investigated over fireable that decade. And there are 77 00:06:50,490 --> 00:06:55,830 currently many kinds of application that promoted the usage and the acceleration of 78 00:06:55,830 --> 00:07:01,920 multiple acreage in different fields in different states. Audio inside the head 79 00:07:01,920 --> 00:07:07,530 field, there are more or less advanced at the maturity level projects now are 80 00:07:07,530 --> 00:07:14,250 currently ongoing. And to mention the forming larger experiments, LLC ATS, le 81 00:07:14,250 --> 00:07:19,560 CMS and lsv have been evaluating solution that integrated GPUs since a while. And 82 00:07:19,560 --> 00:07:24,450 the three main topics are indeed online and offline data reconstruction but also 83 00:07:24,450 --> 00:07:29,760 physics simulation, and machine learning and deep learning application or supplied 84 00:07:29,790 --> 00:07:38,640 to the analysis field. So now before looking at a broader overview about the 85 00:07:38,670 --> 00:07:42,870 status of the of the current projects that we've tried to highlight, which are the 86 00:07:42,870 --> 00:07:49,440 trait of an algorithm that can be well deployed on on a GPU. So in the field of 87 00:07:49,470 --> 00:07:54,870 high energy physics, we have multiple application lights that are related to the 88 00:07:54,900 --> 00:07:59,850 construction lift. For instance, we have succeeding models of truck fitting and 89 00:08:00,000 --> 00:08:01,140 These are kind of the 90 00:08:02,040 --> 00:08:03,510 other goods that also has 91 00:08:06,180 --> 00:08:11,550 an inherent combinatorial computation that can benefit also from the usage of 92 00:08:12,480 --> 00:08:19,860 graphics processing unit. So also today's and in, in recent times some experiment 93 00:08:19,860 --> 00:08:25,440 are already moving to GPUs in order to sometime even meet requirements for their 94 00:08:25,440 --> 00:08:31,650 upgrades. So let's just define which are the main feature of our high energy 95 00:08:31,650 --> 00:08:38,430 physics algorithm that must comply with to be ported on a GPU GPU. And in the end, 96 00:08:38,430 --> 00:08:43,230 what we find is that the algorithm must be static and predictable meaning that we in 97 00:08:43,230 --> 00:08:47,880 the in its control stuck with control flow, we don't want to encounter too many 98 00:08:47,910 --> 00:08:53,820 effects statements and also the the program should be intelligent enough in 99 00:08:53,820 --> 00:08:57,810 using the local memory in order to minimize in the latency in the 100 00:08:57,870 --> 00:09:04,200 computations. So will now present very briefly an interesting example that fits 101 00:09:04,200 --> 00:09:07,890 very well to single instruction of multiple response, which is a Kalman 102 00:09:07,890 --> 00:09:13,650 filter. And this algorithm is present in that many pop is a popular algorithm and 103 00:09:13,860 --> 00:09:22,740 is present in many different scenarios. So, the idea is to have an iterative 104 00:09:22,740 --> 00:09:27,540 algorithm that at every iteration, but for the same kind of operation operations and 105 00:09:27,540 --> 00:09:32,460 this is true for every instance that can run concurrently on a on a parallel 106 00:09:32,460 --> 00:09:37,380 architecture for instance, the one of the GPU and in this case we can see that 107 00:09:37,710 --> 00:09:42,990 considering a hypothetical event we can perform the truck fitting of every truck 108 00:09:43,440 --> 00:09:51,300 on different threads. So, the the only operation that Kalman filter performs are 109 00:09:51,330 --> 00:09:56,400 mapping. As I said matrix multiplication. That means that data can change but the 110 00:09:56,400 --> 00:10:02,880 operation not not doesn't do not so, We do not encounter branches in the control flow 111 00:10:02,880 --> 00:10:07,650 means that we can reach very high throughput and in in deploying it in 112 00:10:07,650 --> 00:10:14,880 scaling it on on a parallel architecture on a parallel architecture. So, in the 113 00:10:14,880 --> 00:10:19,920 end, what we also consider is that to efficiently achieve a large throughput, we 114 00:10:19,920 --> 00:10:22,950 have to fulfill the 115 00:10:23,880 --> 00:10:24,630 day 116 00:10:24,750 --> 00:10:26,730 to fulfill the occupancy 117 00:10:26,730 --> 00:10:31,200 of the of the of the GPU. And the possibility to perform multiple 118 00:10:31,200 --> 00:10:35,940 reconstructions also on multiple events allows us to implement a different 119 00:10:36,600 --> 00:10:41,850 multiple levels of parallelism. So, in this case, GPU are very, very efficient in 120 00:10:41,850 --> 00:10:42,480 performing 121 00:10:43,560 --> 00:10:44,910 those kinds of computation. 122 00:10:46,680 --> 00:10:52,440 So, let's talk about the broader overview. In recent times we observe a raised 123 00:10:52,470 --> 00:10:59,520 interest in the topic of using GPU in high energy physics. This mainly is driven 124 00:10:59,520 --> 00:11:06,690 because the The competition challenges a lot of the upcoming runs at the LSE are 125 00:11:06,930 --> 00:11:11,730 more and more demanding in terms of the computing resources. So, what we observe 126 00:11:11,730 --> 00:11:17,640 is also kind of consolidation is in a knowledge common knowledge base for the 127 00:11:17,640 --> 00:11:24,360 development of of IoT our field and there is a an increasing interest in rethinking 128 00:11:24,660 --> 00:11:30,060 existing algorithms in order to inspect if there is a possibility to move them 129 00:11:30,060 --> 00:11:37,920 towards a parallel approach either that be on CPU or GPU. So, in the end, we have 130 00:11:37,950 --> 00:11:42,300 also some critical point with the target integration with the existing and 131 00:11:42,300 --> 00:11:46,920 consolidated frameworks, which sometimes might be not obvious and also the fact 132 00:11:46,920 --> 00:11:55,170 that we do want to avoid code duplication wherever it is possible. So, in general, 133 00:11:55,200 --> 00:12:00,000 any AI or physics algorithms whose implementation can provide a large benefit 134 00:12:00,000 --> 00:12:03,180 These are good candidates for being uploaded on accelerator. 135 00:12:03,870 --> 00:12:07,020 Okay, let's just start with the overview. 136 00:12:08,910 --> 00:12:16,920 When we start with hcb lsv from country will operate its high level trigger one of 137 00:12:16,920 --> 00:12:22,320 the construction entirely on a GPU. That means that the reconstruction of each sub 138 00:12:22,320 --> 00:12:26,580 detector will be offloaded to an a graphics card. And this is thanks to the 139 00:12:26,610 --> 00:12:31,800 Allen framework a self contained piece of software that can be deployed both on CPU 140 00:12:31,800 --> 00:12:36,780 and GPU with the same result indeed, that implements efficient tracking and all the 141 00:12:36,870 --> 00:12:44,790 other algorithms can be used in reconstruction and selection on on hL hL T 142 00:12:44,790 --> 00:12:52,920 one three. So I advise you also to to check the talk on Wednesday should be 143 00:12:53,310 --> 00:12:59,580 about this. So what it emerged in it emerged in the in this case is that the 144 00:12:59,580 --> 00:13:05,640 results Good matching from the feature provide features provided by a GPU and the 145 00:13:05,640 --> 00:13:12,270 requirements from a level trigger one in the electricity. And as you can see in the 146 00:13:12,270 --> 00:13:19,500 plot of the plot on the right, the capability of the GPUs can increase but 147 00:13:19,560 --> 00:13:24,480 Allen framework does not saturate in terms of throughput, meaning that they are 148 00:13:24,900 --> 00:13:33,240 currently able to successfully and efficiently explore a exploit to the all 149 00:13:33,240 --> 00:13:40,350 the resources that are being deployed it for them. So he's very promising. And now 150 00:13:40,350 --> 00:13:45,990 we move from two to CMS. Historically, a lot of effort from their side has been 151 00:13:46,050 --> 00:13:51,960 spent in the support to support the heterogeneity in a telogen 80 in the 152 00:13:51,960 --> 00:13:58,080 framework, and currently up to 24% of the current aligner construction is GP ready 153 00:13:58,080 --> 00:14:03,660 meaning that can be run on GPU And this is thanks to some frameworks like for 154 00:14:03,660 --> 00:14:09,720 instance, a track which is responsible for the pixel based and tracks and vertex 155 00:14:09,720 --> 00:14:14,760 reconstruction. You see in the bar plot on the right a comparison between the legacy 156 00:14:14,760 --> 00:14:24,720 code legacy version on GPU and the new version that can run both on GPU and also 157 00:14:24,750 --> 00:14:28,530 in this case, but they also have a calorimeter rocker local construction that 158 00:14:28,530 --> 00:14:33,300 can be ported, that this has been ported on GPU. It is also interesting to observe 159 00:14:33,300 --> 00:14:41,910 that also for for the high luminosity LFC, they have already they are going to, to to 160 00:14:42,150 --> 00:14:45,690 to deploy your higher granularity calorimeter. And its reconstruction 161 00:14:45,690 --> 00:14:51,120 algorithm is it's going to it's going to be on GPU as well and be observing the 162 00:14:51,120 --> 00:14:57,120 port on the right that the the parallel version on GPU can scale up to 550 times 163 00:14:57,120 --> 00:15:01,290 faster. And the interesting thing is that the same code can Compared to the second 164 00:15:01,320 --> 00:15:07,410 version, so single core, the interesting fact is that the same code can be deployed 165 00:15:07,410 --> 00:15:13,350 also on a CPU obtaining scaling by running in parallel. Indeed also in CMS, we have a 166 00:15:13,350 --> 00:15:18,570 lot of machine learning application that can be run on a GPU and not only for the 167 00:15:18,600 --> 00:15:22,320 training part, but also for the MCs and they are investigating possible techniques 168 00:15:22,320 --> 00:15:23,130 for simulation 169 00:15:23,970 --> 00:15:24,540 just 170 00:15:25,320 --> 00:15:31,440 considering Atlas now no Atlas as historically already performed some some 171 00:15:32,100 --> 00:15:38,550 tests of the of the implementation of their old tracking on GPU. You can see in 172 00:15:38,550 --> 00:15:41,460 the plot on the right that the results were promising then in the end, they 173 00:15:41,460 --> 00:15:45,060 decided not to use them, but it is not very relevant at this point, what is in 174 00:15:45,120 --> 00:15:52,110 it, what is relevant is that 414 and five, they will also have the knowledge base to 175 00:15:53,190 --> 00:15:58,350 allow us to investigate the use of GPUs, they are probably going to use them and 176 00:15:58,350 --> 00:16:03,120 they will use it for like trigger on flatter construction and a lot of other 177 00:16:03,120 --> 00:16:08,460 application like for instance, the simulation which is very interesting since 178 00:16:08,460 --> 00:16:14,160 they have these two projects where they are porting part of the libraries to be 179 00:16:14,160 --> 00:16:15,780 able to exploit the GPU 180 00:16:18,150 --> 00:16:19,950 GPU to boost their, 181 00:16:21,540 --> 00:16:28,050 their computation. So, similar GPUs for simulation, they're interested in not 182 00:16:28,410 --> 00:16:33,930 easily implementable like other use cases and also in our in our class, we have 183 00:16:33,930 --> 00:16:38,760 different machine learning techniques that are ready to be run on GPU backends. And 184 00:16:39,060 --> 00:16:44,430 last but not least, at least the least experimental and three will move towards 185 00:16:44,430 --> 00:16:49,260 our trigger less data acquisition mode, meaning that they will implement the so 186 00:16:49,260 --> 00:16:53,280 called continuous readout and to obtain that they are going to 187 00:16:54,870 --> 00:16:55,590 achieve that 188 00:16:56,610 --> 00:17:03,600 larger data compression factors that are Going to reduce the input bandwidth from 189 00:17:03,600 --> 00:17:10,020 3.5 terabytes per second down to hundreds of 100 of gigabytes per second. And GPUs 190 00:17:10,020 --> 00:17:15,330 are the pivotal architectures to efficiently process and operate these 191 00:17:15,330 --> 00:17:20,310 reduction. Considering for instance, the most expensive computationally expensive 192 00:17:20,340 --> 00:17:24,390 part which is the time projection chamber tracking, we observe on the plot on the 193 00:17:24,390 --> 00:17:31,020 right, which is actually are normalized to speed up how we can in this case, compare 194 00:17:31,020 --> 00:17:36,150 the computing power of a GPU to the computing power of CPUs in order to 195 00:17:36,960 --> 00:17:42,900 somehow estimate which are the exchange factor between the two the two parts the 196 00:17:42,900 --> 00:17:51,900 two devices, so since the working the working condition will be more towards the 197 00:17:51,900 --> 00:17:58,980 right part on the plot, we can say that one GPU will replace up to 40 CPU cores in 198 00:17:58,980 --> 00:18:08,010 the in In, in their online infrastructure, so, there are also more advanced scenarios 199 00:18:08,040 --> 00:18:13,710 that foresee to include to deploy more and more part of the reconstruction on the 200 00:18:13,740 --> 00:18:19,110 GPUs whenever they are available in order to efficiently exploit those resources, 201 00:18:19,110 --> 00:18:26,970 but also to free some some resources on CPU. I will also advise you to attend all 202 00:18:26,970 --> 00:18:34,410 David Ross presentation on this topic. So, here I come to my conclusion. Extract 203 00:18:34,410 --> 00:18:38,040 metallics he will be extremely challenging in the in terms of computing requirements 204 00:18:38,040 --> 00:18:46,080 and GPU might be one solution to this increasing demand to address the the lack 205 00:18:46,080 --> 00:18:51,150 of computing power. There are ongoing many different efforts with diverse scope 206 00:18:51,570 --> 00:18:56,550 carried out by all the alleged larger illiteracy experiments. And the idea is to 207 00:18:56,550 --> 00:19:01,650 accelerate all the parallel workflows that you have in order to Performance beyond 208 00:19:01,650 --> 00:19:07,290 what we can be currently achieved with CPUs. There are also some cases where the 209 00:19:07,290 --> 00:19:16,440 GPUs are enabling the some scenarios that otherwise wouldn't be reachable to by only 210 00:19:16,440 --> 00:19:21,000 using standard CPU. And one last point I'll leave you to consider is that the 211 00:19:21,000 --> 00:19:25,170 next generation of data center, but also performance computing facility will 212 00:19:25,170 --> 00:19:29,760 increase the number of GPUs on board that they're computing now. So, so to be 213 00:19:29,760 --> 00:19:36,840 efficiently able to be able to efficiently exploit those kind of architecture will 214 00:19:37,290 --> 00:19:45,300 will give us a lot of potential that that otherwise we will just ignore, since most 215 00:19:45,300 --> 00:19:50,190 of our workflows are on CPU, and from my side is everything and thanks for your 216 00:19:50,190 --> 00:19:50,640 attention. 217 00:19:52,500 --> 00:19:53,160 Hey, Matteo, 218 00:19:54,510 --> 00:19:59,610 thank you for taking time. And I'm opening the floor for questions if you find the 219 00:19:59,610 --> 00:20:00,930 race. button 220 00:20:02,310 --> 00:20:04,320 and interface, I see an article 221 00:20:06,210 --> 00:20:09,930 and allowing you to talk to you to be able to unmute yourself and ask your question. 222 00:20:10,680 --> 00:20:17,910 Yes. Can you hear me? Yes. Cool. So I was wondering about the use of CUDA and 223 00:20:18,060 --> 00:20:22,320 particularly considering the development of apparently more vendors trying to 224 00:20:22,470 --> 00:20:27,630 develop GP GPUs and trying to deploy them. And if we, 225 00:20:29,100 --> 00:20:33,930 if it's still a good idea to to be vendor specific by using CUDA exclusively, 226 00:20:33,930 --> 00:20:38,520 basically, also, considering that there's a lot of development in terms of 227 00:20:40,050 --> 00:20:42,210 other parallelism frameworks, like is 228 00:20:43,560 --> 00:20:44,610 that clear? 229 00:20:45,990 --> 00:20:49,140 Yes, yes. So I think what what's your opinion on this what 230 00:20:50,400 --> 00:20:55,740 my take on that is that in the end, this has to be considered depending from 231 00:20:55,740 --> 00:21:02,730 experiment to experiment. So there are some use cases also in the past Were in 232 00:21:02,730 --> 00:21:06,780 Vidya frameworks were considered just because of their level of maturity but 233 00:21:06,780 --> 00:21:14,760 also their level of the performances. So the idea is that depending on on how much 234 00:21:14,760 --> 00:21:21,240 manpower how much resources you have, you can investigate also different different 235 00:21:21,240 --> 00:21:26,460 vendor vendor solution, but also different frameworks that claims to be cross 236 00:21:26,460 --> 00:21:32,520 portable across across different architectures layer. I didn't went didn't 237 00:21:32,520 --> 00:21:37,110 go very much into detail about this topic in the presentation because it is a more 238 00:21:37,110 --> 00:21:42,420 broader one. But the idea is that more or less all the experiments are considering 239 00:21:43,590 --> 00:21:51,720 frameworks that allows for portability, both on Nvidia or AMD, or Intel GPUs in 240 00:21:51,720 --> 00:22:02,070 the future. I think that nowadays nobody is thinking anymore indeed. To a single a 241 00:22:02,070 --> 00:22:08,160 single vendor solution in any case, they always leave some some other door open and 242 00:22:09,900 --> 00:22:16,080 that for some tests and development on that in the end you have to converge for 243 00:22:16,080 --> 00:22:22,710 production to one architecture, but it depends if you are able to support it, you 244 00:22:22,710 --> 00:22:29,310 can also write one code and be able to run in different GPUs but you have to be sure 245 00:22:29,310 --> 00:22:34,140 every time that the result of the same the performances are cross portable and etc. 246 00:22:34,140 --> 00:22:41,070 So, my take on that is that depends on how how much are you are you are able to 247 00:22:41,280 --> 00:22:45,750 support both of them or the three of them or it depends. 248 00:22:47,070 --> 00:22:54,030 But would you say there's a trend like towards the trend 249 00:22:54,330 --> 00:22:59,430 depends on who you consider normalize on how fluent they are. Because if you look 250 00:22:59,430 --> 00:23:07,890 at the outside our field, indeed, the domination of the media is is adaptable. 251 00:23:08,490 --> 00:23:14,340 When there are more and more convincing use cases that claims that they can pop, 252 00:23:14,370 --> 00:23:20,220 let's say TensorFlow or AMD GPUs and whatever. So for the time being, I don't 253 00:23:20,220 --> 00:23:26,160 have a very strong opinion on. But the trend on considering other GPUs may be 254 00:23:26,160 --> 00:23:32,280 much cheaper. Maybe that just fulfilled What you need is, is present. Maybe it 255 00:23:32,280 --> 00:23:33,540 isn't. It is growing. 256 00:23:35,160 --> 00:23:36,780 Okay, thank you very much interesting. 257 00:23:38,970 --> 00:23:41,820 Is there any other questions, I don't see any raised hands 258 00:23:43,920 --> 00:23:45,270 and one for you. 259 00:23:46,590 --> 00:23:52,500 So you've talked about mostly on premises usage of GPU, but then know that there is 260 00:23:52,500 --> 00:24:02,580 also a line of work on trying to use cloud based tools queues are server based GPU on 261 00:24:02,580 --> 00:24:09,180 the Computing Center. So something in between the online tilty type of 262 00:24:09,300 --> 00:24:15,210 facilities and the HPC facilities where there is a GPU on each node and something 263 00:24:15,210 --> 00:24:21,480 where you either have like just one server of GPU for the whole HPC or just the GPUs 264 00:24:21,480 --> 00:24:23,640 on the cloud. Can you say a couple of words on this? 265 00:24:24,570 --> 00:24:32,040 Well, I think that the usage of cloud GPUs will probably be more directed for those 266 00:24:32,220 --> 00:24:34,440 offline kind of computation because 267 00:24:35,760 --> 00:24:38,880 in the end hlp farms or 268 00:24:40,230 --> 00:24:43,050 de facto h lt. clusters 269 00:24:45,390 --> 00:24:49,980 are not really interested at the moment, at least, to the best of my knowledge to 270 00:24:49,980 --> 00:24:58,530 the ploy some kind of a bus abstraction layers in the analyzing GPUs, for for use 271 00:24:58,530 --> 00:25:06,660 cases. So usually what One experiment tries to do is to exploit edit master edit 272 00:25:06,660 --> 00:25:11,400 maximum capabilities every single GPU that means that we are paralyzed in it. So, 273 00:25:11,430 --> 00:25:19,500 having a cloud approach is not the goal you, you you aim for it is possible that 274 00:25:19,500 --> 00:25:24,840 indeed there are some offline use cases that also are going to consider the 275 00:25:24,840 --> 00:25:26,130 conversion of a cluster 276 00:25:30,600 --> 00:25:33,240 in order to be able to 277 00:25:34,920 --> 00:25:35,700 fully 278 00:25:36,689 --> 00:25:43,979 locate part of GPUs for the smaller tasks but I'm not really sure this is very 279 00:25:43,979 --> 00:25:50,819 convenient for our use cases. If with cloud the approaching you you meant why 280 00:25:50,819 --> 00:25:58,199 GPUs indeed when you buy via GPUs on Amazon, they are we have 12 minutes that 281 00:25:58,199 --> 00:26:03,569 there is a perhaps GPU that is partitioned, and you will see like it is a 282 00:26:03,569 --> 00:26:08,489 less powerful one. But these, I don't think an experiment is going to have an 283 00:26:08,489 --> 00:26:15,299 online workflow on on an Amazon or Google platform or whatever, cloud provider 284 00:26:15,389 --> 00:26:21,299 solution cluster. So these are different use cases that indeed can be investigated. 285 00:26:21,299 --> 00:26:28,739 But for the moment, we have more stick to the online construction and other 286 00:26:30,270 --> 00:26:34,530 sounds good that I see your hands your hand up. 287 00:26:36,450 --> 00:26:40,560 You want to I just, I just want to very briefly comment on this did while I mean, 288 00:26:40,830 --> 00:26:45,840 everything you said is, of course, true material. But I think at least in LHB once 289 00:26:45,840 --> 00:26:52,500 we have a large cluster of GPUs sitting there. Then at some point, kind of 290 00:26:52,500 --> 00:26:56,430 exposing them for other non online workflows when you're not taking data 291 00:26:58,830 --> 00:27:02,970 is something that you'd respond Think about isn't necessarily trivial, but I 292 00:27:02,970 --> 00:27:09,420 think that there will be kind of increasingly work and reflection over the 293 00:27:09,420 --> 00:27:13,950 next years on, on how we make sure that they're not going to sitting there is that 294 00:27:13,950 --> 00:27:16,680 silicon particularly during Ellis three. 295 00:27:18,510 --> 00:27:25,380 Good point. Good point. Yes, sir. All right, thank you Matteo for talking and 296 00:27:25,650 --> 00:27:26,400 answers.