1
00:00:01,170 --> 00:00:01,710
Yes.

2
00:00:02,790 --> 00:00:09,930
Okay. Yeah. So I mean, demand, and also
presentation, kind of touching on

3
00:00:09,960 --> 00:00:16,560
different approaches across different LSE
experiments. So also, due to time, it's

4
00:00:16,590 --> 00:00:22,830
going to be a fairly high level, but I'll
try to touch on the main things. And so

5
00:00:23,100 --> 00:00:26,820
kind of why some kind of motivation why
we're doing this at all, I think

6
00:00:26,820 --> 00:00:31,890
sometimes, you know, in our kind of day to
day life, we kind of forget how special

7
00:00:31,890 --> 00:00:36,960
the Elysee is. But it's really special on
the kind of data and the results that we

8
00:00:36,960 --> 00:00:41,130
are extracting from this machine are
pretty unique. And also the analysis that

9
00:00:41,130 --> 00:00:46,560
we use to extract this results are also
unique. And so both of these things, kind

10
00:00:46,560 --> 00:00:53,970
of merit preservation for posterity. And
so you can kind of ask yourself, like,

11
00:00:53,970 --> 00:00:58,590
what is the scientific output that we have
beyond the papers and you know, since

12
00:00:58,590 --> 00:01:03,390
we're a kind Have a unique machines, you
know, we kind of should strive to make our

13
00:01:03,390 --> 00:01:09,930
results as useful as possible and, you
know, also make our data available in

14
00:01:09,930 --> 00:01:16,800
formats that are as useful as possible.
Okay, so in this kind of area of activity,

15
00:01:16,920 --> 00:01:22,380
that kind of three main directions of
what's going on, there's the kind of data

16
00:01:22,380 --> 00:01:28,500
like preservation of data products that we
kind of extract from our analysis and high

17
00:01:28,500 --> 00:01:34,080
level data products. And then there's kind
of analysis preservation type stuff where

18
00:01:34,080 --> 00:01:39,720
you try to preserve the analysis workflow
itself. And then there's the third branch,

19
00:01:39,720 --> 00:01:43,800
which is kind of more towards open data,
and then, you know, you just open up the

20
00:01:43,800 --> 00:01:49,680
data for researchers and people outside of
the collaborations. Okay, so I'll go

21
00:01:49,680 --> 00:01:54,810
through all these three things. So that
data I think everybody is familiar with

22
00:01:54,810 --> 00:01:59,610
type data ends really been a crucial piece
of cyber infrastructure for our field.

23
00:02:00,000 --> 00:02:05,820
It's basically the main destination where
we kind of put a numeric, machine readable

24
00:02:05,820 --> 00:02:10,680
data that relates to our publications
online for people to reuse. And it's

25
00:02:10,680 --> 00:02:15,090
really like a destination for high
quality. But small data products, right.

26
00:02:15,090 --> 00:02:21,270
So traditionally, this has been, you know,
our kind of started as just a digitization

27
00:02:21,270 --> 00:02:26,130
of the tables that are in the papers, but
then it kind of evolved into a much wider

28
00:02:26,130 --> 00:02:30,990
set of data products. And so all the LSD
experiments are participating in that. And

29
00:02:30,990 --> 00:02:37,170
so there are different levels of kind of
percentages of the analysis that have a

30
00:02:37,200 --> 00:02:43,710
corresponding data product, have to have
data record, but in principle, most of the

31
00:02:44,009 --> 00:02:46,619
experiments are kind of using that as a
platform.

32
00:02:46,860 --> 00:02:51,960
So that's good. And so as I said, we kind
of started out with just having tables,

33
00:02:51,960 --> 00:02:57,120
but now kind of experiments upload all
types of information that relates to the

34
00:02:57,120 --> 00:03:02,340
analysis there. So it kind of goes from
c++ An episode, you know, spectrum files

35
00:03:02,340 --> 00:03:07,710
for searches, likelihoods machine learning
models and all that stuff. And so I think

36
00:03:08,340 --> 00:03:09,960
heptane has been proven

37
00:03:10,440 --> 00:03:11,910
to be a really crucial piece.

38
00:03:12,450 --> 00:03:19,290
And so just an example of what kind of
things we're covering there. So. So one of

39
00:03:19,290 --> 00:03:23,190
the things that a lot of people can alter
the color colorations like to do is to

40
00:03:23,190 --> 00:03:27,390
kind of re implement the analysis so that
they have like a fast approximate version

41
00:03:27,420 --> 00:03:32,370
of the event selection procedure, for
example. And so, for example, Atlas kind

42
00:03:32,370 --> 00:03:37,740
of puts up c++ snippets, and kind of
pseudo code. There are some times we also

43
00:03:37,740 --> 00:03:43,110
have rivet analysis for the different
analysis, and then they're kind of linked

44
00:03:43,110 --> 00:03:48,660
on that data. And But then one thing that
I want to point out is so as Ben was

45
00:03:48,660 --> 00:03:54,210
saying, obviously, we all expect machine
learning to be like a major component of

46
00:03:54,510 --> 00:03:58,560
our analysis tool chain going forward. And
so one of the things that is kind of a

47
00:03:58,560 --> 00:04:03,750
little bit hard as okay If you use machine
learning heavily, how do you actually put

48
00:04:03,750 --> 00:04:08,010
that out there. And so, normally, you
know, if you write a replica routine, you

49
00:04:08,010 --> 00:04:12,210
can go through the cuts that you're doing.
And so with machine learning, this becomes

50
00:04:12,210 --> 00:04:18,330
a little bit trickier, but at least an
Atlas. We have one example where we put

51
00:04:18,330 --> 00:04:22,920
the entire machine learning model publicly
available on hub data on the record that

52
00:04:22,920 --> 00:04:23,190
is

53
00:04:23,399 --> 00:04:25,409
associated to the analysis. And so you

54
00:04:25,410 --> 00:04:28,890
have all the weights. And then if you have
a simulation that is good enough to

55
00:04:28,890 --> 00:04:33,510
reproduce the inputs to this machine
learning model faithfully, you can use

56
00:04:33,510 --> 00:04:38,580
that public model to kind of evaluate your
multivariate function. And so that's

57
00:04:38,790 --> 00:04:44,820
pretty good. So one of the things that I
want to kind of point out in this talk,

58
00:04:45,660 --> 00:04:50,370
kind of an expose nature is the kind of
the public like that. So there has been

59
00:04:50,610 --> 00:04:54,660
kind of a lot of attention to
unlikelihood. So I just want to kind of

60
00:04:54,660 --> 00:04:59,700
motivate why that is. So as far as we
know, basically, what we try to do in

61
00:04:59,760 --> 00:05:03,900
mobile But then also to kind of show that
can we try to extract, you know,

62
00:05:03,900 --> 00:05:09,030
information about the theory that we kind
of assume that produces our data from the

63
00:05:09,030 --> 00:05:14,280
data itself, right? And then if you kind
of try to specify this inference problem,

64
00:05:14,430 --> 00:05:17,070
you know, you have to like, do it, and
then you have maybe your private release

65
00:05:17,940 --> 00:05:21,300
over the theory. So one part is kind of
the job of the theorist. And then the

66
00:05:21,300 --> 00:05:25,950
likelihood is really the part that kind of
summarizes what the experiments do, right?

67
00:05:25,950 --> 00:05:26,910
So it's basically

68
00:05:27,240 --> 00:05:30,360
quantifying how likely the data is given a
theory. And

69
00:05:30,599 --> 00:05:34,499
really, it's kind of like a focal point of
the entire analysis chain where all the

70
00:05:34,499 --> 00:05:39,689
different decisions about performance
optimizations or data acquisition

71
00:05:39,689 --> 00:05:44,069
operations or the you know, the analysis,
choices that you do in your analysis kind

72
00:05:44,069 --> 00:05:49,769
of all get reflected in this, like there's
a really high information density data

73
00:05:49,769 --> 00:05:54,479
product if we are able to preserve this
and then all the standard inference

74
00:05:54,479 --> 00:05:55,769
results like limit or

75
00:05:56,010 --> 00:05:58,740
yield tables or you know, data MonteCarlo
plots

76
00:05:58,950 --> 00:06:03,780
are basically a result Just likelihood. So
it's kind of like a bottleneck. And that

77
00:06:03,780 --> 00:06:10,440
makes it really valuable to be preserved.
And so oftentimes what people do outside

78
00:06:11,220 --> 00:06:15,900
theorists they go to have data, right? And
then they take all their, all the tables,

79
00:06:15,900 --> 00:06:19,470
like the background distribution that are
published on that data, and they try to

80
00:06:19,470 --> 00:06:23,550
construct a likelihood out of that
information. And but that's obviously

81
00:06:23,550 --> 00:06:28,260
lost, because we're not like the fidelity
of the information doesn't allow you to

82
00:06:28,800 --> 00:06:32,340
reconstruct the likelihood. And so there's
this idea of, Okay, what would happen if

83
00:06:32,340 --> 00:06:36,630
we would just provide the likelihood
directly on that data? Because internally

84
00:06:36,630 --> 00:06:41,190
as experiments we have that. And so this
is not a new ideas, it has been kind of

85
00:06:41,190 --> 00:06:46,530
long in the making. So there's the first
fight conference at CERN in the year 2000.

86
00:06:46,530 --> 00:06:51,300
There was kind of like a discussion around
this idea and more or less everybody.

87
00:06:51,690 --> 00:06:55,320
Like, among this, this group of experts
kind of agreed that it would be a good

88
00:06:55,320 --> 00:07:00,480
idea for the OSI experiments to publish to
like the function and but then There are

89
00:07:00,480 --> 00:07:04,320
various technical limitations to that
idea, and also social sociological

90
00:07:04,320 --> 00:07:08,580
limitations. And so the first kind of step
in that direction was to kind of just find

91
00:07:08,580 --> 00:07:12,090
a serialization format. And so this is
kind of the introduction of this route

92
00:07:12,090 --> 00:07:17,490
workspace. And then in 2012, Atlas kind of
published the first kind of profile

93
00:07:17,490 --> 00:07:21,330
likelihood. But that's of course limited,
because, okay, this is the likely

94
00:07:21,330 --> 00:07:25,620
assumption, after you profiled out all
numerous parameters. So that's, that's

95
00:07:25,620 --> 00:07:28,950
useful, but doesn't allow you to do
combinations and stuff like that. And

96
00:07:28,950 --> 00:07:33,600
then, a couple years later, CMS also
published a simplified likelihoods, which

97
00:07:33,630 --> 00:07:37,740
kind of have the same problem. So it's
very useful and kind of a simple

98
00:07:38,219 --> 00:07:40,859
form of the likelihood, but it doesn't
allow you to

99
00:07:40,860 --> 00:07:45,360
do like actual combinations where you can
vary the numerous parameters across

100
00:07:45,360 --> 00:07:52,080
multiple analyses, you know,
systematically So one thing that now

101
00:07:52,080 --> 00:07:58,440
happened maybe half a year, or a year ago,
is that we have the full, first full

102
00:07:58,440 --> 00:08:05,160
likelihood release often Experiment. So
this is Atlas kind of effort where we kind

103
00:08:05,160 --> 00:08:11,760
of took the idea that we take a bent
likelihood and then an Atlas, we use this

104
00:08:11,760 --> 00:08:16,350
format called this factory, and a lot of
analysis kind of use this format. And it's

105
00:08:16,950 --> 00:08:22,110
nice because you can find a pretty nice
schema for it. And so you can define a

106
00:08:22,110 --> 00:08:26,010
data product around it. And so we have
like a publicly available likelihood on

107
00:08:26,010 --> 00:08:30,900
health data. And that allows external
people to reproduce like the key part of

108
00:08:30,900 --> 00:08:34,590
the analysis, which is this exclusion
contour, which delineates,

109
00:08:34,920 --> 00:08:37,470
you know, theories that are excluded or
not excluded

110
00:08:38,070 --> 00:08:40,410
to the exact same fidelity as

111
00:08:40,650 --> 00:08:47,040
inside of the experiments, which I think
is a pretty good stuff. And so this has

112
00:08:47,040 --> 00:08:53,130
been a bit of a milestone and kind of open
data products for the LSE. Okay, so aside

113
00:08:53,130 --> 00:08:57,750
from open data products, there's also that
kind of this use case of internal reuse.

114
00:08:57,750 --> 00:09:04,170
So this is kind of this idea To preserve
the analysis for internal, you know,

115
00:09:04,710 --> 00:09:09,270
projects. And so their efforts were all of
the OSI experiments to foster analysis

116
00:09:09,270 --> 00:09:12,810
preservation, and then there are different
ingredients that you need to do for

117
00:09:12,810 --> 00:09:16,260
analysis preservation. So first of all,
you know, the analysis is basically

118
00:09:16,260 --> 00:09:18,930
implemented in software. So there are
different software packages that need to

119
00:09:18,930 --> 00:09:23,790
be preserved. And then, but only having a
software is not enough, you also need to

120
00:09:23,790 --> 00:09:26,970
know what to do with the software. So you
need to kind of capture, like what

121
00:09:26,970 --> 00:09:31,710
commands you need to run. And since your
analysis is likely a multi step procedure,

122
00:09:31,710 --> 00:09:36,420
you also need to capture the workflow. So
that you know first you need to do the

123
00:09:36,420 --> 00:09:39,810
next election, then you need to do the
statistical analysis and stuff like that.

124
00:09:40,050 --> 00:09:43,470
And then finally, you also need to
preserve like data assets, like your

125
00:09:43,470 --> 00:09:43,980
background,

126
00:09:45,660 --> 00:09:46,980
in tuplets, and all that stuff.

127
00:09:47,280 --> 00:09:52,170
And so Sara kind of provides some
infrastructure to assist the experiments

128
00:09:52,170 --> 00:09:57,750
in this effort. So there's this Viana
project that basically provides like a

129
00:09:57,750 --> 00:10:03,480
workflow as a service. So you can describe
Your analysis like is a sequence of steps.

130
00:10:04,020 --> 00:10:07,080
And then you can kind of run it on this
platform.

131
00:10:07,440 --> 00:10:10,320
And then there's also the cap, which
stands for

132
00:10:10,320 --> 00:10:14,850
certain analysis preservation portal,
which basically allows you once you can

133
00:10:14,850 --> 00:10:19,830
define, or once you define your analysis
as like the sequence of steps as this

134
00:10:19,830 --> 00:10:23,910
workflow, you can kind of take this
description and then put it into the

135
00:10:23,910 --> 00:10:28,500
center loss preservation framework. So
that later on if you want to reuse your

136
00:10:28,500 --> 00:10:29,370
analysis, you can

137
00:10:29,550 --> 00:10:31,050
kind of pull it up and then

138
00:10:31,080 --> 00:10:36,660
extract the workflow definition out of it,
and then rerun the analysis. Okay, so how

139
00:10:36,660 --> 00:10:41,370
do we do this? So the capturing of the
software is basically done through

140
00:10:41,370 --> 00:10:46,080
containers. So probably by now, a lot of
people are familiar with this. So this is

141
00:10:46,080 --> 00:10:49,680
a new technology that kind of grew out of
industry to kind of package software in a

142
00:10:49,680 --> 00:10:53,910
portable way. And so the sometimes kind of
referred to as Docker containers,

143
00:10:54,180 --> 00:10:56,250
and so this has been kind of

144
00:10:56,430 --> 00:11:01,320
revolutionising how we can package the
software. Kind of Part of away and has

145
00:11:01,320 --> 00:11:04,980
been picked up by all of the experiments
and cannabis universally seen as the

146
00:11:05,010 --> 00:11:08,880
solution to this problem. And so an atlas
were kind of providing,

147
00:11:09,660 --> 00:11:11,940
you know, official base images, and CMS
and

148
00:11:13,230 --> 00:11:19,380
Alison LCB are doing something similar.
And so that is largely solved even though

149
00:11:19,830 --> 00:11:23,100
initially, if you think about it, you
know, it's kind of sounds like the most

150
00:11:23,100 --> 00:11:26,610
complicated thing to be doing. Like, you
know, you need to not only preserve your

151
00:11:26,610 --> 00:11:29,850
analysis software, but all the
dependencies, all the root version,

152
00:11:30,180 --> 00:11:33,690
Compiler version, all that stuff. And it
sounds daunting, but through this

153
00:11:33,690 --> 00:11:37,470
technology, it's actually almost the
easiest part and this analysis reservation

154
00:11:37,860 --> 00:11:44,640
problem. And then for the workflow,
basically, as a sense of Rianna. It's kind

155
00:11:44,640 --> 00:11:48,840
of this platform that Sam provides to run
this workflow, and it kind of uses this

156
00:11:48,840 --> 00:11:52,320
concept of workflow languages. So if
you're familiar with continuous

157
00:11:52,320 --> 00:11:56,940
integration or things like that, basically
allows you to define a pipeline in a

158
00:11:56,940 --> 00:12:02,220
declarative way. And so this is actually
quite heavily used, also like in fields

159
00:12:02,220 --> 00:12:06,450
outside of financial physics, like
bioinformatics. And so it's kind of now

160
00:12:06,810 --> 00:12:11,280
starting to creep in into a high energy
physics and kind of allows you to go

161
00:12:11,280 --> 00:12:16,920
beyond the pure software presentation, but
actually preserve the full analysis

162
00:12:16,920 --> 00:12:21,420
workflow so that you can actually execute
the analysis and don't need to remember

163
00:12:21,420 --> 00:12:27,150
what the steps are. Okay, so one of the
major use cases for this type of internal

164
00:12:27,180 --> 00:12:33,330
presentation reuses, the use case of
reinterpretation and so this something

165
00:12:33,330 --> 00:12:37,500
goes under the name of recast. So the idea
is here that you have your analysis, which

166
00:12:37,500 --> 00:12:42,570
might be a search. And so, you know,
explore some corner face base that is

167
00:12:42,570 --> 00:12:48,420
interested for interesting for some class
of the onset of model, physics, and then,

168
00:12:48,900 --> 00:12:53,940
okay, you probably are not finding new
physics, but you might be able to set

169
00:12:53,940 --> 00:12:58,200
limits on this class of models, but the
space plus region that you studied, might

170
00:12:58,200 --> 00:13:03,570
actually also be sensitive to a different
class of models. And so the idea here is

171
00:13:03,570 --> 00:13:07,950
that since you already kind of studied the
space space, and you have your analysis

172
00:13:08,310 --> 00:13:13,050
workflow as a tool, basically to look into
the space space, you preserve this

173
00:13:13,050 --> 00:13:19,200
analysis. And then once a new class of
models comes around, it seems to be, you

174
00:13:19,200 --> 00:13:22,920
know, forward, which this face base region
seems to be sensitive, you can kind of

175
00:13:22,920 --> 00:13:25,110
reuse that and then extract limits.

176
00:13:25,410 --> 00:13:27,060
And so Atlas has been

177
00:13:27,120 --> 00:13:33,960
pushing this and an Atlas. All the major
search groups that do such a probe beyond

178
00:13:33,960 --> 00:13:38,850
animal physics now require the analysis
groups to preserve the analysis in this

179
00:13:38,970 --> 00:13:43,800
kind of reusable way. So that, you know,
maybe a year down the line, if there's a

180
00:13:43,800 --> 00:13:49,350
new model, we can actually redo the
analysis and then extract new limits. And

181
00:13:49,350 --> 00:13:54,180
so we've kind of seen this happen
actually, so that where we have kind of

182
00:13:54,180 --> 00:13:58,020
new scientific results based on this
pretty technical requirement, to be

183
00:13:58,020 --> 00:14:01,890
honest, right, so you require people to
document Images, all that stuff. But after

184
00:14:01,890 --> 00:14:05,010
you go through this exercise, you actually
extract new science out of it

185
00:14:05,700 --> 00:14:08,310
at a cost that is much less than the cost
that

186
00:14:08,550 --> 00:14:12,570
is required to set up a new analysis that
is dedicated for this new class of models.

187
00:14:12,780 --> 00:14:18,270
And so I just like to put in three
different papers or publications from

188
00:14:18,300 --> 00:14:23,790
Atlas that kind of uses technology to
extract new limits and based in kind of

189
00:14:23,790 --> 00:14:27,720
regions of theory space that are
previously uncovered, which is nice to

190
00:14:27,720 --> 00:14:31,710
see. Okay, so far, so now's the
presentation. So this is kind of this

191
00:14:31,710 --> 00:14:36,960
portal where we can save information
related to the analysis. So this is also

192
00:14:37,230 --> 00:14:42,570
more meant for internal usage. So all the
experiments are kind of working on kinda

193
00:14:42,570 --> 00:14:46,380
integrating their internal databases, but
there's so there are two screenshots for

194
00:14:46,680 --> 00:14:53,220
lacp and CMS and so here the focus from so
this is developed by the library section

195
00:14:53,280 --> 00:14:57,390
at CERN, the same people that do like
inspire and video and all these things.

196
00:14:57,750 --> 00:15:02,340
And so the focus here is to kind of Make
it easy for analysis teams to submit their

197
00:15:02,340 --> 00:15:08,490
information, what their analysis entails
and then also make it easy to discover

198
00:15:08,490 --> 00:15:13,770
analyses that have specific feature. So
like, ideally, it would be kind of working

199
00:15:13,770 --> 00:15:18,120
in a way that you're looking for analysis
that use like a specific trigger or like a

200
00:15:18,120 --> 00:15:23,700
specific collection of objects, you'll be
able to query this database and find

201
00:15:23,700 --> 00:15:32,400
analysis that match these criteria. Okay,
so the third kind of column of this topic

202
00:15:32,400 --> 00:15:37,350
is the Open Data stuff. And so all the
experiments have open data programs. And

203
00:15:38,370 --> 00:15:41,430
here again, cern is providing
infrastructure with the CERN open data

204
00:15:41,430 --> 00:15:47,430
portal. And so for Atlas OCP, and
Allison's mostly focus on outreach. And so

205
00:15:47,430 --> 00:15:52,590
I've kind of put in some examples of the
different, you know, plots that you can

206
00:15:52,590 --> 00:15:58,170
make on this open data from these
experiments. But CMS has a much more

207
00:15:58,170 --> 00:16:02,490
expansive Open Data protocol They, where
it's not only for outreach and educational

208
00:16:02,490 --> 00:16:06,750
resources for research. And so what is
kind of nice to see is that we actually

209
00:16:06,750 --> 00:16:11,550
see an external ecosystem slowly
developing around this type of open data.

210
00:16:11,550 --> 00:16:15,690
And there's a workshop in October for
external people to kind of learn more

211
00:16:15,690 --> 00:16:19,800
about this data that I put it down the
link on the slide. And so we've seen a

212
00:16:19,800 --> 00:16:27,480
number of papers appear on the based on
this certain open data. And so no, one

213
00:16:27,480 --> 00:16:31,290
major theme there is the use of
development of machine learning methods

214
00:16:31,500 --> 00:16:36,510
based on this CMS open data. And so I've
put down some examples here. And then

215
00:16:36,690 --> 00:16:42,360
there's also kind of, well, that like a
schedule that is already defined well in

216
00:16:42,360 --> 00:16:47,850
advance to see where these open data
release are going to be at at different

217
00:16:47,850 --> 00:16:53,250
points in time. And so this kind of makes
it makes it predictable and kind of allows

218
00:16:53,250 --> 00:16:58,890
us ecosystem to develop outside of the
experiment. Okay, so this brings me to my

219
00:16:58,890 --> 00:17:02,550
conclusion. So I think The OSHA
requirements are pretty strong analysis

220
00:17:02,550 --> 00:17:06,930
data preservation programs. And so some of
the technological progress actually helps

221
00:17:06,930 --> 00:17:11,280
drive to finish off the presentation, for
example, the use of containers and stuff

222
00:17:11,280 --> 00:17:16,530
like that. And one important component in
this entire endeavor is the availability

223
00:17:16,530 --> 00:17:21,630
of cyber infrastructure for the different
components. So here we have a data CERN

224
00:17:21,630 --> 00:17:25,590
analysis, reservations and open data,
riana recast and all these things that

225
00:17:25,590 --> 00:17:31,950
kind of allowed the community to adopt
these practices in a kind of systematic

226
00:17:31,950 --> 00:17:34,890
way. And yeah, that's basically my

227
00:17:36,210 --> 00:17:37,170
conclusion. Thanks.

228
00:17:40,410 --> 00:17:57,870
Thank you very much. Just so we have time
for a couple of questions. Everyone, can

229
00:17:57,870 --> 00:18:02,160
you hear me? Yes, yes, please. So just

230
00:18:03,809 --> 00:18:05,399
presentation and work of our own

231
00:18:07,710 --> 00:18:13,020
wondering on the cast, and interpretation

232
00:18:15,240 --> 00:18:23,160
what are the policy for for citing the
underlying source of data and inflammation

233
00:18:23,160 --> 00:18:28,680
and whatnot, if it's not already within
collaborations, I see that here you have

234
00:18:28,950 --> 00:18:37,200
two papers that are a justification, but
then one that would be requesting is

235
00:18:37,200 --> 00:18:43,230
there. So how do you do this? Give credits
for the data and the experiments.

236
00:18:44,490 --> 00:18:49,200
So in this example, kind of it's all done
by the collaboration. So these are

237
00:18:49,200 --> 00:18:53,670
reinterpretations that are performed by
the collaboration right and then

238
00:18:53,940 --> 00:18:57,900
internally if you read these papers, they
will reference like what the input

239
00:18:57,900 --> 00:19:02,160
analysis are that they use for the reading
to petition, so the idea is here that you

240
00:19:02,160 --> 00:19:07,470
maybe have like a. So as this gets more
streamlined, you know, you might have like

241
00:19:07,470 --> 00:19:11,670
a first like analysis with a benchmark
model that just kind of explores the space

242
00:19:11,670 --> 00:19:15,420
space. And then you'll have like a
sequence of follow up publications with

243
00:19:15,420 --> 00:19:21,540
references, original publication, but then
kind of explore a bit more specific models

244
00:19:22,050 --> 00:19:26,430
that were just face based and sensitive.
If you're talking about external

245
00:19:26,430 --> 00:19:30,990
reinterpretation, where it's basically
okay we have an analysis then we put out

246
00:19:30,990 --> 00:19:35,370
some information on health data that
allows external people to reinterpret,

247
00:19:35,940 --> 00:19:40,800
then like the assumption is that those
excellent a range of potential site to

248
00:19:40,800 --> 00:19:45,720
have data record so if you, for example,
use the political or the rebut analysis

249
00:19:46,290 --> 00:19:51,000
that was kind of released as part of the
publication procedure, you'll need to cite

250
00:19:51,660 --> 00:19:55,710
these things and the tools that you've
used, but I'm done. That's enough.

251
00:20:03,839 --> 00:20:09,989
Thanks a lot. I have actually a very quick
question more on the data presentation

252
00:20:09,989 --> 00:20:17,609
side, that side. And so I think that in
the last period, we have seen more and

253
00:20:17,609 --> 00:20:26,279
more groups and collaboration using derive
data set that are not root based. So in

254
00:20:26,279 --> 00:20:31,649
some sense, this could, at least in my
experience, it could create some long term

255
00:20:31,979 --> 00:20:37,439
preservation problem, of course, those
would be derived data. So the argument is,

256
00:20:38,339 --> 00:20:43,169
one can always go back to the original
data instead of reconstructing data. But

257
00:20:43,439 --> 00:20:49,919
how do you see this evolving in the long
term at CERN. In particular, I'm referring

258
00:20:49,919 --> 00:20:56,129
also to this effort of, of having our data
frame to allow better storage formats for

259
00:20:56,129 --> 00:20:59,099
people that use Python data and software.
For example,

260
00:21:00,000 --> 00:21:06,330
So I mean, the crucial thing is whatever
format you're kind of doing long term data

261
00:21:06,330 --> 00:21:12,510
preservation has like a specification and
hopefully, more than one implementation of

262
00:21:12,510 --> 00:21:17,370
reading the data. And so the agreement
with all the experiments are so good. I

263
00:21:17,370 --> 00:21:21,330
mean, you need to somehow balance. So I
don't think there's going to be like a

264
00:21:21,330 --> 00:21:25,260
specific open data format, and all the
other experiments will agree on, but they

265
00:21:25,260 --> 00:21:29,730
will know because you need to, in order to
make this work at all, you need to make it

266
00:21:29,730 --> 00:21:33,750
easy for these collaborations to release
the data. And so it will likely be in

267
00:21:33,750 --> 00:21:39,600
formats that experiments defined, but then
experiments also need to release the

268
00:21:39,600 --> 00:21:45,990
software to read the data, right? So an
atlas case, for example, we are not using

269
00:21:45,990 --> 00:21:50,970
flat tea trees, but we have like an event
data model on top of that, but that's also

270
00:21:50,970 --> 00:21:57,510
it's also public. And so it's the same for
CMS where CMS is W is also open source.

271
00:21:57,870 --> 00:22:01,830
And so as long as the software is
available, To read the data, I think

272
00:22:01,980 --> 00:22:06,360
you're covered whether or not separately
the experiments kind of explore

273
00:22:06,390 --> 00:22:11,820
alternative data formats, like our
uncouple, or, you know, HDFS or something

274
00:22:11,820 --> 00:22:15,510
like that, that I think is a separate
discussion, but in the end open data

275
00:22:15,510 --> 00:22:20,640
format will likely be what the experiments
use internally. You know, whether that's

276
00:22:20,670 --> 00:22:23,220
nano ad or fortnight or something like
that.

277
00:22:26,700 --> 00:22:27,810
Okay, thanks.

278
00:22:29,460 --> 00:22:38,700
Is there any other questions for Lucas? If
not, I think we can move to the last

279
00:22:39,360 --> 00:22:45,600
presentation of the section given by
Stetson on an island scriptural languages

280
00:22:45,600 --> 00:22:46,590
for the Elysee