Lxcert 21.04.05 draft1

minutes of the LXCERT meeting 21.04.2005

Invited / Attendance

AbbrevNameAffiliationPresent
Alberto AimarLCG Application Areamissing
BBBruce M BarnettATLAS-online
ABAlastair BlandAB/CO-admin
Eric CanoCMS-onlinemissing
JCJoel ClosierLHCb-offline
NMNNicolas De Metz-NoblatAB/CO-developmentexcused → AB
BGBenigno Gobbonon-LHC experiments
JIJan IvenIT-services catchall (chair)
NNNiko NeufeldLHCb-online
EOEmil ObreshkovATLAS-offline
JPJarek Polokgeneral Desktops (secretary)
Fons RademakersALICE-offlineexcused → KS
Thorsten KleinwortIT PLUS/BATCH servicemissing
KSKlaus SchossmaierALICE-onlinelate
Stephan WynhoffCMS-offlinemissing

Agenda

  • plans for the next CERN Linux version
  • Version proliferation & LCG collaboration (physics compiler)
  • continuous SLC evolution & test releases
  • post-mortem of SLC3, changes to the process and LXCERT membership

Issue1: plans for the next Linux version

JI: Position summary from the mails sent before the meeting:

  • SLC4 will be made available in any case for tests, but priority unclear
  • go to SLC5 directly, don't certify SLC4: LHCb/Marco, ALICE/Fons, ATLAS/David
  • need SLC4 (and some libraries): ATLAS-online/Bruce, LHCb-online/Niko, ALICE-online/Klaus (agree with ATLAS)
→ need to define the scope of SLC4 software availability and priority to see whether a full certification is required or not

JI: (assuming compiler/OS split is OK): library availability for -online becomes an intra-experiment issue, please coordinate internally and with library providers, via AF.

JI: what are the reasons for SLC4 in "production" for anybody (e.g kernel-2.6 availability?), and when?

AB: NMN wants SLC4 in January 06, and would prefer to have LXPLUS with that since it then feels "fully supported" (can direct users to IT helpdesk etc). May not be a "hard" requirement.

NN: LHCb "slow controls" needs PVSS.. and compatibility would need testing (e.g. >iostream.h< needs to be available..)
[JI: explicit action with ETM required during SLC3 certification]
[NN: aside: several issues still outstanding, SIGCHLD in log, /tmp-mode 0777]

AB: from Frank Schmidt (beam simulation): don't care about OS and batch capacity, need compiler (+ AFS capacity).

BB: 2.6 kernel isn't only point for ATLAS-online, stability is (release milestone in September 2006), unlikely to be in time if we only start a certification in summer 2006.

JP: last experience → 6month to certify software (deep chain). Please remember that compat-libs only work backwards for 1 release, so SLC3→SLC5 will need (most likely) a recompile.

BB: staying on SLC3 completely forces "revolutionary change" instead of "evolution"

NN: all the LHCb data acquisition is already running on 2.6 (on SLC3), only own online code (including kernel drivers) . Benefit from "clean" LHCb separation between online and offline
[AB: have 2.6 driver experience? interested..]
[mini diskless discussion → redirected to Linux4Controls/CNIC]]

JI: proposal: split by OS and compiler, certify in parallel, merge, decide quickly before the September 2006 deadline?

BB/EO: summer 2006 is too late for any new certification to start. In general agree that new compiler (on old OS) and new OS should go in parallel, then combine OS+compiler and run another round of tests.

[NN: when will support for SLC3 be dropped?
JI: other way round, decide when new stuff is available+migration period etc., then phase out. RHE3 lifetime is until 2009
JP: but hardware support may push us, RHE3 only gets new drivers in 2005, and we don't want to carry too many versions around.]

JP: could have SLC4 installable in "basic setup" within a few weeks. Do we need this?
Announced changes: mostly 2.6 kernel, gcc-3.4.3, general package update.
[diverge: EM64T support: looking at it (works also on SLC3, some trouble compat between Opteron/Intel (wrong kernel))]
[diverge: updating system on SLC4? apt (have coded some dependencies against it)?
JP: don't know.. x86_64 already on yum, may go to yum by default on SLC4 - this is an internal Linux.Support choice]

NN: could use the cert infrastructure even for SLC4 if non-production?

JI: Agree, either do "real" certification or none at all, certification does not automatically mean deployment.

JP: could start certification "early" on FedoraCore4, as a alpha-test for Red Hat 5, but current rate of changes is horrible (FC4 is still in beta, will cool down later). Useful for porting tests only (versus announced changes), random runtime breaks could occur and would need to be ignored.
AB: would be mostly interested in commercial applications + Java
JI: won't work now on FC4
JP: ORACLE will support RHE4+5, but significant lag for native libraries, relies on compatibility environment until then.

[AB: assume SLC4: new control room (100 PCs about 60 Linux with multiscreen+NVidia): will the close-source driver model continue to work?
JP/JI: yes]

back to basic question - SLC4 or not?

JI summary:
  • need "backup" stable SLC4 in any case.
  • rolled-out SLC4 makes migration to SLC5 easier (also for experiment)

JP: when is the last release date (fully certified, ready for deployment etc) of SLC5 that would get it into the LHC startup phase ?

AB: 1st November 2006 (general agreement, also from ATLAS, this is a HARD limit for AB-CO and LHCb)

[ AB: other changes that could go into next release: SingleSignOn for Windows, shared home dir on DFS? (would be nice for whole AB, Fermi has done SSO?)
JP: AFS not technically required for desktops/laptops, but depends on software environment. Password sync : see Linux4Controls?
DFS: No 'supported' DFS client on Linux...
JI: cert=time for changes, please let us know, can discuss later.]

JI: explains situation to be discussed at HEPiX (May 09): SL(C)3 now is available world-wide (good for sites + experiments), would like to keep this situation => need sites + experiments to agree on timescale for next release.
Experiments: Please contact non-CERN sites and get requirements/deployment plans. Need to match experiment and site plans.

AB: assumption of 'Laptop=Desktop=control machine' is nice (compile once, run directly), please keep if possible.
JI: still current working assumption: laptop=desktop=PLUS=BATCH=online/controls/servers. but actual roll-out may be staggered.

Conclusion:

  • Split compiler and OS certification. Can in principle mix & match, but system compiler is somewhat preferred for 3rd-party compatibility
  • SLC5 would be nicer to have for LHC startup than SLC4, but carries a risk => SLC4 needs to be fully & formally certified in 3Q2006
  • if RHE5 looks promising (already from FedoraCore4/5 and the RHE5-beta) and is on time, look whether it can get fully certified with whatever compiler until October 2006.
  • October 2006: decide which OS version + compiler version to deploy for LHC startup.
  • only "test" SLC4 batch capacity (probably) required until that decision
  • sometime before end of 2005, SLC4 should be a fully-supported OS (available for network installs, updated, helpdesk etc)

Issue2: Version proliferation & LCG collaboration

(intro by JI) 'LHC AF decides on compiler for non-LHC experiments'-issue:

BG: bad, but have to see facts:
COMPASS is only running experiment in 2006
HEP is disbanded, e.g. L3 has no more people
non-LHC in general has no manpower to either port or test. = "no changes" preferred, but not always possible.
concerns only offline → mostly compiler issues, if libraries are available - e.g. CERNLIB widely used, availability is critical..
[JC: confirms, nothing to do for ALEPH on SLC3 since gcc-3.2.3 was already present]

JI:

  • no manpower for tests → no veto even if formal right to do so
  • LHC will have huge overlap in terms of requirements (FORTRAN, libraries etc) → will work most of the time
  • non-LHC will still be "normal" client to Application Area/Compiler provider (just like AB), and can escalate via hierarchy if they feel treated unjustly

Conclusion: compiler/OS split is OK.


Issue2bis: continuous SLC evolution & test releases

(not discussed, shelved for now)

Issue3: post-mortem of SLC3, changes to the process and LXCERT membership?

(some membership changes already happened before the meeting)

Issue:light-weight certification vs formal exercise:

NN: informal tests could be nice?

JI: results are useless unless some outcome is recorded (=formal yes/no, binding),
certification infrastructure is still useful,
deployment & certification can be decoupled

Conclusion: keep current certification infrastructure for now.

Issue:SLC3 post-mortem

JI: (quick rundown, will send details per mail): Remember: SC3 was 12 month late!
[AB: but actually like the November release time]
Lost time with Red Hat negotiations (~6 months) and due to summer
(key people absent) and 'blocking' chain August-October:
CMS (and LHCb and ATLAS, but they didn't veto) → POOL → cxxabi bug.

(Trivial) lessons:

  • any cross-site negotiations take long
  • summer is bad for certifications (issue for SLC5!)
  • need to break down dependency chains for quick certifications (e.g certify "production" scripts independently from working jobs?)

AOB:

AB: would like effort from IT/ORACLE for SLC4: one defined ORACLE client production version (compatible with system compiler), need Pro-C. If client distributed via AFS, please on replicated volumes..
JI: need exact requirements, will forward. Explained split Physics-DB (tied in via AppArea) / non-Physics (same situation as before).

BB: should we open 32 or 64 discussion?
JI: forecast: all current 3 platforms will need to be supported on OS level, may drop IA64 in 2-3years, decision on priorities/batch capacity is with AF/IT task force, after performance evaluation

AB:On the desktop, will we get Opterons as least as choice from PCshop?

JP:will stay on 32bit for this year, and have additional delay for PC purchases anyway: need to redo market survey → nothing moves before FC in November(?)


Actions:

JI: send post-mortem summary

JI: send minutes

all: continue discussion via mail, get external site requirements

JI/JP: report from HEPiX

JP: (slowly) set up SLC4 for initial tests.