Minutes from the LXCERT meeting 06.07.04
Minutes updated on 09.07.2004 on request from A.Aimar
Alberto Aimar, LCG Application Area
Bruce M Barnett, ATLAS-online
Eric Cano, CMS-online
Marco Cattaneo, LHCb
Benigno Gobbo, non-LHC experiments
Jan Iven, IT-services catchall (chair)
Jarek Polok, general Desktops (secretary)
German Cancio, IT PLUS/BATCH service (for Thorsten Kleinwort)
Stephan Wynhoff, CMS-offline
Emil Obreshkov, ATLAS-offline
David Quarrie, ATLAS
Helge Meinhard, CLUG
Fons Rademakers, ALICE
Alastair Bland, AB/CO (no blocking issues)
Nicolas De Metz-Noblat, AB/CO (no blocking issues)
Round-the-table open issues with SLC3, timeline for the SLC3 certification
Desktop issues, Jarek Polok
- Firewall configuration with AFS
- Screensavers not refreshing AFS token
- su destroying AFS token
- lpr not using default printer
current estimate: < 2 weeks (including testing) until fully ready for deployment.
SW: KDE version is too old. Main consequence for CMS: Kdevelop is too old (very hard to backport, tied to too many libraries), minor: KDE mailer is dying from time to time. CMS will probably install KDE-3.2, but try first to get a newer Kdevelop with SLC3 (which linux.support would then take into the main distribution. Full new KDE would need to be shipped separately). JP suggests to try Eclipse as IDE, CMS has not tried that yet.
EC reports a HEPiX bug (tried to install JDK with environment file in /etc/profile.d/ zzz_HEPIX.sh blows away the PATH. Resolved by removing HEPiX, CMS-online farms can live without it (later: turned out to be "expected legacy behavior", not a bug, discussion is underway to change this.)
PLUS/BATCH, German Cancio
- showstopper: version of RPM is broken when used by SPMA, proposal to go back to CEL3-version (4.2.1) is accepted (if problem disappears)
- ORACLE environment setup is pushed into user environment.
JI explains current situation (which has been clarified further directly from IT-DB): no client RPM will be provided, the default user environment will not be aware of ORACLE. This has implications on add-on tools.
- Please review the list of open issues from TK (no reply received yet?), none of which are formally blocking but which could delay uptake.
LXCEL3 didn't get much feedback from the users. Waiting for RPM issue to be solved before reinstalling with SLC3. Would prefer 4 weeks of "test mode" running before certifying, could reduce this if differences (CEL3→SLC3) are found to be small.
Switching over the PLUS alias: experiments drive this, they should give required SPECint numbers per platform for the batch capacity migration.
ATLAS-offline, David Quarrie
David introduced Emil Obreshkov (the new ATLAS librarian) as new ATLAS-offline representative at the LXCERT meeting.
ATLAS-offline is still testing the builds on CEL3, hunting missing external deps (e.g. GENSER, special version?). Compilation of own code is actually rather OK.
ATLAS asks for a CEL3 → SLC3 migration strategy from SPI. (AA: assume binary compat CEL3/SLC3 for now, not recompiling all packages. Compatibility seems to be good, differences in library size are most likely due to the Red Hat quarterly update that was integrated in to SLC3 (but not into CEL3).
ATLAS assumes to have a working release on CEL3/SLC3 at the end of July. They are currently building on LXCEL3, will switch compilation environment when IT-FIO reinstalls these machines.
Deployment: ATLAS have data challenge and testbeam right now. Would resist any change for the "default aliases" before end of September/ early October.
CMS-online, Eric Cano
CMS-online is in the process of building a cluster (CMSDAQ preseries), moved from CEL3->SLC3. No showstoppers, currently deploying.. (HEPIX glitch, see above).
Another glitch: the "ant" package as shipped by Red Hat uses gcj instead of javac, no Swing → not useable). classic-ant is available with expected behavior. They propose to use a "normal" ant. (AA: have old version of "ant" in SPI tools, but nobody using it). JI: build-tools are a critical area, would need to coordinate with Fermi to make sure all of SL could use this, and still need to worry about compatibility with RHE3.
CMS-offline, Stephan Wynhoff
good news: almost everything works like on 7.3 (slightly different package versions, not all packages officially available from SPI yet), even on non-AFS machines.
- HEPiX (some parts of CMS use it). Is this still supported? (JI: yes.) . move from /usr/local will hurt user's scripts, don't do it too often (JI: aware, but little we can do for user environments. Group environments should be OK)
CMS would like to test different compilers (like gcc-3.4.1), before "officially" requesting them for inclusion/support by SPI. These should be made available by IT, but IT has no compiler expert assigned to this anymore. Eventually, other "stable" production compilers will be required as well, but here the SPI infrastructure could be used.
Request: IT should clarify what happened to compiler support.
(for the production build environment, ATLAS+CMS ship their own compiler)
Timeline: CMS has no time/manpower for full tests in summer. Production chain and analysis have not been tested yet. Estimate: (end of) September. Could shorten ifs some batch capacity is available early.
SPI, Alberto Aimar
Some tools/packages have changed versions since the last compilation round and need recompiling/catching up, but SPI will not redo everything. SLC3 will be a standard supported platform, new release should be build there by default (otherwise bug reports, please). SPI is not "in the certification loop" anymore.
SPI is also validating their own services against SLC3, no showstoppers expected.
ATLAS-online, Bruce Barnett
summary: no showstoppers. Issues:
- no "default" rollout while testbeam is running (September)
- would like ISO images, to allow external institutes to test as well
- gcc support (eventually will want new stable version, see above)
- worried about CVS issue reported by CMS: turned out to be a non-issue
LHCb, Marco Cattaneo
Status: only little things missing, are being addressed. SW has been build on CEL3 (building env is OK, runs). No full/official/complete release of offline SW, will have it in next release (SLC3 will be supported platform end of July). Currently using compat 7.3 mode for production, works ok.
Move into production: suggest to do it like the 6 → 7 move (provide test farm, switch default (with ample spare capacity on the new system!), incrementally release more nodes).
non-LHC, Benigno Gobbo
no "veto" from non-LHC experiments, but no reaction either (except from NA49, NA60). DELPHI was interested in 64bit tests, but didn't have time for real tests. HARP, NA48, OPAL: no reaction. COMPASS: working on RHE3 → CEL3 → SLC3 tests, tested mixed libraries (RHE3 → SLC3) binary compatibility is OK. Tested ORACLE 10 lib against ORACLE 9 servers. Only missing "heavy production" tests (but cannot do this without batch capacity).
CLUG, Helge Meinhardno feedback whatsoever from CLUG. Propose to officially abolish CLUG, inactive for years.
ALICE, Fons Rademakers
(report received by email after the meeting)
The ALICE situation with SLC3 is that everything works fine. No issues. However we would like to see the Intel compilers to be installed in /opt. Our software has been validated against these compilers and we see a performance increase between 25-30%. We also ask for the AMD64 support to be maintained. Strangely enough AMD64's are not (yet) wider spread at CERN while they provide the largest bang for the buck at the moment (my AMD64 2.4 GHz runs gives 1700 rootmarks using gcc -O in full 64bit mode, while my P4 3.2 GHz only 1100 rootmarks).
- Linux.support will declare SLC3 fit for Desktops as soon as new LXPLUS beta candidate is available for a few days. Announce this widely (LXPLUS==desktop paradigm broken). Provide updating scripts afterwards (to cope with eventual changes), no support for people who turn off updates. Linux Support from the on will fully support SLC3, and SLC3 will be installed by default on desktops.
- IT-FIO to coordinate "default" alias switch with experiments based on the feedback from the beta cluster and the ongoing production activities, this becomes the de-facto "certification"
- IT-FIO to migrate capacity by demand afterwards, again coordination with the experiments.
SW asks whether has anybody ever tested Opterons in 64bit mode? (this is not a blocking issues for CMS, but some sites will be using Opterons). JP: (we keep i386+amd64 in sync, but tests this very little). CERN is not running this in production, but are open to contributions, actually inviting collaboration via the "Scientific Linux" framework. Some effort from SPI would be required for the base libraries. There was some discussion whether CERN should be intensifying this effort, given that CERN itself does not benefit directly.
BB asked about license costs for the next release, this could affect remote sites' planning. Cost is affordable when compared to HW, will discuss advantages and TCO at one of the next HEPiX meetings to prepare for the next (2005) release decision. MC reminded about obsolete PCs lying around, being re-used for "free", but TCO is difficult to calculate (more machines per SPECint). BB: commented that the next release may have more focus on longevity, if experiments are no longer "virtual collaborations" but closer to "production mode".
No date for next coordination meeting has been set, the hope is to be able to conclude the certification without another face-to-face meeting. In the case of new showstoppers, a VRVS meeting may be invoked.