Professional Documents
Culture Documents
Tier1 Status Report: Andrew Sansum Gridpp12 1 February 2004
Tier1 Status Report: Andrew Sansum Gridpp12 1 February 2004
Overview
Hardware configuration/utilisation That old SRM story dCache deployment Network developments Security stuff
1 February 2004
Hardware
CPU
500 dual Processor Intel PIII and Xeon servers mainly rack mounts (About 884KSI2K) also 100 older systems.
Tape Service
STK Powderhorn 9310 silo with 8 9940B drives. Max capacity 1PB at present but capable of 3PB by 2007.
1 February 2004 Tier-1 Status Report
LCG in September
1 February 2004
LCG Load
1 February 2004
Babar Tier-A
1 February 2004
Last 7 Days
ACTIVE: Babar,D0, LHCB,SNO, H1, ATLAS,ZEUS
1 February 2004
dCache
Motivation:
Needed SRM access to disk (and tape) We have 60+ disk servers (140+ filesystems) needed disk pool management
1 February 2004
September 2004
Redeployed DCache into LCG system for CMS, and DTeam VOs.
1 February 2004
dCache deployed within JRA1 testing infrastructure for gLite i/o daemon testing. Jan 2005: Still working with CMS to resolve interoperation issues, partly due to hybrid grid/non-grid use. Jan 2005: Prototype back end to tape written.
dCache Deployment
dCache deployed as production service (also test instance, JRA1, developer1 and developer2?) Now available in production for ATLAS, CMS, LHCB and DTEAM (17TB now configured 4TB used) Reliability good but load is light Will use dCache (preferably production instance) as interface to Service Challenge 2. Work underway to provide Tape backend, prototype already operational. This will be production SRM to tape at least until after July service challenge
1 February 2004 Tier-1 Status Report
1 February 2004
Service Challenges
Production dCache
test dCache(?)
technology
1 February 2004 Tier-1 Status Report
gridftp
head
Disk Servers
dcache
1 February 2004 Tier-1 Status Report
Network
Recent upgrade to Tier1 Network Begin to put in place new generation of network infrastructure Low cost solution based on commodity hardware 10 Gigabit ready Able to meet needs of:
forthcoming service challenges Increasing production data flows
1 February 2004
September
Site Router
1 Gbit Summit 7i
Summit 7i
Summit 7i Disk
Disk CPU
1 February 2004 Tier-1 Status Report
CPU
Disk
CPU
Now (Production)
Site Router
Disk+CPU
1 February 2004
Disk+CPU
Soon (Lightpath)
Site Router
1 Gbit
Summit 7i N*1 Gbit Nortel 5510 stack (80Gbit) 1 Gbit/link N*1 Gbit U K 2*1Gbit L I G 1 H 0 Dual T G attach b
Disk+CPU
1 February 2004
Disk+CPU
Next (Production)
R A L
New Site Router
10 Gb 10 Gigabit Switch
N*10 Gbit Nortel 5510 stack (80Gbit) 1 Gbit/link N*10 Gbit
S I T E
Disk+CPU
1 February 2004
Disk+CPU
Also: profiling power distribution, new access control system. Worrying about power stability (brownout and blackout in last quarter.
1 February 2004 Tier-1 Status Report
1 February 2004
Test system
8 x 9940 tape drives STK 9310 4 drives to each switch basil AIX test dataserver
Production system
buxton SunOS ACSLS
Tape devices
Brocade FC switches
ADS_switch_1 ADS_Switch_2
cache
Logging mchenry1 AIX Test flfsys ermintrude AIX dataserver florence AIX dataserver
cache
zebedee AIX dataserver dougal AIX dataserver brian AIX flfsys admin commands create query
catalogue
array4
array3
array2
array1
catalogue
All sysreq, vtp and ACSLS connections to dougal also apply to the other dataserver machines, but are left out for clarity
User
ADS tape ADS sysreq SRB Inq; S commands; MySRB
Sysreq udp command User SRB command STK ACSLS command VTP data transfer
1 February 2004
Thursday, 04 November 2004
Catalogue Manipulation
120
100
80
seconds
20
1 February 2004
100
10
20
30
40
50
60
70
80
90
16:00:41 16:22:40 16:44:41 17:06:41 17:28:41 17:49:42 18:12:42 18:35:42 18:58:42 19:21:41 19:44:41 20:07:42 20:30:41 20:53:41 21:16:41 21:39:41 22:02:41 22:25:41 22:48:41 23:10:41 23:33:41 23:56:41 00:19:42 00:42:42 01:04:41 01:27:40 01:49:41 02:12:41 02:35:41 02:58:41 03:21:41 03:44:40 04:07:40 04:29:41 04:52:41 05:15:41 05:37:41 06:00:41 06:23:41 06:46:40 07:09:40 07:32:40 07:55:42 08:18:40 08:41:41
Total w rite (KB/sec) read (KB/sec)
1 February 2004
Write Performance
Conclusions
Have found a number of easily fixable bugs Have found some less easily fixable architecture issues Have much better understanding of limitations of architecture Estimate suggests 60-80MB/s -> tape now. Buy more/faster disk and try again. Current drives good for 240MB/s peak actual performance likely to be limited by ratio of drive (read+write):(load+unload+seek)
1 February 2004 Tier-1 Status Report
Security Incident
26 August X11 scan acquires userid/pw at upstream site Hacker logs on to upstream site snoops known_hosts Ssh to Tier-1 front end host using unencrypted private key from upstream site Upstream site responds but does not notify RAL Hacker loads IRC bot on Tier1 and registers at remote IRC server (for command/control/monitoring) Tries to root exploit (fails), attempts login to downstream sites (fails we think) 7th October Tier-1 incident notified by IRC service and begins response. 2000-3000 sites involved globally
1 February 2004 Tier-1 Status Report
Objectives
Comply with site security policy disconnect etc etc.
Will disconnect hosts promptly once active intrusion detected.
1 February 2004
Roles
Incident controller Log trawler Hunter/killer(s) Detailed forensics End user contacts
1 February 2004
mange information flood deal with external contacts hunt for contacts Searching for compromised hosts Understand what happened on compromised hosts Get in touch with users, confirm usage patterns, terminate IDs
Tier-1 Status Report
Chronology
At 09:30 on 7th October RAL network group forwarded a complaint from Undernet suggesting unauthorized connections from Tier1 hosts to Undernet. At 10:00 initial investigation suggests unauthorised activity on csfmove02. csfmove02 physically disconnected from network. By now 5 Tier1 staff + E-Science security Officer 100% engaged on incident. Additional support from CCLRC network group, and site security officer. Effort remained at this level for several days. Babar support staff at RAL also active tracking down unexplained activity. At 10:07 request made to site firewall admin for firewall logs for all contacts with suspected hostile IRC servers. At 10:37 firewall admin provides initial report, confirming unexplained current outbound activity from csfmove02, but no other nodes involved. At 11:29 babar report that bfactory account password was common to the following additional Ids (bbdatsrv and babartst) At 11:31 Steve completes rootkit check no hosts found although possible false positives on Redhat 7.2 which we are uneasy aboutBy 11:40 preliminary investigations at RAL had concluded that an unauthorized access had taken place onto host csfmove02 (a data mover node) which in turn was connected outbound to an IRC service. At this point we notified security mailing lists (hepix-security, lcg-security, hepsysman):
1 February 2004 Tier-1 Status Report
Security Events
35 30 25 20 15 10 5 0 2D1 4D1 2D2 4D2 D4 D6 6 Hour Intervals
1 February 2004 Tier-1 Status Report
Security Events
Security Summary
The intrusion took 1-2 staff months of CCLRC effort to investigate (6 staff fulltime for 3 days (5 Tier-1 plus EScience security officer), working long hours. Also:
Networking group CCLRC site security. Babar support Other sites
Prompt notification of the incident by up-stream site would have substantially reduced the size and complexity of the investigation. The good standard of patching on the Tier1 minimised the spread of the incident internally (but we were lucky) Can no longer trust who logged on uses are many userids (globally) probably compromised.
1 February 2004 Tier-1 Status Report
Conclusions
A period of consolidation User demand continues to fluctuate, but increasing number of experiments able to use LCG. Good progress on SRM to DISK (DCACHE Making progress with SRM to tape Having an SRM isnt enough it has to meet the needs of the experiments Expect focus to shift (somewhat) towards service challenge
1 February 2004 Tier-1 Status Report