Before you start, you ought to grab a copy of the TCP state transition diagramm as specified in RFC 793 on page 23. The drawback is the missing error correction supplied by later RFCs. There is an easier way to obtain blowup printouts to staple to your office walls. Grab a copy of the PostScript file pocket guide, page 2 accompanying Stevens' TCP/IP Illustrated Volume 1. Or simply open the book at figure 18.12.
/etc/system
Appendices are separate documents. They are quoted from within the text, but you might be interested in them when downloading the current document.
1. History and introduction
This page and the related work have a long history in gathering. I started
out peeking wide eyed over the shoulders of two people from a search engine provider when they were
installing the German server of a
customer of my former employer. My
only alternative resource of tuning information was the brilliant book TCP/IP Illustrated 1 by Stevens. I started gathering
all information about tuning I was able to get my hands upon. The
cumulation of these you are experiencing on these pages.
Solaris allows you to tune, tweak, set and reset various parameters related to the TCP/IP stack while the system is running. Back in the SunOS 4.x days, one had to change various C files in the kernel source tree, generate a new kernel, reboot the machine and try out the changes. The Solaris feature of changing the important parameters on the fly is very convenient.
Many of the parameters I mention in the rest of the document you are reading are time intervals. All intervals are measured in milliseconds. Other parameters are usually bytecounts, but a few times different units of measurements are used and documented. A few items appear totally unrelated to TCP/IP, but due to the lack of a better framework, they materialized on this page.
Most tunings can be achieved using the program ndd. Any user
may execute this program to read the current settings, depending on the
readability of the respective device files. But only the super user is
allowed to execute ndd -set to change values. This makes sense
considering the sensitive parameters you are tuning. Details on the use of
ndd can be obtained from the respective manual page.
ndd /dev/tcp \? # show all parameter keys ndd /dev/tcp tcp_mss_def # show the value to this key ndd -set /dev/ip ip_forwarding 0 # switch off forwarding
All keys starting out with ip_ have to be used with the
pseudo device /dev/ip. Analog behaviour is true for the keys
starting in tcp_ etc. Andres Kroonmaa kindly supplied a
nifty script to check all existing values for a
network component (tcp, udp, ip, icmp, etc.). Usually I do the same thing using Perl.
2. TCP connection initiation
This section is dedicated exclusively to the various queues and tunable
variable(s) used during connection instantiation. The socket API maintains
some control over the queues. But in order to tune anything, you have to
understand how listen and accept interact with
the queues. For details, see the various Stevens books mentioned in the
literature section.
When the server calls listen, the kernel moves the socket from
the TCP state CLOSED into the state LISTEN, thus
doing a passive open. All TCP servers work like this. Also, the kernel
creates and initializes various data structures, among them the socket buffers and two queues:
SYN that has
arrived. BSD sources assing so_q0len entries to this
queue. The server sends off the ACK of the client's
SYN and the server side SYN. The connection
get queued and the kernel now awaits the completion of the TCP three way
handshake to open a connection. The socket is in the SYN_RCVD
state. On the reception of the client's ACK to the server's
SYN, the connection stays one round trip time (RTT)
in this queue before the kernel moves the entry into the
ESTABLISHED
state. Each call to accept() removes the front entry of the
queue. If there are no entries in the queue, the call to
accept usually blocks. BSD source assign a length of
so_qlen to this queue.
listen(), the server is allowed to specify the size of the
second queue for completed connections. If the server is for whatever
reason unable to remove entries from the completed connection
queue, the kernel is not supposed to queue any more connections. A
timeout is associated with each received and queued SYN
segment. If the server never receives an acknowledgement for a queued
SYN segment, TCP state SYN_RCVD, the time will
run out and the connection thrown away. The timeout is an important
resistence against SYN flood attacks.
![[connection queues]](blog-1.gif)
![[connection initiation]](blog-2.gif)
Figure 1: Queues maintained for listening sockets.
Figure 2: TCP three way handshake,
connection initiation.
Stevens shows that the incomplete connection queue does need
more entries for busy servers than the completed
connection queue. The only reason for specifying a large backlog value is
to enable the incomplete connection queue to grow as SYN
arrive from clients. Stevens shows that moderately busy webserver has an
empty completed connection queue during 99 % of the time, but the
incomplete connection queue needed 15 or less entries in 98 % of
the time! Just try to imaginge what this would mean for a really busy
webcache like Squid.
Data for an established connection which arrives before the connection is
accept()ed, should be stored into the socket buffer. If the
queues are full when a SYN arrived, it is dropped in the hope
that the client will resend it, hopefully finding room in the queues
then.
According to Cockroft, there was only one listen queue for unpatched Solari <= 2.5.1. Solari >= 2.6 or an applied TCP patch 103582-12 or above splits the single queue in the two shown in figure 1. The system administrator is allowed to tweak and tune the various maxima of the queue or queues with Solaris. Depending on wether there are one or two queues, there are different sets of tweakable parameters.
The old semantics contained just one tunable parameter
tcp_conn_req_max which specified the maximum argument for
the listen(). The patched versions and Solaris 2.6 replaced
this parameter with the two new parameters
tcp_conn_req_max_q0 and
tcp_conn_req_max_q. A
SunWorld
article on 2.6 by Adrian Cockroft tells the following about the new
parameters:
tcp_conn_req_max [is] replaced. This value is well-known as it normally needs to be increased for Web servers in older releases of Solaris 2. It no longer exists in Solaris 2.6, and patch 103582-12 adds this feature to Solaris 2.5.1. The change is part of a fix that prevents denial of service fromIn other words, the first specifies the size of the incomplete connection queue while the second parameters assigns the maximum length of the completed connection queue. All three parameters are covered below.SYNflood attacks. There are now two separate queues of partially complete connections instead of one.tcp_conn_req_max_q0 is the maximum number of connections with handshake incomplete. A
SYNflood attack could only affect this queue, and a special algorithm makes sure that valid connections can still get through.tcp_conn_req_max_q is the maximum number of completed connections waiting to return from an accept call as soon as the right process gets some CPU time.
You can determine if you need to tweak this set of parameters by watching
the output of netstat -sP tcp. Look for the value of
tcpListenDrop, if available on your version of Solaris. Older
versions don't have this counter. Any value showing up might indicate
something wrong with your server, but then, killing a busy server (like
squid) shuts down its listening socket, and might increase this counter
(and others). If you get many drops, you might need to increase the
appropriate parameter. Since connections can also be dropped, because
listen() specifies a too small argument, you have to be
careful interpreting the counter value. On old versions, a SYN flood attack
might also increase this counter.
Newer or patched versions of Solaris, with both queues available, will also
have the additional counters tcpListenDropQ0 and
tcpHalfOpenDrop. Now the original counter
tcpListenDrop counts only connections dropped from the
completed connection queue, and the counter ending in
Q0 the drops from the incomplete connection
queue. Killing a busy server application might increase either or both
counters. If the tcpHalfOpenDrop shows up values, your server
was likely to be the victim of a SYN flood. The counter is only incremented
for dropping noxious connection attempts. I have no idea, if those will
also show up in the Q0 counter, too.
The current parameter describes the maximum number of pending connection requests queued for a listening endpoint in the completed connection queue. The queue can only save the specified finite number of requests. If a queue overflows, nothing is sent back. The client will time out and (hopefully) retransmit.
The size of the completed connection queue does not influence the
maximum number of simultaneous established connections after they were
accepted nor
does it have any influence on the maximum number of clients a server can
serve. With Solaris, the maximum number of file descriptors is the limiting
factor for simultaneous connections, which just happened to coincide with
the maximum backlog queue size.
From the viewpoint of TCP those connections placed in the completed
connection queue are in the TCP state ESTABLISHED, even
though the application has not reaped the connection with a call to
accept. That is the number limited by the size of the queue,
which you tune with this parameter. If the application, for some reason,
does not release entries from the queue by calling accept, the
queue might overflow, and the connection is dropped. The client's TCP will
hopefully retransmit, and might find a place in the queue.
Solaris offers the possibility to place connections into the backlog queue
as soon as the first SYN arrives, called eager
listening. The three way handshake will be completed as soon as the
application accept()s the connection. The use of eager
listening is not recommended for production systems.
Solari < 2.5 have a maximum queue length of 32 pending connections. The length of the completed connection queue can also be used to decrease the load on an overloaded server: If the queue is completely filled, remote clients will be denied further connections. Sometimes this will lead to a connection timed out error message.
Naively, I assumed that a very huge length might lead to a long service time on a loaded server. Stevens showed that the incomplete connection queue needs much more attention than the completed connection queue. But with tcp_conn_req_max you have no option to tweak that particular length.
When tuning tcp_conn_req_max, always do it with regards to the values of rlim_fd_max and rlim_fd_cur. This is just a rule of thumb. Setting your listen backlog queue larger than the number of filedescriptors available to you won't do you any good if your service time is long. A server shouldn't accept any further connections, if it has run out of descriptors. Even though new connection won't be thrown away with a long backlog, a server might want to reduce the size to as many connections as can be serviced simultaneously. Again, you have to consider your average service time, too.
There is a trick to overcome the hardcoded limit of 1024 with a patch. SunSolve shows this trick in connection withSYNflood attacks. A greatly increased listen backlog queue may offer some small increased protection against this vulnerability. On this topic also look at the tcp_ip_abort_cinterval parameter. Better, use the mentioned TCP patches, and increase the q0 length.
echo "tcp_param_arr+14/W 0t10240" | adb -kw /dev/ksyms /dev/memThis patch is only effective on the currently active kernel, limiting its extend to the next boot. Usually you want to append the line above on the startup script
/etd/init.d/inetinit. The shown patch increases hard limit auf the listen backlog queue to 10240. Only after applying this patch you may use values above 1024 for the tcp_conn_req_max parameter.
A further warning: Changes to the value of
tcp_conn_req_max parameter in a running system
will not take effect until each listening application is
restarted. The backlog queue length is evaluated whenever an application
calls listen(3N), usually once during startup. Sending a HUP
signal may or may not work; personally I prefer to TERM the application and
restart them manually or, even better, use a startup script.
After installing the mentioned TCP patches, alternatively after installing Solaris 2.6, the parameter tcp_conn_req_max is no longer available. In its stead the new parameters tcp_conn_req_max_q and tcp_conn_req_max_q0 emerged. tcp_conn_req_max_q0 is the maximum number of connections with handshake incomplete, basically the length of the incomplete connection queue.
In other words, the connections in this queue are just being
instantiated. A SYN was just received from the client, thus
the connection is in the TCP SYN_RCVD state. The connection
cannot be accept()ed until the handshake is complete, even if
the eager listening is active.
To protect against SYN flooding, you can increase this parameter. Also refer to the parameter tcp_conn_req_max_q above. I believe that changes won't take effect unless the applications are restarted.
After installing the mentioned TCP patches, alternatively after installing Solaris 2.6, the parameter tcp_conn_req_max is no longer available. In its stead the new parameters tcp_conn_req_max_q and tcp_conn_req_max_q0 emerged. tcp_conn_req_max_q is the length of the completed connection queue.
In other words, connections in this queue of length
tcp_conn_req_max_q have completed the three way handshake
of a TCP open. The connection is in the state ESTABLISHED.
Connections in this queue have not been accept()ed by the
server process (yet).
Also refer to the parameter tcp_conn_req_max_q0. Remember that changes won't take effect unless the applications are restarted.
This parameter specifies the minimum number of available connections
in the completed connection queue for select()
or poll() to return "readable" for a listening (server)
socket descriptor.
Programmers should note that Stevens describes a
timing problem, if the connection is RST between the
select() or poll() call and the subsequent
accept() call. If the listening socket is blocking, the
default for sockets, it will block in accept() until a
valid connection is received. While this seems no tragedy with a
webserver or cache receiving several connection requests per second,
the application is not free to do other things in the meantime, which
might constitute a problem.
The recommended upper and lower bounds on the RTO are known to be inadequate on large internets. The lower bound SHOULD be measured in fractions of a second (to accommodate high speed LANs) and the upper bound should be 2*MSL, i.e., 240 seconds.
Besides the retransmit timeout (RTO) value two further parameters R1 and R2 may be of interest. These don't seem to be tunable via any Solaris' offered interface that I know of.
The value of R1 SHOULD correspond to at least 3 retransmissions, at the current RTO. The value of R2 SHOULD correspond to at least 100 seconds.[...]
However, the values of R1 and R2 may be different for SYN and data segments. In particular, R2 for a SYN segment MUST be set large enough to provide retransmission of the segment for at least 3 minutes. The application can close the connection (i.e., give up on the open attempt) sooner, of course.
Great many internet servers which are running Solaris do retransmit segments unnecessarily often. The current condition of European networks indicate that a connection to the US may take up to 2 seconds. All parameters mentioned in the first part of this section relate to each other!
As a starter take this little example. Consider a picture, size 1440 byte,
LZW compressed, which is to be transferred over a serial linkup with 14400
bps and using a MTU of 1500. In the ideal case only one PDU gets
transmitted. The ACK segment can only be sent after
the complete PDU is received. The transmission takes about 1 second. These
values seem low, but they are meant as 'food for thought'. Now consider
something going awry...
Solaris 2.5.1 is behaving strange, if the initial SYN segment
from the host doing the active open is lost. The initial SYN
gets retransmitted only after a period of 4 *
tcp_rexmit_interval_initial plus a constant C. The time is 12
seconds with the default settings. More information is being prepared on
the retransmission test page.
The initial lost SYN may or may not be of importance in your
environment. For instance, if you are connected via ATM SVCs, the initial
PDU might initiate a logical connection (ATM works point to point) in less
than 0.3 seconds, but will still be lost in the process. It is rather
annoying for a user of 2.5.1 to wait 12 seconds until something happens.
This interval is waited before the last data sent is retransmitted due to a missing acknowledgement. Mind that this interval is used only for the first retransmission. The more international your server is, the larger you should chose this interval.
Special laboratory environments working in LAN-only environments might be better off with 500 ms or even less. If you are doing measurements involving TCP (which is almost always a bad idea), you should consider lowering this parameter.
After the initial retransmission further retransmissions will start after the tcp_rexmit_interval_min interval. BSD ususally specifies 1500 milliseconds. This interval should be tuned to the value of tcp_rexmit_interval_initial, e.g. some value between 50 % up to 200 %. The parameter has no effect on retransmissions during an active open, see my accompanying document on retransmissions.
The tcp_rexmit_interval_min doesn't display any influence on connection establishment with Solaris 2.5.1. It does with 2.6, though. The influence on regular data retransmissions, or FIN retransmissions I have yet to research.
This interval specifies how long retransmissions for a connection
in the ESTABLISHED state should be tried before a
RESET segment is sent. BSD systems default to 9 minutes.
This interval specifies how long retransmissions for a remote host are
repeated until the RESET segment is sent. The difference to
the tcp_ip_abort_interval parameter is that this
connection is about to be established - it has not yet reached the state
ESTABLISHED. This value is interesting considering SYN flood
attacks on your server. Proxy server are doubly handicapped because of
their Janus behaviour (like a server towards the downstream cache, like a
client towards the upstream server).
According to Stevens this interval is connected to the active
open, e.g. the connect(3N) call. But according to
SunSolve the interval has an impetus on both
directions. A remote client can refuse to acknowledge an opening connection
up to this interval. After the interval a RESET is sent. The
other way around works out, too. If the three-way handshake to open a
connection is not finished within this interval, the RESET
Segment will be sent. This can only happen, if the final ACK went astray,
which is a difficult test case to simulate.
To improve your SYN flood resistence, SUN suggests to use an interval as
small as 10000 milliseconds. This value has only been tested for the "fast"
networks of SUN. The more international your connection is, the slower it
will be, and the more time you should grant in this interval. Proxy server
should never lower this value (and should let Squid terminate the
connection). Webservers are usually not affected, as they seldom actively
open connections beyond the LAN.
All previously mentioned retransmissions related interval use an exponential backoff algorithm. The wait interval between two consequitive retransmissions for the same PDU is doubled starting with the mimimum.
The tcp_rexmit_interval_max interval specifies the maximum wait interval between two retransmissions. If changing this value, you should also give the abort interval an inspection. The maximum wait interval should only be reached shortly before the abort interval timer expires. Additionally, you should coordinate your interval with the value of tcp_close_wait_interval.
This parameter specifies the timeout before sending a delayed
ACK. The value should not be increased above
500, as required by RFC 1122.
This value is of great interest for interactive services. A small number
will increase the "responsiveness" of a remote service (telnet, X11), while
a larger value can decrease the number of segments exchanged.
The parameter might also interest to HTTP servers which transmit small
amounts of data after a very short retrieval time. With a heavy-duty
servers or in laboratory banging environment, you might encounter service
times answering a request which are well above 50 ms. An increase to 500
might lead to less PDUs transferred over the network, because TCP is able
to merge the ACK with data. Increases beyond 500 should not even be
considered.
Please note that Solaris recognizes the initial data phase of a
connection. An initial ACK (not SYN) is not delayed. Therefore a
request for a webservice (both, server or proxy) which does not fit into a
single PDU can be transmitted faster. Web benchmarks will show this as
improved performance. Also check the tcp_slow_start_initial Parameter.
This parameter has something to do with the number of delayed
acknowlegdements or the number of byte to be collected. My guess is that
this parameter specifies the number of outstanding ACKs in
interactive transfer mode. In this case tiny amounts of data are flowing in
both direction. In contrast to my prior statement, you need not
give this parameter a look when tuning bulk transfers, because its impact
is on interactive transfers.
The next part looks at a few parameters having to do with retransmissions, as well.
This parameter provides the slow-start bug discovered in BSD and Windows
TCP/IP implementations for Solaris. More information on the topic can be
found on the servers
of
SUN and in [Stevens III].
To summarize the effect, a server starts sending two PDUs at once without
waiting for an ACK due to wrong ACK counts. The
ACK from connection initiation being counted as data
ACK - compare with figure 2.
Network congestion avoidance algorithms are being undermined. The slow
start algorithm does not allow the buggy behaviour, compare with RFC
2001.
Setting the parameter to 2 allows a Solaris machine to behave like it has the slow start bug, too. Well, IETF is said to make amends to the slow start algorithm, and the bug is now actively turned into a feature. SUN also warns:
It's still conceivable, although rare, that on a configuration that supports many clients on very slow-links, the change might induce more network congestions. Therefore the change of tcp_slow_start_initial should be made with caution.
[...] Future Solaris releases are likely to default to 2.
You can also gain performance, if many of your clients are running old BSD or derived TCP/IP stacks (like MS). I expect new BSD OS releases not to figure this bug, but then I am not familiar with the BSD OS family. A reader of this page told me about cutting the latency of his server in half, just by using the value of 2.
[New] If you want to know more about this feature and its behaviour, you can have a look at some experiments I have conducted concerning that particular feature. The summry is that I agree with the reader: A BSDish client like Windows definitely profits from using a value of 2.
Something to do with the number of duplicates ACKs. If we do
fast retransmit and fast recovery algorithms, this many ACKs
must be retransmitted until we assume that a segment has really been lost.
A simple reodering of segments usually causes no more than two duplicate
ACKs.
If the ICMP error message fragmentation needed is received, a router on the way to the destination needed to fragment the PDU, but was not allowed to do so. Therefore the router discarded the PDU and did send back the ICMP error. Newer router implementations enclose the needed MSS in the error message. If the needed MSS is not included, the correct MSS must be determined by trial and error algorithm.
Due to the internet being a packet switching network, the route a PDU travels along a TCP virtual circuit may change with time. For this reason RFC 1191 recommends to rediscover the path MTU of an active connection after 10 minutes. Improvements of the route can only be noticed by repeated rediscoveries. Unfortunately, Solaris aggressively tries to rediscover the path MTU every 30 seconds. While this is o.k. for LAN environments, it is a grossly impolite behaviour in WANs. Since routes may not change that often, aggressive repetitions of path MTU discoveries leads to unnecessary consumption of channel capacity and elongated service times.
Path MTU discovery is a far reaching and controversial topic when discussing it with local ISPs. But think, the discovery is at the foundation of IPv6. The PSC tuning page argues pro path MTU discovery, especially if you maintain a high-speed or long-delay (e.g. satellite) link.
The recommendation I can give you is not to use the defaults of Solaris < 2.5. Please use path MTU discovery, but tune your system RFC conformant. You may alternatively want to switch off the path MTU discovery all together, though there are few situations where this is necessary.
I was made aware of the fact that in certain circumstances bridges connecting data link layers of differing MTU sizes defeat pMTU discovery. I have to put some more investigation into this matter. If a frame with maximum MTU size is to be transported into the network with the smaller MTU size, it is truncated silently. A bridge does not know anything about the upper protocol levels: A bridge neither fragments IP nor sends an ICMP error.
There may be work-arounds, and the tcp_mss_def is one of them. Setting all interfaces to the minimum shared MTU might help, at the cost of losing performance on the larger MTU network. Using what RFC 1122 calls an IP gateway is a possible, yet expensive solution.
This timer determines the interval Solaris rediscovers the path MTU. An extremely large value will only evaluate the path MTU once at connection establishment.
This parameter switches path MTU discovery on or off. If you enter a 0 here, Solaris will never try to set the DF bit in the IP option - unless your application explicitly requests it.
This is a debug switch! When activated, this switch will have the IP or TCP layer ignore all ICMP error messages fragmentation needed. By this, you will achieve the opposite of what you intended.
This parameter determines the default MSS (maximum segment size) for non-local destination. For path MTU discovery to work effectively, this value can be set to the MTU of the most-used outgoing interface descreased by 20 byte IP header and 20 byte TCP header - if and only if the value is bigger than 536.
Additionally, I strongly suggest the use of a file /etc/init.d/your-tune (always called
first script) which changes the tunable
parameters. /etc/rcS.d/S31your-tune is a hardlink to this
file. The script will be executed during bootup when the system is in
single user mode. A killscript is not necessary. The section about
startup scripts below reiterates this topic in
greater depth.
5.1 Common TCP timers
The current subsection covers three important TCP timers. First I will have
a look at the keepalive timer. The timer is rather controversial, and some
Solari implement them incorrectly. The next parameter limits the twice
maximum segment lifetime (2MSL) value, which is connected to the time
a socket spends in the TCP state TIME_WAIT. The final entry
looks at the time spend in the TCP state FIN_WAIT_2.
This value is one of the most controversial ones when talking with other people about appropriate values. The interval specified with this key must expire before a keep-alive probe can be sent. Keep-alive probes are described in the host requirements RFC 1122: If a host choses to implement keep-alive probes, it must enable the application to switch them on or off for a connection, and keep-alive probes must be switched off by default.
Keep-alives can terminate a perfectly good connection (as far as
TCP/IP is concerned), cost your money and use up transmission capacity
(commonly called bandwidth, which is, actually, something completely
different). Determining wether a peer is alive should be a task of the
application and thus kept on the application layer. Only if you run into
the danger of keeping a server in the ESTABLISHED state
forever, and thus using up precious server resources, you should switch on
keep-alive probes.
![[Webserver response]](tune.gif)
Figure 3: A typical handshake during a transaction.
Figure 3 shows the typical handshake during a HTTP connection. It is of no importance for the argumentation if the server is threaded, preforked or just plain forked. Webservers work transaction oriented as is shown in the following simplified description - the numbers do not relate to the figure:
Common implementations need to exchange 9..10 TCP segments per HTTP connection. The keep-alive option as a HTTP/1.0 protocol and extensions can be regarded as a hack. Persistent connections are a different matter, and not shown here. Most people still use HTTP/1.0, especially the Squid users.
The keep-alive timer becomes significant for webservers, if in step 1 the client crashed or terminates without the server knowing about it. This condition can be forced sometimes by quickly pressing the stop button of netscape or the Logo of Mosaic. Thus the keep-alive probes do make sense for webservers. HTTP Proxies look like a server to the browser, but look like a client to the server they are querying. Due to their server like interface, the conditions for webservers are true for proxies, as well.
With an implementation of keep-alive probes working
correctly, a very small value can make sense when trying to
improve webservers. In this case you have to make sure that the probes stop
after a finite time, if a peer does not answer. Solari <= 2.5
have a bug and send keep-alive probes forever. They seem to want
to elict some response, like a RST or some ICMP error message
from an intermediate router, but never counted on the destination simply
being down. Is this fixed with 2.5.1? Is there a patch available against
this misbehaviour? I don't know, maybe you can help me.
I am quite sure that this bug is fixed in 2.6 and that it is safe to use a small value like ten minutes. Squid users should synchronize their cache configuration accordingly. There are some Squid timeouts dealing with an idle connection.
As Stevens repeatedly states in his books, the TIME_WAIT
state is your friend. You should not desperately try to avoid it, rather
try to understand it. The maximum segment lifetime(MSL) is the
maximum interval a TCP segment may life in the net. Thus waiting twice this
interval ensures that there are no leftover segments coming to haunt
you. This is what the 2MSL is about. Afterwards it is safe to reuse the
socket resource.
The parameter specifies the 2MSL according to the four minute limit specified in RFC 1122. With the knowledge about current network topologies and the strategies to reserve ephemerical ports you should consider a shorter interval. The shorter the interval, the faster precious resources like ephemerical ports are available again.
A toplevel search engine implementor recommends a value of 1000 millisecond to its customers. Personally I believe this is too low for regular server. A loaded search engine is a different matter alltogether, but now you see where some people start tweaking their systems. I rather tend to use a multiple of the tcp_rexmit_interval_initial interval. The current value of tcp_rexmit_interval_max should also be considered in this case - even though retransmissions are unconnected to the 2MSL time. A good starting point might be the double RTT to a very remote system (e.g. Australia for European sites). Alternatively a German commercial provider of my aquaintance uses 30000, the smallest interval recommended by BSD.
This values seems to describe the (BSD) timer interval which prohibits a
connection to stay in the FIN_WAIT_2 state
forever. FIN_WAIT_2 is reached, if a connection closes
actively. The FIN is acknowledged, but the FIN
from the passive side didn't arrive yet - and maybe never will.
Usually webservers and proxies actively close connections - as long as you don't use persistent connection and even those are closed from time to time. Apart from that HTTP/1.0 compliant server and proxies close connections after each transaction. A crashed or misbehaving browser may cause a server to use up a precious resource for a long time.
You should consider decreasing this interval, if netstat -f
inet shows many connections in the state
FIN_WAIT_2. The timer is only used, if the connection is
really idle. Mind that after a TCP half close a simplex data
transmission is still available towards the actively closing end. TCP half
closes are not yet supported by Squid, though many web servers do support
them (certain HTTP drafts suggest an independent use of TCP
connections). Nevertheless, as long as the client sends data after
the server actively half closed an established connection the timer is not
active.
CLOSE_WAIT for reasons beyond me. During this phase the proxy
is virtually unreachable for HTTP requests though, abnoxiously, it still
answers ICP requests. Although lowering the value for
tcp_close_wait_interval is only fixing
symptoms indirectly, not the cause, it may help overcoming those
periods of erratic behaviour faster than the default. The thing needed
would be some means to influence the CLOSE_WAIT interval
directly.
5.2 Erratic IPX behaviour
I noticed that Solari < 2.6 behave erratically under some conditions, if
the IPX ethernet MTU of 1500 is used. Maybe there is an error in the frame
assembly algorithm. If you limit yourself to the IEEE 802.3 MTU of 1492
byte, the problem does not seem to appear. A sample startup script with link in /etc/rc2.d
can be used to change the MTU of ethernet interfaces after their
initialization. Remember to set the MTU for every virtual interface,
too!
Note, with a patched Solaris 2.5.1 or Solaris 2.6, the problem does not seem to appear. Limiting your MTU to non-standard might introduce problems with truncated PDUs in certain (admittedly very special) environments. Thus you may want to refrain from using the above mentioned script (always called second script in this document).Since I observed the erratic behaviour only in a Solaris 2.5, I believe it has been fixed with patch 103169-10, or above. The error description reads "1226653 IP can send packets larger than MTU size to the driver."
5.3 Common TCP/IP parameters
The following parameters have little impact on performance, nevertheless I
reckon them worth noting here:
This parameter determines if IP datagrams can be forwarded which have the source routing option activated. The parameter has little meaning for performance but is rather of security relevance. Solaris may forward such datagrams, if the host route option is activated, bypassing certain security construct - possibly undermining your firewall. Thus you should disable it always, unless the host functions as a regular router (and no other services).
This switch decides wether datagrams directed to any of your direct broadcast addresses can be forwarded as link-layer broadcasts. If the switch is on (default), such datagrams are forwarded. If set to zero, pings or other broadcasts to the broadcast address(es) of your installed interface(s) are silently discarded. The switch is recommended for any host, but can break "expected" behaviour.
If you intend to disable the routing abilities of your host all together, because you know you don't need them, you can set this switch to 0. The default value of 2 activates IP forwarding, if two or more real interfaces are up. The value of 1 activates IP forwarding regardless of the number of interfaces. With the possible exception of MBone routers and firewalling, you should leave routing to the dedicated routing hardware.
If you don't want to respond to a ping to any of your broadcast addresses, set this parameter to 0. On one hand, responding to broadcast pings is rumored to have caused panics, or at least partial network meltdowns. On the other hand, it is a valid behaviour, and often used to determine the number of alive hosts on a particular network. If you are dead sure that neither you nor your network admin will need this feature, you can switch it off by using the value of 0.
The current parameter defines the minimum time between two consecutive ICMP error responses. This includes a time exceeded as evoked by a traceroute. If your current setting here is above the RTT of a traceroute probe, the second probe will time out.
If you set this value to exactly 0, traceroute will not give your host away as running Solaris. I am afraid I don't have any idea what kind of ghosts you invite by setting this parameter to 0. So far, it didn't hurt the hosts I used it upon. But I could think that security reasons would argue for a value above 0.
This value has the same size for UDP and TCP. Solaris allocates ephemerical ports above 32768. Busy servers or hosts using a large 2MSL, see tcp_close_wait_interval, may want to lower this limit to 8192. This yields more precious resources, especially for proxy servers.
A contra-indication may be servers and services running on well known ports above 8192. This parameter should be set very early during system bootup, especially before the portmapper is started.
This paramters has to be seen in combination with
udp_smallest_anon_port. The traceroute
program tries to reach a random UDP port above 32768 - or rather tries not
to reach such a port - in order to provoke an ICMP error message from the
host.
Paranoid system administrator may want to lower the value for this reason down to 32767, after the corresponding value for udp_smallest_anon_port has been lowered. On the other hand, datagram application protocols should be able to cope with foreign protocol datagrams.
If Squid or other UDP hyper-active applications are used, the lowering of this value can not be recommended. The respective TCP parameter tcp_largest_anon_port does not suffer this problem.
![[buffers and fragmentation]](fraggle.gif)
Figure 4: buffers and related issues
Here just a short trip through the network layer in order to explain what happens where. Your application is able to send almost any size of data to the transport layer. The transport layer is either UDP or TCP. The socket buffers are implemented on the transport layer. Depending on your choice of transport protocol, different actions are taken on this level.
Only when the data was acknowledged from the peer instance, the data can be removed from the socket buffer! For slow connections or a slowly working peer, this implies a very long time some data uses up the buffer.
Please assume that there is not really a socket buffer for sending UDP. This really depends on the operating systems, but many systems copy the user data to some kernel storage area, whereas others try to eliminate all copy operations for the sake of performance.
If the output queue of the datalink layer interface is full, the datagram will be discarded and an error will be returned to IP and back to the transport layer. If the transport protocol was TCP, TCP will try to resend the segment at a later time. UDP should return the ENOBUFS error, but some implementations don't.
To determine the MTU sizes, use the ifconfig -a command. The
MTUs are needed for some calculation to be done later in this section.
With IPv4 you can determine the MSS from the interface MTU by substracting
20 Bytes for the TCP header and 20 Bytes for the IP header. Keep this in
mind, as the calculation will be repeatedly necessary in the text following
below.
I removed the uninteresing things. hme0 is the regular 100 Mbps ethernet interface. The 10 Mbps ethernet interface is called le0. el0 is the ATM LAN emulation (lane) interface. qaa0 is the ATM classical IP (clip) interface. fa0 is the interface that supports Fore's proprietary implementation of native ATM. Fore is the vendor of the installed ATM card. AFAIK you can use this interface to build PVCs or, if you are also using Fore switches, SVCs. You see an unconfigured interface there.$ ifconfig -a lo0: flags=849mtu 8232 inet 127.0.0.1 netmask ff000000 el0: flags=863 mtu 1500 inet 130.75.215.xxx netmask ffffff00 broadcast 130.75.215.255 ether xx:xx:xx:xx:xx:xx hme0: flags=863 mtu 1500 inet 130.75.5.xxx netmask ffffff00 broadcast 130.75.5.255 qaa0: flags=863 mtu 9180 inet 130.75.214.xxx netmask ffffff00 broadcast 130.75.214.255 ether xx:xx:xx:xx:xx:xx fa0: flags=842 mtu 9188 inet 0.0.0.0 netmask 0 ether xx:xx:xx:xx:xx:xx
The buffer sizes for sending and receiving TCP segment and for UDP
datagrams can be tuned with Solaris. With the help of the
netstat command you can obtain an output similar but unlike
the following one. The data was obtained on a server which runs a Squid
with five dnsserver children. Since the interprocess communcation is
accomplished via localhost sockets, you see both, the client side and the
server side of each dnsserver child socket.
The columns titled with$ netstat -f inet TCP Local Address Remote Address Swind Send-Q Rwind Recv-Q State -------------------- -------------------- ----- ------ ----- ------ ------- blau-clip.ssh challenger-clip.1023 57344 19 63980 0 ESTABLISHED localhost.38437 localhost.38436 57344 0 57344 0 ESTABLISHED localhost.38436 localhost.38437 57344 0 57344 0 ESTABLISHED localhost.38439 localhost.38438 57344 0 57344 0 ESTABLISHED localhost.38438 localhost.38439 57344 0 57344 0 ESTABLISHED localhost.38441 localhost.38440 57344 0 57344 0 ESTABLISHED localhost.38440 localhost.38441 57344 0 57344 0 ESTABLISHED localhost.38443 localhost.38442 57344 0 57344 0 ESTABLISHED localhost.38442 localhost.38443 57344 0 57344 0 ESTABLISHED localhost.38445 localhost.38444 57344 0 57344 0 ESTABLISHED localhost.38444 localhost.38445 57344 0 57344 0 ESTABLISHED
Swind and Rwind contain
values for the size of the respective send- and reception
windows, based on the free space available in the receive
buffer at each peer. The Swind column
contains the offered window size as reported by the
remote peer. The Rwind column displays the advertised
window size being transmitted to the remote peer.
An application can change the size of the the socket layer
buffers with calls to setsockopt with the
parameter SO_SNDBUF or SO_RCVBUF. Windows and
buffers are not interchangable. Just remember: The buffers have a fixed
size - unless you use setsockopt to change. Windows on the
other hand depend on the free space available in the input buffer. The
minimum and maximum requirements for buffer sizes are tuneable
watermarks.
![[buffers, watermarks and windows]](buffers.gif)
Figure 5: buffers, watermarks and window sizes.
Figure 5 shows the relation of the different buffers, windows and watermarks. I decided to let the send buffer grow from the maximum towards zero, which is just a way of showing things, and does probably not represent the real implementation. I left out the different socket options as the picture is confusing enough.
SO_RCVBUF allows the dynamic change of the
receive buffer size within the application on a per socket basis.
select or poll return the socket as readable.
The socket option
SO_RCVLOWAT allows the dynamic change of the receive
low watermark on a per socket basis. With UDP, the socket is reported
readable as soon as there is a complete datagram in the receive buffer.
SO_SNDBUF socket option within an application. Mind that for
UDP the size of the output buffer represents the maximum datagram size.
select and poll report the socket writable. The
socket option SO_SNDLOWAT allows a dynamic change of this
size on a per-socket basis.
Swind
column in the netstat output. From the offered window, the
usable window
is calculated, that is the amount of data which can be send as soon as
possible. TCP never sends more than the minimum of the current congestion
window and the offered window.
to_send := MIN( cwnd, offered window )
Rwind column
in the netstat output.
Squid users should note the following behaviour seen with Solaris 2.6. The default socket buffer sizes which are detected during configuration phase are representative of the values for tcp_recv_hiwat, udp_recv_hiwat, tcp_xmit_hiwat and tcp_xmit_hiwat. Also note that enabling the hit object feature still limits hit object size to 16384 byte, regardless of what your system is able to achieve.
output from Squid 1.1.19 configuration script on a Solaris 2.6 host with the previously mentioned parameters all set to 64000. Please mind that these parameters do not constitute optimal sizes in most environments:checking Default UDP send buffer size... 64000 checking Default UDP receive buffer size... 64000 checking Default TCP send buffer size... 64000 checking Default TCP receive buffer size... 64000
Buffers and windows are very important if you link via satellite. Due to the daterate possible but the extreme high round-trip delays of a satellite link, you will need very large TCP windows and possibly the TCP timestamp option. Only RFC 1323 conformant systems will achieve these ends. In other words, get a Solaris 2.6. For 2.5 systems, RFC 1323 compliance can be purchased as a Sun Consulting Special.
Window sizes are important for maximum throughput calculations, too. As Stevens shows, you cannot go faster than the window size offered by your peer, divided by the round-trip time (RTT). The lower your RTT, the faster you can transmit. The larger your window, the faster you can transmit. If you intend to employ maximum window sizes, you might want to give tcp_deferred_acks_max another look.
The network research laboratory of the German research network did measurements on satellite links. The RTT for a 10 Mbps link (if I remember correctly) was about 500 ms. A regular system was able to transmit 600 kbps whereas a RFC 1323 conformant system was able to transmit about 7 Mbps. Only bulk data transfer will do that for you.
(1) 10 Mbps * 0.5 s = 5 Mbit = 625 KB (2) 512 KB = 4 Mbit = 0.5 s * 8 Mbps (3) 64 KB / 0.5 s = 128 KBps = 1 Mbps
The bandwidth-delay-product can be used to estimate the initial value when tweaking buffer sizes. The buffers then represent the capacity of the link. If we apply the bandwidth-delay-product calculations to the satellite link above, we get the following results: Equation 1 estimates the buffer sizes necessary to fully fill the 10 Mbps link. Equation 2 assumes that the buffer sizes were set to 512 KB, which would yield 8 Mbps. Slight deviation in the experiment may have been caused by retransmissions. Finally, equation 3 estimates the maximum datarate we can use on the satellite link, if limited to 64 KB buffers, e.g. Solaris <= 2.5.1. The 1 Mbps constitute an upper limit, as can be seen by the measured 600 Kbps.
Squid users beware: As long as Squid does not implement HTTP/1.1 persistent connections, you will not get any decent HTTP transmissions via satellite. The average cached object sizes about 13 kbyte, thus you almost never get past the TCP slow start. While this may or may not be a big deal with terrestrial links, but you will never be able to fill a satellite pipe to a satisfactorily degree. Doing things in parallel might help. Only when reaching TCP congestion avoidance you will see any filling of the pipe.
This parameter describes the maximum size the congestion window can be opened. The congestion window is opened as large as possible with any Solaris up to 2.5.1. A change to this value is only necessary for older Solaris systems, which defaulted to 32768. The Solaris 2.6 default looks reasonable, but you might need to increase this further for satellite links.
Though window sizes beyond 64k are possible, mind that the window scale option is only announced during connection creation and your maximum windows size is 1 GByte (1,073,725,440 Byte). Also, the window scale option is only employed during the connection, if both sides support it.
This parameter determines the maximum size of the initial TCP reception buffer. The specified value will be rounded up to the next multiple of the MSS. From the free space within the buffer the advertised window size is determined. That is, the size of the reception window advertised to the remote peer. Squid users will be interested in this value with regards to the socket buffer size the Squid auto configuration program finds.
The previous table shows an Rwind value of 63980 = 7 *
9140. 9140 is the MSS of the ATM classical IP interface (clip) in host
blau. The interface itself uses a MTU of 9180. For the standard
builtin 10 Mbps or 100 Mbps IPX ethernet, you get a MTU of 1500 on the
outgoing interface, which yields an MSS of 1460. The value of 57344 in the
next Rwind line points to the lo0 (loopback)
interface, MTU 8232, MSS 8192 and 57344 = 7 * 8192.
Starting with Solaris 2.6 values above 65535 are possible, see the window scale option from RFC 1323. Only if the peer host also implements RFC 1323, you will benefit from buffer sizes above 65535. If one host does not implement the window scale option, the window is still limited to 64K. The option is only activated, if buffer sizes above 64K are used.
For HTTP, I don't see the need to increase the buffer above 64k. Imagine servicing 1024 simultaneous connections. If both the TCP high watermarks of your system are tuned to 64k and your application uses the system's defaults, you would need 128M just for your TCP buffers!
Squid's configuration option tcp_recv_bufsize lets you select
a TCP receive buffer size, but if set to 0 (default) the kernel value will
be taken, which is configurable with the tcp_recv_hiwat
parameter. A buffer size of 16K is large enough to cover over 70 % of all
received webobjects on our caches.
This parameter influences the minimum size of the input buffer. The reception buffer is at least as large as this value multiplied by the MSS. The real value is the maximum of tcp_recv_hiwat round up to the next MSS and tcp_recv_hiwat_minmss multiplied by the MSS, in other words, something akin to:
hiwat_tmp ~= ceil( tcp_recv_hiwat / MSS ) real_size := MAX( hiwat_tmp, tcp_recv_hiwat_minmss ) * MSS
That way, however bad you misconfigure the buffers, there is a guaranteed space for tcp_recv_hiwat_minmss full segments in your input buffer.
The highwater mark for the UDP reception buffer size. This value may be of interest for Squid proxies which use ICP extensively. Please read the explanations for tcp_recv_hiwat. Squid users will want at least 16384, especially if you are planning on using the hit object feature of Squid.
If you see many dead parent detections in your cache.log file
without cause, you might want to increase the receive buffer. In most
environments an increase to 64000 will have a neglegible effect on the
memory consumption, as most application, including Squid, use only one or
very few UDP sockets, and only in an iterative way.
Remember if you don't set your socket buffer explicitely with a call to
setsockopt(), your default reception buffer will have about
this size. Arriving Datagrams of a larger size might be truncated or
completely rejected. Some systems don't even notify your receiving
appliction.
This parameter influence a heuristic which determines the size of the initial send window. The actual value will be rounded up to the next multiple of the MSS, e.g. 8760 = 6 * 1460. Also do read the section on tcp_recv_hiwat.
The table further to the top shows a Swind of 57344 = 7 *
8192. For the standard builtin 10 Mbps or 100 Mbps IPX ethernet, you get an
MTU of 1500 on the outgoing interface, which yields a MSS of 1460.
Starting with Solaris 2.6 values above 65535 are possible, see the window scale option from RFC 1323. Only if the peer host also implements RFC 1323, you will benefit from buffer sizes above 65535. If one host does not implement the window scale option, the window is still limited to 64K.
I don't see the need to increase the buffer above 32K for HTTP
applications. Imagine servicing 1024 simultaneous connections. If both TCP
high watermarks of your system are tuned to 32K, you would need 64M just
for your TCP buffers! Squid 1.1.x does not seem to use the socket
option SO_SNDBUF to limit this memory hunger during
runtime. Mind that the send buffer has to keep a copy of all unacknowledged
segments. Therefore it is affordable to give it a greater size than the
receive buffer. Again, 16K covers over 70 % of all transferred webdocuments
on our caches, and 32K should cover 90 %.
This refers to the highwater mark for send buffers. May be of interest for proxies using ICP extensively. Please refer to the explanations for tcp_xmit_hiwat. Squid users will want at least 16384, especially if you are planning on using the hit object feature of Squid.
Please remember that there exists no real send buffer for UDP on the socket
layer. Thus, trying to send a larger amount of data than
udp_xmit_hiwat will truncate the excess, unless the
SO_SNDBUF socket option was used to extend the buffer.
The current paramenter refers to the amount of data which must be available
in the TCP socket sendbuffer until select or poll
return writable for the connected file descriptor.
Usually there is no need to tune this parameter. Applications can use the
socket option SO_SNDLOWAT to change this parameter on a process
local basis.
The current paramenter refers to the amount of data which must be available
until select or poll return writable for
the connected file descriptor. Since UDP does not need to keep datagrams
and thus needs no outgoing socket buffer, the socket will always be
writable as long as the socket sendbuffer size value is greater than the
low watermark. Thus it does not really make much sense to wait for a
datagramm socket to become writable unless you constantly adjust the
sendbuffer size.
Usually there is no need to tune this parameter, especially not on a system-wide basis.
[New] Finally found the explanations in the SUN TCP/IP Admin Guide. The current parameter refers to the maximum buffer size an application is allowed to specify with the SO_SNDBUF and SO_RCVBUF socket option calls. Attempts to use larger buffers will fail with a EINVAL return code from the socket option call. SUN recommends to use only the largest buffer necessary for any of your applications - that is, the supremum function, not the sum. Specifying a greater size does not seem to have much impact, if all your applications are well-behaving. If not, they may consume quite an amount of kernel memory, thus this parameter is also a kind of safety line.
7. Tuning your system
7.1 Things to watch
Did you reserve enough swap space? You should have at least
as much swap as you have main memory. If you have little main
memory, even double your swap. Do not be fooled by the result
of the vmstat command - read the manpage and realize that the
small value for free memory shown there is (usually) correct.
With Solaris there seems to exist a difference between virtually generated
processes and real processes. The latter is extremely dependend on the
amount of virtual memory. To test the amount of both kinds of processes,
try a small program of mine. Do start it at the
console, without X and not as priviledged user. The first value is
the hardlimit of processes, and the second value the amount of processes
you can really create given your virtual memory configuration. Tweaking
your ulimit values may or may not help.
7.2 General entries in the file
The file /etc/system
/etc/system contains various very important resource
configurable parameters for your system. You use these tunings to give a
heavily loaded system more resources of a certain kind. Unfortunately a
reboot is necessary after changing anything. Though one could
schedule reboots after midnight, I advice against it. You should always
check if your changes have the desired effect, and won't tear down the
system.
Adrian Cockroft severly warns against transporting an
/etc/system from one system onto another, even worse, onto
another hardware platform:
Clean out your /etc/system when you upgrade.
The most frequent changes are limited to the number of file descriptors, because the socket API uses filedescriptors for handling internet connectivity. You may want to look at the hardlimit of filehandles available to you. Proxies like Squid have to count twice for each connection: open request descriptors and either an open file or an open forwarding request descriptors.
You are able to influence the tuning with the reserved word
set. Use a whitespace to seperate the key from the
keyword. Use an equals sign to separate the value from its key. There are a
few examples in the comments of the file.
Please, before you start, make a backup copy of your initial
/etc/system. The backup should be located on your root
filesystem. Thus, if some parameters fail, you can always supply the
alternative, original system file on the boot prompt. The following shows
two typically entered parameters:
WARNING! SUN does not make any guarantees for the correct working of your system, if you use more filedescriptors than 4096. Personally, my old fvwm window manager did quit working alltogether. In my case, I compiled it on a Solaris 2.3 or 2.4 system and transferred it always onwards to a 2.5 system. After compiling the fvwm95, it worked to my satisfaction.* these are the defaults of the system set rlim_fd_max=1024 set rlim_fd_cur=64
If you experience SEGV core dumps from your select(3c) system
call after increasing your file descriptors above 4096, you have to
recompile the affected programs. Especially the select(3c)
call is known to the Squid users for its bad tempers concerning the maximum
number of file descriptors. SUN remarks to this topic:
The default value for FD_SETSIZE (currently 1024) is larger
than the default limit on the number of open files. In
order to accommodate programs that may use a larger number
of open files with select(), it is possible to increase this
size within a program by providing a larger definition of
FD_SETSIZE before the inclusion of <sys/types.h>.
I did test this suggestion by SUN, and a friend of mine tried it with Squid
Caches. The result was a complete success or diseaster both times,
depending on your point of view: If you can live with supplying naked women
to your customers instead of bouncing company logos, go ahead and try
it. If you really need to access filedescriptors above 1024, don't
use
Note: This does not work as expected. See text below.
select(), use poll() instead!
poll() is supposed to be faster with Solaris, anyway. A
different source mentions that the redefinition workaround mentioned above
works satisfactorily; not for me, neither with Squid.
At the pages of VJ are a some tricks which I incorporated into this paper, too. Personally I am of the opinion that the VJ pages are not as up to date as they could be.
Many parameters of interest can be determined using the sysdef
-i command. Please keep in mind that many values are in
hexadecimal notation without the 0x prefix. Another
very good program to see your system's configuration is
sysinfo,
the program. Refer to the manpages how to
invoke this program.
This parameters defines the softlimit of open files you can have. The currently active softlimit can be determined from a shell with something like
Use at your own risk values above 1024, especially if you are running old binaries. A value of 4096 may look harmless enough, but may still break old binaries.ulimit -Sn
Another source mentions that using more than 8192 filedescriptors is discouragable. It mentions that you ought to use more processes, if you need more than 4096 file descriptors. On the other hand, an ISP of my acquaintance is using 16384 descriptors to his satisfaction.
The predicate rlim_fd_cur <= rlim_fd_max must be fullfilled.
This parameter defines the hardlimit of open files you can have. For a Squid and most other servers, regardless of TCP or UDP, the number of open filedescriptors per user process is among the most important parameter. The number of filedescriptors is one limit on the number of connections you can have in parallel. You can find out the value of your hardlimit on a shell with something like
You should consider a value of at least 2 * tcp_conn_req_max and you should provide at least 2 * rlim_fd_cur. The predicate rlim_fd_cur <= rlim_fd_max must be fullfilled.ulimit -Hn
Use at your own risk values above 1024. SUN does not make any
warranty for the workability of your system, if you increase this above
1024. Squid users of busy proxies will have to increase this
value, though. A good starting seems to be 16384 <= x <= 32768.
Remember to change the Makefile for Squid to use poll()
instead of select(). Also remember that each call of
configure will change the Makefile back, if you didn't
change Makefile.in.
This parameter determines the size of certain kernel data structures which are initialized at startup. There is strong indication that the default is determined from the main memory in megs. It might also be a function of the available memory and/or architecture.
The defaults of the parameters max_nprocs, maxuprc, ufs_ninode, ncsize and ndquot will be determined from this parameter's value. The greater you chose the number for maxusers, the greater the number of the mentioned resources. The relation in strictly proportional: A doubling of maxusers will (more or less) double the other resources.
Adrian Cockroft advises against a setting of maxusers. The kernel uses a lot of space while keeping track of the RAM usages within the system, therefore it might need to be reduced on systems with gigabytes of main memory.
This is the systemwide number of processes available. You should leave sufficient space to the parameter maxuprc. The value of this parameter is influenced by the setting of maxusers.
This parameter describes the number of processes available to a single user. The actual value is determined from max_nprocs which is itself determined by maxusers. The negative value seems to be a relative distance with regards to max_nprocs, but I haven't been able to test this (yet).
The parameter defines the maximum number of BSD ttys
(/dev/ptty??) available. A few BSD networking things might
need these devices. If you run into a limit, you may want to increase
the number of available ttys, but usually the size is sufficient.
Solaris only allocated 48 SYSV pseudo tty devices (slave devices in
/dev/pts/*). On a server with many remote login, or many
open xterm windows you may reach this limit. It is of little interest
to webservers or proxies, but of greater interest for personal
workstations.
This parameter specifies the size of the virtual address cache. If a personal workstation with many open xterms and sufficient tty devices has a very degraded performance, this parameter might be too small. My recommendation is to let the system chose the correct value. The current value is determined by the size of maxusers.
The current parameter specifies the size of the inode table. This is some kind of cache. The actual value will be determined by the value of maxusers.
Some webcache users increase this value. If your intention was to keep the inode for each squid data file in memory, forget it. You'd need over L1 * L2 * swapfiles_per_dir entries. But an increase with regards to the ncsize value might help a tiny little.
This parameter specifies the size of the directory name lookup cache. A large directory name lookup cache size significantly helps NFS servers that have a lot of clients. On other systems the default is adequate.
I don't know about the ties to ufs_ninode, but the formula is the same. The current value is determined by maxusers.
I have heard from a few people who increase ncsize to 30000 when using the Squid webcache. Image, a Squid uses 16 toplevel directories and 256 second level directories. Thus you'd need over 4096 entries just for the directories. It looks as if webcaches and newsserver which store data in files generated from a hash need to increase this value for efficient access. Twice the default should be a good starting point. You may want to increase ufs_ninode by the same size, too.
This parameter specifies the size of the quota table. Many standalone webservers or proxies don't use quotas.
This parameter determines how many STREAMS modules you are allowed
to push into the Solaris kernel - I guess this is a per user or per
process count. The only application of widespread
use which may need such a kernel module is xntp. Even
with other modules pushed, usually you have sufficient room and no
need to tweak this parameter.
This parameter determines the maximum size of a message which is to be piped through the SYSV STREAMS.
The maximum size of the control part of a STREAMS message.
Now, considering the SVR3 buffer cache described by Maurice Bach, this parameter specifies the maximum memory size allowed for the buffer cache. The 0 value reported by sysinfo says to take 2 % of the main memory for buffer caches. sysdef -i shows the size in bytes taken for the buffer cache.
I have seen Squid admins increasing this value up to 10 %. If you change this value, you have to enter the number of kByte you want for the buffer cache. Please keep in mind that you are effectively 'double buffering', if you increase this value in conjunction with a cache like Squid 1.1.
The autoup value determines the maximum age a modified
memory page. The fsflush kernel daemon wakes up every five
seconds as determined by the tune_t_fsflushr interval.
At each wakeup, it checks a portion of the main memory - the quotient of
autoup divided by tune_t_fsflushr. The
pages are queued to the pageout kernel daemon, which forms
it into clusters for faster write access. Furthermore, the
fsflush daemon flushed modified entries from the inode
caches to disk!
Some squid admins recommend lowering this value, because at high disk
loads, the fsflush effectively kills the I/O subsystem with
its updates, unless the stuff is flushed out fairly often. Steward Forster
notes that this is justifiable, because squid writes disjoint data sets
and rarely does multiple writes to the same disk block. If
reports the time spent for updating the disks above five seconds on
several occasions, you can consider lowering autoup
among several options. Please note that a larger bufhwm
will take longer to flush. Also, the settings of
ufs_ninode and ncsize have an impact
on the time spent updating the disks. Setting the value too low has
harmful impact on your performance, too.
/usr/proc/bin/ptime sync
There are also instances, where increasing the
autoup makes sense. Whenever you are using synchroneous
writes like NFS or raw database partition, fsflush has
little to do, and the overhead of frequent memory scans are a hindrence.
Refer to Adrian Cockroft, "Sun Performance And Tuning", 2nd
edition for a more detailed enlightment on the subject. I never
claimed that tweaking your kernel is easy nor foolproof.
Adrian Cockroft explains in What are the tunable kernel parameters for Solaris 2? this parameter. The parameter determines the external cache controller prefetches. You have to know your workload. Applications with extensive floating point arithmetic will benefit from prefetches, thus the parameter is turned on on personal workstations. On random access databases with little or no need for float point arithmetic the prefetch will likely get into the way, therefore it is turned off on server machines. It looks as if it should be turned off on dedicated squid servers.
7.3 100 Mbit ethernet related entries
Mr. Nebel and Mr. Hüsemann were so kind to give me a few hints concerning
100 Mbit ethernet interfaces and Solaris. It looks as if these cards
default to halfduplex operations. In order to switch to full duplex mode,
make sure your router can also work full duplex.
This parameter switches on the full duplex mode. Only use this parameter together with the next option.
Switch off the half duplex mode, must be used together with the previous parameter.
This parameter determines wether the SUN workstation should automatically negotiate the 100 Mbit with the switch or router. Usually Cisco switches also do auto negatiation, thus is may be necessary to set this switch to 0 and configure your Cisco hardware manually to 100 FDX.
netstat -ni input errors. Of couse, good information
can only be obtained at the switch or router side.
7.4 System V IPC related entries
Many applications still use the (old) SYSV IPCs. The System V IPC can be
ordered into the three separate areas message queues, shared memory and
semaphores. With Solaris you have an easier and faster API to achive the
same ends with Unix sockets or FIFOs, shared memory through
memory maps, see mmap(2), and file locks instead of
semaphores. Due to the
reduced need for System V IPC, Solaris has decreased the resources for
System V IPC drastically. This is o.k. for stand alone servers, but
personal workstations may need increased resources.
In some cases large database applications or VRML viewer use System V
IPC. Thus you should consider increasing a few resources. The active
resource can be determined with the sysdef -i
command. Relevant for your inspection are the parts rather at the end, all
having IPC in their names.
At first glance, the System V IPC resources for message queues and semaphores seem to be disabled by default. This is not true, because the necessary modules are loaded dynamically into the kernel as soon as they are referenced. The default System V shared memory uses 1 MB main memory. Proxy and webserver may even want to decrease this value, but database servers may need up to 25 % of the main memory as System V shared memory.
The entries in* personal workstations using mpeg_play, or vic set shmsys:shminfo_shmmax=16777216
/etc/system for all System V IPC related
informations contains the prefix msgsys:msginfo_
for message queues, the prefix
semsys:seminfo_
for semaphores,
and the prefix shmsys:shminfo_
for shared memory. After the prefixes starts
the resource identifier, all lower case letters, for the corresponding
value displayed by the sysdef command,
e.g. shmmax for the value of SHMMAX. The meaning
of the parameters can be obtained from any programming resource on System V
ICP, e.g. Stevens' APUE. If anything, you only need to
change the value for SHMMAX.
7.5 How to find further entries
There are thousands of further items you can adjust. Every module
which has a device in the /dev directory and a module
file somewhere in the kernel tree underneath /kernel can
be configured with the help of ndd. Wether you have to have
superuser priveleges depends on the access mode of the device file.
For instance, there exists a device /dev/hme and a kernel
module /kernel/drv/hme. This driver is connected, as you
might know, to the 100 Mbit ethernet interface. If you want to know what
value you can tweak, you can ask ndd:
Of course, you can only change entries marked for read and write. If you tweaked enough and want to store some configuration as a default at boot time, you can enter your preferred values into thendd /dev/hme \?
/etc/system
file. Just prefix the key with the module name and separate both with a
colon. You did see this earlier in the subsection on 100 Mbit ethernet and
the System V IPC page.
There is another way to get your hands on the names of keys to tweak. For
instance, the System V IPC modules don't have a related device file. This
implies that you cannot tweak things with the help of ndd.
Nevertheless, you can obtain all clear text strings from the module file
in the kernel.
There is a number of strings you are seeing. Most of the strings are either names of function within the module or clear text string passages defined within. Strings starting withstrings -a /kernel/sys/shmsys # possible nm /kernel/sys/shmsys # recommended
shminfo are the names of
user tuneable parameters, though. Now, how do you separate tuneable
parameters from the other stuff? I really don't know. If you have some
knowledge about Sun DDI, you may be able to help me to find a
recommendable way, e.g. using _info(9E) and
mod_info.
8. Recommended patches
It is utterly necessary to patch you Solaris system, if you didn't already
do so! Have a look at the DFN CERT patch
mirror or the
original
source from SUN. There may be a mirror closer to you, e.g. EUNet and
FUNET have their own mirrors, if I am informed correctly.
In order to increase your TCP performance, security of websites and fix several severe bugs, do patch! Whoever still runs a Solaris below 2.5 should upgrade to 2.6 at least. I am about to find out how good Solaris 2.6 really is, and it is looking very promising.
Please remember to press the Shift button on your netscape navigator while selecting a link. If the patch is not loadable, probably a new release appeared in the meantime. To determine the latter case, have a look at the directories of DFN CERT or SUN . The README file on the DNF-CERT server is kept without a version number and thus always up to date.
The SUN supplied patches to fix multicast problems with 2.5.1 are incompatible with the TCP patch. Unfortunately, you have to decide between an unbroken multicast and a fixed TCP module. Yes, I am aware that multicast is only possible via UDP, nevertheless the multicast patch replaces the installed TCP module. If you have problems here, ask your SUN partner for a workaround - he will probably suggest upgrading to 2.6.
9. Related books and software
This section started after receiving some information from
Christian
Grimm and Franz Haberhauer on TCP/IP and performance related
literature.
ACK.
zoom, a tool with a traffic light like display of system states and the percollator.
proctool.
linger?
The timeout after which IP is notified by TCP to find a new route during an active open.
The timeout after which IP is notified by TCP to find a new route for an established connection.
Something in connection with retransmissions.
11. Startup scripts
For the important tweakable parameters exist startup scripts for
Solaris. Only the first script is really necessary.
/etc/init.d/your-tune and you must link
(hardlinks preferred, symbolic links are o.k.)
/etc/rcS.d/S31your-tune to the init.d file.Please read the script carefully before installing. It is a rather easy shell script. The piping and awking isn't as bad as it looks:
PATH to standard values and
prints a message. For all messages which are not to contain a
linefeed, we have to use the UCB echo.
$osver is set with the operating system
major and minor version number times ten: Solaris 2.6 AKA SunOS 5.6
will set $osver to 560 and Solaris 2.5.1 AKA SunsOS 5.5.1
will be counted as 551.
$patch looks into the installed kernel TCP module,
because it mustn't be assumed that /var is already mounted. The
result is either 0 for an unpatched system (or some error in the
pipeline), or the applied TCP patch level. For non-2.5.1 systems, you
have to change this line to your needs. All 2.5.1 system (Sparc, x86
and PPC) will be recognized.
if tree just prints a message about the patch
found.
Always tune the parameters to your needs, not mine. Thus, examine the values closely.
le0 from the IPX to the IEEE 802.3 size. The meaning is shown
further up. The script is not strictly necessary, and
reports about odd behaviour may have ceeded with a patched 2.5.1 or a
2.6.
Since I observed the erratic behaviour only in a Solaris 2.5, I believe it has been fixed with patch 103169-10, or above. The error description reads "1226653 IP can send packets larger than MTU size to the driver."
If you intend to go ahead with this script, the file is called
/etc/init.d/your-tune2 and you need to create a link to it
(hard or soft, as above) as /etc/rc2.d/S90your-tune2. Please
mind that GNU awk is used in the script, normal awk does not seem to work
satisfactorily.
As this is the scripts section, I should remention the nifty script kindly supplied by Mr. Kroonmaa. It allow the user to check on all existing values for a network component (tcp, udp, ip, icmp, etc.). Previously, I did something similar in Perl, but nothing as sophisticated until I saw Mr. Kroonmaa's script.
12. List of things to do
This section is not about things you have to do, but rather about
items which I think of being in need to be reworked. Thus it is more a
kind of meta-section.
SYN
segment. Solaris 2.6 is well-behaved in this regard. Also I should finish
the few examples which show what is going on.
/etc/system values to be
put up which I don't know about. If you know something more about
maxpgio, minfree,
desfree, lotsfree,
fastscan, slowscan,
tune_t_gpgslo, tune_t_fsflushr,
autoup or (nbuf), feel free to write to
me. They might be covered in various SunWorld articles and Adrian
Cockroft's second edition (which I am currently reading).
adb, especially those parameters, which are not
accessible with ndd. Anybody out there more familiar with
adb?
Please send your suggestions, bugfixes, comments, and ideas for new items to voeckler@rvs.uni-hannover.de