- 02 Dec, 2021 3 commits
-
-
Gleb Smirnoff authored
Just trust the pcb database, that if we did in_pcbref(), no way an inpcb can go away. And if we never put a dropped inpcb on our queue, and tcp_discardcb() always removes an inpcb to be dropped from the queue, then any inpcb on the queue is valid. Now, to solve LOR between inpcb lock and HPTS queue lock do the following trick. When we are about to process a certain time slot, take the full queue of the head list into on stack list, drop the HPTS lock and work on our queue. This of course opens a race when an inpcb is being removed from the on stack queue, which was already mentioned in comments. To address this race introduce generation count into queues. If we want to remove an inpcb with generation count mismatch, we can't do that, we can only mark it with desired new time slot or -1 for remove. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33026
-
Gleb Smirnoff authored
The HPTS input queue is in reality used only for "delayed drops". When a TCP stack decides to drop a connection on the output path it can't do that due to locking protocol between main tcp_output() and stacks. So, rack/bbr utilize HPTS to drop the connection in a different context. In the past the queue could also process input packets in context of HPTS thread, but now no stack uses this, so remove this functionality. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33025
-
Gleb Smirnoff authored
With introduction of epoch(9) synchronization to network stack the inpcb database became protected by the network epoch together with static network data (interfaces, addresses, etc). However, inpcb aren't static in nature, they are created and destroyed all the time, which creates some traffic on the epoch(9) garbage collector. Fairly new feature of uma(9) - Safe Memory Reclamation allows to safely free memory in page-sized batches, with virtually zero overhead compared to uma_zfree(). However, unlike epoch(9), it puts stricter requirement on the access to the protected memory, needing the critical(9) section to access it. Details: - The database is already build on CK lists, thanks to epoch(9). - For write access nothing is changed. - For a lookup in the database SMR section is now required. Once the desired inpcb is found we need to transition from SMR section to r/w lock on the inpcb itself, with a check that inpcb isn't yet freed. This requires some compexity, since SMR section itself is a critical(9) section. The complexity is hidden from KPI users in inp_smr_lock(). - For a inpcb list traversal (a pcblist sysctl, or broadcast notification) also a new KPI is provided, that hides internals of the database - inp_next(struct inp_iterator *). Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33022
-
- 29 Nov, 2021 1 commit
-
-
Michael Tuexen authored
Reviewed by: glebius, lstewart, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D33098
-
- 19 Nov, 2021 1 commit
-
-
Gleb Smirnoff authored
Until this change there were two places where we would free tcpcb - tcp_discardcb() in case if all timers are drained and tcp_timer_discard() otherwise. They were pretty much copy-n-paste, except that in the default case we would run tcp_hc_update(). Merge this into single function tcp_freecb() and move new short version of tcp_timer_discard() to tcp_timer.c and make it static. Reviewed by: rrs, hselasky Differential revision: https://reviews.freebsd.org/D32965
-
- 14 Nov, 2021 1 commit
-
-
Michael Tuexen authored
tcp_respond() is sometimes called with only a read lock. The logging however, requires a write lock. So either try to upgrade the lock if needed, or don't log the packet. Reported by: syzbot+8151ef969c170f76706b@syzkaller.appspotmail.com Reported by: syzbot+eb679adb3304c511c1e4@syzkaller.appspotmail.com Reviewed by: markj, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D32983
-
- 11 Nov, 2021 2 commits
-
-
Randall Stewart authored
When a persists probe is lost, we will end up calculating a long RTT based on the initial probe and when the response comes from the second probe (or third etc). This means we have a minimum of a confidence level of 3 on a incorrect probe. This commit will change it so that we have one of two options a) Just not count RTT of probes where we had a loss <or> b) Count them still but degrade the confidence to 0. I have set in this the default being to just not measure them, but I am open to having the default be otherwise. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32897
-
Randall Stewart authored
NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!! This patch does a bit of cleanup on TCP congestion control modules. There were some rather interesting surprises that one could get i.e. where you use a socket option to change from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get a memory failure and end up on cc_newreno. This is not what one would expect. The new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK and pass in to the init function preallocated memory. The CC init is expected in this case *not* to fail but if it does and a module does break the "no fail with memory given" contract we do fall back to the CC that was in place at the time. This also fixes up a set of common newreno utilities that can be shared amongst other CC modules instead of the other CC modules reaching into newreno and executing what they think is a "common and understood" function. Lets put these functions in cc.c and that way we have a common place that is easily findable by future developers or bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE and HYSTART++ without having to dance through hoops for other CC modules, instead both newreno and the other modules just call into the common functions if they desire that behavior or roll there own if that makes more sense. Note: This commit changes the kernel configuration!! If you are not using GENERIC in some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC, CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined as well if you desire. Note that if you create a kernel configuration that does not define a congestion control module and includes INET or INET6 the kernel compile will break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\" but you can specify any string that represents the name of the CC module (same names that show up in the CC module list under net.inet.tcp.cc). If you fail to add the options CC_DEFAULT in your kernel configuration the kernel build will also break. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. RELNOTES:YES Differential Revision: https://reviews.freebsd.org/D32693
-
- 09 Nov, 2021 1 commit
-
-
John Baldwin authored
Previously, sorele() always required the socket lock and dropped the lock if the released reference was not the last reference. Many callers locked the socket lock just before calling sorele() resulting in a wasted lock/unlock when not dropping the last reference. Move the previous implementation of sorele() into a new sorele_locked() function and use it instead of sorele() for various places in uipc_socket.c that called sorele() while already holding the socket lock. The sorele() macro now uses refcount_release_if_not_last() try to drop the socket reference without locking the socket. If that shortcut fails, it locks the socket and calls sorele_locked(). Reviewed by: kib, markj Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D32741
-
- 03 Nov, 2021 1 commit
-
-
Gordon Bergling authored
- s/writting/writing/ MFC after: 3 days
-
- 27 Oct, 2021 2 commits
-
-
Gleb Smirnoff authored
Pass control for IP/IP6 level options from generic tcp_ctloutput_set() down to per-stack ctloutput. Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput(). Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
-
Peter Lei authored
TCP stack sysctl nodes are currently inserted using the stack name alias. Allow the user to get the current stack's alias to allow for programatic sysctl access. Obtained from: Netflix
-
- 01 Oct, 2021 1 commit
-
-
Randall Stewart authored
DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics on DSACKs however are very useful in figuring out how much bad retransmissions you are doing. This is further complicated, however, by stacks that do TLP. A TLP when discovering a lost ack in the reverse path will cause the generation of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well as the more traditional dsack-bytes and dsack-packets. These will now all display in netstat -p tcp -s. This also updates all stacks that are currently built to keep track of these stats. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32158
-
- 13 Jul, 2021 1 commit
-
-
Randall Stewart authored
In reviewing tcp_lro.c we have a possibility that some drives may send a mbuf into LRO without making sure that the checksum passes. Some drivers actually are aware of this and do not call lro when the csum failed, others do not do this and thus could end up sending data up that we think has a checksum passing when it does not. This change will fix that situation by properly verifying that the mbuf has the correct markings (CSUM VALID bits as well as csum in mbuf header is set to 0xffff). Reviewed by: tuexen, hselasky, gallatin Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31155
-
- 27 Jun, 2021 1 commit
-
-
Michael Tuexen authored
Some TCP stacks negotiate TS support, but do not send TS at all or not for keep-alive segments. Since this includes modern widely deployed stacks, tolerate the violation of RFC 7323 per default. Reviewed by: rgrimes, rrs, rscheff MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D30740 Sponsored by: Netflix, Inc.
-
- 14 Jun, 2021 1 commit
-
-
Mark Johnston authored
Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation
-
- 24 May, 2021 1 commit
-
-
Randall Stewart authored
The push bit itself was also not actually being properly moved to the right edge. The FIN bit was incorrectly on the left edge. We fix these two issues as well as plumb in the mtu_change for alternate stacks. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30413
-
- 10 May, 2021 2 commits
-
-
Richard Scheffenegger authored
Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931
-
Randall Stewart authored
The hostcache up to now as been updated in the discard callback but without checking if we are all done (the race where there are more than one calls and the counter has not yet reached zero). This means that when the race occurs, we end up calling the hc_upate more than once. Also alternate stacks can keep there srtt/rttvar in different formats (example rack keeps its values in microseconds). Since we call the hc_update *before* the stack fini() then the values will be in the wrong format. Rack on the other hand, needs to convert items pulled from the hostcache into its internal format else it may end up with very much incorrect values from the hostcache. In the process lets commonize the update mechanism for srtt/rttvar since we now have more than one place that needs to call it. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30172
-
- 06 May, 2021 1 commit
-
-
Randall Stewart authored
This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036
-
- 21 Apr, 2021 1 commit
-
-
Navdeep Parhar authored
Notify the TOE driver when when an ICMP type 3 code 4 (Fragmentation needed and DF set) message is received for an offloaded connection. This gives the driver an opportunity to lower the path MTU for the connection and resume transmission, much like what the kernel does for the connections that it handles. Reviewed by: glebius@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D29755
-
- 20 Apr, 2021 1 commit
-
-
Hans Petter Selasky authored
This change makes the TCP LRO code more generic and flexible with regards to supporting multiple different TCP encapsulation protocols and in general lays the ground for broader TCP LRO support. The main job of the TCP LRO code is to merge TCP packets for the same flow, to reduce the number of calls to upper layers. This reduces CPU and increases performance, due to being able to send larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it possible to avoid per-packet interaction by the host CPU. Because the current TCP LRO code was tightly bound and optimized for TCP/IP over ethernet only, several larger changes were needed. Also a minor bug was fixed in the flushing mechanism for inactive entries, where the expire time, "le->mtime" was not always properly set. To avoid having to re-run time consuming regression tests for every change, it was chosen to squash the following list of changes into a single commit: - Refactor parsing of all address information into the "lro_parser" structure. This easily allows to reuse parsing code for inner headers. - Speedup header data comparison. Don't compare field by field, but instead use an unsigned long array, where the fields get packed. - Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed recursivly, only applying deltas as the result of updating payload data. - Make smaller inline functions doing one operation at a time instead of big functions having repeated code. - Refactor the TCP ACK compression code to only execute once per TCP LRO flush. This gives a minor performance improvement and keeps the code simple. - Use sbintime() for all time-keeping. This change also fixes flushing of inactive entries. - Try to shrink the size of the LRO entry, because it is frequently zeroed. - Removed unused TCP LRO macros. - Cleanup unused TCP LRO statistics counters while at it. - Try to use __predict_true() and predict_false() to optimise CPU branch predictions. Bump the __FreeBSD_version due to changing the "lro_ctrl" structure. Tested by: Netflix Reviewed by: rrs (transport) Differential Revision: https://reviews.freebsd.org/D29564 MFC after: 2 week Sponsored by: Mellanox Technologies // NVIDIA Networking
-
- 18 Apr, 2021 1 commit
-
-
Michael Tuexen authored
Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469
-
- 16 Apr, 2021 1 commit
-
-
Gleb Smirnoff authored
-
- 12 Apr, 2021 1 commit
-
-
Gleb Smirnoff authored
When packet is a SYN packet, we don't need to modify any existing PCB. Normally SYN arrives on a listening socket, we either create a syncache entry or generate syncookie, but we don't modify anything with the listening socket or associated PCB. Thus create a new PCB lookup mode - rlock if listening. This removes the primary contention point under SYN flood - the listening socket PCB. Sidenote: when SYN arrives on a synchronized connection, we still don't need write access to PCB to send a challenge ACK or just to drop. There is only one exclusion - tcptw recycling. However, existing entanglement of tcp_input + stacks doesn't allow to make this change small. Consider this patch as first approach to the problem. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D29576
-
- 17 Feb, 2021 1 commit
-
-
Randall Stewart authored
a further CPU enhancements for compressed acks. These are acks that are compressed into an mbuf. The transport has to be aware of how to process these, and an upcoming update to rack will do so. You need the rack changes to actually test and validate these since if the transport does not support mbuf compression, then the old code paths stay in place. We do in this commit take out the concept of logging if you don't have a lock (which was quite dangerous and was only for some early debugging but has been left in the code). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28374
-
- 20 Jan, 2021 1 commit
-
-
Richard Scheffenegger authored
Summary: When using the base stack in conjunction with RACK, it appears that infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh. This leaves the recover flightsize (sackhint.recover_fs) uninitialized, leading to a div/0 panic. Address this by properly initializing the variable just prior to first use, if it is not properly initialized. In order to prevent stale information from a prior recovery to negatively impact the PRR calculations in this event, also clear recover_fs once loss recovery is finished. Finally, improve the readability of the initialization of recover_fs when t_dupacks == tcprexmtthresh by adjusting the indentation and using the max(1, snd_nxt - snd_una) macro. Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages Reviewed By: rrs, kbowling, #transport Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro Differential Revision: https://reviews.freebsd.org/D28114
-
- 14 Jan, 2021 1 commit
-
-
Michael Tuexen authored
When timestamp support has been negotiated, TCP segements received without a timestamp should be discarded. However, there are broken TCP implementations (for example, stacks used by Omniswitch 63xx and 64xx models), which send TCP segments without timestamps although they negotiated timestamp support. This patch adds a sysctl variable which tolerates such TCP segments and allows to interoperate with broken stacks. Reviewed by: jtl@, rscheff@ Differential Revision: https://reviews.freebsd.org/D28142 Sponsored by: Netflix, Inc. PR: 252449 MFC after: 1 week
-
- 29 Oct, 2020 1 commit
-
-
John Baldwin authored
Reviewed by: gallatin, gnn Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26875
-
- 09 Oct, 2020 1 commit
-
-
Richard Scheffenegger authored
Extend netstat to display TCP stack and detailed congestion state Adding the "-c" option used to show detailed per-connection congestion control state for TCP sessions. This is one summary patch, which adds the relevant variables into xtcpcb. As previous "spare" space is used, these changes are ABI compatible. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26518
-
- 25 Sep, 2020 1 commit
-
-
Richard Scheffenegger authored
The fastpath in tcp_output tries to send out full segments, and avoid sending partial segments by comparing against the static t_maxseg variable. That value does not consider tcp options like timestamps, while the initial window calculation is using the correct dynamic tcp_maxseg() function. Due to this interaction, the last, full size segment is considered too short and not sent out immediately. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26478
-
- 13 Sep, 2020 1 commit
-
-
Michael Tuexen authored
and netstat. Reviewed by: rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D26412
-
- 01 Sep, 2020 1 commit
-
-
Mateusz Guzik authored
-
- 31 Jul, 2020 1 commit
-
-
Randall Stewart authored
back from the end of the function created an issue. If one of the routines returns NULL during setup we have inp's with extra references (which is why the increment was at the end). Also the stack switch return code was being ignored and actually has meaning if the stack cannot take over it should return NULL. Fix both of these situation by being sure to test the return code and of course in any case of return NULL (there are 3) make sure we properly reduce the ref count. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D25903
-
- 07 Jul, 2020 1 commit
-
-
Richard Scheffenegger authored
While testing with system default cc set to cubic, and running a memory exhaustion validation, FreeBSD panics for a missing inpcb reference / lock. Reviewed by: rgrimes (mentor), tuexen (mentor) Approved by: rgrimes (mentor), tuexen (mentor) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D25583
-
- 28 May, 2020 1 commit
-
-
Alexander V. Chernikov authored
fib[46]_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees over pointer validness and requiring on-stack data copying. Conversion is straight-forwarded, as the only 2 differences are requirement of running in network epoch and the need to handle RTF_GATEWAY case in the caller code. Differential Revision: https://reviews.freebsd.org/D24974
-
- 27 Apr, 2020 1 commit
-
-
Randall Stewart authored
latest rack and bbr in from the NF repo. When those come in the OOB data handling will be fixed where Skyzaller crashes. Differential Revision: https://reviews.freebsd.org/D24575
-
- 25 Apr, 2020 1 commit
-
-
Alexander V. Chernikov authored
This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340
-
- 26 Feb, 2020 1 commit
-
-
Pawel Biernacki authored
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
-
- 12 Feb, 2020 1 commit
-
-
Randall Stewart authored
from any line. Sponsored by: Netflix Inc.
-