1. 02 Dec, 2021 1 commit
    • Gleb Smirnoff's avatar
      Remove "options PCBGROUP" · 93c67567
      Gleb Smirnoff authored
      With upcoming changes to the inpcb synchronisation it is going to be
      broken. Even its current status after the move of PCB synchronization
      to the network epoch is very questionable.
      
      This experimental feature was sponsored by Juniper but ended never to
      be used in Juniper and doesn't exist in their source tree [sjg@, stevek@,
      jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix
      [gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@].
      
      I'm up to resurrecting it back if there is any interest from anybody.
      
      Reviewed by:		rrs
      Differential revision:	https://reviews.freebsd.org/D33020
      93c67567
  2. 18 Oct, 2021 3 commits
  3. 07 Jul, 2021 1 commit
    • Randall Stewart's avatar
      tcp: HPTS performance enhancements · d7955cc0
      Randall Stewart authored
      HPTS drives both rack and bbr, and yet there have been many complaints
      about performance. This bit of work restructures hpts to help reduce CPU
      overhead. It does this by now instead of relying on the timer/callout to
      drive it instead use user return from a system call as well as lro flushes
      to drive hpts. The timer becomes a backstop that dynamically adjusts
      based on how "late" we are.
      
      Reviewed by: tuexen, glebius
      Sponsored by: Netflix Inc.
      Differential Revision: https://reviews.freebsd.org/D31083
      d7955cc0
  4. 20 Apr, 2021 1 commit
    • Gleb Smirnoff's avatar
      tcp_input: always request read-locking of PCB for any pure SYN segment. · 1db08fbe
      Gleb Smirnoff authored
      This is further rework of 08d9c920.  Now we carry the knowledge of
      lock type all the way through tcp_input() and also into tcp_twcheck().
      Ideally the rlocking for pure SYNs should propagate all the way into
      the alternative TCP stacks, but not yet today.
      
      This should close a race when socket is bind(2)-ed but not yet
      listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered
      recently by pho@.
      1db08fbe
  5. 12 Apr, 2021 1 commit
    • Gleb Smirnoff's avatar
      tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets · 08d9c920
      Gleb Smirnoff authored
      When packet is a SYN packet, we don't need to modify any existing PCB.
      Normally SYN arrives on a listening socket, we either create a syncache
      entry or generate syncookie, but we don't modify anything with the
      listening socket or associated PCB. Thus create a new PCB lookup
      mode - rlock if listening. This removes the primary contention point
      under SYN flood - the listening socket PCB.
      
      Sidenote: when SYN arrives on a synchronized connection, we still
      don't need write access to PCB to send a challenge ACK or just to
      drop. There is only one exclusion - tcptw recycling. However,
      existing entanglement of tcp_input + stacks doesn't allow to make
      this change small. Consider this patch as first approach to the problem.
      
      Reviewed by:	rrs
      Differential revision:	https://reviews.freebsd.org/D29576
      08d9c920
  6. 17 Feb, 2021 1 commit
    • Randall Stewart's avatar
      Update the LRO processing code so that we can support · 69a34e8d
      Randall Stewart authored
      a further CPU enhancements for compressed acks. These
      are acks that are compressed into an mbuf. The transport
      has to be aware of how to process these, and an upcoming
      update to rack will do so. You need the rack changes
      to actually test and validate these since if the transport
      does not support mbuf compression, then the old code paths
      stay in place. We do in this commit take out the concept
      of logging if you don't have a lock (which was quite
      dangerous and was only for some early debugging but has
      been left in the code).
      
      Sponsored by: Netflix Inc.
      Differential Revision: https://reviews.freebsd.org/D28374
      69a34e8d
  7. 19 Dec, 2020 1 commit
    • Andrew Gallatin's avatar
      Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain · a034518a
      Andrew Gallatin authored
      In order to efficiently serve web traffic on a NUMA
      machine, one must avoid as many NUMA domain crossings as
      possible. With SO_REUSEPORT_LB, a number of workers can share a
      listen socket. However, even if a worker sets affinity to a core
      or set of cores on a NUMA domain, it will receive connections
      associated with all NUMA domains in the system. This will lead to
      cross-domain traffic when the server writes to the socket or
      calls sendfile(), and memory is allocated on the server's local
      NUMA node, but transmitted on the NUMA node associated with the
      TCP connection. Similarly, when the server reads from the socket,
      he will likely be reading memory allocated on the NUMA domain
      associated with the TCP connection.
      
      This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
      server can now tell the kernel to filter traffic so that only
      incoming connections associated with the desired NUMA domain are
      given to the server. (Of course, in the case where there are no
      servers sharing the listen socket on some domain, then as a
      fallback, traffic will be hashed as normal to all servers sharing
      the listen socket regardless of domain). This allows a server to
      deal only with traffic that is local to its NUMA domain, and
      avoids cross-domain traffic in most cases.
      
      This patch, and a corresponding small patch to nginx to use
      TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
      https media content from dual-socket Xeons with only 13% (as
      measured by pcm.x) cross domain traffic on the memory controller.
      
      Reviewed by:	jhb, bz (earlier version), bcr (man page)
      Tested by: gonzo
      Sponsored by:	Netfix
      Differential Revision:	https://reviews.freebsd.org/D21636
      a034518a
  8. 29 Oct, 2020 1 commit
  9. 09 Oct, 2020 1 commit
    • Richard Scheffenegger's avatar
      Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow. · 868aabb4
      Richard Scheffenegger authored
      This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
      option IP(V6)_VLAN_PCP, which can be set to -1 (interface
      default), or explicitly to any priority between 0 and 7.
      
      Note that for untagged traffic, explicitly adding a
      priority will insert a special 801.1Q vlan header with
      vlan ID = 0 to carry the priority setting
      
      Reviewed by:	gallatin, rrs
      MFC after:	2 weeks
      Sponsored by:	NetApp, Inc.
      Differential Revision:	https://reviews.freebsd.org/D26409
      868aabb4
  10. 18 May, 2020 1 commit
    • Mike Karels's avatar
      Allow TCP to reuse local port with different destinations · 25102351
      Mike Karels authored
      Previously, tcp_connect() would bind a local port before connecting,
      forcing the local port to be unique across all outgoing TCP connections
      for the address family. Instead, choose a local port after selecting
      the destination and the local address, requiring only that the tuple
      is unique and does not match a wildcard binding.
      
      Reviewed by:	tuexen (rscheff, rrs previous version)
      MFC after:	1 month
      Sponsored by:	Forcepoint LLC
      Differential Revision:	https://reviews.freebsd.org/D24781
      25102351
  11. 12 Feb, 2020 1 commit
  12. 15 Jan, 2020 1 commit
  13. 12 Jan, 2020 1 commit
    • Michael Tuexen's avatar
      Fix race when accepting TCP connections. · fe1274ee
      Michael Tuexen authored
      When expanding a SYN-cache entry to a socket/inp a two step approach was
      taken:
      1) The local address was filled in, then the inp was added to the hash
         table.
      2) The remote address was filled in and the inp was relocated in the
         hash table.
      Before the epoch changes, a write lock was held when this happens and
      the code looking up entries was holding a corresponding read lock.
      Since the read lock is gone away after the introduction of the
      epochs, the half populated inp was found during lookup.
      This resulted in processing TCP segments in the context of the wrong
      TCP connection.
      This patch changes the above procedure in a way that the inp is fully
      populated before inserted into the hash table.
      
      Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
      mailing list and for testing the patch!
      
      Reviewed by:		rrs@
      MFC after:		1 week
      Sponsored by:		Netflix, Inc.
      Differential Revision:	https://reviews.freebsd.org/D22971
      fe1274ee
  14. 07 Nov, 2019 3 commits
  15. 02 Aug, 2019 1 commit
    • Bjoern A. Zeeb's avatar
      IPv6 cleanup: kernel · 0ecd976e
      Bjoern A. Zeeb authored
      Finish what was started a few years ago and harmonize IPv6 and IPv4
      kernel names.  We are down to very few places now that it is feasible
      to do the change for everything remaining with causing too much disturbance.
      
      Remove "aliases" for IPv6 names which confusingly could indicate
      that we are talking about a different data structure or field or
      have two fields, one for each address family.
      Try to follow common conventions used in FreeBSD.
      
      * Rename sin6p to sin6 as that is how it is spelt in most places.
      * Remove "aliases" (#defines) for:
        - in6pcb which really is an inpcb and nothing separate
        - sotoin6pcb which is sotoinpcb (as per above)
        - in6p_sp which is inp_sp
        - in6p_flowinfo which is inp_flow
      * Try to use ia6 for in6_addr rather than in6p.
      * With all these gone  also rename the in6p variables to inp as
        that is what we call it in most of the network stack including
        parts of netinet6.
      
      The reasons behind this cleanup are that we try to further
      unify netinet and netinet6 code where possible and that people
      will less ignore one or the other protocol family when doing
      code changes as they may not have spotted places due to different
      names for the same thing.
      
      No functional changes.
      
      Discussed with:		tuexen (SCTP changes)
      MFC after:		3 months
      Sponsored by:		Netflix
      0ecd976e
  16. 01 Aug, 2019 1 commit
  17. 10 Jul, 2019 1 commit
  18. 25 Apr, 2019 1 commit
    • Andrew Gallatin's avatar
      Track TCP connection's NUMA domain in the inpcb · 50575ce1
      Andrew Gallatin authored
      Drivers can now pass up numa domain information via the
      mbuf numa domain field.  This information is then used
      by TCP syncache_socket() to associate that information
      with the inpcb. The domain information is then fed back
      into transmitted mbufs in ip{6}_output(). This mechanism
      is nearly identical to what is done to track RSS hash values
      in the inp_flowid.
      
      Follow on changes will use this information for lacp egress
      port selection, binding TCP pacers to the appropriate NUMA
      domain, etc.
      
      Reviewed by:	markj, kib, slavash, bz, scottl, jtl, tuexen
      Sponsored by:	Netflix
      Differential Revision:	https://reviews.freebsd.org/D20028
      50575ce1
  19. 09 Jan, 2019 1 commit
    • Gleb Smirnoff's avatar
      Mechanical cleanup of epoch(9) usage in network stack. · a68cc388
      Gleb Smirnoff authored
      - Remove macros that covertly create epoch_tracker on thread stack. Such
        macros a quite unsafe, e.g. will produce a buggy code if same macro is
        used in embedded scopes. Explicitly declare epoch_tracker always.
      
      - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
        IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
        locking macros to what they actually are - the net_epoch.
        Keeping them as is is very misleading. They all are named FOO_RLOCK(),
        while they no longer have lock semantics. Now they allow recursion and
        what's more important they now no longer guarantee protection against
        their companion WLOCK macros.
        Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
      
      This is non functional mechanical change. The only functionally changed
      functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
      epoch recursively.
      
      Discussed with:	jtl, gallatin
      a68cc388
  20. 05 Dec, 2018 1 commit
  21. 01 Oct, 2018 1 commit
  22. 10 Sep, 2018 1 commit
    • Mark Johnston's avatar
      Fix synchronization of LB group access. · 54af3d0d
      Mark Johnston authored
      Lookups are protected by an epoch section, so the LB group linkage must
      be a CK_LIST rather than a plain LIST.  Furthermore, we were not
      deferring LB group frees, so in_pcbremlbgrouphash() could race with
      readers and cause a use-after-free.
      
      Reviewed by:	sbruno, Johannes Lundberg <johalun0@gmail.com>
      Tested by:	gallatin
      Approved by:	re (gjb)
      Sponsored by:	The FreeBSD Foundation
      Differential Revision:	https://reviews.freebsd.org/D17031
      54af3d0d
  23. 06 Sep, 2018 1 commit
    • Bjoern A. Zeeb's avatar
      The inp_lle field to struct inpcb, along with two "valid" flags · 113c4fad
      Bjoern A. Zeeb authored
      for the rt and lle cache were added in r191129 (2009).
      To my best knowledge they have never been used and route caching
      has converted the inp_rt field from that commit to inp_route
      rendering this field and these flags obsolete.
      
      Convert the pointer into a spare pointer to not change the size of
      the structure anymore (and to have a spare pointer) and mark the
      two fields as unused.
      
      Reviewed by:	markj, karels
      Approved by:	re (gjb)
      Differential Revision:	https://reviews.freebsd.org/D17062
      113c4fad
  24. 21 Aug, 2018 1 commit
  25. 20 Aug, 2018 1 commit
  26. 04 Aug, 2018 1 commit
  27. 05 Jul, 2018 1 commit
    • Brooks Davis's avatar
      Make struct xinpcb and friends word-size independent. · f38b68ae
      Brooks Davis authored
      Replace size_t members with ksize_t (uint64_t) and pointer members
      (never used as pointers in userspace, but instead as unique
      idenitifiers) with kvaddr_t (uint64_t). This makes the structs
      identical between 32-bit and 64-bit ABIs.
      
      On 64-bit bit systems, the ABI is maintained. On 32-bit systems,
      this is an ABI breaking change. The ABI of most of these structs
      was previously broken in r315662.  This also imposes a small API
      change on userspace consumers who must handle kernel pointers
      becoming virtual addresses.
      
      PR:		228301 (exp-run by antoine)
      Reviewed by:	jtl, kib, rwatson (various versions)
      Sponsored by:	DARPA, AFRL
      Differential Revision:	https://reviews.freebsd.org/D15386
      f38b68ae
  28. 04 Jul, 2018 1 commit
    • Matt Macy's avatar
      epoch(9): allow preemptible epochs to compose · 6573d758
      Matt Macy authored
      - Add tracker argument to preemptible epochs
      - Inline epoch read path in kernel and tied modules
      - Change in_epoch to take an epoch as argument
      - Simplify tfb_tcp_do_segment to not take a ti_locked argument,
        there's no longer any benefit to dropping the pcbinfo lock
        and trying to do so just adds an error prone branchfest to
        these functions
      - Remove cases of same function recursion on the epoch as
        recursing is no longer free.
      - Remove the the TAILQ_ENTRY and epoch_section from struct
        thread as the tracker field is now stack or heap allocated
        as appropriate.
      
      Tested by: pho and Limelight Networks
      Reviewed by: kbowling at llnw dot com
      Sponsored by: Limelight Networks
      Differential Revision: https://reviews.freebsd.org/D16066
      6573d758
  29. 19 Jun, 2018 1 commit
    • Matt Macy's avatar
      convert inpcbinfo hash and info rwlocks to epoch + mutex · 9e58ff6f
      Matt Macy authored
      - Convert inpcbinfo info & hash locks to epoch for read and mutex for write
      - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
        INP_INFO_RLOCK which can no longer fail
      
      When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
      unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to
      3%.
      
      Overall packet throughput rate limited by CPU affinity and NIC driver design
      choices.
      
      On the receiver unhalted core cycles samples in in_pcblookup_hash went from
      13% to to 1.6%
      
      Tested by LLNW and pho@
      
      Reviewed by: jtl
      Sponsored by: Limelight Networks
      Differential Revision: https://reviews.freebsd.org/D15686
      9e58ff6f
  30. 13 Jun, 2018 1 commit
  31. 12 Jun, 2018 3 commits
  32. 06 Jun, 2018 1 commit
    • Sean Bruno's avatar
      Load balance sockets with new SO_REUSEPORT_LB option. · 1a43cff9
      Sean Bruno authored
      This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
      programs or threads to bind to the same port and incoming connections will be
      load balanced using a hash function.
      
      Most of the code was copied from a similar patch for DragonflyBSD.
      
      However, in DragonflyBSD, load balancing is a global on/off setting and can not
      be set per socket. This patch allows for simultaneous use of both the current
      SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
      
      Required changes to structures:
      Globally change so_options from 16 to 32 bit value to allow for more options.
      Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
      
      Limitations:
      As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
      threads sharing the same socket).
      
      This is a substantially different contribution as compared to its original
      incarnation at svn r332894 and reverted at svn r332967.  Thanks to rwatson@
      for the substantive feedback that is included in this commit.
      
      Submitted by:	Johannes Lundberg <johalun0@gmail.com>
      Obtained from:	DragonflyBSD
      Relnotes:	Yes
      Sponsored by:	Limelight Networks
      Differential Revision:	https://reviews.freebsd.org/D11003
      1a43cff9
  33. 21 May, 2018 1 commit
  34. 20 May, 2018 1 commit