1. 24 Apr, 2021 1 commit
    • Hans Petter Selasky's avatar
      Allow the tcp_lro_flush_all() function to be called when the control · a9b66dbd
      Hans Petter Selasky authored
      structure is zeroed, by setting the VNET after checking the mbuf count
      for zero. It appears there are some cases with early interrupts on some
      network devices which still trigger page-faults on accessing a NULL "ifp"
      pointer before the TCP LRO control structure has been initialized.
      This basically preserves the old behaviour, prior to
      9ca874cf .
      
      No functional change.
      
      Reported by:	rscheff@
      Differential Revision:	https://reviews.freebsd.org/D29564
      MFC after:	2 weeks
      Sponsored by:	Mellanox Technologies // NVIDIA Networking
      a9b66dbd
  2. 20 Apr, 2021 1 commit
    • Hans Petter Selasky's avatar
      Add TCP LRO support for VLAN and VxLAN. · 9ca874cf
      Hans Petter Selasky authored
      This change makes the TCP LRO code more generic and flexible with regards
      to supporting multiple different TCP encapsulation protocols and in general
      lays the ground for broader TCP LRO support. The main job of the TCP LRO code is
      to merge TCP packets for the same flow, to reduce the number of calls to upper
      layers. This reduces CPU and increases performance, due to being able to send
      larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it
      possible to avoid per-packet interaction by the host CPU.
      
      Because the current TCP LRO code was tightly bound and optimized for TCP/IP
      over ethernet only, several larger changes were needed. Also a minor bug was
      fixed in the flushing mechanism for inactive entries, where the expire time,
      "le->mtime" was not always properly set.
      
      To avoid having to re-run time consuming regression tests for every change,
      it was chosen to squash the following list of changes into a single commit:
      - Refactor parsing of all address information into the "lro_parser" structure.
        This easily allows to reuse parsing code for inner headers.
      - Speedup header data comparison. Don't compare field by field, but
        instead use an unsigned long array, where the fields get packed.
      - Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed
        recursivly, only applying deltas as the result of updating payload data.
      - Make smaller inline functions doing one operation at a time instead of
        big functions having repeated code.
      - Refactor the TCP ACK compression code to only execute once
        per TCP LRO flush. This gives a minor performance improvement and
        keeps the code simple.
      - Use sbintime() for all time-keeping. This change also fixes flushing
        of inactive entries.
      - Try to shrink the size of the LRO entry, because it is frequently zeroed.
      - Removed unused TCP LRO macros.
      - Cleanup unused TCP LRO statistics counters while at it.
      - Try to use __predict_true() and predict_false() to optimise CPU branch
        predictions.
      
      Bump the __FreeBSD_version due to changing the "lro_ctrl" structure.
      
      Tested by:	Netflix
      Reviewed by:	rrs (transport)
      Differential Revision:	https://reviews.freebsd.org/D29564
      MFC after:	2 week
      Sponsored by:	Mellanox Technologies // NVIDIA Networking
      9ca874cf
  3. 04 Mar, 2021 1 commit
  4. 18 Feb, 2021 2 commits
  5. 17 Feb, 2021 2 commits
    • Randall Stewart's avatar
      Add ifdef TCPHPTS around build_ack_entry and do_bpf_and_csum to avoid · ab4fad4b
      Randall Stewart authored
      warnings when HPTS is not included
      
      Thanks to Gary Jennejohn for pointing this out.
      ab4fad4b
    • Randall Stewart's avatar
      Update the LRO processing code so that we can support · 69a34e8d
      Randall Stewart authored
      a further CPU enhancements for compressed acks. These
      are acks that are compressed into an mbuf. The transport
      has to be aware of how to process these, and an upcoming
      update to rack will do so. You need the rack changes
      to actually test and validate these since if the transport
      does not support mbuf compression, then the old code paths
      stay in place. We do in this commit take out the concept
      of logging if you don't have a lock (which was quite
      dangerous and was only for some early debugging but has
      been left in the code).
      
      Sponsored by: Netflix Inc.
      Differential Revision: https://reviews.freebsd.org/D28374
      69a34e8d
  6. 01 Sep, 2020 1 commit
  7. 12 Feb, 2020 1 commit
  8. 07 Nov, 2019 1 commit
  9. 09 Oct, 2019 1 commit
    • Warner Losh's avatar
      Fix casting error from newer gcc · b23b156e
      Warner Losh authored
      Cast the pointers to (uintptr_t) before assigning to type
      uint64_t. This eliminates an error from gcc when we cast the pointer
      to a larger integer type.
      b23b156e
  10. 06 Oct, 2019 1 commit
    • Randall Stewart's avatar
      Brad Davis identified a problem with the new LRO code, VLAN's · 5b63b220
      Randall Stewart authored
      no longer worked. The problem was that the defines used the
      same space as the VLAN id. This commit does three things.
      1) Move the LRO used fields to the PH_per fields. This is
         safe since the entire PH_per is used for IP reassembly
         which LRO code will not hit.
      2) Remove old unused pace fields that are not used in mbuf.h
      3) The VLAN processing is not in the mbuf queueing code. Consequently
         if a VLAN submits to Rack or BBR we need to bypass the mbuf queueing
         for now until rack_bbr_common is updated to handle the VLAN properly.
      
      Reported by:	Brad Davis
      5b63b220
  11. 06 Sep, 2019 2 commits
    • Conrad Meyer's avatar
      Fix build after r351934 · 373013b0
      Conrad Meyer authored
      tcp_queue_pkts() is only used if TCPHPTS is defined (and it is not by
      default).
      
      Reported by:	gcc
      373013b0
    • Randall Stewart's avatar
      This adds the final tweaks to LRO that will now allow me · e57b2d0e
      Randall Stewart authored
      to add BBR. These changes make it so you can get an
      array of timestamps instead of a compressed ack/data segment.
      BBR uses this to aid with its delivery estimates. We also
      now (via Drew's suggestions) will not go to the expense of
      the tcb lookup if no stack registers to want this feature. If
      HPTS is not present the feature is not present either and you
      just get the compressed behavior.
      
      Sponsored by:	Netflix Inc
      Differential Revision: https://reviews.freebsd.org/D21127
      e57b2d0e
  12. 09 Mar, 2018 1 commit
    • Sean Bruno's avatar
      Update tcp_lro with tested bugfixes from Netflix and LLNW: · d7fb35d1
      Sean Bruno authored
          rrs - Lets make the LRO code look for true dup-acks and window update acks
                fly on through and combine.
          rrs - Make the LRO engine a bit more aware of ack-only seq space. Lets not
                have it incorrectly wipe out newer acks for older acks when we have
                out-of-order acks (common in wifi environments).
          jeggleston - LRO eating window updates
      
      Based on all of the above I think we are RFC compliant doing it this way:
      
      https://tools.ietf.org/html/rfc1122
      
      section 4.2.2.16
      
      "Note that TCP has a heuristic to select the latest window update despite
      possible datagram reordering; as a result, it may ignore a window update with
      a smaller window than previously offered if neither the sequence number nor the
      acknowledgment number is increased."
      
      Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
      Reviewed by:	rstone gallatin
      Sponsored by:	NetFlix and Limelight Networks
      Differential Revision:	https://reviews.freebsd.org/D14540
      d7fb35d1
  13. 27 Nov, 2017 1 commit
    • Pedro F. Giffuni's avatar
      sys: general adoption of SPDX licensing ID tags. · fe267a55
      Pedro F. Giffuni authored
      Mainly focus on files that use BSD 2-Clause license, however the tool I
      was using misidentified many licenses so this was mostly a manual - error
      prone - task.
      
      The Software Package Data Exchange (SPDX) group provides a specification
      to make it easier for automated tools to detect and summarize well known
      opensource licenses. We are gradually adopting the specification, noting
      that the tags are considered only advisory and do not, in any way,
      superceed or replace the license texts.
      
      No functional change intended.
      fe267a55
  14. 24 Apr, 2017 2 commits
  15. 19 Apr, 2017 3 commits
  16. 25 Aug, 2016 1 commit
  17. 16 Aug, 2016 1 commit
  18. 05 Aug, 2016 1 commit
    • Sepherosa Ziehau's avatar
      tcp/lro: If timestamps mismatch or it's a FIN, force flush. · b9ec6f0b
      Sepherosa Ziehau authored
      This keeps the segments/ACK/FIN delivery order.
      
      Before this patch, it was observed: if A sent FIN immediately after
      an ACK, B would deliver FIN first to the TCP stack, then the ACK.
      This out-of-order delivery causes one unnecessary ACK sent from B.
      
      Reviewed by:	gallatin, hps
      Obtained from:  rrs, gallatin
      Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
      Differential Revision:	https://reviews.freebsd.org/D7415
      b9ec6f0b
  19. 02 Aug, 2016 1 commit
  20. 03 Jun, 2016 1 commit
    • Hans Petter Selasky's avatar
      Use insertion sort instead of bubble sort in TCP LRO. · ec668905
      Hans Petter Selasky authored
      Replacing the bubble sort with insertion sort gives an 80% reduction
      in runtime on average, with randomized keys, for small partitions.
      
      If the keys are pre-sorted, insertion sort runs in linear time, and
      even if the keys are reversed, insertion sort is faster than bubble
      sort, although not by much.
      
      Update comment describing "tcp_lro_sort()" while at it.
      
      Differential Revision:	https://reviews.freebsd.org/D6619
      Sponsored by:	Mellanox Technologies
      Tested by:	Netflix
      Suggested by:	Pieter de Goeje <pieter@degoeje.nl>
      Reviewed by:	ed, gallatin, gnn, transport
      ec668905
  21. 26 May, 2016 1 commit
    • Hans Petter Selasky's avatar
      Use optimised complexity safe sorting routine instead of the kernel's · fc271df3
      Hans Petter Selasky authored
      "qsort()".
      
      The kernel's "qsort()" routine can in worst case spend O(N*N) amount of
      comparisons before the input array is sorted. It can also recurse a
      significant amount of times using up the kernel's interrupt thread
      stack.
      
      The custom sorting routine takes advantage of that the sorting key is
      only 64 bits. Based on set and cleared bits in the sorting key it
      partitions the array until it is sorted. This process has a recursion
      limit of 64 times, due to the number of set and cleared bits which can
      occur. Compiled with -O2 the sorting routine was measured to use
      64-bytes of stack. Multiplying this by 64 gives a maximum stack
      consumption of 4096 bytes for AMD64. The same applies to the execution
      time, that the array to be sorted will not be traversed more than 64
      times.
      
      When serving roughly 80Gb/s with 80K TCP connections, the old method
      consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4%
      CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting
      as measured by Intel Vtune. The testing was done using a sysctl to
      toggle between "qsort()" and "tcp_lro_sort()".
      
      Differential Revision:	https://reviews.freebsd.org/D6472
      Sponsored by:	Mellanox Technologies
      Tested by:	Netflix
      Reviewed by:	gallatin, rrs, sephe, transport
      fc271df3
  22. 03 May, 2016 1 commit
  23. 28 Apr, 2016 1 commit
  24. 27 Apr, 2016 1 commit
  25. 01 Apr, 2016 2 commits
  26. 25 Mar, 2016 1 commit
  27. 18 Feb, 2016 1 commit
  28. 11 Feb, 2016 1 commit
  29. 01 Feb, 2016 1 commit
  30. 19 Jan, 2016 1 commit
    • Hans Petter Selasky's avatar
      Add optimizing LRO wrapper: · e936121d
      Hans Petter Selasky authored
      - Add optimizing LRO wrapper which pre-sorts all incoming packets
        according to the hash type and flowid. This prevents exhaustion of
        the LRO entries due to too many connections at the same time.
        Testing using a larger number of higher bandwidth TCP connections
        showed that the incoming ACK packet aggregation rate increased from
        ~1.3:1 to almost 3:1. Another test showed that for a number of TCP
        connections greater than 16 per hardware receive ring, where 8 TCP
        connections was the LRO active entry limit, there was a significant
        improvement in throughput due to being able to fully aggregate more
        than 8 TCP stream. For very few very high bandwidth TCP streams, the
        optimizing LRO wrapper will add CPU usage instead of reducing CPU
        usage. This is expected. Network drivers which want to use the
        optimizing LRO wrapper needs to call "tcp_lro_queue_mbuf()" instead
        of "tcp_lro_rx()" and "tcp_lro_flush_all()" instead of
        "tcp_lro_flush()". Further the LRO control structure must be
        initialized using "tcp_lro_init_args()" passing a non-zero number
        into the "lro_mbufs" argument.
      
      - Make LRO statistics 64-bit. Previously 32-bit integers were used for
        statistics which can be prone to wrap-around. Fix this while at it
        and update all SYSCTL's which expose LRO statistics.
      
      - Ensure all data is freed when destroying a LRO control structures,
        especially leftover LRO entries.
      
      - Reduce number of memory allocations needed when setting up a LRO
        control structure by precomputing the total amount of memory needed.
      
      - Add own memory allocation counter for LRO.
      
      - Bump the FreeBSD version to force recompilation of all KLDs due to
        change of the LRO control structure size.
      
      Sponsored by:	Mellanox Technologies
      Reviewed by:	gallatin, sbruno, rrs, gnn, transport
      Tested by:	Netflix
      Differential Revision:	https://reviews.freebsd.org/D4914
      e936121d
  31. 30 Jun, 2015 1 commit
  32. 28 Aug, 2013 1 commit
    • Navdeep Parhar's avatar
      Merge r254336 from user/np/cxl_tuning. · 7127e6ac
      Navdeep Parhar authored
      Add a last-modified timestamp to each LRO entry and provide an interface
      to flush all inactive entries.  Drivers decide when to flush and what
      the inactivity threshold should be.
      
      Network drivers that process an rx queue to completion can enter a
      livelock type situation when the rate at which packets are received
      reaches equilibrium with the rate at which the rx thread is processing
      them.  When this happens the final LRO flush (normally when the rx
      routine is done) does not occur.  Pure ACKs and segments with total
      payload < 64K can get stuck in an LRO entry.  Symptoms are that TCP
      tx-mostly connections' performance falls off a cliff during heavy,
      unrelated rx on the interface.
      
      Flushing only inactive LRO entries works better than any of these
      alternates that I tried:
      - don't LRO pure ACKs
      - flush _all_ LRO entries periodically (every 'x' microseconds or every
        'y' descriptors)
      - stop rx processing in the driver periodically and schedule remaining
        work for later.
      
      Reviewed by:	andre
      7127e6ac
  33. 21 Feb, 2013 1 commit