Accessible and Quantitative Entangled Polymer Rheology Predictions

Feb 17, 2015 - In this work, we present a numerical algorithm for a variant of DSM—the clustered fixed slip-link model fully implemented on CUDA. Be...
0 downloads 0 Views 1MB Size
Article pubs.acs.org/Macromolecules

Accessible and Quantitative Entangled Polymer Rheology Predictions, Suitable for Complex Flow Calculations Marat Andreev†,‡ and Jay D. Schieber*,§,†,‡ †

Department of Physics, ‡Center for Molecular Study of Condensed Soft Matter, and §Department of Chemical and Biological Engineering, Illinois Institute of Technology, Chicago, Illinois 60616, United States ABSTRACT: Despite strong industrial interest, only recently have quantitative predictions for entangled polymer rheology become possible. Major advances are achieved with an alternative approach to entanglement modelingslip-links. For example, the discrete slip-link model (DSM) allows for equilibrium and highflow-rate predictions with a strong connection to a molecular basis. Slip-link models are applicable to arbitrary composition and chain architectures, but usually incur high numerical cost. Recent advances in coarse-graining allowed for a speedup of order one hundred, but further improvements are required for the ultimate goalsimulations of nonhomogeneous flows. On the other hand, it is possible to rely on advanced computational devices to obtain additional speed-up. GPUs are one of the most rapidly developing kind of computational devices, and provide immense computational power, but also have some critical restrictions on numerical operations. Since slip-link models have a unique level of description, our numerical implementations are designed from scratch, rather than using available modeling frameworks. In this work, we present a numerical algorithm for a variant of DSMthe clustered fixed slip-link model fully implemented on CUDA. Because the model is a well-defined mathematical object, the completely new GPU-optimized algorithm produces results indistinguishable from the earlier CPU implementation. We believe the performance of the algorithm will allow simulations of complex, nonhomogeneous flows of entangled polymer melts and solutions on a relatively small GPU computational cluster. The single GPU version of the code for equilibrium and homogeneous flow calculations is available for free download online, which allows homogeneous flow calculations on an inexpensive desktop. We discuss implementation details and test performance and show a comparison of the results with experiments.



nonlinear flow of arbitrary chain architecture and composition without any modification to the mathematics. It is consistent with nonequilibrium thermodynamics, with the stress-optic rule (for moderate deformations), and gives consistency between Green−Kubo predictions and the relaxation modulus predicted for small deformations. It is able to predict simultaneously the linear viscoelasticity of monodisperse linear, polydisperse linear, and branched systems. Finally DSM shows excellent agreement with shear flow experiments and elongational flows with stretch up to ∼10−20. In recent work4 a strong connection between DSM and atomistic simulations was proposed. The authors introduced the mobile slip-link model (MSM), which is based on an earlier model,5,6 but has a slightly more-detailed level of description. All MSM parameters, except one, can be calculated from the results of atomistic simulations. In this way MSM allows for almost ab initio predictions. The MSM, together with the original fixed slip-link model (FSM) and the less-detailed, clustered fixed slip-link model (CFSM) form a hierarchy of models with different levels of description.7 Models in the

INTRODUCTION Entangled polymer melts and solutions are non-Newtonian liquids with vast industrial applications. A strong connection between their complex macroscopic behavior and molecular structure makes their studies particularly challenging. It is commonly accepted that polymer chains over a certain molecular weight form so-called entanglementstopological constraints that chains exert on each other. These entanglements are further believed to govern the macroscopic behavior of polymer melts or solutions. It is possible to capture many important physical details with atomistic simulations, but modern molecular dynamics computations are unable to reach the long relaxation times observed for entangled polymer melts, which can reach minutes or hours. Whereas numerically cheap reptation models are able to capture entanglement dynamics in certain cases, they have a number of phenomenological parameters and usually exhibit some limitations, such as zero second-normal-stress coefficient in shear flows.1,2 On the other hand, a strong connection with a molecular basis is required to build a robust modeling framework. The DSM is a single-chain mean-field model for flexible entangled polymers, first proposed by Schieber et al.3 Compared to other models, DSM has a minimal set of parameters and is applicable to equilibrium dynamics and © XXXX American Chemical Society

Received: December 18, 2014 Revised: February 2, 2015

A

DOI: 10.1021/ma502525x Macromolecules XXXX, XXX, XXX−XXX

Article

Macromolecules

develop a new algorithm for GPU, rather than adopt one from CPU. CUDA is a programming language developed by the NVIDIA corporation for GPGPU calculations. NVIDIA is the world’s largest GPU manufacturer, and with intensive support CUDA became the most common language for GPGPU programming. This support is reflected not only in the CUDA framework itself, but new mainstream GPUs are developed with the GPGPU purpose in mind. CUDA has been used to accelerate calculations in many fields including cryptography and molecular dynamics simulations.16,17 An attempt to accelerate entangled polymer rheology calculations with GPU was made by Uneyama,18 who used CUDA and an NVIDIA TESLA card to perform calculations for modifications of a slip-spring simulation.19 Although the author reports very significant speedup, several simplifications were made in the model and the GPU code did not reproduce predictions of the original model. The work however showed the potential of entanglement theory calculations on a GPU.

hierarchy are consistent with each other in overlapping regions, and the parameters for MSM unambiguously set parameters for less-detailed members. The numerical cost of the least-detailed member (CFSM) is many orders of magnitude less than MSM, which allows for homogeneous flow predictions of entangled star-branched polymer melts giving excellent agreement with experiment.8 A somewhat faster version of the DSM would be a good candidate for nonhomogeneous flow calculations. In order to perform nonhomogeneous flow calculations for entangled polymers, a number of issues should first be overcome. Since entangled polymers possess a memory about past deformations, theoretical models frequently rely on state variables representing chain conformations. An Eulerian approach to flow usually requires interpolation of these variables between grid points, which is difficult. Clearly, a Lagrangian approach is preferable. Since fluid flow simulations are nontrivial and numerically expensive even for simple Newtonian fluid, most early attempts relied on simple and numerically cheap polymer models. The first application of a stochastic model is the CONNFFESSIT approach9 based on the finite element method. It can work with an arbitrary viscoelastic model as was shown for example in ref.1 CONNFFESSIT can also utilize the concept of Brownian configuration fields developed by Hulsen et al.10 and later generalized11 to significantly reduce noise. More recently a combination of smoothed particle hydrodynamics (SPH) with a simple polymeric fluid model was proposed by Vázquez et al.12 for nonhomogeneous flow calculations. SPH is a Lagrangian method and was originally developed by Gingold and Monaghan13 for astrophysical problems. The method has been successfully applied to various problems. The method is known to handle free surfaces and moving boundaries well. It proved to be robust and reliable, with available, highly efficient algorithms. The version of SPH by Vázquez et al.12 is capable of working with an arbitrary viscoelasticity model and was shown to satisfy nonequilibrium thermodynamics by the GENERIC formalism.14 Vázquez et al.15 successfully applied this method together with the Oldroyd-B model to microrheology simulations, which included moving boundaries. We believe that SPH is a preferable flow calculation method for combination with DSM, providing that there is a way to reduce DSM numerical costs. The GPU was originally developed for one specific taskthe processing of computer graphics. Continuous demand from home entertainment consumers led to rapid growth of GPUs in both power and complexity. Compared to CPUs, modern GPUs theoretically possess up to three orders more computational power. Equally important, GPUs provide a significantly smaller electrical power per FLOPS ratio as well as a smaller price per FLOPS ratio. These factors have inspired the GPGPU (General-purpose computing on graphics processing units) idea. Fortunately GPU manufacturers quickly recognized the idea and supported it. The biggest challenge for GPGPU comes from the differences in GPU and CPU programming. The GPU is designed for mass-parallel execution of one simple set of numerical operations with different data values. Complex operations can seriously impair GPU performance. While interaction between parallel threads are possible with modern GPUs, they have severe restrictions. Sometimes trivial tasks take much longer time on a GPU than on a CPU. Overall GPU programming is challenging and, therefore, it is often easier to



CFSM Here we briefly review the least-detailed member of the DSM hierarchythe clustered fixed slip-link model (CFSM). The main goal of this section is to aid GPU code description further in the text. Additionally since CFSM is mathematically identical to FSM, and the code is suited for both models, some details of FSM are also mentioned. In the DSM the mean-field chain is represented as a random walk of elementary chain segments with constant length. The elementary chain segment usually is the Kuhn stepa characteristic length scale of flexible polymer chains, but CFSM uses a cluster consisting of several Kuhn steps. Entanglements (slip-links) represent topological constraints from other chains in the mean-field. They are placed randomly in junctures between chain segments. FSM and CFSM assume that slip-links are fixed to the background. The chain is represented with the following dynamic variables: Z, the number of chain strands, equal to the number of entanglements plus one; {Qi}, the strand connector vectors; and {Ni}, number of segments in each strand. The chain is subject to two dynamic processes: dynamics of the chain itself, sliding dynamics (SD); and interaction with the mean-field, or constraint dynamics (CD). Sliding dynamics corresponds to reallocation of segments through entanglements and it creates and destroys entanglements at the ends of the chain. When the last segment of a dangling end goes through an entanglement, the entanglement is destroyed. The probability to create an entanglement at the end of the chain is connected with the destruction process through detailed balance. CD is the creation and destruction of entanglements along the entire length of the chain due to SD of mean-field chains. We implement CD self-consistently with SD by introducing an entanglement lifetime distribution pCD(τCD i ). Thus, the FSM and CFSM have the following set of stochastic variables: {Z, {Ni}, {Qi}, {τCD i }}. The molecular-weight-, architecture- and chemistry-independent parameter set for FSM is MK, the molecular weight of a Kuhn step; β, which is related to the entanglement density as β = limNK→∞⟨Ni⟩ + 1; and τK, a characteristic time of Kuhn step shuffling. MK is determined by chemistry and we do not treat it as an adjustable parameter, while β and τK also depend on solvent concentration, and only τK is a function of temperature. They can be found by B

DOI: 10.1021/ma502525x Macromolecules XXXX, XXX, XXX−XXX

Article

Macromolecules comparison with G* data or from atomistic simulations (except τK) as described in refs 4 and 7. The CFSM parameter set is Mc, the molecular weight of a cluster and τc, a characteristic time of cluster shuffling. Usually CFSM assumes two clusters between entanglements on average. However, in highly extensional flow the CFSM prediction may deviate from the more-detailed FSM. In this case a smaller cluster molecular weight can be used. The mapping of FSM parameters to CSFM is reported in our previous work.8 More-detailed information on FSM can be found in ref.5 Since the DSM is a single-chain stochastic model, predictions usually involve following the time evolution for an ensemble of independent chains. Macroscopic quantities, such as the stress tensor or relaxation modulus, are calculated as averages over the ensemble.

Figure 2. CFSM shear flow predictions. Symbols are experimental data:20 linear polyisoprene melt, 90 kDa, −35 °C, CSFM parameters: Mc = 2050 Da, τc = 0.1 s. Black line is CPU code predictions, and green line is GPU code prediction.



RESULTS In this section we present DSM predictions calculated on a GPU, and compare results and performance of CPU and GPU code. We applied our GPU code to two linear entangled polymer systems in equilibrium and flow. First, we compare predictions for a linear, monodisperse polyisoprene melt with molecular weight 90 kDa and 22.5 entanglements per chain on average. Experimental measurements were performed by Auhl et al.20 Figure 1 shows the

Figure 3. CFSM G* prediction. Symbols are experimental data:22 polybutadiene solution, 813 kDa, 30 °C. CSFM parameters: Mc = 8381 Da, β = 5, τc = 0.0014 s. Black line is CPU code prediction, and green line is GPU code prediction.

dynamic modulus G* data with DSM predictions. Here parameters were fit to G* data. Note that the theory has coarse-grained out the high-frequency Rouse modes, so we expect disagreement there. However, these modes can be added back into the model without any additional parameters.23 Again we observe good agreement with experiment except in the high frequency region and almost identical results from CPU and GPU. No parameter adjustments were then made for flow predictions. Figure 4 shows predictions for the transient shear viscosity during startup of flow and Figure 5 shows predictions for transient first normal stress coefficient. As above, GPU and CPU produce nearly identical results. Overall quality of the

Figure 1. CFSM G* prediction. Symbols are experimental data:20 linear polyisoprene melt, 90 kDa, −35 °C. CSFM parameters: Mc = 2050 Da, τc = 0.1 s. Black line is CPU code prediction, and green line is GPU code prediction.

experimentally measured dynamic modulus G* together with the DSM prediction calculated on both GPU and CPU. GPU and CPU produce nearly identical results. We fit our timedomain results to a BSW spectrum21 to convert them to the frequency domain, which explains the very small discrepancy. At high frequencies, Rouse modes that were integrated out from the model, dominate the experimental data. The value for Mc is taken from ref 8 and, although found by fitting data originally, is consistent with the value obtained from first principles. The value for τc was obtained by fitting these experimental data. Figure 2 shows predictions for transient shear viscosity during startup of flow. Here we do not adjust any parameters from the equilibrium predictions. We observe excellent agreement of CPU and GPU calculations as well as very good agreement with experimental data. Second, we compared predictions for a linear monodisperse polybutadiene solution with chain molecular weight 813 kDa and 16 entanglements per chain on average. Experimental measurements were performed by Menezes and Graessley.22 Figure 3 shows comparison of the experimentally measured

Figure 4. CFSM predictions of shear transient viscosity. Symbols are experimental data:22 linear polybutadiene solution, 813 kDa, 30 °C. CSFM parameters: Mc = 8381 Da, β = 5, τc = 0.0014 s. Black line is CPU code predictions, and green line is GPU code prediction. C

DOI: 10.1021/ma502525x Macromolecules XXXX, XXX, XXX−XXX

Article

Macromolecules

prediction. A large ensemble again benefits GPU effectiveness and we observe speedup values between 170 and 350. DSM does not always require large ensemble sizes, where GPU shows optimal performance. However, even for moderate ensemble sizes the performance gain from switching to GPU is on the order of 100. We conclude that a single GPU can replace a computational cluster for equilibrium and homogeneous flow calculations with DSM. Figure 6 shows GPU performance as a function of ensemble size. GPU calculation time is a linear function of the number of chains for an ensemble larger than ≈16000 chains. Nonhomogeneous flow calculations require much bigger ensemble sizes and would greatly benefit from GPU calculations. Suppose we are doing a 3D nonhomogeneous flow simulation with 64 × 64 × 64 grid points on a relatively small GPU cluster of 100 GPUs. We use PI90K γ̇ = 0.02 s−1 as an example to estimate the time required to perform such a simulation. We assume we can use 1000 chains per grid point, and target 3000 s ≈ 1 h real time. Thus, we have 2 × 108 chains for 100 GPUs, which means 2 × 106 chains per GPU. Figure 6 show that the time required to perform time evolution for 1 024 000 chains is roughly 4700s. Assuming perfect linear scaling and neglecting the numerical cost of other calculations this gives us 2 × 106/(1024 × 103) × 4700 s ≈ 9 × 103 s ≈ 2.5 h for the simulation. It is probable that a finer stress resolution will be required in nonhomogeneous flow calculations. This means more frequent ensemble synchronizations, which can impact the performance, as discussed in the next chapter. Figure 7 shows the GPU calculation time as a function of synchronization time. We picked the same rate as that used for the nonhomogeneous flow estimation. However, only for synchronization times ≲5τc is the performance penalty significant. Such synchronization times are unreasonable since they are very close to the model’s resolution, and would represent massive Weissenberg number values. Yet, even in the worst case (1τc), our estimate for nonhomogeneous flow calculation time is still feasible (hours, not days). For reasonable synchronization times, >5τc, the performance penalty is almost negligible. We conclude that nonhomogeneous flow simulations with DSM are easily accessible for a cluster of 100 GPUs.

Figure 5. CFSM predictions of transient shear first normal stress. Symbols are experimental data:22 linear polybutadiene solution, 813 kDa, 30 °C. CSFM parameters: Mc = 8381 Da, β = 5, τc = 0.0014 s. Black line is CPU code predictions, and green line is GPU code prediction.

predictions is also very good. With β = 1, the fit of G* data yields τc = 0.016s. Since the highest rate is close to the strand Weissenberg number of 4τcγ̇ = 0.7, and work by Andreev et al.8 shows that clustering may artifactually affect the quality of predictions for high rates, we decreased the amount of clustering (β = 5) between slip-links. Predictions and performance measurements were taken on a single desktop computer equipped with a single NVIDIA GeForce GTX780 mainstream gaming card. CPU predictions were calculated on a computational cluster with approximately 1000 CPU cores. Each CPU core performed calculations for a fraction of the whole chain ensemble. Averages over the ensemble were taken afterward by separate postprocessing software. CPU performance was measured separately on 1 core of Intel Xeon E5-2640 CPU for just 1 chain. We report CPU execution time for the whole ensemble calculated assuming perfect linear scaling of CPU code with number of chains. The speedup factor is a ratio of CPU to GPU execution times. We believe the way we perform CPU performance measurements allows for a simple practical interpretation: the speedup factor is the number of CPU cores that one GPU can replace. It is important to mention that the CPU code uses double precision for the floating point operations, whereas GPU uses single precision. We do not observe any effect of the single precision on overall quality of predictions. Table 1 shows a comparison of GPU and CPU performance for equilibrium calculations. We find the speedup factor on the



DETAILS OF CUDA IMPLEMENTATION During implementation of DSM on CUDA, we had to consider several GPU limitations. Here we first briefly review the most important aspects of CUDA programming, then discuss optimization details of our code. In CUDA, functions are called kernels, which are executed on a GPU many times in parallel for different threads. Threads are organized into blocks, and only the threads in one block are guaranteed to be executed simultaneously. In typical situations some blocks are waiting for on-board memory input/output operations, while other blocks are being executed. The order of block execution is undefined. In order to effectively utilize GPU computational power, the total number of threads should be bigger than the number of hardware computational devices on the GPU, and threads should be separated into many blocks. On other hand, it is undesirable to have very small blocks due to hardware limitations. The latest GPUs are capable to process a few thousand threads at once. Thus, first of all, it is desirable to have more than several thousand threads. The typical number of threads in a block is between several tens to a thousand. Second, threads in one block can exchange data much faster than with threads from different blocks. It is

Table 1. CPU and GPU Code Performance in Equilibriuma

PI90K G* PBD813 K G*

ensemble size

calculation length, τc

GPU time, s

CPU time, s

speedup factor

4000 4000

140000 450000

243 664

28000 53200

115 80

a

Calculation length is set to 10 times the longest relaxation time. Calculation results are shown in Figures 1 and 3. CPU time is calculated from a measurement for 1 chain, assuming perfect linear scaling.

order of 100 satisfactory given that only minimal optimization for equilibrium was implemented. Tables 2 and 3 show CPU and GPU performance in flow. For PI90K, we observe a speedup factor between 115 and 135 for all rates except the lowest. Larger ensemble size improves GPU effectiveness in this case. We used a bigger ensemble for PBD813 K predictions to improve signal-to-noise ratio for first normal stress D

DOI: 10.1021/ma502525x Macromolecules XXXX, XXX, XXX−XXX

Article

Macromolecules Table 2. GPU and CPU Performance for PI90K Flowa γ̇, s−1 8 2 5 8 2 7 1.35 3.37 a

× × × × × × × ×

10−4 10−3 10−3 10−3 10−2 10−2 10−1 10−1

ensemble size

calculation length, τc

stress resolution, τc

GPU time, s

CPU time, s

speedup factor

4000 1000 1000 1000 1000 1000 1000 1000

100000 100000 50000 50000 20000 20000 10000 2000

100 100 50 50 50 50 50 10

165 104 53 51 19 17.2 8.2 1.8

56924 13686 6939 6364 2421 2081 980 235

345 132 131 125 126 121 118 134

Calculation results are shown in Figure 2. CPU time is calculated from a measurement for 1 chain, assuming perfect linear scaling.

Table 3. GPU and CPU Performance for PBD813 K Flowa γ̇, s−1 2.7 × 5.4 × 8.5 × 1.7 × 6.8 × 2.7 10.8

10−2 10−2 10−2 10−1 10−1

ensemble size

calculation length, τc

GPU time, s

CPU time, s

speedup factor

16000 16000 16000 4000 4000 4000 4000

80000 80000 80000 40000 40000 8000 8000

310 302 293 51.8 41.3 8.1 5.55

107300 78200 66300 9190 8240 2070 1310

346 259 226 177 200 255 236

a

Calculation results are shown in Figues 4 and 5. For each deformation rate, the stress calculation was performed 200 times. CPU time is calculated from a measurement for 1 chain, assuming perfect linear scaling.

Figure 7. GPU calculation time for startup of shear flow as a function of synchronization time. Linear polyisoprene melt: 90 kDa, γ̇ = 2 × 10−2 s−1, calculation length 20000τc, ensemble size 1000 chains.

are often not as accurate as on a CPU. Seventh, random access to on-board memory is much slower than on a CPU, though several special instructions are introduced for sequential memory access by temporarily marking some memory areas as write or read only. In order to write efficient CUDA code, all aspects above should be taken into account. DSM chain dynamics are implemented as jumps, where each jump has a probability determined by the chain free energy. Jump process probability calculation is the most expensive part of the code for CPU. Since DSM is a single-chain stochastic model, the simplest way to perform parallel calculations is to assign one chain to one thread. Nevertheless ensemble size typically used for DSM predictions are on the order of a few thousand, at most. As described above, we need several thousand threads to use GPU effectively. Therefore, we split DSM algorithm into two major parts. The first part is an entanglement-parallel kernel, which performs the algorithm steps that can be done in parallel for each entanglement, such as jump process probability calculation, and strand connector vector {Qi} deformation. Most arithmetic operations are performed in this kernel. The number of threads in the entanglement-parallel is roughly ⟨Z⟩ times the number of chains. The second part is a chain-parallel kernel, which performs the time interval calculation and picks the jump process randomly according to the appropriate probabilities. The block sizes of the entanglement-parallel and chain-parallel kernel kernels are 1024 and 256. Some trials were required to find these numbers and we cannot guarantee that they are optimal for any particular hardware or for all possible situations. However, we do not observe a strong dependence of performance on these exact numbers. Additional kernels

Figure 6. GPU calculation time for startup of shear flow as a function of ensemble size. Linear polyisoprene melt: 90 kDa, γ̇ = 2 × 10−2 s−1. Gray line is the linear function of the number of chains.

preferable to reduce or eliminate, if possible, data exchange between different blocks. Third, GPU handles code branching poorly. Execution of threads in the block is absolutely parallel, in other words all threads in the block perform the same operation at any given time. When threads take different conditional branches, the branches are executed serially. In this situation one portion of threads is actually performing calculations, and all others are just waiting for their turn. As a consequence execution time increases if just one of the threads in the block takes a different conditional branch from the others. Fourth, the amount of cache memory and number registers on a GPU are small. It is preferable to write smaller and simpler kernels. Luckily kernel calls are cheap, thus splitting one big kernel into a few smaller ones is highly recommended. Fifth, double-precision floating point performance is much slower than single precision, integer and bitwise operation are much slower than on a CPU. Sixth, floating-point operations E

DOI: 10.1021/ma502525x Macromolecules XXXX, XXX, XXX−XXX

Article

Macromolecules

the mark, and GPU effectiveness decreases. Ensemble synchronization is an expensive operation on GPU and should be performed as rarely as possible. In the case of a given timedependent deformation, we can postpone ensemble averages by caching separate results for each chain into temporary arrays. The averages of the ensemble can be performed when time evolution is complete or we run out of memory. However, under unsteady conditions, like in nonhomogeneous flow calculations, this scheme cannot be used. Thus, it is desirable for nonhomogeneous flow calculations to utilize large synchronization times. For equilibrium dynamics, the stress-autocorrelation function is of greatest interest. Here autocorrelation functions for individual chains are averaged postprocess. During a typical equilibrium calculation macroscopic quantities are calculated every few time steps. Performance of the equilibrium version is sufficient for a single desktop, even with no specific equilibrium optimizations, except explicitly setting the deformation tensor κ = 0. A fully optimized version of the equilibrium code should probably be written from scratch and is the subject of future work. The flow version calculates macroscopic quantities by ensemble synchronization, since typical homogeneous flow predictions require less-frequent stress calculations. Besides our primary interest is in performance in situations close to nonhomogeneous flow calculations. Additionally, Figure 7 shows performance dependence on synchronization frequency. The CPU code has been extensively tested over the course of several years. Distributions for model variables generated by the CPU code have been compared to analytic expressions with excellent agreement. We find that these distributions are extremely sensitive to bugs in the code. Similar checks were performed for the GPU code. In addition, we performed a different test to debug the CUDA code. We used the same pseudorandom number sequence as our CPU code and compared chain conformations after performing an equilibrium time evolution for sufficient time. Since the GPU code is written completely from scratch, this test checks both CPU and GPU codes. We find that even with the reduced accuracy, the chains on the GPU deviate from chain CPU only after several longest relaxation times. Most importantly, the averages remain the same within uncertainty. On every occasion of deviation in a detailed sense we explicitly check that deviation occurred only due to reduced GPU accuracy. Passing this test and agreement with statistical distributions is the code verification.

calculate ensemble averages, and prepare pseudorandom sequences. In our CPU code, chain conformation is updated two times during a time step. The first time is when the jump process is applied, and the second time is when flow deformation of the strand connector vectors is applied. In the case of segment shuffling only two variables Ni and Ni+1 must be updated. Although in the case of entanglement creation or destruction a major portion of memory assigned to chain conformation arrays must be updated. For example if entanglement i is destroyed, {Qj}, {Nj}, {τCD j }, where j > i + 1 must be moved to cell j − 1. The usual implementation of this operation would require a sequence of reads and writes to memory, or two parallel writes and reads, and a temporary array. Considering that the probability of entanglement creation or destruction is roughly proportional to Z in CFSM, and the number of threads in a block is usually more than ⟨Z⟩ for typical systems of interest, it is safe to assume that nearly every block would perform this expensive operation at every time step. Performance on the chain-parallel part would be limited by this expensive operation. Updating {Qi} array on deformation is another memory intensive operation. It involves roughly 60% of chain conformation memory, since Qi is a 3D vector and Ni and τCD are two scalars. i In order to reduce the number of memory writes we implement two duplicating arrays for chain conformations. At any given time step one of them is marked as read-only and another marked write-only. When a jump process is picked, it is not applied immediately, but information about the change is stored in temporary variables (just a few bytes per chain is enough). During the next time step, the entanglement-parallel kernel reads {Qi}, {Ni}, {τCD i } from the first array, applies the stored jump process on the fly, applies deformation to {Qi}, calculates probabilities and writes the chain conformations to the second array. Then arrays exchange roles. In this way only one copy of the chain conformation is performed at each time step. The algorithm requires twice the memory that is actually needed for chain conformations, but extensively uses highperformance memory access instructions. Compared to a simple translation of the CPU algorithm, we estimate 5−15 times speedup. Additional performance is gained by using a temporary array for the pseudorandom sequence. Generation of a random number is a moderately expensive operation on GPU, but more importantly on entanglement creation the model requires four additional pseudorandom numbers, three to generate a new connector vector Q, and one to generate a new entanglement lifetime τCD. In order to reduce the time that the other threads spend waiting for a thread where creation occurs, we simply read pseudorandom numbers from an array prepared earlier. In order to enable parallel reads each chain uses its own sequence. Sequence length is on the order of several ⟨Z⟩, so on the preparation step we generate roughly the same amount of numbers for each chain. Since the DSM algorithm uses a variable time-step size, chains in the ensemble quickly become out of sync. However, calculation of ensemble averages makes sense only for a synchronized ensemble. Suppose we would like to synchronize an ensemble on a given time mark. When a chain reaches the time mark, the thread associated with it becomes idle. In that case, a GPU performing calculation for one chain requires the same computational time as for several hundred chains. The number of active threads drops, as more and more chains reach



CONCLUSIONS A completely new implementation of DSM on NVIDIA CUDA is presented. During the implementation, we were able to get around most limitations of GPU programming, while avoiding any model simplifications. The code has been extensively tested. We applied DSM GPU code to equilibrium and start-up of shear flow experimental data sets of moderately entangled polyisoprene melt and polybutadiene solution with parameters partially obtained ab initio. The code results have very good agreement with experimental data and are identical to results of the earlier CPU implementation, thanks to the unambiguous mathematical formulation of DSM. As a result, performance of the code allows for quantitative rheology predictions of entangled polymer melts and solutions in both equilibrium and flow on a single desktop equipped with a mainstream gaming GPU. Calculation of model prediction performed a few hundred times faster than on CPU and usually takes between F

DOI: 10.1021/ma502525x Macromolecules XXXX, XXX, XXX−XXX

Article

Macromolecules

To conclude, we use the following algorithm for chain time evolution: 1 Calculate all possible jump process probabilities W (Ω′|Ω0). 2 Set the time step interval to Δt = 1/ΣΩ′ W (Ω′|Ω0). 3 Pick one jump process randomly using probabilities as weights. 4 In the presence of flow, deform {Qi} according to eq 1. 5 Repeat . The algorithm assumes that W (Ω|Ω0) is constant during the time interval Δt; in flow, however, W (Ω|Ω0) changes with {Qi}, following eq 1. It is possible to validate this approximation by either deriving a more accurate algorithm or by reducing the time step size. We performed such a check in the past for the CPU version of the code and found no effect on predictions. We assume that as long as Δt given by eq 4 is small and that the flow rate is moderate, the algorithm is valid. Δt is inversely proportional to ⟨Z⟩, and thus stays small for long chains. On the other hand, we know that for shorter chains physics that has been integrated out from the model become important. The same physics are important for high rates. In other words, we expect the model to break down at approximately the same point as the numerical algorithm.

seconds to several minutes depending on the system of interest. The current code allows one to obtain predictions of equilibrium relaxation modulus and transient stress during startup of homogeneous deformation easily. The code is currently limited to linear monodisperse chains; however, DSM is capable of successfully predicting polydisperse systems of arbitrary chain architecture. We believe implementation of polydispersity and different chain architectures should not pose a problem for GPUs. Calculation of additional polymer properties can be easily added to the code. We will continue to work on the code to expand its capabilities and further improve performance. Additionally, we made our code available under an open-source license to popularize use of DSM and provide a simple tool for theoretical predictions of entangled polymers.24 Most importantly, our estimates suggest that a relatively small GPU cluster is capable of performing complex nonhomogeneous flow calculations with DSM.



APPENDIX. NUMERICAL ALGORITHM Except for the number of segments per chain and parameters values, FSM and CFSM are mathematically identical. The numerical algorithm below applies to both FSM and CFSM, with only minor changes required for MSM. In the model the chain conformation Ω:= {{Qi}, {Ni}, {τCD i }, Z} changes in time by jumps. Jumps account for chain segment relocation between different strands, or entanglement creation and destruction. The probability of a jump from conformation Ω to conformation Ω′ per unit time is W(Ω′|Ω). Probabilities are determined by the chain free energy and detailed balance. Additionally in flow strand connector vectors deform continuously dQ i(t ) dt

= κ ·Q i(t )



*(J.D.S.) E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS Support of this work by Army Research Office Grants W911NF-08-2-0058 and W911NF-09-2-0071 is gratefully acknowledged.

(1)

where κ is the transpose of the imposed velocity gradient. We postulate the time evolution for a chain conformation Ω in the absence of flow ∂p(Ω, t |Ω 0 , t0) = ∂t



(2)

After integrating eq 2 from t0 to t0 + Δt and substituting ∫ tt00+Δt peq (Ω, t|Ω0, t0) dt ≅ Δtδ (Ω − Ω0), we get peq (Ω, t0 + Δt |Ω 0 , t0) ≅ W (Ω|Ω 0)Δt + δ(Ω − Ω 0)[1 − Δt

∫ W (Ω′|Ω) dΩ′] (3)

Equation 3 shows that in time interval Δt = 1/

∫ W (Ω′|Ω) dΩ′

REFERENCES

(1) Hua, C. C.; Schieber, J. D. J. Rheol. 1998, 42, 477−491. (2) Graham, R. S.; Likhtman, A. E.; McLeish, T. C. B.; Milner, S. T. J. Rheol. 2003, 47, 1171−1200. (3) Schieber, J. D.; Neergaard, J.; Gupta, S. J. Rheol. 2003, 47, 213. (4) Steenbakkers, R. J. A.; Tzoumanekas, C.; Li, Y.; Liu, W. K.; Kröger, M.; Schieber, J. D. New J. Phys. 2014, 16, 015027. (5) Khaliullin, R. N.; Schieber, J. D. Macromolecules 2009, 42, 7504− 7517. (6) Khaliullin, R. N.; Schieber, J. D. Macromolecules 2010, 43, 6202− 6212. (7) Schieber, J. D.; Andreev, M. Annu. Rev. Chem. Biomol. Eng. 2014, 5, 367−381. (8) Andreev, M.; Feng, H.; Yang, L.; Schieber, J. D. J. Rheol. 2014, 58, 723. (9) Laso, M.; Ö ttinger, H. J. Non-Newtonian Fluid Mech. 1993, 47, 1− 20. (10) Hulsen, M. A.; van den Brule, B.; van Heel, A. J. Non-Newtonian Fluid Mech. 1997, 70, 79−101. (11) Schieber, J. D. J. Non-Newtonian Fluid Mech. 2006, 135, 179− 181. (12) Vázquez-Quesada, A.; Ellero, M.; Español, P. Phys. Rev. E 2009, 79, 056701. (13) Gingold, R.; Monaghan, J. Mon. Not. R. Astron. Soc. 1977, 181, 375−389. (14) Grmela, M.; Ö ttinger, H. C. Phys. Rev. E 1997, 56, 6620−6632. (15) Vázquez-Quesada, A.; Ellero, M.; Españ ol, P. Microfluid. Nanofluid. 2012, 13, 249−260.

∫ [W (Ω|Ω′)p(Ω′, t|Ω0, t0)

− W (Ω′|Ω)p(Ω, t |Ω 0 , t0)] dΩ′

AUTHOR INFORMATION

Corresponding Author

(4)

one of the jump processes is guaranteed to happen, each with probability peq (Ω, t0 + Δt |Ω 0 , t0) ≅ W (Ω|Ω 0)Δt

Therefore, we perform chain evolution with variable time step given by eq 4. If the chain conformation after a time interval Δt0 < Δt is required, the probability to stay at the conformation Ω0 is 1 − Δt0/Δt. G

DOI: 10.1021/ma502525x Macromolecules XXXX, XXX, XXX−XXX

Article

Macromolecules (16) Anderson, J. A.; Lorenz, C. D.; Travesset, A. J. Comput. Phys. 2008, 227, 5342−5359. (17) Manavski, S. In CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography2007 IEEE International Conference on Signal Processing and Communications (ICSPC); Curran Associates: Red Hook NY, 2007; pp 65−68. (18) Uneyama, T. Nihon Reoroji Gakkaishi 2011, 39, 135−152. (19) Likhtman, A. E. Macromolecules 2005, 38, 6128−6139. (20) Auhl, D.; Ramirez, J.; Likhtman, A. E.; Chambon, P.; Fernyhough, C. J. Rheol. 2008, 52, 801−835. (21) Baumgärtel, M.; Schausberger, A.; Winter, H. H. Rheol. Acta 1990, 29, 400−408. (22) Menezes, E. V.; Graessley, W. W. J. Polym. Sci., Part B: Polym. Phys. 1982, 20, 1817−1833. (23) Katzarova, M.; Yang, L.; Andreev, M.; Córdoba, A.; Schieber, J. D. Rheol. Acta 2015, in press; DOI: 10.1007/s00397-015-0836-0. (24) The code is available at http://www.chbe.iit.edu/~schieber/ dsm-software.html.

H

DOI: 10.1021/ma502525x Macromolecules XXXX, XXX, XXX−XXX