Solving SEO with Headless Chrome (Polymer Summit 2017)

By | August 26, 2019


SAM LI: Hi, everyone. I’m Sam Li, and I’m an
engineer on the Polymer team. If you managed to pick up on my
accent in the last five words, I am indeed Australian,
and so honored to be followed up by Trey,
a fellow Aussie, as well. Prior to joining this team, I’d
worked on the beloved Chrome DevTools. One of my smallest, but maybe
my greatest contribution was adding the ability to
rearrange tabs in DevTools. [APPLAUSE] It’s probably
the greatest five lines I’ve ever written. I did work on other features. So if you find me afterwards,
feel free to ask me about them. I might share a
DevTools trick or two. More recently I’ve had the
humbling experience of building webcomponents.org and witnessing
all the incredible components that all of you have
built and published. For example, the one and
only Pokemon selector. And if you are the
person who says, but there’s only 151
Pokemon in the original set, well, there’s even an option
that lets you set that, too. So all kudos to Sami for this. He was, however, in the process
of building webcomponents.org, which brings us to what we’re
here to talk about today. So first, I’m going to
cover my story of how I came to encounter this
SEO problem while building webcomponents.org. We’ll then look at how I used
headless Chrome to solve this before diving into all the
details of how that actually works and how you can use it. So I’m going to take a
step back for a moment and talk about what I learned
in the process of building webcomponents.org. The first thing
I learned was how the platform supports
encapsulation through the use
of web components. With this encapsulation
comes inherent code reuse, which leads to a
specific architecture. Also learned a lot about
progressive web apps and how they can provide us
with fast, engaging experiences. I learned how the
platform provides APIs, such as service
workers, to help enable those experiences. I also learned how to
compose web components to build a progressive web app. We’ve heard from Kevin yesterday
about the PRPL pattern– Push Render,
Pre-cache, Lazy Load– as a method of optimizing
delivery of this application to the user. And one of the
architectures which enables us to utilize the PRPL
pattern is the App Shell model. It provides us with instant
reliable performance by using an aggressively
cached App Shell. You can see that of all the
requests which hit our server, we serve the entry
point file which we serve regardless of the route. The client then requests the
App Shell, which is similar. But because it’s the same
URL across the application, we can combine that
with a service worker to achieve near-instant
loading on repeated visits. The shell is then
responsible for looking at the actual route
that was requested and then requests the necessary
resources to render that route. So at this point I’d learned
how to build a progressive web app using client-side
technologies like Web Components and Polymer,
and how to use patterns such as the PRPL pattern
to deliver this application quickly to the user. Then there’s the elephant
in the room, SEO. For some of these
bots, they’re basically just running curl with that
URL and stop right there. No rendering, no JavaScript. So what are we left with? With this PWA that we built
using the App Shell model, we’re left with just your
entry point file, which has no information in it at all. And in fact, it’s the
same generic entry point file that you serve
across your entire application. So this is particularly
problematic for Web Components which require JavaScript to be
executed for them to be useful. This issue applies to
all search engine indexes that don’t render JavaScript. But it also applies to the
plethora of link rendering bots out there. There’s the social bots
like Facebook and Twitter, but don’t forget the enormous
number of link rendering bots such as Slack,
Hangouts, Gmail, you name it. So what is it about
the App Shell model that I’d really like to keep? Well, for me, this approach
pushes our application complexity out to the client. You can see that the server
has no understanding of routes. It just serves the
entry point file, and it has no real understanding
of what the user is actually trying to achieve. This allows our server to
be significantly decoupled from the front end application,
since it now only needs to expose a simple API to
read and manipulate data. The application that we
pushed out to the client is then responsible for
servicing this data to the user and mediating user interactions
to manipulate this data. So I asked, can we keep
the simple architecture that we know and we love and
also solve this SEO use case with zero performance cost? So then we thought,
what if we just use headless Chrome to
render on our behalf? So here’s a breakdown
of how that would work. We have our regular users
who are making a request, and they would
like a cat picture. Because who wouldn’t? And as part of this approach
we ask, are you a robot? And to answer this, we look
at the user agent string and check if it’s a known
bot that doesn’t render. In this case, the
user can render, so we serve the page
as we normally would. The server responds with a
fetch cat picture function, and then the client can go
and execute that function to get the rendered
result. By the way, this is one of my kittens,
which I fostered recently. She’s super adorable. Now when we encounter
a bot, we can look at the user-agent
string and determine that they don’t render. And instead of serving that
fetch cat picture function, we fire for a quest
to headless Chrome to render this
page on our behalf. And then we send the serialized
rendered response back to the bot so they can see
the full contents of the page. So I built a proof of
concept of this approach to webcomponents.org,
and it worked. I wrote a “Medium”
post about it, and people were really
interested in this approach and wanted to see more of it. So based on this
response, I eventually decided that instead
of my hacky solution that I would build it properly. But then came the
most challenging part of any project. And I know you’ve all
experienced it as well. Naming. So I asked in our team
chat for some suggestions, and I got a ton. [LAUGHTER] So these are
some of our top ones. There’s some great
ones in there. Power Renders, Use The
Platform As A Renderer. However, today I am very
pleased to introduce Rendertron. Let me render that for you. [APPLAUSE] Rendertron is a Dockerized
headless Chrome rendering solution. So that’s a mouthful,
so let’s break it down. First off, what is Docker
and why did I use it? Well, no one knows what it
means, but it’s provocative. In all seriousness,
Docker containers allow you to create lightweight
images as standalone executable packages which isolate
software from its surrounding environment. In Rendertron, we have
headless Chrome packaged up in this container so that you
can easily clone and deploy these to wherever you like. So what about headless Chrome? It was introduced in Chrome
59 for Linux and Mac, Chrome 60 for
Windows, and it allows Chrome to run in environments
which don’t have a UI interface, such as a server. This means that you can
now use Chrome as part of any part of your tool chain. You can use it for
automated testing, you can use it for
measuring the performance of your application,
generating PDFs, amongst many other things. Headless Chrome itself
exposes a really basic JSON API for managing tabs, with
most of the power coming from the DevTools protocol. All of DevTools is built
on top of this protocol, so it’s a pretty powerful API. And one of the key
reasons that headless Chrome is great is that now
we’re bringing the latest and greatest from Chrome to
ensure that all the latest web platform features are supported. With Rendertron, this
means that your SEO can now be a first class
environment which is no different from
the rest of your users. So just a quick shout-out. This all sounds really
interesting to you, and you’d like to
include headless Chrome in some other
way in your tool chain. There’s a brand-new
node library that was published just last week
that exposes a high-level API to control Chrome while
also bundling all of Chrome inside that node package. So you can check
it out on gitHub at GoogleChrome/puppeteer. So I’ve looked at the high
level of how headless Chrome can fit into your application
to fulfill your SEO needs. Now it’s time to
dive to how it works. But I’ve been talking a lot. So who wants to see
Rendertron in action? [CHEERS] All right, so this
is the Hacker News PWA created by some of
my awesome colleagues, and it’s built using
Polymer and Web Components. It loads really fast, and all
around performs pretty well. We can see the separate
network requests which loads the main content that we see. And we can guess that it’s
affected by this SEO problem, since it uses Web Components
which require JavaScript, and it pulls in
data asynchronously. So one quick way to verify
this is by disabling JavaScript and refreshing the page. And once we do that, we can
see that we still get the app header, since that was
in the initial request, but we lose the main content
of the page, which isn’t good. So we jump over to Rendertron,
a headless Chrome service that is meant to render
and serialize this for you. So I wrote this UI as a
quick way to put in the URL and test the applet
from Rendertron. So first off, what
are we hoping to see? Because these bots only
perform one request, we want to see that whole page
come back in that one network request. We also want to
see that it doesn’t need any JavaScript to do this. So take a look. I’m going to put in
the Hacker News URL and tell Rendertron to
render and serialize this, and that I’m also
using Web Components. And it renders correctly. I’m going to disable JavaScript
and verify that it still works. So you can see it’s still
there, and it all comes back in that single network request. Rendertron automatically
detects when your PWA has completed loading. It looks at the page load event
and it shows that it has fired. But we know that’s a really poor
indication of when the page is actually completed loading. So Rendertron also ensures
that any async work has been completed, and it also
looks at your network requests to make sure they’re
finished as well. In total, you have a
10-second rendering budget. This doesn’t mean that it
waits 10 seconds, though. It’ll finish as soon as
your rendering is complete. If this is insufficient
for you, you can also fire a
custom event which signals to Rendertron that
your PWA has completed loading. Serializing Web
Components is tricky because of Shadow DOM,
which abstracts away part of the DOM tree. So to keep things simple,
Rendertron Shady DOM, which polyfills Shadow DOM. This allows Rendertron
to effectively serialize the DOM tree so that it can
be preserved in the output. So let’s take a look at the
news PWA, which we’ve all seen, and it’s also built by some
of our other colleagues. And we’ll plug that
into Rendertron. We’ll then ask Rendertron
to render this as well, and then I’m also
using Web Components. And there we have it. So what do you need to do
to enable this behavior? With Polymer 1
this is super easy, and Rendertron doesn’t
actually need to do anything. Simply append dom
equals shady to the URLs that you pass to
Rendertron and Polymer 1 will ensure that
Shady DOM is used. With Polymer 2, and
with Web Components v1, it’s recommended you use Web
Components loader.js which pulls in all the right
polyfills on different browsers. You then set a
flag to Rendertron telling it that you’re
using web components, and it will ensure that the
necessary polyfills that it needs for serialization
get enabled. So another feature of
Rendertron is that it lets you set HTTP status codes. These status codes are used by
indexes as important signals. For example, if it
comes across a 404, it’s not going to
link to that page because that would be a really
poor search result. Our server, though, is still returning
that entry point file with the status card of 200 OK. So it looks like
every URL exists. Rendertron lets you
configure that status code from within your PWA,
which understands when a page is invalid. Simply add meta tags– dynamically is fine–
to signal to Rendertron what the status code should be. Rendertron will then pick these
up and return that status code to the bot. So this approach isn’t
specific to Polymer or even Web Components. Let’s plug in
fonts.google.com and see what happens when we serialize it. So that looks pretty good. Who can guess what
JavaScript library was used to build Google Fonts? Angular. Rendertron works with any and
all client-side technologies that work in Chrome and whose
DOM tree can be serialized. The Rendertron
endpoint also features screenshot capabilities
so that you can check that headless Chrome
and the load-detecting function are performing as you expect. Unfortunately, this
service is not fast. For each URL that we render,
we spin up headless Chrome to render that entire page. So performance is strictly tied
to the performance of your PWA. Rendertron does, however,
implement a perfect cache. This means that if we have
rendered the same page within a certain cache
freshness threshold, we’ll serve the cached response
instead of re-rendering it again. So how can you get your
hands on this today, and how do you use it? Well first, you’ll need to
deploy the Rendertron service to an endpoint. You’ll need to clone
the gitHub repo at GoogleChrome/rendertron. And it’s built primarily
for Google Cloud, so it’s easiest to deploy there. But if you remember, this
is a Docker container. So you can deploy
this to anywhere which supports a Docker image. So to make things simple
for you to test out, we have the demo
service endpoint, which you can hit at
render-tron.appspot.com. And that’s the one with
the UI that we saw earlier. It is not intended to be used
as a production endpoint. However, you are
welcome to use it, but we make no
guarantees on uptime. Having this as a
ready to use service is something that
we might consider based on the interest received. So just in case
you’re wondering, my boss’s Twitter
handle is @mattsmcnulty, just in case you want to
tell him how awesome I am. So once we have
that end point up, you’re going to need to
install some middleware in your application to do the
user-agent splitting that I was talking about earlier. So this middleware needs
to look at the user-agent, figure out whether or
not they can render, and if not, proxy the request
through the Rendertron end point. If you’re using prpl-server,
which is a node server designed to serve production
applications using PRPL, you simply need to specify
the bot proxy option and provide it with your
Rendertron endpoint. If you’re using Express,
there’s a middleware that you can include
directly by saying app.use rendertron-middleware
with a proxy endpoint and whether or not you’re
using Web Components. If you’re not using
either of these, check the docs for a list
of community-maintained middleware. There’s a Firebase
function there, as well as a list of existing
middleware that Rendertron is compatible with. If it’s not listed,
it’s also fairly simple to roll your own middleware
by simply proxying based on the user-agent string. And that’s it. That’s all the changes
you need to make to use Rendertron today, and
all these bots can now be happy. Rendertron is
available to use today, compatible with any
client side technologies, including both Polymer
1 and Polymer 2. Thank you.

5 thoughts on “Solving SEO with Headless Chrome (Polymer Summit 2017)

  1. Justin O'Neill Post author

    Finallllllly all my bitchin paid offf 😀

    Just kidding. Thank you guys SO MUCH for getting to this.

    Reply
  2. Saumendra Swain Post author

    Would be awaiting a guidance on the Headless Chrome Sam! Is its a mmust to have for the Future PWA or to utilise in the current PWA. ?

    Reply
  3. Kostas Bariotis Post author

    Haven't we already solved the SEO issue with universal Javascript? We don't need headless chrome for this.

    Reply
  4. Bruno Mateus Post author

    I've been using a local Prerender.io node app since early 2016 for angular, using PhantomJs instead. Basically, this replaces it with headless Chrome 🙂

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *