Tuesday 25 October 2016

The Technical SEO Renaissance: The Whys and Hows of SEO’s Forgotten Role in the Mechanics of the Web

Posted by iPullRank

Web technologies and their adoption are advancing at a frenetic pace. Content is a game that every type of team and agency plays, so we’re all competing for a piece of that pie. Meanwhile, technical SEO is more complicated and more important than ever before and much of the SEO discussion has shied away from its growing technical components in favor of content marketing.

As a result, SEO is going through a renaissance wherein the technical components are coming back to the forefront and we need to be prepared. At the same time, a number of thought leaders have made statements that modern SEO is not technical. These statements misrepresent the opportunities and problems that have sprouted on the backs of newer technologies. They also contribute to an ever-growing technical knowledge gap within SEO as a marketing field and make it difficult for many SEOs to solve our new problems.

That resulting knowledge gap that’s been growing for the past couple of years influenced me to, for the first time, “tour” a presentation. I’d been giving my Technical SEO Renaissance talk in one form or another since January because I thought it was important to stoke a conversation around the fact that things have shifted and many organizations and websites may be behind the curve if they don’t account for these shifts. A number of things have happened that prove I’ve been on the right track since I began giving this presentation, so I figured it’s worth bringing the discussion to continue the discussion. Shall we?

An abridged history of SEO (according to me)

It’s interesting to think that the technical SEO has become a dying breed in recent years. There was a time when it was a prerequisite.

Image via PCMag

Personally, I started working on the web in 1995 as a high school intern at Microsoft. My title, like everyone else who worked on the web then, was "webmaster." This was well before the web profession splintered into myriad disciplines. There was no Front End vs. Backend. There was no DevOps or UX person. You were just a Webmaster.

Back then, before Yahoo, AltaVista, Lycos, Excite, and WebCrawler entered their heyday, we discovered the web by clicking linkrolls, using Gopher, Usenet, IRC, from magazines, and via email. Around the same time, IE and Netscape were engaged in the Browser Wars and you had more than one client-side scripting language to choose from. Frames were the rage.

Then the search engines showed up. Truthfully, at this time, I didn’t really think about how search engines worked. I just knew Lycos gave me what I believed to be the most trustworthy results to my queries. At that point, I had no idea that there was this underworld of people manipulating these portals into doing their bidding.

Enter SEO.

Image via Fox

SEO was born of a cross-section of these webmasters, the subset of computer scientists that understood the otherwise esoteric field of information retrieval and those “Get Rich Quick on the Internet” folks. These Internet puppeteers were essentially magicians who traded tips and tricks in the almost dark corners of the web. They were basically nerds wringing dollars out of search engines through keyword stuffing, content spinning, and cloaking.

Then Google showed up to the party.

Image via droidforums.net

Early Google updates started the cat-and-mouse game that would shorten some perpetual vacations. To condense the last 15 years of search engine history into a short paragraph, Google changed the game from being about content pollution and link manipulation through a series of updates starting with Florida and more recently Panda and Penguin. After subsequent refinements of Panda and Penguin, the face of the SEO industry changed pretty dramatically. Many of the most arrogant “I can rank anything” SEOs turned white hat, started software companies, or cut their losses and did something else. That’s not to say that hacks and spam links don’t still work, because they certainly often do. Rather, Google’s sophistication finally discouraged a lot of people who no longer have the stomach for the roller coaster.

Simultaneously, people started to come into SEO from different disciplines. Well, people always came into SEO from very different professional histories, but it started to attract a lot more more actual “marketing” people. This makes a lot of sense because SEO as an industry has shifted heavily into a content marketing focus. After all, we’ve got to get those links somehow, right?

Image via Entrepreneur

Naturally, this begat a lot of marketers marketing to marketers about marketing who made statements like “Modern SEO Requires Almost No Technical Expertise.”

Or one of my favorites, that may have attracted even more ire: “SEO is Makeup.”

Image via Search Engine Land

While I, naturally, disagree with these statements, I understand why these folks would contribute these ideas in their thought leadership. Irrespective of the fact that I’ve worked with both gentlemen in the past in some capacity and know their predispositions towards content, the core point they're making is that many modern Content Management Systems do account for many of our time-honored SEO best practices. Google is pretty good at understanding what you’re talking about in your content. Ultimately, your organization’s focus needs to be on making something meaningful for your user base so you can deliver competitive marketing.

If you remember the last time I tried to make the case for a paradigm shift in the SEO space, you’d be right in thinking that I agree with that idea fundamentally. However, not at the cost of ignoring the fact that the technical landscape has changed. Technical SEO is the price of admission. Or, to quote Adam Audette, “SEO should be invisible,” not makeup.

Changes in web technology are causing a technical renaissance

In SEO, we often criticize developers for always wanting to deploy the new shiny thing. Moving forward, it's important that we understand the new shiny things so we can be more effective in optimizing them.

SEO has always had a healthy fear of JavaScript, and with good reason. Despite the fact that search engines have had the technology to crawl the web the same way we see it in a browser for at least 10 years, it has always been a crapshoot as to whether that content actually gets crawled and, more importantly, indexed.

When we’d initially examined the idea of headless browsing in 2011, the collective response was that the computational expense prohibited it at scale. But it seems that even if that is the case, Google believes enough of the web is rendered using JavaScript that it’s a worthy investment.

Over time more and more folks would examine this idea; ultimately, a comment from this ex-Googler on Hacker News would indicate that this has long been something Google understood needed conquering:

This was actually my primary role at Google from 2006 to 2010.

One of my first test cases was a certain date range of the Wall Street Journal's archives of their Chinese language pages, where all of the actual text was in a JavaScript string literal, and before my changes, Google thought all of these pages had identical content... just the navigation boilerplate. Since the WSJ didn't do this for its English language pages, my best guess is that they weren't trying to hide content from search engines, but rather trying to work around some old browser bug that incorrectly rendered (or made ugly) Chinese text, but somehow rendering text via JavaScript avoided the bug.

The really interesting parts were (1) trying to make sure that rendering was deterministic (so that identical pages always looked identical to Google for duplicate elimination purposes) (2) detecting when we deviated significantly from real browser behavior (so we didn't generate too many nonsense URLs for the crawler or too many bogus redirects), and (3) making the emulated browser look a bit like IE and Firefox (and later Chrome) at the some time, so we didn't get tons of pages that said "come back using IE" er "please download Firefox".

I ended up modifying SpiderMonkey's bytecode dispatch to help detect when the simulated browser had gone off into the weeds and was likely generating nonsense.

I went through a lot of trouble figuring out the order that different JavaScript events were fired off in IE, FireFox, and Chrome. It turns out that some pages actually fire off events in different orders between a freshly loaded page and a page if you hit the refresh button. (This is when I learned about holding down shift while hitting the browser's reload button to make it act like it was a fresh page fetch.)

At some point, some SEO figured out that random() was always returning 0.5. I'm not sure if anyone figured out that JavaScript always saw the date as sometime in the Summer of 2006, but I presume that has changed. I hope they now set the random seed and the date using a keyed cryptographic hash of all of the loaded javascript and page text, so it's deterministic but very difficult to game. (You can make the date determistic for a month and dates of different pages jump forward at different times by adding an HMAC of page content (mod number of seconds in a month) to the current time, rounding down that time to a month boundary, and then subtracting back the value you added earlier. This prevents excessive index churn from switching all dates at once, and yet gives each page a unique date.)

Now, consider these JavaScript usage statistics across the web from BuiltWith:

JavaScript is obviously here to stay. Most of the web is using it to render content in some form or another. This means there’s potential for search quality to plummet over time if Google couldn't make sense of what content is on pages rendered with JavaScript.

Additionally, Google’s own JavaScript MVW framework, AngularJS, has seen pretty strong adoption as of late. When I attended Google’s I/O conference a few months ago, the recent advancements of Progressive Web Apps and Firebase were being harped upon due to the speed and flexibility they bring to the web. You can only expect that developers will make a stronger push.

Image via Builtwith

Sadly, despite BuiltVisible’s fantastic contributions to the subject, there hasn’t been enough discussion around Progressive Web Apps, Single-Page Applications, and JavaScript frameworks in the SEO space. Instead, there are arguments about 301s vs 302s. Perhaps the latest spike in adoption and the proliferation of PWAs, SPAs, and JS frameworks across different verticals will change that. At iPullRank, we’ve worked with a number of companies who have made the switch to Angular; there's a lot worth discussing on this specific topic.

Additionally, Facebook’s contribution to the JavaScript MVW frameworks, React, is being adopted for the very similar speed and benefits of flexibility in the development process.

However, regarding SEO, the key difference between Angular and React is that, from the beginning, React had a renderToString function built in which allows the content to render properly from the server side. This makes the question of indexation of React pages rather trivial.

AngularJS 1.x, on the other hand, has birthed an SEO best practice wherein you pre-render pages using headless browser-driven snapshot appliance such as Prerender.io, Brombone, etc. This is somewhat ironic, as it's Google’s own product. More on that later.

View Source is dead

As a result of the adoption of these JavaScript frameworks, using View Source to examine the code of a website is an obsolete practice. What you’re seeing in View Source is not the computed Document Object Model (DOM). Rather, you’re seeing the code before it's processed by the browser. The lack of understanding around why you might need to view a page’s code differently is another instance where having a more detailed understanding of the technical components of how the web works is more effective.

Depending on how the page is coded, you may see variables in the place of actual content, or you may not see the completed DOM tree that's there once the page has loaded completely. This is the fundamental reason why, as soon as an SEO hears that there’s JavaScript on the page, the recommendation is to make sure all content is visible without JavaScript.

To illustrate the point further, consider this View Source view of Seamless.com. If you look for the meta description or the rel-canonical on this page, you’ll find variables in the place of the actual copy:

If instead you look at the code in the Elements section of Chrome DevTools or Inspect Element in other browsers, you’ll find the fully executed DOM. You’ll see the variables are now filled in with copy. The URL for the rel-canonical is on the page, as is the meta description:

Since search engines are crawling this way, you may be missing out on the complete story of what's going on if you default to just using View Source to examine the code of the site.

HTTP/2 is on the way

One of Google’s largest points of emphasis is page speed. An understanding of how networking impacts page speed is definitely a must-have to be an effective SEO.

Before HTTP/2 was announced, the HyperText Transfer Protocol specification had not been updated in a very long time. In fact, we’ve been using HTTP/1.1 since 1999. HTTP/2 is a large departure from HTTP/1.1, and I encourage you to read up on it, as it will make a dramatic contribution to the speed of the web.

Image via Slideshare

Quickly though, one of the biggest differences is that HTTP/2 will make use of one TCP (Transmission Control Protocol) connection per origin and “multiplex” the stream. If you’ve ever taken a look at the issues that Google PageSpeed Insights highlights, you’ll notice that one of the primary things that always comes up is limiting the number of HTTP requests/ This is what multiplexing helps eliminate; HTTP/2 opens up one connection to each server, pushing assets across it at the same time, often making determinations of required resources based on the initial resource. With browsers requiring Transport Layer Security (TLS) to leverage HTTP/2, it’s very likely that Google will make some sort of push in the near future to get websites to adopt it. After all, speed and security have been common threads throughout everything in the past five years.

Image via Builtwith

As of late, more hosting providers have been highlighting the fact that they are making HTTP/2 available, which is probably why there’s been a significant jump in its usage this year. The beauty of HTTP/2 is that most browsers already support it and you don’t have to do much to enable it unless your site is not secure.

Image via CanIUse.com

Definitely keep HTTP/2 on your radar, as it may be the culmination of what Google has been pushing for.

SEO tools are lagging behind search engines

When I think critically about this, SEO tools have always lagged behind the capabilities of search engines. That’s to be expected, though, because SEO tools are built by smaller teams and the most important things must be prioritized. A lack of technical understanding may lead to you believe the information from the tools you use when they are inaccurate.

When you review some of Google’s own documentation, you’ll find that some of my favorite tools are not in line with Google’s specifications. For instance, Google allows you to specify hreflang, rel-canonical, and x-robots in HTTP headers. There's a huge lack of consistency in SEO tools’ ability to check for those directives.

It's possible that you've performed an audit of a site and found it difficult to determine why a page has fallen out of the index. It very well could be because a developer was following Google’s documentation and specifying a directive in an HTTP header, but your SEO tool did not surface it. In fact, it’s generally better to set these at the HTTP header level than to add bytes to your download time by filling up every page’s <head> with them.

Google is crawling headless, despite the computational expense, because they recognize that so much of the web is being transformed by JavaScript. Recently, Screaming Frog made the shift to render the entire page using JS:

To my knowledge, none of the other crawling tools are doing this yet. I do recognize the fact that it would be considerably more expensive for all SEO tools to make

source https://moz.com/blog/the-technical-seo-renaissance

No comments:

Post a Comment